# Analyzing Stack Overflow Activity

In our developer advocacy team we keep an eye out on what is happening on Stack Overflow (SO). It is good to know what people are struggling with, and we try to help out by answering questions or writing blog posts and talks about the most common issues. 

Our team advocates for a range of products for IBM. In this analysis I focused on `cloudant`, `couchdb`, `dashdb` and `pouchdb`, which are four different databases closely linked. **Linked how?** These questions are stored in a [Cloudant] database. **Patrick: how?**

For the analysis of the data from these 2060 questions I used a Jupyter Python notebook on the [IBM Data Science Experience]() that you can find on [github](). In this notebook the SO data is analysed to try to find out what more about the users by answering the following questions:

1. How many unique users are there and what questions do they ask? 
1. Are users asking questions with different tags, or only about one product? 
1. Are users (owners) beginners or more experienced?
1. Does the number of users change over time?

And some other possible questions:

* How long are the questions active for?
* What is the rating of the questions grouped by tag?
* Can we find users on twitter and see what they say there?
* What is the sentiment in the questions and answers?
* Is there a relationship between ranking of the question and the lenght?
* What are the best questions to ask or answer to increase your reputation?


## Load and clean the data

As the data is stored in Cloudant, the first step is to load the data into the notebook, clean up the data and covert it into a pandas DataFrame. 

### Prerequisites

Import PixieDust, enable the Apache Spark Job monitor and load some more packages.

Install or update missing packages with `!pip install --user <package>`.

In [None]:
import pixiedust
pixiedust.enableJobMonitor()

In [None]:
from pyspark.sql.functions import explode
from pyspark.sql import functions as F

import numpy as np
import pandas as pd
from datetime import datetime

from io import BytesIO, StringIO  
import requests  
import json  

from wordcloud import WordCloud, STOPWORDS

import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline

### Configure database connectivity

Customize this cell with your Cloudant/CouchDB connection information

In [None]:
# @hidden_cell
# Enter your Cloudant host name
host = '--myhostname--'
# Enter your Cloudant user name
username = '--myusername--'
# Enter your Cloudant password
password = '--mysecretpassword--'
# Enter your source database name
database = '--mydatabasename--'

### Load documents from the database

Load the documents into an Apache Spark DataFrame and describe the data structure.

In [None]:
# no changes are required to this cell
# obtain Spark SQL Context
sqlContext = SQLContext(sc)
# load data
so_data = sqlContext.read.format("com.cloudant.spark").\
                                 option("cloudant.host", host).\
                                 option("cloudant.username", username).\
                                 option("cloudant.password", password).\
                                 load(database)              
so_data.cache()                

In [None]:
so_data.printSchema()
so_data.count()

### Clean up data and convert to a pandas DataFrame

Select only the relevant data and convert it to a table. As the tags column consists of a string, also add columns for each tag in the string by using a `lambda` function. For the further analysis it is also handy to have boolean data about the occurence of the 4 tags of interest. These can also be added with a `lambda` function.

In [None]:
# all users
sodf = so_data.select(so_data.question.question_id.alias("id"),
                       so_data.question.owner.accept_rate.alias("accept_rate"),
                       so_data.question.owner.reputation.alias("reputation"),
                       so_data.question.owner.user_id.alias("user_id"),
                       so_data.question.answer_count.alias("answer_count"), 
                       so_data.question.creation_date.alias("creation"), 
                       so_data.question.closed_date.alias("closed"),
                       so_data.question.is_answered.alias("answered"),
                       so_data.question.score.alias("score"),
                       so_data.question.view_count.alias("views"),
                       so_data.question.title.alias("title"),
                       so_data.question.tags.alias("tags")).toPandas()

In [None]:
tags = sodf['tags'].apply(pd.Series)
tags = tags.rename(columns = lambda x: 'tags_' + str(x))

sodf = pd.concat([sodf[:], tags[:]], axis=1)

sodf['cloudant']=sodf['tags'].apply(lambda x: 'cloudant' in x)
sodf['dashdb']=sodf['tags'].apply(lambda x: 'dashdb' in x)
sodf['couchdb']=sodf['tags'].apply(lambda x: 'couchdb' in x)
sodf['pouchdb']=sodf['tags'].apply(lambda x: 'pouchdb' in x)

sodf.head()

### Save the DataFrame in object-store

To save some time and reduce the number of times loading the data from Cloudant save `sodf` to a csv file in object-store. This will speed up the analysis when coming back to the notebook, you can just start by loading the data. 

In [None]:
# @hidden_cell
credentials_1 = {
  'auth_url':'https://identity.open.softlayer.com',
  'project':'object_storage_de9e50d4_b6ba_4f25_926a_75c568e896a0',
  'project_id':'f564592bf4d24d41b89ea8229243fa05',
  'region':'dallas',
  'user_id':'5c8116961cf345dfb9d75c5cb24a2e47',
  'domain_id':'299db6126ac14081bf872905c18ba585',
  'domain_name':'790621',
  'username':'member_fbf96b4f9f669a48110776d13d879b48d00c7a25',
  'password':"""D]2x4Ms&FMywU_Ez""",
  'container':'SOanalysis',
  'tenantId':'undefined',
  'filename':'wordcloud.txt'
}    

In [None]:
sodf.to_csv('SOdata.csv', index=False, encoding='utf-8')

def put_file(credentials, local_file_name):  
    """This functions returns a StringIO object containing
    the file content from Bluemix Object Storage V3."""
    f = open(local_file_name,'r')
    my_data = f.read()
    url1 = ''.join(['https://identity.open.softlayer.com', '/v3/auth/tokens'])
    data = {'auth': {'identity': {'methods': ['password'],
            'password': {'user': {'name': credentials['username'],'domain': {'id': credentials['domain_id']},
            'password': credentials['password']}}}}}
    headers1 = {'Content-Type': 'application/json'}
    resp1 = requests.post(url=url1, data=json.dumps(data), headers=headers1)
    resp1_body = resp1.json()
    for e1 in resp1_body['token']['catalog']:
        if(e1['type']=='object-store'):
            for e2 in e1['endpoints']:
                        if(e2['interface']=='public'and e2['region']=='dallas'):
                            url2 = ''.join([e2['url'],'/', credentials['container'], '/', local_file_name])
    s_subject_token = resp1.headers['x-subject-token']
    #headers2 = {'X-Auth-Token': s_subject_token, 'accept': 'application/json', 'Content-type':'application/json; charset=utf-8'}
    headers2 = {'X-Auth-Token': s_subject_token, 'accept': 'application/json'}
    resp2 = requests.put(url=url2, headers=headers2, data = my_data )
    print resp2
    
put_file(credentials_1, 'SOdata.csv')

### Read the DataFrame from the saved csv file in object store

In [None]:
# @hidden_cell
# This function accesses a file in your Object Storage. The definition contains your credentials.
# You might want to remove those credentials before you share your notebook.
def get_object_storage_file_with_credentials_de9e50d4b6ba4f25926a75c568e896a0(container, filename):
    """This functions returns a StringIO object containing
    the file content from Bluemix Object Storage."""

    url1 = ''.join(['https://identity.open.softlayer.com', '/v3/auth/tokens'])
    data = {'auth': {'identity': {'methods': ['password'],
            'password': {'user': {'name': 'member_fbf96b4f9f669a48110776d13d879b48d00c7a25','domain': {'id': '299db6126ac14081bf872905c18ba585'},
            'password': 'D]2x4Ms&FMywU_Ez'}}}}}
    headers1 = {'Content-Type': 'application/json'}
    resp1 = requests.post(url=url1, data=json.dumps(data), headers=headers1)
    resp1_body = resp1.json()
    for e1 in resp1_body['token']['catalog']:
        if(e1['type']=='object-store'):
            for e2 in e1['endpoints']:
                        if(e2['interface']=='public'and e2['region']=='dallas'):
                            url2 = ''.join([e2['url'],'/', container, '/', filename])
    s_subject_token = resp1.headers['x-subject-token']
    headers2 = {'X-Auth-Token': s_subject_token, 'accept': 'application/json'}
    resp2 = requests.get(url=url2, headers=headers2)
    return StringIO(resp2.text)

sodf2 = pd.read_csv(get_object_storage_file_with_credentials_de9e50d4b6ba4f25926a75c568e896a0('SOanalysis', 'SOdata.csv'))
sodf2.head()

## What can we learn from the data?

Lets start with some basic numbers.

In [None]:
print len(np.unique(sodf2['user_id'])), 'users asked', len(sodf2), 'questions between',datetime.fromtimestamp(sodf2['creation'].min()).strftime('%Y-%m-%d'), 'and', datetime.fromtimestamp(sodf2['creation'].max()).strftime('%Y-%m-%d')

print sodf2['answer_count'].where(sodf2['cloudant']).count(),  'questions were about Cloudant.'  
print sodf2['answer_count'].where(sodf2['dashdb']).count(),  'questions were about dashDB.'  
print sodf2['answer_count'].where(sodf2['couchdb']).count(),  'questions were about couchDB.'  
print sodf2['answer_count'].where(sodf2['pouchdb']).count(),  'questions were about pouchDB.'  

These numbers look good in a bar chart as well. Average values and ranges of number of views, accept rate and reputation are easy to compare with boxplots. 

In [None]:
reputation = [sodf2['reputation'].where(sodf2['cloudant']).dropna(), sodf2['reputation'].where(sodf2['dashdb']).dropna(),
             sodf2['reputation'].where(sodf2['couchdb']).dropna(), sodf2['reputation'].where(sodf2['pouchdb']).dropna()]

views = [sodf2['views'].where(sodf2['cloudant']).dropna(), sodf2['views'].where(sodf2['dashdb']).dropna(),
             sodf2['views'].where(sodf2['couchdb']).dropna(), sodf2['views'].where(sodf2['pouchdb']).dropna()]

accept = [sodf2['accept_rate'].where(sodf2['cloudant']).dropna(), sodf2['accept_rate'].where(sodf2['dashdb']).dropna(),
             sodf2['accept_rate'].where(sodf2['couchdb']).dropna(), sodf2['accept_rate'].where(sodf2['pouchdb']).dropna()]

questions = [sodf2['answer_count'].where(sodf2['cloudant']).count(),sodf2['answer_count'].where(sodf2['dashdb']).count(),
            sodf2['answer_count'].where(sodf2['couchdb']).count(),sodf2['answer_count'].where(sodf2['pouchdb']).count()]

answers = [sodf2['answer_count'].where(sodf2['cloudant']).sum(), sodf2['answer_count'].where(sodf2['dashdb']).sum(),
          sodf2['answer_count'].where(sodf2['couchdb']).sum(), sodf2['answer_count'].where(sodf2['pouchdb']).sum()]

score = [sodf2['score'].where(sodf2['cloudant']).dropna(), sodf2['score'].where(sodf2['dashdb']).dropna(),
          sodf2['score'].where(sodf2['couchdb']).dropna(), sodf2['score'].where(sodf2['pouchdb']).dropna()]

ticks = ['cloudant','dashdb','couchdb','pouchdb']

fig = plt.subplots(nrows=3, ncols=2, figsize=(16, 10))

ax0 = plt.subplot(321)
ind = np.arange(4)
width = 0.35       
ax0.bar(ind, questions, width, color='r')
ax0.bar(ind+0.35, answers, width, color='b')
ax0.set_xticks(0.2 + (ind + width / 2))
ax0.set_xticklabels(ticks)
ax0.legend(('Questions', 'Answers'))

ax1 = plt.subplot(322)
ax1.boxplot(reputation)
ax1.set_ylim([0,1000])
ax1.set_ylabel('Reputation')
ax1.set_xticklabels(ticks)

ax2 = plt.subplot(323)
ax2.boxplot(views)
ax2.set_ylim([0,600])
ax2.set_ylabel('Views')
ax2.set_xticklabels(ticks)\

ax3 = plt.subplot(324)
ax3.boxplot(accept)
#ax3.set_ylim([0,600])
ax3.set_ylabel('Accept rate')
ax3.set_xticklabels(ticks)

ax3 = plt.subplot(325)
ax3.boxplot(score)
ax3.set_ylim([-2,3])
ax3.set_ylabel('Score')
ax3.set_xticklabels(ticks)
plt.tight_layout()


## Most popular questions

popularity = f(views,score,tags)

views = f(question age)

Might need to explode the data, duplicating questions when multiple tags

Classify or group questions?

Maybe there are some correlations between variables. Let's try with some quick scatter plots. Note that the axes are cut off, because the outliers made it hard to see anything. Anyways, no correlations there on first sight. Not entirely sure what I am trying to find here actually.  

In [None]:
list(sodf2)

fig = plt.subplots(nrows=2, ncols=2, figsize=(13, 8))

ax0 = plt.subplot(221)
ax0.scatter(sodf2.reputation.where(sodf2['couchdb']), sodf2.views.where(sodf2['couchdb']),color='g')
ax0.scatter(sodf2.reputation.where(sodf2['pouchdb']), sodf2.views.where(sodf2['pouchdb']),color='orange')
ax0.scatter(sodf2.reputation.where(sodf2['cloudant']), sodf2.views.where(sodf2['cloudant']),color='r')
ax0.scatter(sodf2.reputation.where(sodf2['dashdb']), sodf2.views.where(sodf2['dashdb']),color='b')
ax0.set_ylim([0,2500])
ax0.set_xlim([0,2000])
ax0.set_ylabel('Views')
ax0.set_xlabel('Reputation')
ax0.legend(('couchdb','pouchdb','cloudant','dashdb'))

ax1 = plt.subplot(222)
ax1.scatter(sodf2.reputation.where(sodf2['couchdb']), sodf2.accept_rate.where(sodf2['couchdb']),color='g')
ax1.scatter(sodf2.reputation.where(sodf2['pouchdb']), sodf2.accept_rate.where(sodf2['pouchdb']),color='orange')
ax1.scatter(sodf2.reputation.where(sodf2['cloudant']), sodf2.accept_rate.where(sodf2['cloudant']),color='r')
ax1.scatter(sodf2.reputation.where(sodf2['dashdb']), sodf2.accept_rate.where(sodf2['dashdb']),color='b')
ax1.set_ylim([-2,102])
ax1.set_xlim([0,2000])
ax1.set_ylabel('Accept rate')
ax1.set_xlabel('Reputation')

ax2 = plt.subplot(223)
ax2.scatter(sodf2.score.where(sodf2['couchdb']), sodf2.views.where(sodf2['couchdb']),color='g')
ax2.scatter(sodf2.score.where(sodf2['pouchdb']), sodf2.views.where(sodf2['pouchdb']),color='orange')
ax2.scatter(sodf2.score.where(sodf2['cloudant']), sodf2.views.where(sodf2['cloudant']),color='r')
ax2.scatter(sodf2.score.where(sodf2['dashdb']), sodf2.views.where(sodf2['dashdb']),color='b')
#ax2.set_ylim([-2,102])
ax2.set_ylim([0,2500])
ax2.set_xlabel('Score')
ax2.set_ylabel('Views')

ax3 = plt.subplot(224)
ax3.scatter(sodf2.score.where(sodf2['couchdb']), sodf2.accept_rate.where(sodf2['couchdb']),color='g')
ax3.scatter(sodf2.score.where(sodf2['pouchdb']), sodf2.accept_rate.where(sodf2['pouchdb']),color='orange')
ax3.scatter(sodf2.score.where(sodf2['cloudant']), sodf2.accept_rate.where(sodf2['cloudant']),color='r')
ax3.scatter(sodf2.score.where(sodf2['dashdb']), sodf2.accept_rate.where(sodf2['dashdb']),color='b')
ax3.set_ylim([-2,102])
#ax3.set_ylim([0,2500])
ax3.set_xlabel('Score')
ax3.set_ylabel('Accept rate')

plt.tight_layout()

As there is data for two years, there might be trends over time!

**TODO: summarize number of questions per month etc.**

In [None]:
sodf2['date']=sodf2['creation'].apply(lambda x: datetime.fromtimestamp(x))

per1 = sodf2.date.dt.to_period("M")
g1 = sodf2.groupby(per1)
g2 = g1.sum()

#g2.head()

g2.
list(g2)



In [None]:



fig = plt.subplots(nrows=2, ncols=1, figsize=(20, 8))

ax0 = plt.subplot(211)
ax0.plot(g2.index,g2.pouchdb,color='g')
#ax0.plot(sodf2.date.where(sodf2['pouchdb']), sodf2.views.where(sodf2['pouchdb']),color='orange')
#ax0.plot(sodf2.date.where(sodf2['cloudant']), sodf2.views.where(sodf2['cloudant']),color='r')
#ax0.plot(sodf2.date.where(sodf2['dashdb']), sodf2.views.where(sodf2['dashdb']),color='b')
ax0.set_ylim([0,500])
#ax0.set_xlim([0,2000])
#ax0.set_ylabel('Views')
#ax0.set_xlabel('Reputation')
ax0.legend(('couchdb','pouchdb','cloudant','dashdb'))

#ax0.plot(sodf2.date.where(sodf2['couchdb']), sodf2.score.where(sodf2['couchdb']),color='g')
#ax0.plot(sodf2.date.where(sodf2['pouchdb']), sodf2.views.where(sodf2['pouchdb']),color='orange')
#ax0.plot(sodf2.date.where(sodf2['cloudant']), sodf2.views.where(sodf2['cloudant']),color='r')
#ax0.plot(sodf2.date.where(sodf2['dashdb']), sodf2.views.where(sodf2['dashdb']),color='b')


#sodf2.head()

### 1 . Word cloud of all tags

A quick [word cloud](https://github.com/amueller/word_cloud) to see which tags are used most. 

In [None]:
# change the numbers [h,s,l = hue, saturation, ...] for different colors
def random_color_func(word=None, font_size=None, position=None,  orientation=None, font_path=None, random_state=None):
    h = int(360.0 * 140.0 / 255.0)
    s = int(150.0 * 255.0 / 255.0)
    l = int(100.0 * float(random_state.randint(50, 150)) / 255.0)
    return "hsl({}, {}%, {}%)".format(h, s, l)

tagtext = sodf2['tags'].to_string()
#tagtext2 = tagtext.split()
#tagtext2

stopwords = set(STOPWORDS)
stopwords.add("bluemix")
stopwords.add("ibm")
stopwords.add("NaN")

fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(16, 24))

wordcloud = WordCloud(background_color="white", margin=10, random_state=1, 
                      color_func=random_color_func, stopwords=stopwords, 
                     max_font_size=40, min_font_size=9).generate(tagtext)
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

### Comparison of word clouds for cloudant, couchdb, dashdb and pouchdb.

In [None]:
tagtext1 = sodf2['tags'].where(sodf2['cloudant']).to_string()
tagtext2 = sodf2['tags'].where(sodf2['dashdb']).to_string()
tagtext3 = sodf2['tags'].where(sodf2['couchdb']).to_string()
tagtext4 = sodf2['tags'].where(sodf2['pouchdb']).to_string()

stopwords = set(STOPWORDS)
stopwords.add("bluemix")
stopwords.add("ibm")
stopwords.add("NaN")

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(24, 12))

plt.subplot(2,2,1)
wordcloud = WordCloud(background_color="white", margin=10, random_state=1, 
                      color_func=random_color_func, stopwords=stopwords, 
                     max_font_size=40, min_font_size=9).generate(tagtext1)
plt.imshow(wordcloud)
plt.axis("off")

plt.subplot(2,2,2)
wordcloud = WordCloud(background_color="white", margin=10, random_state=1, 
                      color_func=random_color_func, stopwords=stopwords, 
                     max_font_size=40, min_font_size=9).generate(tagtext2)
plt.imshow(wordcloud)
plt.axis("off")

plt.subplot(2,2,3)
wordcloud = WordCloud(background_color="white", margin=10, random_state=1, 
                      color_func=random_color_func, stopwords=stopwords, 
                     max_font_size=40, min_font_size=9).generate(tagtext3)
plt.imshow(wordcloud)
plt.axis("off")

plt.subplot(2,2,4)
wordcloud = WordCloud(background_color="white", margin=10, random_state=1, 
                      color_func=random_color_func, stopwords=stopwords, 
                     max_font_size=40, min_font_size=9).generate(tagtext4)
plt.imshow(wordcloud)
plt.axis("off")


plt.show()

In [None]:
print datetime.fromtimestamp(sodf2['creation'].min()).strftime('%Y-%m-%d')

#sodf2['creation'].max() 

The difference of users between cloudant, couchdb, dashdb and pouchdb.

There is a large overlp in keywords, so the question is now if there is a way to seperate out the 4 different products based on the 

* reputation of th question owner
* accept rate of the question owner
* nuber of views of the question
* question score
* key words




In [None]:
# split users into experts and beginners
experts = sodf.where(sodf['reputation']>1000).dropna(subset=['reputation'])

beginners = sodf.where(sodf['reputation']<1000).dropna(subset=['reputation'])

print len(experts)

### Number of unique users and their experience

### Are there unique groups of users? Any classification of users possible? 

## Question tags over time

In [None]:
# top 5 tags
top5tags = sodf2.select(sodf2.tags.alias("tag")).groupBy("tag").count().orderBy(["count"], ascending=[0])
top5 = top5tags.head(5)

#datetime.fromtimestamp(1486667728)

In [None]:
#tag1 = sodf2.select(sodf2.creation.alias("created"),sodf2.tags.alias("tag"))

check=top5[0].tag

grouped = sodf2.groupBy('creation','tags').count()

#tag1 = sodf2.filter(color_df[9]='apache-spark]')

#\
#    .where(sodf2.tags = top5[0].tag)

#df.filter($"foo".contains("bar"))

grouped.show(20)

In [None]:
tagsS = sodf['tags'].apply(pd.Series)
tagsS = tagsS.rename(columns = lambda x: 'tags_' + str(x))

sodf2 = pd.concat([sodf[:], tagsS[:]], axis=1)

sodf2['cloudant']=sodf2['tags'].apply(lambda x: 'cloudant' in x)
sodf2['dashdb']=sodf2['tags'].apply(lambda x: 'dashdb' in x)
sodf2['pixiedust']=sodf2['tags'].apply(lambda x: 'pixiedust' in x)

sodf2.head()

In [None]:
rows = []
_ = sodf.apply(lambda x: [rows.append([x['id'], x['answer_count'], x['closed'], x['creation'], x['answered'], x['owner_accept_rate'], x['owner_reputation'], x['score'], x['views'],nn])
                                  for nn in x.tags], axis =1)

sodf3 = pd.DataFrame(rows,columns=sodf.columns)
sodf3.head()

In [None]:
tag_counts = pd.DataFrame(sodf3.groupby('tags').size().rename('counts'))
tag_count = tag_counts.sort_values(['counts'], ascending=[False])
print len(tag_count), 'unique tags'
tag_count.head(10)

## Data visualisations

In [None]:
display(sodf3.sort_values(['counts'], ascending=[False]))