# Analyzing Stack Overflow Activity

In our developer advocacy team we keep an eye out on what is happening on Stack Overflow (SO). Stack Overflow questions can provide valuable insights into what customers are struggeling with and how they are consuming our services.

This notebook analyzes


* Which runtime environment or programming language is most commonly referred to in questions for our offerings?
* Which tags are most commonly used?



## Load and clean the data

As the data is stored in Cloudant, the first step is to load the data into the notebook and prepare it for analysis.

### Prerequisites

Import the required packages.

Install or update missing packages with `!pip install --user <package>`.

In [None]:
import pixiedust
import pyspark.sql.functions as func
import pyspark.sql.types as types

### Configure database connectivity

Customize this cell with your Cloudant/CouchDB connection information

In [None]:
# @hidden_cell
# Enter your Cloudant host name
host = '--myhostname--'
# Enter your Cloudant user name
username = '--myusername--'
# Enter your Cloudant password
password = '--mysecretpassword--'
# Enter your source database name
database = '--mydatabasename--'

### Load documents from the database

Load the documents into an Apache Spark DataFrame and describe the data structure.

In [None]:
# no changes are required to this cell
# obtain Spark SQL Context
sqlContext = SQLContext(sc)
# load data
so_data = sqlContext.read.format("com.cloudant.spark").\
                                 option("cloudant.host", host).\
                                 option("cloudant.username", username).\
                                 option("cloudant.password", password).\
                                 load(database)              
so_data.cache()                

In [None]:
# debug only
so_data.printSchema()
so_data.count()

### Prepare the data

Select data that's relevant to this analysis.

In [None]:

sodf = so_data.select(so_data.question.question_id.alias("id"),
                       so_data.question.owner.accept_rate.alias("accept_rate"),
                       so_data.question.owner.reputation.alias("reputation"),
                       so_data.question.owner.user_id.alias("user_id"),
                       so_data.question.answer_count.alias("answer_count"), 
                       so_data.question.creation_date.alias("creation"), 
                       so_data.question.closed_date.alias("closed"),
                       so_data.question.is_answered.alias("answered"),
                       so_data.question.score.alias("score"),
                       so_data.question.view_count.alias("views"),
                       so_data.question.title.alias("title"),
                       so_data.question.tags.alias("tags")) 

In [None]:
# debug only
sodf.printSchema()
#sodf.select(sodf.id, sodf.tags).head(100)

## Identify runtime environment or programming language
Questions sometimes contain tags that might identify a programming language or runtime environment, such as `node.js` or `ios`. The following chart depicts correlations between our offerings and those tags.

In [None]:
# only consider questions that contain one of these tags 
key_tags = ["cloudant", "dashdb"]
# of the questions that meet the first condition only count those that also contain a tag that is associated with a programming language or runtime environment
# the following curated list was obtained from https://meta.stackoverflow.com/tags; customize as needed
env = ["node.js", "java", "javascript", "python", "android","php", "node-red", "cordova" ,"c#", "ios", "swift"]

In [None]:
# ---------------
# prepare data
# ---------------
env_df = sodf.select(sodf.id, sodf.tags)

def extractKeyTags(tags):
    # input: ["tag1","tag2", "tag3", ...]
    # output: ["key_tag1","key_tag2",...]
    out = []
    for tag in tags:
        if tag in key_tags:
            out.append(tag)
    return out
extractUDF = func.udf(lambda c: extractKeyTags(c), types.ArrayType(types.StringType()))
env_df = env_df.withColumn("keys", extractUDF(env_df.tags))
env_df = env_df.select(env_df.id, func.explode(env_df.keys).alias("key"), env_df.tags)
env_df = env_df.select(env_df.id, env_df.key, func.explode(env_df.tags).alias("tag"))
env_df = env_df.filter(func.col("tag").isin(env)).groupBy(["key","tag"]).count().orderBy("count", ascending = False).withColumnRenamed("tag","env")

# +--------+----------+-----+
# |     key|       env|count|
# +--------+----------+-----+
# |cloudant|   node.js|   73|
# |cloudant|      java|   49|
# | ...

# ---------------
# visualize data
# ---------------
#env_df.show()
display(env_df)

## Generic tag associations

In [None]:
# only consider questions that contain one of these tags 
key_tags = ["dashdb"]

In [None]:
# ---------------
# prepare data
# ---------------
tag_df = sodf.select(sodf.id, sodf.tags)
def extractKeyTags(tags):
    # input: ["tag1","tag2", "tag3", ...]
    # output: ["key_tag1","key_tag2",...]
    out = []
    for tag in tags:
        if tag in key_tags:
            out.append(tag)
    return out
extractUDF = func.udf(lambda c: extractKeyTags(c), types.ArrayType(types.StringType()))
tag_df = tag_df.withColumn("key", extractUDF(tag_df.tags))
tag_df = tag_df.select(tag_df.id, func.explode(tag_df.key).alias("key"), tag_df.tags)
tag_df = tag_df.select(tag_df.id, tag_df.key, func.explode(tag_df.tags).alias("tag"))
tag_df = tag_df.filter(func.col("key").isin(key_tags)).filter("key != tag").groupBy(["key","tag"]).count().orderBy("count", ascending = False)

# +--------+------------+-----+
#      key|         tag|count|
# +--------+------------+-----+
# |cloudant|     couchdb|  190|
# |cloudant| ibm-bluemix|  170|
# |  ...

# ---------------
# visualize data
# ---------------
#tag_df.show()
display(tag_df)