# Stack Overflow

## Introduction 

In the second part of this assignment, we will create and analyze time series of creation dates of Stack Overflow questions. This assignment is to be completed **INDIVIDUALLY** and it is due on **October 7 at 7pm**.

Let's create some time series from the data. You may choose to analyze either users or tags. To analyze users, take the top 100 users with the most question posts. For each user, your time series will be the number of questions posted by that user at some frequency. To analyze tags, take the top 100 most popular question tags. For each tag, your time series will be the number of questions with that tag at some frequency. You may choose to sample your data each week, each month, on a certain day of the week or at certain hours in a day depending on what trend you are hoping to find in the data. For example, if you choose to analyze tags and sample during different hours of the day, your hypothesis could be that languages (i.e. Javascript) that are used more in industry will have more questions posted during work hours, whereas languages (i.e. Python) that are taught in academia will have more questions posted after midnight when students are scrambling to finish their homework.

Compare the time series using one of the methods discussed in class. In a few paragraphs, write down what you were hoping to find in the data, what timeseries you created, what method you chose and why. **(30 pts)**

You may find the [pandas.DataFrame.resample](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html) module helpful.

In [None]:
from pprint import pprint
import xml.etree.ElementTree as ET
from pandas import Series, DataFrame
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
from math import sqrt

questions = pd.DataFrame()

N = 100
tag_count = {}


for event, element in ET.iterparse('stackoverflow-posts-2015.xml'):
    
    if element.attrib['PostTypeId'] == '1':
        try:
            res = {}
            res['Id'] = [element.attrib['Id']]
            res['CreationDate'] = [element.attrib['CreationDate']]
            tag = element.attrib['Tags'][:element.attrib['Tags'].find('>')+1]
            res['Tags'] = [tag]
            res_df = pd.DataFrame(res)
            questions = questions.append(res_df)
            if tag not in tag_count:
                tag_count[tag] = 0
            tag_count[tag] += 1
            
        except KeyError:
            pass
            
top_tags = [''] * N
top_freq = [0] * N
            
for tag in tag_count:
    if tag_count[tag] > min(top_freq):
        i = top_freq.index(min(top_freq))
        top_tags[i] = tag
        top_freq[i] = tag_count[tag]

time_series2 = pd.DataFrame(0, index = list(range(24)), columns = top_tags)

for tag in top_tags:
    for dt in questions.loc[questions['Tags'] == tag]['CreationDate'].tolist():
        prev = time_series2.at[int(dt[11:13]), tag]
        time_series2.xs(int(dt[11:13]))[tag] = prev+1

def cosine_similarity(l1, l2):
    p_sum = 0
    l1_sum = 0
    l2_sum = 0
    for i in range(len(l1)):
        p_sum += l1[i] * l2[i]
        l1_sum += l1[i] * l1[i]
        l2_sum += l2[i] * l2[i]
    res = p_sum / (sqrt(l1_sum * l2_sum))
    return res

similarities = {}
for col1 in top_tags:
    for col2 in top_tags:
        if col1 != col2:
            if (col2, col1) not in similarities:
                cl1 = time_series2[col1].tolist()
                cl2 = time_series2[col2].tolist()
                similarities[(col1, col2)] = cosine_similarity(cl1, cl2)

print('By cosine similarity, %s is the most similar tag pair.\n%s is the most dissimilar tag pair.' 
      % (str(max(similarities, key=similarities.get)), str(min(similarities, key=similarities.get))))


Choose a different distance/similarity metric and repeat the same time series analysis. Compare the two different metrics you used. **(10 pts)**

In [None]:
def euclidian_distance(l1, l2):
    res = 0
    for i in range(len(l1)):
        p = l1[i] - l2[i]
        res += p*p
    return sqrt(res)

distances = {}
for col1 in top_tags:
    for col2 in top_tags:
        if col1 != col2:
            if (col2, col1) not in distances:
                cl1 = time_series2[col1].tolist()
                cl2 = time_series2[col2].tolist()
                distances[(col1, col2)] = euclidian_distance(cl1, cl2)
                
print('By euclidian distance, %s is the most similar tag pair.\n%s is the most dissimilar tag pair.' 
      % (str(max(distances, key=distances.get)), str(min(distances, key=distances.get))))
      
sim_dist = {}
max_sim = max(similarities.values())
max_dist = max(distances.values())

for tag_pair in similarities:
    sim_dist[tag_pair] = [similarities[tag_pair]/max_sim, distances[tag_pair]/max_dist]

difference = sum([sim_dist[tp][0]/sim_dist[tp][1] for tp in similarities])/N

print('The difference between the two metrics is %s. (Same metrics would produce 1)' % str(difference))

