###At what point in high school do students give up on math? Identifying knowledge gaps in high school and college math education using Stack Exchange's Math forum

#### Additional Questions:
    1. What are the most popular subjects over time? (Upvotes, Favorites and Views)
    2. What are the most hotly debated subjects over time? (Comment Count, Longevity)
    3. What are the most fundamental subjects students are asking about over time? (Asked multiple times, answered quickly and by only a few commenters)
    4. What are the demographics of the posters and commenters? (Education level, age and location)
    5. Who are the most active math people? How quickly do they reply? Who has way too much free time?
    6. Can I determine sex from username?

In [3]:
import numpy as np
import pandas as pd
import nltk
import re
import os
import codecs
from datetime import datetime
from sklearn import feature_extraction
import mpld3
from xml.etree import ElementTree as ET

from HTMLParser import HTMLParser

In [4]:
class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

In [5]:
etree = ET.parse('Data/Math/math.stackexchange.com/Posts.xml')
root = etree.getroot()
i = 0
postdicts = []
for x in root.iter():
    postdicts.append(x.attrib)
postdata = pd.DataFrame(postdicts).ix[1:]

In [6]:
for index, row in postdata.iterrows():
    row['Body'] = strip_tags(row['Body'])

In [7]:
etree = ET.parse('Data/Math/math.stackexchange.com/Users.xml')
root = etree.getroot()
i = 0
userdicts = []
for x in root.iter():
    userdicts.append(x.attrib)
userdata = pd.DataFrame(userdicts).ix[1:]

In [8]:
postdata = pd.merge(postdata, userdata, left_on='OwnerUserId', right_on='Id', how='left')

In [9]:
etree = ET.parse('Data/Math/math.stackexchange.com/Comments.xml')
root = etree.getroot()
i = 0
commentdicts = []
for x in root.iter():
    commentdicts.append(x.attrib)
commentdata = pd.DataFrame(commentdicts).ix[1:]

In [10]:
for index, row in commentdata.iterrows():
    row['Text'] = strip_tags(row['Text'])

In [11]:
commentdata = pd.merge(commentdata, userdata, left_on='UserId', right_on='Id', how='left')

In [12]:
commentdata.columns

Index([u'CreationDate_x', u'Id_x', u'PostId', u'Score', u'Text', u'UserDisplayName', u'UserId', u'AboutMe', u'AccountId', u'Age', u'CreationDate_y', u'DisplayName', u'DownVotes', u'Id_y', u'LastAccessDate', u'Location', u'ProfileImageUrl', u'Reputation', u'UpVotes', u'Views', u'WebsiteUrl'], dtype='object')

In [13]:
postdata.columns

Index([u'AcceptedAnswerId', u'AnswerCount', u'Body', u'ClosedDate', u'CommentCount', u'CommunityOwnedDate', u'CreationDate_x', u'FavoriteCount', u'Id_x', u'LastActivityDate', u'LastEditDate', u'LastEditorDisplayName', u'LastEditorUserId', u'OwnerDisplayName', u'OwnerUserId', u'ParentId', u'PostTypeId', u'Score', u'Tags', u'Title', u'ViewCount', u'AboutMe', u'AccountId', u'Age', u'CreationDate_y', u'DisplayName', u'DownVotes', u'Id_y', u'LastAccessDate', u'Location', u'ProfileImageUrl', u'Reputation', u'UpVotes', u'Views', u'WebsiteUrl'], dtype='object')

In [14]:
postdata.CreationDate_x = pd.to_datetime(postdata.CreationDate_x)
postdata.LastActivityDate = pd.to_datetime(postdata.LastActivityDate)

In [15]:
postdata['Longevity'] = postdata.LastActivityDate - postdata.CreationDate_x

In [30]:
# Pull out tags
postdata.Tags = postdata.Tags.str.replace('\\<', '')
postdata.Tags = postdata.Tags.str.replace('\\>', ',')
postdata.Tags[0:10]

0                            set-theory,intuition,faq,
1                          calculus,limits,definition,
2             soft-question,big-list,online-resources,
3                                                  NaN
4                    number-theory,irrational-numbers,
5                         soft-question,math-software,
6                                                  NaN
7    linear-algebra,combinatorics,generating-functi...
8                                                  NaN
9               algebra-precalculus,decimal-expansion,
Name: Tags, dtype: object

In [34]:
# Calculate Body Length
bodylength = []
for index, row in postdata.iterrows():
    bodylength.append(len(row['Body']))
postdata.BodyLength = bodylength
postdata.BodyLength[0:9]

0     269
1      71
2      72
3     191
4     196
5     117
6     285
7     392
8    1490
Name: BodyLength, dtype: int64

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline 
import pylab as pl
import pylab
import seaborn as sns

dataforplotting = postdata[['Tags','Longevity','AnswerCount','CommentCount','FavoriteCount','ViewCount']]
dataforplotting.gropuby
dataforplotting.plot(kind='scatter',x='Precision',y='Recall',title='Precision vs. Recall',figsize=(16, 6))

plt.xlim(-0.1, 1.2)
plt.ylim(-0.1, 1.2)
plt.show()

To Do: 
1. exploratory viz stuff
2. classify ed level and train to gain more topics
    - If it has this tag I am 99% sure that it is a high school level question or posts with title containing high school
    - I am 100% sure that these are not high school level (abstract algebra)
3. do clustering on text to get more within untagged grade level
4. then i have high school level posts
5. do data viz on high school / not high school
6. topic for more granularity within tags
7. what topics are students struggling with in their subjects which might make them give up
8. Most complex, hotly debated, or fundamental