## 0402 - Finding Correlation and Significant Variables in Pandas (Lobse.rs)

In this case study, we'll be preparing [lobste.rs](http://lobste.rs) stories to find significant variables and correlation.

In [None]:
import pandas as pd
import requests
from fuzzywuzzy import fuzz
from collections import Counter
from yellowbrick.features.rankd import Rank2D 
from sklearn.feature_selection import mutual_info_regression

### If you'd rather read from the API to get the latest, uncomment the details (and add comment to the final line)

In [None]:
#resp = requests.get('https://lobste.rs/hottest.json')
#stories = pd.read_json(resp.content)
#stories = stories.set_index('short_id')

stories = pd.read_json('../data/lobsters_sample.json')

In [None]:
stories.head()

In [None]:
stories.dtypes

### Let's take a look at the submitter_user field, as it appears like a dict

In [None]:
stories.submitter_user.iloc[3]

In [None]:
user_df = stories['submitter_user'].apply(pd.Series)

In [None]:
user_df.head()

### Can we combine the user data without potential column overlap?

In [None]:
set(user_df.columns).intersection(stories.columns)

In [None]:
user_df = user_df.rename(columns={'created_at': 
                                  'user_created_at'})

In [None]:
stories = pd.concat([stories.drop(['submitter_user'], axis=1), 
                     user_df], axis=1)

In [None]:
stories.head()

### Let's check for nulls

In [None]:
stories.shape

In [None]:
stories.dropna().shape

In [None]:
stories.dropna(thresh=10, axis=1).shape

### Exercise: which columns would be dropped?

In [None]:
%load ../solutions/lobsters_dropped.py


## Let's make the tags easier to use by having them as features in the columns.

In [None]:
tag_df = stories.tags.apply(pd.Series)

In [None]:
tag_df.head()

In [None]:
# what are our unique tags?

pd.unique(tag_df.values.ravel())

In [None]:
set(tag_df.values.ravel())

In [None]:
len(pd.unique(tag_df.values.ravel()))

In [None]:
# most common tags

Counter(tag_df.values.ravel()).most_common(5)

### Let's create a dummy df with our tags

In [None]:
tag_df = pd.get_dummies(
    tag_df.apply(pd.Series).stack()).sum(level=0)

In [None]:
tag_df.head()

### Now we can add it back to our stories DataFrame

In [None]:
stories = pd.concat([stories, 
                     tag_df], axis=1)

In [None]:
stories.head()

### Another potentially useful feature is the post times...

In [None]:
stories['created_hour'] = stories.created_at.map(
    lambda x: x.hour)

In [None]:
stories['created_dow'] = stories.created_at.map(
    lambda x: x.weekday())

### Let's analyze some of the correlations in our features so far...

In [None]:
stories[['created_hour', 'score']].corr()

In [None]:
stories[['created_dow', 'score']].corr()

In [None]:
stories[['karma', 'score']].corr()

In [None]:
stories[['comment_count', 'score']].corr()

In [None]:
stories[['hardware', 'score']].corr()

## Exercise: can you find a more highly positive correlation?

- You can use the sklearn mutual_info_regressor (and pass score as target)
- You can use the Rank2D from Yellowbrick
- You can use a for loop with the Pandas corr() DataFrame

## What else should we add?

- fuzzywuzzy to find match of title with topics
- add normalization or scaling to comments
- extract domain name
- number of words in the title
- number of capitalized words in the title
- use NLP to extract named entities from the title
- what else?

In [None]:
stories.shape

In [None]:
stories.to_json('../data/lobsters_sample_cleaned.json')