## Case Study: Preparing Lobste.rs Stories for Machine Learning

In this case study, we'll be preparing [lobste.rs](http://lobste.rs) stories for machine learning. To do so, we need to extract features and clean up the messy parts of the data. We'll be using Pandas along with `sklearn.preprocessing` and `fuzzywuzzy`. 

In [1]:
import pandas as pd
import requests
from fuzzywuzzy import fuzz
from collections import Counter
from sklearn import preprocessing

### If you'd rather read from the API to get the latest, uncomment the details (and add comment to the final line)

In [5]:
resp = requests.get('https://lobste.rs/hottest.json')
stories = pd.read_json(resp.content)
stories = stories.set_index('short_id')
stories.to_json('hottest.json')
# stories = pd.tread_json('../data/all_lobsters.json')

In [6]:
stories.head()

Unnamed: 0_level_0,comment_count,comments_url,created_at,description,downvotes,score,short_id_url,submitter_user,tags,title,upvotes,url
short_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
fskvfo,3,https://lobste.rs/s/fskvfo/security_footgun_etcd,2018-03-18 13:27:27,,0,13,https://lobste.rs/s/fskvfo,"{'username': 'calvin', 'created_at': '2014-07-...","[devops, security]",The security footgun in etcd,13,https://elweb.co/the-security-footgun-in-etcd/
d42dv7,0,https://lobste.rs/s/d42dv7/types_indeterminates,2018-03-18 16:13:06,,0,3,https://lobste.rs/s/d42dv7,"{'username': 'HenriTuhola', 'created_at': '201...","[ml, plt]",Types and Indeterminates,3,http://boxbase.org/entries/2018/mar/19/types-a...
xss2yb,15,https://lobste.rs/s/xss2yb/life_land_unqualifi...,2018-03-17 14:05:38,,0,13,https://lobste.rs/s/xss2yb,"{'username': 'maxhallinan', 'created_at': '201...","[elm, practices]",Life in the land of unqualified imports,13,https://maxhallinan.com/posts/2018/03/17/life-...
zgu5en,0,https://lobste.rs/s/zgu5en/gaijin_engineer_tokyo,2018-03-18 16:23:28,,0,4,https://lobste.rs/s/zgu5en,"{'username': 'friendlysock', 'created_at': '20...",[culture],Gaijin Engineer in Tokyo,4,https://medium.com/@xevix/gaijin-engineer-in-t...
omhcqr,0,https://lobste.rs/s/omhcqr/aigo_chinese_encryp...,2018-03-18 15:03:25,"<p><a href=""https://syscall.eu/blog/2018/03/12...",0,2,https://lobste.rs/s/omhcqr,"{'username': 'sevan', 'created_at': '2013-06-0...","[hardware, security]",Aigo Chinese encrypted HDD − Part 1: taking it...,2,https://syscall.eu/blog/2018/03/12/aigo_part1/


In [4]:
stories.dtypes

comment_count              int64
comments_url              object
created_at        datetime64[ns]
description               object
downvotes                  int64
score                      int64
short_id_url              object
submitter_user            object
tags                      object
title                     object
upvotes                    int64
url                       object
dtype: object

### Let's take a look at the submitter_user field, as it appears like a dict

In [7]:
stories.submitter_user.iloc[3]

{'about': '*Literally* full of ants.\r\n\r\nFriendly engineer and human being.\r\n\r\nStrong opinions held weakly, sometimes weekly.\r\n\r\nHit me up at "ch" plus "ris" (at) "k" plus "e" plus "dagital" dot com.\r\n\r\n> Gentrification is the process by which nebulous threats are pacified and alchemised into money. \r\n',
 'avatar_url': 'https://lobste.rs/avatars/friendlysock-100.png',
 'created_at': '2014-02-20T00:43:41.000-06:00',
 'is_admin': False,
 'is_moderator': False,
 'karma': 14747,
 'username': 'friendlysock'}

In [8]:
user_df = stories['submitter_user'].apply(pd.Series)

In [9]:
user_df.head()

Unnamed: 0_level_0,about,avatar_url,created_at,github_username,is_admin,is_moderator,karma,twitter_username,username
short_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
fskvfo,Soon we will all have special names... names d...,https://lobste.rs/avatars/calvin-100.png,2014-07-01T06:47:13.000-05:00,NattyNarwhal,False,False,24328.0,,calvin
d42dv7,,https://lobste.rs/avatars/HenriTuhola-100.png,2017-10-11T17:37:54.000-05:00,,False,False,41.0,,HenriTuhola
xss2yb,[https://www.maxhallinan.com/](https://www.max...,https://lobste.rs/avatars/maxhallinan-100.png,2017-07-05T09:16:05.000-05:00,maxhallinan,False,False,164.0,,maxhallinan
zgu5en,*Literally* full of ants.\r\n\r\nFriendly engi...,https://lobste.rs/avatars/friendlysock-100.png,2014-02-20T00:43:41.000-06:00,,False,False,14747.0,,friendlysock
omhcqr,,https://lobste.rs/avatars/sevan-100.png,2013-06-02T17:42:02.000-05:00,,False,False,3402.0,,sevan


### Can we combine the user data without potential column overlap?

In [10]:
set(user_df.columns).intersection(stories.columns)

{'created_at'}

In [12]:
user_df = user_df.rename(columns={'created_at': 'user_created_at'})

In [13]:
stories = pd.concat([stories.drop(['submitter_user'], axis=1), 
                     user_df], axis=1)

In [14]:
stories.head()

Unnamed: 0_level_0,comment_count,comments_url,created_at,description,downvotes,score,short_id_url,tags,title,upvotes,url,about,avatar_url,user_created_at,github_username,is_admin,is_moderator,karma,twitter_username,username
short_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
fskvfo,3,https://lobste.rs/s/fskvfo/security_footgun_etcd,2018-03-18 13:27:27,,0,13,https://lobste.rs/s/fskvfo,"[devops, security]",The security footgun in etcd,13,https://elweb.co/the-security-footgun-in-etcd/,Soon we will all have special names... names d...,https://lobste.rs/avatars/calvin-100.png,2014-07-01T06:47:13.000-05:00,NattyNarwhal,False,False,24328.0,,calvin
d42dv7,0,https://lobste.rs/s/d42dv7/types_indeterminates,2018-03-18 16:13:06,,0,3,https://lobste.rs/s/d42dv7,"[ml, plt]",Types and Indeterminates,3,http://boxbase.org/entries/2018/mar/19/types-a...,,https://lobste.rs/avatars/HenriTuhola-100.png,2017-10-11T17:37:54.000-05:00,,False,False,41.0,,HenriTuhola
xss2yb,15,https://lobste.rs/s/xss2yb/life_land_unqualifi...,2018-03-17 14:05:38,,0,13,https://lobste.rs/s/xss2yb,"[elm, practices]",Life in the land of unqualified imports,13,https://maxhallinan.com/posts/2018/03/17/life-...,[https://www.maxhallinan.com/](https://www.max...,https://lobste.rs/avatars/maxhallinan-100.png,2017-07-05T09:16:05.000-05:00,maxhallinan,False,False,164.0,,maxhallinan
zgu5en,0,https://lobste.rs/s/zgu5en/gaijin_engineer_tokyo,2018-03-18 16:23:28,,0,4,https://lobste.rs/s/zgu5en,[culture],Gaijin Engineer in Tokyo,4,https://medium.com/@xevix/gaijin-engineer-in-t...,*Literally* full of ants.\r\n\r\nFriendly engi...,https://lobste.rs/avatars/friendlysock-100.png,2014-02-20T00:43:41.000-06:00,,False,False,14747.0,,friendlysock
omhcqr,0,https://lobste.rs/s/omhcqr/aigo_chinese_encryp...,2018-03-18 15:03:25,"<p><a href=""https://syscall.eu/blog/2018/03/12...",0,2,https://lobste.rs/s/omhcqr,"[hardware, security]",Aigo Chinese encrypted HDD − Part 1: taking it...,2,https://syscall.eu/blog/2018/03/12/aigo_part1/,,https://lobste.rs/avatars/sevan-100.png,2013-06-02T17:42:02.000-05:00,,False,False,3402.0,,sevan


### Let's check for nulls

In [15]:
stories.shape

(25, 20)

In [22]:
stories.dropna(axis=1).shape

(25, 17)

In [38]:
stories.dropna(thresh=9, axis=1).shape

(25, 19)

### Exercise: which columns would be dropped?

In [40]:
# %load ../solutions/lobsters_dropped.py
set(stories.columns) - set(stories.dropna(thresh=9, axis=1).columns)
#stories['github_username'].isna().sum()

{'twitter_username'}

## Let's make the tags easier to use by having them as features in the columns.

In [62]:
tag_df = stories.tags.apply(pd.Series)
tag_df.head()

Unnamed: 0_level_0,0,1,2
short_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
fskvfo,devops,security,
d42dv7,ml,plt,
xss2yb,elm,practices,
zgu5en,culture,,
omhcqr,hardware,security,


In [63]:
tag_df.values

array([['devops', 'security', nan],
       ['ml', 'plt', nan],
       ['elm', 'practices', nan],
       ['culture', nan, nan],
       ['hardware', 'security', nan],
       ['illumos', 'unix', nan],
       ['practices', nan, nan],
       ['erlang', 'video', nan],
       ['javascript', nan, nan],
       ['pdf', 'plt', 'scala'],
       ['hardware', nan, nan],
       ['python', nan, nan],
       ['databases', nan, nan],
       ['unix', nan, nan],
       ['plt', 'practices', nan],
       ['networking', nan, nan],
       ['hardware', 'java', 'pdf'],
       ['math', 'pdf', 'practices'],
       ['culture', nan, nan],
       ['c++', 'compsci', 'pdf'],
       ['person', nan, nan],
       ['practices', nan, nan],
       ['formalmethods', nan, nan],
       ['c++', 'games', 'graphics'],
       ['vcs', nan, nan]], dtype=object)

In [64]:
# what are our unique tags?

pd.unique(tag_df.values.ravel())

array(['devops', 'security', nan, 'ml', 'plt', 'elm', 'practices',
       'culture', 'hardware', 'illumos', 'unix', 'erlang', 'video',
       'javascript', 'pdf', 'scala', 'python', 'databases', 'networking',
       'java', 'math', 'c++', 'compsci', 'person', 'formalmethods',
       'games', 'graphics', 'vcs'], dtype=object)

In [None]:
len(pd.unique(tag_df.values.ravel()))

In [51]:
# most common tags

Counter(tag_df.values.ravel()).most_common(5)

[(nan, 33), ('practices', 5), ('pdf', 4), ('plt', 3), ('hardware', 3)]

### Let's create a dummy df with our tags

In [65]:
tag_df.head()

Unnamed: 0_level_0,0,1,2
short_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
fskvfo,devops,security,
d42dv7,ml,plt,
xss2yb,elm,practices,
zgu5en,culture,,
omhcqr,hardware,security,


In [73]:
type(tag_df.stack().index)

pandas.core.indexes.multi.MultiIndex

In [71]:
tag_df.stack().head()

short_id   
fskvfo    0      devops
          1    security
d42dv7    0          ml
          1         plt
xss2yb    0         elm
dtype: object

In [82]:
tag_df= pd.get_dummies(
    tag_df.stack()).sum(level=0)

In [86]:
tag_df.head()

Unnamed: 0_level_0,c++,compsci,culture,databases,devops,elm,erlang,formalmethods,games,graphics,...,pdf,person,plt,practices,python,scala,security,unix,vcs,video
short_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
fskvfo,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
d42dv7,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
xss2yb,0,0,0,0,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
zgu5en,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
omhcqr,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


### Now we can add it back to our stories DataFrame

In [85]:
#stories.drop('tags', axis=1)
stories = pd.concat([stories, tag_df], axis=1)

In [87]:
stories.head()

Unnamed: 0_level_0,comment_count,comments_url,created_at,description,downvotes,score,short_id_url,title,upvotes,url,...,pdf,person,plt,practices,python,scala,security,unix,vcs,video
short_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
fskvfo,3,https://lobste.rs/s/fskvfo/security_footgun_etcd,2018-03-18 13:27:27,,0,13,https://lobste.rs/s/fskvfo,The security footgun in etcd,13,https://elweb.co/the-security-footgun-in-etcd/,...,0,0,0,0,0,0,1,0,0,0
d42dv7,0,https://lobste.rs/s/d42dv7/types_indeterminates,2018-03-18 16:13:06,,0,3,https://lobste.rs/s/d42dv7,Types and Indeterminates,3,http://boxbase.org/entries/2018/mar/19/types-a...,...,0,0,1,0,0,0,0,0,0,0
xss2yb,15,https://lobste.rs/s/xss2yb/life_land_unqualifi...,2018-03-17 14:05:38,,0,13,https://lobste.rs/s/xss2yb,Life in the land of unqualified imports,13,https://maxhallinan.com/posts/2018/03/17/life-...,...,0,0,0,1,0,0,0,0,0,0
zgu5en,0,https://lobste.rs/s/zgu5en/gaijin_engineer_tokyo,2018-03-18 16:23:28,,0,4,https://lobste.rs/s/zgu5en,Gaijin Engineer in Tokyo,4,https://medium.com/@xevix/gaijin-engineer-in-t...,...,0,0,0,0,0,0,0,0,0,0
omhcqr,0,https://lobste.rs/s/omhcqr/aigo_chinese_encryp...,2018-03-18 15:03:25,"<p><a href=""https://syscall.eu/blog/2018/03/12...",0,2,https://lobste.rs/s/omhcqr,Aigo Chinese encrypted HDD − Part 1: taking it...,2,https://syscall.eu/blog/2018/03/12/aigo_part1/,...,0,0,0,0,0,0,1,0,0,0


### Another potentially useful feature is the post times...

In [88]:
stories['created_hour'] = stories.created_at.map(lambda x: x.hour)

In [89]:
stories['created_dow'] = stories.created_at.map(lambda x: x.weekday())

In [90]:
stories[['created_hour','created_dow']].head()

Unnamed: 0_level_0,created_hour,created_dow
short_id,Unnamed: 1_level_1,Unnamed: 2_level_1
fskvfo,13,6
d42dv7,16,6
xss2yb,14,5
zgu5en,16,6
omhcqr,15,6


### Let's analyze some of the correlations in our features so far...

In [91]:
stories[['created_hour', 'score']].corr()

Unnamed: 0,created_hour,score
created_hour,1.0,0.130952
score,0.130952,1.0


In [92]:
stories[['created_dow', 'score']].corr()

Unnamed: 0,created_dow,score
created_dow,1.0,-0.900492
score,-0.900492,1.0


In [93]:
stories[['karma', 'score']].corr()

Unnamed: 0,karma,score
karma,1.0,0.075633
score,0.075633,1.0


In [94]:
stories[['comment_count', 'score']].corr()

Unnamed: 0,comment_count,score
comment_count,1.0,0.74946
score,0.74946,1.0


In [95]:
stories[['hardware', 'score']].corr()

Unnamed: 0,hardware,score
hardware,1.0,-0.319139
score,-0.319139,1.0


## Exercise: can you find a more highly positive correlation?

## We might also want/need to normalize scores. We can use a Scaler / MinMaxScaler or Normalizer

In [None]:
normed_score = preprocessing.normalize(stories[['score']])

In [None]:
normed_score[:5]

#### hmm... maybe a min-max scaler works better for our needs!

In [None]:
scaler = preprocessing.MinMaxScaler()

In [None]:
scaled_score = scaler.fit_transform(stories[['score']])

In [None]:
scaled_score[:5]

In [None]:
stories['scaled_score'] = scaled_score[:,0]

## Exercise: can you add a scaled or normalized karma score?

## What else should we add?

- fuzzywuzzy to find match of title with topics
- add normalization or scaling to comments
- extract domain name
- number of words in the title
- number of capitalized words in the title
- use NLP to extract named entities from the title
- what else?