# Project 3 - Web APIs & Classification

## Part 4b - Scattertext Analysis (Extra) - Classify TV Show

> Using **Scattertext** to investigate the details behind the models through graphic presentation. The code is modified from the original scattertext jupyter notebook that is available to the public.

> For best demostration results, this demo uses the data from the classmate, Marguerite Siboni's project data. The content of her subreddits are more alike.

> Make sure the feature names match your dataframe in order for the code to work

**credits and special thanks:**
1. Ms. Marguerite Siboni for her permission of using her subreddit posts data
2. Mr. Joson Kessler for his **scattertext** tool and associated resources including the public repo (includes various demos) and youtube training videos.

### Table of Content

- [4.0-Import Libraries](#4.0---Import-Libraries)
- [4.1-Load Data](#4.1---Load-Data)
- [4.2-Preprocess](#4.2---Preprocess)
- [4.3-Calculate F-Score](#4.3---Calculate-F-Score)
- [4.4-Visualization](#4.4---Visualization)

### 4.0 - Import Libraries

In [15]:
%matplotlib inline
import scattertext as st
import re, io
from pprint import pprint
import pandas as pd
import numpy as np
from scipy.stats import rankdata, hmean, norm
import spacy
import os, pkgutil, json, urllib
from urllib.request import urlopen
from IPython.display import IFrame
from IPython.core.display import display, HTML
from scattertext import CorpusFromPandas, produce_scattertext_explorer
display(HTML("<style>.container { width:98% !important; }</style>"))

In [16]:
# Spacy's natural language processing package - powerful
nlp = spacy.load('en_core_web_sm')

### 4.1 - Load Data

In [17]:
%store -r df_to_preprocess

In [18]:
df = df_to_preprocess #load your dataframe
df.head()

Unnamed: 0,post_title,post_content,title_and_content,class
0,Should I watch Season 3,I know this probably sounds dumb but I just wa...,Should I watch Season 3 I know this probably s...,0
1,The Unofficial Rewatch Thread - S3E06: “What K...,**From IMDB:** The team looks forward as they ...,The Unofficial Rewatch Thread - S3E06: “What K...,0
2,Just started watching it and have a few questi...,I'm a huge Sorkin fan and recently picked up t...,Just started watching it and have a few questi...,0
3,The Unofficial Rewatch Thread - S3E05: “Oh She...,**From IMDB:** Shocking information regarding ...,The Unofficial Rewatch Thread - S3E05: “Oh She...,0
4,"On the bus, early season 2",Added the spoiler tag just in case....\n\n&amp...,"On the bus, early season 2 Added the spoiler t...",0


### 4.2 - Preprocess

In [19]:
# Re-map class column back from binary dummy variable.
# Change the class in the lambda function to match yours
df['class'] = df['class'].apply(lambda x: "thenewsroom" if x == 0 else "thewestwing")
df.head()

Unnamed: 0,post_title,post_content,title_and_content,class
0,Should I watch Season 3,I know this probably sounds dumb but I just wa...,Should I watch Season 3 I know this probably s...,thenewsroom
1,The Unofficial Rewatch Thread - S3E06: “What K...,**From IMDB:** The team looks forward as they ...,The Unofficial Rewatch Thread - S3E06: “What K...,thenewsroom
2,Just started watching it and have a few questi...,I'm a huge Sorkin fan and recently picked up t...,Just started watching it and have a few questi...,thenewsroom
3,The Unofficial Rewatch Thread - S3E05: “Oh She...,**From IMDB:** Shocking information regarding ...,The Unofficial Rewatch Thread - S3E05: “Oh She...,thenewsroom
4,"On the bus, early season 2",Added the spoiler tag just in case....\n\n&amp...,"On the bus, early season 2 Added the spoiler t...",thenewsroom


In [20]:
# Count the class
print("Document Count")
print(df.groupby('class')['post_title'].count())
print("Word Count")

Document Count
class
thenewsroom    997
thewestwing    997
Name: post_title, dtype: int64
Word Count


In [21]:
# Tokenize
df.groupby('class').apply(lambda x: x['post_title'].apply(lambda x: len(x.split())).sum())
df['parsed'] = df['post_title'].apply(nlp)
df.head()

Unnamed: 0,post_title,post_content,title_and_content,class,parsed
0,Should I watch Season 3,I know this probably sounds dumb but I just wa...,Should I watch Season 3 I know this probably s...,thenewsroom,"(Should, I, watch, Season, 3)"
1,The Unofficial Rewatch Thread - S3E06: “What K...,**From IMDB:** The team looks forward as they ...,The Unofficial Rewatch Thread - S3E06: “What K...,thenewsroom,"(The, Unofficial, Rewatch, Thread, -, S3E06, :..."
2,Just started watching it and have a few questi...,I'm a huge Sorkin fan and recently picked up t...,Just started watching it and have a few questi...,thenewsroom,"(Just, started, watching, it, and, have, a, fe..."
3,The Unofficial Rewatch Thread - S3E05: “Oh She...,**From IMDB:** Shocking information regarding ...,The Unofficial Rewatch Thread - S3E05: “Oh She...,thenewsroom,"(The, Unofficial, Rewatch, Thread, -, S3E05, :..."
4,"On the bus, early season 2",Added the spoiler tag just in case....\n\n&amp...,"On the bus, early season 2 Added the spoiler t...",thenewsroom,"(On, the, bus, ,, early, season, 2)"


### 4.3 - Calculate F-Score

In [22]:
# Instantiate Scattertext corpus
corpus = st.CorpusFromParsedDocuments(df, category_col='class', parsed_col='parsed').build()

In [23]:
# Calculate precision, recall, and raw F-Score for each feature
term_freq_df = corpus.get_term_freq_df()
term_freq_df['thenewsroom_precision'] = term_freq_df['thenewsroom freq'] * 1./(term_freq_df['thenewsroom freq'] + term_freq_df['thewestwing freq'])
term_freq_df['thenewsroom_freq_pct'] = term_freq_df['thenewsroom freq'] * 1./term_freq_df['thenewsroom freq'].sum()
term_freq_df['thenewsroom_hmean'] = term_freq_df.apply(lambda x: (hmean([x['thenewsroom_precision'], x['thenewsroom_freq_pct']])
                                                                   if x['thenewsroom_precision'] > 0 and x['thenewsroom_freq_pct'] > 0 
                                                                   else 0), axis=1)                                                        
term_freq_df.sort_values(by='thenewsroom_hmean', ascending=False).head(10)

Unnamed: 0_level_0,thenewsroom freq,thewestwing freq,thenewsroom_precision,thenewsroom_freq_pct,thenewsroom_hmean
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
the,677,532,0.559967,0.034624,0.065215
newsroom,189,3,0.984375,0.009666,0.019144
i,189,199,0.487113,0.009666,0.018956
of,179,161,0.526471,0.009155,0.017996
to,176,183,0.490251,0.009001,0.017678
a,164,190,0.463277,0.008387,0.016477
's,154,90,0.631148,0.007876,0.015558
in,151,175,0.46319,0.007723,0.015192
the newsroom,146,3,0.979866,0.007467,0.014821
season,135,38,0.780347,0.006904,0.013688


In [25]:
# Normalize precision, recall, and scale F_score for better graphic results. Normalize to a range from 0 to 1.
def normcdf(x):
    return norm.cdf(x, x.mean(), x.std())
term_freq_df['thenewsroom_precision_normcdf'] = normcdf(term_freq_df['thenewsroom_precision'])
term_freq_df['thenewsroom_freq_pct_normcdf'] = normcdf(term_freq_df['thenewsroom_freq_pct'])
term_freq_df['thenewsroom_scaled_f_score'] = hmean([term_freq_df['thenewsroom_precision_normcdf'], term_freq_df['thenewsroom_freq_pct_normcdf']])
term_freq_df.sort_values(by='thenewsroom_scaled_f_score', ascending=False).head(10)

Unnamed: 0_level_0,thenewsroom freq,thewestwing freq,thenewsroom_precision,thenewsroom_freq_pct,thenewsroom_hmean,thenewsroom_precision_normcdf,thenewsroom_freq_pct_normcdf,thenewsroom_scaled_f_score
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
rewatch thread,25,0,1.0,0.001279,0.002554,0.861632,0.999067,0.925274
unofficial rewatch,25,0,1.0,0.001279,0.002554,0.861632,0.999067,0.925274
the unofficial,25,0,1.0,0.001279,0.002554,0.861632,0.999067,0.925274
genoa,24,0,1.0,0.001227,0.002452,0.861632,0.998559,0.925056
newsroom season,24,0,1.0,0.001227,0.002452,0.861632,0.998559,0.925056
mcavoy,23,0,1.0,0.001176,0.00235,0.861632,0.997809,0.924734
sloan,22,0,1.0,0.001125,0.002248,0.861632,0.99672,0.924266
don,22,0,1.0,0.001125,0.002248,0.861632,0.99672,0.924266
will mcavoy,21,0,1.0,0.001074,0.002146,0.861632,0.995168,0.923598
maggie,21,0,1.0,0.001074,0.002146,0.861632,0.995168,0.923598


In [26]:
# Calculate corner score
term_freq_df['thenewsroom_corner_score'] = corpus.get_corner_scores('thenewsroom')
term_freq_df.sort_values(by='thenewsroom_corner_score', ascending=False).head(10)

Unnamed: 0_level_0,thenewsroom freq,thewestwing freq,thenewsroom_precision,thenewsroom_freq_pct,thenewsroom_hmean,thenewsroom_precision_normcdf,thenewsroom_freq_pct_normcdf,thenewsroom_scaled_f_score,thenewsroom_corner_score
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
rewatch thread,25,0,1.0,0.001279,0.002554,0.861632,0.999067,0.925274,0.924595
unofficial rewatch,25,0,1.0,0.001279,0.002554,0.861632,0.999067,0.925274,0.924595
the unofficial,25,0,1.0,0.001279,0.002554,0.861632,0.999067,0.925274,0.924595
genoa,24,0,1.0,0.001227,0.002452,0.861632,0.998559,0.925056,0.924593
newsroom season,24,0,1.0,0.001227,0.002452,0.861632,0.998559,0.925056,0.924593
mcavoy,23,0,1.0,0.001176,0.00235,0.861632,0.997809,0.924734,0.92459
sloan,22,0,1.0,0.001125,0.002248,0.861632,0.99672,0.924266,0.924586
don,22,0,1.0,0.001125,0.002248,0.861632,0.99672,0.924266,0.924586
will mcavoy,21,0,1.0,0.001074,0.002146,0.861632,0.995168,0.923598,0.924582
maggie,21,0,1.0,0.001074,0.002146,0.861632,0.995168,0.923598,0.924582


In [27]:
# Print the top 10 useful terms for each class based on scaled F score
term_freq_df = corpus.get_term_freq_df()
term_freq_df['thewestwing Score'] = corpus.get_scaled_f_scores('thewestwing')
term_freq_df['thenewsroom Score'] = corpus.get_scaled_f_scores('thenewsroom')
print("Top 10 thenewsroom terms")
pprint(list(term_freq_df.sort_values(by='thenewsroom Score', ascending=False).index[:10]))
print("Top 10 thewestwing terms")
pprint(list(term_freq_df.sort_values(by='thewestwing Score', ascending=False).index[:10]))

Top 10 thenewsroom terms
['unofficial rewatch',
 'rewatch thread',
 'the unofficial',
 'newsroom season',
 'genoa',
 'mcavoy',
 'newsroom',
 'the newsroom',
 'sloan',
 'don']
Top 10 thewestwing terms
['bartlet',
 'leo',
 'tww',
 'cj',
 'president',
 'josh',
 'toby',
 'president bartlet',
 'reboot',
 'bartlett']


### 4.4 - Visualization

#### 4.4.1 Features based on F-Score

In [32]:
# Scale Frequency
def scale(ar): 
    return (ar - ar.min()) / (ar.max() - ar.min())

def zero_centered_scale(ar):
    scores = np.zeros(len(ar))
    scores[ar > 0] = scale(ar[ar > 0])
    scores[ar < 0] = -scale(-ar[ar < 0])
    return (scores + 1) / 2.

frequencies_scaled = scale(np.log(term_freq_df.sum(axis=1).values))

In [46]:
html = produce_scattertext_explorer(corpus,
                                    category='thenewsroom',
                                    category_name='The News Room',
                                    not_category_name='The West Wing',
                                    minimum_term_frequency=5,
                                    width_in_pixels=1000,
                                    x_coords=frequencies_scaled,
                                    y_coords=corpus.get_scaled_f_scores('thenewsroom', beta=0.5),
                                    scores=corpus.get_scaled_f_scores('thewestwing', beta=0.5),
                                    sort_by_dist=False,
                                    metadata=df['class'],
                                    x_label='Log Frequency',
                                    y_label='Scaled F-Score')
file_name = './figures/tv_SFSvsLog.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)

#### 4.4.2 Features Frequency Plot by F-Score in Alphabetical Order

In [44]:
html = produce_scattertext_explorer(corpus,
                                    category='thenewsroom',
                                    category_name='The News Room',
                                    not_category_name='The West Wing',
                                    width_in_pixels=1000,
                                    minimum_term_frequency=5,
                                    metadata=df['class'],
                                    term_significance = st.LogOddsRatioUninformativeDirichletPrior())
file_name = './figures/tv_ScattertextRankDefault.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)

#### 4.4.3 Log of RIDGE Regression Beta

In [45]:
from sklearn.linear_model import LogisticRegression
scores = corpus.get_logreg_coefs('thenewsroom',
                                 LogisticRegression(penalty='l2', C=10, max_iter=10000, n_jobs=-1))
scores_scaled = zero_centered_scale(scores)

html = produce_scattertext_explorer(corpus,
                                    category='thenewsroom',
                                    category_name='The West Wing',
                                    not_category_name='The West Wing',
                                    minimum_term_frequency=5,
                                    width_in_pixels=1000,
                                    x_coords=frequencies_scaled,
                                    y_coords=scores_scaled,
                                    scores=scores,
                                    sort_by_dist=False,
                                    metadata=df['class'],
                                    x_label='Log frequency',
                                    y_label='L2-Penalized Log Reg Coef')
file_name = './figures/tv_L2vsLog.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)

  " = {}.".format(effective_n_jobs(self.n_jobs)))


In [None]:
# html = st.produce_scattertext_explorer(corpus,
#                                        category='thenewsroom',
#                                        category_name='The News Room',
#                                        not_category_name='The West Wing',
#                                        minimum_term_frequency=5,
#                                        width_in_pixels=1000,
#                                        transform=st.Scalers.log_scale_standardize)
# file_name = './figures/tv_ScattertextLog.html'
# open(file_name, 'wb').write(html.encode('utf-8'))
# IFrame(src=file_name, width = 1200, height=700)