# Project 3 - Web APIs & Classification

## Part 4a - Scattertext Analysis

> Using **Scattertext** to investigate the details behind the models through graphic presentation. The code is modified from the original scattertext jupyter notebook that is available to the public.

> Make sure the feature names match your dataframe in order for the code to work

**credits and special thanks:**
1. Mr. Joson Kessler for his **scattertext** tool and associated resources including the public repo (includes various demos) and youtube training videos.

### Table of Content

- [4.0-Import Libraries](#4.0---Import-Libraries)
- [4.1-Load Data](#4.1---Load-Data)
- [4.2-Preprocess](#4.2---Preprocess)
- [4.3-Calculate F-Score](#4.3---Calculate-F-Score)
- [4.4-Visualization](#4.4---Visualization)

### 4.0 - Import Libraries

In [46]:
%matplotlib inline
import scattertext as st
import re, io
from pprint import pprint
import pandas as pd
import numpy as np
from scipy.stats import rankdata, hmean, norm
import spacy
import os, pkgutil, json, urllib
from urllib.request import urlopen
from IPython.display import IFrame
from IPython.core.display import display, HTML
from scattertext import CorpusFromPandas, produce_scattertext_explorer
display(HTML("<style>.container { width:98% !important; }</style>"))

In [47]:
# Spacy's natural language processing package - powerful
nlp = spacy.load('en_core_web_sm')

### 4.1 - Load Data

In [48]:
%store -r df_to_preprocess

In [49]:
df = df_to_preprocess #load your dataframe
df.head()

Unnamed: 0,post_title,post_content,title_and_content,class
0,Halfway through chemo today!!!,So today marks my last dose of AC chemo before...,Halfway through chemo today!!! So today marks ...,0
1,Textured Implants have been recalled in France...,,Textured Implants have been recalled in France...,0
2,Found this in my bra today. This is my sleep b...,,Found this in my bra today. This is my sleep b...,0
3,Just found out my mom has cancer. Any advice?,I’m not very familiar with breast cancer. She ...,Just found out my mom has cancer. Any advice? ...,0
4,Called to go in for ultrasound,Hi all! Kind of long story so bear with me. Ab...,Called to go in for ultrasound Hi all! Kind of...,0


### 4.2 - Preprocess

In [50]:
# Re-map class column back from binary dummy variable.
# Change the class in the lambda function to match yours
df['class'] = df['class'].apply(lambda x: "breastcancer" if x == 0 else "airquality")
df.head()

Unnamed: 0,post_title,post_content,title_and_content,class
0,Halfway through chemo today!!!,So today marks my last dose of AC chemo before...,Halfway through chemo today!!! So today marks ...,breastcancer
1,Textured Implants have been recalled in France...,,Textured Implants have been recalled in France...,breastcancer
2,Found this in my bra today. This is my sleep b...,,Found this in my bra today. This is my sleep b...,breastcancer
3,Just found out my mom has cancer. Any advice?,I’m not very familiar with breast cancer. She ...,Just found out my mom has cancer. Any advice? ...,breastcancer
4,Called to go in for ultrasound,Hi all! Kind of long story so bear with me. Ab...,Called to go in for ultrasound Hi all! Kind of...,breastcancer


In [51]:
# Count the class
print("Document Count")
print(df.groupby('class')['post_title'].count())
print("Word Count")

Document Count
class
airquality      904
breastcancer    976
Name: post_title, dtype: int64
Word Count


In [52]:
# Tokenize
df.groupby('class').apply(lambda x: x['post_title'].apply(lambda x: len(x.split())).sum())
df['parsed'] = df['post_title'].apply(nlp)
df.head()

Unnamed: 0,post_title,post_content,title_and_content,class,parsed
0,Halfway through chemo today!!!,So today marks my last dose of AC chemo before...,Halfway through chemo today!!! So today marks ...,breastcancer,"(Halfway, through, chemo, today, !, !, !)"
1,Textured Implants have been recalled in France...,,Textured Implants have been recalled in France...,breastcancer,"(Textured, Implants, have, been, recalled, in,..."
2,Found this in my bra today. This is my sleep b...,,Found this in my bra today. This is my sleep b...,breastcancer,"(Found, this, in, my, bra, today, ., This, is,..."
3,Just found out my mom has cancer. Any advice?,I’m not very familiar with breast cancer. She ...,Just found out my mom has cancer. Any advice? ...,breastcancer,"(Just, found, out, my, mom, has, cancer, ., An..."
4,Called to go in for ultrasound,Hi all! Kind of long story so bear with me. Ab...,Called to go in for ultrasound Hi all! Kind of...,breastcancer,"(Called, to, go, in, for, ultrasound)"


### 4.3 - Calculate F-Score

In [55]:
# Instantiate Scattertext corpus
corpus = st.CorpusFromParsedDocuments(df, category_col='class', parsed_col='parsed').build()

In [56]:
# Calculate precision, recall, and raw F-Score for each feature
term_freq_df = corpus.get_term_freq_df()
term_freq_df['bc_precision'] = term_freq_df['breastcancer freq'] * 1./(term_freq_df['breastcancer freq'] + term_freq_df['airquality freq'])
term_freq_df['bc_freq_pct'] = term_freq_df['breastcancer freq'] * 1./term_freq_df['breastcancer freq'].sum()
term_freq_df['bc_hmean'] = term_freq_df.apply(lambda x: (hmean([x['bc_precision'], x['bc_freq_pct']])
                                                                   if x['bc_precision'] > 0 and x['bc_freq_pct'] > 0 
                                                                   else 0), axis=1)                                                        
term_freq_df.sort_values(by='bc_hmean', ascending=False).head(10)

Unnamed: 0_level_0,breastcancer freq,airquality freq,bc_precision,bc_freq_pct,bc_hmean
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
i,276,90,0.754098,0.015613,0.030592
breast,259,0,1.0,0.014651,0.028879
cancer,250,14,0.94697,0.014142,0.027868
and,195,190,0.506494,0.011031,0.021591
a,186,204,0.476923,0.010522,0.020589
to,181,238,0.431981,0.010239,0.020003
breast cancer,178,0,1.0,0.010069,0.019937
my,167,49,0.773148,0.009447,0.018665
the,166,406,0.29021,0.00939,0.018192
for,139,142,0.494662,0.007863,0.01548


In [59]:
# Normalize precision, recall, and scale F_score for better graphic results. Normalize to a range from 0 to 1.
def normcdf(x):
    return norm.cdf(x, x.mean(), x.std())
term_freq_df['bc_precision_normcdf'] = normcdf(term_freq_df['bc_precision'])
term_freq_df['bc_freq_pct_normcdf'] = normcdf(term_freq_df['bc_freq_pct'])
term_freq_df['bc_scaled_f_score'] = hmean([term_freq_df['bc_precision_normcdf'], term_freq_df['bc_freq_pct_normcdf']])
term_freq_df.sort_values(by='bc_scaled_f_score', ascending=False).head(10)

Unnamed: 0_level_0,breastcancer freq,airquality freq,bc_precision,bc_freq_pct,bc_hmean,bc_precision_normcdf,bc_freq_pct_normcdf,bc_scaled_f_score
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
breast cancer,178,0,1.0,0.010069,0.019937,0.730268,1.0,0.84411
breast,259,0,1.0,0.014651,0.028879,0.730268,1.0,0.84411
mom,58,0,1.0,0.003281,0.00654,0.730268,1.0,0.84411
diagnosed,58,0,1.0,0.003281,0.00654,0.730268,1.0,0.84411
chemo,57,0,1.0,0.003224,0.006428,0.730268,1.0,0.84411
lump,52,0,1.0,0.002942,0.005866,0.730268,1.0,0.84411
mastectomy,45,0,1.0,0.002546,0.005078,0.730268,1.0,0.84411
just,42,0,1.0,0.002376,0.00474,0.730268,0.999998,0.844109
’m,41,0,1.0,0.002319,0.004628,0.730268,0.999997,0.844109
treatment,41,0,1.0,0.002319,0.004628,0.730268,0.999997,0.844109


In [60]:
# Calculate corner score
term_freq_df['bc_corner_score'] = corpus.get_corner_scores('breastcancer')
term_freq_df.sort_values(by='bc_corner_score', ascending=False).head(10)

Unnamed: 0_level_0,breastcancer freq,airquality freq,bc_precision,bc_freq_pct,bc_hmean,bc_precision_normcdf,bc_freq_pct_normcdf,bc_scaled_f_score,bc_corner_score
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
breast,259,0,1.0,0.014651,0.028879,0.730268,1.0,0.84411,0.872781
breast cancer,178,0,1.0,0.010069,0.019937,0.730268,1.0,0.84411,0.872781
mom,58,0,1.0,0.003281,0.00654,0.730268,1.0,0.84411,0.872779
diagnosed,58,0,1.0,0.003281,0.00654,0.730268,1.0,0.84411,0.872779
chemo,57,0,1.0,0.003224,0.006428,0.730268,1.0,0.84411,0.872778
lump,52,0,1.0,0.002942,0.005866,0.730268,1.0,0.84411,0.872778
mastectomy,45,0,1.0,0.002546,0.005078,0.730268,1.0,0.84411,0.872777
just,42,0,1.0,0.002376,0.00474,0.730268,0.999998,0.844109,0.872776
biopsy,41,0,1.0,0.002319,0.004628,0.730268,0.999997,0.844109,0.872775
’m,41,0,1.0,0.002319,0.004628,0.730268,0.999997,0.844109,0.872775


In [61]:
# Print the top 10 useful terms for each class based on scaled F score
term_freq_df = corpus.get_term_freq_df()
term_freq_df['airquality Score'] = corpus.get_scaled_f_scores('airquality')
term_freq_df['breastcancer Score'] = corpus.get_scaled_f_scores('breastcancer')
print("Top 10 bc terms")
pprint(list(term_freq_df.sort_values(by='breastcancer Score', ascending=False).index[:10]))
print("Top 10 aq terms")
pprint(list(term_freq_df.sort_values(by='airquality Score', ascending=False).index[:10]))

Top 10 bc terms
['breast',
 'breast cancer',
 'diagnosed',
 'mom',
 'chemo',
 'lump',
 'mastectomy',
 'just',
 'i ’m',
 'treatment']
Top 10 aq terms
['pollution',
 'air pollution',
 'air quality',
 'quality',
 'air',
 'china',
 'coal',
 'co2',
 'toxic',
 'environmental']


### 4.4 - Visualization

#### 4.4.1 Features based on F-Score

In [63]:
# Scale Frequency
def scale(ar): 
    return (ar - ar.min()) / (ar.max() - ar.min())

def zero_centered_scale(ar):
    scores = np.zeros(len(ar))
    scores[ar > 0] = scale(ar[ar > 0])
    scores[ar < 0] = -scale(-ar[ar < 0])
    return (scores + 1) / 2.

frequencies_scaled = scale(np.log(term_freq_df.sum(axis=1).values))

In [65]:
html = produce_scattertext_explorer(corpus,
                                    category='breastcancer',
                                    category_name='Breast Cancer',
                                    not_category_name='Air Quality',
                                    minimum_term_frequency=5,
                                    width_in_pixels=1000,
                                    x_coords=frequencies_scaled,
                                    y_coords=corpus.get_scaled_f_scores('breastcancer', beta=0.5),
                                    scores=corpus.get_scaled_f_scores('breastcancer', beta=0.5),
                                    sort_by_dist=False,
                                    metadata=df['class'],
                                    x_label='Log Frequency',
                                    y_label='Scaled F-Score')
file_name = './figures/bc_vs_aq_SFSvsLog.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)

#### 4.4.2 Features Frequency Plot by F-Score in Alphabetical Order

In [67]:
html = produce_scattertext_explorer(corpus,
                                    category='breastcancer',
                                    category_name='Breast Cancer',
                                    not_category_name='Air Quality',
                                    width_in_pixels=1000,
                                    minimum_term_frequency=5,
                                    metadata=df['class'],
                                    term_significance = st.LogOddsRatioUninformativeDirichletPrior())
file_name = './figures/bc_vs_aq_ScattertextRankDefault.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)

#### 4.4.3 Log of RIDGE Regression Beta

In [69]:
from sklearn.linear_model import LogisticRegression
scores = corpus.get_logreg_coefs('breastcancer',
                                 LogisticRegression(penalty='l2', C=10, max_iter=10000, n_jobs=-1))
scores_scaled = zero_centered_scale(scores)

html = produce_scattertext_explorer(corpus,
                                    category='breastcancer',
                                    category_name='Breast Cancer',
                                    not_category_name='Air Quality',
                                    minimum_term_frequency=5,
                                    width_in_pixels=1000,
                                    x_coords=frequencies_scaled,
                                    y_coords=scores_scaled,
                                    scores=scores,
                                    sort_by_dist=False,
                                    metadata=df['class'],
                                    x_label='Log frequency',
                                    y_label='L2-Penalized Log Reg Coef')
file_name = './figures/bc_vs_aq_L2vsLog.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)

In [None]:
# html = st.produce_scattertext_explorer(corpus,
#                                        category='thenewsroom',
#                                        category_name='The News Room',
#                                        not_category_name='The West Wing',
#                                        minimum_term_frequency=5,
#                                        width_in_pixels=1000,
#                                        transform=st.Scalers.log_scale_standardize)
# file_name = './figures/tv_ScattertextLog.html'
# open(file_name, 'wb').write(html.encode('utf-8'))
# IFrame(src=file_name, width = 1200, height=700)