# Scattertext for Descriptive Text Analytics and Visualization Using Data from the 2016 Presidential Debates
---

Now that we have reviewed basic visualization of word frequency, let's look at some more advanced tools. Scattertext is a comprehensive package that provides tools for visualizing terms and associations, topics and categories, term scores, text classification weights, semiotic squares, word similarity, and even emojis! 

We are going to start with visualizing terms and associations, then we will explore some of the more advanced topics. 

> https://github.com/JasonKessler/Scattertext-PyData

### First ensure that you installed all of the packages listed in the read me

### Import Packages

In [3]:
import scattertext as st
import spacy
from pprint import pprint
import en_core_web_sm

#CSV
import csv
from collections import Counter

#pandas
import pandas as pd

#Matplotlib
import matplotlib.pyplot as plt
% matplotlib inline

#numpy
import numpy as np

# nltk
import nltk
# stopwords, FreqDist, word_tokenize
from nltk.corpus import stopwords
from nltk import FreqDist, word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

#regular expression
import re

#seaborn
import seaborn as sns

#import packages for scatter text
import scattertext as st
import spacy
from pprint import pprint
import en_core_web_sm

#SKlearn packages
import sklearn
from lightning.classification import CDClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# feature engineering (words to vectors)
from sklearn.feature_extraction.text import TfidfVectorizer
# classification algorithms (or classifiers)
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
# build a pipeline
from sklearn.pipeline import Pipeline
# model evaluation, validation
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split, GridSearchCV 
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
#pip install scikit-plot 
import scikitplot as skplt

#### Reload and Process the debate dataset
* Scattertext and spaCY make the preprocessing very easy
* All we need to do is create a dataframe with two columns (category and text data)
 1. Speaker (Trump and Clinton) these are our categories
 2. Text (the text data for analysis)

In [4]:
df = pd.read_csv("data/debate.csv", encoding = 'iso-8859-1')
del df['Line']
del df['Date']
df_clinton = df[df.Speaker=="Clinton"].copy()
df_trump = df[df.Speaker=="Trump"].copy()
df3 = df_clinton.append(df_trump)

### Turn the data frame into a Scattertext Corpus
* We want to look for differences between Trump and Clinton so we set (category_col='Speaker')
* We want to analyize the text for each canidate so we set (text_col = 'Text')
* We are not going to remove stopwords for this visualization because they might actually help provide insight in this use case (if you want to use stop word the code is included below)
>Jason Kessler the creator of Scattertext said "function words can reveal interesting psychological traits"

* ##### To remove stopwords: 
 * from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
 * put this after .build()
   * .remove_terms(ENGLISH_STOP_WORDS, ignore_absences=True)

* ##### spaCY error troubleshooting
 * If you get an error for spaCY, it may not be able to load the local english langauge. Replace 'nlp = spacy.load('en')' with:
  * nlp = en_core_web_sm.load()

In [5]:
nlp = spacy.load('en') 
corpus = st.CorpusFromPandas(df3, category_col='Speaker', text_col='Text', nlp=nlp).build()

##### Let's see characteristic terms in the corpus, and terms that are most associated with both Trump and Hillary

In [7]:
print(list(corpus.get_scaled_f_scores_vs_background().index[:10]))

['obamacare', 'wikileaks', 'raqqa', 'obama', 'outsmarted', 'mosul', 'baghdadi', 'irredeemable', 'tweeting', 'underleveraged']


##### Let's see characteristic terms in the corpus, and terms that are most associated with Trump

In [8]:
term_freq_df = corpus.get_term_freq_df()
term_freq_df['Trump Score'] = corpus.get_scaled_f_scores('Trump')
pprint(list(term_freq_df.sort_values(by='Trump Score', ascending=False).index[:10]))

['hillary',
 "she 's",
 'she',
 'bad',
 'tell you',
 'tell',
 "they 're",
 'clinton',
 'her',
 'and she']


##### Let's see characteristic terms in the corpus, and terms that are most associated with Clinton

In [9]:
term_freq_df = corpus.get_term_freq_df()
term_freq_df['Clinton Score'] = corpus.get_scaled_f_scores('Clinton')
pprint(list(term_freq_df.sort_values(by='Clinton Score', ascending=False).index[:10]))

['donald',
 'need to',
 'his',
 'that he',
 "'ve got",
 'he',
 'i want',
 'that is',
 'work',
 'he has']


## Visualization of terms with Scattertext
* Using the corpus we created the produce_scattertext_explorer function will create an interactive visualization of terms
* Hillary Clinton will be plotted on the Y axis
* Donald Trump will be plotted on the X axis
* You can click on a term and see the sentences that Hillary Clinton or Donald Trump used them in. 
* We will also use Iframe so we can embed the HTML output in the notebook

In [10]:
# category = 'Clinton' (this creates a category for Hillary Clinton)
# not_category_name = 'Donald Trump' every thing that is not in the clinton category is in the Trump category
html = st.produce_scattertext_explorer(corpus, category='Clinton', category_name='Hillary Clinton', 
                                       not_category_name='Donald Trump', width_in_pixels=1000, metadata=df3['Speaker'])
open("TrumpClinton_Visualization.html", 'wb').write(html.encode('utf-8'))

736831

###### This will also create a standalone HTML file that you can open from your file explorer

In [12]:
from IPython.display import IFrame
from IPython.core.display import display, HTML
file_name = 'TrumpClinton_Visualization.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1300, height=800)

# Topic Modeling Visualization Using Scattertext
>Instead of looking for common terms we will be looking for topics that Hillary Clinton and Donald Trump had in thier debates

In [13]:
df3.head(1)

Unnamed: 0,Speaker,Text
2,Clinton,"How are you, Donald?"


##### Create a corpus of topics/categories rather than terms. We will do this using **FeatsOnlyFromEmpath**

In [14]:
empath_corpus = st.CorpusFromParsedDocuments(df3, category_col='Speaker', 
                                             feats_from_spacy_doc=st.FeatsFromOnlyEmpath(), 
                                             parsed_col='Text').build()

In [15]:
html = st.produce_scattertext_explorer(empath_corpus, category= 'Clinton', category_name='Hillary Clinton', 
                                       not_category_name='Donald Trump', width_in_pixels=1000, 
                                       metadata=df3['Speaker'], use_non_text_features=True, use_full_doc=True)
open("TrumpClintonDebate-Empath.html", 'wb').write(html.encode('utf-8'))

661421

In [16]:
from IPython.display import IFrame
from IPython.core.display import display, HTML
file_name = 'TrumpClintonDebate-Empath.html'
file = open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1500, height=700)