# Automated Text Analysis for Social Scientists (Day 1) 

[John McLevey](http://www.johnmclevey.com)  
Associate Professor  
University of Waterloo 

June 7 & 8, 2019  
University of British Columbia 

A few introductory notes: 

* **Please feel free to interrupt me at any point during the workshop.** I want to know if you are confused early, since concepts build throughout the workshop. 
* If you are working on something and need help, please put a postit note on the back of your laptop screen. One of us will come over to help.
* A version of this notebook will be bundled with my forthcoming book *Doing Computational Social Science* (Sage UK). 
* Some of the code below is not as compact or efficient as it could be. This is make things more transparent and easy to read. 


## Workshop Topics 

### Day 1

* Content Analysis & Automated Text Analysis 
* Getting Started with Text Data and Natural Language Processing 
* Exploratory Text Analysis: Word Scores and Visualization 

### Day 2 

* Constructing Feature Matrices 
* Identifying Latent Topics with Unsupervised Learning 
* The Semantic Network Alternative 
* Text Classification and Scaling Up Content Analysis with Supervised Learning 

In [7]:
!pip install spacy
!python -m spacy download en_core_web_sm 
!pip install sklearn --upgrade


[93m    Linking successful[0m
    /anaconda3/lib/python3.6/site-packages/en_core_web_sm -->
    /anaconda3/lib/python3.6/site-packages/spacy/data/en_core_web_sm

    You can now load the model via spacy.load('en_core_web_sm')

Requirement already up-to-date: sklearn in /anaconda3/lib/python3.6/site-packages (0.0)


In [8]:
import spacy 
nlp = spacy.load('en')

import pandas as pd
import seaborn as sns 
import matplotlib.pyplot as plt 
import sklearn
import sklearn.feature_extraction.text
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.decomposition import NMF, LatentDirichletAllocation
from yellowbrick.cluster import SilhouetteVisualizer

import scattertext as st
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

import matplotlib.pyplot as plt 
import seaborn as sns
sns.set_style('whitegrid')
# make inline plots vector graphics instead of raster. 
%config InlineBackend.figure_format = 'svg'

import numpy as np
import pandas as pd

# make inline plots vector graphics instead of raster. 
%config InlineBackend.figure_format = 'svg'

We are going to read in data stored in a `csv` file stored in our Google Drive. To simplify things, we will tell Google Colab that it can access data in our Google Drive. You will need to follow the instructions on the screen. 

In [1]:
from google.colab import drive # let's us hook into Google Drive
drive.mount('/content/drive')

ModuleNotFoundError: No module named 'google'

In [2]:
!ls "/content/drive/My Drive/Workshops/text_analysis_workshop-master/"
workshop = "/content/drive/My Drive/Workshops/text_analysis_workshop-master/"

ls: /content/drive/My Drive/Workshops/text_analysis_workshop-master/: No such file or directory


# Getting Data into Python 

In [11]:
dat = workshop + 'data/fake_news.csv'
df = pd.read_csv(dat)

In [12]:
df.head(20)

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
5,6903,"Tehran, USA","\nI’m not an immigrant, but my grandparents ...",FAKE
6,7341,Girl Horrified At What She Watches Boyfriend D...,"Share This Baylee Luciani (left), Screenshot o...",FAKE
7,95,‘Britain’s Schindler’ Dies at 106,A Czech stockbroker who saved more than 650 Je...,REAL
8,4869,Fact check: Trump and Clinton at the 'commande...,Hillary Clinton and Donald Trump made some ina...,REAL
9,2909,Iran reportedly makes new push for uranium con...,Iranian negotiators reportedly have made a las...,REAL


In [13]:
df.sample(10)

Unnamed: 0.1,Unnamed: 0,title,text,label
471,5060,'America is already strong': Obama continues D...,President Obama offered enthusiastic support f...,REAL
997,5753,SHOCK VIDEO : Hillary Needs Help Climbing ONE ...,SHOCK VIDEO : Hillary Needs Help Climbing ONE ...,FAKE
700,17,Stop the vendetta against Planned Parenthood,THE STING videos targeting Planned Parenthood ...,REAL
2619,3969,Fate of Paris attack mastermind unclear after ...,French investigators would not publicly identi...,REAL
4185,5874,Preventing cultural genocide with the Mother T...,Preventing cultural genocide with the Mother T...,FAKE
6183,10181,It Happened: Personal Notes From A Young Chica...,Chicago Cubs first baseman Anthony Rizzo stepp...,FAKE
2797,4190,Trump Scoffs at Cruz Choosing a Running Mate: ...,It's an unusual move for a presidential candid...,REAL
1793,5125,Trump's VP search enters frenzied phase,(CNN) Donald Trump's vice presidential search ...,REAL
4066,8916,The Next Big Shoe to Drop,The Next Big Shoe to Drop Posted on The Next...,FAKE
1376,9328,Cahill vs. Kalma Debut Album Out Now!,We Are Change \n\nThe debut album by Cahill vs...,FAKE


In [14]:
df['title']

0                            You Can Smell Hillary’s Fear
1       Watch The Exact Moment Paul Ryan Committed Pol...
2             Kerry to go to Paris in gesture of sympathy
3       Bernie supporters on Twitter erupt in anger ag...
4        The Battle of New York: Why This Primary Matters
5                                             Tehran, USA
6       Girl Horrified At What She Watches Boyfriend D...
7                       ‘Britain’s Schindler’ Dies at 106
8       Fact check: Trump and Clinton at the 'commande...
9       Iran reportedly makes new push for uranium con...
10      With all three Clintons in Iowa, a glimpse at ...
11      Donald Trump’s Shockingly Weak Delegate Game S...
12      Strong Solar Storm, Tech Risks Today | S0 News...
13           10 Ways America Is Preparing for World War 3
14                       Trump takes on Cruz, but lightly
15                             How women lead differently
16      Shocking! Michele Obama & Hillary Caught Glamo...
17      Hillar

In [15]:
df.groupby(['label']).size()

label
FAKE    3164
REAL    3171
dtype: int64

We can pull a variable out of a `dataframe` and store it as a `list`. This is sometimes useful when you only need a small subset of your data to work with. It can easily be reattached to the `dataframe` later. 

In [16]:
titles = df['title'].tolist()

In [17]:
for t in titles[:10]:
    print(t)

You Can Smell Hillary’s Fear
Watch The Exact Moment Paul Ryan Committed Political Suicide At A Trump Rally (VIDEO)
Kerry to go to Paris in gesture of sympathy
Bernie supporters on Twitter erupt in anger against the DNC: 'We tried to warn you!'
The Battle of New York: Why This Primary Matters
Tehran, USA
Girl Horrified At What She Watches Boyfriend Do After He Left FaceTime On
‘Britain’s Schindler’ Dies at 106
Fact check: Trump and Clinton at the 'commander-in-chief' forum
Iran reportedly makes new push for uranium concessions in nuclear talks


# Content Analysis & Automated Text Analysis 

Before we get practical, let's review some big picture issues. 

* shared challenges in text analysis 
    * managing unstructured data
    * interpreting latent meaning and multiple ways to interpret the same text 
    * *far too much text* for any researcher or research team to process 
* sampling? yes, but... 
    * the population is usually unknown 
    * unitizing text is ambiguous 
    * some documents are more important than others 
* critical questions about text: 
    * why are you looking at these texts and not others? 
    * who produced these texts and why? 
    * intended to be public? 
    * whose interests? 
    * who preserved and why? 
    * genre effects? (e.g. co-occurrence windows in news text) 
    
  
## The Promise of Computation 

![](img/mclevey_overview.png)
  
* Natural language processing (NLP), word scores, and exploratory text analysis
* Supervised vs. unsupervised learning: the big picture 
* Discovering latent topics
    * vector space model and cluster analysis 
    * LDA and structural topic models 
    * semantic networks 
* Text classification and supervised learning 
* Integrating supervised and unsupervised learning 
    * Patterns from unsupervised analyses $\neq$ patterns from supervised analyses
    * Unsupervised > Guided Reading > Supervised 
    
We are going to rely a lot on a package called `spacy`. There are a variety of reasons why I think you should use `spacy` instead of established alternatives, like `nltk`. 

# Getting Started with Text Data and Natural Language Processing 

The following abstract is from Muller, Sampson, and Winter's (2018) article "[Environmental Inequality: The Social Causes and Consequences of Lead Exposure](https://www.annualreviews.org/doi/10.1146/annurev-soc-073117-041222)," published in the *Annual Review of Sociology*. Let's use this example to illustrate some basic natural language processing (NLP) concepts and tasks before moving on to a bigger example.

> In this article, we review evidence from the social and medical sciences on the causes and effects of lead exposure. We argue that lead exposure is an important subject for sociological analysis because it is socially stratified and has important social consequences -- consequences that themselves depend in part on children's social environments. We present a model of environmental inequality over the life course to guide an agenda for future research. We conclude with a call for deeper exchange between urban sociology, environmental sociology, and public health, and for more collaboration between scholars and local communities in the pursuit of independent science for the common good.

When you read this abstract, you know where words begin and end, and where sentences begin and end. It is more challenging for a computer to be able to tell where these "tokens" begin and end. Why? What are some common challenges for a computer "reading" text data? 

Pre-processing tasks like selecting words based on their part-of-speech always degrades the reading experience for a human because we are stripping out information that makes it easier to understand the meaning of any individual text. But when we want to our a computer to tell us something about the content of many texts in a document collection, the same pre-processing tasks are *essential* for producing informative results. Effective pre-processing is all about knowing what kinds of information needs to be preserved or removed to improve the ability of the computer to "read" many texts. That may or may not involve tasks like selecting nouns and verbs, but it will almost always include tasks like removing stopwords, punctuation, and normalizing text. 

Let's begin by walking through some basic pre-processing on the abstract introduced above. The first thing we will do is create a variable containing the abstract, and then we will feed it into the `spaCy` pipeline `nlp()`, which we defined right under `import spacy` when we were importing the packages used in this lesson. 

In [18]:
ab = "In this article, we review evidence from the social and medical sciences on the causes and effects of lead exposure. We argue that lead exposure is an important subject for sociological analysis because it is socially stratified and has important social consequences -- consequences that themselves depend in part on children's social environments. We present a model of environmental inequality over the life course to guide an agenda for future research. We conclude with a call for deeper exchange between urban sociology, environmental sociology, and public health, and for more collaboration between scholars and local communities in the pursuit of independent science for the common good."

proc = nlp(ab)

When we process text by running it through the `spaCy` pipeline, `spaCy` stores the information we want in a `Doc` object. If we want to see the tokenized sentences, for example, we can iterate over the sentences in the `Doc` object and print them to screen. In this example, our `Doc` object is stored in the variable `proc.`

In [19]:
for sent in proc.sents:
    print(sent)
    print('\n')

In this article, we review evidence from the social and medical sciences on the causes and effects of lead exposure.


We argue that lead exposure is an important subject for sociological analysis because it is socially stratified and has important social consequences -- consequences that themselves depend in part on children's social environments.


We present a model of environmental inequality over the life course to guide an agenda for future research.


We conclude with a call for deeper exchange between urban sociology, environmental sociology, and public health, and for more collaboration between scholars and local communities in the pursuit of independent science for the common good.




We can iterate over tokens (e.g. sentences, words) for a variety of important text processing tasks, including normalizing text, removing stopwords, identifying parts-of-speech, and extracting named entities. For example, we can use normalized words rather than the original words by iterating over the words in the abstract and adding each word's lemma to a list. This time we will use `list comprehension` to iterate over the tokens. 

When working with natural language data, we have to make a decision about how to handle words that mean more or less the same thing, but which have different surface forms (e.g. compute, computing). On the one hand, leaving words as they are enables us to preserve nuances in language use that may be useful in answering our research questions. The downside of this approach is that those words are tokenized and counted separately, as if they had no semantic similarity. An alternative approach is to normalize the text by grouping together words that mean more or less the same thing and reducing them to the same token. The idea, in short, is to define classes of equivalent words and treat them as a single token. Doing so loses some of the nuances of language use, but can dramatically improve the results of most text analysis algorithms.

The two most widely-used approaches to text normalization are **stemming** and **lemmatization**. Stemming is a rule-based approach to normalizing words regardless of what role the word plays in a sentence (e.g. noun or verb), or of the surrounding context. For example, the Snowball stemmer (or Porter 2, following the well-known Porter Stemmer) takes in each individual word and follows rules about what parts of the word (e.g. 'ing') should be cut off. As you might imagine, the results you get back are usually not themselves valid words.

Rather than chopping off parts of tokens to get to a word stem, lemmatization normalizes words by reducing them to their dictionary form. As a result, it always returns valid words, which makes it considerably easier to interpret the results of almost any text analysis. In addition, lemmatization determines a token's part-of-speech (discussed below), which enables it to differentiate between ways of using the same word (e.g. 'meeting' as a noun, 'meeting' as a verb). Lemmatization is extremely accurate, and is almost always going to be a better choice than stemming, if only for the reasons discussed above. It is also more widely used.

Keeping in mind that our goal with computational text analysis is to see the shape and structure of the forest, not any individual tree, you can probably see why this is useful in the context of analyzing natural language data. Although we lose some of the nuances of language use by normalizing the text, we improve our analysis of the **corpus** (i.e. the 'forest') itself.

As mentioned earlier, `spaCy`'s `nlp()` does all of the heavy computing up front. As a result, our `Doc` object already includes information about the lemmas of each token in our abstract. In the code below, we can iterate over each token in the `Doc` and add its lemma to a list.

In [20]:
lemmas = [token.lemma_ for token in proc]
print(lemmas)

['in', 'this', 'article', ',', '-PRON-', 'review', 'evidence', 'from', 'the', 'social', 'and', 'medical', 'science', 'on', 'the', 'cause', 'and', 'effect', 'of', 'lead', 'exposure', '.', '-PRON-', 'argue', 'that', 'lead', 'exposure', 'be', 'an', 'important', 'subject', 'for', 'sociological', 'analysis', 'because', '-PRON-', 'be', 'socially', 'stratify', 'and', 'have', 'important', 'social', 'consequence', '--', 'consequence', 'that', '-PRON-', 'depend', 'in', 'part', 'on', 'child', "'s", 'social', 'environment', '.', '-PRON-', 'present', 'a', 'model', 'of', 'environmental', 'inequality', 'over', 'the', 'life', 'course', 'to', 'guide', 'an', 'agenda', 'for', 'future', 'research', '.', '-PRON-', 'conclude', 'with', 'a', 'call', 'for', 'deep', 'exchange', 'between', 'urban', 'sociology', ',', 'environmental', 'sociology', ',', 'and', 'public', 'health', ',', 'and', 'for', 'more', 'collaboration', 'between', 'scholar', 'and', 'local', 'community', 'in', 'the', 'pursuit', 'of', 'independent

In [21]:
wo_stops = [token for token in proc if token.is_stop == False]
print(wo_stops)

[In, article, ,, review, evidence, social, medical, sciences, causes, effects, lead, exposure, ., We, argue, lead, exposure, important, subject, sociological, analysis, socially, stratified, important, social, consequences, --, consequences, depend, children, 's, social, environments, ., We, present, model, environmental, inequality, life, course, guide, agenda, future, research, ., We, conclude, deeper, exchange, urban, sociology, ,, environmental, sociology, ,, public, health, ,, collaboration, scholars, local, communities, pursuit, independent, science, common, good, .]


Extracting words by their part-of-speech is no different. First, let's print the part-of-speech for each word in the abstract. Then let's make a list that includes only the nouns. 

In [22]:
for item in proc:
    print(item.text + '({})'.format(item.pos_))

In(ADP)
this(DET)
article(NOUN)
,(PUNCT)
we(PRON)
review(VERB)
evidence(NOUN)
from(ADP)
the(DET)
social(ADJ)
and(CCONJ)
medical(ADJ)
sciences(NOUN)
on(ADP)
the(DET)
causes(NOUN)
and(CCONJ)
effects(NOUN)
of(ADP)
lead(NOUN)
exposure(NOUN)
.(PUNCT)
We(PRON)
argue(VERB)
that(ADP)
lead(ADJ)
exposure(NOUN)
is(VERB)
an(DET)
important(ADJ)
subject(NOUN)
for(ADP)
sociological(ADJ)
analysis(NOUN)
because(ADP)
it(PRON)
is(VERB)
socially(ADV)
stratified(VERB)
and(CCONJ)
has(VERB)
important(ADJ)
social(ADJ)
consequences(NOUN)
--(PUNCT)
consequences(NOUN)
that(ADP)
themselves(PRON)
depend(VERB)
in(ADP)
part(NOUN)
on(ADP)
children(NOUN)
's(PART)
social(ADJ)
environments(NOUN)
.(PUNCT)
We(PRON)
present(VERB)
a(DET)
model(NOUN)
of(ADP)
environmental(ADJ)
inequality(NOUN)
over(ADP)
the(DET)
life(NOUN)
course(NOUN)
to(PART)
guide(VERB)
an(DET)
agenda(NOUN)
for(ADP)
future(ADJ)
research(NOUN)
.(PUNCT)
We(PRON)
conclude(VERB)
with(ADP)
a(DET)
call(NOUN)
for(ADP)
deeper(ADJ)
exchange(NOUN)
between(ADP)
urba

We can iterate over the tokens and add the word to a list of nouns if it's part of speech classification is noun. 

In [23]:
nouns = [item.text for item in proc if item.pos_ == 'NOUN']
print(nouns)

['article', 'evidence', 'sciences', 'causes', 'effects', 'lead', 'exposure', 'exposure', 'subject', 'analysis', 'consequences', 'consequences', 'part', 'children', 'environments', 'model', 'inequality', 'life', 'course', 'agenda', 'research', 'call', 'exchange', 'sociology', 'sociology', 'health', 'collaboration', 'scholars', 'communities', 'pursuit', 'science', 'good']


`spaCy` is also able to identify noun chunks, or "phrases." 

In [24]:
for item in proc.noun_chunks:
  print(item.text)

this article
we
evidence
the social and medical sciences
the causes
effects
lead exposure
We
lead exposure
an important subject
sociological analysis
it
important social consequences
consequences
themselves
part
children's social environments
We
a model
environmental inequality
the life course
an agenda
future research
We
a call
deeper exchange
urban sociology
environmental sociology
public health
more collaboration
scholars
local communities
the pursuit
independent science
the common good


As you can see, `spaCy` really simplifies key tasks in natural language processing. Knowing how, when, and why to do these and other tasks is the secret to getting good results when you are doing automated text analysis. **If you don't pre-process your text, you will not get good results, no matter how sophisticated your models are.** Garbage in, garbage out. 

How, then, can we combine these methods into a simple text pre-processing step? And how do we scale this up to a collection of abstracts (or any other text data) rather than a single string?

# Working with Real Data

In the cells below, I have provided some code to pre-process text data from a fake news dataset. 

The cell below this one will take a bit of time to run because of all the work `spaCy` is doing to parse the text. Once it has finished, run the next code cell to actually pre-process the text. 

We will use the fake news dataset we already loaded into memory. We will work with a sample so that we are not waiting around too long for things to compute. 

In [25]:
fake_sample = df[df['label'] == 'FAKE'].sample(200)
real_sample = df[df['label'] == 'REAL'].sample(200)

In [26]:
sampled_news = pd.concat([fake_sample, real_sample])
len(sampled_news)

400

In [27]:
text = sampled_news['text'].tolist()

## The NLP Pipeline 

Pour yourself some coffee. The next step can take a bit of time... 

In [28]:
processed = [nlp(t) for t in text]

## Named Entity Recognition 

* PERSON, GPE, etc. 

In [29]:
for each in processed: 
    for ent in each.ents:
        if ent.label_ is 'PERSON':
            print(ent.text, ent.label_)

Donald Trump PERSON
Hillary Clinton PERSON
Clinton PERSON
John Podesta PERSON
Trumps PERSON
Clinton PERSON
Ben Swann PERSON
Clinton PERSON
Swann PERSON
John Podesta PERSON
Wikileaks PERSON
Clinton PERSON
Huma Abedin PERSON
Clinton PERSON
Clinton PERSON
Wikileaks PERSON
Clinton PERSON
Glenn Thrush PERSON
John Podesta PERSON
Thrush PERSON
Thrush PERSON
Bernie Sanders PERSON
Clinton PERSON
Clinton PERSON
Hillary Clinton PERSON
Barack Hussein Obama Soetoro Sobarkah PERSON
Einstein PERSON
Byron Brown PERSON
John Greven PERSON
Rupert Spira PERSON
Opitz “ Oneness ” PERSON
Planet PERSON
Souls PERSON
Katie Gallanti Letting Go PERSON
Making Room PERSON
Albert Einstein PERSON
York Dobyns PERSON
Einstein PERSON
Billy Okeefe PERSON
McClatchy Tribune PERSON
Brandon Smith PERSON
Hillary Clinton PERSON
Clinton PERSON
Clinton PERSON
Weiner PERSON
— Anthony Weiner PERSON
Hillary Clinton’s PERSON
Clinton PERSON
Clinton PERSON
Clinton PERSON
Clinton PERSON
Clinton PERSON
Hillary Clinton PERSON
Bill Clinto

## Pre-processing Text

In [30]:
analysis_text = []

for doc in processed: 
    reduced = []
    for token in doc:
        if token.is_stop is False and token.pos_ is 'NOUN':
            reduced.append(token.lemma_)
    analysis_text.append(reduced)

In [31]:
for each in analysis_text[:2]:
    print(each)

['criminal', 'issue', 'idiot', 'medium', 'debate', 'view', 'release', 'email', 'campaign', 'chair', 'question', 'leak', 'email', 'question', 'journalist', 'reality', 'check', 'agency', 'hack', 'agency', 'credibility', 'issue', 'agency', 'weapon', 'destruction', 'agency', 'hack', 'issue', 'leak', 'information', 'report', 'email', 'email', 'pay', 'play', 'scheme', 'email', 'deputy', 'adviser', 'authority', 'initiative', 'order', 'access', 'email', 'collusion', 'campaign', 'medium', 'outlet', 'fact', 'email', 'wikileak', 'reporter', 'campaign', 'staff', 'writer', 'article', 'writing', 'hack', 'sic', 'section', 'sic', 'word', 'email', 'year', 'staff', 'e', 'mail', 'campaign', 'issue', 'hacking', 'email', 'reality', 'information', 'email', 'who', 'accusation', 'election', 'truth', 'medium', 'election', 'medium', 'medium', 'outlet', 'truth', 'weight', 'conclusion', 'way', 'thing', 'email', 'release', 'doubt', 'truth', 'question', 'people', 'activity', 'place', 'minion', 'indication', 'justic

The results are stored as a list of lists, and the content of each is as expected.

# Exploratory Text Analysis with `scattertext`

## Text Scatterplots and Finding Important Terms

There are many types of exploratory analysis we may try here, including 'dictionary methods', where we can do searches and counts for specific words, compare their distribution across categories of texts, and observe their usage frequency over time. To do so, we typically want a list of words and phrases we are interested in advance, including specialized `dictionaries` of words that are related specifically to our domain of inquiry. This would be a natural time to use dictionary methods to explore your text, but we will set those aside until a later chapter and focus instead on visual approaches to exploring text data.

One common task in text analysis is comparing how often specific words appear in different categories of documents. In the example we are using here, we can compare differences in the words that appear in articles by economists, political scientists, sociologists, psychologists, and anthropologists. The package `scattertext` simplifies this process, and produces intuitive graphs that are much more informative and useful than some common alternatives, such as word clouds (which I think you should *never* used). Along the way, you will learn a little about assigning various different scores to words depending on how much they can tell us about a text's content relative to other texts in the corpus.

It's time to use some of the other metadata from our fake news dataset. 

In [32]:
sampled_news.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
6068,10063,Is Who Hacked Podesta’s Emails the Issue or th...,Print \nEver notice how the criminals deflect ...,FAKE
5910,7772,Comment on 5 Corporations Own The U.S. Media –...,"From the day we are born into this world, we...",FAKE
2735,7067,The Fix Is In: NBC Affiliate Accidentally Post...,Posted 11/03/2016 12:44 am by PatriotRising wi...,FAKE
346,9961,Podesta wiki leaks...We prefer Muslims over Ch...,Podesta wiki leaks...We prefer Muslims over Ch...,FAKE
2173,6380,Politicians Will Feel the Heat From Rising Tem...,Politicians Will Feel the Heat From Rising Tem...,FAKE


`Scattertext` is going to do some pre-processing for us in the background, so the code below will also take some time to run. Pour yourself some more coffee? 

In [33]:
corpus = st.CorpusFromPandas(sampled_news, category_col='label', text_col='text', nlp=nlp).build()

### Precision, Recall, and F-Scores

`Scattertext` identifies terms that are statistically associated with categories of documents, which in our data are 'Fake' or 'Real'. To do so, it uses measures of term importance that have been developed by specialists in the field of information retrieval. The key measure used is scaled F-score. F-score is the harmonic mean of two measures that are important to understand: *precision* and *recall*.

Imagine, for a moment, that you are evaluating the results of a search of some set of documents. If you searched for 'split labour market' and all of the results that were returned were relevant, then you would have a high precision score. If many irrelevant results were returned, you would have a lower precision score. While this is useful, it does not tell us whether or not all of the relevant documents were returned. This is where recall helps. Recall is the fraction of relevant documents that our search query actually retrieved.

In short, high precision means that we have far more relevant results than irrelevant results, and high recall means we discovered most of what is actually relevant in the document collection. Typically increases in one are associated with decreases in the other, but when a word has high precision and high recall, it will have a higher **F-score**. `Scattertext` computes precision and recall for each category, and then computes scaled F-scores.

We can use the method `.get_scaled_f_scores_vs_background()` to identify words that distinguish our fake news corpus from *general English-language data*. 

In [34]:
print(list(corpus.get_scaled_f_scores_vs_background().index[:25]))

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  corpus_unigram_freq = corpus_freq_df.ix[[term for term


['obama', 'trump', 'comey', 'facebook', 'obamacare', 'kasich', 'twitter', 'hillary', 'dapa', 'rubio', 'barack', 'wikileaks', 'zika', 'romney', 'podesta', 'manafort', 'clinton', 'rannazzisi', 'tweeted', 'mateen', 'sanders', 'abedin', 'airstrikes', 'eurozone', 'whiteness']


It is also possible to have `Scattertext` return words that are associated most with each of our two categories. Here we construct a second `dataframe` from our `Scattertext` corpus that contains the term frequencies for sociology and political science articles. The `dataframe` can be sorted to display words with high F-scores in either category. 

In [35]:
term_df = corpus.get_term_freq_df()
term_df['fscore_fake'] = corpus.get_scaled_f_scores('FAKE')
term_df['fscore_real'] = corpus.get_scaled_f_scores('REAL')

In [36]:
topFake = term_df.sort_values(by='fscore_fake', ascending=False).reset_index()[:25]
topReal = term_df.sort_values(by='fscore_real', ascending=False).reset_index()[:25]

In [37]:
topFake

Unnamed: 0,term,FAKE freq,REAL freq,fscore_fake,fscore_real
0,comey,106,11,1.0,0.0
1,october,111,18,0.9925,0.0075
2,emails,102,22,0.984797,0.015203
3,global,71,12,0.97951,0.02049
4,russia,178,52,0.974928,0.025072
5,fbi,128,38,0.974234,0.025766
6,the fbi,78,20,0.973068,0.026932
7,human,91,29,0.969136,0.030864
8,order,90,29,0.968435,0.031565
9,the us,85,28,0.966088,0.033912


You should be able to see the frequencies and F-scores for both fake and real news stories. You will likely notice that many words with strong associations with one category have weak associations with the other category. Some words are strongly associated with both categories. Depending on what we want to know, we can think of these words as domain-specific stopwords. Can you think of any examples? 

Our next task is to better understand these relationships by plotting category-term associations with a scatterplot. `Scattertext` uses a Javascript library called `D3` to produce interactive graphs that run in your browser. Hovering over points reveals additional information about words that are statistically associated with our categories. 

We will run some code to display the results in our notebook, but it will also write a file to disk that you can open up later and explore. The file may take some time to load. That's normal. 

In [38]:
from IPython.display import IFrame
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:98% !important; }</style>"))

In [39]:
html = st.produce_scattertext_explorer(corpus, category='FAKE', category_name='FAKE', not_category_name='REAL', width_in_pixels=1000)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  corpus_unigram_freq = corpus_freq_df.ix[[term for term


In [40]:
file_name = 'html/news.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)