# Scattertext Viz of News Headlines

The following visualizations of differences in right and left-biased headlines were produced with the [Scattertext](https://github.com/JasonKessler/scattertext) library for Python. This notebook is adapted from Jason Kessler's PyData [tutorial](https://github.com/JasonKessler/Scattertext-PyData/blob/master/PyData-Scattertext-Part-1.ipynb).

### Install Libraries

In [None]:
!pip install scattertext
!pip install spacy
!python -m spacy download en

In [None]:
!pip install news-please
!pip install fuzzywuzzy
!pip install python-Levenshtein

### Import Modules

In [3]:
# import modules
%matplotlib inline
import scattertext as st
import re, io
from pprint import pprint
import pandas as pd
import numpy as np
from scipy.stats import rankdata, hmean, norm
import spacy
import os, pkgutil, json, urllib
from urllib.request import urlopen
from IPython.display import IFrame
from IPython.core.display import display, HTML
from scattertext import CorpusFromPandas, produce_scattertext_explorer
display(HTML("<style>.container { width:98% !important; }</style>"))
nlp = spacy.load('en')

### Load All Sides Media Headlines

In [4]:
def read_data(filename):
    # read in csv
    df = pd.read_csv(filename, encoding='utf-8')
    
    #limit df content to bias and text
    df = df.loc[:, ['bias', 'source', 'headline']]
    df = df.rename(index=str, columns={"headline": "text"})
    
    return df

In [5]:
convention_df = read_data('news-corpus-df.csv')
convention_df.head()

Unnamed: 0,bias,source,text
0,Left,Washington Post,b'How trend-riding Trump is taking credit for ...
1,Right,Wall Street Journal- Editorial,b'It\xe2\x80\x99s Trump\xe2\x80\x99s Economy Now'
2,Center,USA TODAY,b'The Bubble: By undoing Obama accomplishments...
3,Center,Wall Street Journal- News,b'Trump Lawyer Michael Cohen\xe2\x80\x99s Atto...
4,Left,Vox,b'Reports suggest Michael Cohen is thinking of...


In [6]:
print("Document Count")
print(convention_df.groupby('bias')['text'].count())
print("Word Count")

Document Count
bias
Center      807
Left       1316
Right      1039
Name: text, dtype: int64
Word Count


### Text Preprocessing

In [7]:
# import key modules
import re
import string

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

[nltk_data] Downloading package stopwords to /content/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [11]:
# strip texts of punctuation, boilerplate, and stop words
def text_prepare(text):
    """
        text: a string
        return: modified initial string
    """
    text = text.lower()
    text = text.replace('\n',' ')
    
    letters = list(string.ascii_lowercase)
    numbers = ['0','1','2','3','4','5','6','7','8','9']
    banned = ["’","’","“","—","”","‘","–",'#','[','/','(',')','{','}','\\','[',']','|','@',',',';','+','-']
    banned = ''.join(banned) + string.punctuation + ''.join(numbers)
    banned = banned.replace(".", "")
    stop_list = set(stopwords.words('english') + letters)
    
    translation_table = dict.fromkeys(map(ord, banned), ' ')
    text = text.translate(translation_table)
    text = re.sub(' +',' ',text)
    text = ' '.join([word for word in text.split() if word not in stop_list])
    return text

In [12]:
# shuffle df for random sampling
convention_df = convention_df.sample(frac=1).reset_index(drop=True)

In [13]:
# rewrite df with cleaned text
for i in range(0, len(convention_df)):
  convention_df.at[i,'text'] = text_prepare(convention_df.at[i,'text'])
  
convention_df.head()

Unnamed: 0,bias,source,text
0,Center,BBC News,trump nfl row us president denies comments rac...
1,Center,Reuters,trump discuss tax plan senate republicans next...
2,Left,CNN (Web News),trump lay plan combating radical islamic terro...
3,Left,Washington Post,police protesters clash st. louis former offic...
4,Right,Fox News,tillerson grilled secretary state confirmation...


### Transform DF into Scattertext corpus

In [14]:
convention_df.groupby('bias').apply(lambda x: x.text.apply(lambda x: len(x.split())).sum())
convention_df['parsed'] = convention_df.text.apply(nlp)

In [15]:
corpus = st.CorpusFromParsedDocuments(convention_df, category_col='bias', parsed_col='parsed').build()

In [16]:
# remove stop words
# list of words in corpus: corpus._term_idx_store
stop_word_list = ['via getty', 'inbox', 'subscribe', '×', 'close ×', 'screen close', 'full screen', 'buy second', 'second continue', 'story continued', 'llc permission', '―', 'xe', '\\xe2\\x80\\x99', 'news', 'for reprint', 'llc', 'post', 'click', 'to', '’ve', 'unsupported on', 'share', 'that ’s', 'still', 'got', 'it', '37', 'of his', 'this report', 'ofs', 'fox', 'photos', '’m', 'is the', 's.', 'around', 'times', 'also', 'the', 'copyright', 'washington times', 'mr', 'press', 'wait', 'associated', 'unsubscribe', 'view', 'photo wait', 'http', '#', 'associated press', 'more videos', 'get', 'just watched', 'permission', 'however', 'b.', 'ms.', 'here©', 'device', 'copyright ©', 'paste', '10', 'the associated', 'contributed to', 'hide', 'and his', 'videos', 'said mr.', '_', '©', 'contributed', 'embed', 'n’t', '/', 'something', 'i', 'that they', 'read', 'for a', 'playback', 'must watch', 'washington post', 'just', 'to get', 'r', 'read more', 'toggle', 'more', 'i ’m', 'follow', 'is', 'https', ' ', 'said', 'mr.', 'unsupported', 'or blog', 'your device', 'for', 'cnn', 'of 76', 'that', 'ms', 'andhis', 'click here', 'or share', 'replay', 'press contributed', 'they', 'must', 'prof', 'www', 'it ’s', 'told', '’re', 'the washington', '1', "'s rise", '© 2018', 'to this', 'skip', 'around the', 'blog', 'cut', 'told fox', 'mrs.', 'hide caption', 'ad', 'watched', '/ the', 'replay more', 'and the', '’s', '2018', 'copy', '&', 'read or', 'reprint permission', 'are', 'told cnn', 'watch', 'here for', 'also said', 'copy this', 'reprint', 'report', 'advertisement', 'mrs', 'caption', 'autoplay', 'fox news', 'dr', 'enlarge', 'times llc', '76', 'photo', 'this']
stop_word_list = list(set(stop_word_list))

update_stop = []
for term in stop_word_list:
  if term in corpus._term_idx_store:
    update_stop.append(term)
corpus = corpus.remove_terms(update_stop)

In [17]:
term_freq_df = corpus.get_term_freq_df()
term_freq_df.head()

Unnamed: 0_level_0,Center freq,Left freq,Right freq
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
nfl row,1,0,0
nfl,4,3,4
related,3,1,1
race,12,12,11
president,11,39,31


In [18]:
term_freq_df = corpus.get_term_freq_df()
list(term_freq_df.columns.values)

['Center  freq', 'Left  freq', 'Right  freq']

In [19]:
print(list(corpus.get_scaled_f_scores_vs_background().index[:10]))

['obamacare', 'comey', 'tillerson', 'tweets', 'manafort', 'trump', 'brexit', 'wikileaks', 'scaramucci', 'priebus']


In [20]:
term_freq_df = corpus.get_term_freq_df()
term_freq_df['Left Score'] = \
corpus.get_scaled_f_scores('Left ')
pprint(list(term_freq_df.sort_values(by='Left Score', ascending=False).index[0:25]))

['clinton xe',
 'explained',
 'test',
 'hillary clinton',
 'want',
 'job',
 'conway',
 'intelligence',
 'finally',
 'pushes',
 'making',
 'trump xe',
 'win',
 'jeff sessions',
 'line',
 'missile',
 'trump white',
 'donald trump',
 'opinion',
 'donald',
 'jeff',
 'america',
 'ryan',
 'lead',
 'team']


In [21]:
term_freq_df['Right Score'] = \
corpus.get_scaled_f_scores('Right ')
pprint(list(term_freq_df.sort_values(by='Right Score', 
                                      ascending=False).index[0:25]))

['slams',
 'trump rips',
 'schumer',
 'trump hits',
 'vows',
 'hits',
 'go',
 'think',
 'media',
 'trump slams',
 'rips',
 'hill',
 'dem',
 'lies',
 'dems',
 'americans',
 'taking',
 'left',
 'away',
 'look',
 'fact',
 'comments',
 'un',
 'terror',
 'list']


In [22]:
term_freq_df = corpus.get_term_freq_df()
term_freq_df['dem_precision'] = term_freq_df['Left  freq'] * 1./(term_freq_df['Left  freq'] + term_freq_df['Right  freq'])
term_freq_df['dem_freq_pct'] = term_freq_df['Left  freq'] * 1./term_freq_df['Left  freq'].sum()
term_freq_df['dem_hmean'] = term_freq_df.apply(lambda x: (hmean([x['dem_precision'], x['dem_freq_pct']])
                                                                   if x['dem_precision'] > 0 and x['dem_freq_pct'] > 0 
                                                                   else 0), axis=1)                                                        
term_freq_df.sort_values(by='dem_hmean', ascending=False).iloc[:10]

Unnamed: 0_level_0,Center freq,Left freq,Right freq,dem_precision,dem_freq_pct,dem_hmean
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
trump,368,617,471,0.567096,0.032616,0.061685
clinton,35,101,62,0.619632,0.005339,0.010587
donald trump,18,84,39,0.682927,0.00444,0.008824
donald,18,84,40,0.677419,0.00444,0.008823
trump xe,12,81,35,0.698276,0.004282,0.008512
house,54,68,61,0.527132,0.003595,0.007141
says,42,64,54,0.542373,0.003383,0.006724
new,27,48,55,0.466019,0.002537,0.005047
hillary,8,47,30,0.61039,0.002485,0.004949
senate,40,47,49,0.489583,0.002485,0.004944


### Visualizations

#### Raw Frequency

In [24]:
html  =  produce_scattertext_explorer (corpus,
                                    category='Left ',
                                    category_name='Democratic',
                                    not_category_name='Republican',
                                    width_in_pixels=1000,
                                    minimum_term_frequency=5,
                                    transform=st.Scalers.scale,
                                    metadata=convention_df['source'])
file_name = 'AllSides_Scattertext_Scale.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)

**Log Scale**

In [None]:
html = st.produce_scattertext_explorer(corpus,
                                       category='Left ',
                                       category_name='Democratic',
                                       not_category_name='Republican',
                                       minimum_term_frequency=5,
                                       width_in_pixels=1000,
                                       transform=st.Scalers.log_scale_standardize)
file_name = 'AllSides_Scattertext_Log.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)

**Rank terms by frequency percentiles instead of raw frequencies**

In [None]:
html = produce_scattertext_explorer(corpus,
                                    category='Left ',
                                    category_name='Democratic',
                                    not_category_name='Republican',
                                    width_in_pixels=1000,
                                    minimum_term_frequency=5,
                                    transform=st.Scalers.percentile,
                                    metadata=convention_df['source'])
file_name = 'AllSides_Scattertext_RankData.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)

**Fall back to alphabetic order among equally frequent terms**

In [41]:
html = produce_scattertext_explorer(corpus,
                                    category='Left ',
                                    category_name='Democratic',
                                    not_category_name='Republican',
                                    width_in_pixels=1000,
                                    minimum_term_frequency=5,
                                    metadata=convention_df['source'],
                                    term_significance = st.LogOddsRatioUninformativeDirichletPrior())
file_name = 'AllSides_Scattertext_RankDefault.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)

**L2-penalized logistic regression coefficients vs. log term frequency**

In [26]:
def scale(ar): 
    return (ar - ar.min()) / (ar.max() - ar.min())

def zero_centered_scale(ar):
    scores = np.zeros(len(ar))
    scores[ar > 0] = scale(ar[ar > 0])
    scores[ar < 0] = -scale(-ar[ar < 0])
    return (scores + 1) / 2.

frequencies_scaled = scale(np.log(term_freq_df.sum(axis=1).values))

In [None]:
from sklearn.linear_model import LogisticRegression
scores = corpus.get_logreg_coefs('Left ',
                                 LogisticRegression(penalty='l2', C=10, max_iter=10000, n_jobs=-1))
scores_scaled = zero_centered_scale(scores)

html = produce_scattertext_explorer(corpus,
                                    category='Left ',
                                    category_name='Democratic',
                                    not_category_name='Republican',
                                    minimum_term_frequency=5,
                                    width_in_pixels=1000,
                                    x_coords=frequencies_scaled,
                                    y_coords=scores_scaled,
                                    scores=scores,
                                    sort_by_dist=False,
                                    metadata=convention_df['source'],
                                    x_label='Log frequency',
                                    y_label='L2-Penalized Log Reg Coef')
file_name = 'AllSides_L2vsLog.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)

**Scaled F-Score**

In [27]:
html = produce_scattertext_explorer(corpus,
                                    category='Left ',
                                    category_name='Democratic',
                                    not_category_name='Republican',
                                    minimum_term_frequency=5,
                                    width_in_pixels=1000,
                                    x_coords=frequencies_scaled,
                                    y_coords=corpus.get_scaled_f_scores('Left ', beta=0.5),
                                    scores=corpus.get_scaled_f_scores('Left ', beta=0.5),
                                    sort_by_dist=False,
                                    metadata=convention_df['source'],
                                    x_label='Log Frequency',
                                    y_label='Scaled F-Score')
file_name = 'AllSides_SFSvsLog.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)

#### Logs-Odd Ratio

In [28]:
freq_df = corpus.get_term_freq_df().rename(columns={'Left  freq': 'y_dem', 'Right  freq': 'y_rep'})
a_w = 0.01
y_i, y_j = freq_df['y_dem'].values, freq_df['y_rep'].values

In [30]:
n_i, n_j = y_i.sum(), y_j.sum()
a_0 = len(freq_df) * a_w
delta_i_j = (  np.log((y_i + a_w) / (n_i + a_0 - y_i - a_w))
                 - np.log((y_j + a_w) / (n_j + a_0 - y_j - a_w)))
var_delta_i_j = ( 1./(y_i + a_w) + 1./(y_i + a_0 - y_i - a_w)
                    + 1./(y_j + a_w) + 1./(n_j + a_0 - n_j - a_w))
zeta_i_j = delta_i_j/np.sqrt(var_delta_i_j)
max_abs_zeta = max(zeta_i_j.max(), -zeta_i_j.min())
zeta_scaled_for_charting = ((((zeta_i_j > 0).astype(float) * (zeta_i_j/max_abs_zeta))*0.5 + 0.5)
                            + ((zeta_i_j < 0).astype(float) * (zeta_i_j/max_abs_zeta) * 0.5))

In [None]:
html = produce_scattertext_explorer(corpus,
                                    category='Left ',
                                    category_name='Democratic',
                                    not_category_name='Republican',
                                    minimum_term_frequency=5,
                                    width_in_pixels=1000,
                                    x_coords=frequencies_scaled,
                                    y_coords=zeta_scaled_for_charting,
                                    scores=zeta_i_j,
                                    sort_by_dist=False,
                                    metadata=convention_df['source'],
                                    x_label='Log Frequency',
                                    y_label='Log Odds Ratio w/ Uninformative Prior (alpha_w=0.01)')
file_name = 'LOPriorvsLog.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)

#### Cornerstone

In [None]:
corner_scores  = corpus.get_corner_scores('Left ')
html = produce_scattertext_explorer(corpus,
                                    category='Left ',
                                    category_name='Democratic',
                                    not_category_name='Republican',
                                    minimum_term_frequency=5,
                                    width_in_pixels=1000,
                                    x_coords=frequencies_scaled,
                                    y_coords=corner_scores,
                                    scores=corner_scores,
                                    sort_by_dist=False,
                                    metadata=convention_df['source'],
                                    x_label='Log Frequency',
                                    y_label='Corner Scores')
file_name = 'CornervsLog.html'
open(file_name, 'wb').write(html.encode('utf-8'))
IFrame(src=file_name, width = 1200, height=700)