# Word frequency

First step in exploration: which words occur more frequently in one data set versus another?

1. [Load data](#Load-data)
2. [Raw frequency](#Raw-frequency)
3. [Normalized frequency](#Normalized-frequency)
4. [TF-IDF](#TF-IDF)
5. [Exploration](#Exploration)

### Load data

In [1]:
import pandas as pd
fake_news_data = pd.read_csv('data/fakeNewsDatasets/fake_news_small.tsv', sep='\t', index_col=False)
real_news_data = pd.read_csv('data/fakeNewsDatasets/real_news_small.tsv', sep='\t', index_col=False)
display(fake_news_data.head())
display(real_news_data.head())
print(fake_news_data.shape[0])
# print(fake_news_data[5])
# print(real_news_data[5])

Unnamed: 0,title,text,topic,id
0,"Alex Jones Vindicated in ""Pizzagate"" Controversy","""Alex Jones, purveyor of the independent inves...",biz,1
1,THE BIG DATA CONSPIRACY,so that in the no so far future can institute ...,biz,2
2,California Surprisingly Lenient on Auto Emissi...,"Setting Up Face-Off With Trump ""California's c...",biz,3
3,Mexicans Are Chomping at the Bit to Stop NAFTA...,Mexico has been unfairly gaining from NAFTA as...,biz,4
4,Breaking News: Snapchat to purchase Twitter fo...,Yahoo and AOL could be extremely popular over ...,biz,5


Unnamed: 0,title,text,topic,id
0,"Alex Jones Vindicated in ""Pizzagate"" Controversy","""Alex Jones, purveyor of the independent inves...",biz,1
1,THE BIG DATA CONSPIRACY,so that in the no so far future can institute ...,biz,2
2,California Surprisingly Lenient on Auto Emissi...,"Setting Up Face-Off With Trump ""California's c...",biz,3
3,Mexicans Are Chomping at the Bit to Stop NAFTA...,Mexico has been unfairly gaining from NAFTA as...,biz,4
4,Breaking News: Snapchat to purchase Twitter fo...,Yahoo and AOL could be extremely popular over ...,biz,5


240


### Raw frequency

As a first step, let's look at raw word frequency.

In [3]:
# word frequency
from nltk.tokenize import WordPunctTokenizer
from sklearn.feature_extraction.text import CountVectorizer
from stop_words import get_stop_words
en_stops = get_stop_words('en')
tokenizer = WordPunctTokenizer()
cv = CountVectorizer(min_df=0.001, max_df=0.75, 
                     tokenizer=tokenizer.tokenize, stop_words=en_stops,
                     ngram_range=(1,1))
# get vocab for all data
combined_txt = fake_news_data.loc[:, 'text'].append(real_news_data.loc[:, 'text'])
combined_txt_dtm = cv.fit_transform(combined_txt)
sorted_vocab = list(sorted(cv.vocabulary_.keys(), key=cv.vocabulary_.get))
# get separate DTM for each news data
cv = CountVectorizer(min_df=0.001, max_df=0.75, tokenizer=tokenizer.tokenize, stop_words=en_stops, vocabulary=sorted_vocab)
fake_news_dtm = cv.fit_transform(fake_news_data.loc[:, 'text'].values)
real_news_dtm = cv.fit_transform(real_news_data.loc[:, 'text'].values)

In [4]:
## top words
import numpy as np
fake_news_dtm_top_words = pd.Series(np.array(fake_news_dtm.sum(axis=0))[0], index=sorted_vocab).sort_values(ascending=False)
real_news_dtm_top_words = pd.Series(np.array(real_news_dtm.sum(axis=0))[0], index=sorted_vocab).sort_values(ascending=False)
print(fake_news_dtm_top_words.head(20))
print(real_news_dtm_top_words.head(20))

'            384
"            343
s            278
-            182
will         161
trump        113
said          90
."            89
new           86
president     76
t             61
time          56
can           54
school        53
one           53
now           49
many          48
just          43
years         43
students      42
dtype: int64
'            384
"            343
s            278
-            182
will         161
trump        113
said          90
."            89
new           86
president     76
t             61
time          56
can           54
school        53
one           53
now           49
many          48
just          43
years         43
students      42
dtype: int64


In [5]:
# per-topic
article_topics = fake_news_data.loc[:, 'topic'].unique()
en_stops = get_stop_words('en')
tokenizer = WordPunctTokenizer()
top_k = 20
for topic_i in article_topics:
    print(f'topic = {topic_i}')
    fake_news_data_i = fake_news_data[fake_news_data.loc[:, 'topic']==topic_i]
    real_news_data_i = real_news_data[real_news_data.loc[:, 'topic']==topic_i]
    # get vocab, compute counts, etc.
    cv = CountVectorizer(min_df=0.001, max_df=0.75, 
                         tokenizer=tokenizer.tokenize, stop_words=en_stops,
                         ngram_range=(1,1))
    combined_txt_i = fake_news_data_i.loc[:, 'text'].append(real_news_data_i.loc[:, 'text'])
    combined_txt_dtm_i = cv.fit_transform(combined_txt_i)
    sorted_vocab_i = list(sorted(cv.vocabulary_.keys(), key=cv.vocabulary_.get))
    # get separate DTM for each news data
    cv = CountVectorizer(min_df=0.001, max_df=0.75, 
                         tokenizer=tokenizer.tokenize, stop_words=en_stops,
                         ngram_range=(1,1), vocabulary=sorted_vocab_i)
    fake_news_dtm_i = cv.fit_transform(fake_news_data_i.loc[:, 'text'].values)
    real_news_dtm_i = cv.fit_transform(real_news_data_i.loc[:, 'text'].values)
    # get top counts
    fake_news_dtm_top_words_i = pd.Series(np.array(fake_news_dtm_i.sum(axis=0))[0], index=sorted_vocab_i).sort_values(ascending=False).head(top_k)
    real_news_dtm_top_words_i = pd.Series(np.array(real_news_dtm_i.sum(axis=0))[0], index=sorted_vocab_i).sort_values(ascending=False).head(top_k)
    print('top words for fake news articles')
    display(fake_news_dtm_top_words_i)
    print('top words for real news articles')
    display(real_news_dtm_top_words_i)

topic = biz
top words for fake news articles


'            41
"            34
s            33
will         28
-            28
uk           21
said         19
$            14
."           13
eu           13
trump        13
deal         13
company      11
many         10
european     10
companies    10
now           9
may           8
last          8
new           8
dtype: int64

top words for real news articles


'            41
"            34
s            33
will         28
-            28
uk           21
said         19
$            14
."           13
eu           13
trump        13
deal         13
company      11
many         10
european     10
companies    10
now           9
may           8
last          8
new           8
dtype: int64

topic = edu
top words for fake news articles


"            53
school       47
'            45
students     38
s            31
-            23
will         22
education    20
trump        15
president    12
new          12
children     11
schools      10
first        10
parents      10
."           10
law          10
said         10
student      10
time         10
dtype: int64

top words for real news articles


"            53
school       47
'            45
students     38
s            31
-            23
will         22
education    20
trump        15
president    12
new          12
children     11
schools      10
first        10
parents      10
."           10
law          10
said         10
student      10
time         10
dtype: int64

topic = entmt
top words for fake news articles


"            106
s             64
-             31
will          29
."            25
t             24
one           17
new           16
time          16
show          16
said          14
also          14
fans          13
way           12
last          11
now           11
(             11
just          11
character     10
actress       10
dtype: int64

top words for real news articles


"            106
s             64
-             31
will          29
."            25
t             24
one           17
new           16
time          16
show          16
said          14
also          14
fans          13
way           12
last          11
now           11
(             11
just          11
character     10
actress       10
dtype: int64

topic = polit
top words for fake news articles


"             59
s             56
president     50
clinton       29
-             25
donald        22
said          20
white         16
house         16
washington    15
will          14
."            13
t             11
us            11
cnn           11
just          11
obama         11
great         11
(             11
)             11
dtype: int64

top words for real news articles


"             59
s             56
president     50
clinton       29
-             25
donald        22
said          20
white         16
house         16
washington    15
will          14
."            13
t             11
us            11
cnn           11
just          11
obama         11
great         11
(             11
)             11
dtype: int64

topic = sports
top words for fake news articles


"         55
s         51
-         41
will      26
game      24
team      23
."        21
said      18
two       16
one       14
years     13
year      13
last      12
time      11
new       10
brazil    10
world     10
race       9
sports     9
just       9
dtype: int64

top words for real news articles


"         55
s         51
-         41
will      26
game      24
team      23
."        21
said      18
two       16
one       14
years     13
year      13
last      12
time      11
new       10
brazil    10
world     10
race       9
sports     9
just       9
dtype: int64

topic = tech
top words for fake news articles


'           49
s           43
will        42
"           36
new         34
-           34
can         16
amazon      14
google      13
now         12
many        11
apple       10
said         9
(            9
t            9
devices      9
time         9
world        9
research     8
youtube      8
dtype: int64

top words for real news articles


'           49
s           43
will        42
"           36
new         34
-           34
can         16
amazon      14
google      13
now         12
many        11
apple       10
said         9
(            9
t            9
devices      9
time         9
world        9
research     8
youtube      8
dtype: int64

In [6]:
def compute_frequency(text_data, tokenizer, stops, vocab):
    cv = CountVectorizer(tokenizer=tokenizer.tokenize, stop_words=stops,
                         ngram_range=(1,1), vocabulary=vocab)
    dtm = cv.fit_transform(text_data)
    word_frequency = np.array(dtm.sum(axis=0))[0]
    word_frequency = pd.Series(word_frequency, index=vocab)
    return word_frequency

In [7]:
fake_news_text = fake_news_data.loc[:, 'text'].values
real_news_text = real_news_data.loc[:, 'text'].values
fake_news_word_frequency = compute_frequency(fake_news_text, tokenizer, en_stops, sorted_vocab)
real_news_word_frequency = compute_frequency(real_news_text, tokenizer, en_stops, sorted_vocab)
# compute difference
fake_vs_real_news_word_frequency_diff = fake_news_word_frequency - real_news_word_frequency
fake_vs_real_news_word_frequency_diff.sort_values(inplace=True, ascending=False)
# show words with highest/lowest difference
top_k = 20
print('words that occurred in more fake news articles')
print(fake_vs_real_news_word_frequency_diff.head(top_k))
print('words that occurred in more real news articles')
print(fake_vs_real_news_word_frequency_diff.tail(top_k))

words that occurred in more fake news articles
���            0
eu             0
espen          0
esporte        0
essentially    0
estate         0
estimated      0
et             0
ethnically     0
ethnicity      0
euro           0
escorted       0
euronext       0
europe         0
european       0
eurpoe         0
evasions       0
eve            0
even           0
evening        0
dtype: int64
words that occurred in more real news articles
player      0
played      0
placing     0
plagued     0
plaguing    0
plainly     0
plan        0
planes      0
planet      0
planned     0
planning    0
plans       0
plant       0
planted     0
plants      0
plated      0
platform    0
platonic    0
play        0
!           0
dtype: int64


These differences suggest that fake news articles focused more on the actions of specific people (`trump`, `clinton`) and less on specific details (`tuesday`, `financial`).

However, these results could be due to longer articles that allowed e.g. real news writers to cover more details. How do we control for length?

Let's compute the normalized frequency for fake news and real news articles, to identify words that occurred more often than expected in one genre of article.

### Normalized frequency

In [1]:
def compute_norm_frequency(text_data, tokenizer, stops, vocab):
    cv = CountVectorizer(tokenizer=tokenizer.tokenize, stop_words=stops,
                         ngram_range=(1,1), vocabulary=vocab)
    dtm = cv.fit_transform(text_data)
    # normalize by column
    word_norm_frequency = np.array(dtm.sum(axis=0) / dtm.sum(axis=0).sum())[0]
    # store in format that is easy to manipulate
    word_norm_frequency = pd.Series(word_norm_frequency, index=vocab)
    return word_norm_frequency

In [None]:
tokenizer = WordPunctTokenizer()
stops = get_stop_words('en')
fake_news_word_norm_frequency = compute_norm_frequency(fake_news_data.loc[:, 'text'].values, tokenizer, stops, sorted_vocab)
real_news_word_norm_frequency = compute_norm_frequency(real_news_data.loc[:, 'text'].values, tokenizer, stops, sorted_vocab)
## compute ratio: what words are used more often in fake news than real news?
def compute_text_word_ratio(text_data_1, text_data_2):
    text_word_ratio = text_data_1 / text_data_2
    # drop non-occurring words
    text_word_ratio = text_word_ratio[~np.isinf(text_word_ratio)]
    text_word_ratio = text_word_ratio[~np.isnan(text_word_ratio)]
    text_word_ratio = text_word_ratio[text_word_ratio != 0.]
    text_word_ratio.sort_values(inplace=True, ascending=False)
    return text_word_ratio
fake_vs_real_news_word_frequency_ratio = compute_text_word_ratio(fake_news_word_norm_frequency, real_news_word_norm_frequency)
# show words with highest/lowest ratio
top_k = 20
print('words that occurred in more fake news articles')
print(fake_real_news_word_frequency_ratio.head(top_k))
print('words that occurred in more real news articles')
print(fake_real_news_word_frequency_ratio.tail(top_k))

OK! We see that fake news consistently focuses on `hillary` (e.g. her email case) s well as potential conspiracy theories (`secret`, `ai`). In contrast, real news focuses on concrete time details (`january`, `40`) and provides some words to "hedge" their claims (`potentially`, `story`).

### TF-IDF

What if we want to identify words that occur frequency in just a few documents? E.g. some fake news stories may disproportionately use rare but inflammatory words.

Let's try TF-IDF, which normalizes term frequency by the inverse document frequency:

$$\text{tf-idf(word)} = \frac{\text{freq(word)}}{\text{document-freq(word)}}$$

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
def compute_non_zero_mean(data):
    non_zero_data = data[data != 0.]
    non_zero_mean = non_zero_data.mean()
    return non_zero_mean
def compute_tfidf(text_data, tokenizer, stops, vocab):
    tfidf_vec = TfidfVectorizer(tokenizer=tokenizer.tokenize, stop_words=stops, vocabulary=vocab)
    text_tfidf_matrix = tfidf_vec.fit_transform(text_data).toarray()
    # compute mean over non-zero TF-IDF values
    text_tfidf_score = np.apply_along_axis(lambda x: compute_non_zero_mean(x), 0, text_tfidf_matrix)
    text_tfidf_score = pd.Series(text_tfidf_score, index=vocab)
    return text_tfidf_score

In [None]:
fake_news_tfidf = compute_tfidf(fake_news_text, tokenizer, en_stops, sorted_vocab)
real_news_tfidf = compute_tfidf(real_news_text, tokenizer, en_stops, sorted_vocab)
fake_vs_real_news_word_tfidf_ratio = fake_news_tfidf / real_news_tfidf
fake_vs_real_news_word_tfidf_ratio.dropna(inplace=True)
fake_vs_real_news_word_tfidf_ratio.sort_values(inplace=True, ascending=False)
top_k = 20
print('words with higher TF-IDF scores in fake news')
display(fake_vs_real_news_word_tfidf_ratio.head(top_k))
print('words with higher TF-IDF scores in real news')
display(fake_vs_real_news_word_tfidf_ratio.tail(top_k))

This method succeeds in identifying fairly rare words that characterize real and fake news.

For fake news, we see that words with higher TF-IDF scores include those related to business transactions (`retailers`, `gas`) and Middle Eastern countries (`saudi`, `qatar`).

For real news, the words with higher TF-IDF scores include words that directly address conspiracies (`pizzagate`, `investigators`, `authentication`) and words that speculate on the veracity of claims (`absurdity`, `suddenly`).

### Exploration

Now it's time for you to explore the data a little more with word frequency modeling!

Some thoughts:

- The original data are organized by topic. What are the words that characterize real/fake news in each topic?
- Changing the vocabulary size could identify more rare words (e.g. lowering `min_df` threshold in `CountVectorizer`). What happens if you include more words in the vocabulary?
- Up until now we have focused more strongly on single words (unigrams). What if we include phrases (changing `ngram_range` in the `CountVectorizer`)? Will we see more examples of conspiracy theories being highlighted by the real news?

In [None]:
## example: test different n-gram range
## generate new vocabulary
tokenizer = WordPunctTokenizer()
en_stops = get_stop_words('en')
def compute_word_freq_custom(text_data, custom_cv, vocab):
    text_dtm = custom_cv.transform(text_data)
    word_norm_frequency = np.array(text_dtm.sum(axis=0) / text_dtm.sum(axis=0).sum())[0]
    word_norm_frequency = pd.Series(word_norm_frequency, index=vocab)
    return word_norm_frequency
# create custom vectorizer for bigrams
bigram_cv = CountVectorizer(min_df=0.001, max_df=0.75, 
                            tokenizer=tokenizer.tokenize, stop_words=en_stops,
                            ngram_range=(2,2))
# get vocab for all data
combined_txt = fake_news_data.loc[:, 'text'].append(real_news_data.loc[:, 'text'])
combined_txt_dtm = bigram_cv.fit_transform(combined_txt)
sorted_bigram_vocab = list(sorted(bigram_cv.vocabulary_.keys(), key=bigram_cv.vocabulary_.get))
## compute frequency ratio for bigrams
fake_news_bigram_frequency = compute_word_freq_custom(fake_news_data.loc[:, 'text'].values, bigram_cv, sorted_bigram_vocab)
fake_vs_real_news_bigram_word_frequency_ratio = compute_text_word_ratio(fake_news_bigram_frequency, real_news_bigram_frequency)
fake_vs_real_news_bigram_word_frequency_ratio.sort_values(inplace=True, ascending=False)
top_k = 20
print('top unigrams/bigrams that occur more often in fake news data')
display(fake_vs_real_news_bigram_word_frequency_ratio.head(top_k))
print('top unigrams/bigrams that occur more often in real news data')
display(fake_vs_real_news_bigram_word_frequency_ratio.tail(top_k))