# Word frequency

First step in exploration: which words occur more frequently in one data set versus another?

1. [Load data](#Load-data)
2. [Raw frequency](#Raw-frequency)
3. [Normalized frequency](#Normalized-frequency)
4. [TF-IDF](#TF-IDF)
5. [Exploration](#Exploration)

### Load data

In [None]:
!wget https://bitbucket.org/istewart6/core_tutorial_2020/raw/36e69f9d777319ae2cc94354cf57bd01f3e080b3/data.zip; unzip data

--2020-12-09 22:31:43--  https://bitbucket.org/istewart6/core_tutorial_2020/raw/36e69f9d777319ae2cc94354cf57bd01f3e080b3/data.zip
Resolving bitbucket.org (bitbucket.org)... 104.192.141.1, 2406:da00:ff00::22c3:9b0a, 2406:da00:ff00::22c2:513, ...
Connecting to bitbucket.org (bitbucket.org)|104.192.141.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 230909807 (220M) [application/zip]
Saving to: ‘data.zip’


2020-12-09 22:31:51 (108 MB/s) - ‘data.zip’ saved [230909807/230909807]

Archive:  data.zip
   creating: data/
   creating: data/fakeNewsDatasets/
  inflating: data/fakeNewsDatasets/fake_news_small.tsv  
  inflating: data/fakeNewsDatasets/real_news_small.tsv  
   creating: data/fake_news_challenge/
  inflating: data/fake_news_challenge/fake_news_glove_embed.model  
  inflating: data/fake_news_challenge/real_news_word2vec_embed.model  
  inflating: data/fake_news_challenge/Fake.csv  
  inflating: data/fake_news_challenge/fake_news_word2vec_embed.model  
  inf

In [None]:
!pip install stop_words

Collecting stop_words
  Downloading https://files.pythonhosted.org/packages/1c/cb/d58290804b7a4c5daa42abbbe2a93c477ae53e45541b1825e86f0dfaaf63/stop-words-2018.7.23.tar.gz
Building wheels for collected packages: stop-words
  Building wheel for stop-words (setup.py) ... [?25l[?25hdone
  Created wheel for stop-words: filename=stop_words-2018.7.23-cp36-none-any.whl size=32916 sha256=ae954665226e0323c0f7897db983bf43a1510e49f63e6738c6842e09629c3769
  Stored in directory: /root/.cache/pip/wheels/75/37/6a/2b295e03bd07290f0da95c3adb9a74ba95fbc333aa8b0c7c78
Successfully built stop-words
Installing collected packages: stop-words
Successfully installed stop-words-2018.7.23


In [None]:
import pandas as pd
fake_news_data = pd.read_csv('data/fakeNewsDatasets/fake_news_small.tsv', sep='\t', index_col=False)
real_news_data = pd.read_csv('data/fakeNewsDatasets/real_news_small.tsv', sep='\t', index_col=False)
display(fake_news_data.head())
display(real_news_data.head())
print(fake_news_data.shape[0])
# print(fake_news_data[5])
# print(real_news_data[5])

Unnamed: 0,title,text,topic,id
0,"Alex Jones Vindicated in ""Pizzagate"" Controversy","""Alex Jones, purveyor of the independent inves...",biz,1
1,THE BIG DATA CONSPIRACY,so that in the no so far future can institute ...,biz,2
2,California Surprisingly Lenient on Auto Emissi...,"Setting Up Face-Off With Trump ""California's c...",biz,3
3,Mexicans Are Chomping at the Bit to Stop NAFTA...,Mexico has been unfairly gaining from NAFTA as...,biz,4
4,Breaking News: Snapchat to purchase Twitter fo...,Yahoo and AOL could be extremely popular over ...,biz,5


Unnamed: 0,title,text,topic,id
0,Alex Jones Apologizes for Promoting 'Pizzagate...,Alex Jones a prominent conspiracy theorist an...,biz,1
1,Banks and Tech Firms Battle Over Something Aki...,The big banks and Silicon Valley are waging an...,biz,2
2,California Upholds Auto Emissions Standards,"Setting Up Face-Off With Trump ""California's ...",biz,3
3,Renegotiate Nafta? Mexicans Say Get On With It,For more than two decades free trade has been...,biz,4
4,Snapchat 'will be bigger than Twitter,"Yahoo and AOL with advertisers' ""Snapchat cou...",biz,5


240


### Raw frequency

As a first step, let's look at raw word frequency.

In [None]:
# word frequency
from nltk.tokenize import WordPunctTokenizer
from sklearn.feature_extraction.text import CountVectorizer
from stop_words import get_stop_words
en_stops = get_stop_words('en')
tokenizer = WordPunctTokenizer()
cv = CountVectorizer(min_df=0.001, max_df=0.75, 
                     tokenizer=tokenizer.tokenize, stop_words=en_stops,
                     ngram_range=(1,1))
# get vocab for all data
combined_txt = fake_news_data.loc[:, 'text'].append(real_news_data.loc[:, 'text'])
combined_txt_dtm = cv.fit_transform(combined_txt)
sorted_vocab = list(sorted(cv.vocabulary_.keys(), key=cv.vocabulary_.get))
# get separate DTM for each news data
cv = CountVectorizer(min_df=0.001, max_df=0.75, tokenizer=tokenizer.tokenize, stop_words=en_stops, vocabulary=sorted_vocab)
fake_news_dtm = cv.fit_transform(fake_news_data.loc[:, 'text'].values)
real_news_dtm = cv.fit_transform(real_news_data.loc[:, 'text'].values)

  'stop_words.' % sorted(inconsistent))


In [None]:
## top words
import numpy as np
fake_news_dtm_top_words = pd.Series(np.array(fake_news_dtm.sum(axis=0))[0], index=sorted_vocab).sort_values(ascending=False)
real_news_dtm_top_words = pd.Series(np.array(real_news_dtm.sum(axis=0))[0], index=sorted_vocab).sort_values(ascending=False)
print(fake_news_dtm_top_words.head(20))
print(real_news_dtm_top_words.head(20))

,            851
'            384
"            343
s            278
-            182
will         161
trump        113
said          90
."            89
new           86
president     76
t             61
time          56
can           54
school        53
one           53
now           49
many          48
just          43
years         43
dtype: int64
'        451
-        344
"        339
s        323
,        314
said     148
will      87
year      64
also      53
t         52
."        51
new       47
trump     47
$         45
first     41
one       41
last      38
(         38
two       38
:         37
dtype: int64


In [None]:
# per-topic
article_topics = fake_news_data.loc[:, 'topic'].unique()
en_stops = get_stop_words('en')
tokenizer = WordPunctTokenizer()
top_k = 20
for topic_i in article_topics:
    print(f'topic = {topic_i}')
    fake_news_data_i = fake_news_data[fake_news_data.loc[:, 'topic']==topic_i]
    real_news_data_i = real_news_data[real_news_data.loc[:, 'topic']==topic_i]
    # get vocab, compute counts, etc.
    cv = CountVectorizer(min_df=0.001, max_df=0.75, 
                         tokenizer=tokenizer.tokenize, stop_words=en_stops,
                         ngram_range=(1,1))
    combined_txt_i = fake_news_data_i.loc[:, 'text'].append(real_news_data_i.loc[:, 'text'])
    combined_txt_dtm_i = cv.fit_transform(combined_txt_i)
    sorted_vocab_i = list(sorted(cv.vocabulary_.keys(), key=cv.vocabulary_.get))
    # get separate DTM for each news data
    cv = CountVectorizer(min_df=0.001, max_df=0.75, 
                         tokenizer=tokenizer.tokenize, stop_words=en_stops,
                         ngram_range=(1,1), vocabulary=sorted_vocab_i)
    fake_news_dtm_i = cv.fit_transform(fake_news_data_i.loc[:, 'text'].values)
    real_news_dtm_i = cv.fit_transform(real_news_data_i.loc[:, 'text'].values)
    # get top counts
    fake_news_dtm_top_words_i = pd.Series(np.array(fake_news_dtm_i.sum(axis=0))[0], index=sorted_vocab_i).sort_values(ascending=False).head(top_k)
    real_news_dtm_top_words_i = pd.Series(np.array(real_news_dtm_i.sum(axis=0))[0], index=sorted_vocab_i).sort_values(ascending=False).head(top_k)
    print('top words for fake news articles')
    display(fake_news_dtm_top_words_i)
    print('top words for real news articles')
    display(real_news_dtm_top_words_i)

topic = biz
top words for fake news articles


  'stop_words.' % sorted(inconsistent))


,            106
'             41
"             34
s             33
-             28
will          28
uk            21
said          19
$             14
trump         13
eu            13
deal          13
."            13
company       11
many          10
companies     10
european      10
now            9
may            8
jobs           8
dtype: int64

top words for real news articles


'            73
s            65
-            58
"            50
said         44
$            26
will         18
us           16
1            16
)            15
company      15
:            13
last         13
firm         13
trump        13
uk           13
financial    13
european     13
eu           13
two          12
dtype: int64

topic = edu
top words for fake news articles


  'stop_words.' % sorted(inconsistent))


"            53
school       47
'            45
students     38
s            31
-            23
will         22
education    20
trump        15
president    12
new          12
children     11
student      10
."           10
said         10
parents      10
law          10
time         10
schools      10
first        10
dtype: int64

top words for real news articles


'             29
s             24
-             24
school        23
students      21
"             19
education     11
said           9
,"             9
student        8
year           8
percent        7
children       6
president      6
according      5
will           5
college        5
)              5
(              5
university     5
dtype: int64

topic = entmt
top words for fake news articles


  'stop_words.' % sorted(inconsistent))


,            161
"            106
s             64
-             31
will          29
."            25
t             24
one           17
time          16
show          16
new           16
also          14
said          14
fans          13
way           12
now           11
last          11
just          11
(             11
character     10
dtype: int64

top words for real news articles


"        151
s        102
-         89
said      42
,         30
."        29
also      23
t         21
will      20
one       16
film      16
year      16
first     15
told      14
new       14
--        14
news      13
show      13
john      11
years     11
dtype: int64

topic = polit
top words for fake news articles


  'stop_words.' % sorted(inconsistent))


trump         79
'             69
"             59
s             56
president     50
clinton       29
-             25
donald        22
said          20
house         16
white         16
washington    15
will          14
."            13
just          11
cnn           11
)             11
(             11
obama         11
us            11
dtype: int64

top words for real news articles


"            33
'            32
s            28
trump        25
-            20
said         18
,"           12
president    10
mr            9
clinton       7
campaign      6
t             5
will          5
time          5
:             5
first         5
u             4
america       4
press         4
order         4
dtype: int64

topic = sports
top words for fake news articles


  'stop_words.' % sorted(inconsistent))


,         148
"          55
s          51
-          41
will       26
game       24
team       23
."         21
said       18
two        16
one        14
years      13
year       13
last       12
time       11
new        10
brazil     10
world      10
sports      9
just        9
dtype: int64

top words for real news articles


-          124
s           83
"           71
will        25
year        23
said        21
world       17
game        16
sport       16
one         15
."          15
two         14
win         14
time        13
6           13
team        13
federer     12
old         12
sports      11
(           11
dtype: int64

topic = tech
top words for fake news articles


  'stop_words.' % sorted(inconsistent))


'           49
s           43
will        42
"           36
new         34
-           34
can         16
amazon      14
google      13
now         12
many        11
apple       10
t            9
world        9
devices      9
said         9
(            9
time         9
research     8
app          8
dtype: int64

top words for real news articles


-             29
'             27
s             21
"             15
will          14
said          14
new           12
,"             9
also           7
devices        7
can            6
google         6
year           6
like           6
t              6
announced      5
monday         5
see            5
game           4
technology     4
dtype: int64

In [None]:
def compute_frequency(text_data, tokenizer, stops, vocab):
    cv = CountVectorizer(tokenizer=tokenizer.tokenize, stop_words=stops,
                         ngram_range=(1,1), vocabulary=vocab)
    dtm = cv.fit_transform(text_data)
    word_frequency = np.array(dtm.sum(axis=0))[0]
    word_frequency = pd.Series(word_frequency, index=vocab)
    return word_frequency

In [None]:
fake_news_text = fake_news_data.loc[:, 'text'].values
real_news_text = real_news_data.loc[:, 'text'].values
fake_news_word_frequency = compute_frequency(fake_news_text, tokenizer, en_stops, sorted_vocab)
real_news_word_frequency = compute_frequency(real_news_text, tokenizer, en_stops, sorted_vocab)
# compute difference
fake_vs_real_news_word_frequency_diff = fake_news_word_frequency - real_news_word_frequency
fake_vs_real_news_word_frequency_diff.sort_values(inplace=True, ascending=False)
# show words with highest/lowest difference
top_k = 20
print('words that occurred in more fake news articles')
print(fake_vs_real_news_word_frequency_diff.head(top_k))
print('words that occurred in more real news articles')
print(fake_vs_real_news_word_frequency_diff.tail(top_k))

words that occurred in more fake news articles
,            537
will          74
trump         66
president     41
new           39
."            38
many          33
clinton       27
donald        25
time          24
now           24
school        24
can           23
stated        20
students      19
even          18
white         18
order         16
way           16
great         16
dtype: int64
words that occurred in more real news articles
)             -8
000           -9
–             -9
6            -10
report       -11
m            -11
4            -11
tuesday      -11
three        -11
1            -12
financial    -12
$            -16
--           -17
also         -19
:            -20
year         -27
s            -45
said         -58
'            -67
-           -162
dtype: int64


  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


These differences suggest that fake news articles focused more on the actions of specific people (`trump`, `clinton`) and less on specific details (`tuesday`, `financial`).

However, these results could be due to longer articles that allowed e.g. real news writers to cover more details. How do we control for length?

Let's compute the normalized frequency for fake news and real news articles, to identify words that occurred more often than expected in one genre of article.

### Normalized frequency

In [None]:
def compute_norm_frequency(text_data, tokenizer, stops, vocab):
    cv = CountVectorizer(tokenizer=tokenizer.tokenize, stop_words=stops,
                         ngram_range=(1,1), vocabulary=vocab)
    dtm = cv.fit_transform(text_data)
    # normalize by column
    word_norm_frequency = np.array(dtm.sum(axis=0) / dtm.sum(axis=0).sum())[0]
    # store in format that is easy to manipulate
    word_norm_frequency = pd.Series(word_norm_frequency, index=vocab)
    return word_norm_frequency

In [None]:
tokenizer = WordPunctTokenizer()
stops = get_stop_words('en')
fake_news_word_norm_frequency = compute_norm_frequency(fake_news_data.loc[:, 'text'].values, tokenizer, stops, sorted_vocab)
real_news_word_norm_frequency = compute_norm_frequency(real_news_data.loc[:, 'text'].values, tokenizer, stops, sorted_vocab)
## compute ratio: what words are used more often in fake news than real news?
def compute_text_word_ratio(text_data_1, text_data_2):
    text_word_ratio = text_data_1 / text_data_2
    # drop non-occurring words
    text_word_ratio = text_word_ratio[~np.isinf(text_word_ratio)]
    text_word_ratio = text_word_ratio[~np.isnan(text_word_ratio)]
    text_word_ratio = text_word_ratio[text_word_ratio != 0.]
    text_word_ratio.sort_values(inplace=True, ascending=False)
    return text_word_ratio
fake_vs_real_news_word_frequency_ratio = compute_text_word_ratio(fake_news_word_norm_frequency, real_news_word_norm_frequency)
# show words with highest/lowest ratio
top_k = 20
print('words that occurred in more fake news articles')
print(fake_vs_real_news_word_frequency_ratio.head(top_k))
print('words that occurred in more real news articles')
print(fake_vs_real_news_word_frequency_ratio.tail(top_k))

words that occurred in more fake news articles
hillary      11.724431
commented    10.886971
needs        10.049512
secret        9.212053
caused        8.374593
ai            8.374593
provided      7.537134
earth         6.699675
begin         6.699675
instead       6.699675
attempt       5.862215
release       5.862215
success       5.862215
stein         5.862215
lack          5.862215
charges       5.024756
tennis        5.024756
met           5.024756
phone         5.024756
groups        5.024756
dtype: float64
words that occurred in more real news articles
anniversary    0.167492
customer       0.167492
missing        0.167492
saw            0.167492
value          0.167492
jersey         0.167492
providers      0.167492
potentially    0.167492
growing        0.139577
indian         0.139577
story          0.139577
drawn          0.139577
january        0.139577
february       0.139577
vehicle        0.119637
brady          0.119637
40             0.119637
18             0.119637

  'stop_words.' % sorted(inconsistent))
  'stop_words.' % sorted(inconsistent))


OK! We see that fake news consistently focuses on `hillary` (e.g. her email case) s well as potential conspiracy theories (`secret`, `ai`). In contrast, real news focuses on concrete time details (`january`, `40`) and provides some words to "hedge" their claims (`potentially`, `story`).

### TF-IDF

What if we want to identify words that occur frequency in just a few documents? E.g. some fake news stories may disproportionately use rare but inflammatory words.

Let's try TF-IDF, which normalizes term frequency by the inverse document frequency:

$$\text{tf-idf(word)} = \frac{\text{freq(word)}}{\text{document-freq(word)}}$$

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
def compute_non_zero_mean(data):
    non_zero_data = data[data != 0.]
    non_zero_mean = non_zero_data.mean()
    return non_zero_mean
def compute_tfidf(text_data, tokenizer, stops, vocab):
    tfidf_vec = TfidfVectorizer(tokenizer=tokenizer.tokenize, stop_words=stops, vocabulary=vocab)
    text_tfidf_matrix = tfidf_vec.fit_transform(text_data).toarray()
    # compute mean over non-zero TF-IDF values
    text_tfidf_score = np.apply_along_axis(lambda x: compute_non_zero_mean(x), 0, text_tfidf_matrix)
    text_tfidf_score = pd.Series(text_tfidf_score, index=vocab)
    return text_tfidf_score

In [None]:
fake_news_tfidf = compute_tfidf(fake_news_text, tokenizer, en_stops, sorted_vocab)
real_news_tfidf = compute_tfidf(real_news_text, tokenizer, en_stops, sorted_vocab)
fake_vs_real_news_word_tfidf_ratio = fake_news_tfidf / real_news_tfidf
fake_vs_real_news_word_tfidf_ratio.dropna(inplace=True)
fake_vs_real_news_word_tfidf_ratio.sort_values(inplace=True, ascending=False)
top_k = 20
print('words with higher TF-IDF scores in fake news')
display(fake_vs_real_news_word_tfidf_ratio.head(top_k))
print('words with higher TF-IDF scores in real news')
display(fake_vs_real_news_word_tfidf_ratio.tail(top_k))

  'stop_words.' % sorted(inconsistent))
  after removing the cwd from sys.path.
  ret = ret.dtype.type(ret / rcount)
  'stop_words.' % sorted(inconsistent))
  after removing the cwd from sys.path.
  ret = ret.dtype.type(ret / rcount)


words with higher TF-IDF scores in fake news


steel         4.280091
retailers     4.089067
tourists      3.949276
friendship    3.927483
morgan        3.752379
bruno         3.728585
gas           3.509625
saudi         3.392921
arnold        3.364493
privacy       3.356762
sacrifice     3.214699
emoji         3.112207
duncan        3.080392
michelle      3.053198
ebony         3.014936
qatar         2.980143
fees          2.926611
ai            2.926181
wawrinka      2.924895
kyrgios       2.892984
dtype: float64

words with higher TF-IDF scores in real news


virtual           0.331943
comfortable       0.330759
saran             0.312199
hacking           0.309510
junco             0.308395
farah             0.303634
iphones           0.300928
putin             0.297846
engines           0.292603
fisher            0.280006
alcohol           0.279075
absurdity         0.277041
suddenly          0.272967
pizzagate         0.270884
punk              0.265890
authentication    0.264018
factor            0.264018
graduates         0.254588
tempe             0.239919
investigators     0.221240
dtype: float64

This method succeeds in identifying fairly rare words that characterize real and fake news.

For fake news, we see that words with higher TF-IDF scores include those related to business transactions (`retailers`, `gas`) and Middle Eastern countries (`saudi`, `qatar`).

For real news, the words with higher TF-IDF scores include words that directly address conspiracies (`pizzagate`, `investigators`, `authentication`) and words that speculate on the veracity of claims (`absurdity`, `suddenly`).

### Exploration

Now it's time for you to explore the data a little more with word frequency modeling!

Some thoughts:

- The original data are organized by topic. What are the words that characterize real/fake news in each topic?
- Changing the vocabulary size could identify more rare words (e.g. lowering `min_df` threshold in `CountVectorizer`). What happens if you include more words in the vocabulary?
- Up until now we have focused more strongly on single words (unigrams). What if we include phrases (changing `ngram_range` in the `CountVectorizer`)? Will we see more examples of conspiracy theories being highlighted by the real news?

In [None]:
## example: test different n-gram range
## generate new vocabulary
tokenizer = WordPunctTokenizer()
en_stops = get_stop_words('en')
def compute_word_freq_custom(text_data, custom_cv, vocab):
    text_dtm = custom_cv.transform(text_data)
    word_norm_frequency = np.array(text_dtm.sum(axis=0) / text_dtm.sum(axis=0).sum())[0]
    word_norm_frequency = pd.Series(word_norm_frequency, index=vocab)
    return word_norm_frequency
# create custom vectorizer for bigrams
bigram_cv = CountVectorizer(min_df=0.001, max_df=0.75, 
                            tokenizer=tokenizer.tokenize, stop_words=en_stops,
                            ngram_range=(2,2))
# get vocab for all data
combined_txt = fake_news_data.loc[:, 'text'].append(real_news_data.loc[:, 'text'])
combined_txt_dtm = bigram_cv.fit_transform(combined_txt)
sorted_bigram_vocab = list(sorted(bigram_cv.vocabulary_.keys(), key=bigram_cv.vocabulary_.get))
## compute frequency ratio for bigrams
fake_news_bigram_frequency = compute_word_freq_custom(fake_news_data.loc[:, 'text'].values, bigram_cv, sorted_bigram_vocab)
real_news_bigram_frequency = compute_word_freq_custom(real_news_data.loc[:, 'text'].values, bigram_cv, sorted_bigram_vocab)
fake_vs_real_news_bigram_word_frequency_ratio = compute_text_word_ratio(fake_news_bigram_frequency, real_news_bigram_frequency)
fake_vs_real_news_bigram_word_frequency_ratio.sort_values(inplace=True, ascending=False)
top_k = 20
print('top unigrams/bigrams that occur more often in fake news data')
display(fake_vs_real_news_bigram_word_frequency_ratio.head(top_k))
print('top unigrams/bigrams that occur more often in real news data')
display(fake_vs_real_news_bigram_word_frequency_ratio.tail(top_k))

top unigrams/bigrams that occur more often in fake news data


  'stop_words.' % sorted(inconsistent))


hillary clinton     10.837533
, "                  7.711322
however ,            7.502908
trump .              7.502908
, many               6.669251
, one                5.835595
game ,               5.001938
president donald     5.001938
couldn '             5.001938
first lady           5.001938
white house          4.724053
donald trump         4.335013
monday .             4.168282
night ,              4.168282
now ,                4.168282
trump tower          4.168282
" just               4.168282
. new                3.751454
, wanted             3.334626
supreme court        3.334626
dtype: float64

top unigrams/bigrams that occur more often in real news data


economy .        0.208414
performance -    0.208414
2 .              0.208414
well -           0.208414
. k              0.208414
. 4              0.208414
2016 .           0.208414
. still          0.208414
s really         0.208414
, adding         0.208414
science ,        0.208414
k .              0.208414
- year           0.189467
year -           0.185257
- old            0.175507
middle east      0.166731
indian wells     0.166731
. report         0.166731
world number     0.166731
1 .              0.083366
dtype: float64