# ANALYSIS 2: FREQUENCY LISTS AND KEYNESS

In this notebook, we will conduct exploratory analysis of the corpora and:

 - look at most common words and bigrams for both corpora (different years)
 - look at the most distinctive words to the two corpora 
 - look at the type of words that come after thankful/grateful 

In [1]:
%run functions.ipynb

In [2]:
tweets_2019 = json.load(open("../data/cleaned/tweets_2019.json"))

In [3]:
tweets_2020 = json.load(open("../data/cleaned/tweets_2020.json"))

## Cleaning:

To look at frequencies of words we need to put tweets into string and then tokenize.

Make tweets into string.

In [4]:
string_2019 = ''.join(tweets_2019)

In [5]:
string_2020 = ''.join(tweets_2020)

Tokenize tweets.

In [6]:
tokens_2019 = tokenize(string_2019, lowercase = True, strip_chars = '.,!')

In [7]:
tokens_2020 = tokenize(string_2020, lowercase = True, strip_chars = '.,!')

## STEP 1: FREQUENCY LISTS

Let's look at frequency lists for the two years so we can see how loved ones and health shows up so we know which words/phrases to analyze.

In [8]:
word_freq_2019 = Counter(tokens_2019)
bigram_freq_2019 = Counter(get_bigram_tokens(tokens_2019))

In [9]:
print('Top 30 words\n============')
print(word_freq_2019.most_common(30)) 

print('\nTop 30 bigrams\n==============')
print(bigram_freq_2019.most_common(30))

Top 30 words
[('for', 9222), ('and', 9026), ('to', 8736), ('the', 8247), ('you', 6574), ('i', 5684), ('a', 5402), ('of', 4837), ('thanksgiving', 4123), ('my', 4026), ('thankful', 3101), ('all', 3035), ('in', 2858), ('is', 2767), ('this', 2615), ('that', 2465), ('grateful', 2295), ('so', 2256), ('have', 2237), ('we', 2233), ('your', 2226), ('be', 2187), ('with', 2115), ('thanks', 2094), ('are', 1932), ('happy', 1926), ('our', 1758), ('me', 1740), ('family', 1710), ('day', 1700)]

Top 30 bigrams
[('thankful for', 2354), ('grateful for', 1503), ('happy thanksgiving', 1288), ('for the', 1093), ('to be', 889), ('for all', 827), ('i am', 827), ('thank you', 787), ('give thanks', 758), ('for my', 592), ('of the', 582), ('all the', 572), ('thanks for', 562), ('for you', 562), ('to you', 522), ('all of', 518), ('you and', 503), ('in the', 467), ('we are', 447), ('thanksgiving to', 445), ('have a', 436), ('my life', 428), ('thanks to', 425), ('so much', 418), ('and i', 413), ('to the', 412), ('i

In [10]:
word_freq_2020 = Counter(tokens_2020)
bigram_freq_2020 = Counter(get_bigram_tokens(tokens_2020))

In [11]:
print('Top 30 words\n============')
print(word_freq_2020.most_common(30)) 

print('\nTop 30 bigrams\n==============')
print(bigram_freq_2020.most_common(30))

Top 30 words
[('for', 8791), ('and', 8691), ('to', 8517), ('the', 8180), ('you', 7138), ('i', 5850), ('a', 5347), ('of', 4542), ('my', 3870), ('thanksgiving', 3599), ('all', 2978), ('thankful', 2817), ('this', 2811), ('is', 2747), ('in', 2708), ('that', 2441), ('so', 2409), ('grateful', 2367), ('we', 2328), ('your', 2270), ('have', 2253), ('be', 2141), ('thanks', 1994), ('with', 1916), ('happy', 1897), ('it', 1793), ('are', 1792), ('me', 1657), ('on', 1611), ('but', 1532)]

Top 30 bigrams
[('thankful for', 2085), ('grateful for', 1556), ('happy thanksgiving', 1249), ('for the', 1025), ('i am', 859), ('to be', 854), ('thank you', 797), ('for you', 750), ('#oreninyourarea #oreninyourarea', 726), ('for all', 698), ('you and', 673), ('give thanks', 649), ('to you', 621), ('of the', 602), ('this year', 556), ('for my', 529), ('all the', 528), ('thanks for', 511), ('so much', 482), ('all of', 450), ('you all', 449), ('we are', 449), ('in the', 439), ('have a', 429), ('and i', 409), ('thanksg

### What does this tell us?

1. We can already see how there is a lot of overlap in the most frequent words. 

  
2. We see the subject of our hypothesis come up in these lists as "family", so we know that this is certainly a top topic in these expressions of gratitude. 

  
3. We also see a lot of pronouns (I, we, you, me) and prepositional phrases (for my, of the, for you, to you) which will both be interesting to analyze in context. See later notebook on this once we test hypothesis.

  
4. However, overall, this is not very helpful as we cannot see what is most significant and different. 

  
I'm going to take out stop words so that we can see which words are coming up most frequent and which words we can look at to test hypothesis.

### Frequency lists without stop words: 

In [12]:
stopwords = stopwords.words('english')

We can remove stop words from these token lists.

In [13]:
no_stop_tokens_2019 = [word for word in tokens_2019 if (word not in stopwords)]

In [14]:
no_stop_tokens_2020 = [word for word in tokens_2020 if (word not in stopwords)]

In [15]:
no_stop_word_freq_2019 = Counter(no_stop_tokens_2019)
no_stop_bigram_freq_2019 = Counter(get_bigram_tokens(no_stop_tokens_2019))

In [16]:
print('Top 30 words\n============')
print(no_stop_word_freq_2019.most_common(30)) 

print('\nTop 30 bigrams\n==============')
print(no_stop_bigram_freq_2019.most_common(30))

Top 30 words
[('thanksgiving', 4123), ('thankful', 3101), ('grateful', 2295), ('thanks', 2094), ('happy', 1926), ('family', 1710), ('day', 1700), ('love', 1492), ('blessings', 1203), ('i’m', 1158), ('appreciate', 1107), ('give', 1028), ('&amp;', 1006), ('thank', 1001), ('friends', 984), ('life', 979), ('gratitude', 974), ('appreciation', 889), ('everyone', 844), ('people', 839), ('god', 838), ('today', 825), ('lucky', 799), ('much', 789), ('us', 788), ('thankfulness', 779), ("i'm", 745), ('year', 720), ('blessing', 708), ('hope', 677)]

Top 30 bigrams
[('happy thanksgiving', 1290), ('give thanks', 766), ('i’m thankful', 350), ('god bless', 323), ('family friends', 311), ("i'm thankful", 251), ('thanksgiving everyone', 226), ('i’m grateful', 218), ('many blessings', 191), ('every day', 168), ('thanksgiving day', 159), ("i'm grateful", 154), ('friends family', 142), ('thanksgiving family', 132), ('loved ones', 105), ('thanks lord', 100), ('thankful family', 98), ('express gratitude', 94)

In [17]:
no_stop_word_freq_2020 = Counter(no_stop_tokens_2020)
no_stop_bigram_freq_2020 = Counter(get_bigram_tokens(no_stop_tokens_2020))

In [18]:
print('Top 30 words\n============')
print(no_stop_word_freq_2020.most_common(30)) 

print('\nTop 30 bigrams\n==============')
print(no_stop_bigram_freq_2020.most_common(30))

Top 30 words
[('thanksgiving', 3599), ('thankful', 2817), ('grateful', 2367), ('thanks', 1994), ('happy', 1897), ('love', 1449), ('family', 1413), ('day', 1265), ('i’m', 1170), ('blessings', 1168), ('appreciate', 1151), ('&amp;', 1135), ('year', 1060), ('thank', 1014), ('gratitude', 989), ('#oreninyourarea', 924), ('appreciation', 917), ('god', 916), ('much', 893), ('give', 868), ('life', 838), ('people', 816), ('us', 800), ('everyone', 777), ('friends', 767), ('thankfulness', 763), ('today', 750), ('blessing', 715), ('one', 697), ('like', 696)]

Top 30 bigrams
[('happy thanksgiving', 1249), ('#oreninyourarea #oreninyourarea', 726), ('give thanks', 650), ('i’m thankful', 347), ('god bless', 346), ('i’m grateful', 270), ('thanksgiving everyone', 206), ('mau #oreninyourarea', 198), ("i'm thankful", 171), ('family friends', 168), ('many blessings', 145), ("i'm grateful", 142), ('#oreninyourarea mau', 132), ('every day', 113), ('loved ones', 112), ('thanksgiving day', 102), ('stay safe', 1

### What do we find from these frequency lists without stop words?

We find words having to do with our analysis such as:
- family 
- friends 
- people
- everyone

This is important for the testing of our hypothesis because we know that these topics we want to measure are prominent in the text -- now it is a matter of conducting analyses that can effectively compare the discussions of the topics of loved ones and health in the to corpora. 

We also find some interesting ideas related but not central to my hypothesis that we can test later:
- pronouns: i, we, us
- big picture nouns: love, life, today, day, god, hope
- nouns referring to other people: people, everyone
- verbs: give
- modifiers/intensifiers: like, much

# STEP 2: KEYNESS

#### We can also do a Keyness analysis to see what are the most distinct words between the years.

In [19]:
calculate_keyness(word_freq_2019, word_freq_2020, top=50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
day                      1700      1265      54.754
•                        130       41        46.723
2019                     78        17        41.094
cowboys                  73        19        32.638
full                     224       120       29.672
@                        266       158       25.468
thanksgiving             4123      3599      24.898
jones                    35        5         24.649
chronicles               79        28        24.205
loving                   100       41        24.171
#family                  156       79        24.020
wonderful                414       278       23.968
🍁                        80        30        22.486
friends                  984       767       22.358
family                   1710      1413      22.076
16:34                    71        26        20.711
#food                    31        5         20.322
troops                   33        7         17.781
la 

In [20]:
calculate_keyness(word_freq_2020, word_freq_2019, top=50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
2020                     198       36        127.097
safe                     331       120       107.418
stay                     233       83        77.522
year                     1060      720       73.116
workers                  73        12        49.995
you                      7138      6574      37.424
through                  303       182       33.256
dropped                  41        5         32.947
but                      1532      1266      31.580
check                    101       38        31.039
courts                   50        10        30.006
gates                    50        11        27.867
im                       271       166       27.864
u                        464       329       26.191
click                    42        8         26.109
💯                        41        8         25.053
you’ve                   119       58        22.837
album                    56        17        22.836
d

In [21]:
calculate_keyness(bigram_freq_2019, bigram_freq_2020, top=50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
• •                      63        5         57.264
he is                    196       90        37.920
thanksgiving from        282       155       34.679
family and               362       217       33.553
for he                   125       50        31.563
and friends              190       96        29.424
more you                 49        10        27.233
family to                129       58        26.091
the cowboys              40        7         24.868
with family              110       47        24.627
1 chronicles             69        22        24.456
black friday             47        11        23.269
#thanksgiving #givethanks48        12        22.337
full of                  141       73        20.503
the art                  31        5         20.322
may you                  76        30        19.633
good; for                54        17        19.453
on thanksgiving          153       84        18.879
end

In [22]:
calculate_keyness(bigram_freq_2020, bigram_freq_2019, top=50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
this year                556       291       90.310
safe and                 131       30        70.618
stay safe                101       19        63.337
check out                70        13        44.289
year but                 68        13        42.163
lord is                  45        5         37.701
year has                 57        10        37.465
out my                   58        12        33.931
for you                  750       562       31.370
and his                  108       44        29.246
you and                  673       503       28.582
for u                    89        33        27.953
his courts               48        10        27.932
courts with              48        10        27.932
gates with               49        11        26.860
his name                 44        9         25.965
know the                 50        12        25.877
his gates                48        11        25.860
you

### What do we find from keyness? We can already see our hypothesis tested in this analysis. 

Notable distinctive tokens/bigrams to 2019:

- words we are interested in for our hypothesis: family, #family, friends
- descriptions (adjectives/adverbs): full, loving, wonderful, amazing
- nouns: day, night
- big picture nouns: #love, gracy, mercy, abundance
- pronouns: her, he, our

Notable distinctive tokens/bigrams to 2020:

- descriptions: safe, alone, tough, crazy, difficult, different, lot
- modifiers: despite, though, but
- verbs: stay 
- covid related nouns: workers, healthcare, heroes, health, healthy
- referring to people: you (heroes, workers)

- reflexive on this year: this year, stay safe


#### What does this mean for our hypothesis?

We can already see out hypothesis tested as family/friends is more distinctive in 2019 and health/healthy is more distinctive to 2020.

## STEP 3: Look for words around "thankful"

We want to see other words/phrases that come up around this word to use in comparison.

### KWIC for "thankful"

## 2019

In [23]:
kwic_thankful_19 = []

for tweet in tweets_2019:
    tokens_19 = tokenize(tweet, lowercase = True)
    kwic_19 = make_kwic("thankful", tokens_19)
    kwic_thankful_19.extend(kwic_19)

In [24]:
thankful_19_sample = random.sample(kwic_thankful_19, 50)

In [25]:
print_kwic(sort_kwic(thankful_19_sample , order=['R1']))

                  thanksgiving to all be  thankful  and grateful for what
                  i’ve received i’m also  thankful  for every bit of
                       people i am truly  thankful  for but i aint
                         be worse so i’m  thankful  for what i have
                      what have you been  thankful  for this year blessed
                                          thankful  for my beautiful girlfriend
                   and i’m so incredibly  thankful  for all the wonderful
                  extra thankful today 🥰  thankful  for friends amp family
                                          thankful  for you and shawnabenson
                                          thankful  for everyone whos tweeting
                 i thankful this holiday  thankful  for all the food
                             a lot to be  thankful  for i try to
                       share what we are  thankful  for which casts a
                     spirit every day im  thankful  for eve

## 2020

In [26]:
kwic_thankful_20 = []

for tweet in tweets_2020:
    tokens_20 = tokenize(tweet, lowercase = True)
    kwic_20 = make_kwic("thankful", tokens_20)
    kwic_thankful_20.extend(kwic_20)

In [27]:
thankful_20_sample = random.sample(kwic_thankful_20, 50)

In [28]:
print_kwic(sort_kwic(thankful_20_sample, order=['R1']))

                     thanks we should be  thankful  all year long he
                                          thankful  and grateful for every
                   remember to always be  thankful  and count your blessings
                                          thankful  but not complete 
                       woman can ask for  thankful  foe this beautiful princess
                    of thankfulness i am  thankful  for you httpstcop7b8tba8hs 
                    my darkest times i’m  thankful  for harry for bringing
                                          thankful  for y’all💜 httpstcout5yhnmlqn 
                  26 of thankfulness i’m  thankful  for my sister how
          droakley1689 tomascol amen tom  thankful  for your steadfast faithfulness
                                          thankful  for my family and
                                          thankful  for everyone’s health and
                     people smile i’m so  thankful  for you and what
                    

### What does this tell us?

1. We see that people are calling out others that are thankful for in several ways:
 -  "you" (as in directly talking about one/more people)
 -  "my" or "our" children/kid/friend (possessive)
 -  people/person/everyone (a group of people)
 
 
 **Once again, this tells us that these prounouns play a huge role. Stayed tuned on this.**

2. It is difficult to tell who is actually being called out when we only have 4 words on either side. So we need to look at collocates around these words to see what is coming up.

### What else can we do?

We can use collocates around this word to see their contexts using token lists without stop words.

In [29]:
coll_2019 = collocates(no_stop_tokens_2019,"thankful", win=[6,6])

In [30]:
sample_19 = random.sample(coll_2019, 500)

In [31]:
coll_2020 = collocates(no_stop_tokens_2020,"thankful", win=[6,6])

In [32]:
sample_20 = random.sample(coll_2020, 500)

We can compare these collocates in a **keyness analysis:**

In [33]:
coll_19_freq = Counter([word for word in tokens_2019 if (word in sample_19)])

In [34]:
coll_20_freq = Counter([word for word in tokens_2020 if (word in sample_20)])

In [35]:
calculate_keyness(coll_19_freq, coll_20_freq, top=50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
day                      1700      1265      37.886
friends                  984       767       14.224
family                   1710      1413      11.695
thanksgiving             4123      3599      8.869
#happythanksgivng        61        32        7.330


In [36]:
calculate_keyness(coll_20_freq, coll_19_freq, top=50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
year                     1060      720       90.536
u                        464       329       33.197
different                113       54        25.508
lot                      284       196       22.767
&amp;                    1135      1006      19.031
much                     893       789       15.456
help                     193       142       11.652
grateful                 2367      2295      11.404
like                     696       620       11.085
thing                    188       143       9.581
bless                    612       563       6.737


#### What does this tell us?

We are looking at what is distinctive about the words that come around the word "thankful". This should give us an indication of the topics being invoked around expressions of gratitude.

We can see distinctive words to 2019:
 - reflexive on the day: day, thanksgiving
 - positive descriptions: wonderful, amazing
 - adding to hypothesis: friends, family 
 
Words distinctive to 2020:
 - reflexive on the year: year
 - pronound: im (i'm), u 
 - adding to hypothesis: health
 - god
 

## Based on these three parts, what does this tell us about our hypothesis?

Loved ones: We can see that direct mentions of loved ones (family and friends) are more distinctive to the 2019 corpus, in addition to prounouns (he, our, we). 

Health: We can see that mentions of health (health, healthy) are more distinctive to 2020. In addition to the word *safe* which makes sense given the pandemic.

### Updated Hypothesis: It appears that loved ones appear more in 2019, and references to health more in 2020. 

### What do we need to do next?

1. Look to each topic (loved ones and health) independently to measure and analyze how they function differently in the two corpora.

2. Re-evaluate this hypothesis as we go along, possibly making it more nuanced as we discover both quantitatively and qualitatively how these topics are invoked in tweets.

### NOTE: 
*We found other interesting patterns in these corpora that are not directly related to testing my hypothesis. Once I've explored my hypothesis further, we will go into these trends on pronouns and big picture words. See last analysis notebook.*