# ANALYSIS 4: Part of Speech & Year Descriptions (Year/2019/2020)

In this notebook:
  

I explore the two corpora using part of speech tagging. I look at:
- nouns
- verbs
- adjectives 

  
Then, I look at the keyness for these lists of parts of speech to find out what is the most distinctive to each corpus. 

  
I then look at the collocates/keyness for the words life/year as I realize people are talking about the context of their lives very differently between the two years.

This allows me to add to my analysis about assessment of the life/year overall that is my secondary analysis to use in my data story.

In [1]:
%run functions.ipynb

In [2]:
tweets_2019 = json.load(open("../data/cleaned/tweets_2019.json"))

In [3]:
tweets_2020 = json.load(open("../data/cleaned/tweets_2020.json"))

# Cleaning

Make strings.

In [4]:
string_2019 = ''.join(tweets_2019)

In [5]:
string_2020 = ''.join(tweets_2020)

Tokenize.

In [6]:
tokens_2019 = tokenize(string_2019, lowercase = True, strip_chars = '.,!')

In [7]:
tokens_2020 = tokenize(string_2020, lowercase = True, strip_chars = '.,!')

Take out stop words.

In [8]:
stopwords = stopwords.words('english')

In [9]:
no_stop_tokens_2019 = [word for word in tokens_2019 if (word not in stopwords)]

In [10]:
no_stop_tokens_2020 = [word for word in tokens_2020 if (word not in stopwords)]

Freq lists.

In [11]:
word_freq_2019 = Counter(tokens_2019)
bigram_freq_2019 = Counter(get_bigram_tokens(tokens_2019))

In [46]:
word_freq_2020 = Counter(tokens_2020)
bigram_freq_2020 = Counter(get_bigram_tokens(tokens_2020))

## Part of Speech tagging

Need to make tagged token objects for each year.

In [12]:
tagged_tokens_2019 = pos_tag(tokens_2019)

In [13]:
sents_2019 = sent_tokenize(string_2019)

In [14]:
tagged_tokens_2020 = pos_tag(tokens_2020)

In [15]:
sents_2020 = sent_tokenize(string_2020)

# VERBS

#### Verbs 2019

In [16]:
verbs_2019 = []
for word, tag in tagged_tokens_2019:
    if tag.startswith('V'):
        verbs_2019.append(word)

In [17]:
verb_dist_2019=Counter(verbs_2019)

In [18]:
verb_dist_2019.most_common(20)

[('is', 2767),
 ('thanksgiving', 2519),
 ('have', 2235),
 ('be', 2187),
 ('are', 1932),
 ('i', 1176),
 ('am', 1095),
 ('appreciate', 895),
 ('give', 790),
 ('love', 759),
 ('was', 702),
 ('#thanksgiving', 589),
 ('do', 587),
 ('been', 574),
 ('has', 568),
 ('blessed', 545),
 ('know', 504),
 ('being', 500),
 ('hope', 494),
 ('get', 426)]

#### Verbs 2020

In [19]:
verbs_2020 = []
for word, tag in tagged_tokens_2020:
    if tag.startswith('V'):
        verbs_2020.append(word)

In [20]:
verb_dist_2020 = Counter(verbs_2020)

In [21]:
verb_dist_2020.most_common(20)

[('is', 2747),
 ('have', 2253),
 ('thanksgiving', 2197),
 ('be', 2141),
 ('are', 1792),
 ('i', 1238),
 ('am', 1123),
 ('appreciate', 923),
 ('love', 802),
 ('was', 728),
 ('has', 706),
 ('been', 706),
 ('give', 671),
 ('do', 591),
 ('know', 562),
 ('blessed', 505),
 ('get', 468),
 ('#thanksgiving', 466),
 ('being', 462),
 ('hope', 457)]

#### We can calculate the keyness for these lists of verbs. 

This will tell us which verbs (actions) are distinctive to each year.

In [22]:
calculate_keyness(verb_dist_2019, verb_dist_2020, top=50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
thanksgiving             2519      2197      20.032
loving                   64        26        16.322
wouldn’t                 35        9         16.250
ate                      27        6         14.323
#thanksgiving            589       466       13.613
#gratitude               73        35        13.424
excited                  25        6         12.394
#blessed                 130       82        10.664
knew                     37        15        9.470
spend                    124       80        9.292
give                     790       671       8.972
y’all                    23        7         8.893
told                     51        28        6.651


In [23]:
calculate_keyness(verb_dist_2020, verb_dist_2019, top=50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
stay                     138       51        42.144
dropped                  41        5         32.368
thinking                 48        16        16.945
looks                    61        24        16.890
god                      292       204       16.252
has                      706       568       15.855
miss                     66        29        15.032
been                     706       574       14.476
help                     142       89        12.603
you’ve                   30        9         12.062
guys                     116       70        11.785
helped                   104       61        11.608
praying                  30        10        10.591
changed                  39        16        10.065
learn                    52        25        9.842
following                37        15        9.746
check                    57        29        9.461
passed                   29        11        8.511
deserve

### What do we find?

The verbs distinctive to 2019 are more reflective of the holiday of thanksgiving:
 - loving 
 - ate
 - spend
 - give
 
The verbs distinctive to 2002 are reflective of the struggles of 2020:
 - stay
 - dropped 
 - miss
 - helped
 - changed
 - learn
 - became 
 
Overall, there is a lot more substantive action in the collocates from 2020, suggesting that people are talking about real happenings in these tweets. In other words, perhaps the verbs in 2019 are functioning more as "fluff", and in 2020 as more practical

# NOUNS

#### Nouns 2019

In [24]:
nouns_2019 = []
for word, tag in tagged_tokens_2019:
    if tag.startswith('N'):
        nouns_2019.append(word)

In [25]:
noun_dist_2019=Counter(nouns_2019)

#### Nouns 2020

In [26]:
nouns_2020 = []
for word, tag in tagged_tokens_2020:
    if tag.startswith('N'):
        nouns_2020.append(word)

In [27]:
noun_dist_2020=Counter(nouns_2020)

Frequency lists are not great to compare. Let's look at keyness of the nouns again:

In [28]:
calculate_keyness(noun_dist_2019, noun_dist_2020, top=50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
day                      1700      1265      45.518
•                        109       31        42.510
cowboys                  49        10        26.320
la                       38        6         24.490
jones                    35        5         23.944
chronicles               77        27        22.806
endureth                 46        13        18.071
friends                  893       692       17.110
store                    32        7         16.222
troops                   32        7         16.222
family                   1710      1413      16.222
season                   186       115       13.784
l                        25        5         13.646
😂                        43        15        12.834
🍁                        51        20        12.606
@                        183       116       12.203
fun                      71        34        11.667
practice                 68        32        11.644
dal

In [29]:
calculate_keyness(noun_dist_2020, noun_dist_2019, top = 50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
year                     1060      720       82.020
workers                  73        12        51.496
im                       148       66        36.145
stay                     87        30        31.686
courts                   50        10        30.992
💯                        38        5         30.248
check                    44        9         26.828
gates                    48        11        26.774
album                    55        15        26.187
grateful                 723       577       23.879
moots                    34        6         22.951
tweet                    148       82        22.383
lot                      283       195       20.619
generations              36        8         20.591
n                        37        10        17.775
care                     104       57        16.182
congratulations          42        14        15.976
health                   203       141       14.287
her

### What does this show us?

Again, we see the year reflected in these tweets. 

For the 2019 corpus, nouns are more focused on traditional expressions of gratitude on thanksgiving
- family, friends, relationship, abundance, feast 


For the 2020 corpus, we have nouns that really tell the story of the pandemic:
- workers, stay, care, heroes prayers

For 2020, we also have nouns related to the election
- election, democracy, courts, states, truth, americans


# Adjectives

#### Adjectives 2019

In [30]:
adj_2019 = []
for word, tag in tagged_tokens_2019:
    if tag.startswith('JJ'):
        adj_2019.append(word)

In [31]:
adj_dist_2019=Counter(adj_2019)

#### Adjectives 2020

In [32]:
adj_2020 = []
for word, tag in tagged_tokens_2020:
    if tag.startswith('JJ'):
        adj_2020.append(word)

In [33]:
adj_dist_2020=Counter(adj_2020)

Collocation for adjectives:

In [34]:
calculate_keyness(adj_dist_2019, adj_dist_2020, top = 50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
full                     224       120       28.878
wonderful                412       273       24.362
#gobblegobble            32        7         16.612
🦃                        63        26        14.762
amazing                  289       199       14.088
#thankful                291       209       11.147
#blackfriday             23        6         10.127
south                    22        6         9.243
#grateful                216       153       8.991
gobble                   28        10        8.343
turkey                   69        38        8.200
@                        40        19        7.020


In [35]:
calculate_keyness(adj_dist_2020, adj_dist_2019, top = 50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
safe                     326       116       110.355
u                        320       204       29.515
different                113       54        23.120
healthy                  109       52        22.382
difficult                58        19        21.894
sorry                    39        11        17.481
tough                    54        21        16.042
#thanksgivingday         32        9         14.386
crazy                    65        30        14.282
willing                  27        7         13.171
positive                 80        43        12.449
normal                   22        5         12.074
white                    56        28        10.379
#bethankful              29        10        10.245
mental                   39        17        9.557
only                     68        38        9.538
alex                     18        5         8.198
own                      96        64        7.446
other 

### What does this mean? How does it add to our discussion comparing gratitude in the two years?

For adjectives distinctive to 2019...
- wonderful
- amazing 
- turkey 
- grateful, thankful

For adjectives distinctive to 2020...
- related to COVID: safe, healthy, normal
- challenges: different, difficult, sorry, weird, sad, tough, crazy

### What did the PoS analysis tell us?

It told us that the contexts in which words function in the two corpora is different and feels very reflective of the differences between the years overall. Most basically, that 2020 was a very hard year for many compared to 2019.

# Looking at reflections on the year.

#### We can look at the collocates around "year" to see how people are referencing the year.

In [36]:
coll_yr_19 = collocates(no_stop_tokens_2019,"year", win=[4,4])

In [37]:
coll_yr_19_freq = Counter([word for word in no_stop_tokens_2019 if (word in coll_yr_19)])

In [38]:
coll_yr_19_freq.most_common()[:30]

[('thanksgiving', 4123),
 ('thankful', 3101),
 ('grateful', 2295),
 ('thanks', 2094),
 ('happy', 1926),
 ('family', 1710),
 ('day', 1700),
 ('love', 1492),
 ('blessings', 1203),
 ('i’m', 1158),
 ('appreciate', 1107),
 ('give', 1028),
 ('&amp;', 1006),
 ('thank', 1001),
 ('friends', 984),
 ('life', 979),
 ('gratitude', 974),
 ('appreciation', 889),
 ('everyone', 844),
 ('people', 839),
 ('god', 838),
 ('today', 825),
 ('lucky', 799),
 ('much', 789),
 ('us', 788),
 ('thankfulness', 779),
 ("i'm", 745),
 ('year', 720),
 ('blessing', 708),
 ('hope', 677)]

In [39]:
coll_yr_20 = collocates(no_stop_tokens_2020,"year", win=[4,4])

In [40]:
coll_yr_20_freq = Counter([word for word in no_stop_tokens_2020 if (word in coll_yr_20)])

In [41]:
coll_yr_20_freq.most_common()[:30]

[('thanksgiving', 3599),
 ('thankful', 2817),
 ('grateful', 2367),
 ('thanks', 1994),
 ('happy', 1897),
 ('love', 1449),
 ('family', 1413),
 ('day', 1265),
 ('i’m', 1170),
 ('blessings', 1168),
 ('appreciate', 1151),
 ('&amp;', 1135),
 ('year', 1060),
 ('thank', 1014),
 ('gratitude', 989),
 ('#oreninyourarea', 924),
 ('appreciation', 917),
 ('god', 916),
 ('much', 893),
 ('give', 868),
 ('life', 838),
 ('people', 816),
 ('us', 800),
 ('everyone', 777),
 ('friends', 767),
 ('thankfulness', 763),
 ('today', 750),
 ('blessing', 715),
 ('one', 697),
 ('like', 696)]

In [42]:
calculate_keyness(coll_yr_19_freq, coll_yr_20_freq, top=50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
day                      1700      1265      83.441
thanksgiving             4123      3599      60.740
2019                     78        17        45.017
family                   1710      1413      42.003
friends                  984       767       36.753
full                     224       120       36.417
cowboys                  73        19        36.113
wonderful                414       278       32.864
thankful                 3101      2817      28.018
🍁                        80        30        25.712
give                     1028      868       20.999
season                   186       115       19.992
amazing                  366       272       18.089
🦃                        213       141       17.897
life                     979       838       17.609
#grateful                261       183       17.213
#thanksgiving            618       505       16.580
part                     149       94        14.952
spe

In [43]:
calculate_keyness(coll_yr_20_freq, coll_yr_19_freq, top=50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
2020                     198       36        116.828
safe                     331       120       94.088
stay                     233       83        68.045
year                     1060      720       51.963
im                       271       166       21.296
you’ve                   119       58        18.997
different                113       54        18.918
healthy                  109       52        18.322
u                        464       329       17.819
✨                        44        13        16.543
though                   124       65        16.355
tweet                    164       97        14.718
alone                    78        36        14.145
tough                    54        21        13.688
#thanksgivingday         56        23        12.877
lot                      284       196       12.770
crazy                    74        36        11.868
heroes                   25        6         11.735
ch

### What does this tell us?

This makes it really clear that there is such a different analysis of the two years. We can see that the collocates around "year" in 2020 are much more grounded in the context of the pandemic (and negative) then those in 2019 which are a lot more positive/general.

In 2019, the words are more positive (amazing, wonderful) or about thanksgiving/gratitude (thankful, thanksigving, season). 

In 2020, the distinctive words around year were about enduring the year (during, through, been, stay) and descriptions of its relativeness to other years (different, weird, crazy, tough).

## "2020" vs. "2019":

Now let's look directly at how much each of these years are talked about and in what contexts.

In [44]:
word_freq_2019["2019"]

78

In [47]:
word_freq_2020["2020"]

198

This is really interesting. We see that 2020 appears so much more in the 2020 corpus than 2019 does in the 2019 corpus. This reflects how much 2020 has become a trope in culture because it has been so unique.

#### We can also look at how 2019 and 2020 compare in context

In [43]:
coll_19_2 = collocates(no_stop_tokens_2019,"2019", win=[4,4])

In [44]:
coll_19_freq_2 = Counter([word for word in no_stop_tokens_2019 if (word in coll_19_2)])

In [45]:
coll_19_freq_2.most_common()[:30]

[('thanksgiving', 4123),
 ('thankful', 3101),
 ('grateful', 2295),
 ('thanks', 2094),
 ('happy', 1926),
 ('family', 1710),
 ('day', 1700),
 ('love', 1492),
 ('blessings', 1203),
 ('i’m', 1158),
 ('give', 1028),
 ('&amp;', 1006),
 ('thank', 1001),
 ('friends', 984),
 ('life', 979),
 ('gratitude', 974),
 ('appreciation', 889),
 ('god', 838),
 ('today', 825),
 ('lucky', 799),
 ('much', 789),
 ('us', 788),
 ('thankfulness', 779),
 ('year', 720),
 ('blessing', 708),
 ('hope', 677),
 ('one', 674),
 ('many', 642),
 ('good', 639),
 ('blessed', 622)]

In [46]:
coll_20_2 = collocates(no_stop_tokens_2020,"2020", win=[4,4])

In [47]:
coll_20_freq_2 = Counter([word for word in no_stop_tokens_2020 if (word in coll_20_2)])

In [48]:
coll_20_freq_2.most_common()[:30]

[('thanksgiving', 3599),
 ('thankful', 2817),
 ('grateful', 2367),
 ('thanks', 1994),
 ('happy', 1897),
 ('love', 1449),
 ('family', 1413),
 ('day', 1265),
 ('i’m', 1170),
 ('blessings', 1168),
 ('appreciate', 1151),
 ('&amp;', 1135),
 ('year', 1060),
 ('thank', 1014),
 ('gratitude', 989),
 ('appreciation', 917),
 ('god', 916),
 ('much', 893),
 ('give', 868),
 ('life', 838),
 ('people', 816),
 ('us', 800),
 ('everyone', 777),
 ('friends', 767),
 ('thankfulness', 763),
 ('today', 750),
 ('blessing', 715),
 ('one', 697),
 ('like', 696),
 ('lucky', 693)]

In [49]:
dist_19 = calculate_keyness(coll_19_freq_2, coll_20_freq_2, print_table=False, top=-1, keyness_threshold=-100000)

In [50]:
show_keyitems(dist_19,30, c1='green', c2='purple', 
              corpusA='Distinctive Words Around 2019', corpusB='Distinctive Words Aroud 2020')

In [51]:
calculate_keyness(coll_20_freq_2, coll_19_freq_2, top=50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
2020                     198       36        82.997
safe                     331       120       52.769


### What does this tell us?

Once again, we see how reflections of "year" in 2019/2020 and the direct comparison of "2019"/"2020" yield the same results we've been seeing in Tweets in 2019 that are more around Thanksgiving, gratitude, and positive descriptions of the year. In 2020, they are more practical. The fact that "safe" is the distinctive word around "2020" is reflective of what the year has been for many.