# ANALYSIS 3: HEALTH

In the first analysis notebook, we established that the words health and healthy were distinctive to the 2020.

In this notebook I look more closely at invocations of health by:

- explore different words frequencies, and by comparing to keyness from last notebook, decide which words to use to test hypothesis 
- look at words/phrases in context in KWIC
- look at collocates around words

In [1]:
%run functions.ipynb

In [2]:
tweets_2019 = json.load(open("../data/cleaned/tweets_2019.json"))

In [3]:
tweets_2020 = json.load(open("../data/cleaned/tweets_2020.json"))

# Cleaning

To look at frequencies of words we need to put tweets into string and then tokenize

Need to make string of tweets.

In [4]:
string_2019 = ''.join(tweets_2019)

In [5]:
string_2020 = ''.join(tweets_2020)

Tokenize strings.

In [6]:
tokens_2019 = tokenize(string_2019, lowercase = True, strip_chars = '.,!')

In [7]:
tokens_2020 = tokenize(string_2020, lowercase = True, strip_chars = '.,!')

Word and Bigram Freq Need to be Saved

In [8]:
word_freq_2019 = Counter(tokens_2019)
bigram_freq_2019 = Counter(get_bigram_tokens(tokens_2019))

In [9]:
word_freq_2020 = Counter(tokens_2020)
bigram_freq_2020 = Counter(get_bigram_tokens(tokens_2020))

# HEALTH

We will start by comparing invocations of health in the two corpora.

We will look at these words as describing health:

- health
- healthy 
- alive 
- life 
- living 
- body 
- wellness
- well-being 
- sickness
- sick
- disease
- fit


### We can run a filtered frequency.

In [10]:
health_words = ['health', 'healthy', 'alive', 'life', 'live','living', 'body', 'wellness', 
                    'well-being', 'sickness', 'sick', 'disease', 'fit']

In [11]:
filter_freq_list(word_freq_2019, health_words)

[('life', 979),
 ('live', 182),
 ('health', 141),
 ('living', 60),
 ('alive', 58),
 ('healthy', 52),
 ('body', 34),
 ('fit', 15),
 ('sick', 15),
 ('wellness', 3),
 ('disease', 1),
 ('sickness', 1)]

In [12]:
filter_freq_list(word_freq_2020, health_words)

[('life', 838),
 ('health', 203),
 ('live', 162),
 ('healthy', 109),
 ('alive', 81),
 ('living', 55),
 ('sick', 23),
 ('body', 22),
 ('fit', 11),
 ('disease', 7),
 ('well-being', 3)]

### Remember that the distinctive words from Keyness relevant to health were:
- health
- healthy

### So, what do we know so far, given Frequencies and Keyness?
    
    
1. A lot of words related to health are not very frequent. Because we want to explore this topic for our analysis, we will look at health in context, but this might not be a very rich line of inquiry.
  
2. We see that "life" is very frequent. We need to see how it functions in context to make sure it is reflecting what we are interested in. Even so, it will be interesting to look at

  
Thus, for the purposes of our analysis, we want to focus on the most frequent tokens:

 -  health
 -  healthy
 -  life

### What do we need to do next?

A. Look at the context of these words through KWIC

B. Collocates around the keywords we have selected

C. Keyness of these collocates to see what is distinctive for each year about invocations of these ideas.
    

# KWIC: 

#### Now looking at mentions of health in context.

### Health (2019)

In [13]:
kwic_health_19 = []

for tweet in tweets_2019:
    tokens_19 = tokenize(tweet, lowercase = True)
    kwic_19 = make_kwic("health", tokens_19)
    kwic_health_19.extend(kwic_19)

In [14]:
kwic_health_19_sample = random.sample(kwic_health_19, 50)

In [15]:
print_kwic(sort_kwic(kwic_health_19_sample, order=['R1']))

                      bed food water and  health     
                                          health  abundance amp peace happythanksgiving2019
             during the thursday morning  health  amp fitness segment on
                    is about family good  health  and friendships grateful for
        food shelter clothing employment  health  and an education we
                  birthday today in good  health  and happiness i pray
                      am grateful for my  health  and hope that these
                  to bestow blessings of  health  and joy on my
                bring you blessings good  health  and happiness httpstcoxmupuovgrg 
                bring you blessings good  health  and happiness httpstcoxmupuovgrg 
                  birthday today in good  health  and happiness i pray
          family friends provisions life  health  and great tunes 😁🤘🏻🍁🦃🍽❤️
                    family and peace and  health  and whatnot  
                    life be grateful for  health 

### Health (2020)

In [16]:
kwic_health_20 = []

for tweet in tweets_2020:
    tokens_20 = tokenize(tweet, lowercase = True)
    kwic_20 = make_kwic("health", tokens_20)
    kwic_health_20.extend(kwic_20)

In [17]:
kwic_health_20_sample = random.sample(kwic_health_20, 50)

In [18]:
print_kwic(sort_kwic(kwic_health_20_sample, order=['R1']))

                          of a home poor  health     
                        the most part my  health     
                     and loved ones good  health     
                     pups many wishes of  health  amp happiness to you
                   for my family friends  health  amp to all who
                      am grateful for my  health  and all the positive
                   for my family friends  health  and team today 2020
                    people lose they are  health  and free time for
               friends grateful for good  health  and life’s many blessings
                        wishing u a good  health  and may our almighty
                 any attention to mental  health  and the reality of
               🧡 thankful for everyone’s  health  and safety this year
                      am grateful for my  health  and all the positive
                for family friends faith  health  and the travelblogger community
                      our loved ones our  health  and

### Healthy (2019)

In [19]:
kwic_healthy_19 = []

for tweet in tweets_2019:
    tokens_19 = tokenize(tweet, lowercase = True)
    kwic_19 = make_kwic("healthy", tokens_19)
    kwic_healthy_19.extend(kwic_19)

In [20]:
kwic_healthy_19_sample = random.sample(kwic_healthy_19, 50)

In [21]:
print_kwic(sort_kwic(kwic_healthy_19_sample, order=['R1']))

                      working to keep us  healthy  all year long brother
                          i am alive and  healthy  and surrounded by love
                       people tend to be  healthy  and happy they exhibit
                        everyone in wb a  healthy  and happy holiday httpstcotc6ajjxi2f
                    achieve the dream of  healthy  and successful students and
                    achieve the dream of  healthy  and successful students and
               allah continue bless with  healthy  and miji nagari 
                      and everyone to be  healthy  and happy brightest blessings
          thanksgiving skip you’re still  healthy  and rich don’t have
                     says “it’s having a  healthy  appreciation for the broader
                         life i wax very  healthy  but now i thank
         wonderful husband amp beautiful  healthy  children today and everyday
         wonderful husband amp beautiful  healthy  children today and everyday
        

### Healthy (2020)

In [22]:
kwic_healthy_20 = []

for tweet in tweets_2020:
    tokens_20 = tokenize(tweet, lowercase = True)
    kwic_20 = make_kwic("healthy", tokens_20)
    kwic_healthy_20.extend(kwic_20)

In [23]:
kwic_healthy_20_sample = random.sample(kwic_healthy_20, 50)

In [24]:
print_kwic(sort_kwic(kwic_healthy_20_sample, order=['R1']))

                        to stay safe and  healthy     
     mentally physically amp financially  healthy     
                       ones are safe and  healthy     
                   my family and friends  healthy  2020 has been challenging
                   my family and friends  healthy  2020 has been challenging
                          for a safe and  healthy  2021 and always hunterpence
                         long 4 yrs stay  healthy  amp safe 💜 httpstcouyddx6fgbu
                       that my family is  healthy  and that we are
                      to wish everyone a  healthy  and happy thanksgiving let
                          me dear you be  healthy  and happy plz god
                        i hope yall stay  healthy  and safe  
                everyone is staying safe  healthy  and feeling the love
                       that my family is  healthy  and that we are
                 i’m grateful that we’re  healthy  and that we have
                       stayed amp left

### Life (2019)

In [25]:
kwic_life_19 = []

for tweet in tweets_2019:
    tokens_19 = tokenize(tweet, lowercase = True)
    kwic_19 = make_kwic("life", tokens_19)
    kwic_life_19.extend(kwic_19)

In [26]:
kwic_life_19_sample = random.sample(kwic_life_19, 50)

In [27]:
print_kwic(sort_kwic(kwic_life_19_sample, order=['R1']))

                       be great full for  life     
                                     for  life     
                    many blessings in my  life  a full life of
                       best people in my  life  amp in 176 days
                    havenyou both in his  life  and putting him first
                  family and a wonderful  life  and im extremely grateful
                     us through his long  life  and khilafat  
                       the gifts in your  life  and leaving room for
                   the blessings in your  life  blessings thanksgiving thankful success
                                          life  family and friends happy
                  this year blessed with  life  family and friends happy
                   continue to impact my  life  for the better the
                       im warm thank you  life  for keeping me in
             niece nephew nieces friends  life  happy me fun familia
                                          life  has its u

### Life (2020)

In [28]:
kwic_life_20 = []

for tweet in tweets_2020:
    tokens_20 = tokenize(tweet, lowercase = True)
    kwic_20 = make_kwic("life", tokens_20)
    kwic_life_20.extend(kwic_20)

In [29]:
kwic_life_20_sample = random.sample(kwic_life_20, 50)

In [30]:
print_kwic(sort_kwic(kwic_life_20_sample, order=['R1']))

                                     for  life     
                        what you have in  life     
                  instead of an innocent  life     
                    live in harmony with  life  and that all people
                    many blessings in my  life  and i am grateful
                        the people in my  life  and thankful that the
                         that came to my  life  and made it endlessly
                          has been in my  life  and put a smile
                                          life  and all my blessings
                         of things in my  life  and i started to
                    guys have changed my  life  and words will never
                       will go places in  life  because one girl blessed
                 in our community making  life  better for more people
            happy love familia goodtimes  life  blessings blessedlife httpstcown1nxegeey 
                                          life  can be trying in
    

### What does this tell us?

1. As expected, 'health' is surrounded by references to health care workers, the year, and other people a lot more in 2020. 

2. 'healthy' is not very frequent, so it is hard to make conclusions on a small KWIC list. But, it is clear that 'healthy' is also more in this COVID context in 2020 then in 2019 when it is more a good wish in addition to other wishes on Thanksgiving. 

3. 'life' could be its own reach project as it is clearly very multi-faceted in how it is used.

### What can we do next?

It is difficult to compare KWIC, so we can:

A. **look at the collocates around these words to see in what contexts they function.**  
B. **look at the collacates' keyness to see if these contexts are distinctive.**


##### Let's focus on HEALTH and HEALTHY for now and make some conclusions about our hypothesis and then we can look at life in the next notebook.


## Keyness of Collocates Around Health + Healthy.

## Health 

In [31]:
coll_health_19 = collocates(tokens_2019,"health", win=[7,7])

In [32]:
coll_health_19_freq = Counter([word for word in tokens_2019 if (word in coll_health_19)])

In [33]:
coll_health_20 = collocates(tokens_2020,"health", win=[7,7])

In [34]:
coll_health_20_freq = Counter([word for word in tokens_2020 if (word in coll_health_20)])

In [35]:
calculate_keyness(coll_health_19_freq, coll_health_20_freq, top=50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
day                      1700      1265      90.636
thanksgiving             4123      3599      70.841
•                        130       41        53.779
for                      9222      8791      48.391
family                   1710      1413      47.294
2019                     78        17        45.915
friends                  984       767       40.436
and                      9026      8691      38.824
thankful                 3101      2817      34.091
of                       4837      4542      33.046
our                      1758      1530      31.107
to                       8736      8517      28.434
with                     2115      1916      24.072
give                     1028      868       23.924
he                       558       434       23.233
life                     979       838       20.236
at                       820       694       18.709
#thanksgiving            618       505       18.570
#gr

In [36]:
calculate_keyness(coll_health_20_freq, coll_health_19_freq, top=50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
safe                     331       120       91.251
year                     1060      720       47.755
through                  303       182       24.134
healthcare               25        5         13.462
but                      1532      1266      12.660
lot                      284       196       11.687
during                   140       84        11.199
positive                 80        43        9.335
has                      706       568       8.268
care                     142       92        8.158
health                   203       141       8.043
been                     706       574       7.267
myself                   122       79        7.027
despite                  51        26        6.931


### Collocates around health distinctive to 2019 corpus: 
 - loved ones: family, friends, families, 
 - nouns: work, home
 - positive descriptions: best, great, abundance
 - life
 - pronouns: he, my, me, we're/i'm

### Collocates around health distinctive to 2020 corpus:
 - covid related nouns: safe, healthcare, care, health
 - modifiers: despite, through, lot, but, despite
 - pronouns: myself
 - encouragement: positive, stay 

# Healthy

In [37]:
coll_healthy_19 = collocates(tokens_2019,"healthy", win=[4,4])

In [38]:
coll_healthy_20 = collocates(tokens_2020,"healthy", win=[4,4])

In [39]:
coll_healthy_19_freq = Counter([word for word in tokens_2020 if (word in coll_healthy_19)])

In [40]:
coll_healthy_20_freq = Counter([word for word in tokens_2020 if (word in coll_healthy_20)])

In [41]:
calculate_keyness(coll_healthy_19_freq, coll_healthy_20_freq, top=50)

WORD                     Corpus A Freq.Corpus B Freq.Keyness
for                      8791      8791      128.311
and                      8691      8691      126.851
to                       8517      8517      124.312
the                      8180      8180      119.393
you                      7138      7138      104.184
i                        5850      5850      85.385
a                        5347      5347      78.043
of                       4542      4542      66.294
thanksgiving             3599      3599      52.530
all                      2978      2978      43.466
this                     2811      2811      41.029
is                       2747      2747      40.094
in                       2708      2708      39.525
that                     2441      2441      35.628
grateful                 2367      2367      34.548
we                       2328      2328      33.979
have                     2253      2253      32.884
be                       2141      2141      31.24

- 2020 corpus does not have distinctive collocates around healthy

### Collocates around healthy distinctive to 2019 corpus:

- hypothesis/loved ones nouns: family, friends 
- expressions of gratitude: thank, gratitude, appreciation, blessing, bless, blessed
- positive modifiers: very, all
- big picture nouns: love, life, hope

### What does this mean for our hypothesis?

Once again, we see our further analysis of health in context, both in KWIC and keyness of the collocates, **support our hypothesis** about family/friends being more distinctive to 2019 and health being distinctive to 2020.

### What else has this analysis suggests?

We see that words surrounding health in 2020 are negative modifiers (but, though, despite) and less positive than those surrounding health in 2019 which are more positive expressions of gratitude, big picture nouns, and modifiers. 

This suggests that the context in which gratitude for these topics occurs is different, cushioned in different language. This makes sense given the year we've had. 