## Keyness Analysis:

* Some of most prominent words used (based on corpora analysis in Part 1)/words that may be of interest: `safety`, `consumer`, `problem`, `risk`, `president`
* So will conduct keyness analysis using these words across corpora
* Followed by the same analysis structure for some of most prominent bigram features

### Select Corpora:

In [2]:
### 1980s Corpus
import os
os.chdir('../data')
eighties_text = open('eighties.txt').read()

In [3]:
### 1990s corpus:
import os
os.chdir('../data')
nineties_text = open('nineties.txt').read()

In [4]:
### 2000s corpus:
import os
os.chdir('../data')
thous_text = open('twothousands.txt').read()

In [5]:
### 2010s:
import os
os.chdir('../data')
tens_text = open('twentytens.txt').read()

In [6]:
### 2020s:
import os
os.chdir('../data')
twenties_text = open('twenties.txt').read()

### Create Frequency Lists:

In [7]:
### Relevant functions:
import os
os.chdir('../data_analysis')
%run functions.ipynb

import os
import pandas as pd
import re
import math
import random


from collections import Counter

In [8]:
no_toc_80s=eighties_text[10690:]
token_80s=tokenize(no_toc_80s,lowercase=True,strip_chars='!,."')
count_80s=Counter(token_80s)

In [9]:
no_toc_90s=nineties_text[11325:]
token_90s=tokenize(no_toc_90s,lowercase=True,strip_chars='!,."')
count_90s=Counter(token_90s)

In [10]:
no_toc_00s=thous_text[13296:]
token_00s=tokenize(no_toc_00s,lowercase=True,strip_chars='!,."')
count_00s=Counter(token_00s)

In [11]:
no_toc_10s=tens_text[12975:]
token_10s=tokenize(no_toc_10s,lowercase=True,strip_chars='!,."')
count_10s=Counter(token_10s)

In [12]:
no_toc_20s=twenties_text[15370:]
token_20s=tokenize(no_toc_20s,lowercase=True,strip_chars='!,."')
count_20s=Counter(token_20s)

### Match items and calculate score

#### Item 1: "problem"

In [13]:
size_80s=len(token_80s)
size_90s=len(token_90s)
size_00s=len(token_00s)
size_10s=len(token_10s)
size_20s=len(token_20s)

size_80s, size_90s, size_00s, size_10s, size_20s

(90642, 87007, 107619, 81784, 80278)

In [14]:
base=10000

In [15]:
norm_80s= count_80s.get('problem')/size_80s*base
norm_90s= count_90s.get('problem')/size_90s*base
norm_00s= count_00s.get('problem')/size_00s*base
norm_10s= count_10s.get('problem')/size_10s*base
norm_20s= count_20s.get('problem')/size_20s*base

In [17]:
print(f'decade, "problem", normalized frequency')
print(f'80s: {norm_80s:.2f} per {base:,} words')
print(f'90s: {norm_90s:.2f} per {base:,}')
print(f'00s: {norm_00s:.2f} per {base:,}')
print(f'10s: {norm_10s:.2f} per {base:,}')
print(f'20s: {norm_20s:.2f} per {base:,}')

decade, "problem", normalized frequency
80s: 8.05 per 10,000 words
90s: 5.52 per 10,000
00s: 8.73 per 10,000
10s: 3.30 per 10,000
20s: 1.12 per 10,000


### Observations:
* Steady decrease in the term problem used
* Picked up in the 2000s, but dropped off to an all time low relative to the other decades by the 2020s
* Indicates less negative sentiment associated with the coverage of these cases over time

#### Item 2: "safety"

In [18]:
norm_80s_safety= count_80s.get('safety')/size_80s*base
norm_90s_safety= count_90s.get('safety')/size_90s*base
norm_00s_safety= count_00s.get('safety')/size_00s*base
norm_10s_safety= count_10s.get('safety')/size_10s*base
norm_20s_safety= count_20s.get('safety')/size_20s*base

In [19]:
print(f'decade, "safety", normalized frequency')
print(f'80s: {norm_80s_safety:.2f} per {base:,} words')
print(f'90s: {norm_90s_safety:.2f} per {base:,}')
print(f'00s: {norm_00s_safety:.2f} per {base:,}')
print(f'10s: {norm_10s_safety:.2f} per {base:,}')
print(f'20s: {norm_20s_safety:.2f} per {base:,}')

decade, "safety", normalized frequency
80s: 12.25 per 10,000 words
90s: 6.78 per 10,000
00s: 9.20 per 10,000
10s: 9.05 per 10,000
20s: 12.46 per 10,000


### Observations:
* 80s and 2020s were at comparable numbers
* decreased significantly in 90s, and steady increased back to 80s levels
* May demonstrate more transparency (given that mentions of the term increased) within company messaging pertaining to the content and risks associated with the drugs being produced
* With COVID there was also likely more safety concerns and certainly influence related to those events with the measures being taken

#### Item 3: "Consumer"

In [20]:
norm_80s_con= count_80s.get('consumer')/size_80s*base
norm_90s_con= count_90s.get('consumer')/size_90s*base
norm_00s_con= count_00s.get('consumer')/size_00s*base
norm_10s_con= count_10s.get('consumer')/size_10s*base
norm_20s_con= count_20s.get('consumer')/size_20s*base

In [28]:
print(f'decade, "consumer", normalized frequency')
print(f'80s: {norm_80s_con:.2f} per {base:,} words')
print(f'90s: {norm_90s_con:.2f} per {base:,}')
print(f'00s: {norm_00s_con:.2f} per {base:,}')
print(f'10s: {norm_10s_con:.2f} per {base:,}')
print(f'20s: {norm_20s_con:.2f} per {base:,}')

decade, "consumer", normalized frequency
80s: 4.19 per 10,000 words
90s: 2.99 per 10,000
00s: 1.58 per 10,000
10s: 4.89 per 10,000
20s: 4.86 per 10,000


### Observations:
* features of the term fell off in the 90s and 2000s, but returned to roughly constant numbers after the fact
* rates were mostly constant otherwise


#### Item 4: "Risk"

In [23]:
norm_80s_risk= count_80s.get('risk')/size_80s*base
norm_90s_risk= count_90s.get('risk')/size_90s*base
norm_00s_risk= count_00s.get('risk')/size_00s*base
norm_10s_risk= count_10s.get('risk')/size_10s*base
norm_20s_risk= count_20s.get('risk')/size_20s*base

In [26]:
print(f'decade, "risk", normalized frequency')
print(f'80s: {norm_80s_risk:.2f} per {base:,} words')
print(f'90s: {norm_90s_risk:.2f} per {base:,}')
print(f'00s: {norm_00s_risk:.2f} per {base:,}')
print(f'10s: {norm_10s_risk:.2f} per {base:,}')
print(f'20s: {norm_20s_risk:.2f} per {base:,}')

decade, "risk", normalized frequency
80s: 2.32 per 10,000 words
90s: 5.52 per 10,000
00s: 17.93 per 10,000
10s: 17.49 per 10,000
20s: 25.04 per 10,000


### Observations:
* the word risk increased significantly over time
* comparable numbers in the 2000s and 10s
* there may have been more risk associated with the types of drugs being produced within each decade
* More mentions of the word "risk" could highlight greater transparency within the drug industry of the associated risks

#### Item 5: "President"

In [25]:
norm_80s_pres= count_80s.get('president')/size_80s*base
norm_90s_pres= count_90s.get('president')/size_90s*base
norm_00s_pres= count_00s.get('president')/size_00s*base
norm_10s_pres= count_10s.get('president')/size_10s*base
norm_20s_pres= count_20s.get('president')/size_20s*base

In [27]:
print(f'decade, "president", normalized frequency')
print(f'80s: {norm_80s_pres:.2f} per {base:,} words')
print(f'90s: {norm_90s_pres:.2f} per {base:,}')
print(f'00s: {norm_00s_pres:.2f} per {base:,}')
print(f'10s: {norm_10s_pres:.2f} per {base:,}')
print(f'20s: {norm_20s_pres:.2f} per {base:,}')

decade, "president", normalized frequency
80s: 8.16 per 10,000 words
90s: 5.52 per 10,000
00s: 4.00 per 10,000
10s: 1.71 per 10,000
20s: 1.00 per 10,000


### Observations:
* Chose this word because it may be of interest (as in, could relate to the number of times the "president" of the company/related may have spoken out, or related to the scandal, so it's frequency could be a measure of interest)
* Its frequency across decades definitely fell off with time (2020s being the lowest number of times spoken per 10,000 words)
* the 1980s had by far the highest normalized frequency

## Bigram Keyness Analysis

In [98]:
### Bigrams Lists:

bigram_80s=get_ngram_tokens(token_80s,n=2)
count_bi_80s=Counter(bigram_80s)

bigram_90s=get_ngram_tokens(token_90s,n=2)
count_bi_90s=Counter(bigram_90s)

bigram_00s=get_ngram_tokens(token_00s,n=2)
count_bi_00s=Counter(bigram_00s)

bigram_10s=get_ngram_tokens(token_10s,n=2)
count_bi_10s=Counter(bigram_10s)

bigram_20s=get_ngram_tokens(token_20s,n=2)
count_bi_20s=Counter(bigram_20s)

In [144]:
bigrams_list = ['risk of', 'heart attack', 
                        'consumer trust', 'consumer confidence',
                        'drug administration','he said'
                        ]

print(f'Normalized Frequencies:  1980s     1990s     2000s     2010s    2020s')

for item in bigrams_list:
    norm_freq_80s=count_bi_80s.get(item,0)/size_80s*base
    norm_freq_90s=count_bi_90s.get(item,0)/size_90s*base
    norm_freq_00s=count_bi_00s.get(item,0)/size_00s*base
    norm_freq_10s=count_bi_10s.get(item,0)/size_10s*base
    norm_freq_20s=count_bi_20s.get(item,0)/size_20s*base

    print(f'{item: <26}{round(norm_freq_80s,2): <10}{round(norm_freq_90s,2): <10}{round(norm_freq_00s,2): <10}{round(norm_freq_10s,2): <10}{round(norm_freq_20s,2): <10}')

Normalized Frequencies:  1980s     1990s     2000s     2010s    2020s
risk of                   0.66      1.49      11.34     5.26      7.1       
heart attack              0.0       0.23      7.15      0.61      0.37      
consumer trust            0.0       0.0       0.0       0.0       0.0       
consumer confidence       0.0       0.0       0.0       0.0       0.0       
drug administration       18.64     12.53     13.1      17.0      18.31     
he said                   20.85     13.1      8.18      4.16      1.12      


### Observations:
* this analysis focuses on the normalized frequencies for phrases per 10,000 words
* Interestingly the phrase "risk of" increased within the news/media coverage with time
* usage of the phrase "heart attack" increased with time until the 2000s, and then fell off until the 2020s; perhaps this could relate to the drugs being recalled, as in the drugs from the 1980s-2000s were more associated with heart issues/concerns
* the phrases consumer "trust" and "confidence" were non-existent across all decades; this could be an analysis issue, or perhaps an area in which no striking information was found
* mentions of the "drug administration" kept relatively steady; it tapered off a little bit going into those mid 3 decades, and then picked up back again to pre-90s levels in 2020s
* The phrase "he said" also diminished significantly over time; indicative of the number of public statements-that were relevant to the issue-were made (i.e. whether it was by the head of company, a representative/etc., or a public official making a statement on behalf, and so forth)
  * in other words, the mentions of public statements made dwindled with time

## KWIC Context Comparisons (Written Notes/Observations):

1980s Corpus:
* Corporate messaging reflected negative press and connotations regarding company actions and public image
* There was ambiguity in accountability, particularly about who should bear responsibility for public health and safety issues
* Reports indicated instances of test evasion/cheating
* Greater emphasis appeared to be placed on maintaining corporate image rather than addressing public health concerns effectively

1990s:
* Notable improvements were observed in corporate messaging, with a greater mention of more secure safety methods/protocols.
* Ambiguity persisted, particularly around drug efficacy and the reliability of safety claims.
* FDA’s role becoming central, though concerns about corporate transparency remained.

2000s:
* Messaging demonstrated a clear shift toward prioritizing public health and patient safety. 
* Greater mention of “public health”
* Companies increasingly emphasized compliance with regulatory standards, including the need for longer and more rigorous drug studies
* References to “quality assurance” and consumer safety became more frequent, reflecting a higher corporate commitment to addressing public concerns
* Enhanced communication of safety warnings and trust-building marked a significant effort to rebuild credibility and strengthen consumer confidence

2010s:
* Greater mentions of phrases and words related to credibility and holding responsibility; showing how the foundations for communication messaging involved more techniques and words related to these principles
* Shows a greater upholding and adherence to better communication strategies and health concerns
* A shift towards a higher standard to corporate accountability
* More focus on building public trust & highlighted more influence of consumer input with respect to recall protocols/addressing of the public

2020s comparison Frequency Lists Across Decades (2020s vs. Earlier Years)
* In the 2020s, there’s more focus on specific drugs like "Chantix" and "Metformin," similar to the 2010s focus on "Valsartan," while earlier years talked about broader controversies like "diet drugs"
* Contamination terms like "Nitrosamine" and "NDMA" are still key in the 2020s, continuing from the 2010s, whereas earlier decades didn’t focus much on chemical impurities
* The 2020s use more technical language like "impurity" and "acceptable levels," which feels more precise compared to the simpler terms like "safety" and "recall" in the 1980s and 1990s
* Mentions of "patients" and "health" are consistent across all decades, but the 2020s feel more personal with words like "consumers" and "affected" compared to the more corporate or regulatory focus in the past
* Big companies like "Pfizer" and "Pharma" are named outright in the 2020s, reflecting the same scrutiny on companies seen in the 2010s with mentions like "McNeil," but in the 1980s and 1990s, company names felt less central
* Health conditions like "thyroid" and "diabetes" are showing up more in the 2020s, while earlier decades seemed to focus more on singular crises like heart valve issues or weight-loss drugs
* The tone in the 2020s feels proactive, with words like "potential" and "adverse," showing companies trying to get ahead of issues, unlike the 1980s and 1990s, where the focus was more reactive and on assigning blame

## VADER Scores Comparisons (Written Notes/Observations):
* 2000's corpus had the highest negative sentiment score (0.095)
* 2020s highest positive score (0.08)
* All corpora had negative compound scores, except for the 2020s corpus; maybe indication of improved strategies over time (given that media coverage and related sentiment is more positive than the rest of the decade corpora)