# Problem set 3: Text analysis of DOJ press releases

**Total points (without extra credit)**: 52 

- For background:

    - DOJ is the federal law enforcement agency responsible for federal prosecutions; this contrasts with the local prosecutions in the Cook County dataset we analyzed earlier. Here's a short explainer on which crimes get prosecuted federally versus locally: https://www.criminaldefenselawyer.com/resources/criminal-defense/federal-crime/state-vs-federal-crimes.htm#:~:text=Federal%20criminal%20prosecutions%20are%20handled,of%20state%20and%20local%20law. 
    - Here's the Kaggle that contains the data: https://www.kaggle.com/jbencina/department-of-justice-20092018-press-releases 
    - Here's the code the dataset creator used to scrape those press releases here if you're interested: https://github.com/jbencina/dojreleases

## 0.0 Import packages

In [11]:
## helpful packages
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import random
import re
import string

## nltk imports
import nltk
### uncomment and run these lines if you haven't downloaded relevant nltk add-ons yet
### nltk.download('averaged_perceptron_tagger')
### nltk.download('stopwords')
from nltk import pos_tag
from nltk.tokenize import word_tokenize, wordpunct_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords

## spacy imports
import spacy
### uncomment and run the below line if you haven't loaded the en_core_web_sm library yet
### ! python -m spacy download en_core_web_sm
import en_core_web_sm
nlp = en_core_web_sm.load()

## vectorizer
from sklearn.feature_extraction.text import CountVectorizer

## sentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

## lda
from gensim import corpora
import gensim

## repeated printouts and wide-format text
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
pd.set_option('display.max_colwidth', None)

## 0.1 Load and clean text data

In [12]:
## first, unzip the file pset3_inputdata.zip 
## then, run this code to load the unzipped json file and convert to a dataframe
## (may need to change the pathname depending on where you store stuff)
## and convert some of the attributes from lists to values
doj = pd.read_json("pset3_inputdata/combined.json", lines = True)

## due to json, topics are in a list so remove them and concatenate with ;
doj['topics_clean'] = ["; ".join(topic) 
                      if len(topic) > 0 else "No topic" 
                      for topic in doj.topics]

## similarly with components
doj['components_clean'] = ["; ".join(comp) 
                           if len(comp) > 0 else "No component" 
                           for comp in doj.components]

## drop older columns from data
doj = doj[['id', 'title', 'contents', 'date', 'topics_clean', 
           'components_clean']].copy()

doj.head()

FileNotFoundError: File pset3_inputdata/combined.json does not exist

## 1. Tagging and sentiment scoring (17 points)

Focus on the following press release: `id` == "17-1204" about this pharmaceutical kickback prosecution: https://www.forbes.com/sites/michelatindera/2017/11/16/fentanyl-billionaire-john-kapoor-to-plead-not-guilty-in-opioid-kickback-case/?sh=21b8574d6c6c 

The `contents` column is the one we're treating as a document. You may need to to convert it from a pandas series to a single string.

We'll call the raw string of this press release `pharma`

In [None]:
## your code to subset to one press release and take the string
pharma_doj = doj[doj['id'] == '17-1204']
pharma = pharma_doj['contents'].iloc[0]

# check to make sure we have raw string of press release
print(type(pharma))
pharma

### 1.1 part of speech tagging (3 points)

A. Preprocess the `pharma` press release to remove all punctuation / digits (you can use `.isalpha()` to subset)

B. With the preprocessed press release from part A, use the part of speech tagger within nltk to tag all the words in that one press release with their part of speech. 

C. Using the output from B, extract the adjectives and sort those adjectives from most occurrences to fewest occurrences. Print a dataframe with the 5 most frequent adjectives and their counts in the `pharma` release. See here for a list of the names of adjectives within nltk: https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/

**Resources**:

- Documentation for `.isalpha()`: https://www.w3schools.com/python/ref_string_isalpha.asp

In [None]:
## restrict string to only words, removing punctuation and digits
pharma_preprocessed = [word for word in word_tokenize(pharma.lower()) if word.isalpha()]
pharma_preprocessed = ' '.join(pharma_preprocessed)

In [None]:
## tokenize words and then use part of speech tagger
#pharma_preprocessed
words = word_tokenize(pharma_preprocessed)
#words
tags = nltk.pos_tag(words)
#tags

In [None]:
# get adjectives
# got which tags to use from this website : https://cs.nyu.edu/~grishman/jet/guide/PennPOS.html
adjectives = [word for word, tag in tags if tag in ['JJ', 'JJR', 'JJS', 'ADJ']]

adj_counts = {}
for adj in adjectives:
    if adj in adj_counts:
        adj_counts[adj] += 1
    else:
        adj_counts[adj] = 1
print(adj_counts)
sorted_adj_counts = sorted(adj_counts.items(), key=lambda item: item[1], reverse=True)
most_common_adjectives = sorted_adj_counts[:5]
most_common_adjectives

df_adjectives = pd.DataFrame(sorted_adj_counts, columns=['adjective', 'count'])
print(df_adjectives)

## 1.2 named entity recognition (4 points)

A. Using the original `pharma` press release (so the one before stripping punctuation/digits), use spaCy to extract all named entities from the press release.

B. Print the unique named entities with the tag: `LAW`

In [None]:
## use before preprocessed string to get named entity recognition
spacy_pressrelease = nlp(pharma)
print(type(spacy_pressrelease))
# for one_tok in spacy_pressrelease.ents:
#     print("Entity: " + one_tok.text + "; NER tag: " + one_tok.label_)

In [None]:
## unique entities w/ tag LAW
law_tag = [one_tok.text for one_tok in spacy_pressrelease.ents if one_tok.label_ == 'LAW']
law_tag = set(law_tag) 
law_tag

C. Use Google to summarize in one sentence what the `RICO` named entity means and why this might apply to a pharmaceutical kickbacks case (and not just a mafia case...) 

#### RICO entity explanation

Rico stands for the Racketeer Influenced and Corrupt Organizations Act (why it is given the tag 'LAW'), and it might apply to a pharmaceutical kickbacks case alongside mafia cases because RICO  and the definition of racketeering has expanded to include patterns of repeated crime - which could be a number of white-collar crimes like bribery, counterfeiting, theft, embezzlement, fraud, money laundering - in corporate settings which serves as the "corrupt organization", showing how it can apply to a pharmaceutical kickbacks case. 


source: https://en.wikipedia.org/wiki/Racketeer_Influenced_and_Corrupt_Organizations_Act

D. You want to extract the possible sentence lengths the CEO is facing; pull out the named entities with (1) the label `DATE` and (2) that contain the word year or years (hint: you may want to use the `re` module for that second part). Print these named entities.

In [None]:
## named entities with DATE label and w/ year or years

def extract_years(spacy_res):
    years = []

    # loop through named entities
    for one_tok in spacy_pressrelease.ents:
        if one_tok.label_ == 'DATE':
            # check if 'year' or 'years' is in the named entity
            if re.search(r'.*(year|years).*', one_tok.text):
                years.append(one_tok.text)
                
    return years


possible_sentences = extract_years(pharma)
print(possible_sentences)

E. Pull and print the original parts of the press releases where those year lengths are mentioned (e.g., the sentences or rough region of the press release). Describe in your own words (1 sentence) what length of sentence (prison) and probation (supervised release) the CEO may be facing if convicted after this indictment (if there are multiple lengths mentioned describe the maximum). 

**Hint**: you may want to use re.search or re.findall 

- For part E, you can use `re.search` and `re.findall`, or anything that works 😳.

In [None]:
## your code here
year_pattern = r'([^.]*?years[^.]*\.)'
sentences_with_years = re.findall(year_pattern, pharma)
for sentence in sentences_with_years:
    print(sentence)

The maximum length of sentence for the CEO is 20 years, along with a 3-year probation period if convicted after the indictment.

## 1.3 sentiment analysis  (10 points)

A. Subset the press releases to those labeled with one of three topics via `topics_clean`: Civil Rights, Hate Crimes, and Project Safe Childhood. We'll call this `doj_subset` going forward and it should have 717 rows.



In [None]:
## your code here for subsetting
topics = ['Civil Rights', 'Hate Crimes', 'Project Safe Childhood']
doj_subset = doj[doj['topics_clean'].isin(topics)].copy().reset_index()
doj_subset

B. Write a function that takes one press release string as an input and:

- Removes named entities from each press release string (**Hint**: you may want to use `re.sub` with an or condition)
- Scores the sentiment of the entire press release using the `SentimentIntensityAnalyzer` and `polarity_scores`
- Returns the length-four (negative, positive, neutral, compound) sentiment dictionary (any order is fine)

Apply that function to each of the press releases in `doj_subset`. 

**Hints**: 

- A function + list comprehension to execute will takes about 30 seconds on a respectable local machine and about 2 mins on jhub; if it's taking a very long time, you may want to check your code for inefficiencies. If you can't fix those, for partial credit on this part/full credit on remainder, you can take a small random sample of the 717


In [None]:
## your code here to define function
def sentiment_analysis(press_string):
    tokenized = word_tokenize(press_string)
    tagged = pos_tag(tokenized)
    
    nlp_res = nlp(press_string)
    without_ner = " ".join([ent.text for ent in nlp_res if not ent.ent_type_])
    
    sia = SentimentIntensityAnalyzer()
    sentiment = sia.polarity_scores(without_ner)
    
    return sentiment
   
# test  
#print(sentiment_analysis('A federal jury convicted Rick Lee Evans, 43, of Anniston, Alabama, today of aggravated sexual abuse of a child after a five-day trial, Assistant Attorney General Leslie R. Caldwell of the Justice Department’s Criminal Division and U.S. Attorney Joyce White Vance of the Northern District of Alabama announced.  According to evidence introduced at trial, Evans, a former U.S. Army soldier, and his then-wife, a Department of Defense employee, were residing in Germany when they were asked to take temporary custody of a five-year-old child whose parents were deployed to Iraq with the U.S. Army.  Evans sexually abused the child on multiple occasions during the 18 months that the child lived with him from May 2007 to December 2008.  Trial Attorney Austin M. Berry of the Criminal Division’s Child Exploitation and Obscenity Section (CEOS) and Assistant U.S. Attorney Jacquelyn Hutzell of the Northern District of Alabama are prosecuting the case.  U.S. Army Criminal Investigations Division and the FBI’s Birmingham, Alabama, Division investigated the case. This case was brought as part of Project Safe Childhood, a nationwide initiative to combat the growing epidemic of child sexual exploitation and abuse, launched in May 2006 by the Department of Justice.  Led by U.S. Attorneys’ offices and CEOS, Project Safe Childhood marshals federal, state and local resources to better locate, apprehend and prosecute individuals who exploit children via the Internet, as well as to identify and rescue victims.'))

In [None]:
scores = [sentiment_analysis(press_str) for press_str in doj_subset['contents']]

In [None]:
## your code here executing the function
doj_subset['sentiment_scores'] = doj_subset['contents'].apply(sentiment_analysis)
doj_subset[['contents', 'sentiment_scores']]

C. Add the four sentiment scores to the `doj_subset` dataframe to create a dataframe: `doj_subset_wscore`. Sort from highest neg to lowest neg score and print the top `id`, `contents`, and `neg` columns of the two most neg press releases. 

Notes:

- Don't worry if your sentiment score differs slightly from our output on GitHub; differences in preprocessing can lead to diff scores

In [None]:
## already added column to doj_subset df, so now create a copy
doj_subset_wscore = doj_subset.copy().reset_index()
doj_subset_wscore['neg_score'] = doj_subset_wscore['sentiment_scores'].apply(lambda x: x['neg'])
doj_subset_wscore = doj_subset_wscore.sort_values('neg_score', ascending = False)

# print id, contents, neg_score column for two most negative press releases
for i in range(3):
    print(doj_subset_wscore[['id', 'contents', 'neg_score']].iloc[i])

D. With the dataframe from part C, find the mean compound sentiment score for each of the three topics in `topics_clean` using group_by and agg.

E. Add a 1 sentence interpretation of why we might see the variation in scores (remember that compound is a standardized summary where -1 is most negative; +1 is most positive)


In [None]:
## agg and find the mean compound score by topic
doj_subset_wscore['compound_score'] = doj_subset_wscore['sentiment_scores'].apply(lambda x: x['compound'])
compound_scores = doj_subset_wscore.groupby('topics_clean').agg({'compound_score': 'mean'}).reset_index()
compound_scores

We might see variation in compound scores due to the nature and emotional contexts of each type of crime. For example, hate crimes will have the most hateful, adverse language in their descriptions because of the nature of the crime, explaining the most negative compound score, whereas Civil Rights for example might not have outwardly hostile language, leading to a less negative compound score. 

# 2. Topic modeling (25 points)

For this question, use the `doj_subset_wscores` data that is restricted to civil rights, hate crimes, and project safe childhood and with the sentiment scores added


## 2.1 Preprocess the data by removing stopwords, punctuation, and non-alpha words (5 points)

A. Write a function that:

- Takes in a single raw string in the `contents` column from that dataframe
- Does the following preprocessing steps:

    - Converts the words to lowercase
    - Removes stopwords, adding the custom stopwords in your code cell below to the default stopwords list
    - Only retains alpha words (so removes digits and punctuation)
    - Only retains words 4 characters or longer
    - Uses the snowball stemmer from nltk to stem

- Returns a joined preprocessed string
    
B. Use `apply` or list comprehension to execute that function and create a new column in the data called `processed_text`
    
C. Print the `id`, `contents`, and `processed_text` columns for the following press releases:

id = 16-718 (this case: https://www.seattletimes.com/nation-world/doj-miami-police-reach-settlement-in-civil-rights-case/)

id = 16-217 (this case: https://www.wlbt.com/story/32275512/three-mississippi-correctional-officers-indicted-for-inmate-assault-and-cover-up/)
    
**Resources**:

- Here's code examples for the snowball stemmer: https://www.geeksforgeeks.org/snowball-stemmer-nlp/

In [None]:
custom_doj_stopwords = ["civil", "rights", "division", "department", "justice",
                        "office", "attorney", "district", "case", "investigation", "assistant",
                       "trial", "assistance", "assist"]

In [None]:
## your code defining a text processing function
def preprocess(words):
    words = words.lower()
    
    stop_words = set(stopwords.words('english')).union(custom_doj_stopwords)
    stemmer = SnowballStemmer('english')
    
    tokens = word_tokenize(words)
    processed = [
        stemmer.stem(tok) for tok in tokens
        if tok.isalpha() and len(to tok not in stop_words andk) >= 4
    ]
    
    return ' '.join(processed)

In [None]:
## your code executing the function
doj_subset_wscore['processed_text'] = [preprocess(content) for content in doj_subset_wscore['contents']]

In [None]:
## your code showing the examples
filtered_doj_wscores = doj_subset_wscore[doj_subset_wscore['id'].isin(['16-718', '16-217'])]
filtered_doj_wscores[['id', 'contents', 'processed_text']]

## 2.2 Create a document-term matrix from the preprocessed press releases and to explore top words (5 points)

A. Use the `create_dtm` function I provide (alternately, feel free to write your own!) and create a document-term matrix using the preprocessed press releases; make sure metadata contains the following columns: `id`, `compound` sentiment column you added, and the `topics_clean` column

B. Print the top 10 words for press releases with compound sentiment in the top 5% (so the most positive sentiment)

C. Print the top 10 words for press releases with compound sentiment in the bottom 5% (so the most negative sentiment)

**Hint**: for these, remember the pandas quantile function from pset one.  

D. Print the top 10 words for press releases in each of the three `topics_clean`

For steps B - D, to receive full credit, write a function `get_topwords` that helps you avoid duplicated code when you find top words for the different subsets of the data. There are different ways to structure it but one way is to feed it subsetted data (so data subsetted to one topic etc.) and for it to get the top words for that subset.


In [14]:
def create_dtm(list_of_strings, metadata):
    vectorizer = CountVectorizer(lowercase = True)
    dtm_sparse = vectorizer.fit_transform(list_of_strings)
    dtm_dense_named = pd.DataFrame(dtm_sparse.todense(), 
        columns=vectorizer.get_feature_names())
    dtm_dense_named_withid = pd.concat([metadata.reset_index(), dtm_dense_named], axis = 1)
    return(dtm_dense_named_withid)

In [None]:
dtm_doj = create_dtm(list_of_strings=doj_subset_wscore['processed_text'],
                     metadata=doj_subset_wscore[['id', 'compound_score', 'topics_clean']])


dtm_doj

In [13]:
def get_topwords(dtm, num_words=10):
    word_counts = dtm[dtm.columns[4:]].sum(axis=0).sort_values(ascending=False)
    return word_counts.head(num_words)

# thresholds for top and bottom 5% compound scores
top_5_percent_threshold = dtm_doj['compound_score'].quantile(0.95)
bottom_5_percent_threshold = dtm_doj['compound_score'].quantile(0.05)

# get subsets based on thresholds
top_5_percent_dtm = dtm_doj[dtm_doj['compound_score'] >= top_5_percent_threshold]
bottom_5_percent_dtm = dtm_doj[dtm_doj['compound_score'] <= bottom_5_percent_threshold]

print("top 10 words in the top 5% most positive:")
print(get_topwords(top_5_percent_dtm))
print("top 10 words in the bottom 5% most negative:")
print(get_topwords(bottom_5_percent_dtm))

# for topics
for topic in dtm_doj['topics_clean'].unique():
    topic_dtm = dtm_doj[dtm_doj['topics_clean'] == topic]
    print("top 10 words for topic " + topic + ":")
    print(get_topwords(topic_dtm))


NameError: name 'create_dtm' is not defined

## 2.3 Estimate a topic model using those preprocessed words (5 points)

A. Going back to the preprocessed words from part 2.3.1, estimate a topic model with 3 topics, since you want to see if the unsupervised topic models recover different themes for each of the three manually-labeled areas (civil rights; hate crimes; project safe childhood). You have free rein over the other topic model parameters beyond the number of topics.

B. After estimating the topic model, print the top 15 words in each topic.

**Hints and Resources**:

- Same topic modeling resources linked to above
- Make sure to use the `random_state` argument within the model so that the numbering of topics does not move around between runs of your code

In [None]:
# tokenize text
doj_clean = doj_subset_wscore[doj_subset_wscore.processed_text != ""].copy()
tokenized_text = [wordpunct_tokenize(one_text) 
                for one_text in 
                doj_clean.processed_text]
#tokenized_text

In [None]:
## preprocess and estimate topicmod

### create dictionary
text_proc_dict = corpora.Dictionary(tokenized_text)

### filter dictionary- using 2% as bounds
text_proc_dict.filter_extremes(no_below = round(doj_clean.shape[0]*0.02),
                             no_above = round(doj_clean.shape[0]*0.98))

### create corpus from dictionary
corpus_fromdict_proc = [text_proc_dict.doc2bow(one_text) 
                       for one_text in tokenized_text]

# corpus_fromdict_proc

In [None]:
### estimate model
n_topics = 3
ldamod_proc = gensim.models.ldamodel.LdaModel(corpus_fromdict_proc, 
                                              num_topics = n_topics, 
                                              id2word=text_proc_dict, 
                                              passes=6, alpha = 'auto',
                                              per_word_topics = True, 
                                              random_state = 91988)

### print topics and words
topics = ldamod_proc.print_topics(num_words = 15)
for topic in topics:
    print(topic)
    

In [None]:
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis
pyLDAvis.enable_notebook()
lda_display_proc = gensimvis.prepare(ldamod_proc, corpus_fromdict_proc, text_proc_dict)
pyLDAvis.display(lda_display_proc)

## 2.4 Add topics back to main data and explore correlation between manual labels and our estimated topics (10 points)

A. Extract the document-level topic probabilities. Within `get_document_topics`, use the argument `minimum_probability` = 0 to make sure all 3 topic probabilities are returned. Write an assert statement to make sure the length of the list is equal to the number of rows in the `doj_subset_wscores` dataframe

B. Add the topic probabilities to the `doj_subset_wscores` dataframe as columns and create a column, `top_topic`, that reflects each document to its highest-probability topic (eg topic 1, 2, or 3)

C. For each of the manual labels in `topics_clean` (Hate Crime, Civil Rights, Project Safe Childhood), print the breakdown of the % of documents with each top topic (so, for instance, Hate Crime has 246 documents-- if 123 of those documents are coded to topic_1, that would be 50%; and so on). **Hint**: pd.crosstab and normalize may be helpful: https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.crosstab.html

D. Using a couple press releases as examples, write a 1-2 sentence interpretation of why some of the manual topics map on more cleanly to an estimated topic than other manual topic(s)


In [None]:
## your code here to get doc-level topic probabilities 
document_topics = [ldamod_proc.get_document_topics(item, minimum_probability=0) for item in corpus_fromdict_proc]
assert len(document_topics) == len(doj_subset_wscore), "the length of the list is not equal to the number of rows in the doj_subset_wscores dataframe"
print(len(document_topics))
print(len(doj_subset_wscore))
#length of both is 717

In [None]:
import pandas as pd


doj_subset_wscore['topic_1'] = [
    next((t[1] for t in doc if t[0] == 0))  # Extract the probability for topic 1, or 0 if not found
    for doc in document_topics
]
doj_subset_wscore['topic_2'] = [
    next((t[1] for t in doc if t[0] == 1))  # Extract the probability for topic 1, or 0 if not found
    for doc in document_topics
]
doj_subset_wscore['topic_3'] = [
    next((t[1] for t in doc if t[0] == 2))  # Extract the probability for topic 1, or 0 if not found
    for doc in document_topics
]
doj_subset_wscore['top_topic'] = doj_subset_wscore[['topic_1', 'topic_2', 'topic_3']].idxmax(axis=1)

In [None]:
## topic proportions
topic_distribution = pd.crosstab(doj_subset_wscore['topics_clean'], doj_subset_wscore['top_topic'], normalize='index') * 100
print(topic_distribution)


In [None]:
random_docs = doj_subset_wscore.sample(n=3, random_state=56)  

for index, row in random_docs.iterrows():
    print("document ID:", row['id'])
    print("contents:", row['contents'])
    print("actual topic:", row['topics_clean'])
    print("predicted topic (LDA):", row['top_topic'])
    print("-------------------------------------------------------------------------")

The main reason that some manual topics map on more cleanly to an estimated topic than other manual topic is due to the consistency or inconsistency of language used when describing the topic. For example, 'hate crime' documents predominantly map to topic_1, which may have more consistent themes and languages associated with that type of crime, like 'discrimination' or 'violence' or frequent types of hate crimes like graffiti or slurs, which helps our LDA model effectively categorize these documents. On the other hand, 'civil rights' documents can sometimes be misclassified and less consistently mapped, which is likely due to the more diverse content and issues that might lie under the category of 'civil rights'. 
 

# 3. Extend the analysis from unigrams to bigrams (10 points)

In the previous question, you found top words via a unigram representation of the text. Now, we want to see how those top words change with bigrams (pairs of words)

A. Using the `doj_subset_wscore` data and the `processed_text` column (so the words after stemming/other preprocessing), create a column in the data called `processed_text_bigrams` that combines each consecutive pairs of word into a bigram separated by an underscore. Eg:

"depart reach settlem" would become "depart_reach reach_settlem"

Do this by writing a function `create_bigram_onedoc` that takes in a single `processed_text` string and returns a string with its bigrams structured similarly to above example
 
**Hint**: there are many ways to solve but `zip` may be helpful: https://stackoverflow.com/questions/21303224/iterate-over-all-pairs-of-consecutive-items-in-a-list

B. Print the `id`, `processed_text`, and `processed_text_bigram` columns for press release with id = 16-217

In [None]:
## A
def create_bigram_onedoc(text):
    words = text.split()
    bigrams = ['_'.join(bigram) for bigram in zip(words, words[1:])]
    return ' '.join(bigrams)


doj_subset_wscore['processed_text_bigrams'] = [create_bigram_onedoc(processed_content) for processed_content in doj_subset_wscore['processed_text']]

In [None]:
## B
filtered_doj_wscores = doj_subset_wscore[doj_subset_wscore['id'].isin(['16-217'])]
filtered_doj_wscores[['id', 'processed_text', 'processed_text_bigrams']]

C. Use the create_dtm function and the `processed_text_bigrams` column to create a document-term matrix (`dtm_bigram`) with these bigrams. Keep the following three columns in the data: `id`, `topics_clean`, and `compound` 

D. Print the (1) dimensions of the `dtm` matrix from question 2.2  and (2) the dimensions of the `dtm_bigram` matrix. Comment on why the bigram matrix has more dimensions than the unigram matrix 

In [None]:
dtm_bigram = create_dtm(list_of_strings=doj_subset_wscore['processed_text_bigrams'],
                     metadata=doj_subset_wscore[['id', 'compound_score', 'topics_clean']])



In [None]:
print("shape of initial dtm matrix: " + str(dtm_doj.shape))
print("shape of bigram dtm matrix: " + str(dtm_bigram.shape))

# COMMENT ON WHY BIGRAM MATRIX HAS MORE DIMENSIONS

E. Find and print the 10 most prevelant bigrams for each of the three topics_clean using the `get_topwords` function from 2.2

In [None]:
# your code here
for topic in dtm_bigram['topics_clean'].unique():
    topic_dtm = dtm_bigram[dtm_bigram['topics_clean'] == topic]
    print("top 10 words for topic " + topic + ":")
    print(get_topwords(topic_dtm))

# 4. Optional extra credit (2 points)

You notice that the pharmaceutical kickbacks press release we analyzed in question 1 was for an indictment, and that in the original data, there's not a clear label for whether a press release outlines an indictment (charging someone with a crime), a conviction (convicting them after that charge either via a settlement or trial), or a sentencing (how many years of prison or supervised release a defendant is sentenced to after their conviction).

You want to see if you can identify pairs of press releases where one press release is from one stage (e.g., indictment) and another is from a different stage (e.g., a sentencing).

You decide that one way to approach is to find the pairwise string similarity between each of the processed press releases in `doj_subset`. There are many ways to do this, so Google for some approaches, focusing on ones that work well for entire documents rather than small strings.

Find the top two pairs (so four press releases total)-- do they seem like different stages of the same crime or just press releases covering similar crimes?

In [None]:
# your code here 