# Problem set 3: Text analysis of DOJ press releases

**Total points (without extra credit)**: 52 

- For background:

    - DOJ is the federal law enforcement agency responsible for federal prosecutions; this contrasts with the local prosecutions in the Cook County dataset we analyzed earlier. Here's a short explainer on which crimes get prosecuted federally versus locally: https://www.criminaldefenselawyer.com/resources/criminal-defense/federal-crime/state-vs-federal-crimes.htm#:~:text=Federal%20criminal%20prosecutions%20are%20handled,of%20state%20and%20local%20law. 
    - Here's the Kaggle that contains the data: https://www.kaggle.com/jbencina/department-of-justice-20092018-press-releases 
    - Here's the code the dataset creator used to scrape those press releases here if you're interested: https://github.com/jbencina/dojreleases

## 0.0 Import packages

In [1]:
## helpful packages
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import random
import re
import string
from collections import Counter 

## nltk imports
import nltk
### uncomment and run these lines if you haven't downloaded relevant nltk add-ons yet
### nltk.download('averaged_perceptron_tagger')
### nltk.download('stopwords')
from nltk import pos_tag
from nltk.tokenize import word_tokenize, wordpunct_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

## spacy imports
import spacy
### uncomment and run the below line if you haven't loaded the en_core_web_sm library yet
### ! python -m spacy download en_core_web_sm
import en_core_web_sm
nlp = en_core_web_sm.load()

## vectorizer
from sklearn.feature_extraction.text import CountVectorizer

## sentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

## lda
from gensim import corpora
import gensim

## repeated printouts and wide-format text
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
pd.set_option('display.max_colwidth', None)

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/katinachristensen/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


## 0.1 Load and clean text data

In [2]:
## first, unzip the file pset3_inputdata.zip 
## then, run this code to load the unzipped json file and convert to a dataframe
## (may need to change the pathname depending on where you store stuff)
## and convert some of the attributes from lists to values
doj = pd.read_json("combined.json", lines = True)

## due to json, topics are in a list so remove them and concatenate with ;
doj['topics_clean'] = ["; ".join(topic) 
                      if len(topic) > 0 else "No topic" 
                      for topic in doj.topics]

## similarly with components
doj['components_clean'] = ["; ".join(comp) 
                           if len(comp) > 0 else "No component" 
                           for comp in doj.components]

## drop older columns from data
doj = doj[['id', 'title', 'contents', 'date', 'topics_clean', 
           'components_clean']].copy()

doj.head()

Unnamed: 0,id,title,contents,date,topics_clean,components_clean
0,,Convicted Bomb Plotter Sentenced to 30 Years,"PORTLAND, Oregon. ‚Äì Mohamed Osman Mohamud, 23, who was convicted in 2013 of attempting to use a weapon of mass destruction (explosives) in connection with a plot to detonate a vehicle bomb at an annual Christmas tree lighting ceremony in Portland, was sentenced today to serve 30 years in prison, followed by a lifetime term of supervised release. Mohamud, a naturalized U.S. citizen from Somalia and former resident of Corvallis, Oregon, was arrested on Nov. 26, 2010, after he attempted to detonate what he believed to be an explosives-laden van that was parked near the tree lighting ceremony in Portland.¬† The arrest was the culmination of a long-term undercover operation, during which Mohamud was monitored closely for months as his bomb plot developed.¬† The device was in fact inert, and the public was never in danger from the device. At sentencing, United States District Court Judge Garr M. King, who presided over Mohamed‚Äôs 14-day trial, said ‚Äúthe intended crime was horrific,‚Äù and that the defendant, even though he was presented with options by undercover FBI employees, ‚Äúnever once expressed a change of heart.‚Äù¬† King further noted that the Christmas tree ceremony was attended by up to 10,000 people, and that the defendant ‚Äúwanted everyone to leave either dead or injured.‚Äù¬† King said his sentence was necessary in view of the seriousness of the crime and to serve as deterrence to others who might consider similar acts.¬†¬†¬†¬† ‚ÄúWith today‚Äôs sentencing, Mohamed Osman Mohamud is being held accountable for his attempted use of what he believed to be a massive bomb to attack innocent civilians attending a public Christmas tree lighting ceremony in Portland,‚Äù said John P. Carlin, Assistant Attorney General for National Security.¬† ‚ÄúThe evidence clearly indicated that Mohamud was intent on killing as many people as possible with his attack.¬† Fortunately, law enforcement was able to identify him as a threat, insert themselves in the place of a terrorist that Mohamud was trying to contact, and thwart Mohamud‚Äôs efforts to conduct an attack on our soil.¬† This case highlights how the use of undercover operations against would-be terrorists allows us to engage and disrupt those who wish to commit horrific acts of violence against the innocent public.¬† The many agents, analysts, and prosecutors who have worked on this case deserve great credit for their roles in protecting Portland from the threat posed by this defendant and ensuring that he was brought to justice.‚Äù ‚ÄúThis trial provided a rare glimpse into the techniques Al Qaeda employs to radicalize home-grown extremists,‚Äù said Amanda Marshall, U.S. Attorney for the District of Oregon.¬† ‚ÄúWith the sentencing today, the court has held this defendant accountable.¬†¬† I thank the dedicated professionals in the law enforcement and intelligence communities who were responsible for this successful outcome.¬† I look forward to our continued work with Muslim communities in Oregon who are committed to ensuring that all young people are safe from extremists who seek to radicalize others to engage in violence.‚Äù¬† According to the trial evidence, in February 2009, Mohamud began communicating via e-mail with Samir Khan, a now-deceased al Qaeda terrorist who published Jihad Recollections, an online magazine that advocated violent jihad, and who also published Inspire, the official magazine of al-Qaeda in the Arabian Peninsula.¬† Between February and August 2009, Mohamed exchanged approximately 150 emails with Khan.¬† Mohamud wrote several articles for Jihad Recollections that were published under assumed names. In August 2009, Mohamud was in email contact with Amro Al-Ali, a Saudi national who was in Yemen at the time and is today in custody in Saudi Arabia for terrorism offenses.¬† Al-Ali sent Mohamud detailed e-mails designed to facilitate Mohamud‚Äôs travel to Yemen to train for violent jihad.¬† In December 2009, while Al-Ali was in the northwest frontier province of Pakistan, Mohamud and Al-Ali discussed the possibility of Mohamud traveling to Pakistan to join Al-Ali in terrorist activities. Mohamud responded to Al-Ali in an e-mail: ‚Äúyes, that would be wonderful, just tell me what I need to do.‚Äù¬† Al-Ali referred Mohamud to a second associate overseas and provided Mohamud with a name and email address to facilitate the process. In the following months, Mohamud made several unsuccessful attempts to contact Al-Ali‚Äôs associate.¬† Ultimately, an FBI undercover operative contacted Mohamud via email under the guise of being an associate of Al-Ali‚Äôs.¬† Mohamud and the FBI undercover operative agreed to meet in Portland in July 2010.¬† At the meeting, Mohamud told the FBI undercover operative he had written articles that were published in Jihad Recollections.¬† Mohamud also said that he wanted to become ‚Äúoperational.‚Äù¬† Asked what he meant by ‚Äúoperational,‚Äù Mohamud said he wanted to put an explosion together, but needed help. According to evidence presented at trial, at a meeting in August 2010, Mohamud told undercover FBI operatives he had been thinking of committing violent jihad since the age of 15.¬† Mohamud then told the undercover FBI operatives that he had identified a potential target for a bomb: the annual Christmas tree lighting ceremony in Portland‚Äôs Pioneer Courthouse Square on Nov. 26, 2010.¬† The undercover FBI operatives cautioned Mohamud several times about the seriousness of this plan, noting there would be many people at the event, including children, and emphasized that Mohamud could abandon his attack plans at any time with no shame.¬† Mohamud indicated the deaths would be justified and that he would not mind carrying out a suicide attack on the crowd. According to evidence presented at trial, in the ensuing months Mohamud continued to express his interest in carrying out the attack and worked on logistics.¬† On Nov. 4, 2010, Mohamud and the undercover FBI operatives traveled to a remote location in Lincoln County, Oregon, where they detonated a bomb concealed in a backpack as a trial run for the upcoming attack.¬† During the drive back to Corvallis, Mohamud was asked if was capable looking at all the bodies of those who would be killed during the explosion.¬† In response, Mohamud noted, ‚ÄúI want whoever is attending that event to be, to leave either dead or injured.‚Äù¬† Mohamud later recorded a video of himself, with the assistance of the undercover FBI operatives, in which he read a statement that offered his rationale for his bomb attack.¬† On Nov. 18, 2010, undercover FBI operatives picked up Mohamud to travel to Portland to finalize the details of the attack.¬† On Nov. 26, 2010, just hours before the planned attack, Mohamud examined the 1,800 pound bomb in the van and remarked that it was ‚Äúbeautiful.‚Äù¬† Later that day, Mohamud was arrested after he attempted to remotely detonate the inert vehicle bomb rked near the Christmas tree lighting ceremony This case was investigated by the FBI, with assistance from the Oregon State Police, the Corvallis Police Department, the Lincoln County Sheriff‚Äôs Office and the Portland Police Bureau.¬† The prosecution was handled by Assistant U.S. Attorneys Ethan D. Knight and Pamala Holsinger from the U.S. Attorney‚Äôs Office for the District of Oregon.¬† Trial Attorney Jolie F. Zimmerman, from the Counterterrorism Section of the Justice Department‚Äôs National Security Division, assisted. # # # 14-1077",2014-10-01T00:00:00-04:00,No topic,National Security Division (NSD)
1,12-919,$1 Million in Restitution Payments Announced to Preserve North Carolina Wetlands,"WASHINGTON ‚Äì North Carolina‚Äôs Waccamaw River watershed will benefit from a $1 million restitution order from a federal court, funding environmental projects to acquire and preserve wetlands in an area damaged by illegal releases of wastewater from a corporate hog farm, announced Ignacia S. Moreno, Assistant Attorney General of the Justice Department‚Äôs Environment and Natural Resources Division; U.S. Attorney for the Eastern District of North Carolina Thomas G. Walker; Director Greg McLeod from the North Carolina State Bureau of Investigation; and Camilla M. Herlevich, Executive Director of the North Carolina Coastal Land Trust. Freedman Farms Inc. was sentenced in February 2012 to five years of probation and ordered to pay $1.5 million in fines, restitution and community service payments for violating the Clean Water Act when it discharged hog waste into a stream that leads to the Waccamaw River.¬† William B. Freedman, president of Freedman Farms, was sentenced to six months in prison to be followed by six months of home confinement.¬† Freedman Farms also is required to implement a comprehensive environmental compliance program and institute an annual training program. In an order issued on April 19, 2012, the court ordered that the defendants would be responsible for restitution of $1 million in the form of five annual payments starting in January 2013, which the court will direct to the North Carolina Coastal Land Trust (NCCLT).¬† The NCCLT plans to use the money to acquire and conserve land along streams in the Waccamaw watershed.¬† The court also directed a $75,000 community service payment to the Southern Environmental Enforcement Network, an organization dedicated to environmental law enforcement training and information sharing in the region. ¬†‚ÄúThe resolution of the case against Freedman Farms demonstrates the commitment of the Department of Justice to enforcing the Clean Water Act to ensure the protection of human health and the environment,‚Äù said Assistant Attorney General Moreno.¬† ‚ÄúThe court-ordered restitution in this case will conserve wetlands for the benefit of the people of North Carolina.¬† By enforcing the nation‚Äôs environmental laws, we will continue to ensure that concentrated animal feeding operations (CAFOs) operate without threatening our drinking water, the health of our communities and the environment.‚Äù ‚ÄúThis office is committed to doing our part to hold accountable those who commit crimes against our environment, which can cause serious health problems to residents and damage the environment that makes North Carolina such a beautiful place to live and visit,‚Äù said U.S. Attorney Walker. ‚ÄúThis case shows what we can accomplish when our SBI agents work closely with their local, state and federal partners to investigate environmental crimes and hold the polluters accountable,‚Äù said Director McLeod.¬† ‚ÄúWe‚Äôll continue our efforts to fight illegal pollution that damages our water and puts the public‚Äôs health at risk.‚Äù ¬†‚ÄúThe Waccamaw is unique and wild,‚Äù said Director Herlevich of the North Carolina Coastal Land Trust. ‚ÄúIts watershed includes some of the most extensive cypress gum swamps in the state, and its headwaters at Lake Waccamaw contain fish that are found nowhere else on Earth.¬† We appreciate the trust of the court and the U. S. Attorney, and we look forward to using these funds for conservation projects in a river system that is one of our top conservation priorities.‚Äù According to evidence presented in court, in December 2007 Freedman Farms discharged hog waste into Browder‚Äôs Branch, a tributary to the Waccamaw River that flows through the White Marsh, a large wetlands complex.¬† Freedman Farms, located in Columbus County, N.C., is in the business of raising hogs for market, and this particular farm had some 4,800 hogs.¬† The hog waste was supposed to be directed to two lagoons for treatment and disposal.¬† Instead, hog waste was discharged from Freedman Farms directly into Browder‚Äôs Branch.¬† The Clean Water Act is a federal law that makes it illegal to knowingly or negligently discharge a pollutant into a water of the United States.¬† The Freedman case was investigated by the U.S. Environmental Protection Agency (EPA) Criminal Investigation Division, the U.S. Army Corps of Engineers and the North Carolina State Bureau of Investigation, with assistance from the EPA Science and Ecosystem Support Division.¬† The case was prosecuted by Assistant U.S. Attorney J. Gaston B. Williams of the Eastern District of North Carolina and Trial Attorney Mary Dee Carraway of the Environmental Crimes Section of the Justice Department‚Äôs Environment and Natural Resources Division. The North Carolina Coastal Land Trust is celebrating its 20th anniversary of saving special lands in eastern North Carolina. The organization has protected nearly 50,000 acres of lands with scenic, recreational, historic and ecological values. North Carolina Coastal Land Trust has saved streams and wetlands that provide clean water, forests that are havens for wildlife, working farms that provide local food and nature parks that everyone can enjoy.¬† More information about the Coastal Land Trust is available at www.coastallandtrust.org.",2012-07-25T00:00:00-04:00,No topic,Environment and Natural Resources Division
2,11-1002,$1 Million Settlement Reached for Natural Resource Damages at Superfund Site in Massachusetts,"BOSTON‚Äì A $1-million settlement has been reached for natural resource damages (NRD) at the Blackburn & Union Privileges Superfund Site in Walpole, Mass., the Departments of Justice and Interior (DOI), and the Office of the Massachusetts Attorney General announced today. ¬† The Blackburn & Union Privileges Superfund Site includes 22 acres of contaminated land and water in Walpole. The contamination resulted from the operations of various industrial facilities dating back to the 19th century that exposed the site to asbestos, arsenic, lead and other hazardous substances. ¬† The private parties involved in the settlement include two former owners and operators of the site, W.R. Grace & Co.‚Äì Conn. and Tyco Healthcare Group LP, as well as the current owners, BIM Investment Corp. and Shaffer Realty Nominee Trust. ¬† From about 1915 to 1936, a predecessor of W.R. Grace manufactured asbestos brake linings and clutch linings on a large portion of the property. From 1946 to about 1983, a predecessor of Tyco Healthcare operated a cotton fabric manufacturing business, which used caustic solutions, on a portion of the property. ¬† In a 2010 settlement with U.S. Environmental Protection Agency (EPA), the four private parties agreed to perform a remedial action to clean up the site at an estimated cost of $13 million. The consent decree lodged today resolves both state and federal NRD liability claims; it requires the parties to pay $1,094,169.56 to the state and federal natural resource trustees, the Massachusetts Executive Office of Energy and Environmental Affairs (EEA) and DOI, for injuries to ecological resources including groundwater and wetlands, which provide habitat for waterfowl and wading birds, including black ducks and great blue herons. The trustees will use the settlement funds for natural resource restoration projects in the area. ¬† ‚ÄúThis settlement demonstrates our commitment to recovering damages from the parties responsible for injury to natural resources, in partnership with state trustees,‚Äù said Bruce Gelber, Acting Deputy Assistant Attorney General of the Justice Department‚Äôs Environment and Natural Resources Division. ¬† ‚ÄúThe citizens of Walpole have had to live with the environmental impact of this contamination for many years,‚Äù Attorney General Martha Coakley said. ‚ÄúWe are pleased that today‚Äôs agreement will not only require the responsible parties to reimburse taxpayer dollars, but will also provide funding to begin restoring or replacing the wetland and other natural resources.‚Äù ¬† The consent decree was lodged in the U.S. District Court for Massachusetts. A portion of the funds, $300,000, will be distributed to the EEA-sponsored groundwater restoration projects; $575,000 will be used for ecological restoration projects jointly sponsored by EEA and the U.S. Fish and Wildlife Service (FWS). ¬† In addition, $125,000 will go for projects jointly sponsored by EEA and FWS that achieve both ecological and groundwater restoration; $57,491.34 will be allocated for reimbursement for the FWS‚Äôs assessment costs; and $36,678.22 will be distributed as reimbursement for the commonwealth‚Äôs assessment costs. ‚ÄúThis settlement provides the means for a range of projects designed to compensate the public for decades of groundwater and other ecological damage at this site.¬† I encourage local citizens and organizations to become engaged in the public process that will take place as we solicit, take comment on, and choose these projects in the months ahead,‚Äù said Energy and Environmental Affairs Secretary Richard K. Sullivan Jr., who serves as the Commonwealth‚Äôs Natural Resources Damages trustee. ‚ÄúThis settlement will help restore habitat for fish and wildlife in the Neponset River watershed,‚Äù said Tom Chapman of the FWS New England Field Office. ‚ÄúWe look forward to working with the commonwealth and local stakeholders to implement restoration.‚Äù ¬† ‚ÄúMore than 100 years-worth of industrial activities at this site caused major environmental contamination to the Neponset River, nearby wetlands and to groundwater below the site,‚Äù said Commissioner Kenneth Kimmell of the Massachusetts Department of Environmental Protection (MassDEP), which will staff the Trustee Council for the Commonwealth. ‚ÄúWe will ensure that the community and the public will be active participants in the process to use these NRD funds to restore the injured natural resources.‚Äù ¬† Under the federal Comprehensive Environmental Response, Compensation and Liability Act, EEA and DOI, acting through the FWS, are the designated state and federal natural resource Trustees for the site. The site has been listed on the EPA‚Äôs National Priorities List since 1994. The consent decree is subject to a public comment period and court approval. A copy of the consent decree and instructions about how to submit comments is available on www.usdoj.gov/enrd/Consent_Decrees.html . ¬† After the consent decree is approved, EEA and FWS will develop proposed restoration plans to use the settlement funds for restoration projects. The proposed restoration plans will also be made available to the public for review and comment. ¬† Assistant Attorney General Matthew Brock of Massachusetts Attorney General Coakley's Environmental Protection Division handled this matter. Attorney Jennifer Davis of MassDEP, Attorney Anna Blumkin of EEA and MassDEP‚Äôs NRD Coordinator Karen Pelto also worked on this settlement.",2011-08-03T00:00:00-04:00,No topic,Environment and Natural Resources Division
3,10-015,10 Las Vegas Men Indicted \r\nfor Falsifying Vehicle Emissions Tests,"WASHINGTON‚ÄîA federal grand jury in Las Vegas today returned indictments against 10 Nevada-certified emissions testers for falsifying vehicle emissions test reports, the Justice Department announced. Each defendant faces one felony Clean Air Act count for falsifying reports between November 2007 and May 2009. The number of falsifications varied by defendant, with some defendants having falsified approximately 250 records, while others falsified more than double that figure. One defendant is alleged to have falsified over 700 reports. The individuals indicted include:¬† Escudero resides in Pahrump, Nev. All other individuals are from Clark County, Nev. The 10 defendants are alleged to have engaged in a practice known as ""clean scanning"" vehicles. The scheme involved entering the Vehicle Identification Number (VIN) for a vehicle that would not pass the emissions test into the computerized system, then connecting a different vehicle the testers knew would pass the test. These falsifications were allegedly performed for anywhere from $10 to $100 over and above the usual emissions testing fee. The U.S. Environmental Protection Agency (EPA), under the Clean Air Act, requires the state of Nevada to conduct vehicle emissions testing in certain areas because the areas exceed national standards for carbon monoxide and ozone. Las Vegas is currently required to perform emissions testing. To obtain a registration renewal, vehicle owners bring the vehicles to a licensed inspection station for testing. The emissions inspector logs into a computer to activate the system by using a unique password issued to the emissions inspector. The emissions inspector manually inputs the vehicle‚Äôs VIN to identify the tested vehicle, then connects the vehicle for model year 1996 and later to an onboard diagnostics port connected to an analyzer. The analyzer downloads data from the vehicle‚Äôs computer, analyzes the data and provides a ""pass"" or ""fail"" result. The pass or fail result and vehicle identification data are reported on the Vehicle Inspection Report. It is a crime to knowingly alter or conceal any record or other document required to be maintained by the Clean Air Act.¬† ""Falsifications of vehicle emissions testing, such as those alleged in the indictments unsealed today, are serious matters and we intend to use all of our enforcement tools to stop this harmful practice. These actions undermine a system that is designed to reduce air pollutants including smog and provide better air quality for the citizens of Nevada,"" said Ignacia S. Moreno, Assistant Attorney General for the Justice Department‚Äôs Environment and Natural Resources Division. ""The residents of Nevada deserve to know that the vast majority of licensed vehicle emission inspectors are not corrupt and are not circumventing emission testing procedures,"" said U.S. Attorney Bogden. ""These indictments should serve as a clear warning to offenders that the Department of Justice will prosecute you if you make fraudulent statements and reports concerning compliance with the federal Clean Air Act."" ""Lying about car emissions means dirtier air, which is especially of concern in areas like Las Vegas that are already experiencing air quality problems,"" said Cynthia Giles, Assistant Administrator for Enforcement and Compliance Assurance at EPA. ""We will take aggressive action to ensure communities have clean air."" The maximum penalty for the felony violations contained in the indictments includes up to two years in prison and a fine of up to $250,000. An indictment is merely an accusation, and a defendant is presumed innocent unless and until proven guilty in a court of law. The case was investigated by the EPA, Criminal Investigation Division; and the Nevada Department of Motor Vehicles Compliance Enforcement Division. The case is being prosecuted by the U.S. Attorney‚Äôs Office for the District of Nevada and the Justice Department‚Äôs Environmental Crimes Section.",2010-01-08T00:00:00-05:00,No topic,Environment and Natural Resources Division
4,18-898,"$100 Million Settlement Will Speed Cleanup Work at Centredale Manor Superfund Site in North Providence, R.I.","The U.S. Department of Justice, the U.S. Environmental Protection Agency (EPA), and the Rhode Island Department of Environmental Management (RIDEM) announced today that two subsidiaries of Stanley Black & Decker Inc.‚ÄîEmhart Industries Inc. and Black & Decker Inc.‚Äîhave agreed to clean up dioxin contaminated sediment and soil at the Centredale Manor Restoration Project Superfund Site in North Providence and Johnston, Rhode Island.¬† ‚ÄúWe are pleased to reach a resolution through collaborative work with the responsible parties, EPA, and other stakeholders,‚Äù said¬†Acting Assistant Attorney General Jeffrey H. Wood for the Justice Department's¬†Environment and Natural Resources Division . ‚ÄúToday‚Äôs settlement ends protracted litigation and allows for important work to get underway to restore a healthy environment for citizens living in and around the Centredale Manor Site and the Woonasquatucket River.‚Äù ‚ÄúThis settlement demonstrates the tremendous progress we are achieving working with responsible parties, states, and our federal partners to expedite sites through the entire Superfund remediation process,‚Äù said EPA Acting Administrator Andrew Wheeler. ‚ÄúThe Centredale Manor Site has been on the National Priorities List for 18 years; we are taking charge and ensuring the Agency makes good on its promise to clean it up for the betterment of the environment and those communities affected.‚Äù ‚ÄúSuccessfully concluding this settlement paves the way for EPA to make good on our commitment to aggressively pursue cleaning up the Centredale Manor Superfund Site,‚Äù said EPA New England Regional Administrator Alexandra Dunn. ‚ÄúWe are excited to get to work on the cleanup at this site, and get it closer to the goal of being fully utilized by the North Providence and Johnston communities.‚Äù ‚ÄúWe are pleased that the collective efforts of the State of Rhode Island, EPA, and DOJ in these negotiations have concluded in this major milestone toward the cleanup of the Centredale Manor Restoration Superfund site and are consistent with our long-standing efforts to make the polluter pay,‚Äù said RIDEM Director Janet Coit. ‚ÄúThe settlement will speed up a remedy that protects public health and the river environment, and moves us closer to the day that we can reclaim recreational uses of this beautiful river resource.‚Äù The settlement, which includes cleanup work in the Woonasquatucket River (River) and bordering residential and commercial properties along the River, requires the companies to perform the remedy selected by EPA for the Site in 2012, which is estimated to cost approximately $100 million, and resolves longstanding litigation. The cleanup remedy includes excavation of contaminated sediment and floodplain soil from the Woonasquatucket River, including from adjacent residential properties. Once the cleanup remedy is completed, full access to the Woonasquatucket River should be restored for local citizens. The cleanup will be a step toward the State‚Äôs goal of a fishable and swimmable river. The work will also include upgrading caps over contaminated soil in the peninsula area of the Site that currently house two high-rise apartment buildings. The settlement also ensures that the long-term monitoring and maintenance of the site, as directed in the remedy, will be implemented to ensure that public health is protected.¬† Under the settlement, Emhart and Black & Decker will reimburse EPA for approximately $42 million in past costs incurred at the Site. The companies will also reimburse EPA and the State of Rhode Island for future costs incurred by those agencies in overseeing the work required by the settlement. The settlement will also include payments on behalf of two federal agencies to resolve claims against those agencies. These payments, along with prior settlements related to the Site, will result in a 100 percent recovery for the United States of its past and future response costs related to the Site. Litigation related to the Site has been ongoing for nearly eight years. While the Federal District Court found Black & Decker and Emhart to be liable for their hazardous waste and responsible to conduct the cleanup of the Site, it had also ruled that EPA needed to reconsider certain aspects of that cleanup. EPA appealed the decision requiring it to reconsider aspects of the cleanup. This settlement, once entered by the District Court, will resolve the litigation between the United States, Rhode Island, and Emhart and Black and Decker, allowing the cleanup of the Site to begin. The Site spans a one and a half mile stretch of the Woonasquatucket River and encompasses a nine-acre peninsula, two ponds and a significant forested wetland. From the 1940s to the early 1970s, Emhart‚Äôs predecessor operated a chemical manufacturing facility on the peninsula and used a raw material that was contaminated with 2,3,7,8-tetrachlorodibenzo-p-dioxin, a toxic form of dioxin. The Site property was also previously used by a barrel refurbisher. Elevated levels of dioxins and other contaminants have been detected in soil, groundwater, sediment, surface water and fish.¬† The Site was added to the National Priorities List (NPL) in 2000, and in December 2017, EPA included the Centredale Manor Restoration Project Superfund Site on a list of Superfund sites targeted for immediate and intense attention. Several short-term actions were previously performed at the Site to address immediate threats to the residents and minimize potential erosion and downstream transport of contaminated soil and sediment. This settlement is the latest agreement EPA has reached since the Site was listed on the NPL. Prior agreements addressed the performance and recovery of costs for the past environmental investigations and interim cleanup actions from Emhart, the barrel reconditioning company, the current owners of the peninsula portion of the Site, and other potentially responsible parties. The Consent Decree, lodged in the U.S. District Court of Rhode Island, will be posted in the Federal Register and available for public comment for a period of 30 days. The Consent Decree can be viewed on the Justice Department website:¬†www.justice.gov/enrd/Consent_Decrees.html.¬† EPA information on the Centredale Manor Superfund Site:¬†www.epa.gov/superfund/centredale.",2018-07-09T00:00:00-04:00,Environment,Environment and Natural Resources Division


## 1. Tagging and sentiment scoring (17 points)

Focus on the following press release: `id` == "17-1204" about this pharmaceutical kickback prosecution: https://www.forbes.com/sites/michelatindera/2017/11/16/fentanyl-billionaire-john-kapoor-to-plead-not-guilty-in-opioid-kickback-case/?sh=21b8574d6c6c 

The `contents` column is the one we're treating as a document. You may need to to convert it from a pandas series to a single string.

We'll call the raw string of this press release `pharma`

In [5]:
## your code to subset to one press release and take the string 
string = ' '.join(doj['contents'].astype(str)) 

# Filter the DataFrame to get the row corresponding to the press release with id "17-1204"
subset_data = doj[doj['id'] == "17-1204"]

# Convert the contents column of the subset to a single string
pharma = subset_data['contents'].iloc[0] 

pharma 

'The founder and majority owner of Insys Therapeutics Inc., was arrested today and charged with leading a nationwide conspiracy to profit by using bribes and fraud to cause the illegal distribution of a Fentanyl spray intended for cancer patients experiencing breakthrough pain.\xa0"More than 20,000 Americans died of synthetic opioid overdoses last year, and millions are addicted to opioids. And yet some medical professionals would rather take advantage of the addicts than try to help them," said Attorney General Jeff Sessions. "This Justice Department will not tolerate this.\xa0 We will hold accountable anyone ‚Äì from street dealers to corporate executives -- who illegally contributes to this nationwide epidemic.\xa0 And under the leadership of President Trump, we are fully committed to defeating this threat to the American people.‚ÄùJohn N. Kapoor, 74, of Phoenix, Ariz., a current member of the Board of Directors of Insys, was arrested this morning in Arizona and charged with RICO co

### 1.1 part of speech tagging (3 points)

A. Preprocess the `pharma` press release to remove all punctuation / digits (you can use `.isalpha()` to subset)

B. With the preprocessed press release from part A, use the part of speech tagger within nltk to tag all the words in that one press release with their part of speech. 

C. Using the output from B, extract the adjectives and sort those adjectives from most occurrences to fewest occurrences. Print a dataframe with the 5 most frequent adjectives and their counts in the `pharma` release. See here for a list of the names of adjectives within nltk: https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/

**Resources**:

- Documentation for `.isalpha()`: https://www.w3schools.com/python/ref_string_isalpha.asp

In [6]:
## your code here to restrict to alpha
pharma_cleaned = ''.join(char for char in pharma if char.isalpha() or char.isspace())

In [7]:
## your code here for part of speech tagging and part C 

# Tokenize the cleaned press release text into words
tokens = word_tokenize(pharma_cleaned)

# Perform part-of-speech tagging
pos_tags = nltk.pos_tag(tokens)

# Part C 
# Filter adjectives from pos_tags
adjectives = [word for word, pos in pos_tags if pos.startswith('JJ')]

# Count the occurrences of each adjective
adjectives_count = Counter(adjectives)

# Sort adjectives by frequency
sorted_adjectives = sorted(adjectives_count.items(), key=lambda x: x[1], reverse=True)

# Convert the sorted adjectives to a DataFrame
df_top_adjectives = pd.DataFrame(sorted_adjectives, columns=['Adjective', 'Count'])

# Select the top 5 most frequent adjectives
top_5_adjectives = df_top_adjectives.head(5)

top_5_adjectives 

Unnamed: 0,Adjective,Count
0,former,8
1,opioid,5
2,nationwide,4
3,other,3
4,addictive,3


## 1.2 named entity recognition (4 points)

A. Using the original `pharma` press release (so the one before stripping punctuation/digits), use spaCy to extract all named entities from the press release.

B. Print the unique named entities with the tag: `LAW`

In [8]:
## your code here for part A
# Process the original pharmaceutical press release text
doc = nlp(pharma)

# Extract named entities
named_entities = [(ent.text, ent.label_) for ent in doc.ents]

# Print the named entities
for entity, label in named_entities:
    print(f"{entity}: {label}")

Insys Therapeutics Inc.: ORG
today: DATE
Fentanyl: PERSON
More than 20,000: CARDINAL
Americans: NORP
last year: DATE
millions: CARDINAL
Jeff Sessions: PERSON
This Justice Department: ORG
Trump: PERSON
American: NORP
‚ÄùJohn N. Kapoor: PERSON
74: DATE
Phoenix: GPE
Ariz.: GPE
the Board of Directors: ORG
Insys: ORG
this morning: TIME
Arizona: GPE
RICO: LAW
Kapoor: PERSON
Executive: ORG
Board: ORG
Insys: ORG
Phoenix: GPE
today: DATE
U.S.: GPE
District Court: ORG
Boston: GPE
a later date: DATE
today: DATE
Boston: GPE
Insys: ORG
December 2016.The: DATE
Kapoor: GPE
Michael L. Babich: PERSON
40: DATE
Scottsdale: GPE
Ariz.: GPE
Alec Burlakoff: PERSON
42: DATE
Charlotte: GPE
N.C.: GPE
Richard M. Simon: PERSON
46: DATE
Seal Beach: GPE
Calif.: GPE
Sunrise Lee: PERSON
36: DATE
Bryant City: GPE
Mich.: GPE
Joseph A. Rowan: PERSON
43: DATE
Panama City: GPE
Fla.: GPE
Managed Markets: ORG
Michael J. Gurry: PERSON
53: DATE
Scottsdale: GPE
Ariz.: GPE
Subsys: ORG
Kapoor: GPE
six: CARDINAL
Kapoor: PERSON
Un

In [9]:
## your code here for part B
# Extract unique named entities with the tag "LAW"
law_entities = set([(ent.text, ent.label_) for ent in doc.ents if ent.label_ == "LAW"])

# Print the unique named entities with the tag "LAW"
for entity, label in law_entities:
    print(f"{entity}: {label}")

the Controlled Substances Act: LAW
RICO: LAW


C. Use Google to summarize in one sentence what the `RICO` named entity means and why this might apply to a pharmaceutical kickbacks case (and not just a mafia case...) 

- RICO (Racketeer Influenced and Corrupt Organizations Act) is a U.S. federal law targeting organized crime (like a mafia case) but can also apply to cases involving illegal activities by corporations or individuals connected to a criminal enterprise, such as pharmaceutical kickbacks, becasue RICO includes various forms of racketeering and unlawful behavior. 

D. You want to extract the possible sentence lengths the CEO is facing; pull out the named entities with (1) the label `DATE` and (2) that contain the word year or years (hint: you may want to use the `re` module for that second part). Print these named entities.

In [10]:
## your code here
# Extract named entities with the label "DATE" and containing the words "year" or "years"
sentence_lengths = [(ent.text, ent.label_) for ent in doc.ents if ent.label_ == "DATE" and re.search(r'\b(?:year|years)\b', ent.text, flags=re.IGNORECASE)]

# Print the extracted named entities
for entity, label in sentence_lengths:
    print(f"{entity}: {label}")

last year: DATE
20 years: DATE
three years: DATE
five years: DATE
three years: DATE


E. Pull and print the original parts of the press releases where those year lengths are mentioned (e.g., the sentences or rough region of the press release). Describe in your own words (1 sentence) what length of sentence (prison) and probation (supervised release) the CEO may be facing if convicted after this indictment (if there are multiple lengths mentioned describe the maximum). 

**Hint**: you may want to use re.search or re.findall 

- For part E, you can use `re.search` and `re.findall`, or anything that works üò≥.

In [15]:
## your code here --- NEED TO FIX THIS

# Convert the pharma press release into sentences
sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', pharma)

# Find and print the sentences containing the relevant named entities
sentences_containing_years = []
for sentence in sentences:
    if any(re.search(r'\b(?:year|years)\b', sentence, flags=re.IGNORECASE) for entity in sentence_lengths):
        sentences_containing_years.append(sentence)

print("Sentences containing year lengths:")
for sentence in sentences_containing_years:
    print(sentence)


Sentences containing year lengths:
"More than 20,000 Americans died of synthetic opioid overdoses last year, and millions are addicted to opioids.
Neves, Special Agent in Charge of the VA OIG Northeast Field Office.The charges of conspiracy to commit RICO and conspiracy to commit mail and wire fraud each provide for a sentence of no greater than 20 years in prison, three years of supervised release and a fine of $250,000, or twice the amount of pecuniary gain or loss.
 The charges of conspiracy to violate the Anti-Kickback Law provide for a sentence of no greater than five years in prison, three years of supervised release and a $25,000 fine.


## Describe in your own words (1 sentence) what length of sentence (prison) and probation (supervised release) the CEO may be facing if convicted after this indictment (if there are multiple lengths mentioned describe the maximum).
- If convicted after this indictment, the CEO may face a maximum sentence of 20 years in prison.

## 1.3 sentiment analysis  (10 points)

A. Subset the press releases to those labeled with one of three topics via `topics_clean`: Civil Rights, Hate Crimes, and Project Safe Childhood. We'll call this `doj_subset` going forward and it should have 717 rows.



In [16]:
## your code here for subsetting
doj_subset = doj[doj['topics_clean'].isin(['Civil Rights', 'Hate Crimes', 'Project Safe Childhood'])]
print("There are " + str(len(doj_subset)) + " rows in doj_subset.") 
doj_subset 

There are 717 rows in doj_subset.


Unnamed: 0,id,title,contents,date,topics_clean,components_clean
77,17-1235,Additional Former Correctional Officer Pleads Guilty to Beating of Handcuffed and Shackled Inmate at Angola State Prison,"A former supervisory correctional officer at Louisiana State Penitentiary in Angola, Louisiana, pleaded guilty yesterday in connection with the beating of a handcuffed and shackled inmate, in addition to conspiring to cover up their misconduct by falsifying official records and lying to internal investigators about what happened.¬† ¬† James Savoy, 39, of Marksville, Louisiana, admitted during his plea hearing that he witnessed other officers using excessive force against the inmate and failed to intervene; that he conspired with other officers to cover up the beating by engaging in a variety of obstructive acts; and that he personally falsified official prison records to cover up the attack. ¬† Scotty Kennedy, 48, of Beebe, Arkansas, and John Sanders, 30, of Marksville, Louisiana previously pleaded guilty in November 2016, and September 2017, for their roles in the beating and cover up. ¬† ‚ÄúEvery citizen has the right to due process and protection from unreasonable force, and correctional officers who violate these basic Constitutional rights must be held accountable for their egregious actions‚Äù said Acting Assistant Attorney General John Gore of the Civil Rights Division.¬† ‚ÄúThe Justice Department will continue to vigorously prosecute correctional officers who violate the public‚Äôs trust by committing crimes and to covering up violations of federal criminal law.‚Äù ¬† ‚ÄúYesterday is another example of our office‚Äôs unwavering commitment to pursuing those who violate the federal criminal civil rights laws,‚Äù said Acting United States Attorney for the Middle District of Louisiana Corey Amundson. ‚ÄúWe will continue to work closely with the Justice Department‚Äôs Civil Rights Division and the FBI to ensure that no one is above the law.‚Äù¬† ¬† This case is being investigated by the FBI‚Äôs Baton Rouge Resident Agency and is being prosecuted by Assistant U.S. Attorney Frederick A. Menner, Jr. of the Middle District of Louisiana and Trial Attorney Christopher J. Perras of the Civil Rights Division‚Äôs Criminal Section.",2017-11-02T00:00:00-04:00,Civil Rights,"Civil Rights Division; USAO - Louisiana, Middle"
155,15-1522,Alabama Man Found Guilty of Aggravated Sexual Abuse of a Child,"A federal jury convicted Rick Lee Evans, 43, of Anniston, Alabama, today of aggravated sexual abuse of a child after a five-day trial, Assistant Attorney General Leslie R. Caldwell of the Justice Department‚Äôs Criminal Division and U.S. Attorney Joyce White Vance of the Northern District of Alabama announced.¬† According to evidence introduced at trial, Evans, a former U.S. Army soldier, and his then-wife, a Department of Defense employee, were residing in Germany when they were asked to take temporary custody of a five-year-old child whose parents were deployed to Iraq with the U.S. Army.¬† Evans sexually abused the child on multiple occasions during the 18 months that the child lived with him from May 2007 to December 2008.¬† Trial Attorney Austin M. Berry of the Criminal Division‚Äôs Child Exploitation and Obscenity Section (CEOS) and Assistant U.S. Attorney Jacquelyn Hutzell of the Northern District of Alabama are prosecuting the case.¬† U.S. Army Criminal Investigations Division and the FBI‚Äôs Birmingham, Alabama, Division investigated the case. This case was brought as part of Project Safe Childhood, a nationwide initiative to combat the growing epidemic of child sexual exploitation and abuse, launched in May 2006 by the Department of Justice.¬† Led by U.S. Attorneys‚Äô offices and CEOS, Project Safe Childhood marshals federal, state and local resources to better locate, apprehend and prosecute individuals who exploit children via the Internet, as well as to identify and rescue victims.¬† For more information about Project Safe Childhood, please visit www.justice.gov/psc.",2015-12-11T00:00:00-05:00,Project Safe Childhood,"Criminal Division; Criminal - Child Exploitation and Obscenity Section; USAO - Alabama, Northern"
157,16-213,Alabama Man Indicted on Child Pornography and Sex Tourism Charges,"An Alabama native was indicted today and charged with multiple crimes involving travel with intent to engage in illicit sexual conduct with minors and child pornography, announced Assistant Attorney General Leslie R. Caldwell of the Justice Department‚Äôs Criminal Division and U.S. Attorney Kenyen R. Brown of the Southern District of Alabama. Clarence Edward Evers Jr., aka Bud, a technology teacher employed by the Conecuh County, Alabama, Board of Education, was arrested on Feb. 11, 2016, and was charged today with five counts of travel with intent to engage in illicit sexual conduct with a minor, one count of attempted travel with intent to engage in illicit sexual conduct with a minor, one count of production and attempted production of child pornography, one count of transportation of child pornography, one count of receipt of child pornography, one count of access with intent to view child pornography and one count of possession of child pornography. According to the indictment, Evers allegedly traveled to Thailand in the summers of 2010 through 2014 for the purpose of engaging in illicit sexual conduct with a minor and allegedly attempted to make a similar trip in the spring of 2015.¬† During the 2014 trip, Evers also allegedly photographed his victims‚Äô abuse and then transported the images back to the United States.¬† In addition, Evers allegedly had other images of child sexual exploitation on his computers and other electronic devices. The charges contained in the indictment are only allegations.¬† Evers is presumed innocent unless and until he is proven guilty beyond a reasonable doubt in a court of law. ICE-HSI is investigating this case.¬† Trial Attorney James E. Burke IV of the Criminal Division‚Äôs Child Exploitation and Obscenity Section (CEOS) and Assistant U.S. Attorneys Sean P. Costello and Maria E. Murphy of the Southern District of Alabama are prosecuting the case. This case was brought as part of Project Safe Childhood, a nationwide initiative to combat the growing epidemic of child sexual exploitation and abuse launched in May 2006 by the Department of Justice.¬† Led by U.S. Attorneys‚Äô Offices and CEOS, Project Safe Childhood marshals federal, state and local resources to better locate, apprehend and prosecute individuals who exploit children via the Internet, as well as to identify and rescue victims.¬† For more information about Project Safe Childhood, please visit www.justice.gov/psc.",2016-02-24T00:00:00-05:00,Project Safe Childhood,"Criminal Division; Criminal - Child Exploitation and Obscenity Section; USAO - Alabama, Southern"
162,16-381,Alabama Man Indicted for Producing Child Pornography Involving Multiple Victims,"An Alabama man was indicted today by a federal grand jury in Birmingham, Alabama, on charges related to the production of child pornography involving four minor victims, announced Assistant Attorney General Leslie R. Caldwell of the Justice Department‚Äôs Criminal Division and U.S. Joyce White Vance of the Northern District of Alabama. Gregory Jerome Lee, 53, formerly of Cullman County, Alabama, was indicted on four counts of production of child pornography, one count of conspiracy to advertise child pornography and one count of conspiracy to distribute and receive child pornography. According to the indictment, from September 1996 through December 2004, Lee used, persuaded, coerced and enticed minors to engage in sexually explicit conduct in order to produce images of that conduct.¬† Between September 1996 and August 2007, Lee conspired with other individuals to distribute and receive child pornography through a variety of means, including the Internet. The U.S. Postal Inspection Service (USPIS) is investigating the case.¬† Trial Attorney Amy E. Larson of the Criminal Division‚Äôs Child Exploitation and Obscenity Section (CEOS) and Assistant U.S. Attorney Jacquelyn Hutzell of the Northern District of Alabama are prosecuting the case.¬† The charges and allegations contained in an indictment are merely accusations.¬† The defendant is presumed innocent unless and until proven guilty.¬† Members of the public who may have information related to this matter should call the USPIS Birmingham Office at (205) 326-2909. This case was brought as part of Project Safe Childhood, a nationwide initiative to combat the growing epidemic of child sexual exploitation and abuse launched in May 2006 by the Department of Justice.¬† Led by U.S. Attorneys‚Äô Offices and CEOS, Project Safe Childhood marshals federal, state and local resources to better locate, apprehend and prosecute individuals who exploit children via the Internet, as well as to identify and rescue victims.¬† For more information about Project Safe Childhood, please visit www.justice.gov/psc.",2016-03-30T00:00:00-04:00,Project Safe Childhood,"Criminal Division; Criminal - Child Exploitation and Obscenity Section; USAO - Alabama, Northern"
168,14-464,Alabama Man Indicted for Threatening African-American Man and Another Person at Restaurant,"Jeremy Heath Higgins was indicted for threatening an African-American man at a Quinton, Alabama, restaurant, and for threatening another person who ordered Higgins to leave the restaurant due to his behavior, Acting Assistant Attorney General Jocelyn Samuels for the Justice Department‚Äôs Civil Rights Division and U.S. Attorney Joyce Vance for the Northern District of Alabama announced today.¬†¬† Higgins, 28, was charged in a three count indictment returned yesterday by a federal grand jury in the U.S. District Court for the Northern District of Alabama.¬† The indictment charges him with one felony count and two misdemeanor counts of interference with a federally-protected activity.¬† The indictment alleges that on June 14, 2013, Higgins approached and threatened an African-American man at the Alabama Rose Steakhouse because the man was present at the restaurant with a white woman.¬† According to the indictment, another person ordered Higgins to leave the premises of the restaurant because of Higgins‚Äô behavior toward the African-American man, after which Higgins allegedly shouted a threat to burn down the restaurant.¬† The indictment further alleges that Higgins threatened the person who had ordered him to leave the restaurant by painting graffiti on the restaurant‚Äôs exterior and fence.¬†¬†¬†¬† If convicted of the felony count of the indictment, Higgins could face a maximum sentence of 10 years in prison and a $250,000 fine.¬† For each of the misdemeanor charges, Higgins could face a maximum sentence of one year in prison and a $200,000 fine.¬†¬† This case is being investigated by the FBI and is being prosecuted by Assistant U.S. Attorney Robin B. Mark of the Northern District of Alabama and Trial Attorney David Reese of the Justice Department‚Äôs Civil Rights Division. An indictment is merely an accusation, and the defendant is presumed innocent unless proven guilty.",2014-05-01T00:00:00-04:00,Hate Crimes,Civil Rights Division; Civil Rights - Criminal Section
...,...,...,...,...,...,...
13002,09-368,West Virginia Man Pleads Guilty on Federal Civil Rights Charges,"WASHINGTON - Daryl Lee Fierce, 69, of Charleston, W.Va., pleaded guilty today to a civil rights charge in federal court in the Southern District of West Virginia for using fire to intimidate and interfere with a person‚Äôs housing rights. Fierce set fire to the victim‚Äôs home because African-American and biracial individuals visited the victim in her home. Pursuant to the plea agreement, Fierce faces up to 10 years in prison and a fine of up to $250,000. Sentencing is scheduled for July 30, 2009. According to documents filed in court, on or about July 16, 2007, Fierce admitted that he set fire to a home located on Noyes Avenue in Charleston because the tenant occupying the home, a white woman, associated with persons of another race and color. Fierce set fire to the outside wall of the victim‚Äôs bedroom at night as she slept. Fierce further admitted that before the incident he had used racial epithets against guests, including young children, who visited the victim‚Äôs home. ""Living in one‚Äôs home and associating with friends of one‚Äôs choosing, without violent interference because of race, is a core right of all persons in this country,"" said Loretta King, Acting Assistant Attorney General for Civil Rights. ""The defendant used violence against an innocent victim because of his racial prejudice. This is illegal, and despicable, and we will prosecute such crimes whenever and wherever they occur."" The FBI, the Charleston Police Department and the Charleston Fire Department investigated this case. The case was prosecuted by James Walsh with the Justice Department‚Äôs Civil Rights Division and Lisa G. Johnson, Assistant U.S. Attorney for the Southern District of West Virginia.",2009-04-20T00:00:00-04:00,Hate Crimes,Civil Rights Division
13032,18-775,Wisconsin Man Indicted for Producing Child Pornography Outside of the United States,"A Wisconsin man was charged in an indictment yesterday with the crimes of producing and possessing child pornography and engaging in illicit sexual conduct in a foreign place, announced Acting Assistant Attorney General John P. Cronan of the Justice Department‚Äôs Criminal Division and U.S. Attorney¬†Matthew D. Krueger¬†of the¬†Eastern District of¬†Wisconsin. Jeffrey H. Ernisse, 61, is currently incarcerated for state offenses related to child exploitation at the Red Granite Correctional Institution in Wisconsin.¬† A grand jury in the U.S. District Court for the Eastern District of Wisconsin indicted Ernisse on two counts of producing child pornography, two counts of producing child pornography outside of the United States, one count of engaging in illicit sexual conduct with a minor in the Philippines and one count of possessing child pornography. According to the indictment, on or about March 10, 2015 and then again, on or about April 7, 2015, Ernisse used a minor to engage in sexually explicit conduct for the purpose of producing child pornography.¬† Between approximately June 17, 2014, and approximately April 11, 2015, Ernisse engaged in illicit sexual conduct with a minor in the Republic of the Philippines.¬† And on or about Dec. 18, 2015, Ernisse possessed child pornography. The charges contained in the indictment are merely allegations.¬† The defendant is presumed innocent until proven guilty beyond a reasonable doubt in a court of law.¬†¬†¬†¬†¬†¬† ¬† ¬† ¬† ¬† ¬†¬† U.S. Immigration and Customs Enforcement‚Äôs Homeland Security Investigations (HSI) is investigating this case with the cooperation of the Sheboygan, Wisconsin, Police Department. ¬†Trial Attorney¬†William M. Grady¬†of the Criminal Division‚Äôs Child Exploitation and Obscenity Section (CEOS) and Assistant U.S. Attorneys¬†Megan J. Paulson and Penelope L. Coblentz of the¬†Eastern District of¬†Wisconsin are prosecuting the case. This investigation is a part of Project Safe Childhood, a nationwide initiative to combat the growing epidemic of child sexual exploitation and abuse launched in May 2006 by the Department of Justice. ¬†Led by U.S. Attorneys‚Äô Offices and CEOS, Project Safe Childhood marshals federal, state,¬†and local resources to better locate, apprehend,¬†and prosecute individuals who exploit children via the Internet, as well as to identify and rescue victims. ¬†For more information about Project Safe Childhood, please visit¬†www.justice.gov/psc.",2018-06-13T00:00:00-04:00,Project Safe Childhood,Criminal Division; Criminal - Child Exploitation and Obscenity Section
13034,12-596,Wisconsin Man Pleads Guilty to Sexual Exploitation of a Minor in Belize,"WASHINGTON ‚Äì A Wisconsin man pleaded guilty today in federal court in Milwaukee to traveling in foreign commerce and engaging in and attempting to engage in illicit sexual conduct with a minor, announced Assistant Attorney General Lanny A. Breuer of the Justice Department‚Äôs Criminal Division; U.S. Attorney James L. Santelle of the Eastern District of Wisconsin; John Morton, Director of U.S. Immigration and Customs Enforcement (ICE); and Scott Bultrowicz, Director of the U.S. State Department‚Äôs Diplomatic Security Service (DSS).¬† Roland J. Flath, 72, pleaded guilty before U.S. District Judge J.P. Stadtmueller. According to court documents, Flath, of Fond du Lac, Wis., traveled to Belize in July 2006, and subsequently sexually molested a minor girl from Belize.¬† Flath was originally charged by a criminal complaint filed in the Eastern District of Wisconsin in October 2010.¬† He was arrested by the Guatemalan National Civil Police on Feb. 20, 2011, expelled to the United States and arrested in the United States by ICE agents and the U.S. Marshal Service.¬† Flath was indicted on March 22, 2011, by a grand jury sitting in the Eastern District of Wisconsin. Flath faces a maximum penalty of up to 30 years in prison and a fine of $250,000. This case was brought as part of Project Safe Childhood, a nationwide initiative to combat the growing epidemic of child sexual exploitation and abuse launched in May 2006 by the Department of Justice.¬† Led by U.S. Attorneys‚Äô Offices and the Criminal Division‚Äôs Child Exploitation and Obscenity Section (CEOS), Project Safe Childhood marshals federal, state and local resources to better locate, apprehend and prosecute individuals who exploit children, as well as to identify and rescue victims.¬† For more information about Project Safe Childhood, please visit www.projectsafechildhood.gov. This case is being prosecuted by Assistant U.S. Attorney Penelope Coblentz of the Eastern District of Wisconsin and Trial Attorney Mi Yung Park of CEOS.¬† Assistance was provided by the Office of International Affairs in the Justice Department‚Äôs Criminal Division.¬† This case is a result of investigative efforts led by ICE Homeland Security Investigations (HSI) in Milwaukee and the DSS‚Äôs Regional Security Office in Belize, CEOS‚Äôs High Technology Investigative Unit, and the Belize Police Department.",2012-05-09T00:00:00-04:00,Project Safe Childhood,Criminal Division
13068,18-359,Wyoming Military Department Found Liable for Subjecting Employee to Sexual Harassment,"The Justice Department today announced that on March 21, 2018, a federal district court in Casper, Wyoming, found that the Wyoming Military Department (WMD) discriminated against former employee Amanda Dykes by subjecting her to sexual harassment and constructively discharging her.¬† The verdict was returned after a July 2017 bench trial during which the Justice Department produced evidence that the defendant violated Title VII of the Civil Rights Act of 1964, which prohibits discrimination on the basis of race, color, national origin, sex, and religion. ¬† The evidence produced at trial showed that Dykes was subjected to sexual harassment by her direct supervisor, former employee Don Smith, when both worked at WMD‚Äôs Wyoming Youth Challenge Program.¬† Smith subjected Dykes to persistent, unwelcomed conduct including poems, songs, and emails professing his affection and love for her as well as constant visits to her office.¬† These intensified to such a degree that Dykes asked her subordinates to help her avoid being left alone with her supervisor.¬† ¬† ¬† Dykes reported the supervisor‚Äôs conduct to her employer‚Äôs human resources department as well as to his direct supervisor, but received no assistance in remedying the harassment.¬† The court found that harassing behavior persisted for over 18 months despite Dykes‚Äô numerous complaints, that no reasonable employee could be expected to remain in her job under these circumstances, and that Dykes had no choice but to resign her position in September 2011 to avoid the continued harassment.¬† ¬† The district court ordered WMD to pay $221,030.62 to Dykes for the salary and benefits she lost as a result of her constructive discharge.¬† ¬† This judgment represents the first successful sexual harassment trial verdict obtained in a Title VII case since the launch of the Civil Rights Division‚Äôs Sexual Harassment in the Workplace Initiative (SHWI), which focuses on workplace sexual harassment in the public sector. ¬† As part of the Initiative, the Justice Department will continue to bring sex discrimination claims against state and local government employers with a renewed emphasis on sexual harassment charges.¬† The Department will also work to develop effective remedial measures that can be used to hold public sector employers accountable where Title VII violations have been found, including identifying changes to existing employer practices and policies that will result in safe work environments.¬† More information about the Civil Rights‚Äô Division‚Äôs Sexual Harassment in the Workplace Initiative can be found here. ¬† ‚ÄúThe Justice Department vigorously enforces Title VII to ensure that people can work free from sexual harassment and retaliation,‚Äù said Acting Assistant Attorney General John Gore ¬†of the Civil Rights Division.¬† ‚ÄúThe verdict sends the clear message that this Justice Department will continue to effectively combat sex-based discrimination whenever it occurs in a public sector workplace.‚Äù ¬† Dykes originally filed her sexual harassment charge against the WMD with the Denver Field Office of the Equal Employment Opportunity Commission (EEOC), which investigated and determined that there was reasonable cause to believe that discrimination had occurred and referred the matters to the Department of Justice.¬† ¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† More information about Title VII and other federal employment laws is available at the division‚Äôs Employment Litigation Section website.¬† The continued enforcement of Title VII is a priority of the Civil Rights Division.¬† Additional information about the Civil Rights Division of the Department of Justice is available on the division website.¬† ¬† EEOC enforces federal laws prohibiting employment discrimination. Further information about EEOC is available on its website.¬† ¬† The United States was represented in this case by Robert Galbreath, Torie Atkinson, Brian McEntire, and Patty Stasco.",2018-03-23T00:00:00-04:00,Civil Rights,Civil Rights Division; Civil Rights - Employment Litigation Section; USAO - Wyoming


B. Write a function that takes one press release string as an input and:

- Removes named entities from each press release string (**Hint**: you may want to use `re.sub` with an or condition)
- Scores the sentiment of the entire press release using the `SentimentIntensityAnalyzer` and `polarity_scores`
- Returns the length-four (negative, positive, neutral, compound) sentiment dictionary (any order is fine)

Apply that function to each of the press releases in `doj_subset`. 

**Hints**: 

- A function + list comprehension to execute will takes about 30 seconds on a respectable local machine and about 2 mins on jhub; if it's taking a very long time, you may want to check your code for inefficiencies. If you can't fix those, for partial credit on this part/full credit on remainder, you can take a small random sample of the 717


In [18]:
## your code here to define function
def analyze_press_release(string: str):
    "''Press Release Sentiment Analyzer"''
    # Remove named entities
    named_entities = nlp(string)
    for entity in named_entities.ents:
        if entity.text.isalpha():
            string = re.sub(re.escape(entity. text), "", string)
    
    # Get sentiment scores
    sent_obj = SentimentIntensityAnalyzer()
    sentiment = sent_obj. polarity_scores(string)
    return sentiment

In [19]:
## your code here executing the function

# Apply the function to each press release in the doj_subset DataFrame
sentiment_scores = doj_subset['contents'].apply(lambda x: analyze_press_release(x))

sentiment_scores 

77        {'neg': 0.172, 'neu': 0.758, 'pos': 0.07, 'compound': -0.9893}
155      {'neg': 0.121, 'neu': 0.769, 'pos': 0.111, 'compound': -0.7003}
157       {'neg': 0.091, 'neu': 0.803, 'pos': 0.105, 'compound': 0.6124}
162      {'neg': 0.118, 'neu': 0.772, 'pos': 0.111, 'compound': -0.5385}
168      {'neg': 0.149, 'neu': 0.792, 'pos': 0.059, 'compound': -0.9786}
                                      ...                               
13002     {'neg': 0.14, 'neu': 0.794, 'pos': 0.066, 'compound': -0.9645}
13032      {'neg': 0.078, 'neu': 0.798, 'pos': 0.123, 'compound': 0.946}
13034    {'neg': 0.133, 'neu': 0.729, 'pos': 0.138, 'compound': -0.0516}
13068     {'neg': 0.13, 'neu': 0.735, 'pos': 0.135, 'compound': -0.2054}
13081    {'neg': 0.133, 'neu': 0.832, 'pos': 0.035, 'compound': -0.9898}
Name: contents, Length: 717, dtype: object

C. Add the four sentiment scores to the `doj_subset` dataframe to create a dataframe: `doj_subset_wscore`. Sort from highest neg to lowest neg score and print the top `id`, `contents`, and `neg` columns of the two most neg press releases. 

Notes:

- Don't worry if your sentiment score differs slightly from our output on GitHub; differences in preprocessing can lead to diff scores

In [24]:
## your code here

# Convert the resulting Series of dictionaries to a DataFrame
sentiment_df = pd.DataFrame(sentiment_scores.tolist(),index = sentiment_scores.index)
# Concatenate the sentiment DataFrame with the original doj_subset DataFrame
doj_subset_wscore = pd.concat([doj_subset, sentiment_df], axis=1)

# Sort the DataFrame by the 'negative' score in descending order
doj_subset_wscore = doj_subset_wscore.sort_values(by='neg', ascending=False) 

# Print the top two most negative 
top_2_neg = doj_subset_wscore.head(2)
top_2_neg[["id", "contents", "neg"]] 

Unnamed: 0,id,contents,neg
11593,16-718,"In a nine-count indictment unsealed today, two Mississippi correctional officers were charged with beating an inmate and a third was charged with helping to cover it up.¬† The indictment charged Lawardrick Marsher, 28, and Robert Sturdivant, 47, officers at Mississippi State Penitentiary, in Parchman, Mississippi, with a beating that included kicking, punching and throwing the victim to the ground.¬† Marsher and Sturdivant were charged with violating the right of K.H., a convicted prisoner, to be free from cruel and unusual punishment. ¬†Sturdivant was also charged with failing to intervene while Marsher was punching and beating K.H.¬† The indictment alleges that their actions involved the use of a dangerous weapon and resulted in bodily injury to the victim. A third officer, Deonte Pate, 23, was charged along with Marsher and Sturdivant for conspiring to cover up the beating. ¬†The indictment alleges that all three officers submitted false reports and that all three lied to the FBI. If convicted, Marsher and Sturdivant face a maximum sentence of 10 years in prison on the excessive force charges. ¬†Each of the three officers faces up to five years in prison on the conspiracy and false statement charges, and up to 20 years in prison on the false report charges. An indictment is merely an accusation, and the defendants are presumed innocent unless and until proven guilty. This case is being investigated by the FBI‚Äôs Jackson Division, with the cooperation of the Mississippi Department of Corrections.¬† It is being prosecuted by Assistant U.S. Attorney Robert Coleman of the Northern District of Mississippi and Trial Attorney Dana Mulhauser of the Civil Rights Division‚Äôs Criminal Section.¬† ¬†¬† Marsher Indictment",0.276
329,14-248,"The Department of Justice announced that this morning John W. Ng, 58, of Albuquerque, N.M., made his initial appearance in federal court on a criminal complaint charging him with a hate crime offense.¬† This charge is related to anti-Semitic threats Ng made against a Jewish woman who owns and operates the Nosh Jewish Delicatessen and Bakery in Albuquerque. Ng was arrested by the FBI on March 7, 2014, based on a criminal complaint alleging that he interfered with the victim‚Äôs federally protected rights by threatening her and interfering with her business because of her religion.¬† According to the criminal complaint, between Jan. 22, 2014, and Feb. 8, 2014, Ng allegedly posted threatening anti-Semitic notes on and in the vicinity of the victim‚Äôs business. A criminal complaint merely establishes probable cause, and Ng is presumed innocent unless proven guilty.¬† If convicted on the offense charged in the criminal complaint, Ng faces a maximum statutory penalty of one year in prison. This matter was investigated by the Albuquerque Division of the FBI and is being prosecuted by Assistant U.S. Attorney Mark T. Baker of the U.S. Attorney‚Äôs Office for the District of New Mexico and Trial Attorney AeJean Cha of the U.S. Department of Justice‚Äôs Civil Rights Division.",0.268


D. With the dataframe from part C, find the mean compound sentiment score for each of the three topics in `topics_clean` using group_by and agg.

E. Add a 1 sentence interpretation of why we might see the variation in scores (remember that compound is a standardized summary where -1 is most negative; +1 is most positive)


In [25]:
## agg and find the mean compound score by topic
mean_compound_scores = doj_subset_wscore.groupby('topics_clean').agg({'compound': 'mean'})

mean_compound_scores 

Unnamed: 0_level_0,compound
topics_clean,Unnamed: 1_level_1
Civil Rights,0.151243
Hate Crimes,-0.895692
Project Safe Childhood,-0.244107


## E. Add a 1 sentence interpretation of why we might see the variation in scores (remember that compound is a standardized summary where -1 is most negative; +1 is most positive)
- One reason why the topics of Civil Rights may recive more positive score compare to Hate Crime and Project Safe Childhood is because Civil Rights may be spoken about with more positive language then the other two categories, which could be discussed with more neutral or pessimistic language.  

# 2. Topic modeling (25 points)

For this question, use the `doj_subset_wscores` data that is restricted to civil rights, hate crimes, and project safe childhood and with the sentiment scores added


## 2.1 Preprocess the data by removing stopwords, punctuation, and non-alpha words (5 points)

A. Write a function that:

- Takes in a single raw string in the `contents` column from that dataframe
- Does the following preprocessing steps:

    - Converts the words to lowercase
    - Removes stopwords, adding the custom stopwords in your code cell below to the default stopwords list
    - Only retains alpha words (so removes digits and punctuation)
    - Only retains words 4 characters or longer
    - Uses the snowball stemmer from nltk to stem

- Returns a joined preprocessed string
    
B. Use `apply` or list comprehension to execute that function and create a new column in the data called `processed_text`
    
C. Print the `id`, `contents`, and `processed_text` columns for the following press releases:

id = 16-718 (this case: https://www.seattletimes.com/nation-world/doj-miami-police-reach-settlement-in-civil-rights-case/)

id = 16-217 (this case: https://www.wlbt.com/story/32275512/three-mississippi-correctional-officers-indicted-for-inmate-assault-and-cover-up/)
    
**Resources**:

- Here's code examples for the snowball stemmer: https://www.geeksforgeeks.org/snowball-stemmer-nlp/

In [26]:
custom_doj_stopwords = ["civil", "rights", "division", "department", "justice",
                        "office", "attorney", "district", "case", "investigation", "assistant",
                       "trial", "assistance", "assist"]

In [27]:


# Custom stopwords to add
custom_stopwords = ["custom_stopword1", "custom_stopword2"]

def preprocess_text(raw_text):
    # Convert text to lowercase
    raw_text = raw_text.lower()
    
    # Tokenize text into words
    words = nltk.word_tokenize(raw_text)
    
    # Remove stopwords
    stop_words = set(stopwords.words('english') + custom_stopwords)
    filtered_words = [word for word in words if word not in stop_words]
    
    # Remove punctuation and digits, and retain only alpha words
    filtered_words = [word for word in filtered_words if word.isalpha()]
    
    # Retain words 4 characters or longer
    filtered_words = [word for word in filtered_words if len(word) >= 4]
    
    # Stemming using SnowballStemmer
    stemmer = SnowballStemmer("english")
    stemmed_words = [stemmer.stem(word) for word in filtered_words]
    
    # Join preprocessed words into a single string
    preprocessed_text = ' '.join(stemmed_words)
    
    return preprocessed_text

In [28]:
## your code executing the function
# Apply the preprocess_text function to each row in the DataFrame and create a new column 'processed_text'
doj_subset_wscore['processed_text'] = doj_subset_wscore['contents'].apply(preprocess_text)

In [35]:
## your code showing the examples
examples1 = doj_subset_wscore[doj_subset_wscore['id'].isin(['16-718', '16-217'])][['id', 'contents', 'processed_text']]
examples1 

Unnamed: 0,id,contents,processed_text
11593,16-718,"In a nine-count indictment unsealed today, two Mississippi correctional officers were charged with beating an inmate and a third was charged with helping to cover it up.¬† The indictment charged Lawardrick Marsher, 28, and Robert Sturdivant, 47, officers at Mississippi State Penitentiary, in Parchman, Mississippi, with a beating that included kicking, punching and throwing the victim to the ground.¬† Marsher and Sturdivant were charged with violating the right of K.H., a convicted prisoner, to be free from cruel and unusual punishment. ¬†Sturdivant was also charged with failing to intervene while Marsher was punching and beating K.H.¬† The indictment alleges that their actions involved the use of a dangerous weapon and resulted in bodily injury to the victim. A third officer, Deonte Pate, 23, was charged along with Marsher and Sturdivant for conspiring to cover up the beating. ¬†The indictment alleges that all three officers submitted false reports and that all three lied to the FBI. If convicted, Marsher and Sturdivant face a maximum sentence of 10 years in prison on the excessive force charges. ¬†Each of the three officers faces up to five years in prison on the conspiracy and false statement charges, and up to 20 years in prison on the false report charges. An indictment is merely an accusation, and the defendants are presumed innocent unless and until proven guilty. This case is being investigated by the FBI‚Äôs Jackson Division, with the cooperation of the Mississippi Department of Corrections.¬† It is being prosecuted by Assistant U.S. Attorney Robert Coleman of the Northern District of Mississippi and Trial Attorney Dana Mulhauser of the Civil Rights Division‚Äôs Criminal Section.¬† ¬†¬† Marsher Indictment",indict unseal today mississippi correct offic charg beat inmat third charg help cover indict charg lawardrick marsher robert sturdiv offic mississippi state penitentiari parchman mississippi beat includ kick punch throw victim ground marsher sturdiv charg violat right convict prison free cruel unusu punish sturdiv also charg fail interven marsher punch beat indict alleg action involv danger weapon result bodili injuri victim third offic deont pate charg along marsher sturdiv conspir cover beat indict alleg three offic submit fals report three lie convict marsher sturdiv face maximum sentenc year prison excess forc charg three offic face five year prison conspiraci fals statement charg year prison fals report charg indict mere accus defend presum innoc unless proven guilti case investig jackson divis cooper mississippi depart correct prosecut assist attorney robert coleman northern district mississippi trial attorney dana mulhaus civil right divis crimin section marsher indict
6727,16-217,"The Justice Department has reached a comprehensive settlement agreement with the city of Miami and the Miami Police Department (MPD) resolving the Justice Department‚Äôs investigation of officer-involved shootings by MPD officers, announced Principal Deputy Assistant Attorney General Vanita Gupta, head of the Justice Department‚Äôs Civil Rights Division and U.S. Attorney Wifredo A. Ferrer of the Southern District of Florida. The settlement, which was approved by Miami‚Äôs city commission today and will go into effect when the agreement is signed by all parties, resolves claims stemming from the Justice Department‚Äôs investigation into officer-involved shootings by MPD officers, which was conducted under the Violent Crime Control and Law Enforcement Act of 1994.¬† The investigation‚Äôs findings, issued in July 2013, identified a pattern or practice of excessive use of force through officer-involved shootings in violation of the Fourth Amendment of the Constitution.¬† The city‚Äôs compliance with the settlement will be monitored by an independent reviewer, former Tampa, Florida, Police Chief Jane Castor.¬† Under the settlement agreement, the city will implement comprehensive reforms to ensure constitutional policing and support public trust.¬†¬†The settlement agreement is designed to minimize officer-involved shootings and to more effectively and quickly investigate officer-involved shootings that do occur, through measures that include: ‚ÄúThis settlement represents a renewed commitment by the city of Miami and Chief Rodolfo Llanes to provide constitutional policing for Miami residents and to protect public safety through sustainable reform,‚Äù said Principal Deputy Assistant Attorney General Gupta.¬† ‚ÄúThe agreement will help to strengthen the relationship between the MPD and the communities they serve by improving accountability for officers who fire their weapons unlawfully, and provides for community participation in the enforcement of this agreement.‚Äù¬† ‚ÄúToday's agreement is the result of a joint effort between the Department of Justice and the City of Miami to ensure that the Miami Police Department continues its efforts to make our community safe while protecting the sacred Constitutional rights of all of our citizens,‚Äù said U.S. Attorney Ferrer.¬† ‚ÄúThrough oversight and communication, the agreement seeks to make permanent the positive changes that former Chief Orosa and Chief Llanes have made, and we applaud the City Commission‚Äôs vote.‚Äù The settlement agreement builds upon important reforms implemented by the city since the Justice Department issued its findings, including:¬† The investigation was conducted by attorneys and staff from the Civil Rights Division‚Äôs Special Litigation Section and the Civil Division of the U. S. Attorney‚Äôs Office of the Southern District of Florida.",justic depart reach comprehens settlement agreement citi miami miami polic depart resolv justic depart investig shoot offic announc princip deputi assist attorney general vanita gupta head justic depart civil right divis attorney wifredo ferrer southern district florida settlement approv miami citi commiss today effect agreement sign parti resolv claim stem justic depart investig shoot offic conduct violent crime control enforc investig find issu juli identifi pattern practic excess forc shoot violat fourth amend constitut citi complianc settlement monitor independ review former tampa florida polic chief jane castor settlement agreement citi implement comprehens reform ensur constitut polic support public trust settlement agreement design minim shoot effect quick investig shoot occur measur includ settlement repres renew commit citi miami chief rodolfo llane provid constitut polic miami resid protect public safeti sustain reform said princip deputi assist attorney general gupta agreement help strengthen relationship communiti serv improv account offic fire weapon unlaw provid communiti particip enforc today agreement result joint effort depart justic citi miami ensur miami polic depart continu effort make communiti safe protect sacr constitut right citizen said attorney ferrer oversight communic agreement seek make perman posit chang former chief orosa chief llane made applaud citi commiss settlement agreement build upon import reform implement citi sinc justic depart issu find includ investig conduct attorney staff civil right divis special litig section civil divis attorney offic southern district florida


## 2.2 Create a document-term matrix from the preprocessed press releases and to explore top words (5 points)

A. Use the `create_dtm` function I provide (alternately, feel free to write your own!) and create a document-term matrix using the preprocessed press releases; make sure metadata contains the following columns: `id`, `compound` sentiment column you added, and the `topics_clean` column

B. Print the top 10 words for press releases with compound sentiment in the top 5% (so the most positive sentiment)

C. Print the top 10 words for press releases with compound sentiment in the bottom 5% (so the most negative sentiment)

**Hint**: for these, remember the pandas quantile function from pset one.  

D. Print the top 10 words for press releases in each of the three `topics_clean`

For steps B - D, to receive full credit, write a function `get_topwords` that helps you avoid duplicated code when you find top words for the different subsets of the data. There are different ways to structure it but one way is to feed it subsetted data (so data subsetted to one topic etc.) and for it to get the top words for that subset.


In [29]:
#cleaned one from chatGPT 
#def create_dtm(list_of_strings, metadata):
   # vectorizer = CountVectorizer(lowercase=True)
   # dtm_sparse = vectorizer.fit_transform(list_of_strings)
   # dtm_dense_named = pd.DataFrame(dtm_sparse.todense(), columns=vectorizer.get_feature_names_out())
   # dtm_dense_named_withid = pd.concat([metadata.reset_index(), dtm_dense_named], axis=1)
   # return dtm_dense_named_withid

#the one given that produces an error message! 
def create_dtm(list_of_strings, metadata):
    vectorizer = CountVectorizer(lowercase=True)
    dtm_sparse = vectorizer.fit_transform(list_of_strings)
    dtm_dense_named = pd.DataFrame(dtm_sparse.todense(), columns=vectorizer.get_feature_names_out())
    dtm_dense_named_withid = pd.concat([metadata.reset_index(), dtm_dense_named], axis=1)
    return(dtm_dense_named_withid)

In [30]:
# your code here
list = ['id', 'compound', 'topics_clean']
data_for_dtm = doj_subset_wscore['processed_text']
dtm = create_dtm(data_for_dtm, metadata = doj_subset_wscore[list])
dtm 

Unnamed: 0,index,id,compound,topics_clean,aaron,abandon,abbat,abbi,abbott,abdomen,...,zamora,zane,zealand,zealous,zeeman,zero,zobel,zone,zunggeemog,zwengel
0,11593,16-718,-0.9968,Civil Rights,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,329,14-248,-0.9943,Hate Crimes,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,572,13-312,-0.9980,Hate Crimes,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,11876,15-1348,-0.9963,Civil Rights,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,501,11-626,-0.9985,Hate Crimes,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
712,6787,17-132,0.9905,Civil Rights,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
713,10286,15-1559,0.9460,Civil Rights,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
714,11065,16-294,0.9442,Civil Rights,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
715,11085,15-667,0.9118,Civil Rights,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [34]:
# B. Print the top 10 words for press releases with compound sentiment in the top 5%
top_5_percentile = int(len(dtm) * 0.05)
top_sentiment_indices = dtm['compound'].argsort()[-top_5_percentile:][::-1]
top_words_pos_sentiment = dtm.loc[top_sentiment_indices].sum().sort_values(ascending=False).head(10)
print("Top 10 words for press releases with positive sentiment:")
print(top_words_pos_sentiment)

AttributeError: 'DataFrame' object has no attribute 'argsort'

In [59]:
def get_topwords(dtm):
    # Sum the frequencies of each word across all documents
    total_word_freq = dtm.iloc[:, 4:].sum(axis=0)
    
    # Convert the total word frequencies to a dictionary
    word_freq_dict = dict(total_word_freq)
    
    # Return the top 10 words
    return Counter(word_freq_dict).most_common(10)

# Example usage:
top_words = get_topwords(dtm)
print(top_words)

# B. # Step 1: Calculate the 95th percentile of compound sentiment
top_5_percent_threshold = dtm['compound'].quantile(0.95)

# Step 2: Filter the data for press releases with compound sentiment above the 95th percentile
top_5_percent_data = dtm[dtm['compound'] > top_5_percent_threshold]

# Step 3: Use the get_topwords function to get the top 10 words from the filtered data
top_words = get_topwords(top_5_percent_data)

# B. Function to print the top 10 words for press releases with compound sentiment in the top 5%
def top_positive_sentiment(data):
    # Calculate the 95th percentile of compound sentiment
    top_5_percent_threshold = data['compound'].quantile(0.95)
    
    # Filter data for press releases with compound sentiment in the top 5%
    top_5_percent_data = data[data['compound'] > top_5_percent_threshold]
    
    # Get top words for the filtered data
    get_topwords(top_5_percent_data)

# C. Function to print the top 10 words for press releases with compound sentiment in the bottom 5%
def top_negative_sentiment(data):
    # Calculate the 5th percentile of compound sentiment
    bottom_5_percent_threshold = data['compound'].quantile(0.05)
    
    # Filter data for press releases with compound sentiment in the bottom 5%
    bottom_5_percent_data = data[data['compound'] < bottom_5_percent_threshold]
    
    # Get top words for the filtered data
    get_topwords(bottom_5_percent_data)

# D. Function to print the top 10 words for press releases in each of the three topics_clean
def top_words_by_topic(data):
    # Group data by topics_clean and get top words for each group
    for topic, group_data in data.groupby('topics_clean'):
        print(f"Top 10 words for {topic}:")
        get_topwords(group_data)

#Example usage:
top_positive_sentiment(dtm)
top_negative_sentiment(dtm)
top_words_by_topic(dtm)



[('attorney', 2920), ('depart', 2673), ('right', 2360), ('district', 1899), ('justic', 1898), ('divis', 1865), ('civil', 1852), ('offic', 1654), ('assist', 1500), ('investig', 1364)]


ValueError: cannot reindex on an axis with duplicate labels

## 2.3 Estimate a topic model using those preprocessed words (5 points)

A. Going back to the preprocessed words from part 2.3.1, estimate a topic model with 3 topics, since you want to see if the unsupervised topic models recover different themes for each of the three manually-labeled areas (civil rights; hate crimes; project safe childhood). You have free rein over the other topic model parameters beyond the number of topics.

B. After estimating the topic model, print the top 15 words in each topic.

**Hints and Resources**:

- Same topic modeling resources linked to above
- Make sure to use the `random_state` argument within the model so that the numbering of topics does not move around between runs of your code

In [184]:
# your code here 
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

# Assuming 'preprocessed_words' is a list of preprocessed words

# Convert preprocessed words to a document-term matrix
vectorizer = CountVectorizer()
dtm = vectorizer.fit_transform(doj_subset_wscore['processed_text']) 

# Define the number of topics
n_topics = 3

# Initialize LDA model with random_state
lda_model = LatentDirichletAllocation(n_components=n_topics, random_state=42)

# Fit the model to the document-term matrix
lda_model.fit(dtm)

# Print top words for each topic
def print_top_words(model, feature_names, n_top_words=15):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic #{topic_idx + 1}:")
        print(" ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]))
        print()

print("Top 15 words for each topic:")
print_top_words(lda_model, vectorizer.get_feature_names_out())

Top 15 words for each topic:
Topic #1:
attorney child district exploit offic sexual case divis assist depart investig crimin sentenc safe prosecut

Topic #2:
depart right justic civil hous offic divis discrimin disabl attorney district enforc agreement said state

Topic #3:
right attorney civil divis depart victim district justic crime assist defend charg prosecut investig sentenc



## 2.4 Add topics back to main data and explore correlation between manual labels and our estimated topics (10 points)

A. Extract the document-level topic probabilities. Within `get_document_topics`, use the argument `minimum_probability` = 0 to make sure all 3 topic probabilities are returned. Write an assert statement to make sure the length of the list is equal to the number of rows in the `doj_subset_wscores` dataframe

B. Add the topic probabilities to the `doj_subset_wscores` dataframe as columns and create a column, `top_topic`, that reflects each document to its highest-probability topic (eg topic 1, 2, or 3)

C. For each of the manual labels in `topics_clean` (Hate Crime, Civil Rights, Project Safe Childhood), print the breakdown of the % of documents with each top topic (so, for instance, Hate Crime has 246 documents-- if 123 of those documents are coded to topic_1, that would be 50%; and so on). **Hint**: pd.crosstab and normalize may be helpful: https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.crosstab.html

D. Using a couple press releases as examples, write a 1-2 sentence interpretation of why some of the manual topics map on more cleanly to an estimated topic than other manual topic(s)


In [185]:
## your code here to get doc-level topic probabilities 
# Convert preprocessed words to a document-term matrix
vectorizer = CountVectorizer()
dtm = vectorizer.fit_transform(doj_subset_wscore['processed_text'])

# Define the number of topics
n_topics = 3

# Initialize LDA model with random_state
lda_model = LatentDirichletAllocation(n_components=n_topics, random_state=42)

# Fit the model to the document-term matrix
lda_model.fit(dtm)

# Extract document-level topic probabilities
def extract_document_topics(model, dtm, n_topics):
    document_topics = []
    for document in dtm:
        topics = model.transform(document.reshape(1, -1))[0]
        # Pad the topic distribution with zeros if necessary
        padded_topics = np.concatenate([topics, np.zeros(n_topics - len(topics))])
        document_topics.append(padded_topics)
    return document_topics

# Call the function to extract document-level topic probabilities
document_topics = extract_document_topics(lda_model, dtm, n_topics)

# Assert statement to check the length of the list
assert len(document_topics) == len(doj_subset_wscore), "Length of document-level topic probabilities does not match the number of rows in doj_subset_wscore dataframe"

# Print the extracted document-level topic probabilities
print(document_topics)

[array([0.55420576, 0.00258412, 0.44321012]), array([0.27714486, 0.0409751 , 0.68188005]), array([0.00198359, 0.00184364, 0.99617277]), array([0.65804426, 0.00217813, 0.33977761]), array([0.00181607, 0.00185046, 0.99633347]), array([0.00258013, 0.00264015, 0.99477972]), array([0.34432585, 0.00151119, 0.65416296]), array([0.15334248, 0.00127309, 0.84538442]), array([0.00160767, 0.00152678, 0.99686556]), array([0.00188186, 0.00185501, 0.99626313]), array([0.6059508 , 0.00346384, 0.39058536]), array([0.23611779, 0.00312562, 0.76075658]), array([0.177144 , 0.0022207, 0.8206353]), array([0.0098674 , 0.97874533, 0.01138726]), array([0.00111656, 0.00100967, 0.99787376]), array([0.30128397, 0.00170808, 0.69700795]), array([0.00147188, 0.00144302, 0.9970851 ]), array([0.00300235, 0.00283529, 0.99416236]), array([0.00366003, 0.00356193, 0.99277805]), array([0.00149476, 0.00144079, 0.99706445]), array([0.00148785, 0.00140256, 0.99710958]), array([0.04731969, 0.32292426, 0.62975605]), array([0.001

In [186]:
## your code here to add those topic probabilities to the dataframe
# Convert the list of document-level topic probabilities to a DataFrame
topic_probabilities_df = pd.DataFrame(document_topics, columns=[f"topic_{i+1}" for i in range(n_topics)])

# Add the topic probabilities to the doj_subset_wscore dataframe
doj_subset_wscores_with_topics = pd.concat([doj_subset_wscore.reset_index(drop=True), topic_probabilities_df], axis=1)

# Create a column 'top_topic' that reflects each document's highest-probability topic
doj_subset_wscores_with_topics['top_topic'] = doj_subset_wscores_with_topics[[f"topic_{i+1}" for i in range(n_topics)]].idxmax(axis=1)
doj_subset_wscores_with_topics['top_topic'] = doj_subset_wscores_with_topics['top_topic'].str.replace('topic_', '').astype(int)

# Display the modified dataframe
doj_subset_wscores_with_topics 

Unnamed: 0,id,title,contents,date,topics_clean,components_clean,neg,neu,pos,compound,processed_text,topic_1,topic_2,topic_3,top_topic
0,16-718,Three Mississippi Correctional Officers Indicted for Inmate Assault and Cover-Up,"In a nine-count indictment unsealed today, two Mississippi correctional officers were charged with beating an inmate and a third was charged with helping to cover it up.¬† The indictment charged Lawardrick Marsher, 28, and Robert Sturdivant, 47, officers at Mississippi State Penitentiary, in Parchman, Mississippi, with a beating that included kicking, punching and throwing the victim to the ground.¬† Marsher and Sturdivant were charged with violating the right of K.H., a convicted prisoner, to be free from cruel and unusual punishment. ¬†Sturdivant was also charged with failing to intervene while Marsher was punching and beating K.H.¬† The indictment alleges that their actions involved the use of a dangerous weapon and resulted in bodily injury to the victim. A third officer, Deonte Pate, 23, was charged along with Marsher and Sturdivant for conspiring to cover up the beating. ¬†The indictment alleges that all three officers submitted false reports and that all three lied to the FBI. If convicted, Marsher and Sturdivant face a maximum sentence of 10 years in prison on the excessive force charges. ¬†Each of the three officers faces up to five years in prison on the conspiracy and false statement charges, and up to 20 years in prison on the false report charges. An indictment is merely an accusation, and the defendants are presumed innocent unless and until proven guilty. This case is being investigated by the FBI‚Äôs Jackson Division, with the cooperation of the Mississippi Department of Corrections.¬† It is being prosecuted by Assistant U.S. Attorney Robert Coleman of the Northern District of Mississippi and Trial Attorney Dana Mulhauser of the Civil Rights Division‚Äôs Criminal Section.¬† ¬†¬† Marsher Indictment",2016-06-21T00:00:00-04:00,Civil Rights,"Civil Rights Division; Civil Rights - Criminal Section; USAO - Mississippi, Northern",0.276,0.694,0.030,-0.9968,indict unseal today mississippi correct offic charg beat inmat third charg help cover indict charg lawardrick marsher robert sturdiv offic mississippi state penitentiari parchman mississippi beat includ kick punch throw victim ground marsher sturdiv charg violat right convict prison free cruel unusu punish sturdiv also charg fail interven marsher punch beat indict alleg action involv danger weapon result bodili injuri victim third offic deont pate charg along marsher sturdiv conspir cover beat indict alleg three offic submit fals report three lie convict marsher sturdiv face maximum sentenc year prison excess forc charg three offic face five year prison conspiraci fals statement charg year prison fals report charg indict mere accus defend presum innoc unless proven guilti case investig jackson divis cooper mississippi depart correct prosecut assist attorney robert coleman northern district mississippi trial attorney dana mulhaus civil right divis crimin section marsher indict,0.554206,0.002584,0.443210,1
1,14-248,Albuquerque Man Charged with Federal Hate Crime Related to Anti-Semitic Threats Against Businesswoman,"The Department of Justice announced that this morning John W. Ng, 58, of Albuquerque, N.M., made his initial appearance in federal court on a criminal complaint charging him with a hate crime offense.¬† This charge is related to anti-Semitic threats Ng made against a Jewish woman who owns and operates the Nosh Jewish Delicatessen and Bakery in Albuquerque. Ng was arrested by the FBI on March 7, 2014, based on a criminal complaint alleging that he interfered with the victim‚Äôs federally protected rights by threatening her and interfering with her business because of her religion.¬† According to the criminal complaint, between Jan. 22, 2014, and Feb. 8, 2014, Ng allegedly posted threatening anti-Semitic notes on and in the vicinity of the victim‚Äôs business. A criminal complaint merely establishes probable cause, and Ng is presumed innocent unless proven guilty.¬† If convicted on the offense charged in the criminal complaint, Ng faces a maximum statutory penalty of one year in prison. This matter was investigated by the Albuquerque Division of the FBI and is being prosecuted by Assistant U.S. Attorney Mark T. Baker of the U.S. Attorney‚Äôs Office for the District of New Mexico and Trial Attorney AeJean Cha of the U.S. Department of Justice‚Äôs Civil Rights Division.",2014-03-10T00:00:00-04:00,Hate Crimes,Civil Rights Division; Civil Rights - Criminal Section,0.268,0.685,0.046,-0.9943,depart justic announc morn john albuquerqu made initi appear feder court crimin complaint charg hate crime offens charg relat threat made jewish woman own oper nosh jewish delicatessen bakeri albuquerqu arrest march base crimin complaint alleg interf victim feder protect right threaten interf busi religion accord crimin complaint alleg post threaten note vicin victim busi crimin complaint mere establish probabl caus presum innoc unless proven guilti convict offens charg crimin complaint face maximum statutori penalti year prison matter investig albuquerqu divis prosecut assist attorney mark baker attorney offic district mexico trial attorney aejean depart justic civil right divis,0.277145,0.040975,0.681880,3
2,13-312,Aryan Brother Inmate Sentenced for Federal Hate Crime for Assaulting Fellow Inmate,"John Hall, 27, an Aryan Brotherhood member and inmate at the Federal Correctional Institution (FCI) in Seagoville, Texas, was sentenced today by U.S. District Judge Reed O‚ÄôConnor after pleading guilty to violating the Matthew Shepard and James Byrd Jr. Hate Crimes Prevention Act stemming from his assault of a fellow inmate, whom he believed to be gay, the Department of Justice announced. Hall assaulted his fellow inmate with a dangerous weapon, causing bodily injury to the victim on Dec. 20, 2011. Hall was sentenced to serve 71 months in prison to be served consecutively with the sentence he is currently serving. The assault occurred on Dec. 20, 2011, inside the FCI Seagoville when Hall targeted and attacked the victim, a fellow inmate, because he believed the victim was gay or involved in a sexual relationship with another male inmate. Hall repeatedly punched, kicked and stomped on the victim‚Äôs face with his shod feet, a dangerous weapon, while yelling a homophobic slur. The victim lost consciousness during the assault and suffered multiple lacerations to his face. The victim also sustained a fractured eye socket, lost a tooth, fractured other teeth and was treated at a hospital for the injuries he sustained during Hall‚Äôs unprovoked attack. Hall pleaded guilty to violating the Matthew Shepard and James Byrd Jr. Hate Crimes Prevention Act on Nov. 8, 2012. ‚ÄúBrutality and violence based on sexual orientation has no place in a civilized society,‚Äù said Thomas E. Perez, Assistant Attorney General for the Civil Rights Division. ‚ÄúThe Justice Department is committed to using all the tools in our law enforcement arsenal, including the Matthew Shepard and James Byrd Jr. Hate Crimes Prevention Act, to prosecute acts motivated by hate.‚Äù ‚ÄúThis prosecution sends a clear message that this office, in partnership with attorneys in the department‚Äôs Civil Rights Division, will prioritize and aggressively prosecute hate crimes and others civil rights violations in North Texas,‚Äù said U.S. Attorney Sarah R. Salda√±a of the Northern District of Texas. This case was investigated by the FBI Dallas Division. The case was prosecuted by Assistant U.S. Attorney Errin Martin and Trial Attorney Adriana Vieco of the Civil Rights Division.",2013-03-14T00:00:00-04:00,Hate Crimes,Civil Rights Division; Civil Rights - Criminal Section,0.254,0.709,0.037,-0.9980,john hall aryan brotherhood member inmat feder correct institut seagovill texa sentenc today district judg reed connor plead guilti violat matthew shepard jame byrd hate crime prevent stem assault fellow inmat believ depart justic announc hall assault fellow inmat danger weapon caus bodili injuri victim hall sentenc serv month prison serv consecut sentenc current serv assault occur insid seagovill hall target attack victim fellow inmat believ victim involv sexual relationship anoth male inmat hall repeat punch kick stomp victim face shod feet danger weapon yell homophob slur victim lost conscious assault suffer multipl lacer face victim also sustain fractur socket lost tooth fractur teeth treat hospit injuri sustain hall unprovok attack hall plead guilti violat matthew shepard jame byrd hate crime prevent brutal violenc base sexual orient place civil societi said thoma perez assist attorney general civil right divis justic depart commit use tool enforc arsenal includ matthew shepard jame byrd hate crime prevent prosecut act motiv prosecut send clear messag offic partnership attorney depart civil right divis priorit aggress prosecut hate crime other civil right violat north texa said attorney sarah salda√±a northern district texa case investig dalla divis case prosecut assist attorney errin martin trial attorney adriana vieco civil right divis,0.001984,0.001844,0.996173,3
3,15-1348,Two Former Jailers at the Kentucky River Regional Jail Indicted on Charges Related to the Death of A Pretrial Detainee,"The Justice Department announced today that a federal grand jury in London, Kentucky, has indicted two former deputy jailers at the Kentucky River Regional Jail on charges related to the July 9, 2013, in-custody death of Larry Trent, a pretrial detainee at the jail.¬† The indictment charges Damon Hickman, 38, and William Howell, 59, with causing Trent‚Äôs death, and charges Hickman with attempting to cover up his involvement in the death.¬† Hickman and Howell are charged with federal civil rights violations for depriving Trent of his civil rights.¬† Count one of the indictment charges Hickman and Howell of failing to provide Trent with necessary medical care after he was injured, thereby acting with deliberate indifference to a substantial risk of harm to Trent, which resulted in Trent‚Äôs death.¬† Count two of the indictment also charges both defendants with using excessive force against Trent, resulting in bodily injury to him. Hickman is additionally charged with one count of obstruction of justice for falsifying an official log by indicating that observations of Trent were being made and that Trent was ‚Äú10-4,‚Äù meaning that he was safe and not in obvious physical distress, when in fact Trent was not ‚Äú10-4.‚Äù Hickman and Howell face a maximum penalty of life in prison for the death-resulting civil rights offense, and face a maximum penalty of 10 years in prison for assaulting Trent.¬† Hickman faces a maximum penalty of 20 years in prison for falsification of records in a federal investigation. An indictment is merely an accusation, and the defendants are presumed innocent unless proven guilty. The case is being investigated by the FBI‚Äôs London Resident Agency, with assistance provided by the Kentucky State Police. ¬†The case is being prosecuted by Trial Attorney Sanjay Patel of the Civil Rights Division‚Äôs Criminal Section and Assistant U.S. Attorney Hydee Hawkins of the Eastern District of Kentucky. Hickman and Howell Indictment",2015-11-02T00:00:00-05:00,Civil Rights,Civil Rights Division; Civil Rights - Criminal Section,0.252,0.692,0.056,-0.9963,justic depart announc today feder grand juri london kentucki indict former deputi jailer kentucki river region jail charg relat juli death larri trent pretrial detaine jail indict charg damon hickman william howel caus trent death charg hickman attempt cover involv death hickman howel charg feder civil right violat depriv trent civil right count indict charg hickman howel fail provid trent necessari medic care injur therebi act deliber indiffer substanti risk harm trent result trent death count indict also charg defend use excess forc trent result bodili injuri hickman addit charg count obstruct justic falsifi offici indic observ trent made trent mean safe obvious physic distress fact trent hickman howel face maximum penalti life prison civil right offens face maximum penalti year prison assault trent hickman face maximum penalti year prison falsif record feder investig indict mere accus defend presum innoc unless proven guilti case investig london resid agenc assist provid kentucki state polic case prosecut trial attorney sanjay patel civil right divis crimin section assist attorney hyde hawkin eastern district kentucki hickman howel indict,0.658044,0.002178,0.339778,1
4,11-626,Arkansas Man Pleads Guilty to Federal Hate Crime Related to the Assault of Five Hispanic Men,"WASHINGTON ‚Äì The Justice Department announced today that Sean Popejoy, 19, of Green Forest, Ark., pleaded guilty in federal court to one count of committing a federal hate crime and one count of conspiring to commit a federal hate crime. This is the first conviction for a violation of the Matthew Shepard and James Byrd Jr. Hate Crimes Prevention Act, which was enacted in October 2009. ¬† Information presented during the plea hearing established that in the early morning hours of June 20, 2010, Popejoy admitted that he was part of a conspiracy to threaten and injure five Hispanic men who had pulled into a gas station parking lot. ¬†The co-conspirators pursued the victims in a truck. When the co-conspirators caught up to the victims, Popejoy leaned outside of the front passenger window and waived a tire wrench at the victims and continued to threaten and hurl racial epithets at the victims. The co-conspirator rammed into the victims' car, which caused the victims‚Äô car to cross the opposite lane of traffic, go off the road, crash into a tree and ignite. As a result of the co-conspirators‚Äô actions, the victims suffered bodily injury, including one victim who sustained life-threatening injuries. ‚ÄúJames Byrd, Jr. and Matthew Shepard were brutally murdered more than a decade ago, and today the first defendant is convicted for a hate crime under the critical new law enacted in their names,‚Äù said Thomas E. Perez, Assistant Attorney General for the Civil Rights Division. ‚ÄúIt is unacceptable that violent acts of hate committed because of someone‚Äôs race continue to occur in 2011, and the department will continue to use every available tool to identify and prosecute hate crimes whenever and wherever they occur.¬† ‚ÄúIt is terrible and disturbing that violence motivated by hatred of another‚Äôs race continues to occur,‚Äù said Conner Eldridge, U.S. Attorney for the Western District of Arkansas. ‚ÄúWe are committed to prosecuting such crimes in the Western District of Arkansas.‚Äù If convicted, the defendant faces a maximum punishment of 15 years in prison. This case is being investigated by the FBI‚Äôs Fayetteville Division in cooperation with the Arkansas State Police Department and the Carroll County Sheriff‚Äôs Office. ¬† The case is being prosecuted by Trial Attorney Edward Chung of the Department of Justice‚Äôs Civil Rights Division and Assistant U.S. Attorney Kyra Jenner for the Western District of Arkansas.",2011-05-16T00:00:00-04:00,Hate Crimes,Civil Rights Division; Civil Rights - Criminal Section,0.250,0.717,0.033,-0.9985,washington justic depart announc today sean popejoy green forest plead guilti feder court count commit feder hate crime count conspir commit feder hate crime first convict violat matthew shepard jame byrd hate crime prevent enact octob inform present plea hear establish earli morn hour june popejoy admit part conspiraci threaten injur five hispan pull station park pursu victim truck caught victim popejoy lean outsid front passeng window waiv tire wrench victim continu threaten hurl racial epithet victim ram victim caus victim cross opposit lane traffic road crash tree ignit result action victim suffer bodili injuri includ victim sustain injuri jame byrd matthew shepard brutal murder decad today first defend convict hate crime critic enact name said thoma perez assist attorney general civil right divis unaccept violent act hate commit someon race continu occur depart continu everi avail tool identifi prosecut hate crime whenev wherev occur terribl disturb violenc motiv hatr anoth race continu occur said conner eldridg attorney western district arkansa commit prosecut crime western district convict defend face maximum punish year prison case investig fayettevill divis cooper arkansa state polic depart carrol counti sheriff offic case prosecut trial attorney edward chung depart justic civil right divis assist attorney kyra jenner western district arkansa,0.001816,0.001850,0.996333,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
712,17-132,Justice Department Reaches Agreement with St. James Parish Louisiana School District to Desegregate Schools,"The Department of Justice has reached an agreement with the St. James Parish School District in Louisiana that upon completion will end court supervision of the district‚Äôs schools.¬†¬†The consent order, approved yesterday by the U.S. District Court for the Eastern District of Louisiana, addresses all remaining issues in the school desegregation case, and when fully implemented will lead to the closing of that case.¬†¬† The consent order, negotiated with the school district and private plaintiffs, represented by the NAACP Legal Defense and Educational Fund, puts the district on a path to full unitary status within three years provided it: The consent order declares that the district has already met its desegregation obligations in the area of transportation. ¬†The court will retain jurisdiction over the consent order during its implementation, and the Justice Department will monitor the district‚Äôs compliance.¬†¬† ‚ÄúWe are pleased to have worked hand-in-hand with the schools to ensure equal and fair treatment for the students of the St. James Parish School District,‚Äù said Acting Assistant Attorney General Tom Wheeler of the Civil Rights Division.¬†¬†‚ÄúWe look forward to working with the district and private plaintiffs to implement the consent order and bring this case to a successful close.‚Äù Promoting school desegregation and enforcing Title IV of the Civil Rights Act of 1964 is a top priority of the Justice Department‚Äôs Civil Rights Division. ¬†Additional information about the Civil Rights Division is available on its website at www.justice.gov/crt.¬† St. James Parish Consent Order",2017-01-31T00:00:00-05:00,Civil Rights,Civil Rights Division; Civil Rights - Educational Opportunities Section,0.000,0.823,0.177,0.9905,depart justic reach agreement jame parish school district louisiana upon complet court supervis district school consent order approv yesterday district court eastern district louisiana address remain issu school desegreg case fulli implement lead close case consent order negoti school district privat plaintiff repres naacp legal defens educ fund put district path full unitari status within three year provid consent order declar district alreadi desegreg oblig area transport court retain jurisdict consent order implement justic depart monitor district complianc pleas work school ensur equal fair treatment student jame parish school district said act assist attorney general wheeler civil right divis look forward work district privat plaintiff implement consent order bring case success promot school desegreg enforc titl civil right prioriti justic depart civil right divis addit inform civil right divis avail websit jame parish consent order,0.002931,0.832505,0.164565,2
713,15-1559,Readout of Department of Justice‚Äôs First Meetings in Chicago Following Announcement of Pattern or Practice Investigation of the Chicago Police Department,"The Department of Justice, including lawyers and senior leaders from the Civil Rights Division, and the U.S. Attorney‚Äôs Office of the Northern District of Illinois, completed two days of introductory meetings in Chicago today following last week‚Äôs announcement of a pattern or practice investigation into the Chicago Police Department (CPD).¬† The team, comprised primarily of lawyers from the Civil Rights Division was joined by the head of the Civil Rights Division Vanita Gupta, as well as Zachary Fardon, the U.S Attorney of the Northern District of Illinois. ¬†The investigation into use of force, disparities in use of force and accountability systems of the CPD is being led by the Civil Rights Division with assistance from the U.S. Attorney‚Äôs Office of the Northern District of Illinois. On Dec. 16, the group met with CPD Superintendent John Escalante and briefed CPD command staff on the investigative process.¬† The Civil Rights Division also had initial meetings with community members and organizations in order to solicit information and explain the pattern or practice investigation‚Äôs scope and process. Today, Dec. 17, the Civil Rights Division and U.S. Attorney‚Äôs Office met with additional community groups, city officials and union representatives.¬† Meetings with the city of Chicago included Mayor Rahm Emanuel and his staff and a separate meeting with the Independent Police Review Authority Administrator Sharon Fairley.¬† Throughout the investigative process the Civil Rights Division, assisted by the U.S. Attorney‚Äôs Office, will¬†continue to meet with representatives from the community, the city and the unions. ¬†During the course of the investigation, community members will have the opportunity to provide information both in public meetings and privately.¬† Any public meetings will be announced at a later date.¬† Anyone who wishes to share information relevant to the investigation is encouraged to contact the Department of Justice by phone: (844) 401-3735 or email: community.cpd@usdoj.gov.",2015-12-17T00:00:00-05:00,Civil Rights,"Civil Rights Division; Civil Rights - Special Litigation Section; USAO - Illinois, Northern",0.000,0.937,0.063,0.9460,depart justic includ lawyer senior leader civil right divis attorney offic northern district illinoi complet day introductori meet chicago today follow last week announc pattern practic investig chicago polic depart team compris primarili lawyer civil right divis join head civil right divis vanita gupta well zachari fardon attorney northern district illinoi investig forc dispar forc account system civil right divis assist attorney offic northern district illinoi group superintend john escalant brief command staff investig process civil right divis also initi meet communiti member organ order solicit inform explain pattern practic investig scope process today civil right divis attorney offic addit communiti group citi offici union repres meet citi chicago includ mayor rahm emanuel staff separ meet independ polic review author administr sharon fairley throughout investig process civil right divis assist attorney offic continu meet repres communiti citi union cours investig communiti member opportun provid inform public meet privat public meet announc later date anyon wish share inform relev investig encourag contact depart justic phone email,0.002409,0.995167,0.002424,2
714,16-294,"Statement from Head of the Civil Rights Division Vanita Gupta Regarding Ferguson, Missouri, City Council Vote to Approve Consent Decree","Principal Deputy Assistant Attorney General Vanita Gupta, head of the Justice Department‚Äôs Civil Rights Division, released the following statement regarding the Ferguson, Missouri, City Council vote to approve the proposed consent decree with the Department of Justice: ‚ÄúTonight, the city of Ferguson, Missouri, took an important step towards guaranteeing all of its citizens the protections of our Constitution. ¬†We are pleased that they have approved the consent decree, a document designed to provide the framework needed to institute constitutional policing in Ferguson, and look forward to filing it in court in the coming days and beginning to work with them towards implementation.‚Äù",2016-03-15T00:00:00-04:00,Civil Rights,Civil Rights Division; Civil Rights - Special Litigation Section,0.000,0.839,0.161,0.9442,princip deputi assist attorney general vanita gupta head justic depart civil right divis releas follow statement regard ferguson missouri citi council vote approv propos consent decre depart justic tonight citi ferguson missouri took import step toward guarante citizen protect constitut pleas approv consent decre document design provid framework need institut constitut polic ferguson look forward file court come day begin work toward implement,0.005817,0.988377,0.005806,2
715,15-667,"Statement from Vanita Gupta, Head of the Justice Department's Civil Rights Division, U.S. Attorney Steven M. Dettlebach for the Northern District of Ohio and Special Agent in Charge Stephen D. Anthony for the FBI","Statement from Vanita Gupta, head of the Justice Department‚Äôs Civil Rights Division, U.S. Attorney Steven M. Dettelbach for the Northern District of Ohio and Special Agent in Charge Stephen D. Anthony for the FBI: ‚ÄúThe U.S. Attorney's Office, the Federal Bureau of Investigation and the Civil Rights Division of the Department of Justice have been monitoring the extensive investigation that has been conducted around the events of Nov. 29, 2012.¬† We will now review the testimony and evidence presented in the state trial.¬† We will continue our assessment, review all available legal options and will collaboratively determine what, if any, additional steps are available and appropriate given the requirements and limitations of the applicable laws in the federal judicial system.¬† This review is separate and distinct from the Civil Rights Division and U.S. Attorney's Office's productive efforts to resolve civil pattern and practice allegations under 42 U.S.C. 14141 with the city of Cleveland.‚Äù",2015-05-23T00:00:00-04:00,Civil Rights,Civil Rights Division,0.000,0.915,0.085,0.9118,statement vanita gupta head justic depart civil right divis attorney steven dettelbach northern district ohio special agent charg stephen anthoni attorney offic feder bureau investig civil right divis depart justic monitor extens investig conduct around event review testimoni evid present state trial continu assess review avail legal option collabor determin addit step avail appropri given requir limit applic law feder judici system review separ distinct civil right divis attorney offic product effort resolv civil pattern practic alleg citi cleveland,0.005075,0.721912,0.273013,2


In [187]:
## your code here to summarize the topic proportions for each of the topics_clean 
# Calculate the percentage breakdown of top topics for each manual label
topic_breakdown = pd.crosstab(doj_subset_wscores_with_topics['topics_clean'], doj_subset_wscores_with_topics['top_topic'], normalize='index') * 100

# Display the percentage breakdown
print("Percentage breakdown of top topics for each manual label:")
topic_breakdown 

Percentage breakdown of top topics for each manual label:


top_topic,1,2,3
topics_clean,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Civil Rights,20.0,66.557377,13.442623
Hate Crimes,0.406504,0.406504,99.186992
Project Safe Childhood,100.0,0.0,0.0


## D. Using a couple press releases as examples, write a 1-2 sentence interpretation of why some of the manual topics map on more cleanly to an estimated topic than other manual topic(s)
- The mapping between manual topics and estimated topics may vary due to the complexity of the topics. Topics such as "Hate Crime" or "Project Safe Childhood," which represent more specific and distinct themes, may map more cleanly to estimated topics because they are likely to exhibit stronger patterns in the underlying data; alternately, broader topics like "Civil Rights" may encompass a wider range of issues and perspectives, leading to more diffuse mappings with estimated topics as they might intersect with multiple themes identified by the model. 

# 3. Extend the analysis from unigrams to bigrams (10 points)

In the previous question, you found top words via a unigram representation of the text. Now, we want to see how those top words change with bigrams (pairs of words)

A. Using the `doj_subset_wscore` data and the `processed_text` column (so the words after stemming/other preprocessing), create a column in the data called `processed_text_bigrams` that combines each consecutive pairs of word into a bigram separated by an underscore. Eg:

"depart reach settlem" would become "depart_reach reach_settlem"

Do this by writing a function `create_bigram_onedoc` that takes in a single `processed_text` string and returns a string with its bigrams structured similarly to above example
 
**Hint**: there are many ways to solve but `zip` may be helpful: https://stackoverflow.com/questions/21303224/iterate-over-all-pairs-of-consecutive-items-in-a-list

B. Print the `id`, `processed_text`, and `processed_text_bigram` columns for press release with id = 16-217

In [188]:
## your code here 
import pandas as pd

# Assuming 'doj_subset_wscore' is the DataFrame containing the processed text in the 'processed_text' column

# Function to create bigrams from a list of words
def create_bigrams(text):
    words = text.split()
    bigrams = [f"{words[i]}_{words[i+1]}" for i in range(len(words)-1)]
    return " ".join(bigrams)

# Apply the function to create bigrams for each row in the 'processed_text' column
doj_subset_wscore['processed_text_bigrams'] = doj_subset_wscore['processed_text'].apply(create_bigrams)

# Display the appropriate data from B 
doj_subset_wscore[doj_subset_wscore['id'] == '16-217'][['id', 'processed_text', 'processed_text_bigrams']]

Unnamed: 0,id,processed_text,processed_text_bigrams
6727,16-217,justic depart reach comprehens settlement agreement citi miami miami polic depart resolv justic depart investig shoot offic announc princip deputi assist attorney general vanita gupta head justic depart civil right divis attorney wifredo ferrer southern district florida settlement approv miami citi commiss today effect agreement sign parti resolv claim stem justic depart investig shoot offic conduct violent crime control enforc investig find issu juli identifi pattern practic excess forc shoot violat fourth amend constitut citi complianc settlement monitor independ review former tampa florida polic chief jane castor settlement agreement citi implement comprehens reform ensur constitut polic support public trust settlement agreement design minim shoot effect quick investig shoot occur measur includ settlement repres renew commit citi miami chief rodolfo llane provid constitut polic miami resid protect public safeti sustain reform said princip deputi assist attorney general gupta agreement help strengthen relationship communiti serv improv account offic fire weapon unlaw provid communiti particip enforc today agreement result joint effort depart justic citi miami ensur miami polic depart continu effort make communiti safe protect sacr constitut right citizen said attorney ferrer oversight communic agreement seek make perman posit chang former chief orosa chief llane made applaud citi commiss settlement agreement build upon import reform implement citi sinc justic depart issu find includ investig conduct attorney staff civil right divis special litig section civil divis attorney offic southern district florida,justic_depart depart_reach reach_comprehens comprehens_settlement settlement_agreement agreement_citi citi_miami miami_miami miami_polic polic_depart depart_resolv resolv_justic justic_depart depart_investig investig_shoot shoot_offic offic_announc announc_princip princip_deputi deputi_assist assist_attorney attorney_general general_vanita vanita_gupta gupta_head head_justic justic_depart depart_civil civil_right right_divis divis_attorney attorney_wifredo wifredo_ferrer ferrer_southern southern_district district_florida florida_settlement settlement_approv approv_miami miami_citi citi_commiss commiss_today today_effect effect_agreement agreement_sign sign_parti parti_resolv resolv_claim claim_stem stem_justic justic_depart depart_investig investig_shoot shoot_offic offic_conduct conduct_violent violent_crime crime_control control_enforc enforc_investig investig_find find_issu issu_juli juli_identifi identifi_pattern pattern_practic practic_excess excess_forc forc_shoot shoot_violat violat_fourth fourth_amend amend_constitut constitut_citi citi_complianc complianc_settlement settlement_monitor monitor_independ independ_review review_former former_tampa tampa_florida florida_polic polic_chief chief_jane jane_castor castor_settlement settlement_agreement agreement_citi citi_implement implement_comprehens comprehens_reform reform_ensur ensur_constitut constitut_polic polic_support support_public public_trust trust_settlement settlement_agreement agreement_design design_minim minim_shoot shoot_effect effect_quick quick_investig investig_shoot shoot_occur occur_measur measur_includ includ_settlement settlement_repres repres_renew renew_commit commit_citi citi_miami miami_chief chief_rodolfo rodolfo_llane llane_provid provid_constitut constitut_polic polic_miami miami_resid resid_protect protect_public public_safeti safeti_sustain sustain_reform reform_said said_princip princip_deputi deputi_assist assist_attorney attorney_general general_gupta gupta_agreement agreement_help help_strengthen strengthen_relationship relationship_communiti communiti_serv serv_improv improv_account account_offic offic_fire fire_weapon weapon_unlaw unlaw_provid provid_communiti communiti_particip particip_enforc enforc_today today_agreement agreement_result result_joint joint_effort effort_depart depart_justic justic_citi citi_miami miami_ensur ensur_miami miami_polic polic_depart depart_continu continu_effort effort_make make_communiti communiti_safe safe_protect protect_sacr sacr_constitut constitut_right right_citizen citizen_said said_attorney attorney_ferrer ferrer_oversight oversight_communic communic_agreement agreement_seek seek_make make_perman perman_posit posit_chang chang_former former_chief chief_orosa orosa_chief chief_llane llane_made made_applaud applaud_citi citi_commiss commiss_settlement settlement_agreement agreement_build build_upon upon_import import_reform reform_implement implement_citi citi_sinc sinc_justic justic_depart depart_issu issu_find find_includ includ_investig investig_conduct conduct_attorney attorney_staff staff_civil civil_right right_divis divis_special special_litig litig_section section_civil civil_divis divis_attorney attorney_offic offic_southern southern_district district_florida


C. Use the create_dtm function and the `processed_text_bigrams` column to create a document-term matrix (`dtm_bigram`) with these bigrams. Keep the following three columns in the data: `id`, `topics_clean`, and `compound` 

D. Print the (1) dimensions of the `dtm` matrix from question 2.2  and (2) the dimensions of the `dtm_bigram` matrix. Comment on why the bigram matrix has more dimensions than the unigram matrix 

E. Find and print the 10 most prevelant bigrams for each of the three topics_clean using the `get_topwords` function from 2.2

In [189]:
# your code here
def create_dtm_with_bigrams(list_of_strings, metadata):
    vectorizer = CountVectorizer(lowercase=True, ngram_range=(1, 2))  # Include unigrams and bigrams
    dtm_sparse = vectorizer.fit_transform(list_of_strings)
    dtm_dense_named = pd.DataFrame(dtm_sparse.todense(), columns=vectorizer.get_feature_names_out())
    dtm_dense_named_withid = pd.concat([metadata.reset_index(drop=True), dtm_dense_named], axis=1)
    return dtm_dense_named_withid

# Create the document-term matrix with bigrams
dtm_bigram = create_dtm_with_bigrams(doj_subset_wscore['processed_text_bigrams'], doj_subset_wscore[['id', 'topics_clean', 'compound']])

# Display the shape of each matrix
dtm_bigram 

# Print the dimensions of the dtm matrix
print("The dimensions of the dtm matrix: " + str(dtm.shape)) 

# Print the dimensions of the dtm_bigram matrix
print("The dimensions of the dtm_bigram matrix: " + str(dtm_bigram.shape)) 

Unnamed: 0,id,topics_clean,compound,aaron_ford,aaron_ford ford_memphi,aaron_latham,aaron_latham latham_munci,aaron_mcgrath,aaron_mcgrath mcgrath_websit,aaron_parrish,...,zone_varianc,zone_varianc varianc_would,zunggeemog_noel,zunggeemog_noel noel_womack,zunggeemog_prompt,zunggeemog_prompt prompt_forc,zunggeemog_write,zunggeemog_write write_report,zwengel_princeton,zwengel_princeton princeton_illinoi
0,16-718,Civil Rights,-0.9968,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,14-248,Hate Crimes,-0.9943,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,13-312,Hate Crimes,-0.9980,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,15-1348,Civil Rights,-0.9963,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,11-626,Hate Crimes,-0.9985,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
712,17-132,Civil Rights,0.9905,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
713,15-1559,Civil Rights,0.9460,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
714,16-294,Civil Rights,0.9442,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
715,15-667,Civil Rights,0.9118,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The dimensions of the dtm matrix: (717, 6758)
The dimensions of the dtm_bigram matrix: (717, 177110)


## Comment on why the bigram matrix has more dimensions than the unigram matrix
- The bigram matrix has more dimensions than the unigram matrix because it considers not only individual words (unigrams) but also pairs of consecutive words: bigrams. Each bigram represents a unique combination of two words, effectively expanding the space of the matrix. In turn, the bigram matrix captures more detailed information about word sequences and dependencies, leading to a higher-dimensional representation compared to using unigrams alone. 

In [190]:
# Part E 

# Define a function to print the top bigrams for each topic
def print_top_bigrams_for_topic(topic):
    # Filter data for the specific topic
    topic_subset = dtm_bigram[dtm_bigram['topics_clean'] == topic]
    # Get the top bigrams for the topic
    top_bigrams = get_topwords(topic_subset.drop(['id', 'topics_clean', 'compound'], axis=1))
    # Print the top bigrams
    print(f"Top 10 bigrams for the '{topic}' topic:")
    print(top_bigrams)
    print()

# Print top bigrams for each topic
topics = dtm_bigram['topics_clean'].unique()
for topic in topics:
    print_top_bigrams_for_topic(topic)

Top 10 bigrams for the 'Civil Rights' topic:
civil_right                         948
justic_depart                       720
civil_right right_divis             679
right_divis                         679
assist_attorney                     392
attorney_general                    379
assist_attorney attorney_general    300
fair_hous                           227
depart_civil                        224
deputi_assist                       223
dtype: int64

Top 10 bigrams for the 'Hate Crimes' topic:
civil_right                         769
civil_right right_divis             523
right_divis                         523
assist_attorney                     421
hate_crime                          368
justic_depart                       275
plead_guilti                        275
attorney_general                    238
assist_attorney attorney_general    223
trial_attorney                      214
dtype: int64

Top 10 bigrams for the 'Project Safe Childhood' topic:
project_safe safe_childhood 

# 4. Optional extra credit (2 points)

You notice that the pharmaceutical kickbacks press release we analyzed in question 1 was for an indictment, and that in the original data, there's not a clear label for whether a press release outlines an indictment (charging someone with a crime), a conviction (convicting them after that charge either via a settlement or trial), or a sentencing (how many years of prison or supervised release a defendant is sentenced to after their conviction).

You want to see if you can identify pairs of press releases where one press release is from one stage (e.g., indictment) and another is from a different stage (e.g., a sentencing).

You decide that one way to approach is to find the pairwise string similarity between each of the processed press releases in `doj_subset`. There are many ways to do this, so Google for some approaches, focusing on ones that work well for entire documents rather than small strings.

Find the top two pairs (so four press releases total)-- do they seem like different stages of the same crime or just press releases covering similar crimes?

In [191]:
# your code here 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Assuming 'doj_subset' is the DataFrame containing processed press releases

# Create TF-IDF vectors for the processed press releases
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(doj_subset_wscore['processed_text'])

# Calculate pairwise cosine similarity between the press releases
pairwise_similarity = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Find the top two pairs with the highest similarity
n_pairs = 2
top_pairs = []
for i in range(len(pairwise_similarity)):
    for j in range(i+1, len(pairwise_similarity)):
        similarity = pairwise_similarity[i][j]
        top_pairs.append((i, j, similarity))

# Sort the pairs by similarity in descending order
top_pairs.sort(key=lambda x: x[2], reverse=True)

# Print the top two pairs
for idx1, idx2, similarity in top_pairs[:n_pairs]:
    print("Pair " + str(n_pairs) + ":")
    print("Press Release 1:")
    print(doj_subset_wscore.iloc[idx1]['processed_text'])
    print("\nPress Release 2:")
    print(doj_subset_wscore.iloc[idx2]['processed_text'])
    print("Similarity:", similarity)
    print("----")

Pair 2:
Press Release 1:
church hill maryland resid sentenc today year prison follow lifetim term supervis releas entic minor engag sexual activ attempt transfer obscen materi minor announc act assist attorney general kenneth blanco justic depart crimin divis act attorney benjamin greenberg southern district florida robert moor plead guilti march district judg daniel hurley southern district florida moor employ secret divis assign white hous time arrest remain custodi sinc time moor sinc termin secret servic posit accord admiss made connect plea moor maintain profil social media applic provid platform exchang digit imag well voic text messag delawar state polic detect delawar child predat task forc creat profil site pose girl moor engag number onlin chat session mobil app period includ moor work number onlin chat moor undercov offic pose femal minor sexual natur sever occas moor sent pictur includ sexual explicit imag accord plea document arrest enforc discov moor communic minor florid