# Problem set 4: Text analysis of DOJ press releases

**Total points (without extra credit)**: 52 

- For background:

    - DOJ is the federal law enforcement agency responsible for federal prosecutions; this contrasts with the local prosecutions in the Cook County dataset we analyzed earlier. Here's a short explainer on which crimes get prosecuted federally versus locally: https://www.criminaldefenselawyer.com/resources/criminal-defense/federal-crime/state-vs-federal-crimes.htm#:~:text=Federal%20criminal%20prosecutions%20are%20handled,of%20state%20and%20local%20law. 
    - Here's the Kaggle that contains the data: https://www.kaggle.com/jbencina/department-of-justice-20092018-press-releases 
    - Here's the code the dataset creator used to scrape those press releases here if you're interested: https://github.com/jbencina/dojreleases

## 0.0 Import packages

In [1]:
## helpful packages
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import random
import re
import string

## nltk imports
import nltk
### uncomment and run these lines if you haven't downloaded relevant nltk add-ons yet
### nltk.download('averaged_perceptron_tagger')
### nltk.download('stopwords')
from nltk import pos_tag
from nltk.tokenize import word_tokenize, wordpunct_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords

## spacy imports
import spacy
### uncomment and run the below line if you haven't loaded the en_core_web_sm library yet
##python -m spacy download en_core_web_sm
import en_core_web_sm
nlp = en_core_web_sm.load()

## vectorizer
from sklearn.feature_extraction.text import CountVectorizer

## sentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

## lda
from gensim import corpora
import gensim

## repeated printouts and wide-format text
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
pd.set_option('display.max_colwidth', None)

## 0.1 Load and clean text data

In [2]:
## first, unzip the file pset4_inputdata.zip 
## then, run this code to load the unzipped json file and convert to a dataframe
## (may need to change the pathname depending on where you store stuff)
## and convert some of the attributes from lists to values
doj = pd.read_json("./combined.json", lines = True)

## due to json, topics are in a list so remove them and concatenate with ;
doj['topics_clean'] = ["; ".join(topic) 
                      if len(topic) > 0 else "No topic" 
                      for topic in doj.topics]

## similarly with components
doj['components_clean'] = ["; ".join(comp) 
                           if len(comp) > 0 else "No component" 
                           for comp in doj.components]

## drop older columns from data
doj = doj[['id', 'title', 'contents', 'date', 'topics_clean', 
           'components_clean']].copy()

doj.head()

Unnamed: 0,id,title,contents,date,topics_clean,components_clean
0,,Convicted Bomb Plotter Sentenced to 30 Years,"PORTLAND, Oregon. – Mohamed Osman Mohamud, 23, who was convicted in 2013 of attempting to use a weapon of mass destruction (explosives) in connection with a plot to detonate a vehicle bomb at an annual Christmas tree lighting ceremony in Portland, was sentenced today to serve 30 years in prison, followed by a lifetime term of supervised release. Mohamud, a naturalized U.S. citizen from Somalia and former resident of Corvallis, Oregon, was arrested on Nov. 26, 2010, after he attempted to detonate what he believed to be an explosives-laden van that was parked near the tree lighting ceremony in Portland. The arrest was the culmination of a long-term undercover operation, during which Mohamud was monitored closely for months as his bomb plot developed. The device was in fact inert, and the public was never in danger from the device. At sentencing, United States District Court Judge Garr M. King, who presided over Mohamed’s 14-day trial, said “the intended crime was horrific,” and that the defendant, even though he was presented with options by undercover FBI employees, “never once expressed a change of heart.” King further noted that the Christmas tree ceremony was attended by up to 10,000 people, and that the defendant “wanted everyone to leave either dead or injured.” King said his sentence was necessary in view of the seriousness of the crime and to serve as deterrence to others who might consider similar acts. “With today’s sentencing, Mohamed Osman Mohamud is being held accountable for his attempted use of what he believed to be a massive bomb to attack innocent civilians attending a public Christmas tree lighting ceremony in Portland,” said John P. Carlin, Assistant Attorney General for National Security. “The evidence clearly indicated that Mohamud was intent on killing as many people as possible with his attack. Fortunately, law enforcement was able to identify him as a threat, insert themselves in the place of a terrorist that Mohamud was trying to contact, and thwart Mohamud’s efforts to conduct an attack on our soil. This case highlights how the use of undercover operations against would-be terrorists allows us to engage and disrupt those who wish to commit horrific acts of violence against the innocent public. The many agents, analysts, and prosecutors who have worked on this case deserve great credit for their roles in protecting Portland from the threat posed by this defendant and ensuring that he was brought to justice.” “This trial provided a rare glimpse into the techniques Al Qaeda employs to radicalize home-grown extremists,” said Amanda Marshall, U.S. Attorney for the District of Oregon. “With the sentencing today, the court has held this defendant accountable. I thank the dedicated professionals in the law enforcement and intelligence communities who were responsible for this successful outcome. I look forward to our continued work with Muslim communities in Oregon who are committed to ensuring that all young people are safe from extremists who seek to radicalize others to engage in violence.” According to the trial evidence, in February 2009, Mohamud began communicating via e-mail with Samir Khan, a now-deceased al Qaeda terrorist who published Jihad Recollections, an online magazine that advocated violent jihad, and who also published Inspire, the official magazine of al-Qaeda in the Arabian Peninsula. Between February and August 2009, Mohamed exchanged approximately 150 emails with Khan. Mohamud wrote several articles for Jihad Recollections that were published under assumed names. In August 2009, Mohamud was in email contact with Amro Al-Ali, a Saudi national who was in Yemen at the time and is today in custody in Saudi Arabia for terrorism offenses. Al-Ali sent Mohamud detailed e-mails designed to facilitate Mohamud’s travel to Yemen to train for violent jihad. In December 2009, while Al-Ali was in the northwest frontier province of Pakistan, Mohamud and Al-Ali discussed the possibility of Mohamud traveling to Pakistan to join Al-Ali in terrorist activities. Mohamud responded to Al-Ali in an e-mail: “yes, that would be wonderful, just tell me what I need to do.” Al-Ali referred Mohamud to a second associate overseas and provided Mohamud with a name and email address to facilitate the process. In the following months, Mohamud made several unsuccessful attempts to contact Al-Ali’s associate. Ultimately, an FBI undercover operative contacted Mohamud via email under the guise of being an associate of Al-Ali’s. Mohamud and the FBI undercover operative agreed to meet in Portland in July 2010. At the meeting, Mohamud told the FBI undercover operative he had written articles that were published in Jihad Recollections. Mohamud also said that he wanted to become “operational.” Asked what he meant by “operational,” Mohamud said he wanted to put an explosion together, but needed help. According to evidence presented at trial, at a meeting in August 2010, Mohamud told undercover FBI operatives he had been thinking of committing violent jihad since the age of 15. Mohamud then told the undercover FBI operatives that he had identified a potential target for a bomb: the annual Christmas tree lighting ceremony in Portland’s Pioneer Courthouse Square on Nov. 26, 2010. The undercover FBI operatives cautioned Mohamud several times about the seriousness of this plan, noting there would be many people at the event, including children, and emphasized that Mohamud could abandon his attack plans at any time with no shame. Mohamud indicated the deaths would be justified and that he would not mind carrying out a suicide attack on the crowd. According to evidence presented at trial, in the ensuing months Mohamud continued to express his interest in carrying out the attack and worked on logistics. On Nov. 4, 2010, Mohamud and the undercover FBI operatives traveled to a remote location in Lincoln County, Oregon, where they detonated a bomb concealed in a backpack as a trial run for the upcoming attack. During the drive back to Corvallis, Mohamud was asked if was capable looking at all the bodies of those who would be killed during the explosion. In response, Mohamud noted, “I want whoever is attending that event to be, to leave either dead or injured.” Mohamud later recorded a video of himself, with the assistance of the undercover FBI operatives, in which he read a statement that offered his rationale for his bomb attack. On Nov. 18, 2010, undercover FBI operatives picked up Mohamud to travel to Portland to finalize the details of the attack. On Nov. 26, 2010, just hours before the planned attack, Mohamud examined the 1,800 pound bomb in the van and remarked that it was “beautiful.” Later that day, Mohamud was arrested after he attempted to remotely detonate the inert vehicle bomb rked near the Christmas tree lighting ceremony This case was investigated by the FBI, with assistance from the Oregon State Police, the Corvallis Police Department, the Lincoln County Sheriff’s Office and the Portland Police Bureau. The prosecution was handled by Assistant U.S. Attorneys Ethan D. Knight and Pamala Holsinger from the U.S. Attorney’s Office for the District of Oregon. Trial Attorney Jolie F. Zimmerman, from the Counterterrorism Section of the Justice Department’s National Security Division, assisted. # # # 14-1077",2014-10-01T00:00:00-04:00,No topic,National Security Division (NSD)
1,12-919,$1 Million in Restitution Payments Announced to Preserve North Carolina Wetlands,"WASHINGTON – North Carolina’s Waccamaw River watershed will benefit from a $1 million restitution order from a federal court, funding environmental projects to acquire and preserve wetlands in an area damaged by illegal releases of wastewater from a corporate hog farm, announced Ignacia S. Moreno, Assistant Attorney General of the Justice Department’s Environment and Natural Resources Division; U.S. Attorney for the Eastern District of North Carolina Thomas G. Walker; Director Greg McLeod from the North Carolina State Bureau of Investigation; and Camilla M. Herlevich, Executive Director of the North Carolina Coastal Land Trust. Freedman Farms Inc. was sentenced in February 2012 to five years of probation and ordered to pay $1.5 million in fines, restitution and community service payments for violating the Clean Water Act when it discharged hog waste into a stream that leads to the Waccamaw River. William B. Freedman, president of Freedman Farms, was sentenced to six months in prison to be followed by six months of home confinement. Freedman Farms also is required to implement a comprehensive environmental compliance program and institute an annual training program. In an order issued on April 19, 2012, the court ordered that the defendants would be responsible for restitution of $1 million in the form of five annual payments starting in January 2013, which the court will direct to the North Carolina Coastal Land Trust (NCCLT). The NCCLT plans to use the money to acquire and conserve land along streams in the Waccamaw watershed. The court also directed a $75,000 community service payment to the Southern Environmental Enforcement Network, an organization dedicated to environmental law enforcement training and information sharing in the region. “The resolution of the case against Freedman Farms demonstrates the commitment of the Department of Justice to enforcing the Clean Water Act to ensure the protection of human health and the environment,” said Assistant Attorney General Moreno. “The court-ordered restitution in this case will conserve wetlands for the benefit of the people of North Carolina. By enforcing the nation’s environmental laws, we will continue to ensure that concentrated animal feeding operations (CAFOs) operate without threatening our drinking water, the health of our communities and the environment.” “This office is committed to doing our part to hold accountable those who commit crimes against our environment, which can cause serious health problems to residents and damage the environment that makes North Carolina such a beautiful place to live and visit,” said U.S. Attorney Walker. “This case shows what we can accomplish when our SBI agents work closely with their local, state and federal partners to investigate environmental crimes and hold the polluters accountable,” said Director McLeod. “We’ll continue our efforts to fight illegal pollution that damages our water and puts the public’s health at risk.” “The Waccamaw is unique and wild,” said Director Herlevich of the North Carolina Coastal Land Trust. “Its watershed includes some of the most extensive cypress gum swamps in the state, and its headwaters at Lake Waccamaw contain fish that are found nowhere else on Earth. We appreciate the trust of the court and the U. S. Attorney, and we look forward to using these funds for conservation projects in a river system that is one of our top conservation priorities.” According to evidence presented in court, in December 2007 Freedman Farms discharged hog waste into Browder’s Branch, a tributary to the Waccamaw River that flows through the White Marsh, a large wetlands complex. Freedman Farms, located in Columbus County, N.C., is in the business of raising hogs for market, and this particular farm had some 4,800 hogs. The hog waste was supposed to be directed to two lagoons for treatment and disposal. Instead, hog waste was discharged from Freedman Farms directly into Browder’s Branch. The Clean Water Act is a federal law that makes it illegal to knowingly or negligently discharge a pollutant into a water of the United States. The Freedman case was investigated by the U.S. Environmental Protection Agency (EPA) Criminal Investigation Division, the U.S. Army Corps of Engineers and the North Carolina State Bureau of Investigation, with assistance from the EPA Science and Ecosystem Support Division. The case was prosecuted by Assistant U.S. Attorney J. Gaston B. Williams of the Eastern District of North Carolina and Trial Attorney Mary Dee Carraway of the Environmental Crimes Section of the Justice Department’s Environment and Natural Resources Division. The North Carolina Coastal Land Trust is celebrating its 20th anniversary of saving special lands in eastern North Carolina. The organization has protected nearly 50,000 acres of lands with scenic, recreational, historic and ecological values. North Carolina Coastal Land Trust has saved streams and wetlands that provide clean water, forests that are havens for wildlife, working farms that provide local food and nature parks that everyone can enjoy. More information about the Coastal Land Trust is available at www.coastallandtrust.org.",2012-07-25T00:00:00-04:00,No topic,Environment and Natural Resources Division
2,11-1002,$1 Million Settlement Reached for Natural Resource Damages at Superfund Site in Massachusetts,"BOSTON– A $1-million settlement has been reached for natural resource damages (NRD) at the Blackburn & Union Privileges Superfund Site in Walpole, Mass., the Departments of Justice and Interior (DOI), and the Office of the Massachusetts Attorney General announced today. The Blackburn & Union Privileges Superfund Site includes 22 acres of contaminated land and water in Walpole. The contamination resulted from the operations of various industrial facilities dating back to the 19th century that exposed the site to asbestos, arsenic, lead and other hazardous substances. The private parties involved in the settlement include two former owners and operators of the site, W.R. Grace & Co.– Conn. and Tyco Healthcare Group LP, as well as the current owners, BIM Investment Corp. and Shaffer Realty Nominee Trust. From about 1915 to 1936, a predecessor of W.R. Grace manufactured asbestos brake linings and clutch linings on a large portion of the property. From 1946 to about 1983, a predecessor of Tyco Healthcare operated a cotton fabric manufacturing business, which used caustic solutions, on a portion of the property. In a 2010 settlement with U.S. Environmental Protection Agency (EPA), the four private parties agreed to perform a remedial action to clean up the site at an estimated cost of $13 million. The consent decree lodged today resolves both state and federal NRD liability claims; it requires the parties to pay $1,094,169.56 to the state and federal natural resource trustees, the Massachusetts Executive Office of Energy and Environmental Affairs (EEA) and DOI, for injuries to ecological resources including groundwater and wetlands, which provide habitat for waterfowl and wading birds, including black ducks and great blue herons. The trustees will use the settlement funds for natural resource restoration projects in the area. “This settlement demonstrates our commitment to recovering damages from the parties responsible for injury to natural resources, in partnership with state trustees,” said Bruce Gelber, Acting Deputy Assistant Attorney General of the Justice Department’s Environment and Natural Resources Division. “The citizens of Walpole have had to live with the environmental impact of this contamination for many years,” Attorney General Martha Coakley said. “We are pleased that today’s agreement will not only require the responsible parties to reimburse taxpayer dollars, but will also provide funding to begin restoring or replacing the wetland and other natural resources.” The consent decree was lodged in the U.S. District Court for Massachusetts. A portion of the funds, $300,000, will be distributed to the EEA-sponsored groundwater restoration projects; $575,000 will be used for ecological restoration projects jointly sponsored by EEA and the U.S. Fish and Wildlife Service (FWS). In addition, $125,000 will go for projects jointly sponsored by EEA and FWS that achieve both ecological and groundwater restoration; $57,491.34 will be allocated for reimbursement for the FWS’s assessment costs; and $36,678.22 will be distributed as reimbursement for the commonwealth’s assessment costs. “This settlement provides the means for a range of projects designed to compensate the public for decades of groundwater and other ecological damage at this site. I encourage local citizens and organizations to become engaged in the public process that will take place as we solicit, take comment on, and choose these projects in the months ahead,” said Energy and Environmental Affairs Secretary Richard K. Sullivan Jr., who serves as the Commonwealth’s Natural Resources Damages trustee. “This settlement will help restore habitat for fish and wildlife in the Neponset River watershed,” said Tom Chapman of the FWS New England Field Office. “We look forward to working with the commonwealth and local stakeholders to implement restoration.” “More than 100 years-worth of industrial activities at this site caused major environmental contamination to the Neponset River, nearby wetlands and to groundwater below the site,” said Commissioner Kenneth Kimmell of the Massachusetts Department of Environmental Protection (MassDEP), which will staff the Trustee Council for the Commonwealth. “We will ensure that the community and the public will be active participants in the process to use these NRD funds to restore the injured natural resources.” Under the federal Comprehensive Environmental Response, Compensation and Liability Act, EEA and DOI, acting through the FWS, are the designated state and federal natural resource Trustees for the site. The site has been listed on the EPA’s National Priorities List since 1994. The consent decree is subject to a public comment period and court approval. A copy of the consent decree and instructions about how to submit comments is available on www.usdoj.gov/enrd/Consent_Decrees.html . After the consent decree is approved, EEA and FWS will develop proposed restoration plans to use the settlement funds for restoration projects. The proposed restoration plans will also be made available to the public for review and comment. Assistant Attorney General Matthew Brock of Massachusetts Attorney General Coakley's Environmental Protection Division handled this matter. Attorney Jennifer Davis of MassDEP, Attorney Anna Blumkin of EEA and MassDEP’s NRD Coordinator Karen Pelto also worked on this settlement.",2011-08-03T00:00:00-04:00,No topic,Environment and Natural Resources Division
3,10-015,10 Las Vegas Men Indicted \r\nfor Falsifying Vehicle Emissions Tests,"WASHINGTON—A federal grand jury in Las Vegas today returned indictments against 10 Nevada-certified emissions testers for falsifying vehicle emissions test reports, the Justice Department announced. Each defendant faces one felony Clean Air Act count for falsifying reports between November 2007 and May 2009. The number of falsifications varied by defendant, with some defendants having falsified approximately 250 records, while others falsified more than double that figure. One defendant is alleged to have falsified over 700 reports. The individuals indicted include: Escudero resides in Pahrump, Nev. All other individuals are from Clark County, Nev. The 10 defendants are alleged to have engaged in a practice known as ""clean scanning"" vehicles. The scheme involved entering the Vehicle Identification Number (VIN) for a vehicle that would not pass the emissions test into the computerized system, then connecting a different vehicle the testers knew would pass the test. These falsifications were allegedly performed for anywhere from $10 to $100 over and above the usual emissions testing fee. The U.S. Environmental Protection Agency (EPA), under the Clean Air Act, requires the state of Nevada to conduct vehicle emissions testing in certain areas because the areas exceed national standards for carbon monoxide and ozone. Las Vegas is currently required to perform emissions testing. To obtain a registration renewal, vehicle owners bring the vehicles to a licensed inspection station for testing. The emissions inspector logs into a computer to activate the system by using a unique password issued to the emissions inspector. The emissions inspector manually inputs the vehicle’s VIN to identify the tested vehicle, then connects the vehicle for model year 1996 and later to an onboard diagnostics port connected to an analyzer. The analyzer downloads data from the vehicle’s computer, analyzes the data and provides a ""pass"" or ""fail"" result. The pass or fail result and vehicle identification data are reported on the Vehicle Inspection Report. It is a crime to knowingly alter or conceal any record or other document required to be maintained by the Clean Air Act. ""Falsifications of vehicle emissions testing, such as those alleged in the indictments unsealed today, are serious matters and we intend to use all of our enforcement tools to stop this harmful practice. These actions undermine a system that is designed to reduce air pollutants including smog and provide better air quality for the citizens of Nevada,"" said Ignacia S. Moreno, Assistant Attorney General for the Justice Department’s Environment and Natural Resources Division. ""The residents of Nevada deserve to know that the vast majority of licensed vehicle emission inspectors are not corrupt and are not circumventing emission testing procedures,"" said U.S. Attorney Bogden. ""These indictments should serve as a clear warning to offenders that the Department of Justice will prosecute you if you make fraudulent statements and reports concerning compliance with the federal Clean Air Act."" ""Lying about car emissions means dirtier air, which is especially of concern in areas like Las Vegas that are already experiencing air quality problems,"" said Cynthia Giles, Assistant Administrator for Enforcement and Compliance Assurance at EPA. ""We will take aggressive action to ensure communities have clean air."" The maximum penalty for the felony violations contained in the indictments includes up to two years in prison and a fine of up to $250,000. An indictment is merely an accusation, and a defendant is presumed innocent unless and until proven guilty in a court of law. The case was investigated by the EPA, Criminal Investigation Division; and the Nevada Department of Motor Vehicles Compliance Enforcement Division. The case is being prosecuted by the U.S. Attorney’s Office for the District of Nevada and the Justice Department’s Environmental Crimes Section.",2010-01-08T00:00:00-05:00,No topic,Environment and Natural Resources Division
4,18-898,"$100 Million Settlement Will Speed Cleanup Work at Centredale Manor Superfund Site in North Providence, R.I.","The U.S. Department of Justice, the U.S. Environmental Protection Agency (EPA), and the Rhode Island Department of Environmental Management (RIDEM) announced today that two subsidiaries of Stanley Black & Decker Inc.—Emhart Industries Inc. and Black & Decker Inc.—have agreed to clean up dioxin contaminated sediment and soil at the Centredale Manor Restoration Project Superfund Site in North Providence and Johnston, Rhode Island. “We are pleased to reach a resolution through collaborative work with the responsible parties, EPA, and other stakeholders,” said Acting Assistant Attorney General Jeffrey H. Wood for the Justice Department's Environment and Natural Resources Division . “Today’s settlement ends protracted litigation and allows for important work to get underway to restore a healthy environment for citizens living in and around the Centredale Manor Site and the Woonasquatucket River.” “This settlement demonstrates the tremendous progress we are achieving working with responsible parties, states, and our federal partners to expedite sites through the entire Superfund remediation process,” said EPA Acting Administrator Andrew Wheeler. “The Centredale Manor Site has been on the National Priorities List for 18 years; we are taking charge and ensuring the Agency makes good on its promise to clean it up for the betterment of the environment and those communities affected.” “Successfully concluding this settlement paves the way for EPA to make good on our commitment to aggressively pursue cleaning up the Centredale Manor Superfund Site,” said EPA New England Regional Administrator Alexandra Dunn. “We are excited to get to work on the cleanup at this site, and get it closer to the goal of being fully utilized by the North Providence and Johnston communities.” “We are pleased that the collective efforts of the State of Rhode Island, EPA, and DOJ in these negotiations have concluded in this major milestone toward the cleanup of the Centredale Manor Restoration Superfund site and are consistent with our long-standing efforts to make the polluter pay,” said RIDEM Director Janet Coit. “The settlement will speed up a remedy that protects public health and the river environment, and moves us closer to the day that we can reclaim recreational uses of this beautiful river resource.” The settlement, which includes cleanup work in the Woonasquatucket River (River) and bordering residential and commercial properties along the River, requires the companies to perform the remedy selected by EPA for the Site in 2012, which is estimated to cost approximately $100 million, and resolves longstanding litigation. The cleanup remedy includes excavation of contaminated sediment and floodplain soil from the Woonasquatucket River, including from adjacent residential properties. Once the cleanup remedy is completed, full access to the Woonasquatucket River should be restored for local citizens. The cleanup will be a step toward the State’s goal of a fishable and swimmable river. The work will also include upgrading caps over contaminated soil in the peninsula area of the Site that currently house two high-rise apartment buildings. The settlement also ensures that the long-term monitoring and maintenance of the site, as directed in the remedy, will be implemented to ensure that public health is protected. Under the settlement, Emhart and Black & Decker will reimburse EPA for approximately $42 million in past costs incurred at the Site. The companies will also reimburse EPA and the State of Rhode Island for future costs incurred by those agencies in overseeing the work required by the settlement. The settlement will also include payments on behalf of two federal agencies to resolve claims against those agencies. These payments, along with prior settlements related to the Site, will result in a 100 percent recovery for the United States of its past and future response costs related to the Site. Litigation related to the Site has been ongoing for nearly eight years. While the Federal District Court found Black & Decker and Emhart to be liable for their hazardous waste and responsible to conduct the cleanup of the Site, it had also ruled that EPA needed to reconsider certain aspects of that cleanup. EPA appealed the decision requiring it to reconsider aspects of the cleanup. This settlement, once entered by the District Court, will resolve the litigation between the United States, Rhode Island, and Emhart and Black and Decker, allowing the cleanup of the Site to begin. The Site spans a one and a half mile stretch of the Woonasquatucket River and encompasses a nine-acre peninsula, two ponds and a significant forested wetland. From the 1940s to the early 1970s, Emhart’s predecessor operated a chemical manufacturing facility on the peninsula and used a raw material that was contaminated with 2,3,7,8-tetrachlorodibenzo-p-dioxin, a toxic form of dioxin. The Site property was also previously used by a barrel refurbisher. Elevated levels of dioxins and other contaminants have been detected in soil, groundwater, sediment, surface water and fish. The Site was added to the National Priorities List (NPL) in 2000, and in December 2017, EPA included the Centredale Manor Restoration Project Superfund Site on a list of Superfund sites targeted for immediate and intense attention. Several short-term actions were previously performed at the Site to address immediate threats to the residents and minimize potential erosion and downstream transport of contaminated soil and sediment. This settlement is the latest agreement EPA has reached since the Site was listed on the NPL. Prior agreements addressed the performance and recovery of costs for the past environmental investigations and interim cleanup actions from Emhart, the barrel reconditioning company, the current owners of the peninsula portion of the Site, and other potentially responsible parties. The Consent Decree, lodged in the U.S. District Court of Rhode Island, will be posted in the Federal Register and available for public comment for a period of 30 days. The Consent Decree can be viewed on the Justice Department website: www.justice.gov/enrd/Consent_Decrees.html. EPA information on the Centredale Manor Superfund Site: www.epa.gov/superfund/centredale.",2018-07-09T00:00:00-04:00,Environment,Environment and Natural Resources Division


## 1. Tagging and sentiment scoring (17 points)

Focus on the following press release: `id` == "17-1204" about this pharmaceutical kickback prosecution: https://www.forbes.com/sites/michelatindera/2017/11/16/fentanyl-billionaire-john-kapoor-to-plead-not-guilty-in-opioid-kickback-case/?sh=21b8574d6c6c 

The `contents` column is the one we're treating as a document. You may need to to convert it from a pandas series to a single string.

We'll call the raw string of this press release `pharma`

In [3]:
## your code to subset to one press release and take the string
pharma = doj.loc[doj['id'] == '17-1204'] 
pharma = ' '.join(pharma['contents'].tolist())
pharma

'The founder and majority owner of Insys Therapeutics Inc., was arrested today and charged with leading a nationwide conspiracy to profit by using bribes and fraud to cause the illegal distribution of a Fentanyl spray intended for cancer patients experiencing breakthrough pain.\xa0"More than 20,000 Americans died of synthetic opioid overdoses last year, and millions are addicted to opioids. And yet some medical professionals would rather take advantage of the addicts than try to help them," said Attorney General Jeff Sessions. "This Justice Department will not tolerate this.\xa0 We will hold accountable anyone – from street dealers to corporate executives -- who illegally contributes to this nationwide epidemic.\xa0 And under the leadership of President Trump, we are fully committed to defeating this threat to the American people.”John N. Kapoor, 74, of Phoenix, Ariz., a current member of the Board of Directors of Insys, was arrested this morning in Arizona and charged with RICO conspi

### 1.1 part of speech tagging (3 points)

A. Preprocess the `pharma` press release to remove all punctuation / digits (so can use `.isalpha()` to subset)

B. With the preprocessed press release from part A, use the part of speech tagger within nltk to tag all the words in that one press release with their part of speech. 

C. Using the output from B, extract the adjectives and sort those adjectives from most occurrences to fewest occurrences. Print a dataframe with the 5 most frequent adjectives and their counts in the `pharma` release. See here for a list of the names of adjectives within nltk: https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/

**Resources**:

- Documentation for `.isalpha()`: https://www.w3schools.com/python/ref_string_isalpha.asp
- `processtext` function here has an example of tokenizing and filtering to words where `.isalpha()` is true: https://github.com/jhaber-zz/QSS20_public/blob/main/activities/solutions/05_textasdata_partII_topicmodeling_solutions.ipynb
- Part of speech tagging section of this code: https://github.com/jhaber-zz/QSS20_public/blob/main/activities/solutions/05_textasdata_partI_textmining_solutions.ipynb



In [4]:
# A. Preprocess the pharma press release to remove all punctuation / digits
# so that you can use .isalpha() to subset
processed_pharma = ''.join(c for c in pharma if c.isalpha() or c.isspace())

In [5]:
## your code here for part of speech tagging
# B. With the preprocessed press release from part A, use the part of speech tagger within nltk
# to tag all the words in that one press release with their part of speech

words = word_tokenize(processed_pharma)
tags = nltk.pos_tag(words)


# C. Using the output from B, extract the adjectives and sort those adjectives
# from most occurrences to fewest occurrences.
adj_list = [word for word, tag in tags if tag == 'JJ']
adj_counts = {}
for adj in adj_list:
    if adj in adj_counts:
        adj_counts[adj] += 1
    else:
        adj_counts[adj] = 1
      
sorted_adj_counts = sorted(adj_counts.items(), key=lambda x: x[1], reverse=True)

# Print the 5 most frequent adjectives and their counts in the pharma release
df = pd.DataFrame(sorted_adj_counts[0:5], columns=['Adjective', 'Count'])
print(df)

    Adjective  Count
0      former      8
1      opioid      5
2  nationwide      4
3       other      3
4   addictive      3


## 1.2 named entity recognition (4 points)

A. Using the original `pharma` press release (so the one before stripping punctuation/digits), use spaCy to extract all named entities from the press release.

B. Print the unique named entities with the tag: `LAW`

**Resources**:

- For parts A and B: named entity recognition part of this code: https://github.com/jhaber-zz/QSS20_public/blob/main/activities/solutions/05_textasdata_partI_textmining_solutions.ipynb

In [36]:
## your code here for part A
nlp = spacy.load("en_core_web_sm")

spacy_pharma= nlp(pharma)


for ent in spacy_pharma.ents:
    print(ent.text, ent.label_)

Insys Therapeutics Inc. ORG
today DATE
Fentanyl PERSON
More than 20,000 CARDINAL
Americans NORP
last year DATE
millions CARDINAL
Jeff Sessions PERSON
This Justice Department ORG
Trump PERSON
American NORP
N. Kapoor PERSON
74 DATE
Phoenix GPE
Ariz. GPE
the Board of Directors ORG
Insys ORG
this morning TIME
Arizona GPE
RICO LAW
Executive ORG
Board ORG
Insys ORG
Phoenix GPE
today DATE
U.S. GPE
District Court ORG
Boston GPE
a later date DATE
today DATE
Boston GPE
Insys ORG
December 2016.The DATE
Kapoor GPE
Michael L. Babich PERSON
40 DATE
Scottsdale GPE
Ariz. GPE
Alec Burlakoff PERSON
42 DATE
Charlotte GPE
N.C. GPE
Richard M. Simon PERSON
46 DATE
Seal Beach GPE
Calif. GPE
Sunrise Lee PERSON
36 DATE
Bryant City GPE
Mich. GPE
Joseph A. Rowan PERSON
43 DATE
Panama City GPE
Fla. GPE
Managed Markets ORG
Michael J. Gurry PERSON
53 DATE
Scottsdale GPE
Ariz. GPE
Subsys PERSON
Kapoor GPE
six CARDINAL
Kapoor PERSON
Acting United States GPE
William D. Weinreb PERSON
Today DATE
Insys ORG
Harold H. Sha

In [37]:
## your code here for part B

# Printing the unique named entities with the tag "LAW"
unique_entities = set([ent.text for ent in spacy_pharma.ents if ent.label_ == "LAW"])

print("Unique named entities with the tag 'LAW':" + str(unique_entities))


Unique named entities with the tag 'LAW':{'the Controlled Substances Act', 'RICO'}


C. Use Google to summarize in one sentence what the `RICO` named entity means and why this might apply to a pharmaceutical kickbacks case (and not just a mafia case...) 

In [38]:
## your code here
# The Racketeer Influenced and Corrupt Organizations Act (RICO) is a federal law that targets organized crime. 
# It allows for the prosecution of individuals who participate in an enterprise that engages in criminal activity. 
# In the context of a pharmaceutical kickbacks case, RICO may apply if there is evidence that the company 
# and its employees were involved in a pattern of illegal activity, such as paying kickbacks to doctors or 
# Medicare/Medicaid providers in exchange for prescribing or purchasing their drugs. 
# By using RICO, prosecutors can target the entire criminal enterprise and hold its members accountable, 
# rather than just individual actors.

D. You want to extract the possible sentence lengths the CEO is facing; pull out the named entities with (1) the label `DATE` and (2) that contain the word year or years (hint: you may want to use the `re` module for that second part). Print these named entities.

In [39]:
# D. Extracting named entities with label "DATE" containing "year" or "years"
date_entities = [ent.text for ent in spacy_pharma.ents if ent.label_ == "DATE"]
year_entities = [entity for entity in date_entities if re.search(r'\byear(s?)\b', entity, re.IGNORECASE)]

print("Named entities that contain the word year or years:")
print(year_entities)



Named entities that contain the word year or years:
['last year', 'three years', 'three years']


E. Pull and print the original parts of the press releases where those year lengths are mentioned (e.g., the sentences or rough region of the press release). Describe in your own words (1 sentence) what length of sentence (prison) and probation (supervised release) the CEO may be facing if convicted after this indictment (if there are multiple lengths mentioned describe the maximum). 

**Hint**: you may want to use re.search or re.findall 

- For part E, `re.search` and `re.findall` examples here for filtering to ones containing year (multiple approaches; some need not involve `re`): https://github.com/jhaber-zz/QSS20_public/blob/main/activities/solutions/05_basicregex_solutions.ipynb

In [40]:
# Part E
pharma_list = pharma.split(".")

# create a set to store named entities that have already been printed
printed_entities = set()

for sent in pharma_list:
    for year in year_entities:
        if year in sent:
            if sent not in printed_entities:
                print("Named entity:", year)
                print("Sentence containing named entity:", sent)
                printed_entities.add(sent)
                
# CEO may be charged no greater than 20 years in prison and 2 years of supervised released and a $25000 fine if they are convicted to commit RICO
# CEO may be charged no greater than fiver years in prison and three years of supervised released and a 25000 fine if they are convicted to violate the Anti-Kickback Law


Named entity: last year
Sentence containing named entity:  "More than 20,000 Americans died of synthetic opioid overdoses last year, and millions are addicted to opioids
Named entity: three years
Sentence containing named entity: The charges of conspiracy to commit RICO and conspiracy to commit mail and wire fraud each provide for a sentence of no greater than 20 years in prison, three years of supervised release and a fine of $250,000, or twice the amount of pecuniary gain or loss
Named entity: three years
Sentence containing named entity:   The charges of conspiracy to violate the Anti-Kickback Law provide for a sentence of no greater than five years in prison, three years of supervised release and a $25,000 fine


## 1.3 sentiment analysis  (10 points)

- Sentiment analysis section of this script: https://github.com/jhaber-zz/QSS20_public/blob/main/activities/solutions/05_textasdata_partI_textmining_solutions.ipynb


A. Subset the press releases to those labeled with one of three topics via `topics_clean`: Civil Rights, Hate Crimes, and Project Safe Childhood. We'll call this `doj_subset` going forward and it should have 717 rows.



In [11]:
## your code here for subsetting
# Subsetting the press releases to those labeled with one of three topics
topics_clean = ["Civil Rights", "Hate Crimes", "Project Safe Childhood"]
doj_subset = doj[doj['topics_clean'].isin(topics_clean)]

print("Number of rows in the subsetted dataframe:", doj_subset.shape[0])

Number of rows in the subsetted dataframe: 717


B. Write a function that takes one press release string as an input and:

- Removes named entities from each press release string (**Hint**: you may want to use `re.sub` with an or condition)
- Scores the sentiment of the entire press release using the `SentimentIntensityAnalyzer` and `polarity_scores`
- Returns the length-four (negative, positive, neutral, compound) sentiment dictionary (any order is fine)

Apply that function to each of the press releases in `doj_subset`. 

**Hints**: 

- I used a function + list comprehension to execute and it takes about 30 seconds on my local machine and about 2 mins on jhub; if it's taking a very long time, you may want to check your code for inefficiencies. If you can't fix those, for partial credit on this part/full credit on remainder, you can take a small random sample of the 717


In [12]:

def func_remove_and_score(text):
    # Remove named entities using regular expressions
    named_entities_pattern = r'\b[A-Z][a-zA-Z]+\b|\b[A-Z][a-zA-Z]+\s[A-Z][a-zA-Z]+\b'
    text = re.sub(named_entities_pattern, '', text)

    # Score the sentiment of the text
    sia = SentimentIntensityAnalyzer()
    sentiment_scores = sia.polarity_scores(text)

    return sentiment_scores


In [13]:
## your code here executing the function
# Applying the function to each press release in doj_subset
sentiment_scores = [func_remove_and_score(pr) for pr in doj_subset['contents']]

C. Add the four sentiment scores to the `doj_subset` dataframe to create a dataframe: `doj_subset_wscore`. Sort from highest neg to lowest neg score and print the top `id`, `contents`, and `neg` columns of the two most neg press releases. 

Notes:

- Don't worry if your sentiment score differs slightly from our output on GitHub; differences in preprocessing can lead to diff scores

In [14]:
## your code here
# Apply function to each press release in doj_subset
doj_subset_wscore = doj_subset.copy()
doj_subset_wscore["sentiment"] = doj_subset_wscore["contents"].apply(func_remove_and_score)
doj_subset_wscore[["negative", "positive", "neutral", "compound"]] = pd.DataFrame(doj_subset_wscore["sentiment"].tolist(), index=doj_subset_wscore.index)

# Sort by highest negative score
doj_subset_wscore = doj_subset_wscore.sort_values(by="negative", ascending=False)

# Print top 2 most negative press releases
top_2_neg = doj_subset_wscore[["id", "contents", "negative"]].head(2)
print(top_2_neg)

           id  \
329    14-248   
11593  16-718   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    

D. With the dataframe from part C, find the mean compound sentiment score for each of the three topics in `topics_clean` using group_by and agg.

E. Add a 1 sentence interpretation of why we might see the variation in scores (remember that compound is a standardized summary where -1 is most negative; +1 is most positive)


In [15]:
## agg and find the mean compound score by topic
# D
mean_compound_scores = doj_subset_wscore.groupby('topics_clean').agg({'compound': 'mean'})
print(mean_compound_scores)

# E. Add a 1 sentence interpretation of why we might see the variation in scores (remember that compound is 
# a standardized summary where -1 is most negative; +1 is most positive)

# The variation in compound sentiment scores across the different topics may reflect the 
# differing levels of negativity, positivity, and neutrality in the language used in press 
# releases related to each topic.

                        compound
topics_clean                    
Civil Rights           -0.101437
Hate Crimes            -0.934456
Project Safe Childhood -0.722579


# 2. Topic modeling (25 points)

For this question, use the `doj_subset_wscores` data that is restricted to civil rights, hate crimes, and project safe childhood and with the sentiment scores added


## 2.1 Preprocess the data by removing stopwords, punctuation, and non-alpha words (5 points)

A. Write a function that:

- Takes in a single raw string in the `contents` column from that dataframe
- Does the following preprocessing steps:

    - Converts the words to lowercase
    - Removes stopwords, adding the custom stopwords in your code cell below to the default stopwords list
    - Only retains alpha words (so removes digits and punctuation)
    - Only retains words 4 characters or longer
    - Uses the snowball stemmer from nltk to stem

- Returns a joined preprocessed string
    
B. Use `apply` or list comprehension to execute that function and create a new column in the data called `processed_text`
    
C. Print the `id`, `contents`, and `processed_text` columns for the following press releases:

id = 16-718 (this case: https://www.seattletimes.com/nation-world/doj-miami-police-reach-settlement-in-civil-rights-case/)

id = 16-217 (this case: https://www.wlbt.com/story/32275512/three-mississippi-correctional-officers-indicted-for-inmate-assault-and-cover-up/)
    
**Resources**:

- Here's code examples for the snowball stemmer: https://www.geeksforgeeks.org/snowball-stemmer-nlp/
- Here's code with topic modeling steps: https://github.com/jhaber-zz/QSS20_public/blob/main/activities/solutions/05_textasdata_partII_topicmodeling_solutions.ipynb

In [16]:
custom_doj_stopwords = ["civil", "rights", "division", "department", "justice",
                        "office", "attorney", "district", "case", "investigation", "assistant",
                       "trial", "assistance", "assist"]

In [17]:
default_stops = stopwords.words("english")
stops = default_stops + custom_doj_stopwords

snowball = SnowballStemmer("english")

def func(string):
    
    # tokenize the string
    tokens = nltk.word_tokenize(string.lower())
    
    # remove stopwords
    nostops = [token for token in tokens if token not in stops]
    
    # stem the words that are alpha words and at least 4 characters
    stems = [snowball.stem(token) for token in nostops if token.isalpha() and len(token) > 3]
    
    out = ""
    for token in stems:
        out += token + " "
    
    return out

In [18]:
# apply the function
doj_subset_wscore["processed_text"] = doj_subset_wscore.contents.apply(func)

In [19]:
print(doj_subset_wscore[["id", "contents", "processed_text"]][doj_subset_wscore.id.isin(["16-718", "16-217"])])

           id  \
11593  16-718   
6727   16-217   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    

## 2.2 Create a document-term matrix from the preprocessed press releases and to explore top words (5 points)

A. Use the `create_dtm` function I provide (alternately, feel free to write your own!) and create a document-term matrix using the preprocessed press releases; make sure metadata contains the following columns: `id`, `compound` sentiment column you added, and the `topics_clean` column

B. Print the top 10 words for press releases with compound sentiment in the top 5% (so the most positive sentiment)

C. Print the top 10 words for press releases with compound sentiment in the bottom 5% (so the most negative sentiment)

**Hint**: for these, remember the pandas quantile function from pset one.  

D. Print the top 10 words for press releases in each of the three `topics_clean`

For steps B - D, to receive full credit, write a function `get_topwords` that helps you avoid duplicated code when you find top words for the different subsets of the data. There are different ways to structure it but one way is to feed it subsetted data (so data subsetted to one topic etc.) and for it to get the top words for that subset.

**Resources**:

- Here contains an example of applying the `create_dtm` function: https://github.com/jhaber-zz/QSS20_public/blob/main/activities/solutions/05_textasdata_partII_topicmodeling_solutions.ipynb


In [20]:
def create_dtm(list_of_strings, metadata):
    vectorizer = CountVectorizer(lowercase = True)
    dtm_sparse = vectorizer.fit_transform(list_of_strings)
    dtm_dense_named = pd.DataFrame(dtm_sparse.todense(), 
        columns=vectorizer.get_feature_names())
    dtm_dense_named_withid = pd.concat([metadata.reset_index(), dtm_dense_named], axis = 1)
    return(dtm_dense_named_withid)

In [21]:
dtm = create_dtm(doj_subset_wscore.processed_text, doj_subset_wscore[["id", "compound", "topics_clean"]])

In [22]:
print(dtm)

     index       id  compound  topics_clean  aaron  abandon  abbat  abbi  \
0      329   14-248   -0.9957   Hate Crimes      0        0      0     0   
1    11593   16-718   -0.9964  Civil Rights      0        0      0     0   
2      572   13-312   -0.9977   Hate Crimes      0        0      0     0   
3      501   11-626   -0.9985   Hate Crimes      0        0      0     0   
4    11903    18-70   -0.9908  Civil Rights      0        0      0     0   
..     ...      ...       ...           ...    ...      ...    ...   ...   
712  11065   16-294    0.8519  Civil Rights      0        0      0     0   
713   6787   17-132    0.9758  Civil Rights      0        0      0     0   
714   1857   17-271    0.9899  Civil Rights      0        0      0     0   
715  10286  15-1559    0.8481  Civil Rights      0        0      0     0   
716  11085   15-667    0.4767  Civil Rights      0        0      0     0   

     abbott  abdomen  ...  zamora  zane  zealand  zealous  zeeman  zero  \
0         0 

In [41]:
def get_topwords(subset, nwords):
    sums = subset[subset.columns[5:]].sum(axis = 0)
    sort = sums.sort_values(ascending=False)
    out = sort[:10,]
    
    return out

In [42]:
upper = dtm.compound.quantile(q=.95)[0]
lower = dtm.compound.quantile(q=.05)[0]

# test = dtm.compound[0] > upper
print(dtm.compound.iloc[:,0])

0     -0.9957
1     -0.9964
2     -0.9977
3     -0.9985
4     -0.9908
        ...  
712    0.8519
713    0.9758
714    0.9899
715    0.8481
716    0.4767
Name: compound, Length: 717, dtype: float64


In [43]:
# top 5% sentiment
get_topwords(dtm[dtm.compound.iloc[:,0] > upper], nwords=10)

# bottom 5% sentiment
get_topwords(dtm[dtm.compound.iloc[:,0] < lower], nwords=10)


# three topics 
get_topwords(dtm[dtm.topics_clean == "Project Safe Childhood"], nwords=10)

get_topwords(dtm[dtm.topics_clean == "Civil Rights"], nwords=10)

get_topwords(dtm[dtm.topics_clean == "Hate Crimes"], nwords=10)

agreement     174.0
state         110.0
enforc        106.0
disabl        102.0
ensur         101.0
settlement     86.0
student        86.0
polic          85.0
communiti      80.0
school         80.0
dtype: float64

victim      161.0
assault     156.0
offic       152.0
crime       146.0
hate        119.0
sentenc      98.0
defend       95.0
feder        90.0
prosecut     84.0
charg        84.0
dtype: float64

child          1018.0
exploit         698.0
sexual          570.0
safe            476.0
childhood       472.0
project         472.0
pornographi     447.0
children        416.0
crimin          404.0
prosecut        374.0
dtype: float64

offic        627.0
hous         620.0
discrimin    541.0
enforc       531.0
disabl       509.0
said         497.0
feder        475.0
violat       470.0
state        443.0
general      408.0
dtype: float64

victim      590.0
crime       533.0
prosecut    476.0
hate        472.0
defend      459.0
sentenc     455.0
charg       452.0
guilti      430.0
feder       426.0
said        424.0
dtype: float64

## 2.3 Estimate a topic model using those preprocessed words (5 points)

A. Going back to the preprocessed words from part 2.3.1, estimate a topic model with 3 topics, since you want to see if the unsupervised topic models recover different themes for each of the three manually-labeled areas (civil rights; hate crimes; project safe childhood). You have free rein over the other topic model parameters beyond the number of topics.

B. After estimating the topic model, print the top 15 words in each topic.

**Hints and Resources**:

- Same topic modeling resources linked to above
- Make sure to use the `random_state` argument within the model so that the numbering of topics does not move around between runs of your code

In [26]:
# store texts in a list
list_of_texts = [wordpunct_tokenize(word) for word in doj_subset_wscore.processed_text]

# make dict
text_dict = corpora.Dictionary(list_of_texts)

#filter very common and uncommon words
lower_bound = round(doj_subset_wscore.shape[0]*0.05)
upper_bound = round(doj_subset_wscore.shape[0]*0.95)

text_dict.filter_extremes(no_below = lower_bound,
                            no_above = upper_bound)

# make a dictionary linking each token in its text
corpus_dict = [text_dict.doc2bow(text) 
                   for text in list_of_texts]


In [27]:
mod = gensim.models.ldamodel.LdaModel(corpus_dict, 
                                         num_topics = 3, 
                                         id2word=text_dict, 
                                         passes=10, 
                                         alpha = 'auto',
                                         per_word_topics = True,
                                         random_state=1)

In [28]:
topics = mod.print_topics(num_words = 5)

for topic in topics:
    print(topic)
    
## It looks like the topic model recovered the three manual topics. 
## The first topic is Hate Crimes, the second is Safe Children, and the third is Civil Rights. 

(0, '0.015*"charg" + 0.015*"sentenc" + 0.015*"victim" + 0.014*"prosecut" + 0.014*"feder"')
(1, '0.033*"child" + 0.022*"exploit" + 0.020*"sexual" + 0.016*"victim" + 0.015*"safe"')
(2, '0.018*"hous" + 0.016*"discrimin" + 0.015*"disabl" + 0.012*"enforc" + 0.012*"agreement"')


## 2.4 Add topics back to main data and explore correlation between manual labels and our estimated topics (10 points)

A. Extract the document-level topic probabilities. Within `get_document_topics`, use the argument `minimum_probability` = 0 to make sure all 3 topic probabilities are returned. Write an assert statement to make sure the length of the list is equal to the number of rows in the `doj_subset_wscores` dataframe

B. Add the topic probabilities to the `doj_subset_wscores` dataframe as columns and create a column, `top_topic`, that reflects each document to its highest-probability topic (eg topic 1, 2, or 3)

C. For each of the manual labels in `topics_clean` (Hate Crime, Civil Rights, Project Safe Childhood), print the breakdown of the % of documents with each top topic (so, for instance, Hate Crime has 246 documents-- if 123 of those documents are coded to topic_1, that would be 50%; and so on). **Hint**: pd.crosstab and normalize may be helpful: https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.crosstab.html

D. Using a couple press releases as examples, write a 1-2 sentence interpretation of why some of the manual topics map on more cleanly to an estimated topic than other manual topic(s)

**Resources**:

- End of this code (`Additional summaries of topics and documents`) contains example of how to use `get_document_topics` and other steps to add topic probabilities back to data: https://github.com/jhaber-zz/QSS20_public/blob/main/activities/solutions/05_textasdata_partII_topicmodeling_solutions.ipynb
- If you're getting errors, use `shape`, `len`, and other commands to check the dimensionality of things at different steps 

In [29]:
doc_topics = [mod.get_document_topics(el, minimum_probability=0) for el in corpus_dict]

assert len(doc_topics) == len(doj_subset_wscore.index)

In [30]:
probs_df = pd.DataFrame([t for lst in doc_topics for t in lst],
                                     columns = ['topic', 'probability'])

probs_df['doc_id'] = list(np.concatenate([[one_id] * 3 for one_id in doj_subset_wscore.id]).flat)

probs_df_wide = pd.pivot_table(probs_df, index = ['doc_id'], columns = ['topic']).reset_index().reset_index(drop = True)

probs_df_wide.columns = ['doc_id'] + ["topic_" + str(i) for i in np.arange(0, 3)]

doj_subset_merged = pd.merge(probs_df_wide, doj_subset_wscore, left_on = 'doc_id', right_on = 'id')


doj_subset_merged['top_topic'] = doj_subset_merged[[col for col in doj_subset_merged.columns if "topic_" in col]].idxmax(axis=1)
    

In [31]:
breakdown = pd.crosstab(index=doj_subset_merged.topics_clean, columns=doj_subset_merged.top_topic, normalize="index")
breakdown

## The Hate Crimes topic was the best categorized, with a 1:1 correlation between the modeled and manual topics.
## Project Safe Child, was almost as well modeled, with 99% of the documents matching their manual topic. 
## About a third of the Civil Rights topics got modeled as Hate Crimes, likely because these types of crimes can overlao
## on words about protected classes identity, like race and gender. This code prints some of these mismatches:

cond = [doj_subset_merged.top_topic == "topic_0", doj_subset_merged.top_topic == "topic_1", doj_subset_merged.top_topic == "topic_2"]
values = ["Hate Crimes", "Project Safe Childhood", "Civil Rights"]

doj_subset_merged["top_named"] = np.select(cond, values, "MISSING")
mismatches = doj_subset_merged[doj_subset_merged.top_named != doj_subset_merged.topics_clean]
mismatches[["contents", 'top_named', "topics_clean"]].head()

## Many of the mismatches between Hate Crimes and Civil Rights have to do with a Civil Rights violation that 
## uses many of the same words as a Hate Crime, like "victim" and "prosecuted."


top_topic,topic_0,topic_1,topic_2
topics_clean,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Civil Rights,0.33887,0.0,0.66113
Hate Crimes,1.0,0.0,0.0
Project Safe Childhood,0.0,0.99375,0.00625


Unnamed: 0,contents,top_named,topics_clean
244,"Principal Deputy Assistant Attorney General Vanita Gupta, head of the Justice Department’s Civil Rights Division and U.S. Attorney Tammy Dickinson of the Western District of Missouri announced that former Independence, Missouri, police officer Timothy Runnels pleaded guilty today for violating the constitutional rights of a minor who was in his custody. Documents filed in connection with the guilty plea state that Runnels, while employed as an officer of the Independence Police Department, deprived the minor of his civil rights by deliberately dropping the minor face first onto the ground while the minor was restrained and not posing a threat to Runnels or others. According to the court filings, Runnels’ actions resulted in bodily injury to the minor. Runnels faces a statutory maximum sentence of 10 years in prison and a fine of $250,000 for his conviction for violating the minor’s civil rights. “The department remains committed to ensuring that police officers who violate their sworn oaths by using excessive force are held accountable,” said Principal Deputy Assistant Attorney General Gupta. “I am hopeful that today’s plea brings a measure of closure for the victim.” “The use of excessive force by law enforcement officers is a serious offense that strikes at the heart of Constitutional protections for all citizens,” said U.S. Attorney Dickinson of the Western District of Missouri. “This former police officer who violated his sworn duty to protect and serve should not reflect upon the vast majority of officers who perform their duties with integrity and professionalism.” This case was investigated by the FBI’s Kansas City Division and is being prosecuted by Trial Attorney Shan Patel of the Civil Rights Division and First Assistant U.S. Attorney David Ketchmark of the Western District of Missouri.",Hate Crimes,Civil Rights
259,"The Justice Department announced today that the former Mamou, Louisiana, Police Chief Gregory W. Dupuis was sentenced to one year and one day in prison, and former Mamou Police Officer and Chief Robert McGee pleaded guilty to one count of the deprivation of rights under color of law, both for their roles in a series of incidents in which they deployed TASERs on non-resistant inmates at the Mamou Jail. Dupuis’s and McGee’s guilty pleas are the result of a federal investigation into the illegal use of excessive force upon inmates at the Mamou jail. Dupuis, 57, of Mamou, pleaded guilty to one count of violation of an individual’s civil rights on Apr. 13, 2015, and was sentenced today by U.S. District Judge Richard T. Haik of the Western District of Louisiana. According to evidence presented at Dupuis’s plea hearing, Dupuis served as police chief from 1994 to 1997 and from 2004 to 2014. During his tenure as chief, officers, including McGee, repeatedly administered TASER shocks as a form of punishment on inmates who were being disruptive, even if the inmates’ disruption was purely verbal, and on inmates who were calm and compliant when the officer deployed the TASER. On Apr. 25, 2010, Dupuis went to the department’s jail to deal with a verbally disruptive detainee. Dupuis ordered the detainee to get down from his bunk and put his hands on the far wall. The detainee complied. Dupuis then entered the cell and deployed the TASER on the detainee’s back, causing the detainee to fall to the ground, suffer pain and injure his knee. At his plea hearing, Dupuis admitted that he knew at the time that his actions were unlawful. McGee, 44, of Mamou, pleaded guilty today to one count of violation of an individual’s civil rights committed as an officer in 2010, prior to his 2014 election as chief of the Mamou Police Department. According to McGee’s guilty plea, McGee was called to the Mamou Police Department on multiple occasions in 2010 and 2011 to deal with disruptive inmates. On Aug. 6, 2010, McGee and an inmate were engaged in a conversation. Although the inmate posed no threat to himself or the officers, McGee fired the TASER at the inmate, causing the inmate to fall and experience pain. McGee, who was elected Mamou police chief after this incident, resigned his position as chief on Oct. 8, 2015, as a result of the federal investigation. McGee faces up to 10 years in prison, three years supervised release and a $250,000 fine. A sentencing date was not set. “The defendants abused the trust given to them as law enforcement officers when they engaged in a pattern of repeatedly tasing compliant detainees,” said Principal Deputy Assistant Attorney General Vanita Gupta, head of the the Civil Rights Division. “The Justice Department will vigorously prosecute those who violate the civil rights laws to ensure that the rights of all individuals, including those in custody, are protected.” “Law enforcement officers have a duty to ensure that detainees are treated fairly and humanely when taken into custody,” said U.S. Attorney Stephanie A. Finley of the Western District of Louisiana. “Mr. Dupuis and Mr. McGee breached that trust and violated their oaths by using excessive force on incarcerated individuals.” The FBI and the Louisiana State Police conducted the investigation. Trial Attorneys Stephen Curran and Sanjay Patel of the Civil Rights Division, and Assistant U.S. Attorneys Myers P. Namie and Robert Abendroth of the Western District of Louisiana are prosecuting the case.",Hate Crimes,Civil Rights
264,"District Judge J. Michael Seabright today sentenced former Honolulu Police Officer Vincent Morre, 38, to 30 months in prison for violating the civil rights of two Honolulu men. On May 19, 2015, Morre pleaded guilty to two counts of depriving the two men’s right to be free from the use of unreasonable force by a law enforcement officer on Sept. 5, 2014. According to information presented to the court, Morre, then a 10 year veteran of the Honolulu Police Department, was searching for a fugitive when he entered a game room on Hopaka Street. Once in the game room, Morre, in an unprovoked attack, kicked J.T. in the head. Morre then continued to search the game room for the fugitive. On his way out, Morre reapproached J.T. but first struck F.F. (who was seated next to J.T) in the face twice and then kicked his chest. Morre then continued the assault on J.T. by kicking J.T. off his chair. As Morre was leaving the game room, he threw a metal stool which hit J.T. in the head, requiring three stitches. Five days later, Morre filed a false police report omitting that he had assaulted J.T. or F.F. “When this defendant violated the trust of the people he was sworn to serve, the Department of Justice stood ready to enforce the law and protect the civil rights of all Americans,” said Principal Deputy Assistant Attorney General Vanita Gupta, head of the Civil Rights Division. “This case represents important steps in vindicating the civil rights of the victims of unreasonable use of force by a law enforcement officer,” said U.S. Attorney Florence T. Nakakuni of the District of Hawaii. “The FBI would like to thank the Honolulu Police Department for its cooperation, assistance, and transparency during this investigation,” said Special Agent in Charge Paul Delacourt of the FBI’s Honolulu Field Office. Assistant U.S. Attorney Darren W.K. Ching of the District of Hawaii and Trial Attorney Angie Cha of the Civil Rights Division prosecuted the case.",Hate Crimes,Civil Rights
270,"Leonard Dreyer, a captain with the DeKalb County, Georgia, Sheriff's Office, has been indicted by a federal grand jury in the Northern District of Georgia on charges of encouraging Dwight Hamilton, a former corrections officer, to use excessive force against an inmate at the DeKalb County Jail and for attempting to obstruct justice by persuading officers who witnessed the incident to write false reports. Dreyer was also charged with obstructing justice by making false statements to an FBI agent investigating the allegations. Dreyer, 50, of Decatur, Georgia, was arraigned today. He was indicted by a federal grand jury on Oct. 20, 2015. Hamilton, who was indicted earlier this year for using excessive force and writing false reports, has been charged in the same indictment with additional counts of excessive force and obstruction of justice. According to the indictment and other information presented in court: Dreyer began working for the DeKalb County Sheriff’s Office in 2004 and worked as a supervisor in the jail from 2010 to 2012. Hamilton worked in the jail from 2005 to 2012. In 2011 and 2012, Hamilton, who was supervised by Dreyer, tased inmates without justification, many of them multiple times, in five separate incidents during his time at the jail. The inmates suffered injuries as a result of the tasing. The superseding indictment charges that in all five instances, Hamilton used excessive force and thereby violated the inmates’ Constitutional rights. Following four of the five tasing incidents, Hamilton wrote a false report with the intent of impeding, obstructing and improperly influencing the investigation. In the first report, Hamilton falsely reported that the victim “made a step toward” Hamilton immediately before Hamilton tased him. In another report, Hamilton falsely wrote that the victim failed to comply with Hamilton’s commands before Hamilton tased him. Before one of the five incidents, Dreyer directed Hamilton to tase an inmate who was mouthing off to him. After Hamilton repeatedly tased the inmate without legal justification, Dreyer encouraged three witness officers to write false reports that were favorable to Hamilton and would justify the tasing. During the federal investigation of the incident, Dreyer also made false statements to an FBI agent. Members of the public are reminded that the indictment only contains charges. The defendant is presumed innocent of the charges and it will be the government’s burden to prove the defendant’s guilt beyond a reasonable doubt at trial. This case is being investigated by the FBI. This case is being prosecuted by Assistant U.S. Attorney Brent Alan Gray of the Northern District of Georgia and Trial Attorney Christopher Perras of the Civil Rights Division. Dreyer Superseding Indictment",Hate Crimes,Civil Rights
272,"A former prosecutor for the St. Louis Circuit Attorney’s Office pleaded guilty in federal court today to concealing her knowledge of St. Louis Metropolitan Police Department (SLMPD) officers’ assault upon an arrestee, announced U.S. Attorney Tammy Dickinson of the Western District of Missouri and Principal Deputy Assistant Attorney General Vanita Gupta, head of the Justice Department’s Civil Rights Division. Bliss Barber Worrell, 28, of Clayton, Missouri, pleaded guilty before U.S. District Judge Henry E. Autrey of the Eastern District of Missouri to misprision of a felony. Worrell admitted that she failed to notify authorities that on July 22, 2014, police officers assaulted an arrestee in their custody, and that she took an affirmative step to conceal the felony. Worrell also admitted that she filed charges without disclosing knowledge of the assault to her colleagues, supervisors or the judge assigned to setting a bond. She admitted that she allowed the charges to stand despite later learning that the facts that made out the charge of attempted escape were fabricated to cover for injuries the arrestee sustained during the assault. “Prosecutors are trusted to exercise discretion in enforcing the law and are charged above all with doing justice in a fair and impartial manner,” said Principal Deputy Assistant Attorney General Gupta. “In this instance, the defendant ran afoul of her obligation to uphold the Constitution, and must therefore be held to answer for her actions.” “An officer of the court allowed her friendship with a police officer to eclipse her public obligation to uphold justice,” said U.S. Attorney Dickinson. “This remains an ongoing investigation that extends farther than this defendant’s role in covering up an egregious civil rights violation.” Worrell served as an assistant circuit attorney (ACA) in the St. Louis Circuit Attorney’s Office Misdemeanor Division from August 2013 through July 2014. In that capacity, she prosecuted criminal violations of Missouri state statutes on behalf of the state of Missouri. One of her duties was to make determinations as to whether there was probable cause that an individual committed a crime, based on evidence provided to her by law enforcement and/or civilian witnesses. According to today’s plea agreement, Worrell developed a close friendship with a veteran officer of the SLMPD, who is not identified in court documents. On July 22, 2014, the officer informed Worrell that an individual, identified as M.W., was arrested at Ballpark Village by another officer for possessing the veteran officer’s daughter’s credit card. On July 23, 2014, the veteran officer provided Worrell with additional details and told Worrell that he had thrown M.W. against a wall, beat him up, thrown a chair at him and “shoved [his] pistol down the guy’s throat.” After the conversation, Worrell met the arresting officer who confirmed that M.W. was found with stolen credit cards, and that M.W. had resisted arrest and attempted to flee. Working with a new ACA, Worrell issued charges against M.W. herself, including charges for resisting and attempting to escape, despite knowing that she should wait for an ACA without personal knowledge of the case to become available. Worrell concealed her knowledge that M.W. had been assaulted at the police station. After issuing the charges, Worrell had another conversation with the veteran officer and learned that the attempted escape charge was fabricated. Worrell concealed this information from her supervisors, allowing the charge to stand. This case is being investigated by the FBI’s St. Louis Division. The case is being prosecuted by First Assistant U.S. Attorney David M. Ketchmark of the Western District of Missouri, who has been appointed as Special Attorney to the U.S. Attorney General, and Trial Attorney Fara Gold of the Civil Rights Division. The U.S. Attorney’s Office of the Western District of Missouri is prosecuting this case with the Civil Rights Division due to the recusal of the U.S. Attorney’s Office of the Eastern District of Missouri. Worrell Plea Agreement",Hate Crimes,Civil Rights


# 3. Extend the analysis from unigrams to bigrams (10 points)

In the previous question, you found top words via a unigram representation of the text. Now, we want to see how those top words change with bigrams (pairs of words)

A. Using the `doj_subset_wscore` data and the `processed_text` column (so the words after stemming/other preprocessing), create a column in the data called `processed_text_bigrams` that combines each consecutive pairs of word into a bigram separated by an underscore. Eg:

"depart reach settlem" would become "depart_reach reach_settlem"

Do this by writing a function `create_bigram_onedoc` that takes in a single `processed_text` string and returns a string with its bigrams structured similarly to above example
 
**Hint**: there are many ways to solve but `zip` may be helpful: https://stackoverflow.com/questions/21303224/iterate-over-all-pairs-of-consecutive-items-in-a-list

B. Print the `id`, `processed_text`, and `processed_text_bigram` columns for press release with id = 16-217

In [32]:
## your code here 
def create_bigram_onedoc(text):
    words = text.split()
    bigrams = ['_'.join(pair) for pair in zip(words, words[1:])]
    return ' '.join(bigrams)

doj_subset_wscore['processed_text_bigrams'] = doj_subset_wscore['processed_text'].apply(create_bigram_onedoc)

# print id, processed_text, and processed_text_bigram for id = 16-217
doj_subset_wscore.loc[doj_subset_wscore['id'] == '16-217', ['id', 'processed_text', 'processed_text_bigrams']]

Unnamed: 0,id,processed_text,processed_text_bigrams
6727,16-217,reach comprehens settlement agreement citi miami miami polic resolv shoot offic announc princip deputi general vanita gupta head wifredo ferrer southern florida settlement approv miami citi commiss today effect agreement sign parti resolv claim stem shoot offic conduct violent crime control enforc find issu juli identifi pattern practic excess forc shoot violat fourth amend constitut citi complianc settlement monitor independ review former tampa florida polic chief jane castor settlement agreement citi implement comprehens reform ensur constitut polic support public trust settlement agreement design minim shoot effect quick investig shoot occur measur includ settlement repres renew commit citi miami chief rodolfo llane provid constitut polic miami resid protect public safeti sustain reform said princip deputi general gupta agreement help strengthen relationship communiti serv improv account offic fire weapon unlaw provid communiti particip enforc today agreement result joint effort citi miami ensur miami polic continu effort make communiti safe protect sacr constitut citizen said ferrer oversight communic agreement seek make perman posit chang former chief orosa chief llane made applaud citi commiss settlement agreement build upon import reform implement citi sinc issu find includ conduct attorney staff special litig section southern florida,reach_comprehens comprehens_settlement settlement_agreement agreement_citi citi_miami miami_miami miami_polic polic_resolv resolv_shoot shoot_offic offic_announc announc_princip princip_deputi deputi_general general_vanita vanita_gupta gupta_head head_wifredo wifredo_ferrer ferrer_southern southern_florida florida_settlement settlement_approv approv_miami miami_citi citi_commiss commiss_today today_effect effect_agreement agreement_sign sign_parti parti_resolv resolv_claim claim_stem stem_shoot shoot_offic offic_conduct conduct_violent violent_crime crime_control control_enforc enforc_find find_issu issu_juli juli_identifi identifi_pattern pattern_practic practic_excess excess_forc forc_shoot shoot_violat violat_fourth fourth_amend amend_constitut constitut_citi citi_complianc complianc_settlement settlement_monitor monitor_independ independ_review review_former former_tampa tampa_florida florida_polic polic_chief chief_jane jane_castor castor_settlement settlement_agreement agreement_citi citi_implement implement_comprehens comprehens_reform reform_ensur ensur_constitut constitut_polic polic_support support_public public_trust trust_settlement settlement_agreement agreement_design design_minim minim_shoot shoot_effect effect_quick quick_investig investig_shoot shoot_occur occur_measur measur_includ includ_settlement settlement_repres repres_renew renew_commit commit_citi citi_miami miami_chief chief_rodolfo rodolfo_llane llane_provid provid_constitut constitut_polic polic_miami miami_resid resid_protect protect_public public_safeti safeti_sustain sustain_reform reform_said said_princip princip_deputi deputi_general general_gupta gupta_agreement agreement_help help_strengthen strengthen_relationship relationship_communiti communiti_serv serv_improv improv_account account_offic offic_fire fire_weapon weapon_unlaw unlaw_provid provid_communiti communiti_particip particip_enforc enforc_today today_agreement agreement_result result_joint joint_effort effort_citi citi_miami miami_ensur ensur_miami miami_polic polic_continu continu_effort effort_make make_communiti communiti_safe safe_protect protect_sacr sacr_constitut constitut_citizen citizen_said said_ferrer ferrer_oversight oversight_communic communic_agreement agreement_seek seek_make make_perman perman_posit posit_chang chang_former former_chief chief_orosa orosa_chief chief_llane llane_made made_applaud applaud_citi citi_commiss commiss_settlement settlement_agreement agreement_build build_upon upon_import import_reform reform_implement implement_citi citi_sinc sinc_issu issu_find find_includ includ_conduct conduct_attorney attorney_staff staff_special special_litig litig_section section_southern southern_florida


C. Use the create_dtm function and the `processed_text_bigrams` column to create a document-term matrix (`dtm_bigram`) with these bigrams. Keep the following three columns in the data: `id`, `topics_clean`, and `compound` 

D. Print the (1) dimensions of the `dtm` matrix from question 2.2  and (2) the dimensions of the `dtm_bigram` matrix. Comment on why the bigram matrix has more dimensions than the unigram matrix 

E. Find and print the 10 most prevelant bigrams for each of the three topics_clean using the `get_topwords` function from 2.2

In [33]:
# create_dtm function from Part 2
# def create_dtm(list_of_strings, metadata):
#     vectorizer = CountVectorizer(lowercase = True)
#     dtm_sparse = vectorizer.fit_transform(list_of_strings)
#     dtm_dense_named = pd.DataFrame(dtm_sparse.todense(), 
#         columns=vectorizer.get_feature_names())
#     dtm_dense_named_withid = pd.concat([metadata.reset_index(), dtm_dense_named], axis = 1)
#     return(dtm_dense_named_withid)

# Part C
metadata = doj_subset_wscore[['id', 'topics_clean', 'compound']]
metadata
dtm_bigram = create_dtm(doj_subset_wscore['processed_text_bigrams'], metadata)

# Part D
dtm.shape
dtm_bigram.shape
# bigram matrix has more dimensions than the unigram matrix because bigram combines each consecutive pairs of word into 
# a bigram, creating repeats of the words which creates more columns

Unnamed: 0,id,topics_clean,compound
329,14-248,Hate Crimes,-0.9957
11593,16-718,Civil Rights,-0.9964
572,13-312,Hate Crimes,-0.9977
501,11-626,Hate Crimes,-0.9985
11903,18-70,Civil Rights,-0.9908
...,...,...,...
11065,16-294,Civil Rights,0.8519
6787,17-132,Civil Rights,0.9758
1857,17-271,Civil Rights,0.9899
10286,15-1559,Civil Rights,0.8481


(717, 6760)

(717, 71334)

# 4. Optional extra credit (2 points)

You notice that the pharmaceutical kickbacks press release we analyzed in question 1 was for an indictment, and that in the original data, there's not a clear label for whether a press release outlines an indictment (charging someone with a crime), a conviction (convicting them after that charge either via a settlement or trial), or a sentencing (how many years of prison or supervised release a defendant is sentenced to after their conviction).

You want to see if you can identify pairs of press releases where one press release is from one stage (e.g., indictment) and another is from a different stage (e.g., a sentencing).

You decide that one way to approach is to find the pairwise string similarity between each of the processed press releases in `doj_subset`. There are many ways to do this, so Google for some approaches, focusing on ones that work well for entire documents rather than small strings.

Find the top two pairs (so four press releases total)-- do they seem like different stages of the same crime or just press releases covering similar crimes?

In [34]:
import textdistance as td

text1_list = []
text2_list = []
scores_list = []

i = 0
for text1 in doj_subset_wscore.processed_text:
    for text2 in doj_subset_wscore.processed_text:
        if text1 != text2:
            score = td.jaro_winkler.normalized_similarity(text1, text2)
            text1_list.append(text1)
            text2_list.append(text2)
            scores_list.append(score)
            i += 1
            
scores_df = pd.DataFrame({"text1": text1_list, "text2": text2_list, "scores": scores_list})



In [35]:
final = scores_df.sort_values(by="scores", ascending=False).iloc[:4]
final

# The first match appears to be dealing with the same case, but not the second. The first match is the guilty plea and 
# the sentencing for the same defendant, but the second match looks like two different cases for very similar defendants. 

Unnamed: 0,text1,text2,scores
398630,church hill maryland resid plead guilti today feder court count entic minor engag sexual activ count attempt transfer obscen materi minor announc act general kenneth blanco crimin wifredo ferrer southern florida robert moor plead guilti today judg daniel hurley southern florida moor employ secret assign white hous time arrest remain custodi sinc time moor sinc termin secret servic posit accord admiss made connect plea moor maintain profil social media applic provid platform exchang digit imag well voic text messag delawar state polic detect delawar child predat task forc creat profil site pose girl moor engag number onlin chat session mobil app period includ moor work number onlin chat moor undercov offic pose femal minor sexual natur sever occas moor sent pictur includ sexual explicit imag accord plea document arrest enforc discov moor communic minor florida moor admit communic sent sexual explicit imag entic minor send sexual explicit photo well moor engag type behavior girl texa anoth girl missouri moor request feder charg delawar transfer southern florida could plead guilti charg time immigr custom enforc homeland secur investig delawar child predat task forc investig austin berri crimin child exploit obscen section ceo corey steinberg southern florida prosecut brought part project safe childhood nationwid initi combat grow epidem child sexual exploit abus launch attorney offic ceo project safe childhood marshal feder state local resourc better locat apprehend prosecut individu exploit children internet well identifi rescu victim inform project safe childhood pleas visit http,church hill maryland resid sentenc today year prison follow lifetim term supervis releas entic minor engag sexual activ attempt transfer obscen materi minor announc act general kenneth blanco crimin act benjamin greenberg southern florida robert moor plead guilti march judg daniel hurley southern florida moor employ secret assign white hous time arrest remain custodi sinc time moor sinc termin secret servic posit accord admiss made connect plea moor maintain profil social media applic provid platform exchang digit imag well voic text messag delawar state polic detect delawar child predat task forc creat profil site pose girl moor engag number onlin chat session mobil app period includ moor work number onlin chat moor undercov offic pose femal minor sexual natur sever occas moor sent pictur includ sexual explicit imag accord plea document arrest enforc discov moor communic minor florida moor admit communic sent sexual explicit imag entic minor send sexual explicit photo well moor engag type behavior girl texa anoth girl missouri moor request feder charg delawar transfer southern florida could plead guilti charg time immigr custom enforc homeland secur investig delawar child predat task forc investig austin berri crimin child exploit obscen section ceo corey steinberg southern florida prosecut delawar brought part project safe childhood nationwid initi combat grow epidem child sexual exploit abus launch attorney offic ceo project safe childhood marshal feder state local resourc better locat apprehend prosecut individu exploit children internet well identifi rescu victim inform project safe childhood pleas visit http,0.900475
382899,church hill maryland resid sentenc today year prison follow lifetim term supervis releas entic minor engag sexual activ attempt transfer obscen materi minor announc act general kenneth blanco crimin act benjamin greenberg southern florida robert moor plead guilti march judg daniel hurley southern florida moor employ secret assign white hous time arrest remain custodi sinc time moor sinc termin secret servic posit accord admiss made connect plea moor maintain profil social media applic provid platform exchang digit imag well voic text messag delawar state polic detect delawar child predat task forc creat profil site pose girl moor engag number onlin chat session mobil app period includ moor work number onlin chat moor undercov offic pose femal minor sexual natur sever occas moor sent pictur includ sexual explicit imag accord plea document arrest enforc discov moor communic minor florida moor admit communic sent sexual explicit imag entic minor send sexual explicit photo well moor engag type behavior girl texa anoth girl missouri moor request feder charg delawar transfer southern florida could plead guilti charg time immigr custom enforc homeland secur investig delawar child predat task forc investig austin berri crimin child exploit obscen section ceo corey steinberg southern florida prosecut delawar brought part project safe childhood nationwid initi combat grow epidem child sexual exploit abus launch attorney offic ceo project safe childhood marshal feder state local resourc better locat apprehend prosecut individu exploit children internet well identifi rescu victim inform project safe childhood pleas visit http,church hill maryland resid plead guilti today feder court count entic minor engag sexual activ count attempt transfer obscen materi minor announc act general kenneth blanco crimin wifredo ferrer southern florida robert moor plead guilti today judg daniel hurley southern florida moor employ secret assign white hous time arrest remain custodi sinc time moor sinc termin secret servic posit accord admiss made connect plea moor maintain profil social media applic provid platform exchang digit imag well voic text messag delawar state polic detect delawar child predat task forc creat profil site pose girl moor engag number onlin chat session mobil app period includ moor work number onlin chat moor undercov offic pose femal minor sexual natur sever occas moor sent pictur includ sexual explicit imag accord plea document arrest enforc discov moor communic minor florida moor admit communic sent sexual explicit imag entic minor send sexual explicit photo well moor engag type behavior girl texa anoth girl missouri moor request feder charg delawar transfer southern florida could plead guilti charg time immigr custom enforc homeland secur investig delawar child predat task forc investig austin berri crimin child exploit obscen section ceo corey steinberg southern florida prosecut brought part project safe childhood nationwid initi combat grow epidem child sexual exploit abus launch attorney offic ceo project safe childhood marshal feder state local resourc better locat apprehend prosecut individu exploit children internet well identifi rescu victim inform project safe childhood pleas visit http,0.900475
330557,minnesota prior convict sexual abus children sentenc today serv life plus year prison product distribut receipt possess child pornographi well commit child offens requir regist offend announc general lesli caldwel crimin andrew luger minnesota furman feder minnesota sentenc judg david doti minnesota also order furman restitut amount victim furman convict follow juri accord evid present separ investig minnesota bureau crimin apprehens minneapoli polic enforc offic obtain child pornographi video internet protocol address link furman home evid show search resid execut cass counti sheriff furman admit download child pornographi accord evid also inform special agent produc imag depict child exploit involv girl care younger year time abus evid show subsequ forens analysi furman comput digit media confirm produc pornograph photograph video children accord evid investig also found furman possess hundr imag video depict children engag act adult furman prior minnesota state court convict engag act minor decemb furman plead guilti sexual abus girl care januari furman convict bench sexual abus girl care result requir regist offend minnesota lead minnesota internet crime children task forc minneapoli polic member child exploit task forc investig melinda william minnesota deputi chief alexandra gelber crimin child exploit obscen section ceo prosecut brought part project safe childhood nationwid initi combat grow epidem child sexual exploit abus launch attorney offic ceo project safe childhood marshal feder state local resourc better locat apprehend prosecut individu exploit children internet well identifi rescu victim inform project safe childhood pleas visit,minnesota prior convict sexual abus children convict today child pornographi charg general lesli caldwel crimin andrew luger minnesota made announc furman feder minnesota found guilti follow produc receiv distribut possess child pornographi commit feloni offens requir regist offend senior judg david doti minnesota presid sentenc later date prior convict furman subject mandatori life sentenc plus year prison accord evid present fall winter separ investig minnesota bureau crimin apprehens minneapoli polic enforc offic obtain child pornographi video internet protocol address link furman home evid show thereaft search resid execut cass counti sheriff furman acknowledg download child pornographi accord evid also inform special agent produc imag depict genitalia girl care turn four year five year time evid show subsequ forens analysi furman comput digit media reveal child pornographi produc august septemb includ sexual explicit photo video accord evid investig also found hundr imag video depict children engag act adult furman prior minnesota state court convict engag act minor decemb furman plead guilti sexual abus girl care januari furman convict bench sexual abus girl care result requir regist offend investig minnesota lead minnesota internet crime children icac task forc minneapoli polic member child exploit task forc prosecut melinda william minnesota deputi chief alexandra gelber crimin child exploit obscen section ceo brought part project safe childhood nationwid initi combat grow epidem child exploit abus launch offic ceo project safe childhood marshal feder state local resourc better locat apprehend prosecut individu exploit children internet well identifi rescu victim inform project safe childhood pleas visit,0.899851
345573,minnesota prior convict sexual abus children convict today child pornographi charg general lesli caldwel crimin andrew luger minnesota made announc furman feder minnesota found guilti follow produc receiv distribut possess child pornographi commit feloni offens requir regist offend senior judg david doti minnesota presid sentenc later date prior convict furman subject mandatori life sentenc plus year prison accord evid present fall winter separ investig minnesota bureau crimin apprehens minneapoli polic enforc offic obtain child pornographi video internet protocol address link furman home evid show thereaft search resid execut cass counti sheriff furman acknowledg download child pornographi accord evid also inform special agent produc imag depict genitalia girl care turn four year five year time evid show subsequ forens analysi furman comput digit media reveal child pornographi produc august septemb includ sexual explicit photo video accord evid investig also found hundr imag video depict children engag act adult furman prior minnesota state court convict engag act minor decemb furman plead guilti sexual abus girl care januari furman convict bench sexual abus girl care result requir regist offend investig minnesota lead minnesota internet crime children icac task forc minneapoli polic member child exploit task forc prosecut melinda william minnesota deputi chief alexandra gelber crimin child exploit obscen section ceo brought part project safe childhood nationwid initi combat grow epidem child exploit abus launch offic ceo project safe childhood marshal feder state local resourc better locat apprehend prosecut individu exploit children internet well identifi rescu victim inform project safe childhood pleas visit,minnesota prior convict sexual abus children sentenc today serv life plus year prison product distribut receipt possess child pornographi well commit child offens requir regist offend announc general lesli caldwel crimin andrew luger minnesota furman feder minnesota sentenc judg david doti minnesota also order furman restitut amount victim furman convict follow juri accord evid present separ investig minnesota bureau crimin apprehens minneapoli polic enforc offic obtain child pornographi video internet protocol address link furman home evid show search resid execut cass counti sheriff furman admit download child pornographi accord evid also inform special agent produc imag depict child exploit involv girl care younger year time abus evid show subsequ forens analysi furman comput digit media confirm produc pornograph photograph video children accord evid investig also found furman possess hundr imag video depict children engag act adult furman prior minnesota state court convict engag act minor decemb furman plead guilti sexual abus girl care januari furman convict bench sexual abus girl care result requir regist offend minnesota lead minnesota internet crime children task forc minneapoli polic member child exploit task forc investig melinda william minnesota deputi chief alexandra gelber crimin child exploit obscen section ceo prosecut brought part project safe childhood nationwid initi combat grow epidem child sexual exploit abus launch attorney offic ceo project safe childhood marshal feder state local resourc better locat apprehend prosecut individu exploit children internet well identifi rescu victim inform project safe childhood pleas visit,0.899851
