This notebook performs data cleaning for the scraped DataFrame.

In [145]:
import pandas as pd

In [146]:
df = pd.read_csv("scraped_police_reports14.csv") # dataframe of all scraped articles
df.head()

Unnamed: 0,name,department,url,text
0,"Andrew Allen, badge #37",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Home Legislative File 2021-01132 RCA Legal s...
1,"Elijah Allen, badge #None",Department:Minnesota State Patrol,No articles found,No articles found
2,"Michael Anderson, badge #None",Department:Martin County Sheriffs Office,No articles found,No articles found
3,"Oluwademilade Adediran, badge #21",Department:Minneapolis Police Department,No articles found,No articles found
4,"Wade Anderson, badge #None",Department:Winona Police Department,No articles found,No articles found


In [147]:
df.shape # There are 15,527 articles (or links without articles)

(15527, 4)

In [148]:
df.isna().sum() # There are 205 missing articles

name            0
department      0
url             0
text          205
dtype: int64

In [149]:
null_text_rows = df.loc[df["text"].isna()] # Find the rows where there is no article text
null_text_rows.head()

Unnamed: 0,name,department,url,text
31,"Roxanne Affeldt, badge #None",Department:Champlin Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,
645,"Todd Babekuhl, badge #246",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,
657,"Todd Babekuhl, badge #246",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,
719,"Stephanie L. Bailey, badge #29300",Department:St. Paul Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,
726,"Daniel Anderson, badge #94",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,


We will drop the rows where `text` is `NaN`, because that means there is no article for that row.

In [150]:
df = df.dropna() # Drop rows where there is any NaN value
df.shape

(15322, 4)

In [151]:
len(df["url"].unique()) # There are 1855 unique URLs in the dataset and over 15000 total rows. So, many articles come from the same URL

1855

Below, we find the rows where the text or the URL, respectively, is equal to "No articles found".

In [152]:
no_article_rows = df.loc[df["url"] == "No articles found"]
no_article_rows

Unnamed: 0,name,department,url,text
1,"Elijah Allen, badge #None",Department:Minnesota State Patrol,No articles found,No articles found
2,"Michael Anderson, badge #None",Department:Martin County Sheriffs Office,No articles found,No articles found
3,"Oluwademilade Adediran, badge #21",Department:Minneapolis Police Department,No articles found,No articles found
4,"Wade Anderson, badge #None",Department:Winona Police Department,No articles found,No articles found
6,"Fernando F. Abla-Reyes, badge #801160",Department:St. Paul Police Department,No articles found,No articles found
...,...,...,...,...
15483,"John Zupancic, badge #None",Department:Gilbert Police Department,No articles found,No articles found
15484,"Brian Zwach, badge #None",Department:Oak Park Heights Police Department,No articles found,No articles found
15485,"Kenneth Zwak, badge #None",Department:Duluth Police Department,No articles found,No articles found
15486,"Christian Zweerink, badge #None",Department:Park Rapids Police Department,No articles found,No articles found


In [153]:
no_text_rows = df.loc[df["text"] == "No articles found"]
no_text_rows

Unnamed: 0,name,department,url,text
1,"Elijah Allen, badge #None",Department:Minnesota State Patrol,No articles found,No articles found
2,"Michael Anderson, badge #None",Department:Martin County Sheriffs Office,No articles found,No articles found
3,"Oluwademilade Adediran, badge #21",Department:Minneapolis Police Department,No articles found,No articles found
4,"Wade Anderson, badge #None",Department:Winona Police Department,No articles found,No articles found
6,"Fernando F. Abla-Reyes, badge #801160",Department:St. Paul Police Department,No articles found,No articles found
...,...,...,...,...
15483,"John Zupancic, badge #None",Department:Gilbert Police Department,No articles found,No articles found
15484,"Brian Zwach, badge #None",Department:Oak Park Heights Police Department,No articles found,No articles found
15485,"Kenneth Zwak, badge #None",Department:Duluth Police Department,No articles found,No articles found
15486,"Christian Zweerink, badge #None",Department:Park Rapids Police Department,No articles found,No articles found


In [154]:
no_text_rows.equals(no_article_rows) # Check if the rows without text and the rows without articles are the same

True

There are 12,418 rows with no article text, out of about 15,000 total rows. We will drop the rows with no article text.

In [155]:
df = df.loc[(df["url"] != "No articles found") & (df["text"] != "No articles found")]
df

Unnamed: 0,name,department,url,text
0,"Andrew Allen, badge #37",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Home Legislative File 2021-01132 RCA Legal s...
5,"Guled Abdullahi, badge #706",Department:Hennepin County Sheriff's Department,https://assets.nationbuilder.com/cuapb/pages/1...,"Hennepin County 300 South Sixth Street, Minne..."
11,"Dean V. Albers, badge #None",Department:Goodhue County Sheriff's Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,"2/22/2021 Jenson v. Craft, Civil No. 01-1488(D..."
22,"Scott Aikins, badge #22",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,"2/22/2021 United States v. Diriye, Case No. 14..."
24,"Matthew Aish, badge #None",Department:Columbia Heights Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long1 Arbitration LELS (Mathew Aish)/...
...,...,...,...,...
15520,"David Voss, badge #7441",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,MINNEAPOLIS POLICE DEPARTMENT Internal Affair...
15522,"Roderick Weber, badge #7612",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long
15524,"Roderick Weber, badge #7612",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,No text found
15525,"Roderick Weber, badge #7612",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long


Now investigate rows where `text` is equal to `Too_Long` or `No text found`

In [156]:
df.loc[df["text"] == "Too_Long"]

Unnamed: 0,name,department,url,text
928,"Tyrone Barze, badge #330",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long
929,"Tyrone Barze, badge #330",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long
1060,"Tyrone Barze, badge #330",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long
1135,"Tyrone Barze, badge #330",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long
1346,"Mukhtar Abdulkadir, badge #47",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long
...,...,...,...,...
15271,"Roderick Weber, badge #7612",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long
15276,"Roderick Weber, badge #7612",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long
15516,"Roderick Weber, badge #7612",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long
15522,"Roderick Weber, badge #7612",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long


In [157]:
df.loc[df["text"] == "No text found"]

Unnamed: 0,name,department,url,text
25,"Charles Adams, III, badge #59",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,No text found
98,"Dean V. Albers, badge #None",Department:Goodhue County Sheriff's Department,https://www.dropbox.com/s/skcxonxtqyo7vi2/POST...,No text found
130,"Mukhtar Abdulkadir, badge #47",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,No text found
161,"Lance Ahlrich, badge #18",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,No text found
192,"Daniel Anderson, badge #94",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,No text found
...,...,...,...,...
15490,"Rhonda Ziegelmann, badge #None",Department:Hennepin County Sheriff's Department,https://assets.nationbuilder.com/cuapb/pages/1...,No text found
15496,"Rhonda Ziegelmann, badge #None",Department:Hennepin County Sheriff's Department,https://assets.nationbuilder.com/cuapb/pages/1...,No text found
15507,"Roderick Weber, badge #7612",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,No text found
15513,"Geoffrey Toscano, badge #7257",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,No text found


There are 400 rows where the text is too long and 96 rows with no text found. We will drop these rows. That results in a DataFrame with 2,408 rows.

In [158]:
df = df.loc[~df["text"].isin(["No text found", "Too_Long"])] # Rows where the text is not equal to "No text found" AND not equal to "Too_Long".
df = df.reset_index(drop=True) # reset index to a sequential integer index
df

Unnamed: 0,name,department,url,text
0,"Andrew Allen, badge #37",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Home Legislative File 2021-01132 RCA Legal s...
1,"Guled Abdullahi, badge #706",Department:Hennepin County Sheriff's Department,https://assets.nationbuilder.com/cuapb/pages/1...,"Hennepin County 300 South Sixth Street, Minne..."
2,"Dean V. Albers, badge #None",Department:Goodhue County Sheriff's Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,"2/22/2021 Jenson v. Craft, Civil No. 01-1488(D..."
3,"Scott Aikins, badge #22",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,"2/22/2021 United States v. Diriye, Case No. 14..."
4,"Matthew Aish, badge #None",Department:Columbia Heights Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long1 Arbitration LELS (Mathew Aish)/...
...,...,...,...,...
2403,"Richard D. Zimmerman, badge #7961",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,MINNEAPOLIS POLICE DEPARTMENT INTERNAL AFFAIR...
2404,"Michael N. Young, badge #7895",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,"Cook, Joni ,:- --Irom: Glampe, Travis ..ien..."
2405,"Steve Wuorinen, badge #7876",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,4/11/2021 Cop Gets $585K After Colleagues Snoo...
2406,"David Voss, badge #7441",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,MINNEAPOLIS POLICE DEPARTMENT Internal Affair...


Some articles in the `text` column start with `"Too_Long"`, even if there is article text. Let's examine those.

In [159]:
too_long_articles = df.loc[df["text"].str.startswith("Too_Long")].head(20) # TODO: question for Jesse: if an article has text but starts with "Too_Long", can we drop it?
too_long_articles

Unnamed: 0,name,department,url,text
4,"Matthew Aish, badge #None",Department:Columbia Heights Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long1 Arbitration LELS (Mathew Aish)/...
10,"Vincent A. Adams, badge #720505",Department:St. Paul Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_LongOFFICE OF THE RAMSEY COUNTY ATTORNEY J...
12,"Philip Alejandrino, badge #29",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long UNITED STATES DISTRICT COURT DIST...
16,"Philip Alejandrino, badge #29",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_LongUNITED STATES DISTRICT COURT DISTRICT...
23,"Christopher Abbas, badge #2",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long UNITED STATES DISTRICT COURT ...
37,"Mukhtar Abdulkadir, badge #47",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long 1 IN THE MATTER OF ARBITRATION BETWE...
43,"Richard C. Altonen, badge #66",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long2/25/2021 The Bad Cops: How Minneapoli...
78,"Abdiwahab Ali, badge #30",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_LongOPCR Case #17-03832 Table of Contents...
105,"Medaria Arradondo, badge #231",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long2/25/2021 The Bad Cops: How Minneapoli...
110,"Carlos Baires Escobar, badge #275",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_LongUNITED STATES DISTRICT COURT DISTRICT...


Below, we examine the articles that start with `"Too_Long"`.

In [160]:
too_long_articles.loc[155, "text"] # investigate one of the too_long rows



In [161]:
too_long_articles.loc[157, "text"]

'Too_LongMinneapolis  Office of Police  Conduct Review  INVESTIGATIVE REPORT  Complaint Number: 17-10527  Investigator:  Stephen J McKean  Officer (s):  Shannon Barnette  Case Type:  Administrative  Date of Incident:  May 25, 2017  Complaint Filed:  June 07, 2017  Minneapolis  Police  13.43 - Personnel Data  CASE OVERVIEW  On 05/25/2017, at 18:11 hours, a 9-1-1 call was received from an unknown male subject who stated that  he was the Complainant\'s 13.82 however, he wished to remain anonymous. The caller asked for officers  to respond to Privacy Policy home for "a safety or a welfare check" as she has "mental health issues", the  nature of which he could not elaborate. 1 The caller stated that Privacy Policy called him "threatening my  family and me." The caller said that he didn\'t think she was going to do anything.2 He also stated, "I  mean, I don\'t think she\'s gonna hurt anyone, but..."3 At 20:14 hours on the 25th, Officer  13.43  13.43 and Officer  13.43  13.43 responded to  PP

In [162]:
df.loc[2405, "text"] # examine a regular article to compare it with the "Too_Long" ones

'4/11/2021 Cop Gets $585K After Colleagues Snooped on Her DMV Data | WIRED https://www.wired.com/story/minnesota-police-dmv-database-abuse/ 1/5 LOUISE MATSAKIS SECURITY 06.21.2019 03:47 PM Minnesota Cop Awarded $585K After Colleagues Snooped on Her DMV Data A jury this week finds that Minneapolis police officers abused their license database access. Dozens of other lawsuits have made similar claims. IN 2013, AMY Krekelberg received an unsettling notice from Minnesota’s Department of Natural Resources: An employee had abused his access to a government driver’s license database and snooped on thousands of people in the state, mostly women. Krekelberg learned that she was one of them. MARLIN LEVISON/STAR TRIBUNE/GETTY IMAGES 4/11/2021 Cop Gets $585K After Colleagues Snooped on Her DMV Data | WIRED https://www.wired.com/story/minnesota-police-dmv-database-abuse/ 2/5 When Krekelberg asked for an audit of accesses to her DMV records, as allowed by Minnesota state law, she learned that her in

In [163]:
len(too_long_articles), len(df)

(20, 2408)

There is still article text in the `"Too_Long"` articles, but it seems to be cut off in the middle of the sentence, which suggests that only a certain number of characters could be read before the article becomes too long. Just in case, we will keep the `"Too_Long"` at the beginning of these article texts, so that we know which articles might be incomplete. Besides, there are only 20 rows where the text is too long, out of 2,408 total rows.

In [164]:
df.to_csv("cleaned_police_reports.csv", index=False) # save cleaned data as a csv

Potential problems with this data cleaning code:
1. There is extra text at the end of some articles that is not related to the article. This might confuse the LLM.
2. We didn't remove page numbers or URLs that are not part of the article text.
3. The text is all on one line. For readability, we can split each article into sentences using NLTK's sentence tokenizer, and split the text so that every `n` sentences are on one line. However, the NLTK sentence tokenizer might not be 100% correct for our data. We would have to manually check that it's correct, and we can't do that for thousands of articles.
4. There are only 2,400 non-empty articles, out of 15,000 total articles scraped.