This notebook performs data cleaning for the scraped DataFrame.

## Remove missing rows

In [6176]:
import pandas as pd

In [6177]:
df = pd.read_csv("scraped_police_reports14.csv") # dataframe of all scraped articles
df.head()

Unnamed: 0,name,department,url,text
0,"Andrew Allen, badge #37",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Home Legislative File 2021-01132 RCA Legal s...
1,"Elijah Allen, badge #None",Department:Minnesota State Patrol,No articles found,No articles found
2,"Michael Anderson, badge #None",Department:Martin County Sheriffs Office,No articles found,No articles found
3,"Oluwademilade Adediran, badge #21",Department:Minneapolis Police Department,No articles found,No articles found
4,"Wade Anderson, badge #None",Department:Winona Police Department,No articles found,No articles found


In [6178]:
df.shape # There are 15,527 articles (or links without articles)

(15527, 4)

In [6179]:
df.dtypes

name          object
department    object
url           object
text          object
dtype: object

In [6180]:
df.isna().sum() # There are 205 missing articles

name            0
department      0
url             0
text          205
dtype: int64

In [6181]:
null_text_rows = df.loc[df["text"].isna()] # Find the rows where there is no article text
null_text_rows.head()

Unnamed: 0,name,department,url,text
31,"Roxanne Affeldt, badge #None",Department:Champlin Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,
645,"Todd Babekuhl, badge #246",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,
657,"Todd Babekuhl, badge #246",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,
719,"Stephanie L. Bailey, badge #29300",Department:St. Paul Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,
726,"Daniel Anderson, badge #94",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,


We will drop the rows where `text` is `NaN`, because that means there is no article for that row.

In [6182]:
df = df.dropna() # Drop rows where there is any NaN value
df.shape

(15322, 4)

In [6183]:
len(df["url"].unique()) # There are 1855 unique URLs in the dataset and over 15000 total rows. So, many articles come from the same URL

1855

Below, we find the rows where the text or the URL, respectively, is equal to "No articles found".

In [6184]:
no_article_rows = df.loc[df["url"] == "No articles found"]
no_article_rows

Unnamed: 0,name,department,url,text
1,"Elijah Allen, badge #None",Department:Minnesota State Patrol,No articles found,No articles found
2,"Michael Anderson, badge #None",Department:Martin County Sheriffs Office,No articles found,No articles found
3,"Oluwademilade Adediran, badge #21",Department:Minneapolis Police Department,No articles found,No articles found
4,"Wade Anderson, badge #None",Department:Winona Police Department,No articles found,No articles found
6,"Fernando F. Abla-Reyes, badge #801160",Department:St. Paul Police Department,No articles found,No articles found
...,...,...,...,...
15483,"John Zupancic, badge #None",Department:Gilbert Police Department,No articles found,No articles found
15484,"Brian Zwach, badge #None",Department:Oak Park Heights Police Department,No articles found,No articles found
15485,"Kenneth Zwak, badge #None",Department:Duluth Police Department,No articles found,No articles found
15486,"Christian Zweerink, badge #None",Department:Park Rapids Police Department,No articles found,No articles found


In [6185]:
no_text_rows = df.loc[df["text"] == "No articles found"]
no_text_rows

Unnamed: 0,name,department,url,text
1,"Elijah Allen, badge #None",Department:Minnesota State Patrol,No articles found,No articles found
2,"Michael Anderson, badge #None",Department:Martin County Sheriffs Office,No articles found,No articles found
3,"Oluwademilade Adediran, badge #21",Department:Minneapolis Police Department,No articles found,No articles found
4,"Wade Anderson, badge #None",Department:Winona Police Department,No articles found,No articles found
6,"Fernando F. Abla-Reyes, badge #801160",Department:St. Paul Police Department,No articles found,No articles found
...,...,...,...,...
15483,"John Zupancic, badge #None",Department:Gilbert Police Department,No articles found,No articles found
15484,"Brian Zwach, badge #None",Department:Oak Park Heights Police Department,No articles found,No articles found
15485,"Kenneth Zwak, badge #None",Department:Duluth Police Department,No articles found,No articles found
15486,"Christian Zweerink, badge #None",Department:Park Rapids Police Department,No articles found,No articles found


In [6186]:
no_text_rows.equals(no_article_rows) # Check if the rows without text and the rows without articles are the same

True

There are 12,418 rows with no article text, out of about 15,000 total rows. We will drop the rows with no article text.

In [6187]:
df = df.loc[(df["url"] != "No articles found") & (df["text"] != "No articles found")]
df

Unnamed: 0,name,department,url,text
0,"Andrew Allen, badge #37",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Home Legislative File 2021-01132 RCA Legal s...
5,"Guled Abdullahi, badge #706",Department:Hennepin County Sheriff's Department,https://assets.nationbuilder.com/cuapb/pages/1...,"Hennepin County 300 South Sixth Street, Minne..."
11,"Dean V. Albers, badge #None",Department:Goodhue County Sheriff's Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,"2/22/2021 Jenson v. Craft, Civil No. 01-1488(D..."
22,"Scott Aikins, badge #22",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,"2/22/2021 United States v. Diriye, Case No. 14..."
24,"Matthew Aish, badge #None",Department:Columbia Heights Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long1 Arbitration LELS (Mathew Aish)/...
...,...,...,...,...
15520,"David Voss, badge #7441",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,MINNEAPOLIS POLICE DEPARTMENT Internal Affair...
15522,"Roderick Weber, badge #7612",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long
15524,"Roderick Weber, badge #7612",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,No text found
15525,"Roderick Weber, badge #7612",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long


Now investigate rows where `text` is equal to `Too_Long` or `No text found`

In [6188]:
df.loc[df["text"] == "Too_Long"]

Unnamed: 0,name,department,url,text
928,"Tyrone Barze, badge #330",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long
929,"Tyrone Barze, badge #330",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long
1060,"Tyrone Barze, badge #330",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long
1135,"Tyrone Barze, badge #330",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long
1346,"Mukhtar Abdulkadir, badge #47",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long
...,...,...,...,...
15271,"Roderick Weber, badge #7612",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long
15276,"Roderick Weber, badge #7612",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long
15516,"Roderick Weber, badge #7612",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long
15522,"Roderick Weber, badge #7612",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long


In [6189]:
df.loc[df["text"] == "No text found"]

Unnamed: 0,name,department,url,text
25,"Charles Adams, III, badge #59",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,No text found
98,"Dean V. Albers, badge #None",Department:Goodhue County Sheriff's Department,https://www.dropbox.com/s/skcxonxtqyo7vi2/POST...,No text found
130,"Mukhtar Abdulkadir, badge #47",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,No text found
161,"Lance Ahlrich, badge #18",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,No text found
192,"Daniel Anderson, badge #94",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,No text found
...,...,...,...,...
15490,"Rhonda Ziegelmann, badge #None",Department:Hennepin County Sheriff's Department,https://assets.nationbuilder.com/cuapb/pages/1...,No text found
15496,"Rhonda Ziegelmann, badge #None",Department:Hennepin County Sheriff's Department,https://assets.nationbuilder.com/cuapb/pages/1...,No text found
15507,"Roderick Weber, badge #7612",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,No text found
15513,"Geoffrey Toscano, badge #7257",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,No text found


There are 400 rows where the text is too long and 96 rows with no text found. We will drop these rows. That results in a DataFrame with 2,408 rows.

In [6190]:
df = df.loc[~df["text"].isin(["No text found", "Too_Long"])] # Rows where the text is not equal to "No text found" AND not equal to "Too_Long".
df = df.reset_index(drop=True) # reset index to a sequential integer index
df

Unnamed: 0,name,department,url,text
0,"Andrew Allen, badge #37",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Home Legislative File 2021-01132 RCA Legal s...
1,"Guled Abdullahi, badge #706",Department:Hennepin County Sheriff's Department,https://assets.nationbuilder.com/cuapb/pages/1...,"Hennepin County 300 South Sixth Street, Minne..."
2,"Dean V. Albers, badge #None",Department:Goodhue County Sheriff's Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,"2/22/2021 Jenson v. Craft, Civil No. 01-1488(D..."
3,"Scott Aikins, badge #22",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,"2/22/2021 United States v. Diriye, Case No. 14..."
4,"Matthew Aish, badge #None",Department:Columbia Heights Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long1 Arbitration LELS (Mathew Aish)/...
...,...,...,...,...
2403,"Richard D. Zimmerman, badge #7961",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,MINNEAPOLIS POLICE DEPARTMENT INTERNAL AFFAIR...
2404,"Michael N. Young, badge #7895",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,"Cook, Joni ,:- --Irom: Glampe, Travis ..ien..."
2405,"Steve Wuorinen, badge #7876",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,4/11/2021 Cop Gets $585K After Colleagues Snoo...
2406,"David Voss, badge #7441",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,MINNEAPOLIS POLICE DEPARTMENT Internal Affair...


## Check for valid values

Some articles in the `text` column start with `"Too_Long"`, even if there is article text. Let's examine those.

In [6191]:
too_long_articles = df.loc[df["text"].str.startswith("Too_Long")].head(20)
too_long_articles

Unnamed: 0,name,department,url,text
4,"Matthew Aish, badge #None",Department:Columbia Heights Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long1 Arbitration LELS (Mathew Aish)/...
10,"Vincent A. Adams, badge #720505",Department:St. Paul Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_LongOFFICE OF THE RAMSEY COUNTY ATTORNEY J...
12,"Philip Alejandrino, badge #29",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long UNITED STATES DISTRICT COURT DIST...
16,"Philip Alejandrino, badge #29",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_LongUNITED STATES DISTRICT COURT DISTRICT...
23,"Christopher Abbas, badge #2",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long UNITED STATES DISTRICT COURT ...
37,"Mukhtar Abdulkadir, badge #47",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long 1 IN THE MATTER OF ARBITRATION BETWE...
43,"Richard C. Altonen, badge #66",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long2/25/2021 The Bad Cops: How Minneapoli...
78,"Abdiwahab Ali, badge #30",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_LongOPCR Case #17-03832 Table of Contents...
105,"Medaria Arradondo, badge #231",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long2/25/2021 The Bad Cops: How Minneapoli...
110,"Carlos Baires Escobar, badge #275",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_LongUNITED STATES DISTRICT COURT DISTRICT...


Below, we examine the articles that start with `"Too_Long"`.

In [6192]:
too_long_articles.loc[155, "text"] # investigate one of the too_long rows



In [6193]:
too_long_articles.loc[157, "text"]

'Too_LongMinneapolis  Office of Police  Conduct Review  INVESTIGATIVE REPORT  Complaint Number: 17-10527  Investigator:  Stephen J McKean  Officer (s):  Shannon Barnette  Case Type:  Administrative  Date of Incident:  May 25, 2017  Complaint Filed:  June 07, 2017  Minneapolis  Police  13.43 - Personnel Data  CASE OVERVIEW  On 05/25/2017, at 18:11 hours, a 9-1-1 call was received from an unknown male subject who stated that  he was the Complainant\'s 13.82 however, he wished to remain anonymous. The caller asked for officers  to respond to Privacy Policy home for "a safety or a welfare check" as she has "mental health issues", the  nature of which he could not elaborate. 1 The caller stated that Privacy Policy called him "threatening my  family and me." The caller said that he didn\'t think she was going to do anything.2 He also stated, "I  mean, I don\'t think she\'s gonna hurt anyone, but..."3 At 20:14 hours on the 25th, Officer  13.43  13.43 and Officer  13.43  13.43 responded to  PP

In [6194]:
df.loc[2405, "text"] # examine a regular article to compare it with the "Too_Long" ones

'4/11/2021 Cop Gets $585K After Colleagues Snooped on Her DMV Data | WIRED https://www.wired.com/story/minnesota-police-dmv-database-abuse/ 1/5 LOUISE MATSAKIS SECURITY 06.21.2019 03:47 PM Minnesota Cop Awarded $585K After Colleagues Snooped on Her DMV Data A jury this week finds that Minneapolis police officers abused their license database access. Dozens of other lawsuits have made similar claims. IN 2013, AMY Krekelberg received an unsettling notice from Minnesota’s Department of Natural Resources: An employee had abused his access to a government driver’s license database and snooped on thousands of people in the state, mostly women. Krekelberg learned that she was one of them. MARLIN LEVISON/STAR TRIBUNE/GETTY IMAGES 4/11/2021 Cop Gets $585K After Colleagues Snooped on Her DMV Data | WIRED https://www.wired.com/story/minnesota-police-dmv-database-abuse/ 2/5 When Krekelberg asked for an audit of accesses to her DMV records, as allowed by Minnesota state law, she learned that her in

In [6195]:
len(too_long_articles), len(df)

(20, 2408)

There is still article text in the `"Too_Long"` articles, but it seems to be cut off in the middle of the sentence, which suggests that only a certain number of characters could be read before the article becomes too long. Just in case, we will keep the `"Too_Long"` at the beginning of these article texts, so that we know which articles might be incomplete. Besides, there are only 20 rows where the text is too long, out of 2,408 total rows.

Now we will check for any invalid values or empty strings in all columns

### Initial check of all columns

In [6196]:
# Check if any column contains an entry with all whitespace characters. It does not.
for col in df.columns:
    print(df[col].str.isspace().sum())

0
0
0
0


In [6197]:
# Check if any column contains an entry with all numeric characters. It does not
for col in df.columns:
    print(df[col].str.isnumeric().sum())

0
0
0
0


In [6198]:
# Remove leading and trailing whitespace from all entries
for col in df.columns: 
    df[col] = df[col].str.strip() # remove whitespace from that column

### Check invalid URLs

In [6199]:
df.sort_values(by = "url")

Unnamed: 0,name,department,url,text
2258,"Kent Warnberg, badge #7570",Department:Minneapolis Police Department,file:///home/chronos/u-6ec0973dcaad89553a5ddf1...,Failed: No connection adapters were found for ...
1702,"Jeffrey Pennaz, badge #5551",Department:Minneapolis Police Department,https://assets.nationbuilder.com/cuapb/pages/1...,Home Legislative File 2022-00234 RCA Legal S...
1777,"Kurt Radke, badge #5882",Department:Minneapolis Police Department,https://assets.nationbuilder.com/cuapb/pages/1...,Home Legislative File 2022-00241 RCA Legal S...
2120,"Craig A. Taylor, badge #7139",Department:Minneapolis Police Department,https://assets.nationbuilder.com/cuapb/pages/1...,Home Legislative File 2022-00233 RCA Legal S...
2121,"Cory Taylor, badge #7141",Department:Minneapolis Police Department,https://assets.nationbuilder.com/cuapb/pages/1...,Home Legislative File 2022-00240 RCA Legal S...
...,...,...,...,...
816,"Mark Hanneman, badge #2654",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,MPD officer who killed Amir Locke disciplined ...
1113,"Joseph Klimmek, badge #3735",Department:Minneapolis Police Department,https://www.govinfo.gov/content/pkg/USCOURTS-m...,UNITED STATES DISTRICT COURT DISTRICT OF MINN...
68,"Tobias Anderson, badge #132",Department:Minneapolis Police Department,https://www.govinfo.gov/content/pkg/USCOURTS-m...,UNITED STATES DISTRICT COURT DISTRICT OF MINN...
1426,"Daryl McKenna, badge #4623",Department:Minneapolis Police Department,https://www.govinfo.gov/content/pkg/USCOURTS-m...,UNITED STATES DISTRICT COURT DISTRICT OF MINNE...


In [6200]:
print(f"URL: {df.loc[2258, "url"]}")
print(f"Text: {df.loc[2258, "text"]}")

URL: file:///home/chronos/u-6ec0973dcaad89553a5ddf1372e2bf8e9d7aa203/MyFiles/Downloads/Star_Tribune_Wed__May_5__1993_.pdf
Text: Failed: No connection adapters were found for 'file:///home/chronos/u-6ec0973dcaad89553a5ddf1372e2bf8e9d7aa203/MyFiles/Downloads/Star_Tribune_Wed__May_5__1993_.pdf'


We will drop row 2258 since it has an invalid URL (URLs should start with HTTPS), and the text is just an error message.

In [6201]:
df.drop(index=2258, inplace=True)

In [6202]:
unique_urls = df["url"].value_counts() # get unique values of URL
unique_urls

url
https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pages/270/attachments/original/1618168204/Cop_Gets__585K_After_Colleagues_Snooped_on_Her_DMV_Data___WIRED.pdf?1618168204                                                           51
https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pages/270/attachments/original/1618167029/KREKELBERG_v._Anoka_County__439_F._Supp._3d_1143_-_Dist._Court__Minnesota_2020_-_Google_Scholar.pdf?1618167029                           51
https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pages/270/attachments/original/1618167547/2016_USCOURTS-mnd-0_13-cv-03562-1.pdf?1618167547                                                                                         51
https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pages/270/attachments/original/1618166256/Krekelberg_Privacy_Lawsuit.pdf?1618166256                                                                                                50
https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pages/270/attachments/original/1614285311/2020_T

In [6203]:
unique_urls = sorted(list(unique_urls.index)) # get only the URLs

# All URLs seem valid, since they start with https
for url in unique_urls:
    print(url)

https://assets.nationbuilder.com/cuapb/pages/1462/attachments/original/1647659311/RCA-2022-00193_-_Legal_Settlement_Workers'_Compensation_claim_of_Jeffrey_Pennaz.pdf?1647659311
https://assets.nationbuilder.com/cuapb/pages/1462/attachments/original/1647659312/RCA-2022-00194_-_Legal_Settlement_Workers'_Compensation_claim_of_Kurt_Radke.pdf?1647659312
https://assets.nationbuilder.com/cuapb/pages/1462/attachments/original/1647659312/RCA-2022-00195_-_Legal_Settlement_Workers'_Compensation_claim_of_Craig_Taylor.pdf?1647659312
https://assets.nationbuilder.com/cuapb/pages/1462/attachments/original/1647659313/RCA-2022-00196_-_Legal_Settlement_Workers'_Compensation_claim_of_Cory_Taylor.pdf?1647659313
https://assets.nationbuilder.com/cuapb/pages/1462/attachments/original/1647659313/RCA-2022-00197_-_Legal_Settlement_Workers'_Compensation_claim_of_Joseph_Will.pdf?1647659313
https://assets.nationbuilder.com/cuapb/pages/1462/attachments/original/1647659314/RCA-2022-00198_-_Legal_Settlement_Workers'_Co

In [6204]:
starts_with_https = df["url"].str.startswith("https://") # Boolean series for whether each URL starts with "https"
sum(starts_with_https) == len(df["url"]) # Returns True if all the URLs in the dataframe start with https

True

### Identify duplicates

In [6205]:
is_duplicate = df.duplicated() # bool Series that is True if that row is a duplicate
sum(is_duplicate) # number of duplicated rows

17

In [6206]:
df[is_duplicate]

Unnamed: 0,name,department,url,text
53,"Andrew Allen, badge #37",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,CASE 0:13-cv-00435-DSD-JJK Document 1 File...
194,"John E. Bennett, badge #406",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,3/28/2021 Errant bite ends Minneapolis police ...
261,"Peter Brazeau, badge #750",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,3/18/2021 Minneapolis ofﬁcer ordered back on t...
302,"Richard D. Bruns, badge #None",Department:Adrian Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,4/13/2021 Minnesota police officer arrested fo...
368,"Joseph Carroll, badge #200",Department:Richfield Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,"3/6/2021 What we know about Brian Quinones, th..."
524,"John C. Delmonico, badge #1491",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Minneapolis Police Officers: 350 S. 5th St. ...
633,"Mark Durand, badge #1627",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,4/11/2021 Cop Gets $585K After Colleagues Snoo...
751,"Michael Griffin, badge #2475",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,2/25/2021 Minneapolis cop under investigation ...
961,"Jeffrey R. Jindra, badge #3289",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,3/2/2021 MPR: Feds won't prosecute Minneapolis...
1123,"Joseph Klimmek, badge #3735",Department:Minneapolis Police Department,https://assets.nationbuilder.com/cuapb/pages/1...,From the desk of: Medaria Arradondo Chief of...


In [6207]:
dup_indexes = df[is_duplicate].index # indices of duplicate rows

for index in dup_indexes:
    print(f"\n{'-'*100}\n{index}: URL: {df["url"][index]}\n\n{df["text"][index]}")


----------------------------------------------------------------------------------------------------
53: URL: https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pages/270/attachments/original/1614808522/2015DanielFancherComplaint.pdf?1614808522

CASE 0:13-cv-00435-DSD-JJK   Document 1   Filed 02/22/13   Page 1 of 11 CASE 0:13-cv-00435-DSD-JJK   Document 1   Filed 02/22/13   Page 2 of 11 CASE 0:13-cv-00435-DSD-JJK   Document 1   Filed 02/22/13   Page 3 of 11 CASE 0:13-cv-00435-DSD-JJK   Document 1   Filed 02/22/13   Page 4 of 11 CASE 0:13-cv-00435-DSD-JJK   Document 1   Filed 02/22/13   Page 5 of 11 CASE 0:13-cv-00435-DSD-JJK   Document 1   Filed 02/22/13   Page 6 of 11 CASE 0:13-cv-00435-DSD-JJK   Document 1   Filed 02/22/13   Page 7 of 11 CASE 0:13-cv-00435-DSD-JJK   Document 1   Filed 02/22/13   Page 8 of 11 CASE 0:13-cv-00435-DSD-JJK   Document 1   Filed 02/22/13   Page 9 of 11 CASE 0:13-cv-00435-DSD-JJK   Document 1   Filed 02/22/13   Page 10 of 11 CASE 0:13-cv-00435-DSD-JJK   Document 1

We will drop the duplicate rows.

In [6208]:
df = df.drop_duplicates(ignore_index=True) # remove duplicate rows and reset the index to a (new) sequential integer index
len(df)

2390

### Check invalid values for text

In [6209]:
df.sort_values(by = "text")

Unnamed: 0,name,department,url,text
1809,"Peter J. Ritschel, badge #6037",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,         ...
1867,"Peter J. Ritschel, badge #6037",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,"""bad faith"" has been characterized as conduct ..."
249,"Christopher Bennett, badge #413",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,() At this point the discrepancies between th...
149,"Tyrone Barze, badge #330",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,() Mpls. pays $82K in case of Maple Grove woma...
1444,"Phillip Meemken, badge #None",Department:Stearns County Sheriff's Department,https://assets.nationbuilder.com/cuapb/pages/1...,(/) (/submit/) (https://twitter.com/news8new...
...,...,...,...,...
1829,"Peter J. Ritschel, badge #6037",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,• Ritschel took Dahl's bike and brought it aro...
781,"Michael Griffin, badge #2475",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,• Sgt. Biederman was called to the scene and c...
639,"David C. Elliott, badge #1755",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,•Officer Elliott accessed 738 photos through E...
362,"Jason Allen Carlson, badge #None",Department:Ely Police Department,https://assets.nationbuilder.com/cuapb/pages/1..., Support the Timberjay by making a donation. ...


Below we check all rows that seem to have invalid text.

In [6210]:
print(f"Text: {df.copy().loc[1809, "text"]}")
print(f"URL: {df.copy().loc[1809, "url"]}")

df = df.drop(index=1809) # Drop row since it has a bunch of gibberish characters

Text:                   !"#$%"#$ &"'$() "'$*& !+  , !$(*'   -./0./12
URL: https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pages/270/attachments/original/1626475811/Ritschel_hearing.pdf?1626475811


In [6211]:
df.loc[1444, "text"] # Keep this row since it's a valid article, just with a bunch of URLs at the start

'(/)  (/submit/)  (https://twitter.com/news8news)  (https://www.facebook.com/news8000?ref=hl)  (https://www.instagram.com/wkbtnews8/?hl=en)  (https://www.youtube.com/user/wkbttv) Plea deal for Minnesota deputy accused of sexual abuse Posted: May 7, 2012 9:07 PM Updated: December 21, 2019 10:17 AM by Site sta\x01 (https://www.news8000.com/author/sitesta\x01/) ANOKA, Minn. (AP) — A Stearns County sheri\x01’s deputy has reached a plea deal on charges that he sexually abused three teenagers. The St. Cloud Times reports Sgt. Phil Meemken reached the deal Monday as jury selection was scheduled to begin for his trial in Anoka County. Paul Young, head of the violent crime division of the Anoka County Attorney’s O\x02ce, tells the newspaper Meemken will plead guilty to one count of fourth-degree criminal sexual conduct and two counts of providing alcohol to a minor. Meemken was facing 25 counts, Including 22 counts of criminal sexual conduct. He was accused of providing alcohol to three teenage

In [6212]:
df.loc[362, "text"] # This one is valid. But the article seems to be incomplete because it has 2 pages out of 5

"\uf071 Support the Timberjay by making a donation.  (/supportus/ ) Welcome! \xa0  Log in (/) Serving Northern St. Louis County, Minnesota \uf071You have 2 free items remaining before a subscription is required. \xa0 Subscribe now! (/subscribe/?town_id=0) \xa0 | \xa0 Log in (/login.html) Carlson resigns as Ely police o\x01cer City’s employee committee recommends denial, urges termination instead Posted Wednesday, February 1, 2017 4:15 pm Keith Vandervort ELY - Ely police o\x01cer Jason Carlson, who avoided jail time after striking a plea agreement in court and admitting to an illegal sexual relationship with an underage girl, submitted his resignation to the city o\x01cials last week. The city’s employee relations committee recommended that the City Council deny the request. The Council will discuss and could take action on the issue at their next meeting on Feb. 7. Carlson, an 11-year veteran of the city’s police force, was on paid administrative leave with the city from April 2015 un

Below we check the length of all strings in the text column, to see if any strings are empty or very short.

In [6213]:
# Add a column for the length of each string in the text column, just for data cleaning purposes. We will remove it after data cleaning.
df["text_length"] = df["text"].str.len()

for length in sorted(list(df["text_length"])): # print the string length of each entry, in ascending order
    print(length)

13
13
13
13
13
13
13
13
13
13
13
13
13
13
13
13
23
71
85
92
100
100
143
143
143
143
143
143
143
143
143
143
143
143
143
143
143
143
143
167
175
176
189
189
189
189
190
190
190
190
190
191
191
191
191
191
191
191
191
192
192
192
192
192
192
192
192
192
192
192
192
192
192
192
192
192
192
192
192
192
192
192
192
192
193
193
195
214
215
216
216
216
217
218
218
218
218
218
218
218
219
222
224
225
225
231
272
283
283
283
283
283
344
354
355
355
357
378
378
379
379
379
379
379
379
379
380
380
381
381
381
381
381
383
385
385
385
385
385
385
385
385
385
385
385
385
385
385
385
385
385
389
389
407
407
407
407
407
407
407
407
407
418
420
425
461
496
496
496
496
567
567
567
575
575
575
575
575
575
575
575
575
575
575
575
575
575
575
575
575
578
580
580
596
596
658
680
697
698
701
711
711
711
711
711
750
777
777
783
783
793
793
794
818
818
818
818
818
818
818
830
880
881
881
890
893
895
898
899
914
923
939
939
939
939
939
939
939
959
959
959
959
959
959
959
959
959
959
975
975
982
999
999
999
999


In [6214]:
df.head()

Unnamed: 0,name,department,url,text,text_length
0,"Andrew Allen, badge #37",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Home Legislative File 2021-01132 RCA Legal s...,1426
1,"Guled Abdullahi, badge #706",Department:Hennepin County Sheriff's Department,https://assets.nationbuilder.com/cuapb/pages/1...,"Hennepin County 300 South Sixth Street, Minne...",1476
2,"Dean V. Albers, badge #None",Department:Goodhue County Sheriff's Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,"2/22/2021 Jenson v. Craft, Civil No. 01-1488(D...",34621
3,"Scott Aikins, badge #22",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,"2/22/2021 United States v. Diriye, Case No. 14...",22890
4,"Matthew Aish, badge #None",Department:Columbia Heights Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long1 Arbitration LELS (Mathew Aish)/...,44726


Some strings have only 13 characters, which is very few for an article. Let's check those:

In [6215]:
rows_13 = df.loc[df["text_length"] == 13] # all data rows where text length is 13
rows_13

Unnamed: 0,name,department,url,text,text_length
36,"Brian Anderson, badge #91",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Access denied,13
186,"Justin Benner, badge #None",Department:Ramsey County Sheriff's Office,https://assets.nationbuilder.com/cuapb/pages/1...,Access denied,13
333,"Charles Cape, badge #971",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Access denied,13
369,"Timothy Carson, badge #1006",Department:Minneapolis Police Department,https://assets.nationbuilder.com/cuapb/pages/1...,Access denied,13
684,"Joe Fuller, badge #2160",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Access denied,13
707,"Matthew George, badge #None",Department:Bloomington Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Access denied,13
1412,"Cullan McHarg, badge #None",Department:Bloomington Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Access denied,13
1442,"Nick Melser, badge #None",Department:Bloomington Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Access denied,13
1524,"Justin G Mullen, badge #122",Department:Burnsville Police Department,https://assets.nationbuilder.com/cuapb/pages/1...,Access denied,13
1832,"David Robins, badge #6046",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Access denied,13


Since all of these rows just say "Access denied", we will drop them.

In [6216]:
df.drop(index = rows_13.index, inplace = True) # drop invalid rows

In [6217]:
sorted(list(df["text_length"]))

[23,
 71,
 85,
 92,
 100,
 100,
 143,
 143,
 143,
 143,
 143,
 143,
 143,
 143,
 143,
 143,
 143,
 143,
 143,
 143,
 143,
 143,
 143,
 167,
 175,
 176,
 189,
 189,
 189,
 189,
 190,
 190,
 190,
 190,
 190,
 191,
 191,
 191,
 191,
 191,
 191,
 191,
 191,
 192,
 192,
 192,
 192,
 192,
 192,
 192,
 192,
 192,
 192,
 192,
 192,
 192,
 192,
 192,
 192,
 192,
 192,
 192,
 192,
 192,
 192,
 192,
 192,
 192,
 193,
 193,
 195,
 214,
 215,
 216,
 216,
 216,
 217,
 218,
 218,
 218,
 218,
 218,
 218,
 218,
 219,
 222,
 224,
 225,
 225,
 231,
 272,
 283,
 283,
 283,
 283,
 283,
 344,
 354,
 355,
 355,
 357,
 378,
 378,
 379,
 379,
 379,
 379,
 379,
 379,
 379,
 380,
 380,
 381,
 381,
 381,
 381,
 381,
 383,
 385,
 385,
 385,
 385,
 385,
 385,
 385,
 385,
 385,
 385,
 385,
 385,
 385,
 385,
 385,
 385,
 385,
 389,
 389,
 407,
 407,
 407,
 407,
 407,
 407,
 407,
 407,
 407,
 418,
 420,
 425,
 461,
 496,
 496,
 496,
 496,
 567,
 567,
 567,
 575,
 575,
 575,
 575,
 575,
 575,
 575,
 575,
 575,
 575,
 5

Now check all rows where the text has 100 or less characters, since that is still very few for an article.

In [6218]:
rows_100 = df.loc[df["text_length"] <= 100]
rows_100

Unnamed: 0,name,department,url,text,text_length
106,"Julio Baez, badge #None",Department:Kasson Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,https://minnesota.cbslocal.com/2018/06/21/kass...,92
744,"Michael Griffin, badge #2475",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,CASE 0:12-cv-02551-DWF-JSM Document 15 Fil...,71
1350,"Michael Manley, badge #4385",Department:Minneapolis Police Department,https://assets.nationbuilder.com/cuapb/pages/1...,1336323 1336324 1336325,23
1558,"Patrick V. McMahon, badge #4636",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,MR. DENNIS RODOLFO HERNANDEZ-ANTUNEZ 1147 N 7...,100
2276,"Christopher Steward, badge #6823",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,MR. DENNIS RODOLFO HERNANDEZ-ANTUNEZ 1147 N 7...,100
2372,"Michael Zueg, badge #None",Department:Walnut Grove Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,https://minnesota.cbslocal.com/2018/07/25/poli...,85


In [6219]:
for index in rows_100.index:
    print(f"{index}: {rows_100.loc[index, "text"]}")

106: https://minnesota.cbslocal.com/2018/06/21/kasson-pd-officer-accused-criminal-sexual-conduct/
744: CASE 0:12-cv-02551-DWF-JSM   Document 15   Filed 01/26/15   Page 1 of 1
1350: 1336323 1336324 1336325
1558: MR. DENNIS RODOLFO HERNANDEZ-ANTUNEZ  1147 N 7TH ST  APT E-3041  MINNEAPOLIS, MN  55411  CUAPB002843
2276: MR. DENNIS RODOLFO HERNANDEZ-ANTUNEZ  1147 N 7TH ST  APT E-3041  MINNEAPOLIS, MN  55411  CUAPB002843
2372: https://minnesota.cbslocal.com/2018/07/25/police-chief-pleads-guilty-to-prostitution/


All these rows do not contain an article, so we will drop them.

In [6220]:
df.drop(index = rows_100.index, inplace=True)
len(df)

2367

Now check text with 200 characters or less.

In [6221]:
# Get rows with length <= 200, and sort the dataframe in ascending order by the length of the text.
rows_200 = df.loc[df["text_length"] <= 200].sort_values(by = "text_length")
rows_200

Unnamed: 0,name,department,url,text,text_length
1157,"Robert J. Kroll, badge #3874",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,CASE 0:03-cv-01182-DSD-SRN Document 16 Fil...,143
2169,"Geoffrey Toscano, badge #7257",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,CASE 0:06-cv-00579-DWF-AJB Document 51 Fil...,143
2003,"Donald Smulski, Jr., badge #6696",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,CASE 0:06-cv-00579-DWF-AJB Document 51 Fil...,143
1999,"Katherine Smulski, badge #468",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,CASE 0:06-cv-00579-DWF-AJB Document 51 Fil...,143
1987,"Roger D. Smith, badge #6689",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,CASE 0:06-cv-00579-DWF-AJB Document 51 Fil...,143
...,...,...,...,...,...
1646,"Brett R. Palkowitsch, badge #520305",Department:St. Paul Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,"/ Star Tribune (Minneapolis, Minnesota) · Thu...",192
85,"Heather J. Aschoff, badge #233",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,"Star Tribune (Minneapolis, Minnesota) · Wed, ...",192
538,"Joshua Demmerly, badge #None",Department:Warroad Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,"Star Tribune (Minneapolis, Minnesota) · Sun, ...",193
382,"Tou M. Cha, badge #112410",Department:St. Paul Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,"Star Tribune (Minneapolis, Minnesota) · Fri, ...",193


In [6222]:
for index in rows_200.index:
    print(f"{'-'*50}\n{index}: {rows_200.loc[index, "text"]}\n")

--------------------------------------------------
1157: CASE 0:03-cv-01182-DSD-SRN   Document 16   Filed 04/13/04   Page 1 of 2 CASE 0:03-cv-01182-DSD-SRN   Document 16   Filed 04/13/04   Page 2 of 2

--------------------------------------------------
2169: CASE 0:06-cv-00579-DWF-AJB   Document 51   Filed 10/19/07   Page 1 of 2 CASE 0:06-cv-00579-DWF-AJB   Document 51   Filed 10/19/07   Page 2 of 2

--------------------------------------------------
2003: CASE 0:06-cv-00579-DWF-AJB   Document 51   Filed 10/19/07   Page 1 of 2 CASE 0:06-cv-00579-DWF-AJB   Document 51   Filed 10/19/07   Page 2 of 2

--------------------------------------------------
1999: CASE 0:06-cv-00579-DWF-AJB   Document 51   Filed 10/19/07   Page 1 of 2 CASE 0:06-cv-00579-DWF-AJB   Document 51   Filed 10/19/07   Page 2 of 2

--------------------------------------------------
1987: CASE 0:06-cv-00579-DWF-AJB   Document 51   Filed 10/19/07   Page 1 of 2 CASE 0:06-cv-00579-DWF-AJB   Document 51   Filed 10/19/07   Pag

Double-check rows I'm unsure about:

In [6223]:
print(df.loc[330, "text"])

Minneapolis police officer jailed, accused of sex crime - StarTribune.com http://www.startribune.com/minneapolis-police-officer-jailed-accused-of... 1 of 1 9/10/2017, 12:24 PM


In [6224]:
df.loc[144, "text"]

'Minneapolis officer accused of strong-arm tactics with canvasser - StarT... http://www.startribune.com/minneapolis-officer-accused-of-strong-arm-ta... 1 of 1 8/25/2018, 2:37 AM'

We will drop all of these rows. None of them contain articles, and most of them just contain the title and URL of an article.

In [6225]:
df.drop(index=rows_200.index, inplace=True)
len(df)

2302

In [6226]:
unique_text_lengths = df["text_length"].value_counts() # unique values of text length
unique_text_lengths

text_length
72671    51
38274    51
7970     50
33017    50
53556    18
         ..
7914      1
3274      1
4346      1
2375      1
42150     1
Name: count, Length: 1248, dtype: int64

In [6227]:
for length in list(unique_text_lengths.values):
    print(length)

51
51
50
50
18
17
17
17
15
15
15
14
14
11
10
10
10
10
9
9
8
8
8
7
7
7
7
7
7
7
7
7
7
7
7
6
6
6
6
6
6
6
6
6
6
6
6
6
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1


In [6228]:
for l in sorted(list(df["text_length"])):
    print(l)

214
215
216
216
216
217
218
218
218
218
218
218
218
219
222
224
225
225
231
272
283
283
283
283
283
344
354
355
355
357
378
378
379
379
379
379
379
379
379
380
380
381
381
381
381
381
383
385
385
385
385
385
385
385
385
385
385
385
385
385
385
385
385
385
389
389
407
407
407
407
407
407
407
407
407
418
420
425
461
496
496
496
496
567
567
567
575
575
575
575
575
575
575
575
575
575
575
575
575
575
575
575
575
578
580
580
596
596
658
680
697
698
701
711
711
711
711
711
750
777
777
783
783
793
793
794
818
818
818
818
818
818
818
830
880
881
881
890
893
895
898
899
914
923
939
939
939
939
939
939
939
959
959
959
959
959
959
959
959
959
959
975
975
982
999
999
999
999
1010
1027
1027
1027
1027
1027
1027
1027
1027
1028
1038
1038
1038
1038
1038
1038
1077
1079
1079
1079
1079
1079
1079
1079
1089
1093
1099
1101
1104
1104
1115
1115
1115
1117
1117
1120
1120
1120
1120
1127
1136
1151
1151
1161
1161
1161
1161
1173
1180
1180
1180
1180
1180
1180
1198
1198
1198
1207
1218
1238
1247
1247
1247
1247
1248
125

In [6229]:
len(df["text"].unique()), len(df) # check for duplicate texts

(1394, 2302)

There are 2,319 texts but only 1,394 unique texts. Let's check the duplicates.

In [6230]:
is_duplicate_text = df["text"].duplicated() # bool series that is True if the text in a row has a duplicate. Mark all duplicates, not just the first occurrence.
df["is_text_duplicated"] = is_duplicate_text # add a column for whether a row has a duplicate
sum(is_duplicate_text), len(df)

(908, 2302)

In [6231]:
df[is_duplicate_text].sort_values(by = "text").head(20)

Unnamed: 0,name,department,url,text,text_length,is_text_duplicated
293,"Jeanine Brudenell, badge #845",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,(https://lp.¡ndlaw.com/) FINDLAW (HTTPS://LP.F...,78719,True
2277,"Roderick Weber, badge #7612",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,(https://lp.¡ndlaw.com/) FINDLAW (HTTPS://LP.F...,78719,True
1466,"Timothy J. Merkel, badge #4744",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,(https://lp.¡ndlaw.com/) FINDLAW (HTTPS://LP.F...,78719,True
734,"Robert Greer, badge #2441",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,(https://lp.¡ndlaw.com/) FINDLAW (HTTPS://LP.F...,78719,True
422,"Erika Christensen, badge #1097",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,(https://lp.¡ndlaw.com/) FINDLAW (HTTPS://LP.F...,78719,True
895,"Timothy J. Hoeppner, badge #3051",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,(https://lp.¡ndlaw.com/) FINDLAW (HTTPS://LP.F...,78719,True
1746,"Daniel Pommerenke, badge #5774",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,(https://lp.¡ndlaw.com/) FINDLAW (HTTPS://LP.F...,78719,True
1555,"Edward T. Nelson, badge #4970",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,(https://lp.¡ndlaw.com/) FINDLAW (HTTPS://LP.F...,78719,True
210,"John A. Billington, badge #556",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,(https://lp.¡ndlaw.com/) FINDLAW (HTTPS://LP.F...,78719,True
1078,"Joel Kimmerle, badge #3703",Department:Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,(https://lp.¡ndlaw.com/) FINDLAW (HTTPS://LP.F...,19746,True


In [6232]:
sum(df[["text", "url"]].duplicated()) # how many rows have the same text and URL as another row

900

So there are 8 rows that have the same text as another row, but a different URL from the other row.

In [6233]:
sum(df[["text", "url", "name"]].duplicated()) # how many rows have the same text, URL, AND officer name as another row. There are no such rows.

0

Problem: If an article is about multiple police officers, then each officer gets their own row, and all those rows contain the same article text.

### Check invalid `name` values

In [6234]:
for name in sorted(list(df["name"].unique())):
    print(name)

Aaron Apitz, badge #None
Aaron C. Morrison, badge #4859
Aaron Collins, badge #1259
Aaron Heuer, badge #None
Aaron L. Biard, badge #542
Aaron Pearson, badge #5504
Aaron Prescott, badge #5806
Aaron Womble, badge #7851
Abdiaziz Omar, badge #None
Abdirashid Mohamed, badge #None
Abdiwahab Ali, badge #30
Abdulkhayr  Hirse, badge #None
Abraham T. Cyr, badge #139000
Abubakar Muridi, badge #4896
Adam Castilleja, badge #None
Adam Hakanson, badge #3069
Adam Lepinski, badge #4093
Adam Lewis, badge #4102
Adam Miller, badge #659
Adam Moen, badge #4832
Adan Casas-Uriostegui, badge #None
Aimee Walker (Linson), badge #4172
Aine M. Bebeau, badge #35000
Alan K. Nielsen, badge #None
Alan Knowlton, badge #None
Alan Liotta, badge #4196
Alan Salvosa, badge #None
Alan W. Williams, badge #7763
Alberto Mendez, badge #509
Alex Nonnemacher, badge #5157
Alexander Brown, badge #820
Alexander Graham, badge #510
Alexander Walls, badge #7498
Alexandra Dubay, badge #1617
Alexandra Urgiles, badge #685
Allan Kertscher, b

All names look valid.

In [6235]:
name_lengths = sorted(list(df["name"].str.len())) # check for empty strings in the name column. There are none

for l in name_lengths:
    print(l)

20
20
20
21
21
21
21
21
21
21
21
21
21
21
21
21
21
21
21
21
21
21
21
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
22
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
23
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
24
2

### Check invalid departments

In [6236]:
sorted(list(df["department"].unique())) # all unique department names

['Department:',
 'Department:Adams Police Department',
 'Department:Adrian Police Department',
 'Department:Alexandria Police Department',
 "Department:Anoka County Sheriff's Office",
 'Department:Babbitt Police Department',
 'Department:Barnesville Police Department',
 'Department:Bayport Police Department',
 'Department:Belle Plaine Police Department',
 'Department:Bloomington Police Department',
 'Department:Braham Police Department',
 'Department:Brooklyn Center Police Department',
 'Department:Brooklyn Park Police Department',
 'Department:Brownsdale Police Department',
 'Department:Bureau of Criminal Apprehension',
 'Department:Burnsville Police Department',
 'Department:Canby Police Department',
 "Department:Carlton County Sheriff's Department",
 "Department:Carver County Sheriff's Department",
 'Department:Champlin Police Department',
 'Department:Chaska Police Department',
 'Department:Chatfield Police Department',
 "Department:Chisago County Sheriff's Department",
 'Departmen

Some rows have a `department` value of just `Department:`, so we will look at those.

In [6237]:
no_dept_rows = df.loc[df["department"] == "Department:"] # rows with no department specified
no_dept_rows

Unnamed: 0,name,department,url,text,text_length,is_text_duplicated
371,"Adan Casas-Uriostegui, badge #None",Department:,https://assets.nationbuilder.com/cuapb/pages/1...,C C STATE OF MINNESOTA BOARD OF PEACE OFFICER ...,2416,False
531,"Brian Denny, badge #None",Department:,https://assets.nationbuilder.com/cuapb/pages/1...,C C STATE OF MINNESOTA BOARD OF PEACE OFFICER ...,2522,False
1116,"Alan Knowlton, badge #None",Department:,https://assets.nationbuilder.com/cuapb/pages/1...,C C) STATE OF MINNESOTA BOARD OF PEACE OFFICER...,3836,False
1125,"Alan Knowlton, badge #None",Department:,https://assets.nationbuilder.com/cuapb/pages/1...,Skip to Main Content Logout My Account Search ...,4045,False
1203,"David W. Larson, badge #None",Department:,https://assets.nationbuilder.com/cuapb/pages/1...,C C STATE OF MINNEOSTA BOARD OF PEACE OFFICER ...,2419,False
1470,"Richard G. Miller, badge #None",Department:,https://assets.nationbuilder.com/cuapb/pages/1...,Skip to Main Content Logout My Account Search ...,4265,False
1680,"Shawn L. Peters, badge #None",Department:,https://assets.nationbuilder.com/cuapb/pages/1...,Skip to Main Content Logout My Account Search ...,2278,False
1857,"Curtis Rude, badge #None",Department:,https://assets.nationbuilder.com/cuapb/pages/1...,C C S ‘ STATE OF MINNESOTA BOARD OF PEACE OFFi...,2373,False
2364,"Michael Zeug, badge #None",Department:,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,STAR TRIBUNE › Annotations Off-duty McLeod ...,1403,True


In [6238]:
for index in no_dept_rows.index:
    print(f"\n{'-'*100}\n{index}: URL: {df.loc[index, "url"]}\n\n{df.loc[index, "text"]}")


----------------------------------------------------------------------------------------------------
371: URL: https://assets.nationbuilder.com/cuapb/pages/1472/attachments/original/1669104724/Casas-Uriostegui_Adan-20-039-13919.pdf?1669104724

C C STATE OF MINNESOTA BOARD OF PEACE OFFICER STANDARDS AND TRAINING In the Matter of the Peace Officer License of Adan Casas-Uriostegui License Number: 13919 1. The Minnesota Board of Peace Officer Standards and Training (“Board”) is authorized pursuant to Minnesota Statutes sections 626.84 through 626.90 (1998) and Minnesota Rules chapter 6700 to license, regulate, and discipline persons who apply for, petition, or hold peace officer licenses in the State of Minnesota and is further authorized pursuant to Minnesota Statutes section 214.10 to review complaints against peace officers and to initiate appropriate disciplinary action. 2. Adan Casas-Uriostegui (“Respondent”) has been a peace officer in Minnesota since June 18, 1997. 3. Pursuant to M

All these rows look valid, so we will set their `department` value to `NaN` instead of dropping them.

In [6239]:
df["department"].dtype # check that the data types are the same before and after

dtype('O')

In [6240]:
df.loc[no_dept_rows.index, "department"] = None
df.loc[df["department"].isna()]

Unnamed: 0,name,department,url,text,text_length,is_text_duplicated
371,"Adan Casas-Uriostegui, badge #None",,https://assets.nationbuilder.com/cuapb/pages/1...,C C STATE OF MINNESOTA BOARD OF PEACE OFFICER ...,2416,False
531,"Brian Denny, badge #None",,https://assets.nationbuilder.com/cuapb/pages/1...,C C STATE OF MINNESOTA BOARD OF PEACE OFFICER ...,2522,False
1116,"Alan Knowlton, badge #None",,https://assets.nationbuilder.com/cuapb/pages/1...,C C) STATE OF MINNESOTA BOARD OF PEACE OFFICER...,3836,False
1125,"Alan Knowlton, badge #None",,https://assets.nationbuilder.com/cuapb/pages/1...,Skip to Main Content Logout My Account Search ...,4045,False
1203,"David W. Larson, badge #None",,https://assets.nationbuilder.com/cuapb/pages/1...,C C STATE OF MINNEOSTA BOARD OF PEACE OFFICER ...,2419,False
1470,"Richard G. Miller, badge #None",,https://assets.nationbuilder.com/cuapb/pages/1...,Skip to Main Content Logout My Account Search ...,4265,False
1680,"Shawn L. Peters, badge #None",,https://assets.nationbuilder.com/cuapb/pages/1...,Skip to Main Content Logout My Account Search ...,2278,False
1857,"Curtis Rude, badge #None",,https://assets.nationbuilder.com/cuapb/pages/1...,C C S ‘ STATE OF MINNESOTA BOARD OF PEACE OFFi...,2373,False
2364,"Michael Zeug, badge #None",,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,STAR TRIBUNE › Annotations Off-duty McLeod ...,1403,True


In [6241]:
df["department"].dtype

dtype('O')

In [6242]:
# Removes the "Department:" prefix
df["department"] = df["department"].str.removeprefix("Department:")
df.head()

Unnamed: 0,name,department,url,text,text_length,is_text_duplicated
0,"Andrew Allen, badge #37",Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Home Legislative File 2021-01132 RCA Legal s...,1426,False
1,"Guled Abdullahi, badge #706",Hennepin County Sheriff's Department,https://assets.nationbuilder.com/cuapb/pages/1...,"Hennepin County 300 South Sixth Street, Minne...",1476,False
2,"Dean V. Albers, badge #None",Goodhue County Sheriff's Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,"2/22/2021 Jenson v. Craft, Civil No. 01-1488(D...",34621,False
3,"Scott Aikins, badge #22",Minneapolis Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,"2/22/2021 United States v. Diriye, Case No. 14...",22890,False
4,"Matthew Aish, badge #None",Columbia Heights Police Department,https://d3n8a8pro7vhmx.cloudfront.net/cuapb/pa...,Too_Long1 Arbitration LELS (Mathew Aish)/...,44726,False


In [6243]:
df.drop(columns = ["text_length", "is_text_duplicated"]).to_csv("cleaned_police_reports.csv", index=False)