### Importing DOJ data and saving to Pickle file for ease of use later

In [51]:
import pandas as pd

In [52]:
doj = pd.read_json('combined.json', lines=True)

In [53]:
doj.head()

Unnamed: 0,components,contents,date,id,title,topics
0,[National Security Division (NSD)],"PORTLAND, Oregon. – Mohamed Osman Mohamud, 23,...",2014-10-01 04:00:00,,Convicted Bomb Plotter Sentenced to 30 Years,[]
1,[Environment and Natural Resources Division],WASHINGTON – North Carolina’s Waccamaw River...,2012-07-25 04:00:00,12-919,$1 Million in Restitution Payments Announced t...,[]
2,[Environment and Natural Resources Division],BOSTON– A $1-million settlement has been...,2011-08-03 04:00:00,11-1002,$1 Million Settlement Reached for Natural Reso...,[]
3,[Environment and Natural Resources Division],WASHINGTON—A federal grand jury in Las Vegas...,2010-01-08 05:00:00,10-015,10 Las Vegas Men Indicted \r\nfor Falsifying V...,[]
4,[Environment and Natural Resources Division],"The U.S. Department of Justice, the U.S. Envir...",2018-07-09 04:00:00,18-898,$100 Million Settlement Will Speed Cleanup Wor...,[Environment]


In [54]:
import pickle

In [55]:
with open("doj.pkl", "wb") as f:
    pickle.dump(doj, f)

### Basic information

In [56]:
doj.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13087 entries, 0 to 13086
Data columns (total 6 columns):
components    13087 non-null object
contents      13087 non-null object
date          13087 non-null datetime64[ns]
id            12810 non-null object
title         13087 non-null object
topics        13087 non-null object
dtypes: datetime64[ns](1), object(5)
memory usage: 613.5+ KB


### Checking topic values

In [57]:
doj['topics'].value_counts()

[]                                                                                   8399
[Tax]                                                                                 706
[Consumer Protection]                                                                 335
[Civil Rights]                                                                        305
[Antitrust]                                                                           292
[Hate Crimes]                                                                         246
[Environment]                                                                         175
[Health Care Fraud]                                                                   174
[Project Safe Childhood]                                                              166
[Public Corruption]                                                                   144
[Counterterrorism]                                                                    140
[StopFraud

Most cells seem to be missing topic values, and there seem to be a wide range of different topics for the press releases that are categorized. This makes it a more interesting unsupervised learning problem because it may be possible to build topic models to create a cleaner set of categories that can be used to describe all of the documents.

### Checking article lengths and dealing with empty articles

In [58]:
lengths = [len(x) for x in doj['contents']]

In [59]:
max(lengths)

178106

In [60]:
min(lengths)

0

In [61]:
wordcounts = [len(x.split()) for x in doj['contents']]

In [62]:
max(wordcounts)

25254

In [63]:
min(wordcounts)

0

In [64]:
for i in doj.index:
    if len(doj.at[i,'contents']) == 0:
        print(doj.loc[i])

components                                  [Criminal Division]
contents                                                       
date                                        2013-02-12 05:00:00
id                                                       13-185
title         Florida Man Pleads Guilty to Federal Election ...
topics                                                       []
Name: 3403, dtype: object
components                                  [Criminal Division]
contents                                                       
date                                        2013-02-20 05:00:00
id                                                       13-217
title         North Carolina Commodities Firm Owner Sentence...
topics                                                       []
Name: 9304, dtype: object


Not a huge problem because there are only 2. We will deal with this by simply assigning the title as the contents of the article.

In [65]:
for i in doj.index:
    if len(doj.at[i,'contents']) == 0:
        doj.at[i,'contents'] = doj.at[i,'title']

In [66]:
doj.loc[3403]

components                                  [Criminal Division]
contents      Florida Man Pleads Guilty to Federal Election ...
date                                        2013-02-12 05:00:00
id                                                       13-185
title         Florida Man Pleads Guilty to Federal Election ...
topics                                                       []
Name: 3403, dtype: object

### Checking that all have titles

In [67]:
lengths = [len(x) for x in doj['title']]

In [68]:
min(lengths)

20

Good!

### Investigating "Components" column

In [69]:
doj['components'].value_counts()

[Criminal Division]                                                                                                                   2680
[Tax Division]                                                                                                                        1862
[Civil Division]                                                                                                                       926
[Civil Rights Division]                                                                                                                862
[Office of the Attorney General]                                                                                                       822
[Antitrust Division]                                                                                                                   798
[Environment and Natural Resources Division]                                                                                           798
[Civil Rights Division, Civ

There are a lot of different divisions to the DOJ. This may be a better way to check whether our clusters seem to make sense at the end than the topics column, which has a lot of null values.

### Checking time range

In [70]:
doj['date'].min()

Timestamp('2009-01-05 05:00:00')

In [71]:
doj['date'].max()

Timestamp('2018-07-27 04:00:00')

In [72]:
doj['date'].isna().sum()

0

Goes from 1/5/09 to 7/27/18 and each press release has a date. Yay!

### Investigating 'ID' Column

In [73]:
doj['id'].value_counts()

13-526      3
11-750      2
11-780      2
11-065      2
13-763      2
15-142      2
12-409      2
12-1778     2
17-344      2
14-486      2
15-512      2
11-116      2
13-634      2
15-321      2
11-472      2
17-047      2
12-986      2
14-1171     2
14-1125     2
17-1453     2
11-278      2
15-864      2
15-947      2
15-135      2
10-1220     2
15-481      2
17-378      2
14-418      2
15-457      2
17-812      2
           ..
14-1412     1
17-798      1
14-1237     1
15-881      1
10-934      1
12-499      1
14-539      1
11-867      1
12-1051     1
17-867      1
16-1303     1
18 - 393    1
10-769      1
15-1404     1
13-389      1
14-1224     1
12-130      1
15-1200     1
09-1316     1
11-887      1
17-710      1
10-715      1
14-1307     1
18-204      1
11-296      1
14-664      1
16-381      1
12-415      1
12-105      1
13-1283     1
Name: id, Length: 12672, dtype: int64

It looks like the ID column just shows which year the press release is from, and potentially the number press release it was that year. Not every article has an ID, but each has a timestamp, and we can just use that and the index to identify the articles. 

### Getting rid of things unimportant to the analysis

In [74]:
doj.drop(columns='id',inplace=True)

Because the topics column is largely null, and the components column gives us more consistent information about topics, we'll drop the topics column for now too.

In [75]:
doj.drop(columns='topics',inplace=True)

In [78]:
doj.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13087 entries, 0 to 13086
Data columns (total 4 columns):
components    13087 non-null object
contents      13087 non-null object
date          13087 non-null datetime64[ns]
title         13087 non-null object
dtypes: datetime64[ns](1), object(3)
memory usage: 409.0+ KB


No need to have articles with the same contents.

In [81]:
doj.drop_duplicates(subset='contents',inplace=True)

In [82]:
doj.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13081 entries, 0 to 13086
Data columns (total 4 columns):
components    13081 non-null object
contents      13081 non-null object
date          13081 non-null datetime64[ns]
title         13081 non-null object
dtypes: datetime64[ns](1), object(3)
memory usage: 511.0+ KB


Looks like there were only a couple. Phew.

In [84]:
doj['title'].value_counts()

Northern California Real Estate Investor Agrees to Plead Guilty to Bid Rigging at Public Foreclosure Auctions                                8
Justice Department to Monitor Elections in Texas                                                                                             7
President Obama Grants Commutations                                                                                                          6
Miami-Area Resident Pleads Guilty to Participating in $200 Million Medicare Fraud Scheme                                                     5
Deputy Attorney General Sally Q. Yates Statement on the President’s Recent Clemency Decisions                                                5
Justice Department to Monitor Elections in Mississippi                                                                                       4
President Obama Grants Commutations and Pardons                                                                                              4

It looks like there are a few articles that have the same title, but the text of the press releases also differs slightly, which could give us information for the model. Some of the headlines also appear general enough that they could be referring to different events. We'll keep them in for now.

### Saving cleaned data to file

In [86]:
with open("doj_c.pkl", "wb") as f:
    pickle.dump(doj, f)