# Data Extraction
This python file will download, process, and visualize the dataset in a tabular format and then store into a pickle file for easier usage.

**Authors:** Kevin Lu, Shrusti Jain, Smeet Patel, Taobo Liao

# Imports and Graph Configurations

In [None]:
import numpy
import pandas as pd
import time
import tensorflow as tf
import random
import matplotlib
import torch
#%matplotlib notebook
import matplotlib.pyplot as plt
import scipy.stats
import matplotlib.offsetbox as offsetbox
from matplotlib.ticker import StrMethodFormatter

In [None]:
#for some reason, this needs to be in a separate cell
params={
    "font.size":15,
    "lines.linewidth":5,
}
plt.rcParams.update(params)

In [None]:
!gdown 1enR3DLH7iDuI0mG8rV3Z21tPhdZZRXOv
!gdown 1zeyltSH_KaN0qQCRCiZR8kXOG6VUXU9T

Downloading...
From (original): https://drive.google.com/uc?id=1enR3DLH7iDuI0mG8rV3Z21tPhdZZRXOv
From (redirected): https://drive.google.com/uc?id=1enR3DLH7iDuI0mG8rV3Z21tPhdZZRXOv&confirm=t&uuid=17798677-87e6-4aed-8b47-ee16936e66cf
To: /content/train.pkl
100% 224M/224M [00:02<00:00, 94.3MB/s]
Downloading...
From: https://drive.google.com/uc?id=1zeyltSH_KaN0qQCRCiZR8kXOG6VUXU9T
To: /content/debug.pkl
100% 11.2M/11.2M [00:00<00:00, 55.7MB/s]


# Data Extraction Sets

The dataset downloaded from the public website: https://catalog.data.gov/dataset/crime-data-from-2020-to-present has already been loaded and converted to a pickle file for easier storage.

In [None]:
crime_df = pd.read_pickle('/content/train.pkl')

In [None]:
crime_df.columns

Index(['DR_NO', 'Date Rptd', 'DATE OCC', 'TIME OCC', 'AREA', 'AREA NAME',
       'Rpt Dist No', 'Part 1-2', 'Crm Cd', 'Crm Cd Desc', 'Mocodes',
       'Vict Age', 'Vict Sex', 'Vict Descent', 'Premis Cd', 'Premis Desc',
       'Weapon Used Cd', 'Weapon Desc', 'Status', 'Status Desc', 'Crm Cd 1',
       'Crm Cd 2', 'Crm Cd 3', 'Crm Cd 4', 'LOCATION', 'Cross Street', 'LAT',
       'LON'],
      dtype='object')

In [None]:
# Converts Date Rptd and DATE OCC columns from strings to datetime objects.
crime_df['Date Rptd'] = pd.to_datetime(crime_df['Date Rptd'], format='%m/%d/%Y %I:%M:%S %p')
crime_df['DATE OCC'] = pd.to_datetime(crime_df['DATE OCC'], format='%m/%d/%Y %I:%M:%S %p')
#crime_df['TIME OCC'] = pd.to_datetime(crime_df['TIME OCC'], format='')
#TIME OCC is not converted due to inconsistent/unknown format: eg TIME OCC at column 11 is merely '1'.


In [None]:
crime_df.head()

Unnamed: 0,DR_NO,Date Rptd,DATE OCC,TIME OCC,AREA,AREA NAME,Rpt Dist No,Part 1-2,Crm Cd,Crm Cd Desc,...,Status,Status Desc,Crm Cd 1,Crm Cd 2,Crm Cd 3,Crm Cd 4,LOCATION,Cross Street,LAT,LON
0,190326475,2020-03-01,2020-03-01,2130,7,Wilshire,784,1,510,VEHICLE - STOLEN,...,AA,Adult Arrest,510.0,998.0,,,1900 S LONGWOOD AV,,34.0375,-118.3506
1,200106753,2020-02-09,2020-02-08,1800,1,Central,182,1,330,BURGLARY FROM VEHICLE,...,IC,Invest Cont,330.0,998.0,,,1000 S FLOWER ST,,34.0444,-118.2628
2,200320258,2020-11-11,2020-11-04,1700,3,Southwest,356,1,480,BIKE - STOLEN,...,IC,Invest Cont,480.0,,,,1400 W 37TH ST,,34.021,-118.3002
3,200907217,2023-05-10,2020-03-10,2037,9,Van Nuys,964,1,343,SHOPLIFTING-GRAND THEFT ($950.01 & OVER),...,IC,Invest Cont,343.0,,,,14000 RIVERSIDE DR,,34.1576,-118.4387
4,220614831,2022-08-18,2020-08-17,1200,6,Hollywood,666,2,354,THEFT OF IDENTITY,...,IC,Invest Cont,354.0,,,,1900 TRANSIENT,,34.0944,-118.3277


In [None]:
# Save two versions of the dataset in pickle format. The first
# 50,000 rows are saved to debug.pkl for debugging purposes,
# while the entire dataset is saved to train.pkl for training
# or further analysis.

debug_path = 'debug.pkl'
train_path = 'train.pkl'
crime_df[:50000].to_pickle(debug_path)
crime_df.to_pickle(train_path)

