## Data Merging and Cleaning

This file assembles two "master" dataframes for experimentation. The first dataframe contains every crime throughout each day, with associated daily weather information. The second dataframe contains the number of arrests each day with associated mean weather information.

### How to use:

1. **Download CSV data and place in the ./input directory**
    - [NYPD_Arrest_Data__Year_to_Date_.csv](https://drive.google.com/file/d/1Ee7dSLLK7EdiMwiE94uK3p0SiDd7vTrP/view?usp=sharing)
    - [NYPD_Arrests_Data__Historic_.csv](https://drive.google.com/file/d/1g_Iok1V2NnWKBy0r9qGG7XavUlOhDzWe/view?usp=sharing)
    - [daily_new_york_data.csv](https://drive.google.com/file/d/1_B0xP4ORTzHCG9S4LaEB_yEjC1mvITXU/view?usp=sharing)
2. **Run cells in order from top to bottom**
    - Run all cells in order
    - Run all cells only once
3. **View output in the ./output directory**
    - We can either output files in our program, or call this module directly from another module to receive this data.

In [None]:
import pandas as pd

### Arrest data

We can load all arrests, or filter by law codes or by level of offense. If we don't want to filter anything, then everything in the next two cells should be commented out. If we want to filter by one or both, uncomment one or both cells and set the variables to the desired values.

**Dataset**
   - Historic arrest dataset: https://data.cityofnewyork.us/Public-Safety/NYPD-Arrest-Data-Year-to-Date-/uip8-fykc
   - Current arrest dataset: https://data.cityofnewyork.us/Public-Safety/NYPD-Arrests-Data-Historic-/8h9b-rp9u
   - New York State Penal Law Offenses: https://ypdcrime.com/penallawlist.php?tfm_order=DESC&tfm_orderby=code
    
**Resources**
   - New York State Penal Law Offenses: https://ypdcrime.com/penallawlist.php?tfm_order=DESC&tfm_orderby=code
   - Laws of New York: https://ypdcrime.com/penal.law/

In [None]:
# Filter by law code(s).
# These codes seem rather arbitrary, see: https://ypdcrime.com/penallawlist.php?tfm_order=DESC&tfm_orderby=code
# PL 1211200 is "FELONY ASSAULT" for example.
# PL 1303501 is "RAPE" for example.

# Comment out to disable:
law_codes = ['PL 12']

In [None]:
# Filter by level of offense.
# F: felony; M: misdemeanor; V: violation

# Comment out to disable:
law_cat_code = 'F'

In [None]:
historic_arrests = pd.read_csv('./input/NYPD_Arrests_Data__Historic_.csv')

In [None]:
current_arrests = pd.read_csv('./input/NYPD_Arrest_Data__Year_to_Date_.csv')

In [None]:
# Number of dates.
print('Number of unique dates: ' + str(len(current_arrests['ARREST_DATE'].unique()) + len(historic_arrests['ARREST_DATE'].unique())))

In [None]:
# Number of arrests.
print('Number of arrests: ' + str(len(current_arrests['ARREST_DATE']) + len(historic_arrests['ARREST_DATE'])))

In [None]:
# Append historic data to current data for a full list.
total_arrests = current_arrests.append(historic_arrests)

In [None]:
def format_date_arrests(date):
    split = date.split('/')
    return split[2] + split[0] + split[1]

In [None]:
total_arrests['date'] = total_arrests['ARREST_DATE'].apply(format_date_arrests)

In [None]:
def get_law_code_prefix(code):
    if type(code) == str:
        return code[:5]
    else:
        return code

In [None]:
# Create a column with just the first five characters of the law code for filtering.
total_arrests['law_code_abbr'] = total_arrests['LAW_CODE'].apply(get_law_code_prefix)

In [None]:
# Filter arrests in a new dataframe.
arrests = pd.DataFrame()

# Limit by law code if set.
try:
    law_codes
    for law_code in law_codes:
        arrests = arrests.append(total_arrests[total_arrests['law_code_abbr'] == law_code])
except:
    arrests = total_arrests

# Limit by category code if set.
try:
    law_cat_code
    arrests = arrests[arrests['LAW_CAT_CD'] == law_cat_code]
except:
    None

In [None]:
# Drop rows missing important data.
arrests = arrests.dropna(subset=['PD_DESC', 'OFNS_DESC'])

In [None]:
arrests.head()

### Weather data
#### Hourly weather

In [None]:
hourly_weather = pd.read_csv('./input/daily_new_york_data.csv')

# Grabbing more than temperature just so they're handy if we want to play a bit.
hourly_weather = hourly_weather[['dt', 'temp', 'feels_like', 'temp_min', 'temp_max', 
               'humidity', 'wind_speed', 'wind_deg', 'rain_1h', 
               'rain_3h', 'snow_1h', 'snow_3h', 'clouds_all']]

In [None]:
print('Amount of hourly weather data: ' + str(len(hourly_weather)))

In [None]:
from datetime import datetime
from pytz import timezone # for timezone awareness

def format_date_weath(dt):
    localtz = timezone('America/New_York')
    dt_unaware = datetime.fromtimestamp(dt)
    dt_aware = localtz.localize(dt_unaware, is_dst=None)
    return datetime.utcfromtimestamp(dt_aware).strftime('%Y%m%d')

In [None]:
# Format dates like the other dataframes.
hourly_weather['date'] = hourly_weather['dt'].apply(lambda dt: datetime.utcfromtimestamp(dt).strftime('%Y%m%d'))

### Daily weather

In [None]:
# Daily weather averages and extremes.
daily_weather = hourly_weather
daily_weather[['rain_1h', 'rain_3h', 
            'snow_1h', 'snow_3h']] = daily_weather[['rain_1h', 'rain_3h', 
                                                'snow_1h', 'snow_3h']].fillna(value=0)

daily_weather = daily_weather.groupby(['date']).agg({'temp':'mean', 'feels_like':'mean', 
                                        'temp_min': 'min', 'temp_max': 'max',
                                        'humidity': 'mean', 'wind_speed': 'mean',
                                        'wind_deg': 'mean', 'rain_1h': 'mean',
                                        'rain_3h': 'mean', 'snow_1h': 'mean',
                                        'snow_3h': 'mean', 'clouds_all': 'mean'})

daily_weather['date'] = daily_weather.index

In [None]:
# Number of days of weather averages.
print('Number of days of weather averages: ' + str(len(daily_weather)))

### Combined arrest data and weather data

This dataframe contains more detail, including each arrest's description and location in the city, by borough. This dataset may or may not be used, depending if we have time to do extra fancy visualizations. Otherwise, the next dataframe showing daily summaries are what we'll focus on first.

In [None]:
# Set date as index for merging.
arrests.set_index(['date'], inplace=True)

In [None]:
# Merge arrest data and weather data.
all_arrests = pd.merge(left=arrests, right=daily_weather, how='left',
                        left_index=True, right_index=True)

all_arrests = all_arrests[['PD_DESC', 'OFNS_DESC', 'LAW_CODE', 'LAW_CAT_CD',
                             'ARREST_BORO', 'AGE_GROUP', 
                             'PERP_SEX', 'PERP_RACE', 'temp', 
                             'feels_like', 'temp_min', 'temp_max', 'humidity', 
                             'wind_speed', 'wind_deg', 'rain_1h', 'rain_3h', 
                             'snow_1h', 'snow_3h']]

all_arrests.rename(columns={'PD_DESC': 'pd_desc', 'OFNS_DESC': 'ofns_desc', 
                             'LAW_CODE': 'law_code', 'LAW_CAT_CD': 'law_cat_cd',
                             'ARREST_BORO': 'arrest_boro',
                             'AGE_GROUP': 'age_group', 'PERP_SEX': 'perp_sex',
                             'PERP_RACE': 'perp_race'}, inplace=True)

all_arrests.dropna(inplace=True)

In [None]:
# Total number of arrests with associated weather info.
print('Total number of arrests with associated weather info: ' + str(len(all_arrests)))

In [None]:
# Convert borough code to borough name.
def get_borough(b):
    if b == 'B':
        return 'The Bronx'
    elif b == 'K':
        return 'Brooklyn'
    elif b == 'M':
        return 'Manhattan'
    else:
        return 'Queens'

In [None]:
# Clean up categorical text.
all_arrests['pd_desc'] = all_arrests['pd_desc'].str.capitalize()
all_arrests['ofns_desc'] = all_arrests['ofns_desc'].str.capitalize()
all_arrests['arrest_boro'] = all_arrests['arrest_boro'].apply(get_borough)

### Write to file

In [None]:
all_arrests.to_csv('./output/all_arrests.csv')

### Combined daily arrest data and weather data

This dataframe will likely be our primary dataset, since it's giving us a day-by-day arrest count with the mean weather conditions for that day.

In [None]:
daily_arrests = all_arrests.groupby(['date']).count()
daily_arrests.drop(columns=['ofns_desc', 'arrest_boro'], inplace=True)
daily_arrests.rename(columns={'pd_desc': 'num_arrests'}, inplace=True)
daily_arrests = daily_arrests[['num_arrests']]
daily_arrests = pd.merge(left=daily_arrests, right=daily_weather, 
                      how='left', left_index=True, right_index=True)

In [None]:
# Total number of arrests with associated weather info:
# (Compare with total arrests above for sanity check.)
print('Total number of arrests with associated weather info: ' + str(daily_arrests['num_arrests'].sum()))

In [None]:
# Total number of days with arrests and associated weather info
# Sanity check number of days where we have crime and weather data.
# Perhaps check out why this number is a tad lower than the weather df.
print('Total number of days with arrests and associated weather info: ' + str(len(daily_arrests)))

### Write to file

In [None]:
daily_arrests.to_csv('./output/daily_arrests.csv')

### EDA visualizations

Display a quick pairplot to quickly see if there's any correleation between number of arrests per day and weather conditions.

In [None]:
daily_arrests.head()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(ncols=2, nrows=4, sharey=True, figsize=(24, 24))
ax1, ax2, ax3, ax4, ax5, ax6, ax7, ax8 = ax.flatten()

sns.regplot(x=daily_arrests['temp'], y=daily_arrests['num_arrests'], ax=ax1).set_title("Temp")
sns.regplot(x=daily_arrests['feels_like'], y=daily_arrests['num_arrests'], ax=ax2).set_title("Feels Like")
sns.regplot(x=daily_arrests['temp_min'], y=daily_arrests['num_arrests'], ax=ax3).set_title("Temp: Min")
sns.regplot(x=daily_arrests['temp_max'], y=daily_arrests['num_arrests'], ax=ax4).set_title("Temp: Max")
sns.regplot(x=daily_arrests['rain_1h'], y=daily_arrests['num_arrests'], ax=ax5).set_title("Rain: 1 Hour")
sns.regplot(x=daily_arrests['rain_3h'], y=daily_arrests['num_arrests'], ax=ax6).set_title("Rain: 3 Hours")
sns.regplot(x=daily_arrests['snow_1h'], y=daily_arrests['num_arrests'], ax=ax7).set_title("Snow: 1 Hour")
sns.regplot(x=daily_arrests['snow_3h'], y=daily_arrests['num_arrests'], ax=ax8).set_title("Snow: 3 Hours")

### Remarks

There doesn't seem to be much correlation when we look at all types of crime. Let's try and narrow it down to violent crimes and see what happens. Or street crimes and see what happens. Some crime types may be sensitive to the weather. Let's see if that's true.