# Importing Modules

We will import the basics and few specialty modules to be used within the analysis.

In [1]:
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
% matplotlib inline
from matplotlib import pyplot as plt
import matplotlib.dates as mdates
import matplotlib.mlab as mlab
import matplotlib.ticker as tkr
import geocoder as geo
import datetime as dt

# Getting Started

To get started, we will first describe the data. The majority of the analysis will be centered around felony incidents in New York City and the school violence incidents within NYC schools. For our analysis, we will also need population data as well as school location data.


To be clear, we will be analyzing a limited felony dataset of 7 different types, For our purposes, we will be using the general wikipedia definitions. Note that these definitions can sometimes change from state to state:
- Grand larceny: theft of personal property having a value above a legally specified amount.
- Robbery: the taking of another person's property by force, fear, or intimidation, with the intent to permanently deprive that person of the property.
- Burglary: typically defined as the unlawful entry into almost any structure (not just a home or business) with the intent to commit any crime inside (not just theft/larceny). No physical breaking and entering is required; the offender may simply trespass through an open door. Unlike robbery, which involves use of force or fear to obtain another person's property, there is usually no victim present during a burglary.
- Assault: carried out by a threat of bodily harm coupled with an apparent, present ability to cause the harm.
- Grand Larceny of Motor Vehicle: the unlawful taking of property — in this case, a vehicle — that belongs to someone else, done with the intent to permanently deprive the owner of the property.
- Rape: non-consensual sexual intercourse that is committed by physical force, threat of injury, or other duress
- Murder & Non-negligent Manslaughter: The unjustifiable, inexcusable, and intentional killing of a human being with (murder) or without (non-negligent) deliberation, premeditation, and malice.


# Importing Data

Import the felony data, population data, school violence data (VADIR files), and school location information. 

In [2]:
# Direct Loading - no exra manipulation needed
felony_df = pd.read_csv("NYPD_7_Major_Felony_Incidents.csv", parse_dates = "Occurrence Date")
borough_df = pd.read_csv("New_York_City_Population_By_Boroughs.csv")
location_df = pd.read_excel("15-16SchoolDirectory.xlsx")

### VADIR datasets

VADIR [school violence] dataset: The 2010 - 2014 files have the same column names, so they're easy to import and concatenate. However, 2006 - 2009 years each have a few unique column names, so they must be concatenated independently to the main dataframe. Moreover, some columns have "w/ weapon" and others have "w/out weapon" in the column names. These must be shorted to "_ww" and "_nw" for with weapon and no weapon, respectively. 

In [3]:
# Just load the VADIR datasets and concatenate to a single dataframe

def clean_vars(df, row_num):
    """ 
    Helper function to clean up column names for VARDIS files.
    INPUT: dataframe, index of the row that contains the names
    OUTPUT: list of column names
    """
    names = list(df.iloc[row_num])
    for idx in range(len(names)):
        if str(names[idx]) == 'nan':
            names[idx] = names[idx - 1] + '_nw' # 'no weapon'
            names[idx - 1] += '_ww' # 'with weapon'
    return names


# loading 2014-15 VADIR file
v2014_df = pd.read_excel("VADIR_2014.xls")

# take a look:
COLNAMES = clean_vars(v2014_df, 1)

vadir_df = pd.read_excel('VADIR_2014.xls', names = COLNAMES, skiprows = 3)
vadir_df['School Year'] = ['2014'] * len(vadir_df)
print( 'Data from 2014-15 loaded. Total of {} rows.'.format(len(vadir_df)))

# import files for 2010-13 (they have same format as 2014 so this is quick)
files = ['VADIR_2013.xls', 'VADIR_2012.xls', 'VADIR_2011.xls', 'VADIR_2010.xls']
for f in files:
    df = pd.read_excel(f, names = COLNAMES, skiprows = 3)
    df['School Year'] = [f[-8:-4]] * len(df)
    vadir_df = pd.concat([vadir_df, df], ignore_index = True)
    print( '... data from {} appended. Added {} rows for a total of {}.'.format(f, len(df), len(vadir_df)))
    
# ... starting with 2009
vadir_2009 = pd.read_excel('VADIR_2009.xls')
vadir_2009.columns = clean_vars(vadir_2009, 1)[:3] + ['District'] + clean_vars(vadir_2009, 1)[4:]
vadir_2009 = vadir_2009[3:]
#vadir_2009.head()
vadir_2009['School Year'] = [2009] * len(vadir_2009)
vadir_df = pd.concat([vadir_df, vadir_2009], ignore_index = True)
print( '... data from 2009-10 appended. Added {} rows for a total of {}.'.format(len(vadir_2009), len(vadir_df)))

# ... next 2008
vadir_2008 = pd.read_excel('VADIR_2008.xls')
vadir_2008.columns = clean_vars(vadir_2008, 1)[:2] + ['District'] + clean_vars(vadir_2008, 1)[3:]
vadir_2008 = vadir_2008[3:]
#vadir_2008.head()
vadir_2008['School Year'] = [2008] * len(vadir_2008)
vadir_df = pd.concat([vadir_df, vadir_2008], ignore_index = True)
print( '... data from 2008-9 appended. Added {} rows for a total of {}.'.format(len(vadir_2008), len(vadir_df)))

# ... next 2007
vadir_2007 = pd.read_excel('VADIR_2007.xls')
vadir_2007.columns = ['County', 'BEDS Code', 'District'] + clean_vars(vadir_2007, 0)[3:]
vadir_2007 = vadir_2007[2:]
#vadir_2007.head()
vadir_2007['School Year'] = [2007] * len(vadir_2007)
vadir_df = pd.concat([vadir_df, vadir_2007], ignore_index = True)
print( '... data from 2007-8 appended. Added {} rows for a total of {}.'.format(len(vadir_2007), len(vadir_df)))

# ... next 2006
vadir_2006 = pd.read_excel('VADIR_2006.xls')
vadir_2006.columns = ['County', 'District', 'School Name', 'BEDS Code'] + clean_vars(vadir_2006, 0)[4:]
vadir_2006 = vadir_2006[2:]
#vadir_2006.head()
vadir_2006['School Year'] = [2006] * len(vadir_2006)
vadir_df = pd.concat([vadir_df, vadir_2006], ignore_index = True)
print( '... data from 2006-7 appended. Added {} rows for a total of {}.'.format(len(vadir_2006), len(vadir_df)))

# Quick Check  - counts by year
vadir_df['School Year'].value_counts()

Data from 2014-15 loaded. Total of 1805 rows.
... data from VADIR_2013.xls appended. Added 1792 rows for a total of 3597.
... data from VADIR_2012.xls appended. Added 1735 rows for a total of 5332.
... data from VADIR_2011.xls appended. Added 1693 rows for a total of 7025.
... data from VADIR_2010.xls appended. Added 1678 rows for a total of 8703.
... data from 2009-10 appended. Added 1531 rows for a total of 10234.
... data from 2008-9 appended. Added 1545 rows for a total of 11779.
... data from 2007-8 appended. Added 1500 rows for a total of 13279.
... data from 2006-7 appended. Added 1455 rows for a total of 14734.


2014    1805
2013    1792
2012    1735
2011    1693
2010    1678
2008    1545
2009    1531
2007    1500
2006    1455
Name: School Year, dtype: int64

### VADIR datasets - manipulation

Reording of columns, merging duplciate columns, removing problematic (non data) rows from the csv, fixing inconsistencies

In [4]:
# Reordering of columns

# Demographic Categories and their initial indices:
# County[12], District[15], School Name[46], BEDS code[8], School Year[48], Enrollment [18]
# Grade Organization[22], Need/Resource Category[33], School Type[47]
cols = sorted(vadir_df.columns.tolist())
print(cols)

new_order = [12, 15, 46, 48, 18, 8, 22, 33, 47] + list(range(1,8)) + [9, 19, 10, 11, 13, 14, 16, 17, 20, 21]
new_order += list(range(23, 33)) + list(range(34, 46)) + [49, 50, 51, 52]
cols = [cols[idx] for idx in new_order]
vadir_df = vadir_df[cols]
vadir_df.head()

['Alcohol Possesion', 'Alcohol Possession', 'Arson', 'Assault With Serious Physical Injury_nw', 'Assault With Serious Physical Injury_ww', 'Assault with Physical Injury_nw', 'Assault with Physical Injury_ww', 'BEDS Code', 'Bomb Threat', 'Burglary_nw', 'Burglary_ww', 'County', 'Criminal Mischief_nw', 'Criminal Mischief_ww', 'District', 'Drug Possesion', 'Drug Possession', 'Enrollment', 'False Alarm', 'Forcible Sex Offenses_nw', 'Forcible Sex Offenses_ww', 'Grade Organization', 'Homicide_nw', 'Homicide_ww', 'Intimidation, Harassment, Menacing, or Bullying_nw', 'Intimidation, Harassment, Menacing, or Bullying_ww', 'Kidnapping_nw', 'Kidnapping_ww', 'Larceny, or Other Theft_nw', 'Larceny, or Other Theft_ww', 'Minor Altercations_nw', 'Minor Altercations_ww', 'Need/Resource Category', 'Other Disruptive', 'Other Disruptive Incidents', 'Other Sex Offenses_nw', 'Other Sex Offenses_ww', 'Other Sex offenses_nw', 'Other Sex offenses_ww', 'Reckless Endangerment_nw', 'Reckless Endangerment_ww', 'Riot

Unnamed: 0.1,Criminal Mischief_nw,Drug Possesion,School Type,Through Screening,False Alarm,Bomb Threat,Homicide_nw,Other Disruptive,School Year,Alcohol Possession,...,Reckless Endangerment_ww,Riot_nw,Riot_ww,Robbery_nw,Robbery_ww,School Name,Under Other Circumstances,Unnamed: 0,Unnamed: 1,Unnamed: 2
0,,,,0.0,,,,,2014,,...,,,,,,,0.0,Bronx,,Academic Leadership Charter School
1,,,,0.0,,,,,2014,,...,,,,,,,0.0,Bronx,,American Dream Charter School
2,,,,0.0,,,,,2014,,...,,,,,,,0.0,Bronx,,Brilla College Preparatory Charter School
3,,,,0.0,,,,,2014,,...,,,,,,,0.0,Bronx,,Bronx Academy Of Promise Charter School
4,,,,0.0,,,,,2014,,...,,,,,,,0.0,Bronx,,Bronx Charter School For Better Learning


In [None]:
# Merging duplicate columns

# rename 'False Alarm' to "Bomb Threat False Alarm"
vadir_df.rename(columns={'False Alarm':"Bomb Threat False Alarm"}, inplace=True)

# create standard name for "Alcohol Possession"
# vadir_df['Alcohol Possession'] = vadir_df['Alcohol Possession'].combine_first(vadir_df['Alcohol Possesion'])
# vadir_df['Alcohol Possession'] = vadir_df['Alcohol Possession'].combine_first(vadir_df['Use Possession or Sale of Alcohol'])
# vadir_df.drop(['Alcohol Possession', 'Use Possession or Sale of Alcohol'], axis=1, inplace=True)

# create standard name for "Drug Possession"
vadir_df['Drug Possession'] = vadir_df['Drug Possession'].combine_first(vadir_df['Drug Possesion'])
# vadir_df['Drug Possession'] = vadir_df['Drug Possession'].combine_first(vadir_df['Use Possession or Sale of Drugs'])
# vadir_df.drop(['Drug Possesion', 'Use Possession or Sale of Drugs'], axis=1, inplace=True)
vadir_df.drop(['Drug Possesion'], axis=1, inplace=True)

# create standard name for "Other Disruptive Incidents
vadir_df['Other Disruptive Incidents'] = vadir_df['Other Disruptive Incidents'].combine_first(vadir_df['Other Disruptive'])
vadir_df.drop(['Other Disruptive'], axis=1, inplace=True)

# create standard name for "Other Sex Offenses_nw"
vadir_df['Other Sex Offenses_nw'] = vadir_df['Other Sex Offenses_nw'].combine_first(vadir_df['Other Sex offenses_nw'])
vadir_df.drop(['Other Sex offenses_nw'], axis=1, inplace=True)

# create standard name for "Other Sex Offenses_ww"
vadir_df['Other Sex Offenses_ww'] = vadir_df['Other Sex Offenses_ww'].combine_first(vadir_df['Other Sex offenses_ww'])
vadir_df.drop(['Other Sex offenses_ww'], axis=1, inplace=True)

# Remove duplicate column to "School Type"
vadir_df.drop(['Need/Resource Category'], axis=1, inplace=True)

# Take a look at the now cleaned column names
vadir_df.columns.tolist()

In [None]:
# Fixing problematic rows

vadir_df.drop([13277, 13278, 14732, 14733], axis=0, inplace=True)
#... and reindex
vadir_df.reindex()

In [None]:
# Fixing Categorical Data Inconsistencies

# Adjust 'County' values to all have the same capitalization
vadir_df['County']= vadir_df['County'].apply(lambda x: x.title() if type(x) == type('s') else x)

# Get indices with School type'Charter School'
indices = vadir_df[vadir_df['School Type'] == 'Charter School'].index.tolist()
vadir_df.loc[indices, 'School Type'] = 'Charter'

# Adjust 'School Name' values to all have the same capitalization
vadir_df['School Name']= vadir_df['School Name'].apply(lambda x: x.title() if type(x) == type('s') else x)


### Tally number of incidents in total, with weapons, without weapons.

In [None]:
# get numeric values
COUNT_COLUMNS = vadir_df.columns[8:].tolist()

vadir_df[COUNT_COLUMNS].head()

# convert data types
#vadir_df[COUNT_COLUMNS] = vadir_df[COUNT_COLUMNS].apply(lambda x: pd.to_numeric(x))

# # compute tallies
# vadir_df['Total Incidents'] = vadir_df[COUNT_COLUMNS].sum(axis=1)

# WEAPON_COLS = [x for x in COUNT_COLUMNS if x[-3:] == '_ww']
# vadir_df['Incidents w/ Weapons'] = vadir_df[WEAPON_COLS].sum(axis=1)

# NO_WEAPON_COLS = [x for x in COUNT_COLUMNS if x[-3:] == '_nw']
# vadir_df['Incidents w/o Weapons'] = vadir_df[NO_WEAPON_COLS].sum(axis=1)

# # reorder columns
# ORDER = ['County', 'District', 'School Name', 'School Year', 'Enrollment', 'BEDS Code',
#          'Grade Organization','School Type','Total Incidents','Incidents w/ Weapons',
#          'Incidents w/ Weapons'] + COUNT_COLUMNS
# vadir_df = vadir_df[ORDER]

# school_df = vadir_df.copy()

### Borough Population Data - Cleanup

In [None]:
# capitalize borough and set them as index (to match NYPD Data)
borough_df.Borough = borough_df.Borough.apply(lambda x : x.upper())
borough_df.set_index('Borough', inplace = True)

# Add areas (measured in sq miles, source: wikipedia)
borough_df["Area(sq mi)"] = pd.Series({'BRONX': 42.47 , 'BROOKLYN': 69.5, 'MANHATTAN': 22.82 ,
                                'QUEENS': 108.1 , 'STATEN ISLAND': 57.92 })
# Take  a look
borough_df

# NYPD 7 Major Felony Incidents


In [None]:
print("Number of observations:", len(felony_df), "\nNumber of Variables:", len(felony_df.columns),\
     "\nDate Range:", int(felony_df["Occurrence Year"].min()), "-", int(felony_df["Occurrence Year"].max()))

In [None]:
# what percent of the data is from 2006 and after?
year_counts = felony_df['Occurrence Year'].value_counts()
recent = year_counts[list(range(2006,2016))].sum()
print("Percent of data dated 2000 or after: {:.2f}%".format(recent/1123465 * 100))

# plot crime incidence by year (2000- 2015)
year_counts[list(range(2000,2016))].plot(kind = 'bar')
plt.title("Felonies committed in NYC by Year(2000-2015)")

In [None]:
# plot crime incidence by felony type (whole data set)
felony_counts = felony_df.Offense.value_counts()
#counts.sort_values(inplace=True)
felony_counts.plot(kind = 'bar')
plt.title("Frequency of Felony Offenses(NYC, 1905-2015)")
plt.ylabel("Number Recorded")

In [None]:
# plot crime incidence by borough (whole data set)
felony_df.Borough.value_counts().plot(kind = 'bar')
plt.title("Total Felonies by Borough (1905-2015)")

In [None]:
# plotting by borough and offense
fig, ax = plt.subplots()

ind = np.arange(5)
width = 0.25
colors = ['r', 'g', 'b', 'c', 'y', 'm', 'orange']
counter = 0
for off, df in felony_df.groupby('Offense'):
    df.Borough.value_counts().plot(kind = 'bar', color = colors[counter], label = off, figsize = (10,6))
    counter += 1

plt.title("Felonies by offense and Borrough.")
plt.legend()
       

### Set up Felony data to do temporal analysis

In [None]:
# Note: This cell takes 5-10 minutes to run

felony_df = pd.read_csv("NYPD_7_Major_Felony_Incidents.csv", parse_dates = "Occurrence Date")

#creating a new column to strip off Time from Occurrence Date
felony_df['Short Occurrence Date']= pd.to_datetime(felony_df['Occurrence Date'])
felony_df['Short Occurrence Date'] = [d.strftime('%Y-%m-%d') if not pd.isnull(d) else '' for d \
                                      in felony_df['Short Occurrence Date']]

#removing data prior to 2006
felony_df_2006 = felony_df[felony_df["Occurrence Year"]>2005]

#there are some dates that have a year of 2006 and higher but a date of 1900, removing these as well
felony_df_2006 = felony_df[felony_df["Short Occurrence Date"]>'2005-12-31'] 

month_order = {'Jan':'01', 'Feb': '02', 'Mar': '03', 'Apr': '04', 'May': '05', 'Jun': '06', 'Jul': '07', 
               'Aug': '08', 'Sep': '09', 'Oct': '10', 'Nov': '11', 'Dec': '12'}
felony_df_2006['Occurrence Month Ordered'] = [month_order[m] + ' ' + m for m in \
                                              felony_df_2006['Occurrence Month']]

day_order = {'Monday': 1, 'Tuesday': 2, 'Wednesday': 3, 'Thursday': 4, 'Friday': 5, 'Saturday': 6, 'Sunday': 7}
felony_df_2006['Day of Week Ordered'] = [str(day_order[d]) + ' ' + d for d in felony_df_2006['Day of Week']]

#groupby by several parameters for plots

datecount = felony_df_2006.groupby('Short Occurrence Date', as_index=False)['OBJECTID'].count()
yearcount = felony_df_2006.groupby('Occurrence Year', as_index=False)['OBJECTID'].count()
monthcount = felony_df_2006.groupby('Occurrence Month Ordered', as_index=False)['OBJECTID'].count()
daycount = felony_df_2006.groupby('Occurrence Day', as_index=False)['OBJECTID'].count()
weekdaycount = felony_df_2006.groupby('Day of Week Ordered', as_index=False)['OBJECTID'].count()
offensecount = felony_df_2006.groupby('Offense', as_index=False)['OBJECTID'].count()
yearoffensecount = felony_df_2006.groupby(['Occurrence Year', 'Offense'], as_index=False)['OBJECTID'].count()

In [None]:
# get a time series view of the number of offenses per day
dates = datecount["Short Occurrence Date"]
x = [dt.datetime.strptime(d,'%Y-%m-%d').date() for d in dates]
y = datecount["OBJECTID"]

plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.plot(x,y)
plt.gcf().autofmt_xdate()
plt.title('Felony Offenses from 2006 to 2015')

In [None]:
# there is an evident seasonal spike and valley in felonies each year
# the below shows a view for 2009 through 2010
datecount_2010 = datecount[datecount['Short Occurrence Date'] > '2008-12-31']
datecount_2010 = datecount_2010[datecount_2010['Short Occurrence Date'] < '2011-01-01']

dates = datecount_2010["Short Occurrence Date"]
x = [dt.datetime.strptime(d,'%Y-%m-%d').date() for d in dates]
y = datecount_2010["OBJECTID"]

plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.plot(x,y)
plt.gcf().autofmt_xdate()
plt.title('Felony Offenses from 2009 through 2010')

#looks like the number of felonies spike in September/October and 
# then fall back down at the end of each calendar year

In [None]:
# this shows a decline in the number of felonies recorded from 2006 to 2015
# same graph as aaron had with the x axis labels as the years and y ticks labaels formatted
x = yearcount['Occurrence Year']
xlabels = [int(xl) for xl in x]
y = yearcount["OBJECTID"]

def func(x, pos):  
   s = '{:0,d}'.format(int(x))
   return s

y_format = tkr.FuncFormatter(func) 
plt.xticks(x, xlabels)
plt.gca().yaxis.set_major_formatter(y_format)
plt.plot(x,y)
plt.title('Total Annual Felony Offenses from 2006 to 2015')

In [None]:
# wanted to see a breakout of the type of offenses across years

# x = yearoffensecountp.index
# xlabels = [int(xl) for xl in x]
# yb = yearoffensecountp["BURGLARY"]
# yfa = yearoffensecountp["FELONY ASSAULT"]
# ygl = yearoffensecountp["GRAND LARCENY"]
# yglmv = yearoffensecountp["GRAND LARCENY OF MOTOR VEHICLE"]
# ymm = yearoffensecountp["MURDER & NON-NEGL. MANSLAUGHTE"]
# yr = yearoffensecountp["RAPE"]
# yrb = yearoffensecountp["ROBBERY"]

# def func(x, pos):  
#    s = '{:0,d}'.format(int(x))
#    return s

# y_format = tkr.FuncFormatter(func) 
# plt.xticks(x, xlabels)
# plt.gca().yaxis.set_major_formatter(y_format)

# plt.plot(x, yb, 'b-', label='Burglary')
# plt.plot(x, yfa, 'g-', label='Felony Assault')
# plt.plot(x, ygl, 'r-', label='Grand Larceny')
# plt.plot(x, yglmv, 'y-', label='Grand Larceny - Motor')
# plt.plot(x, ymm, 'k-', label='Murder')
# plt.plot(x, yr, 'm-', label='Rape')
# plt.plot(x, yrb, 'c-', label='Robbery')

# plt.legend(loc="best", bbox_to_anchor=[1, 1],
#            ncol=2, shadow=True, title="Legend", fancybox=True)
# plt.title('Felony Offenses by Type from 2006 to 2015')
# plt.show()

In [None]:
# there is a look of total felony offenses by day of week
# looks like the first of the month has the highest offenses - about 10,000 more than any other day
# perhaps this is just the busiest day to record offenses?
# 31 doesn't have as many offenses since not all months have 31 days - most likely
x = daycount['Occurrence Day']
y = daycount['OBJECTID']
plt.gca().yaxis.set_major_formatter(y_format)
plt.bar(x,y)
plt.title('Total Felony Offenses by Day of Month from 2006 to 2015')

In [None]:
# breaking out the offenses by month total, it does look a bit seasonal and similar to the time series chart above
# February has the least number of offenses - potentially due to the cold weather in nyc
# need to determine how to center the month x axis tick labels
x = range(len(monthcount['Occurrence Month Ordered']))
xlabels = [x[-3:] for x in monthcount['Occurrence Month Ordered']]
y = monthcount['OBJECTID']
plt.xticks(x, xlabels)
plt.gca().yaxis.set_major_formatter(y_format)
plt.gca().set_xlim(-0.5,len(xlabels)+0.5)
locs = np.arange(len(xlabels))
plt.bar(x,y)
plt.title('Total Felony Offenses by Month from 2006 to 2015')
plt.show()

In [None]:
# breaking out offenses by day of week, Friday spikes with almost 20,000 more offenses than any other day
x = range(len(weekdaycount['Day of Week Ordered']))
xlabels = [x for x in weekdaycount['Day of Week Ordered']]
y = weekdaycount['OBJECTID']
plt.xticks(x, xlabels, rotation='vertical')
plt.gca().yaxis.set_major_formatter(y_format)
plt.gca().set_xlim(-0.5,len(xlabels)+0.5)
plt.bar(x,y)
plt.title('Total Felony Offenses by Day of Week from 2006 to 2015')
plt.show()

In [None]:
# create plot comparing types of Felonies by Borough
fel_by_bor = felony_df.groupby('Borough')['Offense'].value_counts().sort_values(ascending=False)
fel_by_bor_un = fel_by_bor.unstack("Offense")
fel_by_bor_un.sort_index(ascending=False, level = 'GRAND LARCENY')
fel_by_bor_un.plot(kind="bar", figsize= (15,7), title = "Borough By Offense")

# School Data Analysis

In [None]:
school_df[school_df["School Year"].astype(int) > 2010]

In [None]:
school_df["School Year"] = school_df["School Year"].astype(int)

if 'Incidents w/ Weapons' in school_df.columns:
    school_df.drop('Incidents w/ Weapons', axis =1, inplace=True)

if 'Incidents w/ Weapons.1' in school_df.columns:
    school_df.drop('Incidents w/ Weapons.1', axis =1, inplace=True)

if 'Incidents w/ weapons' in school_df.columns:
    school_df.drop('Incidents w/ weapons', axis =1, inplace=True)
    
if 'Incidents w/out weapons' in school_df.columns:
    school_df.drop('Incidents w/out weapons', axis =1, inplace=True)

nw_cols = [cols for cols in school_df.columns if 'nw' in cols]
school_df[nw_cols].sum(axis=1)
school_df["Incidents w/out weapons"] = school_df[nw_cols].sum(axis=1)

ww_cols = [cols for cols in school_df.columns if 'ww' in cols]
school_df[ww_cols].sum(axis=1)
school_df["Incidents w/ weapons"] = school_df[ww_cols].sum(axis=1)

inc_cols = [cols for cols in school_df.columns if cols.startswith("Inc")]
#school_df.drop(inc_cols[1], inplace = True, axis=1)
# school_df[inc_cols].head()

fig = school_df.groupby("School Year")[inc_cols].sum().plot(kind="bar", figsize = (12, 10), 
                                                      secondary_y = inc_cols[1], legend=True,
#                                                       xticks = range(2006,2016), 
                                                      title = "Incidents without and with weapons"
                                                      )

In [None]:
fig, ax = plt.subplots()

ax.plot(list(set(school_df["School Year"])), 
        school_df.groupby("School Year")[inc_cols[0]].sum(),
        'b',
       label = "Incidents without weapons")
ax.set_ylabel("Count of Incidents W/out Weapons")

ax2 = ax.twinx()
ax2.plot(list(set(school_df["School Year"])), 
        school_df.groupby("School Year")[inc_cols[1]].sum(),
         'g',
       label = "Incidents with weapons")
ax2.set_ylabel("Count of Incidents W/ Weapons")

lines = ax.get_lines() + ax2.get_lines()
ax.legend(lines, [line.get_label() for line in lines], loc = "center left", bbox_to_anchor = (1.15, 0.5))
ax.set_title("Incidents With and Without Weapons")
plt.show()

## Create Lat/Lon values using Geocoder based on addresses

In [None]:
# Finds which schools exist in both the violence and location dataframes

location_df = pd.read_csv("15-16SchoolDirectory.csv")

loc_sed = set(location_df["SED CODE"])
sch_beds = set(school_df["BEDS Code"])
print("Location file school count:", len(loc_sed), "\nSchool violence school count:", len(sch_beds),
      "\nSchools with both violence and locations:", len(loc_sed & sch_beds))
school_beds_with_addresses = [x for x in sch_beds if x in (loc_sed & sch_beds)]
filtered_school_df = school_df[school_df["BEDS Code"].isin(school_beds_with_addresses)]
filtered_school_df.reset_index(inplace = True, drop = True)

In [None]:
# Note: Takes 30 seconds to run

# Creates Full Address column in both location df and filtered school df
location_df["Full_Address"] = location_df["MAILING ADDRESS"] + ", " + location_df["CITY"] + \
                                ", " + location_df["STATE"] + ", " + location_df["ZIP"] 

filtered_school_df["Full_Address"] = [location_df[location_df["SED CODE"] == x].Full_Address for \
                                      x in filtered_school_df["BEDS Code"]]

filtered_school_df.Full_Address = [str(filtered_school_df.Full_Address[x])[8:].split("Name")[0][:-1] for x\
                                   in range(len(filtered_school_df.Full_Address))]

# # Need unique school list only to get below 2,500 records to use geocoder (1,807 total)
beds_list = [] # format of [beds code, address, school name]
for beds in (loc_sed & sch_beds):
    beds_list.append([beds, filtered_school_df[filtered_school_df["BEDS Code"]==beds].Full_Address.values[0],\
                filtered_school_df[filtered_school_df["BEDS Code"]==beds]["School Name"].values[0]])

beds_df = pd.DataFrame(beds_list, columns = ["BEDS_Code", "Address", "SchoolName"])
print("There are", len(beds_df), "unique schools in beds_df.")
beds_df.head()

## WARNING: The next cell can only be run once a day.

It takes at least 10 minutes to run as well. This is a limitation of Google - only 2500 calls can be made per day and the cell below calls it 1,807 times.

In [None]:
# Takes 10 min at least for all 1,807 schools! Only works ONCE A DAY due to geocoder 
# restrictions with Google (2,500 calls per day)

latlon = [geo.google(x).latlng for x in beds_df.Address]
beds_df["latlon"] = latlon
beds_df.head()


## END WARNING

In [None]:
# Add "Lat" and "Lon" columns on the filtered school dataframe

latlon_list = [beds_df.latlon[beds_df[beds_df["BEDS_Code"]==x].index.values[0]] for x in filtered_school_df["BEDS Code"]]
filtered_school_df["Lat"] = [x[0] if x!=[] else [] for x in latlon_list]
filtered_school_df["Lon"] = [x[1] if x!=[] else [] for x in latlon_list]
filtered_school_df[["Full_Address", "Lat", "Lon"]].head()