# Lat Long Analysis

This portion of our project uses the schools' latitude and longitude (previously obtained using geocoder) to perform some final cleaning on the VADIR data and then join it to the NYC crime data by location. The code below will:  

* __Load vadir data__ and cleaning it with the functions from cleandata.py  
* Ensure __schools are consistently named__ (we'll use the names from the lat long file and join them using the beds/seds code. This also means that we'll discard records from the schools for which we don't have lat/long)  
* __Fill in missing boroughs__ for records from 2006-2007.  
* __Identify felonies within a 1 mile__ radius of a given school.  
* __Plot correlations__ between school indicents and felonies (by year, by borough, by felony type, by location, by school incident type).

In [1]:
% matplotlib inline
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from vincenty import vincenty
import cleandata as cd

### Read in and clean the VADIR data from the 2006-2014 school years.

In [2]:
# Raw data for each year
RAW_DATA_DICT = {2006: 'VADIR_2006.xls', 2007: 'VADIR_2007.xls', 2008: 'VADIR_2008.xls', 
                 2009: 'VADIR_2009.xls', 2010: 'VADIR_2010.xls', 2011: 'VADIR_2011.xls', 
                 2012: 'VADIR_2012.xls', 2013: 'VADIR_2013.xls', 2014: 'VADIR_2014.xls'}

# Duplicate name columns in raw files (and their replacements)
DUP_COLS = {'County Name':'County', 'District Name': 'District', 'BEDS CODE': 'BEDS Code', 
            'False Alarm':'Bomb Threat False Alarm',
            'Other Sex offenses': 'Other Sex Offenses', 
            'Use Possession or Sale of Drugs': 'Drug Possession', 
            'Use Possession or Sale of Alcohol': 'Alcohol Possession',
            'Other Disruptive Incidents': 'Other Disruptive Incidents', 
            'Drug Possesion': 'Drug Possession', 'Alcohol Possesion': 'Alcohol Possession', 
            'Other Disruptive': 'Other Disruptive Incidents'}

# Read in raw data and correct duplicate columns
vadir_df = cd.vadir_concat_dfs(RAW_DATA_DICT, DUP_COLS)

... data from VADIR_2006.xls appended. Added 1455 rows for a total of 1455.
... data from VADIR_2007.xls appended. Added 1500 rows for a total of 2955.
... data from VADIR_2008.xls appended. Added 1545 rows for a total of 4500.
... data from VADIR_2009.xls appended. Added 1531 rows for a total of 6031.
... data from VADIR_2010.xls appended. Added 1678 rows for a total of 7709.
... data from VADIR_2011.xls appended. Added 1693 rows for a total of 9402.
... data from VADIR_2012.xls appended. Added 1735 rows for a total of 11137.
... data from VADIR_2013.xls appended. Added 1792 rows for a total of 12929.
... data from VADIR_2014.xls appended. Added 1805 rows for a total of 14734.


In [3]:
# Reorder columns putting demographic information first.
DEMO_COLS = ['School Name', 'School Type', 'School Year', 'BEDS Code',  'County', 
             'District', 'Enrollment', 'Grade Organization', 'Need/Resource Category']
vadir_df = cd.vadir_reorder_columns(vadir_df, DEMO_COLS)

In [4]:
# Create Columns for "Total incidents", "Incidents w/out Weapons" and "Incidents w/ Weapons"
COLUMNS = vadir_df.columns.tolist()
INCIDENT_COLS = [c for c in COLUMNS if c not in DEMO_COLS]
vadir_df = cd.vadir_create_tallies(vadir_df, INCIDENT_COLS)

In [5]:
# Cleaning VADIR data...
# ... consistently name county and school type, fix name capitalization, remove comment rows.
school_df = cd.vadir_clean_concat_df(vadir_df)

In [None]:
# Take a look
school_df.head(2)

### Join School Data and Location Data

In [6]:
# read in lat-long file that contains correct names and locations
#... ??? replace this call with function call to geocoder ???
latlon_df = pd.read_csv('SchoolLatLon.csv', index_col=0)

In [None]:
# take a look
latlon_df.head(2)

In [7]:
# ensure BEDS and SED are integers so that they'll be recognized as identical
latlon_df["SED CODE"] = latlon_df["SED CODE"].astype(np.int64)
school_df["BEDS Code"] = school_df["BEDS Code"].astype(np.int64)

In [8]:
# join latlong data to school data using the BEDS code
school_df = pd.merge(school_df, latlon_df, left_on=['BEDS Code'],right_on=['SED CODE'], how='left')

In [9]:
# drop the now redundant SED code
school_df.drop(['SED CODE'], axis=1, inplace=True)

In [10]:
# Take a look at the resulting data/missing values
print('... {} unique schools,'.format(len(school_df['BEDS Code'].unique())))
schools_withloc = school_df[school_df['latlon'].notnull()]['BEDS Code'].unique()
schools_missingloc = school_df[school_df['latlon'].isnull()]['BEDS Code'].unique()
print('... of which {} have lat/long'.format(len(schools_withloc)))
print('... and {} are missing lat/long'.format(len(schools_missingloc)))

... 1967 unique schools,
... of which 1807 have lat/long
... and 160 are missing lat/long


### Fix School Names

In [11]:
# The problem:
print('... original dataset has {} unique school names'.format(len(school_df['School Name'].unique())))
print('... but only {} unique BEDS Codes'.format(len(school_df['BEDS Code'].unique())))

... original dataset has 3135 unique school names
... but only 1967 unique BEDS Codes


In [12]:
# Helper Function
def fix_case(x):
    """Function to put a school name in the correct case"""
    if not x:
        return x
    elif x[:3] in ['PS ', 'JHS', 'MS ']:
        return x[:3] + x[3:].title()
    else:
        return x.title()

In [13]:
# Fix missing LEGAL NAMES with School Name
school_df['LEGAL NAME'].fillna(school_df['School Name'], inplace=True)

# Fix case and reassign to School Name
school_df['School Name'] = school_df['LEGAL NAME'].apply(fix_case)

# drop the now redundant LEGAL NAME column
school_df.drop(['LEGAL NAME'], axis=1, inplace=True)

In [14]:
# Problem Solved
print('... new dataset has {} unique school names.'.format(len(school_df['School Name'].unique())))
print('... and {} unique BEDS Codes.'.format(len(school_df['BEDS Code'].unique())))

... new dataset has 1959 unique school names.
... and 1967 unique BEDS Codes.


### Fill in missing boroughs

__TODO:__ From the numbers below it looks like some of the County values got switched around (eg. decrease in Manhattan counts?)... I think we probalby need a way of creating the county_map dictionary that prioritizes the traditional borough name). Come back to this.

In [15]:
# The first problem:
print('... {} entries are missing county info'.format(sum(school_df['County'].isnull())))
print('Other county tallies:')
school_df.County.value_counts()

... 3076 entries are missing county info
Other county tallies:


Brooklyn              3566
Bronx                 2737
Queens                2260
Manhattan             1228
New York              1040
Staten Island          478
Nyc Central Office     344
Nassau                   1
Name: County, dtype: int64

In [16]:
# create dictionary of county by BEDS Code
c = school_df[school_df['County'].notnull()][['BEDS Code','County']].to_dict()
county_map = {c['BEDS Code'][idx]: c['County'][idx] for idx in c['County'].keys()}

In [17]:
# map counties using dictionary
school_df.County = school_df['BEDS Code'].map(county_map)

In [18]:
# The first problem Solved
print('... Now only {} entries are missing county info'.format(sum(school_df['County'].isnull())))
print('Other county tallies:')
school_df.County.value_counts()

... Now only 3 entries are missing county info
Other county tallies:


Brooklyn              4637
Bronx                 3536
Queens                2961
New York              2856
Staten Island          639
Manhattan               80
Nyc Central Office       9
Nassau                   9
Name: County, dtype: int64

In [19]:
# Are any BEDS Codes are linked with more than one Borough(County)?
school_df.groupby('BEDS Code')['County'].apply(lambda x: len(x.unique())).value_counts()

1    1967
Name: County, dtype: int64

### Prep for distance sorting crime locations

In [20]:
# Helper function -- check dist
def is_in_radius(school_point, crime_point, radius):
    """
    Function using vincenty package to check distance between school and crime.
    INPUT: (lat,long) tuples for school and crime (in degrees), radius in miles.
    OUTPUT: Boolean
    """
    return vincenty(school_point, crime_point, miles=True) <= radius

In [21]:
# Helper function -- extract lat/long from object type
def parse_latlong(dataframe, loc_column):
    """
    Function to extract lat/long coords. 
    INPUT: dataframe and name of column with string tuple or list pair of coordinates.
    OUTPUT: n/a. Function modifies dataframe to add a lat and long column with float type.
    """
    get_lat = lambda x: x.split(',')[0][1:] if type(x)==type('s') else np.nan
    get_long = lambda x: x.split(',')[1][:-1] if type(x)==type('s') else np.nan
    dataframe['lat'] = dataframe[loc_column].apply(get_lat).astype('float64')
    dataframe['long'] = dataframe[loc_column].apply(get_long).astype('float64')
    print('... latitude and longitude extracted for dataframe.')

In [22]:
# load NYC dataframe
felony_df = pd.read_csv('NYPD_7_Major_Felony_Incidents.csv', index_col = False)

#### ERROR?: the cleaning function seems to take forever... need to take a closer look/maybe fix it.

In [None]:
# ... and clean it
#felony_df = cd.clean_NYPD(felony_df)

In [23]:
# Extact Lattitude and longitude data for both dataframes
parse_latlong(school_df, 'latlon')
parse_latlong(felony_df, 'Location 1')

... latitude and longitude extracted for dataframe.
... latitude and longitude extracted for dataframe.


In [24]:
# Testing vincenty on the first felony and first school
first_school_point = (school_df.loc[0,'lat'], school_df.loc[0,'long']) 
first_felony_point = (felony_df.loc[1,'lat'], felony_df.loc[1,'long']) 

# not w/in 2 miles, but yes, w/in 50
print('Distance: ', vincenty(first_school_point, first_felony_point))
print("... w/in 2 mi?", is_in_radius(first_school_point, first_felony_point, 2))
print("... w/in 50 mi?",is_in_radius(first_school_point, first_felony_point, 50))

Distance:  13.120085
... w/in 2 mi? False
... w/in 50 mi? True


### Function(s) to extract crime tallies w/in radius of schools

I suspect that looping through each school is going to take forever... but we'll try that first and then explore a smarter (dynamic programming) alternative.

In [25]:
# Quick Check, are there rows with 'latlon' but not 'lat'
print('... there are {} missing latlon entries'.format(sum(school_df.latlon.isnull())))
print('... there are {} missing lat entries'.format(sum(school_df.lat.isnull())))

... there are 640 missing latlon entries
... there are 640 missing lat entries


In [30]:
# BRUTE FORCE OPTION: 
# This is super inefficient because the function has to apply the location filter over
# the entire NYPD df (well, the year subset) for each school & each year.

def bf_crimecount(school_df, felony_df, radius = 1):
    """
    Brute force function to find all felonies w/in given radius of each school.
    INPUT: school df w/ cols 'latlon', 'lat', 'long', and 'School Year'
           felony df w/ cols 'Occurrence Year', 'lat','long','Offense', and 'Identifier'
    OUTPUT: n/a, modifies school data.
    """
    # Initialize new columns
    school_df['CrimeIDS'] = pd.Series()
    school_df['Total Felonies w/in 1mi'] = pd.Series()
    school_df['Grand Larceny w/in 1mi'] = pd.Series()
    school_df['Robbery w/in 1mi'] = pd.Series()
    school_df['Burglary w/in 1mi'] = pd.Series()
    school_df['Assault w/in 1mi'] = pd.Series()
    school_df['Auto Theft w/in 1mi'] = pd.Series()
    school_df['Rape w/in 1mi'] = pd.Series()
    school_df['Murders w/in 1mi'] = pd.Series()
    print('... created 9 new empty columns to store crime information.') # for debugging
    
    # Group by location (school) and Year
    grouped = school_df[school_df.lat.notnull()].groupby(['latlon','School Year'])
    print('... grouped by school and year, {} groups in total'.format(len(grouped.groups))) # for debugging
    
    # Loop through schools & years
    for name, df in grouped:
        print('>>> now processing group:', name) # for debugging
        coordinates, year = name # unpacking from groupby 
        location = (df.lat.mean(), df.long.mean())
        #NOTE: the coordinates are all the same so the mean is just the location

        # get subset of felonies for that year
        crimes = felony_df[felony_df['Occurrence Year'] == year] 
        print('    ... found {} crimes w/in this year'.format(len(crimes))) # for debugging

        # further subset felonies using location criteria
        if crimes.empty:
            local_crimes = pd.DataFrame()
        else:    
            criteria = crimes.apply(lambda x: is_in_radius(location, (x.lat, x.long), radius), axis=1 )
            local_crimes = crimes[criteria]
        print('    ... of these, {} occurred w/in one mile'.format(len(local_crimes))) # for debugging
        
        # compute crime counts
        counts = pd.Series({'GRAND LARCENY':0, 'ROBBERY':0, 'BURGLARY':0, 'FELONY ASSAULT':0,
                              'RAPE':0, 'GRAND LARCENY OF MOTOR VEHICLE':0,  
                              'MURDER & NON-NEGL. MANSLAUGHTE':0})
        counts.update(local_crimes['Offense'].value_counts())
        
        #store crime ids and tallies in school DF
        school_df.loc[grouped.groups[name],['CrimeIDS']] = str(local_crimes.Identifier.unique().tolist())
        school_df.loc[grouped.groups[name],['Total Felonies w/in 1mi']] = len(local_crimes)        
        school_df.loc[grouped.groups[name],['Grand Larceny w/in 1mi']] = counts.loc['GRAND LARCENY']
        school_df.loc[grouped.groups[name],['Robbery w/in 1mi']] = counts.loc['ROBBERY']
        school_df.loc[grouped.groups[name],['Burglary w/in 1mi']] = counts.loc['BURGLARY']
        school_df.loc[grouped.groups[name],['Assault w/in 1mi']] = counts.loc['FELONY ASSAULT']
        school_df.loc[grouped.groups[name],['Auto Theft w/in 1mi']] = counts.loc['GRAND LARCENY OF MOTOR VEHICLE']
        school_df.loc[grouped.groups[name],['Rape w/in 1mi']] = counts.loc['RAPE']
        school_df.loc[grouped.groups[name],['Murder w/in 1mi']] = counts.loc['MURDER & NON-NEGL. MANSLAUGHTE']


### Testing Brute Force function on a small subset of the data

In [31]:
# create subsets
school_subset = school_df.head(5)
felony_subset = felony_df[felony_df['Occurrence Year'] == 2006].head(50)

In [32]:
# run function
bf_crimecount(school_subset, felony_subset)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is tryin

... created 9 new empty columns to store crime information.
... grouped by school and year, 4 groups in total
>>> now processing group: ('[40.812627, -73.919908]', 2006)
    ... found 50 crimes w/in this year
    ... of these, 0 occurred w/in one mile


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


KeyError: "['Murder w/in 1mi'] not in index"

In [29]:
# take a look
school_subset

Unnamed: 0,School Name,School Type,School Year,BEDS Code,County,District,Enrollment,Grade Organization,Need/Resource Category,Alcohol Possession,...,long,CrimeIDS,Total Felonies w/in 1mi,Grand Larceny w/in 1mi,Robbery w/in 1mi,Burglary w/in 1mi,Assault w/in 1mi,Auto Theft w/in 1mi,Rape w/in 1mi,Murders w/in 1mi
0,Bronx Charter School For Better Learning,Charter,2006,321100860855,Bronx,,229,,,0,...,-73.847787,['9fed2cab'],1.0,0.0,1.0,0.0,0.0,0.0,0.0,
1,Bronx Charter School For Children,Charter,2006,320700860852,Bronx,,262,,,0,...,-73.919908,[],0.0,0.0,0.0,0.0,0.0,0.0,0.0,
2,Bronx Charter School For Excellence,Charter,2006,321100860859,Bronx,,185,,,0,...,-73.858543,"['2598f1c7', '26e0d827']",2.0,1.0,0.0,1.0,0.0,0.0,0.0,
3,Bronx Charter School For The Arts,Charter,2006,320800860846,Bronx,,284,,,0,...,-73.886463,[],0.0,0.0,0.0,0.0,0.0,0.0,0.0,
4,Bronx Lighthouse Charter School,Charter,2006,320800860870,Bronx,,169,,,0,...,,,,,,,,,,


## There must be a smarter way?

In [None]:
# ???
# STEP 1:
# STEP 2:
# STEP 3: