# Lat Long Analysis

This portion of our project uses the schools' latitude and longitude (previously obtained using geocoder) to perform some final cleaning on the VADIR data and then join it to the NYC crime data by location. The code below will:  

* __Load vadir data__ and cleaning it with the functions from cleandata.py  
* Ensure __schools are consistently named__ (we'll use the names from the lat long file and join them using the beds/seds code. This also means that we'll discard records from the schools for which we don't have lat/long)  
* __Fill in missing boroughs__ for records from 2006-2007.  
* __Identify felonies within a 1 mile__ radius of a given school.  
* __Plot correlations__ between school indicents and felonies (by year, by borough, by felony type, by location, by school incident type).

In [35]:
% matplotlib inline
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from vincenty import vincenty
import cleandata as cd
import time

### Read in and clean the VADIR data from the 2006-2014 school years.

In [2]:
# Raw data for each year
RAW_DATA_DICT = {2006: 'VADIR_2006.xls', 2007: 'VADIR_2007.xls', 2008: 'VADIR_2008.xls', 
                 2009: 'VADIR_2009.xls', 2010: 'VADIR_2010.xls', 2011: 'VADIR_2011.xls', 
                 2012: 'VADIR_2012.xls', 2013: 'VADIR_2013.xls', 2014: 'VADIR_2014.xls'}

# Duplicate name columns in raw files (and their replacements)
DUP_COLS = {'County Name':'County', 'District Name': 'District', 'BEDS CODE': 'BEDS Code', 
            'False Alarm':'Bomb Threat False Alarm',
            'Other Sex offenses': 'Other Sex Offenses', 
            'Use Possession or Sale of Drugs': 'Drug Possession', 
            'Use Possession or Sale of Alcohol': 'Alcohol Possession',
            'Other Disruptive Incidents': 'Other Disruptive Incidents', 
            'Drug Possesion': 'Drug Possession', 'Alcohol Possesion': 'Alcohol Possession', 
            'Other Disruptive': 'Other Disruptive Incidents'}

# Read in raw data and correct duplicate columns
vadir_df = cd.vadir_concat_dfs(RAW_DATA_DICT, DUP_COLS)

... data from VADIR_2006.xls appended. Added 1455 rows for a total of 1455.
... data from VADIR_2007.xls appended. Added 1500 rows for a total of 2955.
... data from VADIR_2008.xls appended. Added 1545 rows for a total of 4500.
... data from VADIR_2009.xls appended. Added 1531 rows for a total of 6031.
... data from VADIR_2010.xls appended. Added 1678 rows for a total of 7709.
... data from VADIR_2011.xls appended. Added 1693 rows for a total of 9402.
... data from VADIR_2012.xls appended. Added 1735 rows for a total of 11137.
... data from VADIR_2013.xls appended. Added 1792 rows for a total of 12929.
... data from VADIR_2014.xls appended. Added 1805 rows for a total of 14734.


In [3]:
# Reorder columns putting demographic information first.
DEMO_COLS = ['School Name', 'School Type', 'School Year', 'BEDS Code',  'County', 
             'District', 'Enrollment', 'Grade Organization', 'Need/Resource Category']
vadir_df = cd.vadir_reorder_columns(vadir_df, DEMO_COLS)

In [4]:
# Create Columns for "Total incidents", "Incidents w/out Weapons" and "Incidents w/ Weapons"
COLUMNS = vadir_df.columns.tolist()
INCIDENT_COLS = [c for c in COLUMNS if c not in DEMO_COLS]
vadir_df = cd.vadir_create_tallies(vadir_df, INCIDENT_COLS)

In [5]:
# Cleaning VADIR data...
# ... consistently name county and school type, fix name capitalization, remove comment rows.
school_df = cd.vadir_clean_concat_df(vadir_df)

In [6]:
# Take a look
school_df.head(2)

### Join School Data and Location Data

In [7]:
# read in lat-long file that contains correct names and locations
#... ??? replace this call with function call to geocoder ???
latlon_df = pd.read_csv('SchoolLatLon.csv', index_col=0)

In [8]:
# take a look
#latlon_df.head(2)

In [9]:
# ensure BEDS and SED are integers so that they'll be recognized as identical
latlon_df["SED CODE"] = latlon_df["SED CODE"].astype(np.int64)
school_df["BEDS Code"] = school_df["BEDS Code"].astype(np.int64)

In [10]:
# join latlong data to school data using the BEDS code
school_df = pd.merge(school_df, latlon_df, left_on=['BEDS Code'],right_on=['SED CODE'], how='left')

In [11]:
# drop the now redundant SED code
school_df.drop(['SED CODE'], axis=1, inplace=True)

In [12]:
# Take a look at the resulting data/missing values
print('... {} unique schools,'.format(len(school_df['BEDS Code'].unique())))
schools_withloc = school_df[school_df['latlon'].notnull()]['BEDS Code'].unique()
schools_missingloc = school_df[school_df['latlon'].isnull()]['BEDS Code'].unique()
print('... of which {} have lat/long'.format(len(schools_withloc)))
print('... and {} are missing lat/long'.format(len(schools_missingloc)))

... 1967 unique schools,
... of which 1807 have lat/long
... and 160 are missing lat/long


### Fix School Names

In [13]:
# The problem:
print('... original dataset has {} unique school names'.format(len(school_df['School Name'].unique())))
print('... but only {} unique BEDS Codes'.format(len(school_df['BEDS Code'].unique())))

... original dataset has 3135 unique school names
... but only 1967 unique BEDS Codes


In [14]:
# Helper Function
def fix_case(x):
    """Function to put a school name in the correct case"""
    if not x:
        return x
    elif x[:3] in ['PS ', 'JHS', 'MS ']:
        return x[:3] + x[3:].title()
    else:
        return x.title()

In [15]:
# Fix missing LEGAL NAMES with School Name
school_df['LEGAL NAME'].fillna(school_df['School Name'], inplace=True)

# Fix case and reassign to School Name
school_df['School Name'] = school_df['LEGAL NAME'].apply(fix_case)

# drop the now redundant LEGAL NAME column
school_df.drop(['LEGAL NAME'], axis=1, inplace=True)

In [16]:
# Problem Solved
print('... new dataset has {} unique school names.'.format(len(school_df['School Name'].unique())))
print('... and {} unique BEDS Codes.'.format(len(school_df['BEDS Code'].unique())))

... new dataset has 1959 unique school names.
... and 1967 unique BEDS Codes.


### Fill in missing boroughs

__TODO:__ From the numbers below it looks like some of the County values got switched around (eg. decrease in Manhattan counts?)... I think we probalby need a way of creating the county_map dictionary that prioritizes the traditional borough name). Come back to this.

In [17]:
# The first problem:
print('... {} entries are missing county info'.format(sum(school_df['County'].isnull())))
print('Other county tallies:')
school_df.County.value_counts()

... 3076 entries are missing county info
Other county tallies:


Brooklyn              3566
Bronx                 2737
Queens                2260
Manhattan             1228
New York              1040
Staten Island          478
Nyc Central Office     344
Nassau                   1
Name: County, dtype: int64

In [18]:
# create dictionary of county by BEDS Code
c = school_df[school_df['County'].notnull()][['BEDS Code','County']].to_dict()
county_map = {c['BEDS Code'][idx]: c['County'][idx] for idx in c['County'].keys()}

In [19]:
# map counties using dictionary
school_df.County = school_df['BEDS Code'].map(county_map)

In [20]:
# The first problem Solved
print('... Now only {} entries are missing county info'.format(sum(school_df['County'].isnull())))
print('Other county tallies:')
school_df.County.value_counts()

... Now only 3 entries are missing county info
Other county tallies:


Brooklyn              4637
Bronx                 3536
Queens                2961
New York              2856
Staten Island          639
Manhattan               80
Nassau                   9
Nyc Central Office       9
Name: County, dtype: int64

In [21]:
# Are any BEDS Codes are linked with more than one Borough(County)?
school_df.groupby('BEDS Code')['County'].apply(lambda x: len(x.unique())).value_counts()

1    1967
Name: County, dtype: int64

### Prep for distance sorting crime locations

In [23]:
# Helper function -- check dist
def is_in_radius(school_point, crime_point, radius):
    """
    Function using vincenty package to check distance between school and crime.
    INPUT: (lat,long) tuples for school and crime (in degrees), radius in miles.
    OUTPUT: Boolean
    """
    return vincenty(school_point, crime_point, miles=True) <= radius

In [24]:
# Helper function -- extract lat/long from object type
def parse_latlong(dataframe, loc_column):
    """
    Function to extract lat/long coords. 
    INPUT: dataframe and name of column with string tuple or list pair of coordinates.
    OUTPUT: n/a. Function modifies dataframe to add a lat and long column with float type.
    """
    get_lat = lambda x: x.split(',')[0][1:] if type(x)==type('s') else np.nan
    get_long = lambda x: x.split(',')[1][:-1] if type(x)==type('s') else np.nan
    dataframe['lat'] = dataframe[loc_column].apply(get_lat).astype('float64')
    dataframe['long'] = dataframe[loc_column].apply(get_long).astype('float64')
    print('... latitude and longitude extracted for dataframe.')

In [25]:
# load NYC dataframe
felony_df = pd.read_csv('NYPD_7_Major_Felony_Incidents.csv', index_col = False)

#### ERROR?: the cleaning function seems to take forever... need to take a closer look/maybe fix it.

In [26]:
# ... and clean it
#felony_df = cd.clean_NYPD(felony_df)

In [56]:
# Extact Lattitude and longitude data for both dataframes
parse_latlong(school_df, 'latlon')
parse_latlong(felony_df, 'Location 1')

... latitude and longitude extracted for dataframe.
... latitude and longitude extracted for dataframe.


In [41]:
# Testing vincenty on the first felony and first school
first_school_point = (school_df.loc[0,'lat'], school_df.loc[0,'long']) 
first_felony_point = (felony_df.loc[1,'lat'], felony_df.loc[1,'long']) 

# not w/in 2 miles, but yes, w/in 50
print('Distance: ', vincenty(first_school_point, first_felony_point))
print("... w/in 2 mi?", is_in_radius(first_school_point, first_felony_point, 2))
print("... w/in 50 mi?",is_in_radius(first_school_point, first_felony_point, 50))

Distance:  13.120085
... w/in 2 mi? False
... w/in 50 mi? True


## Function to extract crime tallies w/in radius of schools

I suspect that looping through each school is going to take forever... but we'll try that first and then explore a smarter (dynamic programming) alternative.

In [29]:
# Quick Check, are there rows with 'latlon' but not 'lat'
print('... there are {} missing latlon entries'.format(sum(school_df.latlon.isnull())))
print('... there are {} missing lat entries'.format(sum(school_df.lat.isnull())))

... there are 640 missing latlon entries
... there are 640 missing lat entries


#### Part 1: Helper functions for setting up a grid for NYC lat/long coords
NOTES: The max lat of a school is ~ 40.9  and the distance between 40.9 and 40.95 is over 3 miles... but there are 7 crimes that fell under the jurisdiction of the NY Transit police whose locations are recorded north of 41 degrees (the farthes one is 500 miles away). The minimum longitude of a school is ~-74.24 which is around 3 miles from -74.3. There are 63 crimes that occurred west of -74.3. I suggest that we disregard these outliers for the purposes of our analysis

In [88]:
# Initial exploration of ranges
max_lat = 40.95
min_lat = min(felony_df.lat.max(), school_df.lat.min())
max_long = max(felony_df.long.max(), school_df.long.max())
min_long = -74.3

lat_dist = vincenty((min_lat, 0.5*(max_long + min_long)),(max_lat, 0.5*(max_long + min_long)), miles=True)
long_dist = vincenty((min_long, 0.5*(max_lat + min_lat)),(max_long, 0.5*(max_lat + min_lat)), miles=True)

print('Latitude ranges from {} to {} with a total distance of {}'.format(min_lat, max_lat, lat_dist))
print('Longitude ranges from {} to {} with a total distance of {}'.format(min_long, max_long, long_dist))

Lat ranges from 40.5078027 to 40.95 with a total distance of 30.512686
Long ranges from -74.3 to -73.4883134 with a total distance of 56.290003


In [101]:
# Helper function to identify grid cell that contains a given point
def nyc_grid(lat,long):
    """
    This function identifies a square mile cell of NYC that contains 
    the given longitude and latitude point. There are 1500 cells in 
    total. 30 rows each represent a segement of latitude and 50 
    columns each represent a segment of longitude. The cells are 
    numbered 0 through 1599 and they are unique to this analysis.
    """
    # max and min values from data set
    max_lat = 40.95
    min_lat = 40.50
    max_long = -73.45
    min_long = -74.30
    
    # divide each range into segments of a little over a mile
    delta_lat = (max_lat - min_lat)/28
    delta_long = (max_long - min_long)/48

    # then segment each direction
    lat_seg = np.array([min_lat + idx*delta_lat for idx in range(-1,29)])
    long_seg = np.array([min_long + idx*delta_long for idx in range(-1,49)])

    # identify where given point fits in segments
    row = sum(lat_seg <= lat) - 1
    col = sum(long_seg <= long) - 1
    
    # return grid number
    if row < 0 or row == 29 or col < 0 or col == 49:
        return np.nan
    else:
        return row * 50 + col

In [114]:
# Test an out of bound point
nyc_grid(40.653161, -76.862164)

nan

In [115]:
# Test an in bound point
nyc_grid(40.821798, -73.886463)

1074

In [110]:
# helper function to get a list of adjacent cells
def get_adjacent(cell_num):
    """ 
    This function identifies a group of cells which together superset 
    any points within a mile of any location in the original cell.
    INPUT: a cell number (< 5999) from NYC grid
    OUTPUT: a list of adjacent and or diagonal cell numbers
    
    NOTE: this function should only be run on cell numbers of vadir
    school locations since the nyc_grid is designed so that all
    schools are in a cell that is not a boarder cell.
    """
    col = cell_num % 50
    row = cell_num // 50
    row_range = [row - 1, row, row + 1]
    col_range = [col - 1, col, col + 1]
    return [r * 50 + c for r in row_range for c in col_range]
    

In [111]:
# Test grid adjacency
get_adjacent(52)

[1, 2, 3, 51, 52, 53, 101, 102, 103]

#### Part 2: Crime counting function to search only within adjacent cells of the school

NOTE: loading grid cell#s for the felony data set takes 3-4 minutes and only needs to be done once. I've included it (commented out) in the function... but you should uncomment it in the final version of our notebook as well as the first time you load the felony data each day.

In [125]:
def gridbased_crimecount(school_df, felony_df):
    """
    Grid based function to identify felonies w/in one mile of each school.
    INPUT: school df w/ cols 'latlon', 'lat', 'long', and 'School Year'
           felony df w/ cols 'Occurrence Year', 'lat','long','Offense', and 'Identifier'
    OUTPUT: n/a, modifies school data.
    """
    # prepare felony dataframe by adding a column for nyc_grid cell number
#     tick0 = time.clock() # for debugging
#     felony_df.lat.fillna(0, inplace=True)
#     felony_df.long.fillna(0, inplace=True)
#     felony_df['NYC_grid'] = felony_df.apply(lambda x: nyc_grid(x.lat, x.long),axis=1)
#     tock0 = time.clock() # for debugging
#     print('< It took {} seconds to calculate all of the felony grids.'.format(tock0 - tick0))
    
    # Initialize new columns in school data frame
    school_df['CrimeIDS'] = pd.Series()
    school_df['Total Felonies w/in 1mi'] = pd.Series()
    school_df['Grand Larceny w/in 1mi'] = pd.Series()
    school_df['Robbery w/in 1mi'] = pd.Series()
    school_df['Burglary w/in 1mi'] = pd.Series()
    school_df['Assault w/in 1mi'] = pd.Series()
    school_df['Auto Theft w/in 1mi'] = pd.Series()
    school_df['Rape w/in 1mi'] = pd.Series()
    school_df['Murders w/in 1mi'] = pd.Series()
    
    # Group schools by location (school BEDS code) 
    grouped = school_df[school_df.lat.notnull()].groupby(['BEDS Code'])
    print('... found {} unique schools'.format(len(grouped.groups))) # for debugging
    
    # Loop through schools & years
    tick = time.clock() # for debugging
    for beds, df in grouped:
        tick1 = time.clock() # for debugging
        #print('>>> now processing:', beds) # for debugging
        # NOTE: the coordinates should all be the same so the mean is just the location
        assert len(df.lat.unique().tolist()) == 1, 'ERROR: multiple latitudes for this school.'
        location = (df.lat.mean(), df.long.mean())
        
        #tick2 = time.clock() # for debugging
        cells_to_search = get_adjacent(nyc_grid(*location)) 
        felony_loc = felony_df.loc[felony_df.NYC_grid.isin(cells_to_search)]
        ###felony_loc = felony_df[felony_df.NYC_grid.apply(lambda x: x in cells_to_search)]
        #print('    ... found {} crimes w/in 9 cells'.format(len(felony_loc))) # for debugging
        local = felony_loc.apply(lambda x: is_in_radius(location, (x.lat, x.long), 1), axis=1)
        local_crimes = felony_loc[local]
        #print('    ... of which {} are w/in one mile'.format(len(local_crimes))) # for debugging
        #tock2 = time.clock() # for debugging
        #print('< Find nearby crimes {} seconds>'.format(tock2 - tick2))  # for debugging     
        
        # tally and store felonies for each year
        for year in df['School Year'].unique():
            #tick3 = time.clock()  # for debugging
            # school_df indices for this year
            idxs = df[df['School Year'] == year].index.tolist()
            # subset of crimes w/in 1 mi that occurred this year
            subset = local_crimes[local_crimes['Occurrence Year'] == year]
            #print('    ... of which {} occurred in '.format(len(subset), year))  # for debugging 
            
            # tally and store felony counts
            school_df.loc[idxs,['CrimeIDS']] = str(subset.Identifier.unique().tolist())
            school_df.loc[idxs,['Total Felonies w/in 1mi']] = len(subset)        
            school_df.loc[idxs,['Grand Larceny w/in 1mi']] = sum(subset['Offense'] == 'GRAND LARCENY')
            school_df.loc[idxs,['Robbery w/in 1mi']] = sum(subset['Offense'] == 'ROBBERY')
            school_df.loc[idxs,['Burglary w/in 1mi']] = sum(subset['Offense'] == 'BURGLARY')
            school_df.loc[idxs,['Assault w/in 1mi']] = sum(subset['Offense'] == 'FELONY ASSAULT')
            school_df.loc[idxs,['Auto Theft w/in 1mi']] = sum(subset['Offense'] == 'GRAND LARCENY OF MOTOR VEHICLE')
            school_df.loc[idxs,['Rape w/in 1mi']] = sum(subset['Offense'] == 'RAPE')
            school_df.loc[idxs,['Murders w/in 1mi']] = sum(subset['Offense'] == 'MURDER & NON-NEGL. MANSLAUGHTE')
            #tock3 = time.clock()
            #print('< Add counts for the year: {} seconds>:'.format(tock3 - tick3))   # for debugging  
        tock1 = time.clock() # for debugging  
        print('>>> Processed {} in {} seconds.'.format( beds, tock1 - tick1)) # for debugging  
    tock = time.clock()  # for debugging  
    print(' TOTAL TIME:', tock - tick) # for debugging  

In [126]:
# run function (got rid of print statments first.)
gridbased_crimecount(school_subset, felony_df)

# BETTER, but time is still a struggle
# ... seems to take about 5 seconds per school (once the felony data is asigned grids.)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is tryin

... found 4 unique schools
>>> now processing: 320700860852
< Tally felonies for one school: 5.161850000000015 seconds>
>>> now processing: 320800860846
< Tally felonies for one school: 4.477165999999954 seconds>
>>> now processing: 321100860855
< Tally felonies for one school: 2.3801900000000273 seconds>
>>> now processing: 321100860859
< Tally felonies for one school: 3.5629490000000033 seconds>
 TOTAL TIME: 15.586018999999993


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
