# COGS 108 - Data Checkpoint

# Names

- Sarah Borsotto
- Niharika Malhotra
- Marvin Ochoa Estrada
- Ann Luong
- Dhathry Doppalapudi


<a id='research_question'></a>
# Research Question

Is there a correlation between the amount of parking tickets issued in different regions of Los Angeles county and the demographics of those regions?

# Dataset(s)

- Dataset Name: Los Angeles County Median Income and AMI
- Link to the dataset: https://geohub.lacity.org/datasets/lacounty::median-income-and-ami-census-tract/explore?location=33.769267%2C-118.302668%2C8.64&showTable=true 
- Number of observations: 2495

This dataset contains stats for median household income, Area Median Income (AMI) category based on median household income, and comparing median household income to AMI categories. This table connects the previous categories to its respective region. 

- Dataset Name: Parking Tickets - LA Open Data Portal
- Link to the dataset: https://data.lacity.org/Transportation/Parking-Citations/wjz9-h9np/data
- Number of observations: 15111221        

This dataset from Los Angeles’s Open Data Initiative, started by Mayor Eric Garceti in 2013, contains various data on parking citations. The table includes the ticket number, the date the ticket was issued, the make of the car, location (as a street address), the type of violation, longitude/latitude, and the fine of the ticket. 

If you plan to use multiple datasets, add 1-2 sentences about how you plan to combine these datasets.

We will use the latitude and longitude given in the parking citation dataset to plot the location of each data point on a map. Then, we will see which data points fall into the regions defined in the AMI dataset in order to compare the density of parking tickets within each income region. 

# Setup

In [1]:
%pip install geopandas

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import os
sns.set(font_scale=2, style="white")

import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.style as style
plt.rcParams['figure.figsize'] = (12, 5)

# Plots latitude and longitude points on a map
from shapely.geometry import Point
import geopandas as gpd
from geopandas import GeoDataFrame

# Converts state plane coordinates to latitude and longitude
from pyproj import Proj, transform

%config InlineBackend.figure_format = 'retina'

# Data Cleaning

First, we will read the parking ticket dataset and find out what information we can extract from it.

In [3]:
citations = pd.read_csv(r"Parking_Citations.csv")

  exec(code_obj, self.user_global_ns, self.user_ns)


In [4]:
print(citations.columns)
citations.head()

Index(['Ticket number', 'Issue Date', 'Issue time', 'Meter Id', 'Marked Time',
       'RP State Plate', 'Plate Expiry Date', 'VIN', 'Make', 'Body Style',
       'Color', 'Location', 'Route', 'Agency', 'Violation code',
       'Violation Description', 'Fine amount', 'Latitude', 'Longitude',
       'Agency Description', 'Color Description', 'Body Style Description'],
      dtype='object')


Unnamed: 0,Ticket number,Issue Date,Issue time,Meter Id,Marked Time,RP State Plate,Plate Expiry Date,VIN,Make,Body Style,...,Route,Agency,Violation code,Violation Description,Fine amount,Latitude,Longitude,Agency Description,Color Description,Body Style Description
0,1103341116,12/21/2015,1251.0,,,CA,200304.0,,HOND,PA,...,01521,1.0,4000A1,NO EVIDENCE OF REG,50.0,99999.0,99999.0,,,
1,1103700150,12/21/2015,1435.0,,,CA,201512.0,,GMC,VN,...,1C51,1.0,4000A1,NO EVIDENCE OF REG,50.0,99999.0,99999.0,,,
2,1104803000,12/21/2015,2055.0,,,CA,201503.0,,NISS,PA,...,2R2,2.0,8939,WHITE CURB,58.0,6439997.9,1802686.4,,,
3,1104820732,12/26/2015,1515.0,,,CA,,,ACUR,PA,...,2F11,2.0,000,17104h,,6440041.1,1802686.2,,,
4,1105461453,09/15/2015,115.0,,,CA,200316.0,,CHEV,PA,...,1FB70,1.0,8069A,NO STOPPING/STANDING,93.0,99999.0,99999.0,,,


Next, we are going to get rid of all of the columns that we don't need and then drop any rows that have missing values. The information that is of interest to us is:
- ticket number
- issue date
- State Plate
- Make
- Location
- Violation code
- Violation description
- Fine amount
- Latitude
- Longitude

In [5]:
citations = citations[['Ticket number', 'Issue Date', 'RP State Plate', 'Make', 'Location', 'Violation code', 'Violation Description', 'Fine amount', 'Latitude', 'Longitude']]
citations = citations.dropna()
print(citations.shape)
citations.head()

(15111221, 10)


Unnamed: 0,Ticket number,Issue Date,RP State Plate,Make,Location,Violation code,Violation Description,Fine amount,Latitude,Longitude
0,1103341116,12/21/2015,CA,HOND,13147 WELBY WAY,4000A1,NO EVIDENCE OF REG,50.0,99999.0,99999.0
1,1103700150,12/21/2015,CA,GMC,525 S MAIN ST,4000A1,NO EVIDENCE OF REG,50.0,99999.0,99999.0
2,1104803000,12/21/2015,CA,NISS,200 WORLD WAY,8939,WHITE CURB,58.0,6439997.9,1802686.4
4,1105461453,09/15/2015,CA,CHEV,GEORGIA ST/OLYMPIC,8069A,NO STOPPING/STANDING,93.0,99999.0,99999.0
5,1106226590,09/15/2015,CA,CHEV,SAN PEDRO S/O BOYD,4000A1,NO EVIDENCE OF REG,50.0,99999.0,99999.0


Now, we are going to extract the year that each ticket was issued to make it easier to filter the data

In [6]:
citations['year'] = pd.to_datetime(citations['Issue Date'], format='%m/%d/%Y').dt.year
citations.head()

Unnamed: 0,Ticket number,Issue Date,RP State Plate,Make,Location,Violation code,Violation Description,Fine amount,Latitude,Longitude,year
0,1103341116,12/21/2015,CA,HOND,13147 WELBY WAY,4000A1,NO EVIDENCE OF REG,50.0,99999.0,99999.0,2015
1,1103700150,12/21/2015,CA,GMC,525 S MAIN ST,4000A1,NO EVIDENCE OF REG,50.0,99999.0,99999.0,2015
2,1104803000,12/21/2015,CA,NISS,200 WORLD WAY,8939,WHITE CURB,58.0,6439997.9,1802686.4,2015
4,1105461453,09/15/2015,CA,CHEV,GEORGIA ST/OLYMPIC,8069A,NO STOPPING/STANDING,93.0,99999.0,99999.0,2015
5,1106226590,09/15/2015,CA,CHEV,SAN PEDRO S/O BOYD,4000A1,NO EVIDENCE OF REG,50.0,99999.0,99999.0,2015


We're only interested in looking in data from the past 5 years, so we're only going to look at entries in which the issue year is between 2018 and 2022 (inclusive)

In [7]:
citations = citations[citations['year'].isin(range(2018, 2023)) == True]
print(citations.shape)
citations['year'].value_counts()

(8448834, 11)


2018    1995302
2019    1949171
2021    1567893
2022    1513752
2020    1422716
Name: year, dtype: int64

In [8]:
citations.head()

Unnamed: 0,Ticket number,Issue Date,RP State Plate,Make,Location,Violation code,Violation Description,Fine amount,Latitude,Longitude,year
19951,4336729224,10/27/2018,CA,TOYT,10667 TELFAIR AVE,22500E,BLOCKING DRIVEWAY,68.0,99999.0,99999.0,2018
20614,4336729235,10/27/2018,CA,OTHR,10341 WOODWARD AVE,80.73.2,EXCEED 72HRS-ST,68.0,99999.0,99999.0,2018
20615,4336729246,10/27/2018,CA,LEXS,10650 SHERMAN GROVE AVE,80.73.2,EXCEED 72HRS-ST,68.0,6464946.0,1918022.0,2018
20616,4336729250,10/27/2018,CA,PONT,7530 SAN FERNANDO ROAD,80.73.2,EXCEED 72HRS-ST,68.0,6454926.0,1898328.0,2018
20617,4336729261,10/27/2018,CA,PONT,7530 SAN FERNANDO ROAD,5204A-,DISPLAY OF TABS,25.0,6454926.0,1898328.0,2018


Looking at `df.head` now, you can see that there are a lot of invalid latitude and longitude coordinates (longitude and latitude are 99999). You can also see that the coordinates are represented with the California Zone 5 State Plane Coordinate System instead of the regular longitude and latitude coordinates that we are used to. We need the latitude on longitude coordinates to plot the location of the tickets on a map, so for now we are only going to look at entries with valid lat/lon coordinates and convert them to the proper values.

In [9]:
citations = citations[citations['Longitude'] != 99999]
print(citations.shape)
citations.head()

(7802916, 11)


Unnamed: 0,Ticket number,Issue Date,RP State Plate,Make,Location,Violation code,Violation Description,Fine amount,Latitude,Longitude,year
20615,4336729246,10/27/2018,CA,LEXS,10650 SHERMAN GROVE AVE,80.73.2,EXCEED 72HRS-ST,68.0,6464946.0,1918022.0,2018
20616,4336729250,10/27/2018,CA,PONT,7530 SAN FERNANDO ROAD,80.73.2,EXCEED 72HRS-ST,68.0,6454926.0,1898328.0,2018
20617,4336729261,10/27/2018,CA,PONT,7530 SAN FERNANDO ROAD,5204A-,DISPLAY OF TABS,25.0,6454926.0,1898328.0,2018
20618,4336729272,10/27/2018,CA,DODG,7590 GLENOAKS BLVD,80.73.2,EXCEED 72HRS-ST,68.0,6457772.0,1898373.0,2018
20619,4336729283,10/27/2018,CA,TOYT,9601 CABRINI DR,80.73.2,EXCEED 72HRS-ST,68.0,6458568.0,1899818.0,2018


Looking at `df.shape`, you can see that there are over 7 million entries. Doing the conversion for every entry would take multiple days for our computers to run so we're only going to look at 5000 entries from each year.

In [10]:
citations2018 = citations[citations['year'] == 2018]
citations2018 = citations2018.sample(frac = 1) #shuffle the data
citations2018 = citations2018.head(5000)

In [11]:
citations2019 = citations[citations['year'] == 2019]
citations2019 = citations2019.sample(frac = 1)
citations2019 = citations2019.head(5000)

In [12]:
citations2020 = citations[citations['year'] == 2020]
citations2020 = citations2020.sample(frac = 1)
citations2020 = citations2020.head(5000)

In [13]:
citations2021 = citations[citations['year'] == 2021]
citations2021 = citations2021.sample(frac = 1)
citations2021 = citations2021.head(5000)

In [14]:
citations2022 = citations[citations['year'] == 2022]
citations2022 = citations2022.sample(frac = 1)
citations2022 = citations2022.head(5000)

In [15]:
citations = pd.concat([citations2018, citations2019, citations2020, citations2021, citations2022])
#we should have 25,000 entries
citations.shape

(25000, 11)

In [16]:
citations['year'].value_counts()

2018    5000
2019    5000
2020    5000
2021    5000
2022    5000
Name: year, dtype: int64

In [17]:
citations.head()

Unnamed: 0,Ticket number,Issue Date,RP State Plate,Make,Location,Violation code,Violation Description,Fine amount,Latitude,Longitude,year
7303609,4328154935,03/16/2018,CA,MERZ,16400 VENTURA BL,88.13B+,METER EXP.,63.0,6413382.0,1879975.0,2018
8569173,4344285551,12/11/2018,CA,BMW,9901 PICO BL W,88.13B+,METER EXP.,63.0,6438605.0,1841744.0,2018
7362643,4329008946,03/28/2018,CA,LEXS,1147 ECHO PARK AVE,88.63B+,OFF STR/OVERTIME/MTR,58.0,6483441.0,1849852.0,2018
8632489,4344052042,12/23/2018,CA,MAZD,843 LOS ANGELES ST S,88.13B+,METER EXP.,63.0,6484998.0,1837549.0,2018
6971339,4324364936,01/10/2018,CA,DODG,579 VENICE BLVD,80.71.3,PARKING/FRONT YARD,68.0,6420827.0,1818193.0,2018


Now that we have reduced our dataset to a size that we can work with, we can convert all of the state coordinates to latitude and longitude coordinates

In [18]:
#inProj: the current coordinate system (CA Zone 5)
inProj = Proj(init='epsg:2229', preserve_units = True)
#outProj: the coordinate system that we want (latitude/longitude)
outProj = Proj(init='epsg:4326')
count = 0

for index in citations.index:
    x = citations.loc[index, 'Latitude']
    y = citations.loc[index, 'Longitude']
    new_lon, new_lat = transform(inProj, outProj, x, y)
    citations.loc[index, 'Longitude'] = new_lon
    citations.loc[index, 'Latitude'] = new_lat
    count += 1
    if count % 5000 == 0:
        print(count) #track progress

  in_crs_string = _prepare_from_proj_string(in_crs_string)
  in_crs_string = _prepare_from_proj_string(in_crs_string)
  new_lon, new_lat = transform(inProj, outProj, x, y)


5000
10000
15000
20000
25000


In [19]:
citations.head()

Unnamed: 0,Ticket number,Issue Date,RP State Plate,Make,Location,Violation code,Violation Description,Fine amount,Latitude,Longitude,year
7303609,4328154935,03/16/2018,CA,MERZ,16400 VENTURA BL,88.13B+,METER EXP.,63.0,34.157264,-118.490149,2018
8569173,4344285551,12/11/2018,CA,BMW,9901 PICO BL W,88.13B+,METER EXP.,63.0,34.052521,-118.406265,2018
7362643,4329008946,03/28/2018,CA,LEXS,1147 ECHO PARK AVE,88.63B+,OFF STR/OVERTIME/MTR,58.0,34.075207,-118.258318,2018
8632489,4344052042,12/23/2018,CA,MAZD,843 LOS ANGELES ST S,88.13B+,METER EXP.,63.0,34.041412,-118.253074,2018
6971339,4324364936,01/10/2018,CA,DODG,579 VENICE BLVD,80.71.3,PARKING/FRONT YARD,68.0,33.987596,-118.464595,2018


# Median Income and AMI Dataset Cleaning

Now we will explore the median income and ami dataset so that we can better establish socioeconomic status later on.

We will begin by reading in the data.

In [20]:
# get the median income and ami geoJSON file which includes info about the boundaries of each city
ami = gpd.read_file(r"Median_Income_and_AMI_(census_tract).geojson")
ami

Unnamed: 0,tract,med_hh_income,med_hh_income_universe,ami_category,below_med_income,below_60pct_med_income,below_moderate_income,sup_dist,csa,spa,ESRI_OID,Shape__Area,Shape__Length,geometry
0,06037199700,38892.0,1204,Very Low Income,Yes,Yes,Yes,District 1,Los Angeles - Wholesale District,SPA 4 - Metro,2347,1.041050e+07,13808.463241,"POLYGON ((-118.22672 34.06242, -118.22453 34.0..."
1,06037199801,41027.0,903,Very Low Income,Yes,Yes,Yes,District 1,Los Angeles - Lincoln Heights,SPA 4 - Metro,2348,3.724107e+06,9459.391827,"POLYGON ((-118.21559 34.07186, -118.21169 34.0..."
2,06037199802,42500.0,612,Very Low Income,Yes,Yes,Yes,District 1,Los Angeles - Lincoln Heights,SPA 4 - Metro,2349,3.296129e+06,8868.744225,"POLYGON ((-118.21563 34.07365, -118.21309 34.0..."
3,06037199900,37232.0,845,Very Low Income,Yes,Yes,Yes,District 1,Los Angeles - Lincoln Heights,SPA 4 - Metro,2350,4.782361e+06,10141.728022,"POLYGON ((-118.21528 34.06349, -118.21547 34.0..."
4,06037201110,65000.0,782,Low Income,Yes,No,Yes,District 1,Los Angeles - El Sereno,SPA 4 - Metro,2351,1.099246e+07,15893.383636,"POLYGON ((-118.18182 34.09277, -118.18039 34.0..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2490,06037554516,126450.0,1215,Above Moderate Income,No,No,No,District 4,City of Cerritos,SPA 7 - East,4837,1.555650e+07,21274.227408,"POLYGON ((-118.07225 33.85452, -118.07047 33.8..."
2491,06037554517,107672.0,1352,Above Moderate Income,No,No,No,District 4,City of Cerritos,SPA 7 - East,4838,1.421767e+07,15905.089174,"POLYGON ((-118.06374 33.86586, -118.05352 33.8..."
2492,06037554518,104439.0,1558,Above Moderate Income,No,No,No,District 4,City of Cerritos,SPA 7 - East,4839,1.938903e+07,21218.412991,"POLYGON ((-118.04646 33.87326, -118.03776 33.8..."
2493,06037554519,131012.0,1216,Above Moderate Income,No,No,No,District 4,City of Cerritos,SPA 7 - East,4840,1.866694e+07,19500.866806,"POLYGON ((-118.06362 33.85858, -118.05495 33.8..."


### Median Income and AMI

We know that the median income and AMI dataset has statistics for median household income and the corresponding AMI category, but what else does it include? Let's take a look!

In [21]:
ami.head()

Unnamed: 0,tract,med_hh_income,med_hh_income_universe,ami_category,below_med_income,below_60pct_med_income,below_moderate_income,sup_dist,csa,spa,ESRI_OID,Shape__Area,Shape__Length,geometry
0,6037199700,38892.0,1204,Very Low Income,Yes,Yes,Yes,District 1,Los Angeles - Wholesale District,SPA 4 - Metro,2347,10410500.0,13808.463241,"POLYGON ((-118.22672 34.06242, -118.22453 34.0..."
1,6037199801,41027.0,903,Very Low Income,Yes,Yes,Yes,District 1,Los Angeles - Lincoln Heights,SPA 4 - Metro,2348,3724107.0,9459.391827,"POLYGON ((-118.21559 34.07186, -118.21169 34.0..."
2,6037199802,42500.0,612,Very Low Income,Yes,Yes,Yes,District 1,Los Angeles - Lincoln Heights,SPA 4 - Metro,2349,3296129.0,8868.744225,"POLYGON ((-118.21563 34.07365, -118.21309 34.0..."
3,6037199900,37232.0,845,Very Low Income,Yes,Yes,Yes,District 1,Los Angeles - Lincoln Heights,SPA 4 - Metro,2350,4782361.0,10141.728022,"POLYGON ((-118.21528 34.06349, -118.21547 34.0..."
4,6037201110,65000.0,782,Low Income,Yes,No,Yes,District 1,Los Angeles - El Sereno,SPA 4 - Metro,2351,10992460.0,15893.383636,"POLYGON ((-118.18182 34.09277, -118.18039 34.0..."


What about the size of the dataset?

In [22]:
ami.shape

(2495, 14)

This means that there are currently 2495 observations in our data, with 14 different categories. What can these categories tell us?

In [23]:
ami.columns

Index(['tract', 'med_hh_income', 'med_hh_income_universe', 'ami_category',
       'below_med_income', 'below_60pct_med_income', 'below_moderate_income',
       'sup_dist', 'csa', 'spa', 'ESRI_OID', 'Shape__Area', 'Shape__Length',
       'geometry'],
      dtype='object')

In [24]:
ami.describe()

Unnamed: 0,med_hh_income,med_hh_income_universe,ESRI_OID,Shape__Area,Shape__Length
count,2458.0,2495.0,2495.0,2495.0,2495.0
mean,76849.334418,1335.672946,3594.0,45797160.0,20273.817017
std,35546.132788,533.910989,720.388784,427058000.0,34074.651234
min,4918.0,0.0,2347.0,483653.2,2815.257443
25%,51157.5,988.0,2970.5,5921008.0,10607.040729
50%,69698.0,1282.0,3594.0,10257610.0,14365.057033
75%,94515.5,1625.0,4217.5,18357920.0,19904.426714
max,250001.0,5617.0,4841.0,16086910000.0,915242.577112


Looks like more than half of the columns are categorical and not numerical. What are the other types of data?

In [25]:
# finding types of each column
ami.dtypes

tract                       object
med_hh_income              float64
med_hh_income_universe       int64
ami_category                object
below_med_income            object
below_60pct_med_income      object
below_moderate_income       object
sup_dist                    object
csa                         object
spa                         object
ESRI_OID                     int64
Shape__Area                float64
Shape__Length              float64
geometry                  geometry
dtype: object

In [26]:
ami.index

RangeIndex(start=0, stop=2495, step=1)

Something that will be important for our analysis later on is identifying median income for specific regions. Let's take a look at the different locations in this dataset:

In [27]:
ami['csa'].unique()

array(['Los Angeles - Wholesale District',
       'Los Angeles - Lincoln Heights', 'Los Angeles - El Sereno',
       'Los Angeles - Highland Park', 'Los Angeles - University Hills',
       'Los Angeles - Boyle Heights', 'City of Pomona',
       'City of Diamond Bar', 'Unincorporated - Rowland Heights',
       'City of Industry', 'City of Walnut', 'Unincorporated - Covina',
       'City of Covina', 'City of San Dimas',
       'Unincorporated - Covina (Charter Oak)', 'City of Glendora',
       'Unincorporated - Azusa', 'City of Azusa', 'City of Norwalk',
       'City of Artesia', 'City of Lakewood', 'City of Hawaiian Gardens',
       'City of Long Beach', 'City of Cerritos',
       'Los Angeles - Little Tokyo', 'Los Angeles - Chinatown',
       'Los Angeles - Downtown', 'Los Angeles - Temple-Beaudry',
       'Los Angeles - Historic Filipinotown', 'Los Angeles - Westlake',
       'City of Irwindale', 'City of Baldwin Park', 'City of West Covina',
       'City of Signal Hill', 'Los Angeles

Since all of our data is from Los Angeles, we can just look at the city instead.

In [28]:
# function to single out the city/community from the string
def a (city):
    if ' - ' in city:
        idx = city.index('- ')
    else:
        idx = -2
    return city[idx+2:]
ami['csa'] = ami['csa'].apply(a)

In [29]:
ami.head()

Unnamed: 0,tract,med_hh_income,med_hh_income_universe,ami_category,below_med_income,below_60pct_med_income,below_moderate_income,sup_dist,csa,spa,ESRI_OID,Shape__Area,Shape__Length,geometry
0,6037199700,38892.0,1204,Very Low Income,Yes,Yes,Yes,District 1,Wholesale District,SPA 4 - Metro,2347,10410500.0,13808.463241,"POLYGON ((-118.22672 34.06242, -118.22453 34.0..."
1,6037199801,41027.0,903,Very Low Income,Yes,Yes,Yes,District 1,Lincoln Heights,SPA 4 - Metro,2348,3724107.0,9459.391827,"POLYGON ((-118.21559 34.07186, -118.21169 34.0..."
2,6037199802,42500.0,612,Very Low Income,Yes,Yes,Yes,District 1,Lincoln Heights,SPA 4 - Metro,2349,3296129.0,8868.744225,"POLYGON ((-118.21563 34.07365, -118.21309 34.0..."
3,6037199900,37232.0,845,Very Low Income,Yes,Yes,Yes,District 1,Lincoln Heights,SPA 4 - Metro,2350,4782361.0,10141.728022,"POLYGON ((-118.21528 34.06349, -118.21547 34.0..."
4,6037201110,65000.0,782,Low Income,Yes,No,Yes,District 1,El Sereno,SPA 4 - Metro,2351,10992460.0,15893.383636,"POLYGON ((-118.18182 34.09277, -118.18039 34.0..."


We should also check for null values.

In [30]:
# checking if csa column has nans
ami['csa'].hasnans

False

In [31]:
# checking if ami category colunn has nans
ami['ami_category'].hasnans

True

In [32]:
# checking if median_hh_income has nans
ami['med_hh_income'].hasnans

True

In [33]:
# finding all the rows in the dataframe where med_hh_income column ins NaN
ami.loc[ ami['med_hh_income'].isna() ]

Unnamed: 0,tract,med_hh_income,med_hh_income_universe,ami_category,below_med_income,below_60pct_med_income,below_moderate_income,sup_dist,csa,spa,ESRI_OID,Shape__Area,Shape__Length,geometry
446,6037216301,,614,,,,,District 2,Miracle Mile,SPA 4 - Metro,2793,6049547.0,11309.81632,"POLYGON ((-118.37618 34.05960, -118.37226 34.0..."
502,6037578100,,0,,,,,District 4,City of Long Beach,SPA 8 - South Bay,2849,18516370.0,18912.709493,"POLYGON ((-118.12347 33.78716, -118.12143 33.7..."
504,6037599100,,79,,,,,District 4,Santa Catalina Island,SPA 8 - South Bay,2851,3698756000.0,554591.259931,"MULTIPOLYGON (((-118.60443 33.47856, -118.5988..."
761,6037222700,,108,,,,,District 2,Exposition Park,SPA 6 - South,3108,9012062.0,13506.623725,"POLYGON ((-118.29155 34.02550, -118.28776 34.0..."
896,6037115103,,11,,,,,District 3,Northridge,SPA 2 - San Fernando,3243,16329320.0,22141.574995,"POLYGON ((-118.53393 34.24457, -118.52741 34.2..."
1458,6037265301,,0,,,,,District 3,Westwood,SPA 5 - West,3805,17371240.0,20071.778704,"POLYGON ((-118.45550 34.07585, -118.45467 34.0..."
1560,6037901003,,0,,,,,District 5,City of Lancaster,SPA 1 - Antelope Valley,3907,28092030.0,21201.25735,"POLYGON ((-118.23668 34.70401, -118.21916 34.7..."
1658,6037273403,,1020,,,,,District 3,Venice,SPA 5 - West,4005,2869621.0,8586.103161,"POLYGON ((-118.47977 33.99453, -118.47762 33.9..."
2052,6037920200,,0,,,,,District 5,Castaic,SPA 2 - San Fernando,4399,122208700.0,64176.721584,"POLYGON ((-118.61868 34.48922, -118.61079 34.4..."
2072,6037980001,,0,,,,,District 5,City of Burbank,SPA 2 - San Fernando,4419,25081050.0,22534.656488,"POLYGON ((-118.37032 34.20120, -118.36582 34.2..."


In [34]:
# all of the rows having NaN in med_hh_income column have NaNs in ami_catgory, 
# below_med_income, below_60pct_med_income, and below_moderate_income as well
ami.loc[ ami['med_hh_income'].isna()].shape
ami.shape

(2495, 14)

In [35]:
# shape of df once dropping rows with NaN
ami.dropna().shape

(2458, 14)

In [36]:
# dropping rows with nans
ami.dropna(inplace = True)

In [37]:
ami['med_hh_income'].hasnans

False

In [38]:
ami['ami_category'].hasnans

False

To optimize run time even more, we can eliminate redundant information. For example, in the csa column, many of the locations are described as 'City of ...'. We will therefore remove this description and only keep the name of the city instead.

In [39]:
ami['csa'].unique()

array(['Wholesale District', 'Lincoln Heights', 'El Sereno',
       'Highland Park', 'University Hills', 'Boyle Heights',
       'City of Pomona', 'City of Diamond Bar', 'Rowland Heights',
       'City of Industry', 'City of Walnut', 'Covina', 'City of Covina',
       'City of San Dimas', 'Covina (Charter Oak)', 'City of Glendora',
       'Azusa', 'City of Azusa', 'City of Norwalk', 'City of Artesia',
       'City of Lakewood', 'City of Hawaiian Gardens',
       'City of Long Beach', 'City of Cerritos', 'Little Tokyo',
       'Chinatown', 'Downtown', 'Temple-Beaudry', 'Historic Filipinotown',
       'Westlake', 'City of Irwindale', 'City of Baldwin Park',
       'City of West Covina', 'City of Signal Hill', 'Pico-Union',
       'Hancock Park', 'Wilshire Center', 'Little Bangladesh',
       'Koreatown', 'Bassett', 'City of La Puente', 'West Puente Valley',
       'Valinda', 'San Jose Hills', 'Avocado Heights', 'North Whittier',
       'Hacienda Heights', 'Tujunga', 'Sun Valley', 'Sunlan

In [40]:
# drop the 'City of' part of the string, as that is unnecessary
fncof = lambda x : x if 'City of ' not in x else x[8:]
ami['csa'] = ami['csa'].apply(fncof)

In [41]:
#checking the unique values
ami['csa'].unique()

array(['Wholesale District', 'Lincoln Heights', 'El Sereno',
       'Highland Park', 'University Hills', 'Boyle Heights', 'Pomona',
       'Diamond Bar', 'Rowland Heights', 'Industry', 'Walnut', 'Covina',
       'San Dimas', 'Covina (Charter Oak)', 'Glendora', 'Azusa',
       'Norwalk', 'Artesia', 'Lakewood', 'Hawaiian Gardens', 'Long Beach',
       'Cerritos', 'Little Tokyo', 'Chinatown', 'Downtown',
       'Temple-Beaudry', 'Historic Filipinotown', 'Westlake', 'Irwindale',
       'Baldwin Park', 'West Covina', 'Signal Hill', 'Pico-Union',
       'Hancock Park', 'Wilshire Center', 'Little Bangladesh',
       'Koreatown', 'Bassett', 'La Puente', 'West Puente Valley',
       'Valinda', 'San Jose Hills', 'Avocado Heights', 'North Whittier',
       'Hacienda Heights', 'Tujunga', 'Sun Valley', 'Sunland',
       'Lakeview Terrace', 'Shadow Hills', 'Pacoima', 'Country Club Park',
       'Melrose', 'Park La Brea', 'Carthay', 'Miracle Mile',
       'South Carthay', 'Crestview', 'Mid-city', 'La

Lastly, we can delete unnecessary columns.

In [42]:
ami = ami.drop(columns=['tract', 'ESRI_OID'])
ami.head()

Unnamed: 0,med_hh_income,med_hh_income_universe,ami_category,below_med_income,below_60pct_med_income,below_moderate_income,sup_dist,csa,spa,Shape__Area,Shape__Length,geometry
0,38892.0,1204,Very Low Income,Yes,Yes,Yes,District 1,Wholesale District,SPA 4 - Metro,10410500.0,13808.463241,"POLYGON ((-118.22672 34.06242, -118.22453 34.0..."
1,41027.0,903,Very Low Income,Yes,Yes,Yes,District 1,Lincoln Heights,SPA 4 - Metro,3724107.0,9459.391827,"POLYGON ((-118.21559 34.07186, -118.21169 34.0..."
2,42500.0,612,Very Low Income,Yes,Yes,Yes,District 1,Lincoln Heights,SPA 4 - Metro,3296129.0,8868.744225,"POLYGON ((-118.21563 34.07365, -118.21309 34.0..."
3,37232.0,845,Very Low Income,Yes,Yes,Yes,District 1,Lincoln Heights,SPA 4 - Metro,4782361.0,10141.728022,"POLYGON ((-118.21528 34.06349, -118.21547 34.0..."
4,65000.0,782,Low Income,Yes,No,Yes,District 1,El Sereno,SPA 4 - Metro,10992460.0,15893.383636,"POLYGON ((-118.18182 34.09277, -118.18039 34.0..."


Let's take a final look at our dataset!

In [43]:
ami

Unnamed: 0,med_hh_income,med_hh_income_universe,ami_category,below_med_income,below_60pct_med_income,below_moderate_income,sup_dist,csa,spa,Shape__Area,Shape__Length,geometry
0,38892.0,1204,Very Low Income,Yes,Yes,Yes,District 1,Wholesale District,SPA 4 - Metro,1.041050e+07,13808.463241,"POLYGON ((-118.22672 34.06242, -118.22453 34.0..."
1,41027.0,903,Very Low Income,Yes,Yes,Yes,District 1,Lincoln Heights,SPA 4 - Metro,3.724107e+06,9459.391827,"POLYGON ((-118.21559 34.07186, -118.21169 34.0..."
2,42500.0,612,Very Low Income,Yes,Yes,Yes,District 1,Lincoln Heights,SPA 4 - Metro,3.296129e+06,8868.744225,"POLYGON ((-118.21563 34.07365, -118.21309 34.0..."
3,37232.0,845,Very Low Income,Yes,Yes,Yes,District 1,Lincoln Heights,SPA 4 - Metro,4.782361e+06,10141.728022,"POLYGON ((-118.21528 34.06349, -118.21547 34.0..."
4,65000.0,782,Low Income,Yes,No,Yes,District 1,El Sereno,SPA 4 - Metro,1.099246e+07,15893.383636,"POLYGON ((-118.18182 34.09277, -118.18039 34.0..."
...,...,...,...,...,...,...,...,...,...,...,...,...
2490,126450.0,1215,Above Moderate Income,No,No,No,District 4,Cerritos,SPA 7 - East,1.555650e+07,21274.227408,"POLYGON ((-118.07225 33.85452, -118.07047 33.8..."
2491,107672.0,1352,Above Moderate Income,No,No,No,District 4,Cerritos,SPA 7 - East,1.421767e+07,15905.089174,"POLYGON ((-118.06374 33.86586, -118.05352 33.8..."
2492,104439.0,1558,Above Moderate Income,No,No,No,District 4,Cerritos,SPA 7 - East,1.938903e+07,21218.412991,"POLYGON ((-118.04646 33.87326, -118.03776 33.8..."
2493,131012.0,1216,Above Moderate Income,No,No,No,District 4,Cerritos,SPA 7 - East,1.866694e+07,19500.866806,"POLYGON ((-118.06362 33.85858, -118.05495 33.8..."


# Combining the datasets

In [44]:
ami['geometry']

0       POLYGON ((-118.22672 34.06242, -118.22453 34.0...
1       POLYGON ((-118.21559 34.07186, -118.21169 34.0...
2       POLYGON ((-118.21563 34.07365, -118.21309 34.0...
3       POLYGON ((-118.21528 34.06349, -118.21547 34.0...
4       POLYGON ((-118.18182 34.09277, -118.18039 34.0...
                              ...                        
2490    POLYGON ((-118.07225 33.85452, -118.07047 33.8...
2491    POLYGON ((-118.06374 33.86586, -118.05352 33.8...
2492    POLYGON ((-118.04646 33.87326, -118.03776 33.8...
2493    POLYGON ((-118.06362 33.85858, -118.05495 33.8...
2494    POLYGON ((-118.10858 33.88694, -118.10828 33.8...
Name: geometry, Length: 2458, dtype: geometry

In [45]:
ami.columns

Index(['med_hh_income', 'med_hh_income_universe', 'ami_category',
       'below_med_income', 'below_60pct_med_income', 'below_moderate_income',
       'sup_dist', 'csa', 'spa', 'Shape__Area', 'Shape__Length', 'geometry'],
      dtype='object')

In [46]:
def getArea(lat, lon, df):
    for index in df.index:
        polygon = df.loc[index, 'geometry']
        point = Point(lon, lat)
        if polygon.contains(point) or polygon.touches(point):
            return [df.loc[index, 'csa'], df.loc[index, 'ami_category']]
    return [None, None]

In [48]:
cities = []
income = []
count = 0
for index in citations.index:
    lat = citations.loc[index, 'Latitude']
    lon = citations.loc[index, 'Longitude']
    amiRegion = getArea(lat, lon, ami)
    csa = amiRegion[0]
    income_category = amiRegion[1]
    cities.append(csa)
    income.append(income_category)
    count += 1
    if count % 5000 == 0:
        print(count) #track progress


5000
10000
15000
20000
25000


In [50]:
citations['csa'] = cities
citations['ami_category'] = income

In [51]:
citations.head()

Unnamed: 0,Ticket number,Issue Date,RP State Plate,Make,Location,Violation code,Violation Description,Fine amount,Latitude,Longitude,year,csa,ami_category
7303609,4328154935,03/16/2018,CA,MERZ,16400 VENTURA BL,88.13B+,METER EXP.,63.0,34.157264,-118.490149,2018,Encino,Above Moderate Income
8569173,4344285551,12/11/2018,CA,BMW,9901 PICO BL W,88.13B+,METER EXP.,63.0,34.052521,-118.406265,2018,Cheviot Hills,Above Moderate Income
7362643,4329008946,03/28/2018,CA,LEXS,1147 ECHO PARK AVE,88.63B+,OFF STR/OVERTIME/MTR,58.0,34.075207,-118.258318,2018,Echo Park,Low Income
8632489,4344052042,12/23/2018,CA,MAZD,843 LOS ANGELES ST S,88.13B+,METER EXP.,63.0,34.041412,-118.253074,2018,Wholesale District,Very Low Income
6971339,4324364936,01/10/2018,CA,DODG,579 VENICE BLVD,80.71.3,PARKING/FRONT YARD,68.0,33.987596,-118.464595,2018,Venice,Above Moderate Income


In [52]:
citations['csa'].unique()

array(['Encino', 'Cheviot Hills', 'Echo Park', 'Wholesale District',
       'Venice', 'Del Rey', 'San Pedro', 'Baldwin Hills', 'Reseda',
       'Westlake', 'Van Nuys', 'East Hollywood', 'Silverlake',
       'Brentwood', 'Carthay', 'Hancock Park', 'Downtown', 'Los Feliz',
       'Sherman Oaks', 'Miracle Mile', 'West Los Angeles',
       'Little Armenia', 'Hollywood', 'Florence-Firestone',
       'Harbor Gateway', 'Melrose', 'Hollywood Hills', 'North Hollywood',
       'Lincoln Heights', 'South Carthay', 'Palms', 'Woodland Hills',
       'University Park', 'Beverlywood', 'Wilshire Center', 'Rancho Park',
       None, 'Tarzana', 'Little Tokyo', 'Granada Hills', 'North Hills',
       'Temple-Beaudry', 'Pico-Union', 'Valley Village', 'Studio City',
       'Mar Vista', 'Pacific Palisades', 'Figueroa Park Square',
       'Boyle Heights', 'Wilmington', 'Historic Filipinotown', 'Westwood',
       'Exposition Park', 'Harvard Park', 'Century Palms/Cove',
       'Eagle Rock', 'Winnetka', 'Atwater 

In [53]:
citations['ami_category'].unique()

array(['Above Moderate Income', 'Low Income', 'Very Low Income',
       'Extremely Low Income', None, 'Moderate Income'], dtype=object)

In [54]:
citations['csa'].hasnans

True

In [55]:
citations['ami_category'].hasnans

True

In [58]:
citations = citations.dropna()

In [59]:
citations['csa'].unique()

array(['Encino', 'Cheviot Hills', 'Echo Park', 'Wholesale District',
       'Venice', 'Del Rey', 'San Pedro', 'Baldwin Hills', 'Reseda',
       'Westlake', 'Van Nuys', 'East Hollywood', 'Silverlake',
       'Brentwood', 'Carthay', 'Hancock Park', 'Downtown', 'Los Feliz',
       'Sherman Oaks', 'Miracle Mile', 'West Los Angeles',
       'Little Armenia', 'Hollywood', 'Florence-Firestone',
       'Harbor Gateway', 'Melrose', 'Hollywood Hills', 'North Hollywood',
       'Lincoln Heights', 'South Carthay', 'Palms', 'Woodland Hills',
       'University Park', 'Beverlywood', 'Wilshire Center', 'Rancho Park',
       'Tarzana', 'Little Tokyo', 'Granada Hills', 'North Hills',
       'Temple-Beaudry', 'Pico-Union', 'Valley Village', 'Studio City',
       'Mar Vista', 'Pacific Palisades', 'Figueroa Park Square',
       'Boyle Heights', 'Wilmington', 'Historic Filipinotown', 'Westwood',
       'Exposition Park', 'Harvard Park', 'Century Palms/Cove',
       'Eagle Rock', 'Winnetka', 'Atwater Villag

In [60]:
citations['csa'].value_counts()

Downtown               1765
Hollywood              1345
Melrose                1268
West Los Angeles        857
Wholesale District      840
                       ... 
Palisades Highlands       1
Inglewood                 1
Willowbrook               1
Burbank                   1
West Carson               1
Name: csa, Length: 143, dtype: int64

In [62]:
citations['ami_category'].unique()

array(['Above Moderate Income', 'Low Income', 'Very Low Income',
       'Extremely Low Income', 'Moderate Income'], dtype=object)

In [63]:
citations['ami_category'].value_counts()

Very Low Income          8406
Low Income               7694
Above Moderate Income    5217
Extremely Low Income     3147
Moderate Income           201
Name: ami_category, dtype: int64

In [64]:
citations.shape

(24665, 13)