# Median Income and AMI Dataset Cleaning

Now we will explore the median income and ami dataset so that we can better establish socioeconomic status later on.

In [1]:
# we first need to import 
import pandas as pd
import numpy as np
import os

We will begin by reading in the data.

In [4]:
# get the median income and ami csv file
df = pd.read_csv("Median_Income_and_AMI.csv")
df

Unnamed: 0,tract,med_hh_income,med_hh_income_universe,ami_category,below_med_income,below_60pct_med_income,below_moderate_income,sup_dist,csa,spa,ESRI_OID,Shape__Area,Shape__Length
0,6037199700,38892.0,1204,Very Low Income,Yes,Yes,Yes,District 1,Los Angeles - Wholesale District,SPA 4 - Metro,2347,1.041050e+07,13808.463240
1,6037199801,41027.0,903,Very Low Income,Yes,Yes,Yes,District 1,Los Angeles - Lincoln Heights,SPA 4 - Metro,2348,3.724107e+06,9459.391827
2,6037199802,42500.0,612,Very Low Income,Yes,Yes,Yes,District 1,Los Angeles - Lincoln Heights,SPA 4 - Metro,2349,3.296129e+06,8868.744225
3,6037199900,37232.0,845,Very Low Income,Yes,Yes,Yes,District 1,Los Angeles - Lincoln Heights,SPA 4 - Metro,2350,4.782361e+06,10141.728020
4,6037201110,65000.0,782,Low Income,Yes,No,Yes,District 1,Los Angeles - El Sereno,SPA 4 - Metro,2351,1.099246e+07,15893.383640
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2490,6037554516,126450.0,1215,Above Moderate Income,No,No,No,District 4,City of Cerritos,SPA 7 - East,4837,1.555650e+07,21274.227410
2491,6037554517,107672.0,1352,Above Moderate Income,No,No,No,District 4,City of Cerritos,SPA 7 - East,4838,1.421767e+07,15905.089170
2492,6037554518,104439.0,1558,Above Moderate Income,No,No,No,District 4,City of Cerritos,SPA 7 - East,4839,1.938903e+07,21218.412990
2493,6037554519,131012.0,1216,Above Moderate Income,No,No,No,District 4,City of Cerritos,SPA 7 - East,4840,1.866694e+07,19500.866810


### Median Income and AMI

We know that the median income and AMI dataset has statistics for median household income and the corresponding AMI category, but what else does it include? Let's take a look!

In [6]:
df.head()

Unnamed: 0,tract,med_hh_income,med_hh_income_universe,ami_category,below_med_income,below_60pct_med_income,below_moderate_income,sup_dist,csa,spa,ESRI_OID,Shape__Area,Shape__Length
0,6037199700,38892.0,1204,Very Low Income,Yes,Yes,Yes,District 1,Los Angeles - Wholesale District,SPA 4 - Metro,2347,10410500.0,13808.46324
1,6037199801,41027.0,903,Very Low Income,Yes,Yes,Yes,District 1,Los Angeles - Lincoln Heights,SPA 4 - Metro,2348,3724107.0,9459.391827
2,6037199802,42500.0,612,Very Low Income,Yes,Yes,Yes,District 1,Los Angeles - Lincoln Heights,SPA 4 - Metro,2349,3296129.0,8868.744225
3,6037199900,37232.0,845,Very Low Income,Yes,Yes,Yes,District 1,Los Angeles - Lincoln Heights,SPA 4 - Metro,2350,4782361.0,10141.72802
4,6037201110,65000.0,782,Low Income,Yes,No,Yes,District 1,Los Angeles - El Sereno,SPA 4 - Metro,2351,10992460.0,15893.38364


What about the size of the dataset?

In [7]:
df.shape

(2495, 13)

This means that there are currently 2495 observations in our data, with 13 different categories. What can these categories tell us?

In [13]:
df.columns

Index(['tract', 'med_hh_income', 'med_hh_income_universe', 'ami_category',
       'below_med_income', 'below_60pct_med_income', 'below_moderate_income',
       'sup_dist', 'csa', 'spa', 'ESRI_OID', 'Shape__Area', 'Shape__Length'],
      dtype='object')

In [14]:
df.describe()

Unnamed: 0,tract,med_hh_income,med_hh_income_universe,ESRI_OID,Shape__Area,Shape__Length
count,2495.0,2458.0,2495.0,2495.0,2495.0,2495.0
mean,6037403000.0,76849.334418,1335.672946,3594.0,45797160.0,20273.817017
std,230389.6,35546.132788,533.910989,720.388784,427058000.0,34074.651233
min,6037101000.0,4918.0,0.0,2347.0,483653.2,2815.257443
25%,6037209000.0,51157.5,988.0,2970.5,5921008.0,10607.04073
50%,6037403000.0,69698.0,1282.0,3594.0,10257610.0,14365.05703
75%,6037552000.0,94515.5,1625.0,4217.5,18357920.0,19904.42671
max,6037980000.0,250001.0,5617.0,4841.0,16086910000.0,915242.5771


Looks like more than half of the columns are categorical and not numerical. What are the other types of data?

In [15]:
# finding types of each column
df.dtypes

tract                       int64
med_hh_income             float64
med_hh_income_universe      int64
ami_category               object
below_med_income           object
below_60pct_med_income     object
below_moderate_income      object
sup_dist                   object
csa                        object
spa                        object
ESRI_OID                    int64
Shape__Area               float64
Shape__Length             float64
dtype: object

In [17]:
df.index

RangeIndex(start=0, stop=2495, step=1)

Something that will be important for our analysis later on is identifying median income for specific regions. Let's take a look at the different locations in this dataset:

In [20]:
df['csa'].unique()

array(['Wholesale District', 'Lincoln Heights', 'El Sereno',
       'Highland Park', 'University Hills', 'Boyle Heights',
       'City of Pomona', 'City of Diamond Bar', 'Rowland Heights',
       'City of Industry', 'City of Walnut', 'Covina', 'City of Covina',
       'City of San Dimas', 'Covina (Charter Oak)', 'City of Glendora',
       'Azusa', 'City of Azusa', 'City of Norwalk', 'City of Artesia',
       'City of Lakewood', 'City of Hawaiian Gardens',
       'City of Long Beach', 'City of Cerritos', 'Little Tokyo',
       'Chinatown', 'Downtown', 'Temple-Beaudry', 'Historic Filipinotown',
       'Westlake', 'City of Irwindale', 'City of Baldwin Park',
       'City of West Covina', 'City of Signal Hill', 'Pico-Union',
       'Hancock Park', 'Wilshire Center', 'Little Bangladesh',
       'Koreatown', 'Bassett', 'City of La Puente', 'West Puente Valley',
       'Valinda', 'San Jose Hills', 'Avocado Heights', 'North Whittier',
       'Hacienda Heights', 'Tujunga', 'Sun Valley', 'Sunlan

Since all of our data is from Los Angeles, we can just look at the city instead.

In [22]:
# function to single out the city/community from the string
def a (city):
    if ' - ' in city:
        idx = city.index('- ')
    else:
        idx = -2
    return city[idx+2:]
df['csa'] = df['csa'].apply(a)

In [23]:
df.head()

Unnamed: 0,tract,med_hh_income,med_hh_income_universe,ami_category,below_med_income,below_60pct_med_income,below_moderate_income,sup_dist,csa,spa,ESRI_OID,Shape__Area,Shape__Length
0,6037199700,38892.0,1204,Very Low Income,Yes,Yes,Yes,District 1,Wholesale District,SPA 4 - Metro,2347,10410500.0,13808.46324
1,6037199801,41027.0,903,Very Low Income,Yes,Yes,Yes,District 1,Lincoln Heights,SPA 4 - Metro,2348,3724107.0,9459.391827
2,6037199802,42500.0,612,Very Low Income,Yes,Yes,Yes,District 1,Lincoln Heights,SPA 4 - Metro,2349,3296129.0,8868.744225
3,6037199900,37232.0,845,Very Low Income,Yes,Yes,Yes,District 1,Lincoln Heights,SPA 4 - Metro,2350,4782361.0,10141.72802
4,6037201110,65000.0,782,Low Income,Yes,No,Yes,District 1,El Sereno,SPA 4 - Metro,2351,10992460.0,15893.38364


We should also check for null values.

In [27]:
# checking if csa column has nans
df['csa'].hasnans

False

In [28]:
# checking if ami category colunn has nans
df['ami_category'].hasnans

True

In [29]:
# checking if median_hh_income has nans
df['med_hh_income'].hasnans

True

In [30]:
# finding all the rows in the dataframe where med_hh_income column ins NaN
df.loc[ df['med_hh_income'].isna() ]

Unnamed: 0,tract,med_hh_income,med_hh_income_universe,ami_category,below_med_income,below_60pct_med_income,below_moderate_income,sup_dist,csa,spa,ESRI_OID,Shape__Area,Shape__Length
446,6037216301,,614,,,,,District 2,Miracle Mile,SPA 4 - Metro,2793,6049547.0,11309.81632
502,6037578100,,0,,,,,District 4,City of Long Beach,SPA 8 - South Bay,2849,18516370.0,18912.70949
504,6037599100,,79,,,,,District 4,Santa Catalina Island,SPA 8 - South Bay,2851,3698756000.0,554591.2599
761,6037222700,,108,,,,,District 2,Exposition Park,SPA 6 - South,3108,9012062.0,13506.62373
896,6037115103,,11,,,,,District 3,Northridge,SPA 2 - San Fernando,3243,16329320.0,22141.57499
1458,6037265301,,0,,,,,District 3,Westwood,SPA 5 - West,3805,17371240.0,20071.7787
1560,6037901003,,0,,,,,District 5,City of Lancaster,SPA 1 - Antelope Valley,3907,28092030.0,21201.25735
1658,6037273403,,1020,,,,,District 3,Venice,SPA 5 - West,4005,2869621.0,8586.103161
2052,6037920200,,0,,,,,District 5,Castaic,SPA 2 - San Fernando,4399,122208700.0,64176.72158
2072,6037980001,,0,,,,,District 5,City of Burbank,SPA 2 - San Fernando,4419,25081050.0,22534.65649


In [32]:
# all of the rows having NaN in med_hh_income column have NaNs in ami_catgory, 
# below_med_income, below_60pct_med_income, and below_moderate_income as well
df.loc[ df['med_hh_income'].isna()].shape
df.shape

(2495, 13)

In [33]:
# shape of df once dropping rows with NaN
df.dropna().shape

(2458, 13)

In [34]:
# dropping rows with nans
df.dropna(inplace = True)

In [35]:
df['med_hh_income'].hasnans

False

In [36]:
df['ami_category'].hasnans

False

To optimize run time even more, we can eliminate redundant information. For example, in the csa column, many of the locations are described as 'City of ...'. We will therefore remove this description and only keep the name of the city instead.

In [43]:
df['csa'].unique()

array(['Wholesale District', 'Lincoln Heights', 'El Sereno',
       'Highland Park', 'University Hills', 'Boyle Heights',
       'City of Pomona', 'City of Diamond Bar', 'Rowland Heights',
       'City of Industry', 'City of Walnut', 'Covina', 'City of Covina',
       'City of San Dimas', 'Covina (Charter Oak)', 'City of Glendora',
       'Azusa', 'City of Azusa', 'City of Norwalk', 'City of Artesia',
       'City of Lakewood', 'City of Hawaiian Gardens',
       'City of Long Beach', 'City of Cerritos', 'Little Tokyo',
       'Chinatown', 'Downtown', 'Temple-Beaudry', 'Historic Filipinotown',
       'Westlake', 'City of Irwindale', 'City of Baldwin Park',
       'City of West Covina', 'City of Signal Hill', 'Pico-Union',
       'Hancock Park', 'Wilshire Center', 'Little Bangladesh',
       'Koreatown', 'Bassett', 'City of La Puente', 'West Puente Valley',
       'Valinda', 'San Jose Hills', 'Avocado Heights', 'North Whittier',
       'Hacienda Heights', 'Tujunga', 'Sun Valley', 'Sunlan

In [44]:
# drop the 'City of' part of the string, as that is unnecessary
fncof = lambda x : x if 'City of ' not in x else x[8:]
df['csa'] = df['csa'].apply(fncof)

In [45]:
#checking the unique values
df['csa'].unique()

array(['Wholesale District', 'Lincoln Heights', 'El Sereno',
       'Highland Park', 'University Hills', 'Boyle Heights', 'Pomona',
       'Diamond Bar', 'Rowland Heights', 'Industry', 'Walnut', 'Covina',
       'San Dimas', 'Covina (Charter Oak)', 'Glendora', 'Azusa',
       'Norwalk', 'Artesia', 'Lakewood', 'Hawaiian Gardens', 'Long Beach',
       'Cerritos', 'Little Tokyo', 'Chinatown', 'Downtown',
       'Temple-Beaudry', 'Historic Filipinotown', 'Westlake', 'Irwindale',
       'Baldwin Park', 'West Covina', 'Signal Hill', 'Pico-Union',
       'Hancock Park', 'Wilshire Center', 'Little Bangladesh',
       'Koreatown', 'Bassett', 'La Puente', 'West Puente Valley',
       'Valinda', 'San Jose Hills', 'Avocado Heights', 'North Whittier',
       'Hacienda Heights', 'Tujunga', 'Sun Valley', 'Sunland',
       'Lakeview Terrace', 'Shadow Hills', 'Pacoima', 'Country Club Park',
       'Melrose', 'Park La Brea', 'Carthay', 'Miracle Mile',
       'South Carthay', 'Crestview', 'Mid-city', 'La

Lastly, we can delete unnecessary columns.

In [49]:
df = df.drop(columns=['tract', 'ESRI_OID'])
df.head()

Unnamed: 0,med_hh_income,med_hh_income_universe,ami_category,below_med_income,below_60pct_med_income,below_moderate_income,sup_dist,csa,spa,Shape__Area,Shape__Length
0,38892.0,1204,Very Low Income,Yes,Yes,Yes,District 1,Wholesale District,SPA 4 - Metro,10410500.0,13808.46324
1,41027.0,903,Very Low Income,Yes,Yes,Yes,District 1,Lincoln Heights,SPA 4 - Metro,3724107.0,9459.391827
2,42500.0,612,Very Low Income,Yes,Yes,Yes,District 1,Lincoln Heights,SPA 4 - Metro,3296129.0,8868.744225
3,37232.0,845,Very Low Income,Yes,Yes,Yes,District 1,Lincoln Heights,SPA 4 - Metro,4782361.0,10141.72802
4,65000.0,782,Low Income,Yes,No,Yes,District 1,El Sereno,SPA 4 - Metro,10992460.0,15893.38364


Let's take a final look at our dataset!

In [50]:
df

Unnamed: 0,med_hh_income,med_hh_income_universe,ami_category,below_med_income,below_60pct_med_income,below_moderate_income,sup_dist,csa,spa,Shape__Area,Shape__Length
0,38892.0,1204,Very Low Income,Yes,Yes,Yes,District 1,Wholesale District,SPA 4 - Metro,1.041050e+07,13808.463240
1,41027.0,903,Very Low Income,Yes,Yes,Yes,District 1,Lincoln Heights,SPA 4 - Metro,3.724107e+06,9459.391827
2,42500.0,612,Very Low Income,Yes,Yes,Yes,District 1,Lincoln Heights,SPA 4 - Metro,3.296129e+06,8868.744225
3,37232.0,845,Very Low Income,Yes,Yes,Yes,District 1,Lincoln Heights,SPA 4 - Metro,4.782361e+06,10141.728020
4,65000.0,782,Low Income,Yes,No,Yes,District 1,El Sereno,SPA 4 - Metro,1.099246e+07,15893.383640
...,...,...,...,...,...,...,...,...,...,...,...
2490,126450.0,1215,Above Moderate Income,No,No,No,District 4,Cerritos,SPA 7 - East,1.555650e+07,21274.227410
2491,107672.0,1352,Above Moderate Income,No,No,No,District 4,Cerritos,SPA 7 - East,1.421767e+07,15905.089170
2492,104439.0,1558,Above Moderate Income,No,No,No,District 4,Cerritos,SPA 7 - East,1.938903e+07,21218.412990
2493,131012.0,1216,Above Moderate Income,No,No,No,District 4,Cerritos,SPA 7 - East,1.866694e+07,19500.866810
