INFO 2950 - FINAL PROJECT (Avni, Aryana, and Ishneet)

In [153]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

In [154]:
aqi = pd.read_csv("daily_aqi_by_county_2018.csv")
pop = pd.read_csv("PopulationEstimates.csv")
inc = pd.read_csv("lapi1119.csv")

In [155]:
aqi.head()

Unnamed: 0,State Name,county Name,Date,AQI,Category,Defining Parameter,Number of Sites Reporting
0,Alabama,Baldwin,1/2/18,32,Good,PM2.5,1
1,Alabama,Baldwin,1/5/18,34,Good,PM2.5,1
2,Alabama,Baldwin,1/8/18,15,Good,PM2.5,1
3,Alabama,Baldwin,1/11/18,19,Good,PM2.5,1
4,Alabama,Baldwin,1/14/18,25,Good,PM2.5,1


This dataset doesn't require much cleaning. When we produce our own dataset, we will have to average out daily AQIs into a yearly AQI, in order to be consistent with the rest of our dataset. Since the per capita income is calculated every July 1st, averaging AQI for every July 1st (or approximately that date) will help keep the data consistent across the board.

In [169]:
#pop.head() is currently displaying few of the 149 columns 
pop.head()

Unnamed: 0,"Population estimates for the U.S., States, and counties, 2010-18 (see the second tab in this workbook for variable name descriptions)",Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 139,Unnamed: 140,Unnamed: 141,Unnamed: 142,Unnamed: 143,Unnamed: 144,Unnamed: 145,Unnamed: 146,Unnamed: 147,Unnamed: 148
0,These data were posted to the ERS website (at ...,,,,,,,,,,...,,,,,,,,,,
1,FIPS,State,Area_Name,Rural-urban_Continuum Code_2003,Rural-urban_Continuum Code_2013,Urban_Influence_Code_2003,Urban_Influence_Code_2013,Economic_typology_2015,CENSUS_2010_POP,ESTIMATES_BASE_2010,...,R_DOMESTIC_MIG_2017,R_DOMESTIC_MIG_2018,R_NET_MIG_2011,R_NET_MIG_2012,R_NET_MIG_2013,R_NET_MIG_2014,R_NET_MIG_2015,R_NET_MIG_2016,R_NET_MIG_2017,R_NET_MIG_2018
2,00000,US,United States,,,,,,308745538,308758105,...,,,,,,,,,,
3,01000,AL,Alabama,,,,,,4779736,4780138,...,0.4,1.2,0.5,1.2,1.6,0.6,0.6,0.8,1.1,1.9
4,01001,AL,Autauga County,2,2,2,2,0,54571,54574,...,1.1,0.7,6.0,-6.1,-3.9,2.0,-1.9,5.3,1.0,0.6


There are a few things to clean about the above dataset. The NaN values result because the rows are left blank in the raw dataset because row 0 signifies a change in state or a description in the beginning of the dataset. This needs to be cleaned. Additionally, the column headings are not named like they are in the raw dataset. Instead, the column names are actually row 1's value. Thus, rows 0 and the current column headings need to be dropped. Furthermore, we only need population estimates for 2018, so our dataset will include only that column. 

In [170]:
#dropping the first row of the population dataset
pop_clean = pop.iloc[1:]
#setting the column headings to row 1
pop_clean.columns = pop_clean.iloc[0]
pop_clean = pop_clean.drop(1, axis = 0)
pop_clean.head(80)
#now we have proper column headings. The only relevant information we want from this dataset is the population estimate

1,FIPS,State,Area_Name,Rural-urban_Continuum Code_2003,Rural-urban_Continuum Code_2013,Urban_Influence_Code_2003,Urban_Influence_Code_2013,Economic_typology_2015,CENSUS_2010_POP,ESTIMATES_BASE_2010,...,R_DOMESTIC_MIG_2017,R_DOMESTIC_MIG_2018,R_NET_MIG_2011,R_NET_MIG_2012,R_NET_MIG_2013,R_NET_MIG_2014,R_NET_MIG_2015,R_NET_MIG_2016,R_NET_MIG_2017,R_NET_MIG_2018
2,00000,US,United States,,,,,,308745538,308758105,...,,,,,,,,,,
3,01000,AL,Alabama,,,,,,4779736,4780138,...,0.4,1.2,0.5,1.2,1.6,0.6,0.6,0.8,1.1,1.9
4,01001,AL,Autauga County,2,2,2,2,0,54571,54574,...,1.1,0.7,6.0,-6.1,-3.9,2.0,-1.9,5.3,1.0,0.6
5,01003,AL,Baldwin County,4,3,5,2,5,182265,182264,...,22.0,24.3,16.3,17.6,22.9,20.2,17.9,21.5,22.5,24.8
6,01005,AL,Barbour County,6,6,6,6,3,27457,27457,...,-25.5,-9.1,0.3,-6.8,-8.1,-5.1,-15.5,-18.2,-25.0,-8.6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
77,02068,AK,Denali Borough,8,8,7,7,5,1826,1822,...,-9.7,-29.6,12.4,11.1,11.9,-20.2,2.1,48.9,2.9,-17.0
78,02070,AK,Dillingham Census Area,9,9,12,12,3,4847,4847,...,-22.0,0.4,4.5,-11.9,-9.9,-9.4,-11.2,-19.1,-21.8,0.6
79,02090,AK,Fairbanks North Star Borough,3,3,2,2,4,97581,97585,...,-23.4,-20.7,-15.4,10.2,-6.0,-30.1,-8.4,-1.6,-20.6,-18.5
80,02100,AK,Haines Borough,9,9,10,12,5,2508,2508,...,-6.3,-20.4,22.9,1.2,-4.3,-0.8,-16.2,4.8,-5.6,-19.6


In [172]:
pop_2018 = pop_clean["POP_ESTIMATE_2018"]
print("The number of NaN values in pop_2018 is", pop_2018.isnull().sum().sum()) 
pop_clean.head(3274) #this now tell us that our dataset had an extra row of values that are not needed
#dropping the extra row of values
pop_clean.head(3274)
pop_2018 = pop_2018[:-1]
print("pop_2018 without any NaN values", pop_2018)

The number of NaN values in pop_2018 is 1
pop_2018 without any NaN values 2       327,167,434
3         4,887,871
4            55,601
5           218,022
6            24,881
           ...     
3270         50,185
3271          8,364
3272         21,476
3273         32,158
3274         33,860
Name: POP_ESTIMATE_2018, Length: 3273, dtype: object


pop_2018 is now the cleaned, usable series for our own dataset. It includes population estimates for 2018 only and NaN values have been excluded (which was only the last row that was added accidentally from the source so this doesn't affect our sample size). What is important to note in this series is that it includes totals for each state. 

In [140]:
inc.head()

Unnamed: 0,"Table 1. Per Capita Personal Income by County, 2016 - 2018",Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7
0,,Per capita personal income1,,,,Percent change from preceding period,,
1,,Dollars,,,Rank in State,Percent change,,Rank in State
2,,2016,2017.0,2018.0,2018,2017,2018.0,2018
3,United States,49870,51885.0,54446.0,--,4.0,4.9,--
4,,,,,,,,


We see the same problem with the income dataset as we did with the population estimates dataset. The column names should be row 0 values and NaN values are purposely left blank in the raw dataset to improve readability.

In [144]:
inc_clean = inc
#setting the column headings to row 1
inc_clean.columns = inc_clean.iloc[0]
inc_clean.columns = ['State_County', 'Per capita personal income 2016', 'Per capita personal income 2017', 
                     'Per capita personal income 2018', 'Rank in State 2018', 'Percent change 2017', 'Percent change 2018',
                    'Percent change in rank 2018']
inc_clean = inc_clean.iloc[3:]
inc_clean.head(75)
#There is again an issue with NaN values, this time there are NaN values after each state because the original data set left the row blank

Unnamed: 0,State_County,Per capita personal income 2016,Per capita personal income 2017,Per capita personal income 2018,Rank in State 2018,Percent change 2017,Percent change 2018,Percent change in rank 2018
3,United States,49870,51885,54446,--,4.0,4.9,--
4,,,,,,,,
5,Alabama,39224,40467,42238,--,3.2,4.4,--
6,Autauga,39561,40450,41618,10,2.2,2.9,61
7,Baldwin,42907,43989,45596,4,2.5,3.7,55
...,...,...,...,...,...,...,...,...
73,,,,,,,,
74,Alaska,56016,56794,59420,--,1.4,4.6,--
75,Aleutians East Borough,57255,58210,56676,18,1.7,-2.6,29
76,Aleutians West Census Area,55302,52695,54385,20,-4.7,3.2,25


In [152]:
#dropping all NaN rows from this dataset
inc_clean = inc_clean.dropna()

#resetting the index to a sequential order
inc_clean = inc_clean.reset_index(drop=True)
inc_clean.head(73)

#extracting per capital personal income in 2018
inc_2018 = inc_clean["Per capita personal income 2018"]
print("This is the personal income per capital 2018 data we need", inc_2018)

This is the personal income per capital 2018 data we need 0        54,446
1        42,238
2        41,618
3        45,596
4        35,199
         ...   
3159     53,145
3160    251,728
3161     40,280
3162     48,184
3163     44,737
Name: Per capita personal income 2018, Length: 3164, dtype: object


inc_clean is now a cleaned dataset, and inc_2018 contains personal per capital income data from 2018 we need to put in our dataset