# Exploratory Data Analysis-Census Data

In this notebook, we explore the census data from United States Census Bureau. This dataset contains population data for counties and states in the US from 2010 to 2015.

In [1]:
import pandas as pd
import numpy as np

In [2]:
census_data = pd.read_csv('census.csv')
census_data.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
0,40,3,6,1,0,Alabama,Alabama,4779736,4780127,4785161,...,0.002295,-0.193196,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861


### State with most number of counties

To find the state with most number of counties, we count the occurences of the variable 'STNAME'. We first sort the dataframe (with inplace=False option) in descending order on this variable using the value_counts method. and then extract the 0 index using the index method.

In [5]:
census_data['STNAME'].value_counts(ascending=False).index[0]

'Texas'

### Three most populous states in the country in 2010. 

In [22]:
state_data = census_data[census_data['SUMLEV']==40]

In [31]:
state_data = state_data.sort_values(by='CENSUS2010POP', ascending=False)

In [36]:
populous_states = [state_data['STNAME'].iloc[i] for i in range(3)]

In [37]:
print(populous_states)

['California', 'Texas', 'New York']


### County with most change in population in the 2010 to 2015 period

In [40]:
county_data = census_data[['STNAME','CTYNAME','POPESTIMATE2015',
                           'POPESTIMATE2014','POPESTIMATE2013','POPESTIMATE2012',
                           'POPESTIMATE2011','POPESTIMATE2010']]

county_data = county_data[county_data['STNAME'] != county_data['CTYNAME']]

In [41]:
county_data.head()

Unnamed: 0,STNAME,CTYNAME,POPESTIMATE2015,POPESTIMATE2014,POPESTIMATE2013,POPESTIMATE2012,POPESTIMATE2011,POPESTIMATE2010
1,Alabama,Autauga County,55347,55290,55038,55175,55253,54660
2,Alabama,Baldwin County,203709,199713,195126,190396,186659,183193
3,Alabama,Barbour County,26489,26815,26973,27159,27226,27341
4,Alabama,Bibb County,22583,22549,22512,22642,22733,22861
5,Alabama,Blount County,57673,57658,57734,57776,57711,57373


In [46]:
index = (county_data.max(axis=1)-county_data.min(axis=1)).idxmax()
max_change = county_data.loc[index]['CTYNAME'],county_data.loc[index]['STNAME']    

In [47]:
print(max_change)

('Harris County', 'Texas')
