# Final Project: States Data Wrangling

- **Vintage**:  2020
- **Geography Level**: State
- **Variables**:
    - **DP02_0116E**: Estimate of population (5 years and over) who speaks Spanish at home
    - **DP02_0116PE**: Percent of population (5 years and over) who speaks Spanish at home
#
- **Variables List**:  https://api.census.gov/data/2020/acs/acs5/profile/variables.html 
- **Supported Geographies**: https://api.census.gov/data/2020/acs/acs5/profile/geography.html

### ***Question***:  
- Get number and percentage of people who speak Spanish at home in each US state

## 1. Import necessary packages

In [1]:
import pandas as pd
import json
import requests

## 2. Build the API Request URL

- Base URL

In [2]:
base_url = "https://api.census.gov/data"

- Dataset Name

In [3]:
dataset_name = "/2020/acs/acs5/profile"

- Get Variables

    - **DP02_0116E**: Estimate of population (5 years and over) who speaks Spanish at home
    - **DP02_0116PE**: Percent of population (5 years and over) who speaks Spanish at home

In [4]:
get_variables = "?get=NAME,DP02_0116E,DP02_0116PE"

- Geography Levels 

    - Every state in the US

In [5]:
geography = "&for=state:*"

- Put it all together 

In [6]:
request_url = base_url + dataset_name + get_variables + geography
print("request_url = ", request_url)

request_url =  https://api.census.gov/data/2020/acs/acs5/profile?get=NAME,DP02_0116E,DP02_0116PE&for=state:*


## 3. Make the API call

In [7]:
r = requests.get(request_url)

api_results = r.json()

## 4. Get the data into a Dataframe 

In [8]:
data = pd.DataFrame(api_results)

print("Number of rows:", data.shape[0])
print("Number of columns:", data.shape[1])
data.head()

Number of rows: 53
Number of columns: 4


Unnamed: 0,0,1,2,3
0,NAME,DP02_0116E,DP02_0116PE,state
1,Arkansas,153429,5.4,05
2,Washington,602058,8.5,53
3,Kansas,207181,7.6,20
4,Oklahoma,269433,7.3,40


In [9]:
# Get the first Row into columns and then get rid of it

data.columns = data.iloc[0]

data = data.iloc[1:]

print("Number of rows:", data.shape[0])
print("Number of columns:", data.shape[1])
data.head()

Number of rows: 52
Number of columns: 4


Unnamed: 0,NAME,DP02_0116E,DP02_0116PE,state
1,Arkansas,153429,5.4,5
2,Washington,602058,8.5,53
3,Kansas,207181,7.6,20
4,Oklahoma,269433,7.3,40
5,Wisconsin,254258,4.6,55


In [10]:
# Print data types
print("Data types: ")
data.dtypes

Data types: 


0
NAME           object
DP02_0116E     object
DP02_0116PE    object
state          object
dtype: object

## 6. Add states abbreviations

### 6.1 Import csv with states abbreviations

In [11]:
state_abb = pd.read_csv('Data/State_Abbreviations.csv')

print("Number of rows:", state_abb.shape[0])
print("Number of columns:", state_abb.shape[1])
state_abb.head()

Number of rows: 50
Number of columns: 2


Unnamed: 0,State_Name,State_Abbrev
0,Alabama,AL
1,Alaska,AK
2,Arizona,AZ
3,Arkansas,AR
4,California,CA


In [12]:
# Print data types
print("Data types: ")
state_abb.dtypes

Data types: 


State_Name      object
State_Abbrev    object
dtype: object

### 6.2. Deleting Puerto Rico and District of Columbia of the dataframe data to match the dataframe state_abb

In [13]:
states_to_delete = ['Puerto Rico', 'District of Columbia']
data.query("NAME not in @states_to_delete", inplace=True)

print("Number of rows:", data.shape[0])
print("Number of columns:", data.shape[1])
data.head()

Number of rows: 50
Number of columns: 4


Unnamed: 0,NAME,DP02_0116E,DP02_0116PE,state
1,Arkansas,153429,5.4,5
2,Washington,602058,8.5,53
3,Kansas,207181,7.6,20
4,Oklahoma,269433,7.3,40
5,Wisconsin,254258,4.6,55


### 6.3. Merge both dataframes (data and state_abb)

- Assign left and right tables to avoid confusion

In [14]:
left_table = data
right_table = state_abb

- Select the joining columns of the left and right tables to avoid confusion

In [15]:
left_table_join_field = 'NAME'
right_table_join_field = 'State_Name'

- Merge

In [16]:
df = pd.merge(left_table,       
                right_table,     
                left_on=left_table_join_field,
                right_on=right_table_join_field,
                how='left'                          # Type of Join:  Left
            )

print()
print("Left Table:  ", left_table.shape)
print("Right Table: ", right_table.shape)
print("Joined Dataframe: ", df.shape)
print()

df.head()


Left Table:   (50, 4)
Right Table:  (50, 2)
Joined Dataframe:  (50, 6)



Unnamed: 0,NAME,DP02_0116E,DP02_0116PE,state,State_Name,State_Abbrev
0,Arkansas,153429,5.4,5,Arkansas,AR
1,Washington,602058,8.5,53,Washington,WA
2,Kansas,207181,7.6,20,Kansas,KS
3,Oklahoma,269433,7.3,40,Oklahoma,OK
4,Wisconsin,254258,4.6,55,Wisconsin,WI


In [17]:
# Print data types
print("Data types: ")
df.dtypes

Data types: 


NAME            object
DP02_0116E      object
DP02_0116PE     object
state           object
State_Name      object
State_Abbrev    object
dtype: object

## 7. Cleaning

### 7.1. Dropping repeated column

In [18]:
df.drop("NAME", axis='columns', inplace=True)

print("Number of rows:", df.shape[0])
print("Number of columns:", df.shape[1])
df.head()

Number of rows: 50
Number of columns: 5


Unnamed: 0,DP02_0116E,DP02_0116PE,state,State_Name,State_Abbrev
0,153429,5.4,5,Arkansas,AR
1,602058,8.5,53,Washington,WA
2,207181,7.6,20,Kansas,KS
3,269433,7.3,40,Oklahoma,OK
4,254258,4.6,55,Wisconsin,WI


### 7.2. Renaming columns

In [19]:
cols_to_rename = {
                   'DP02_0116E' : 'Language spoken at home (Spanish) (DP02_0116E)', 
                   'DP02_0116PE' : 'Language spoken at home (Spanish) - Percent (DP02_0116PE)', 
                   'state' : 'FIPS_State', 
                   'State_Abbrev' : 'State_Abbreviation'
                 }
df.rename(columns = cols_to_rename, inplace=True)

print("Number of rows:", df.shape[0])
print("Number of columns:", df.shape[1])
df.head()

Number of rows: 50
Number of columns: 5


Unnamed: 0,Language spoken at home (Spanish) (DP02_0116E),Language spoken at home (Spanish) - Percent (DP02_0116PE),FIPS_State,State_Name,State_Abbreviation
0,153429,5.4,5,Arkansas,AR
1,602058,8.5,53,Washington,WA
2,207181,7.6,20,Kansas,KS
3,269433,7.3,40,Oklahoma,OK
4,254258,4.6,55,Wisconsin,WI


### 7.3. Reordering columns

In [20]:
cols_to_keep = ['State_Name', 'State_Abbreviation', 'Language spoken at home (Spanish) (DP02_0116E)', 'Language spoken at home (Spanish) - Percent (DP02_0116PE)', 'FIPS_State']
df = df[cols_to_keep]

print("Number of rows:", df.shape[0])
print("Number of columns:", df.shape[1])
df.head()

Number of rows: 50
Number of columns: 5


Unnamed: 0,State_Name,State_Abbreviation,Language spoken at home (Spanish) (DP02_0116E),Language spoken at home (Spanish) - Percent (DP02_0116PE),FIPS_State
0,Arkansas,AR,153429,5.4,5
1,Washington,WA,602058,8.5,53
2,Kansas,KS,207181,7.6,20
3,Oklahoma,OK,269433,7.3,40
4,Wisconsin,WI,254258,4.6,55


In [21]:
# Print data types
print("Data types: ")
df.dtypes

Data types: 


State_Name                                                   object
State_Abbreviation                                           object
Language spoken at home (Spanish) (DP02_0116E)               object
Language spoken at home (Spanish) - Percent (DP02_0116PE)    object
FIPS_State                                                   object
dtype: object

## 8. Save the Dataframe as a CSV file

In [22]:
csv_file_to_create = "States_Data.csv"

filename_with_path = "Data/" + csv_file_to_create
df.to_csv(filename_with_path, index=False)