# Final Project: Counties Data Wrangling

- **Vintage**:  2020
- **Geography Level**: County
- **Variables**:
    - **DP02_0116E**: Estimate of population (5 years and over) who speaks Spanish at home
    - **DP02_0116PE**: Percent of population (5 years and over) who speaks Spanish at home
#
- **Variables List**:  https://api.census.gov/data/2020/acs/acs5/profile/variables.html 
- **Supported Geographies**: https://api.census.gov/data/2020/acs/acs5/profile/geography.html

### ***Question***:  
- Get number and percentage of people who speak Spanish at home for each California county

## 1. Import necessary packages

In [43]:
import pandas as pd
import json
import requests

## 2. Build the API Request URL

- Base URL

In [44]:
base_url = "https://api.census.gov/data"

- Dataset Name

In [45]:
dataset_name = "/2020/acs/acs5/profile"

- Get Variables

    - **DP02_0116E**: Estimate of population (5 years and over) who speaks Spanish at home
    - **DP02_0116PE**: Percent of population (5 years and over) who speaks Spanish at home

In [46]:
get_variables = "?get=NAME,DP02_0116E,DP02_0116PE"

- Geography Levels 

    - Every county in the state of California (FIPS State Code = 06)

In [47]:
geography = "&for=county:*&in=state:06"

- Put it all together 

In [48]:
request_url = base_url + dataset_name + get_variables + geography
print("request_url = ", request_url)

request_url =  https://api.census.gov/data/2020/acs/acs5/profile?get=NAME,DP02_0116E,DP02_0116PE&for=county:*&in=state:06


## 3. Make the API call

In [49]:
r = requests.get(request_url)

api_results = r.json()

## 4. Get the data into a Dataframe 

In [50]:
data = pd.DataFrame(api_results)

print("Number of rows:", data.shape[0])
print("Number of columns:", data.shape[1])
data.head()

Number of rows: 59
Number of columns: 5


Unnamed: 0,0,1,2,3,4
0,NAME,DP02_0116E,DP02_0116PE,state,county
1,"Alameda County, California",250597,16.0,06,001
2,"Alpine County, California",110,10.1,06,003
3,"Butte County, California",21187,10.0,06,007
4,"Colusa County, California",9998,50.2,06,011


In [51]:
# Get the first Row into columns and then get rid of it

data.columns = data.iloc[0]

data = data.iloc[1:]

print("Number of rows:", data.shape[0])
print("Number of columns:", data.shape[1])
data.head()

Number of rows: 58
Number of columns: 5


Unnamed: 0,NAME,DP02_0116E,DP02_0116PE,state,county
1,"Alameda County, California",250597,16.0,6,1
2,"Alpine County, California",110,10.1,6,3
3,"Butte County, California",21187,10.0,6,7
4,"Colusa County, California",9998,50.2,6,11
5,"Contra Costa County, California",195737,18.1,6,13


## 6. Add Rural areas

### 6.1 Import excel with rural areas of California

In [52]:
rural_areas = pd.read_excel('Data/Rural_areas_California.xlsx', dtype={'FIPS_Code' : str, 'Rural_Status' : str})

print("Number of rows:", rural_areas.shape[0])
print("Number of columns:", rural_areas.shape[1])
rural_areas.head()

Number of rows: 58
Number of columns: 4


Unnamed: 0,FIPS_Code,Abbreviation,County_Name,Rural_Status
0,6001,CA,Alameda County,1
1,6003,CA,Alpine County,4
2,6005,CA,Amador County,4
3,6007,CA,Butte County,2
4,6009,CA,Calaveras County,6


In [53]:
# Print data types
print("Data types: ")
rural_areas.dtypes

Data types: 


FIPS_Code       object
Abbreviation    object
County_Name     object
Rural_Status    object
dtype: object

### 6.2. Splitting the NAME column in the dataframe data for merging both dataframes

In [54]:
two_new_cols = ['County_Name', 'State_Name']

data[two_new_cols] = data['NAME'].str.split(', ',1, expand=True)

print("Number of rows:", data.shape[0])
print("Number of columns:", data.shape[1])
data.head()

Number of rows: 58
Number of columns: 7


Unnamed: 0,NAME,DP02_0116E,DP02_0116PE,state,county,County_Name,State_Name
1,"Alameda County, California",250597,16.0,6,1,Alameda County,California
2,"Alpine County, California",110,10.1,6,3,Alpine County,California
3,"Butte County, California",21187,10.0,6,7,Butte County,California
4,"Colusa County, California",9998,50.2,6,11,Colusa County,California
5,"Contra Costa County, California",195737,18.1,6,13,Contra Costa County,California


### 6.3. Merge both dataframes (data and rural_areas)

- Assign left and right tables to avoid confusion

In [55]:
left_table = data
right_table = rural_areas

- Select the joining columns of the left and right tables to avoid confusion

In [56]:
left_table_join_field = 'County_Name'
right_table_join_field = 'County_Name'

- Merge

In [57]:
df = pd.merge(left_table,       
                right_table,     
                left_on=left_table_join_field,
                right_on=right_table_join_field,
                how='left'                          # Type of Join:  Left
            )

print()
print("Left Table:  ", left_table.shape)
print("Right Table: ", right_table.shape)
print("Joined Dataframe: ", df.shape)
print()

df.head()


Left Table:   (58, 7)
Right Table:  (58, 4)
Joined Dataframe:  (58, 10)



Unnamed: 0,NAME,DP02_0116E,DP02_0116PE,state,county,County_Name,State_Name,FIPS_Code,Abbreviation,Rural_Status
0,"Alameda County, California",250597,16.0,6,1,Alameda County,California,6001,CA,1
1,"Alpine County, California",110,10.1,6,3,Alpine County,California,6003,CA,4
2,"Butte County, California",21187,10.0,6,7,Butte County,California,6007,CA,2
3,"Colusa County, California",9998,50.2,6,11,Colusa County,California,6011,CA,4
4,"Contra Costa County, California",195737,18.1,6,13,Contra Costa County,California,6013,CA,1


In [58]:
# Print data types
print("Data types: ")
df.dtypes

Data types: 


NAME            object
DP02_0116E      object
DP02_0116PE     object
state           object
county          object
County_Name     object
State_Name      object
FIPS_Code       object
Abbreviation    object
Rural_Status    object
dtype: object

## 7. Cleaning

### 7.1. Dropping repeated column

In [59]:
df.drop('NAME', axis='columns', inplace=True)

print("Number of rows:", df.shape[0])
print("Number of columns:", df.shape[1])
df.head()

Number of rows: 58
Number of columns: 9


Unnamed: 0,DP02_0116E,DP02_0116PE,state,county,County_Name,State_Name,FIPS_Code,Abbreviation,Rural_Status
0,250597,16.0,6,1,Alameda County,California,6001,CA,1
1,110,10.1,6,3,Alpine County,California,6003,CA,4
2,21187,10.0,6,7,Butte County,California,6007,CA,2
3,9998,50.2,6,11,Colusa County,California,6011,CA,4
4,195737,18.1,6,13,Contra Costa County,California,6013,CA,1


### 7.2. Renaming columns

In [60]:
cols_to_rename = {
                   'DP02_0116E' : 'Language spoken at home (Spanish) (DP02_0116E)', 
                   'DP02_0116PE' : 'Language spoken at home (Spanish) - Percent (DP02_0116PE)', 
                   'state' : 'FIPS_State', 
                   'county' : 'FIPS_County',
                   'Abbreviation' : 'State_Abbreviation',
                   'FIPS_Code' : 'FIPS_Code_Full'
                 }
df.rename(columns = cols_to_rename, inplace=True)

print("Number of rows:", df.shape[0])
print("Number of columns:", df.shape[1])
df.head()

Number of rows: 58
Number of columns: 9


Unnamed: 0,Language spoken at home (Spanish) (DP02_0116E),Language spoken at home (Spanish) - Percent (DP02_0116PE),FIPS_State,FIPS_County,County_Name,State_Name,FIPS_Code_Full,State_Abbreviation,Rural_Status
0,250597,16.0,6,1,Alameda County,California,6001,CA,1
1,110,10.1,6,3,Alpine County,California,6003,CA,4
2,21187,10.0,6,7,Butte County,California,6007,CA,2
3,9998,50.2,6,11,Colusa County,California,6011,CA,4
4,195737,18.1,6,13,Contra Costa County,California,6013,CA,1


### 7.3. Reordering columns

In [61]:
cols_to_keep = ['County_Name', 'FIPS_State', 'FIPS_County', 'FIPS_Code_Full', 'Rural_Status', 'Language spoken at home (Spanish) (DP02_0116E)', 'Language spoken at home (Spanish) - Percent (DP02_0116PE)', 'State_Name', 'State_Abbreviation']
df = df[cols_to_keep]

print("Number of rows:", df.shape[0])
print("Number of columns:", df.shape[1])
df.head()

Number of rows: 58
Number of columns: 9


Unnamed: 0,County_Name,FIPS_State,FIPS_County,FIPS_Code_Full,Rural_Status,Language spoken at home (Spanish) (DP02_0116E),Language spoken at home (Spanish) - Percent (DP02_0116PE),State_Name,State_Abbreviation
0,Alameda County,6,1,6001,1,250597,16.0,California,CA
1,Alpine County,6,3,6003,4,110,10.1,California,CA
2,Butte County,6,7,6007,2,21187,10.0,California,CA
3,Colusa County,6,11,6011,4,9998,50.2,California,CA
4,Contra Costa County,6,13,6013,1,195737,18.1,California,CA


### 7.4. Convert Rural_Status column

In [62]:
# Rural areas
rural_numbers = ['3', '4', '5', '6', '7', '8', '9', '10', '11', '12']
for num in rural_numbers:
    df['Rural_Status'] = df['Rural_Status'].str.replace(num,'Rural')

# Urban areas
urban_numbers = ['1','2']
for num in urban_numbers:
    df['Rural_Status'] = df['Rural_Status'].str.replace(num,'Urban')

# Print
print("Number of rows:", df.shape[0])
print("Number of columns:", df.shape[1])
df.head()

Number of rows: 58
Number of columns: 9


Unnamed: 0,County_Name,FIPS_State,FIPS_County,FIPS_Code_Full,Rural_Status,Language spoken at home (Spanish) (DP02_0116E),Language spoken at home (Spanish) - Percent (DP02_0116PE),State_Name,State_Abbreviation
0,Alameda County,6,1,6001,Urban,250597,16.0,California,CA
1,Alpine County,6,3,6003,Rural,110,10.1,California,CA
2,Butte County,6,7,6007,Urban,21187,10.0,California,CA
3,Colusa County,6,11,6011,Rural,9998,50.2,California,CA
4,Contra Costa County,6,13,6013,Urban,195737,18.1,California,CA


## 8. Save the Dataframe as a CSV file

In [63]:
csv_file_to_create = "Counties_Data.csv"

filename_with_path = "Data/" + csv_file_to_create
df.to_csv(filename_with_path, index=False)