# Notebook 4 - Education and Salary - a first overview.

This notebook aims to create a single dataframe collecting data from the _Census.Gov_ page with information on:
1. Education Levels - Namely the percentage of Over25 that hold a Bachelor's degree
2. Education Institution Presence - from the Firms API we analysed the "educational services" category (number 61) that showed us per each county how many educ. centres were present (this includes kindergarten, schools of any level and university)
3. Salary - The total income of the county

**PLEASE NOTE** While the aim of our research is to show an uniform patter in all states of the United States, we will only take into consideration the state of California, in order to speed up the output results.


We start by importing the necessary libraries, retrieving the data from the API and analysing only the state of California for the sake of notebook size, we will proceed to a further analysis in a later step of our project

In [25]:
#Required imports for the project
import requests # for api requests
import pandas as pd #tabular data

from bs4 import BeautifulSoup

In [26]:
# !! REMINDER TO TAKE API KEY OUT OF CODE BEFORE SUBMITTING !!
api_key = ""

In [27]:
#creating a dataframe function
def json_to_dataframe(response):
    return pd.DataFrame(response.json()[1:], columns=response.json()[0])

In [28]:
#Setting up the API Query parameters for the educational services industry
params1 = {"NAICS2017" : 61}

In [29]:
#Requesting the json file from the census website using the api key
url = "https://api.census.gov/data/2017/ecnbasic?get=NAICS2017_LABEL,NAICS2017,GEO_ID,FIRM&for=county:*&key={}".format(api_key)
response = requests.request("GET", url, params=params1)

In [30]:
response.text[0:77]

'[["NAICS2017_LABEL","NAICS2017","GEO_ID","FIRM","NAICS2017","state","county"]'

In [31]:
#Converting ecn data frame into pandas data frame
educational_services = json_to_dataframe(response)
df = pd.DataFrame(data = educational_services)
df['FIRM'] = df['FIRM'].astype(int)
df['state'] = df['state'].astype("string")
df_california1 = df[df['state'] == '06']

In [32]:
# Top 5 counties with largest number of educational services
df_c_sort = df_california1\
.sort_values(by=['FIRM'], ascending=False)\
.head(10)
print('Top 10 counties with the most amount of educational services in California')
df_c_sort

Top 10 counties with the most amount of educational services in California


Unnamed: 0,NAICS2017_LABEL,NAICS2017,GEO_ID,FIRM,NAICS2017.1,state,county
289,Educational services,61,0500000US06037,2787,61,6,37
305,Educational services,61,0500000US06059,1187,61,6,59
320,Educational services,61,0500000US06073,1010,61,6,73
292,Educational services,61,0500000US06085,822,61,6,85
201,Educational services,61,0500000US06001,693,61,6,1
206,Educational services,61,0500000US06075,470,61,6,75
202,Educational services,61,0500000US06067,334,61,6,67
309,Educational services,61,0500000US06081,323,61,6,81
290,Educational services,61,0500000US06065,313,61,6,65
195,Educational services,61,0500000US06013,295,61,6,13


In [33]:
#Setting the params
params2 = {"state" : "06"}

In [34]:
#Getting the ACS data
#Requesting the json file from the census website using the api key 
url = "https://api.census.gov/data/2017/acs/acs1/profile?get=DP02_0064PE,DP02_0088PE,DP02_0123PE&for=county&key={}".format(api_key)
response2 = requests.request("GET", url)

In [35]:
#Seeing what the columns for the data are
response2.text[0:61]

'[["DP02_0064PE","DP02_0088PE","DP02_0123PE","state","county"]'

## Meaning of variables
- **DP02_0064PE** = Percent!!EDUCATIONAL ATTAINMENT!!Population 25 years and over!!Bachelor's degree
- **DP02_0088PE** = Percent!!PLACE OF BIRTH!!Total population!!Native!!Born in United States 
- **DP02_0123PE** = Percent!!ANCESTRY!!Total population!!American

In [36]:
pop_chars = json_to_dataframe(response2)
df = pd.DataFrame(data = pop_chars)
df_california2 = df[df['state'] == '06']

In [37]:
#Merging the two datasets
merged_census = pd.merge(df_california1, df_california2, on='county')
merged_census.head()
#drop state_y column and rename state_x to state
merged_census = merged_census.drop(columns=['state_y', 'NAICS2017'])
merged_census = merged_census.rename(columns={'state_x': 'state', 'FIRM':'Number of Educational Institutions', 'DP02_0064PE':'Percent of Population with a Bachelor\'s Degree'})
# The last two columns are currently unneeded but will be later used for contextual analysis.
merged_census.head()


Unnamed: 0,NAICS2017_LABEL,GEO_ID,Number of Educational Institutions,state,county,Percent of Population with a Bachelor's Degree,DP02_0088PE,DP02_0123PE
0,Educational services,0500000US06047,13,6,47,8.7,74.4,1.7
1,Educational services,0500000US06033,8,6,33,9.5,91.8,2.3
2,Educational services,0500000US06115,0,6,115,13.9,85.4,2.1
3,Educational services,0500000US06013,295,6,13,26.7,72.9,2.7
4,Educational services,0500000US06099,71,6,99,12.0,77.3,3.0


## Getting the Code on Salary

In [38]:
#Requesting the json file from the census website using the api key
url= "https://api.census.gov/data/2021/acs/acs1?get=group(B08128)&for=county:*&key={}".format(api_key)
response3 = requests.request("GET", url)

In [39]:
salary = json_to_dataframe(response3)
df = pd.DataFrame(data = salary)
df['state'] = df['state'].astype("string")
df_california1 = df[df['state'] == '06']
# Keep only "B08128_002E" and "GEO_ID" columns
df_california1 = df_california1[['B08128_002E', 'GEO_ID']]
#Rename 'B08128_002E' to 'Total County Income'
df_california1 = df_california1.rename(columns={'B08128_002E': 'Total County Income'})
df_california1.head()

Unnamed: 0,Total County Income,GEO_ID
239,,0500000US06031
240,3187248.0,0500000US06037
241,,0500000US06055
242,1127740.0,0500000US06059
243,747123.0,0500000US06065


In [40]:
# Merge the two dataframes on GEO_ID
merged_census = pd.merge(merged_census, df_california1, on='GEO_ID')
#Drop DP02_0088PE and DP02_0123PE columns
merged_census = merged_census.drop(columns=['DP02_0088PE', 'DP02_0123PE'])
merged_census

Unnamed: 0,NAICS2017_LABEL,GEO_ID,Number of Educational Institutions,state,county,Percent of Population with a Bachelor's Degree,Total County Income
0,Educational services,0500000US06047,13,6,47,8.7,
1,Educational services,0500000US06033,8,6,33,9.5,
2,Educational services,0500000US06115,0,6,115,13.9,
3,Educational services,0500000US06013,295,6,13,26.7,381018.0
4,Educational services,0500000US06099,71,6,99,12.0,169323.0
5,Educational services,0500000US06083,137,6,83,21.2,131636.0
6,Educational services,0500000US06097,161,6,97,22.6,156167.0
7,Educational services,0500000US06025,5,6,25,10.4,
8,Educational services,0500000US06001,693,6,1,26.9,574838.0
9,Educational services,0500000US06067,334,6,67,20.0,441328.0


# 🚧 Work in Progress - Data Visualisation: Is there a relation between GDP and Education Level? 🚧
@Seyi and @Alua will complete this section soon.