# ETL Project: Summary

The main objective of our analysis is to determine where bank deserts are located in the United States and populations affected by the lack of financial services in their area. We will explore who borrows, where they borrow and why. Our analysis will cover demographics and bank metrics for the year 2017 in 3,142 U.S. counties, excluding Puerto Rico.

## Data Extraction
We used three data sources to compile our database: 
* **For bank metrics**, we used quartely data from the FDIC Financial Data. CSV files can be extracted [here](https://www5.fdic.gov/idasp/advSearch_warp_download_all.asp?intTab=1) 
* **For demographics**, we used data from the 2017 estimates American Community Survey (2010 U.S. Census). CSV files can be extracted [here](https://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml)
* **For unemployment and median household income**, we used data from the Bureau of Labor Statistics - LAUS data. EXcel files can be extracted [here](https://www.bls.gov/lau/)

## Data Transformation
For each source, we:
* Selected the variables of interest for the analysis.
* Eliminated any duplicates or null values.
* Summed quartely values to obtain annual observations for bank metrics.

*Note: For details on data transformation, please refer to the "Data_Transformation.ipynb" file*

## Data Loading
We created a Financial Deserts database in SQL, containing: 
* Demographics data information, using FIPS as primary key.
* Income and unemployment, using FIPS and State as primary keys.
* Bank metrics. 

## Statistical Summary

In [6]:
#Set dependencies
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
#Load the transformed files
population="Population.csv"
education="Eductation.csv"
hh_income="HouseholdIncome.csv"
income_unemployment="Income_Unemployment_2017.csv"
internet_access = "InternetAccess.csv"

#Read the files
df_population = pd.read_csv(population)
df_education = pd.read_csv(education)
df_hh_income = pd.read_csv(hh_income)
df_income_unemployment= pd.read_csv(income_unemployment)
df_internet_access= pd.read_csv(internet_access)


In [25]:
count_counties = df_income_unemployment["County"].count()
total_population= df_population["pop_total"].sum()
female_pop = df_population["pop_female"].sum()
male_pop = df_population["pop_male"].sum()
summary=pd.DataFrame([{"Total Number of Counties": count_counties, "Total Population": total_population, 
                       "Female Population":female_pop, "Male Population":male_pop}])


In [26]:
summary=summary.rename_axis('Summary_Table')

summary_T= summary.T
summary_T


Summary_Table,0
Female Population,126852354
Male Population,119526965
Total Number of Counties,3141
Total Population,321004407
