# ETL-Project - A Case Study of Extract, Transform, and Load###

# Resorces:

- AIDS_deaths_state.csv
- webscraping: https://www.plannedparenthood.org/health-center


# Install

- BeautifulSoup
- Requests
- Pandas
- os
# ======================================


# Extract
# (1) Planned Parenthood Dataset
# (2) AIDS Deaths Dataset


# ======================================



# Extract - Planned Parenthood Dataset

To make it more challenging, one of my datasource will be unstructured data. I will extract the location details of planned parenthood health centers from the plannedparenthood.org webpage to obtain the total no. of centers by states. 

I will create a scraper in python that will extract the location details of clinics for a given state. 

Here is a list of fields the scraper will be extracting:

1. Center Name
2. City
3. State

As an example, below is a screenshoot of all the health centers in https://www.plannedparenthood.org/health-center/mn:

*Note that each state has its own url link

<img src="img/scraper.png">

- https://www.plannedparenthood.org/health-center/ak
- https://www.plannedparenthood.org/health-center/al
- https://www.plannedparenthood.org/health-center/ar
- https://www.plannedparenthood.org/health-center/az
- https://www.plannedparenthood.org/health-center/ca
- https://www.plannedparenthood.org/health-center/co
- https://www.plannedparenthood.org/health-center/ct
- https://www.plannedparenthood.org/health-center/dc
- https://www.plannedparenthood.org/health-center/de
- https://www.plannedparenthood.org/health-center/fl
- https://www.plannedparenthood.org/health-center/ga
- https://www.plannedparenthood.org/health-center/hi
- https://www.plannedparenthood.org/health-center/ia

In [None]:
 # Dependencies
from bs4 import BeautifulSoup
import requests
import pprint
import urllib.request as urllib2

In [None]:
# URL of page to be scraped
url = 'https://www.plannedparenthood.org/health-center/ak'
response = requests.get(url)
response.content



In [None]:
soup = BeautifulSoup(response.content, 'html.parser')
soup

In [None]:
# Examine the results, then determine element that contains sought info
print(soup.prettify())

# Transform

# ======================================

In [None]:
ul_lists = soup.find_all('span', {'class': 'address-region'})
ul_lists

In [None]:
AK = len(ul_lists)


# Extract - AIDS Deaths Dataset

In [1]:
# Dependencies
import pandas as pd
import os

In [2]:
#set file path
file = 'AIDS_deaths_state.csv'

#load csv to df
df = pd.read_csv(file)


# Transform

# ======================================

In [3]:
g = df.groupby('Geography')
g

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x12289c2e8>

In [4]:
for Geography, Geography_df in g:
    print(Geography)
    print(Geography_df)

Alabama
       Indicator  Year Geography  FIPS                   Race         Sex  \
0    AIDS deaths  2000   Alabama     1  All races/ethnicities  Both sexes   
56   AIDS deaths  2001   Alabama     1  All races/ethnicities  Both sexes   
112  AIDS deaths  2002   Alabama     1  All races/ethnicities  Both sexes   
168  AIDS deaths  2003   Alabama     1  All races/ethnicities  Both sexes   
224  AIDS deaths  2004   Alabama     1  All races/ethnicities  Both sexes   
280  AIDS deaths  2005   Alabama     1  All races/ethnicities  Both sexes   
336  AIDS deaths  2006   Alabama     1  All races/ethnicities  Both sexes   
392  AIDS deaths  2007   Alabama     1  All races/ethnicities  Both sexes   
448  AIDS deaths  2008   Alabama     1  All races/ethnicities  Both sexes   
504  AIDS deaths  2009   Alabama     1  All races/ethnicities  Both sexes   
560  AIDS deaths  2010   Alabama     1  All races/ethnicities  Both sexes   
616  AIDS deaths  2011   Alabama     1  All races/ethnicities  Both 

       Indicator  Year Geography  FIPS                   Race         Sex  \
50   AIDS deaths  2000   Vermont    50  All races/ethnicities  Both sexes   
106  AIDS deaths  2001   Vermont    50  All races/ethnicities  Both sexes   
162  AIDS deaths  2002   Vermont    50  All races/ethnicities  Both sexes   
218  AIDS deaths  2003   Vermont    50  All races/ethnicities  Both sexes   
274  AIDS deaths  2004   Vermont    50  All races/ethnicities  Both sexes   
330  AIDS deaths  2005   Vermont    50  All races/ethnicities  Both sexes   
386  AIDS deaths  2006   Vermont    50  All races/ethnicities  Both sexes   
442  AIDS deaths  2007   Vermont    50  All races/ethnicities  Both sexes   
498  AIDS deaths  2008   Vermont    50  All races/ethnicities  Both sexes   
554  AIDS deaths  2009   Vermont    50  All races/ethnicities  Both sexes   
610  AIDS deaths  2010   Vermont    50  All races/ethnicities  Both sexes   
666  AIDS deaths  2011   Vermont    50  All races/ethnicities  Both sexes   

In [5]:
#end up running into issues with my groupby function. 
#Groupby Geography and sum() Cases would throw me an error but not when
#applying max and min for Cases so I used excel to groupby Geography 

#set file path
file = 'AIDS_deaths_sum_state csv.csv'

#load csv to df
aids_group = pd.read_csv(file)






In [6]:
#dropp all U.S. territory that are not a state
#American Samoa
#Northern Mariana Island
#U.S. Virgin Islands
#Puerto Rico
#Northern Mariana Islands
#Guam
#District of Columbia - federal state

# dropping passed values 
#aids_group.drop(["American Samoa", "Northern Mariana Island", "U.S. Virgin Islands", 
                            #"Puerto Rico", "Northern Mariana Islands", "Guam", "District of Columbia"], axis = 1, inplace = True)
#display
aids_group.head()




Unnamed: 0,Geography,sum_cases
0,Alabama,3484
1,Alaska,187
2,Arizona,2882
3,Arkansas,1379
4,California,23692


# Load and Connect to PostgreSQL DB server using pgAdmin

- open new pdAdmin window
- create new DB - aids_group
- create tables


# ======================================


<img src="img/db.png">