# Extraction, Transform, and Load

ETL is a common necessity for data engineering and data processing pipelines.
The source of the data may be other structured databases, unstructured data stores, data APIs, etc.

ETL can be a simple data acquisition task, such as shown below.

![AutomatedDataAcquisition.png MISSING](../images/AutomatedDataAcquisition.png)

**Or, it may be part of larger process to accumulated data and information in support of advanced analytical systems.**

![AutomatedDataAcquisition_to_Analytics.png MISSING](../images/AutomatedDataAcquisition_to_Analytics.png)

---

## In the context of ETL, you now have the tools to perform this activity.

In the data loading lab, you read in three data files and then massaged the Panda data frame to prepare the data for loading and to understand the semantics of the data.
You then loaded the database with data from the files.

We just need to understand how to acquire data from a remote resource, such as the web or an API and process it with Pandas.

Additionally, in this notebook we will see how to use the SQLAlchemy library to simplify data loading.

## Tasks:

 **Consider**:
 + https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)  
 
In the cells below, 

 1. Define a table for information about the worlds countries.
 1. Describe some challenges you foresee with the data
 1. Review and modify code cells that pull down the data from the tables into a data frame
 1. Load the data into your database
 1. Test loaded data with SQL queries

### 1. Define Tables

### 2. Describe the challenges

### 3. Data Scrapping Code

In [None]:
#import the library to query a website
import requests
# import Beautiful soup library to access functions to parse the data returned from the website
from bs4 import BeautifulSoup



In [None]:
# specify the url
url = "https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)"
# Open website URL and return the html to the variable 'response'
response = requests.get(url)
print(response.encoding)
print(response)

The response we get from web is typically html content. 
We can read the content of the server's response. 
Below, when a `BeautifulSoup` object is created from an html response, we explicitly reference the text format(`response.text`).

The default encoding format is 'UTF-8' as shown below. 

[Click here for additional documentations about the response object.](http://docs.python-requests.org/en/master/user/quickstart/#response-content)



In [None]:
# Parse the html in the 'response' variable, and store it in Beautiful Soup format
soup = BeautifulSoup(response.text, "html")

### Inspect the page source to determine how you need to extract the tables into its own soup object.

We see that the table tag has class settings of:
 * sortable 
 * wikitable 
 * jquery-tablesorter
 
```HTML
<table class="sortable wikitable jquery-tablesorter">
```

We want to focus on the `wikitable`.  

In [None]:
# We can fetch all Tables with a find_all() 
all_tables=soup.find_all('table')
print(type(all_tables))
print(len(all_tables))


# We can find_all this time and get the second occurrence, [1]
right_table=soup.find_all('table', class_='wikitable')[1]
print(type(right_table))

#### Look at the first couple rows

In [None]:
first_two_rows = right_table.findAll("tr")[0:2]

print("Header")
print("-"*30)
print(first_two_rows[0])

print("="*30)

print("First Data row")
print("-"*30)
print(first_two_rows[1])


The `Tag` element is the table.

**Examining the HTML Table Header, we have these columns**

 * Country/Territory
 * UN continental region
 * UN statistical subregion
 * Population 2018
 * Population 2019
 * Change


#### TODO: Replace all the `#?` with one or more lines or portions of code.

In [None]:
# We will use the locale library so we can use 
# atof and atoi to convert alphanumeric to float and integers, respectively.
import locale
locale.setlocale( locale.LC_ALL, 'en_US.UTF-8' ) 

country=[]
region=[]
subregion=[]
population_2018=[]
population_2019=[]
population_change=[]


# Notice we are skipping the head row in this table
for row in right_table.findAll("tr")[1:]: 
    # for each row, pull out the td elements.
    cells = row.findAll('td') # To store all other details
    
    if len(cells)>2: # Only extract information if there is table body not heading
        
        # for the country name, we need to find the name (text) in the Country Hyperlink (a)
        countr_cell = cells[0].find('a').find(text=True)
        country.append(countr_cell)

      
         # for the region name, we need to find the name (text) in the Region Hyperlink (a)
        region_text = cells[1].find('a').find(text=True)
        region.append(
            #?
        )
        
         # for the subregion name, we need to find the name (text) in the Subregion Hyperlink (a)
        subregion_text = #?
        #?

        print("Area: {},{},{}".format(countr_cell,region_text,subregion_text))        

        # Adjust the the data from Text to numeric data types for population
        #?
        #?
        
        
        change_pull = cells[5].find(text=True)
        print(change_pull)
        
        # Note the mdash character in the table needs changed 
        # to a regular dash to be parsed as a negative value
        numeric_string_pop = #?
        population_change.append(
                            locale.atof(numeric_string_pop)
                            )
    

##### Now that we have built all our columns, stack into a data frame!

In [None]:
import pandas as pd

# Note, in the table definition about, we listed 
# the country name first to use as a primary key
df=pd.DataFrame({
                    #?
                })


In [None]:
df.head()

In [None]:
df.tail()

### Check our column data types!
Does this match the data types we sketched out in the `CREATE TABLE` statement above?
If you need to adjust the definition, this would be the time.
Alternatively, we can adjust the columns using Pandas techniques.

In [None]:
df.dtypes

Once we have our Panda data frame and the SQL table inline, we can load it into the database.

---

### 4. Load the data into your database

This time, instead of the manual loading, we are going to use the SQLAlchemy library.


In [None]:
import getpass
mypasswd = getpass.getpass()
username = #?
host = 'pgsql.dsa.lan'
database = 'dsa_student'

In [None]:
# Then connects to the DB
from sqlalchemy.engine.url import URL
from sqlalchemy import create_engine

# SQLAlchemy Connection Parameters
postgres_db = {'drivername': 'postgres',
               'username': username,
               'password': mypasswd,
               'host': host,
               'database' :database}
engine = create_engine(URL(**postgres_db), echo=True)



#### When you run the cell below, carefully examine the output so you see what the SQLAlchemy library is doing!

In [None]:

## Now that SQLAlchemy is loaded, the to_sql function
df.to_sql('country_population', # The table to load
          engine,             # The engine created above
          schema= username,   # The schema where the table lives, our pawprint
          if_exists='append', # If the table is found, it would keep loading the end of table.
          index=False,        # Recall that panda data frame has a row index, so we need to ignore it
          chunksize=50)       # Do 50 records from the data frame at a time


### 5. Test loaded data with SQL queries



```SQL
\x
select * from SSO.country_population limit 2;
```

---

#### TODO: Run the SQL in your database to verify the data was loaded.

If the data was not loaded, please restart from the top and carefully check and redo each step.



#### Now that the data is loaded, let's pull it back out!





In [None]:
df_backout = pd.read_sql_table(
    'country_population',
    con = engine,             # The engine created above
    schema= username   # The schema where the table lives, our pawprint
)

In [None]:
df_backout.head(10)

In [None]:
df_backout.tail(10)

# Save your notebook, then `File > Close and Halt`

---