# Extraction, Transform, and Load

ETL is a common necessity for data engineering and data processing pipelines.
The source of the data may be other structured databases, unstructured data stores, data APIs, etc.

ETL can be a simple data acquisition task, such as shown below.

![AutomatedDataAcquisition.png MISSING](../images/AutomatedDataAcquisition.png)

**Or, it may be part of larger process to accumulated data and information in support of advanced analytical systems.**

![AutomatedDataAcquisition_to_Analytics.png MISSING](../images/AutomatedDataAcquisition_to_Analytics.png)

---

## In the context of ETL, you now have the tools to perform this activity.

In the data loading lab, you read in three data files and then massaged the Panda data frame to prepare the data for loading and to understand the semantics of the data.
You then loaded the database with data from the files.

We just need to understand how to acquire data from a remote resource, such as the web or an API and process it with Pandas.

Additionally, in this notebook we will see how to use the SQLAlchemy library to simplify data loading.

## Tasks:

 **Consider**:
 + https://en.wikipedia.org/wiki/Land_use_statistics_by_country   
 
In the cells below, 

 1. Define a table for information about the worlds countries.
 1. Describe some challenges you foresee with the data
 1. Review and modify code cells that pull down the data from the tables into a data frame
 1. Load the data into your database
 1. Test loaded data with SQL queries

### 1. Define Tables

### 2. Describe the challenges

### 3. Data Scrapping Code

In [None]:
#import the library to query a website
import requests
# import Beautiful soup library to access functions to parse the data returned from the website
from bs4 import BeautifulSoup



In [None]:
# specify the url
url = "https://en.wikipedia.org/wiki/Land_use_statistics_by_country"
# Open website URL and return the html to the variable 'response'
response = requests.get(url)
print(response.encoding)
print(response)

The response we get from web is typically html content. 
We can read the content of the server's response. 
Below, when a `BeautifulSoup` object is created from an html response, we explicitly reference the text format(`response.text`).

The default encoding format is 'UTF-8' as shown below. 

[Click here for additional documentations about the response object.](http://docs.python-requests.org/en/master/user/quickstart/#response-content)



In [None]:
# Parse the html in the 'response' variable, and store it in Beautiful Soup format
soup = BeautifulSoup(response.text, "html")

#####  Basic Inspection
Use `prettify` function to print the data in its nested html structured format.

In [None]:
print(soup.prettify)

We need to extract the table which has list of all baseball world series champions. 

This table should be present in one of the html tags.

We can work with the tags to extract data present in them.  
"**soup.tag**": will return the content between opening and closing tag including tag. 

Additionally, the `.string` value is the data between the tags.
Compare the two cells below.

In [None]:
print(soup.title)
print(soup.title.string)

**Identify the html tag**: 
The data is in a table. 
You can use inspect element option when you right click the mouse to identify the tag which has the data. 

 * [Additional guide on webpage inspection](../resources/AnalyzingHTMLwithTheWebInspector.pdf)


<img src="../images/Wikipedia_Inspect_Screen.png">

**If we look at the inspected HTML source for the table,** 
abbreviated here to focus on the top two rows of data.

```HTML
<table class="sortable wikitable jquery-tablesorter">

 <thead>
     <tr bgcolor="#ececec" valign="top">
         <th data-sort-type="number" class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Rank</th>
         <th class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Country</th>
         <th class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Cultivated <br> land <br> (km<sup>2</sup>)</th>
<th class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Cultivated <br> land <br> (%)</th>
         <th class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Arable <br> land <br> (km<sup>2</sup>)</th>
         <th class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Arable <br> land <br> (%)</th>
         <th class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Permanent <br> crops <br> (km<sup>2</sup>)</th>
<th class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Permanent <br> crops <br> (%)</th>
         <th class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Other <br> lands <br> (km<sup>2</sup>)</th>
<th class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Other <br> lands <br> (%)</th>
         <th class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Total <br> area <br> (km<sup>2</sup>)</th>
         <th class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">Date
</th>
     </tr>
 </thead>
 <tbody>
<tr>
    <td>—</td>
    <td><span class="flagicon" style="padding-left:25px;">&nbsp;</span><b><a href="/wiki/World" title="World">World</a></b></td>
    <td>17,235,800</td>
    <td>11.6</td>
    <td>15,749,300</td>
    <td>10.6</td>
    <td>1,549,600</td>
    <td>1</td>
    <td>131.701.100</td>
    <td>88.4</td>
    <td>149,000,000</td>
    <td>2011
    </td>
</tr>
<tr>
    <td>1</td>
    <td><span class="flagicon"><img alt="" src="//upload.wikimedia.org/wikipedia/en/thumb/4/41/Flag_of_India.svg/23px-Flag_of_India.svg.png" decoding="async" width="23" height="15" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/en/thumb/4/41/Flag_of_India.svg/35px-Flag_of_India.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/4/41/Flag_of_India.svg/45px-Flag_of_India.svg.png 2x" data-file-width="1350" data-file-height="900">&nbsp;</span><a href="/wiki/India" title="India">India</a></td>
    <td>1,891,761</td>
    <td>57</td>
    <td>1,753,694</td>
    <td>52.8</td>
    <td>138,067</td>
    <td>4.2</td>
    <td>1,395,502</td>
    <td>43</td>
    <td>3,287,263</td>
    <td>2011
    </td>
</tr>
<tr>
    <td>2</td>
    <td><span class="flagicon"><img alt="" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" decoding="async" width="23" height="12" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/46px-Flag_of_the_United_States.svg.png 2x" data-file-width="1235" data-file-height="650">&nbsp;</span><a href="/wiki/United_States" title="United States">United States</a></td>
    <td>1,681,826</td>
    <td>17.1</td>
    <td>1,652,028</td>
    <td>16.8</td>
    <td>29,798</td>
    <td>0.3</td>
    <td>8,151,691</td>
    <td>82.9</td>
    <td>9,833,517</td>
    <td>2011
    </td>
</tr>
<tr>
...
</tr></tbody><tfoot></tfoot></table>
```

We see that the table tag has class settings of:
 * sortable 
 * wikitable 
 * jquery-tablesorter
 
```HTML
<table class="sortable wikitable jquery-tablesorter">
```

We want to focus on the `wikitable`.  

In [None]:
# We can fetch all Tables with a find_all() 
all_tables=soup.find_all('table')
print(type(all_tables))
print(len(all_tables))


# We can find the first (only) occurrence 
right_table=soup.find('table', class_='wikitable')
print(type(right_table))

The `Tag` element is the table.

**Examining the HTML Table Header, we have these columns**

 * Rank
 * Country
 * Cultivated Land km^2
 * Cultivated Land %
 * Arable Land km^2
 * Arable Land %
 * Permanent Crops km^2
 * Permanent Crops %
 * Other lands km^2
 * Other lands %
 * Total Area
 * Date

Therefore, a simple approach is to iterate through the HTML table rows, the `<tr>...</tr>` and process the data elements.

Reviewing the HTML above, we see we need to skip the headers and the "World" row.

Additionally, we will stop when we get out of the ranked rows, that is when Rank is not a number.


In [None]:
# We will use the locale library so we can use 
# atof and atoi to convert alphanumeric to float and integers, respectively.
import locale
locale.setlocale( locale.LC_ALL, 'en_US.UTF-8' ) 

rank=[]
country=[]
cultivated_land_k=[]
cultivated_land_p=[]
arable_land_k=[]
arable_land_p=[]
permanent_crops_k=[]
permanent_crops_p=[]
other_land_k=[]
other_land_p=[]
total_area=[]
date_yr=[]


# skip first iteration as we dont need headers
for row in right_table.findAll("tr")[1:]: 
    # for each row, pull out the td elements.
    cells = row.findAll('td') # To store all other details
    
    if len(cells)>2: # Only extract information if there is table body not heading
        
        
        this_rank = cells[0].find(text=True)
        print("Processing rank {}".format(this_rank))
        
        # If the rank is a number, we can convert it
        if (not this_rank.isnumeric()):
            print("Non-Ranked, skipping")
            continue
        
        rank.append(locale.atoi(this_rank))
        
        # for the country name, we need to find the name (text) in the Country Hyperlink (a)
        countr_cell = cells[1].find('a').find(text=True)
        print(countr_cell)
        
        country.append(countr_cell)
        
        # Adjust the the data from Text to numeric data types
        cultivated_land_k.append(locale.atoi(cells[2].find(text=True)))
        cultivated_land_p.append(locale.atof(cells[3].find(text=True)))
        
        arable_land_k.append(locale.atoi(cells[4].find(text=True)))
        arable_land_p.append(locale.atof(cells[5].find(text=True)))
        
        permanent_crops_k.append(locale.atoi(cells[6].find(text=True)))
        permanent_crops_p.append(locale.atof(cells[7].find(text=True)))
        
        # Note, that this is to float because the vatican row has a non-int value
        other_land_k.append(locale.atof(cells[8].find(text=True)))
        other_land_p.append(locale.atof(cells[9].find(text=True)))
        
        total_area.append(locale.atof(cells[10].find(text=True)))
        date_yr.append(locale.atoi(cells[11].find(text=True)))


##### Now that we have built all our columns, stack into a data frame!

In [None]:
import pandas as pd

# Note, in the table definition about, we listed 
# the country name first to use as a primary key

df=pd.DataFrame({'country': country,
                'rank': rank,
                'cultivated_land_k': cultivated_land_k,
                'cultivated_land_p': cultivated_land_p,
                'arable_land_k': arable_land_k,
                'arable_land_p': arable_land_p,
                'permanent_crops_k': permanent_crops_k,
                'permanent_crops_p': permanent_crops_p,
                'other_land_k': other_land_k,
                'other_land_p': other_land_p,
                'total_area': total_area,
                'date_yr': date_yr
                })


In [None]:
df.head()

In [None]:
df.tail()

### Check our column data types!
Does this match the data types we sketched out in the `CREATE TABLE` statement above?
If you need to adjust the definition, this would be the time.
Alternatively, we can adjust the columns using Pandas techniques.

In [None]:
df.dtypes

Once we have our Panda data frame and the SQL table inline, we can load it into the database.

---

### 4. Load the data into your database

This time, instead of the manual loading, we are going to use the SQLAlchemy library.


In [None]:
import getpass
mypasswd = getpass.getpass()
username = 'SSO'
host = 'pgsql.dsa.lan'
database = 'dsa_student'

In [None]:
# Then connects to the DB
from sqlalchemy.engine.url import URL
from sqlalchemy import create_engine

# SQLAlchemy Connection Parameters
postgres_db = {'drivername': 'postgres',
               'username': username,
               'password': mypasswd,
               'host': host,
               'database' :database}
engine = create_engine(URL(**postgres_db), echo=True)
del mypasswd


#### When you run the cell below, carefully examine the output so you see what the SQLAlchemy library is doing!

In [None]:

## Now that SQLAlchemy is loaded, the to_sql function
df.to_sql('land_use_statistics', # The table to load
          engine,             # The engine created above
          schema= username,   # The schema where the table lives, our pawprint
          if_exists='append', # If the table is found, it would keep loading the end of table.
          index=False,        # Recall that panda data frame has a row index, so we need to ignore it
          chunksize=20)       # Do 20 records from the data frame at a time


### 5. Test loaded data with SQL queries



```SQL
\x
select * from SSO.land_use_statistics limit 2;
```

---

```
-[ RECORD 1 ]-----+--------------
country           | India
rank              | 1
cultivated_land_k | 1891761
cultivated_land_p | 57
arable_land_k     | 1753694
arable_land_p     | 52.8
permanent_crops_k | 138067
permanent_crops_p | 4.2
other_land_k      | 1.3955e+06
other_land_p      | 43
total_area        | 3.28726e+06
date_yr           | 2011
-[ RECORD 2 ]-----+--------------
country           | United States
rank              | 2
cultivated_land_k | 1681826
cultivated_land_p | 17.1
arable_land_k     | 1652028
arable_land_p     | 16.8
permanent_crops_k | 29798
permanent_crops_p | 0.3
other_land_k      | 8.15169e+06
other_land_p      | 82.9
total_area        | 9.83352e+06
date_yr           | 2011
```

---






#### Now that the data is loaded, let's pull it back out!





In [None]:
df_backout = pd.read_sql_table(
    'land_use_statistics',
    con = engine,             # The engine created above
    schema= username   # The schema where the table lives, our pawprint
)

In [None]:
df_backout.head(10)

In [None]:
df_backout.tail(10)

# Save your notebook, then `File > Close and Halt`

---