# Introduction to ETL

### Introduction

Now once we know a little bit about data structures, we can focus on retreiving our data and getting it ready to use.  This process, is really similar to the standard process you might see in production in the real world.

We can talk about this process in three steps.

### 1. Extract the data

<img src="./extracting-fruits.png" width="60%" > 

The first step is retrieving our data.  This isn't so bad.  So far, we have used an API to perform this task.

In [7]:
import requests 
response = requests.get('https://data.texas.gov/resource/naix-2893.json?location_name=MAX%27S%20WINE%20DIVE')
revenues = response.json()

## 2. Transform the data

Now once we get that data, we generally want to make some modifications to it before we really use it.  We do this in two ways.

### A. Reduce our data

<img src="./reduce-fruits.png" width="60%">

Once we have the data, we generally have too much.  We only select the finest data...meaning the data we plan on using.  Let's look at all of the types data we get about our monthly data.

In [9]:
revenues[0].keys()

dict_keys(['beer_receipts', 'cover_charge_receipts', 'inside_outside_city_limits_code_y_n', 'liquor_receipts', 'location_address', 'location_city', 'location_county', 'location_name', 'location_number', 'location_state', 'location_zip', 'obligation_end_date_yyyymmdd', 'responsibility_begin_date_yyyymmdd', 'tabc_permit_number', 'taxpayer_address', 'taxpayer_city', 'taxpayer_county', 'taxpayer_name', 'taxpayer_number', 'taxpayer_state', 'taxpayer_zip', 'total_receipts', 'wine_receipts'])

There's a lot of extra data there that we don't need.  We'll make our future tasks easier by removing the data we don't want. 

### B. Coerce our data

<img src="transforming-fruits.png" width="60%" />

Ok, so if we look a little more at some of the data we get back from our API, we'll notice that a lot of them are not in the correct format.

In [11]:
total_receipt = revenues[0]['total_receipts']
total_receipt

'100368'

That represents the total monthly revenue for one of our months.  But it should be a number, not a string.  If it were a number we could work to calculate the revenue that goes with that revenue.  But we can't do that working with a string.

In [17]:
.12 * total_receipt

TypeError: can't multiply sequence by non-int of type 'float'

### III. Load our Data

<img src="./packaged-data.png" width="60%">

Finally, once we have just the data we want and our data is ready to use, we are ready to store it for later use.  Or we can just directly use that properly transformed data.

### Summary

In this lesson we introduced the idea of extracting data, and then transforming the data to make it more suitable to our needs.  We do this by reducing the amount of data to use only the data we need, and then transforming the data so that it is an easier to use format.  Finally, we either store that data for later use or directly use that data.  

### Resources

[Tropicana Goodness of Juice](https://www.youtube.com/watch?v=lM7XK9QYkJ8&list=PL_Lnx8jkME7mCTrs2HIxxeNy1wpORGsu8&index=22)