# Cleaning Data: Intro to ETL

### Introduction

In this lesson, we will see how to take a lot of extra data from an API and clean this data.  A fancy term for this is extract transform and load.  Now extract just means retreiving the data, which we know how to do via an API.  So nothing new.  And load means to save the data, which we'll show at the end.  

The transform part is more interesting.  In general, we transform the data because we generally get it in a weird format and we want to transform it into the format we want to make our lives easier.  We transform data in two steps: (1) Reduce the amount of data by throwing away unnecessary data.  And (2) coerce that remaining data into the correct format or datatype.  

We'll walk you through this process, but we expect you to complete the review material like looping through data on your own. 


We'll do this using the Texas Open Data Portal to explore restaurant revenue data. This information is available via the Texas's Open Data API, and their information on Mixed Beverage Receipts.  Let's get started.

### 1. Extracting Our Data from an API

Now Max's Wine Dive is a restaurant with multiple locations in Texas.

<img src='./max-maps.png' width="50%">

Let's try to see what information we can find on the Max's by using the Texas's Open Data API.  Navigating to the [Mixed Beverage Receipts](https://dev.socrata.com/foundry/data.texas.gov/naix-2893) data we see that we can search for specific restaurants using the `location_name` parameter.  Let's do that below

In [1]:
url = "https://data.texas.gov/resource/naix-2893.json?location_name=MAX%27S%20WINE%20DIVE"

Use the url above to make a request to the API, and store the json results as `restaurant_receipts`.

In [2]:
import requests
response = requests.get(url)
restaurant_receipts = response.json()

In [3]:
len(restaurant_receipts)
# 61

61

In [4]:
restaurant_receipts

[{'taxpayer_number': '12727298569',
  'taxpayer_name': 'MWD AUSTIN DOWNTOWN, LLC',
  'taxpayer_address': '7026 OLD KATY RD STE 255',
  'taxpayer_city': 'HOUSTON',
  'taxpayer_state': 'TX',
  'taxpayer_zip': '77024',
  'taxpayer_county': '101',
  'location_number': '1',
  'location_name': "MAX'S WINE DIVE",
  'location_address': '207 SAN JACINTO BLVD STE 200',
  'location_city': 'AUSTIN',
  'location_state': 'TX',
  'location_zip': '78701',
  'location_county': '227',
  'inside_outside_city_limits_code_y_n': 'Y',
  'tabc_permit_number': 'MB944126',
  'responsibility_begin_date_yyyymmdd': '2016-05-13T00:00:00.000',
  'obligation_end_date_yyyymmdd': '2016-09-30T00:00:00.000',
  'liquor_receipts': '18265',
  'wine_receipts': '71497',
  'beer_receipts': '10606',
  'cover_charge_receipts': '0',
  'total_receipts': '100368'},
 {'taxpayer_number': '12727298569',
  'taxpayer_name': 'MWD AUSTIN DOWNTOWN, LLC',
  'taxpayer_address': '7026 OLD KATY RD STE 255',
  'taxpayer_city': 'HOUSTON',
  'tax

### 2. Understanding What's Returned

We see that we get back a list of 61 entries.  Let's see what's in these entries by taking a closer look at the first entry.

In [5]:
first_receipt = restaurant_receipts[0]
first_receipt

# {'beer_receipts': '10606',
#  'cover_charge_receipts': '0',
#  'inside_outside_city_limits_code_y_n': 'Y',
#  'liquor_receipts': '18265',
#  'location_address': '207 SAN JACINTO BLVD STE 200',
#  'location_city': 'AUSTIN',
#  'location_county': '227',
#  'location_name': "MAX'S WINE DIVE",
#  'location_number': '1',
#  'location_state': 'TX',
#  'location_zip': '78701',
#  'obligation_end_date_yyyymmdd': '2016-09-30T00:00:00.000',
#  'responsibility_begin_date_yyyymmdd': '2016-05-13T00:00:00.000',
#  'tabc_permit_number': 'MB944126',
#  'taxpayer_address': '7026 OLD KATY RD STE 255',
#  'taxpayer_city': 'HOUSTON',
#  'taxpayer_county': '101',
#  'taxpayer_name': 'MWD AUSTIN DOWNTOWN, LLC',
#  'taxpayer_number': '12727298569',
#  'taxpayer_state': 'TX',
#  'taxpayer_zip': '77024',
#  'total_receipts': '100368',
#  'wine_receipts': '71497'}

{'taxpayer_number': '12727298569',
 'taxpayer_name': 'MWD AUSTIN DOWNTOWN, LLC',
 'taxpayer_address': '7026 OLD KATY RD STE 255',
 'taxpayer_city': 'HOUSTON',
 'taxpayer_state': 'TX',
 'taxpayer_zip': '77024',
 'taxpayer_county': '101',
 'location_number': '1',
 'location_name': "MAX'S WINE DIVE",
 'location_address': '207 SAN JACINTO BLVD STE 200',
 'location_city': 'AUSTIN',
 'location_state': 'TX',
 'location_zip': '78701',
 'location_county': '227',
 'inside_outside_city_limits_code_y_n': 'Y',
 'tabc_permit_number': 'MB944126',
 'responsibility_begin_date_yyyymmdd': '2016-05-13T00:00:00.000',
 'obligation_end_date_yyyymmdd': '2016-09-30T00:00:00.000',
 'liquor_receipts': '18265',
 'wine_receipts': '71497',
 'beer_receipts': '10606',
 'cover_charge_receipts': '0',
 'total_receipts': '100368'}

Now looking at the first entry, it looks like the restaurant reports total alcohol revenue, as well as revenue for beer, wine and liquor.  It looks like from the `location_number` and `location_address` attributes that this information is for a single Max's Wine location.  And the `obligation_end_date` and `responsibility_begin_date` perhaps could be the time period.  

We still have some questions though.  Are we sure we should be using `location_address` instead of `taxpayer_address`?  Are there multiple addresses in the data?  One way of getting a better sense of the data is to see the range of information that could be interesting to us.  So let's do the following.  Let's get a list of just the `location_address` and place it in the `location_addresses` list.  Then we'll find the distinct values in the list.  

In [6]:
restaurant_receipts[0].keys()

dict_keys(['taxpayer_number', 'taxpayer_name', 'taxpayer_address', 'taxpayer_city', 'taxpayer_state', 'taxpayer_zip', 'taxpayer_county', 'location_number', 'location_name', 'location_address', 'location_city', 'location_state', 'location_zip', 'location_county', 'inside_outside_city_limits_code_y_n', 'tabc_permit_number', 'responsibility_begin_date_yyyymmdd', 'obligation_end_date_yyyymmdd', 'liquor_receipts', 'wine_receipts', 'beer_receipts', 'cover_charge_receipts', 'total_receipts'])

In [7]:
restaurant_receipts[0]['location_address']

'207 SAN JACINTO BLVD STE 200'

In [8]:
[d['location_address'] for d in restaurant_receipts]

['207 SAN JACINTO BLVD STE 200',
 '207 SAN JACINTO BLVD STE 200',
 '207 SAN JACINTO BLVD STE 200',
 '207 SAN JACINTO BLVD STE 200',
 '207 SAN JACINTO BLVD STE 200',
 '207 SAN JACINTO BLVD STE 200',
 '207 SAN JACINTO BLVD STE 200',
 '207 SAN JACINTO BLVD STE 200',
 '207 SAN JACINTO BLVD STE 200',
 '207 SAN JACINTO BLVD STE 200',
 '207 SAN JACINTO BLVD STE 200',
 '207 SAN JACINTO BLVD STE 200',
 '207 SAN JACINTO BLVD STE 200',
 '207 SAN JACINTO BLVD STE 200',
 '207 SAN JACINTO BLVD STE 200',
 '207 SAN JACINTO BLVD STE 200',
 '207 SAN JACINTO BLVD STE 200',
 '207 SAN JACINTO BLVD STE 200',
 '207 SAN JACINTO BLVD STE 200',
 '207 SAN JACINTO BLVD STE 200',
 '207 SAN JACINTO BLVD STE 200',
 '207 SAN JACINTO BLVD STE 200',
 '207 SAN JACINTO BLVD STE 200',
 '207 SAN JACINTO BLVD STE 200',
 '207 SAN JACINTO BLVD STE 200',
 '207 SAN JACINTO BLVD STE 200',
 '207 SAN JACINTO BLVD STE 200',
 '207 SAN JACINTO BLVD STE 200',
 '207 SAN JACINTO BLVD STE 200',
 '207 SAN JACINTO BLVD STE 200',
 '207 SAN 

In [9]:
location_addresses = [d['location_address'] for d in restaurant_receipts]

set(location_addresses)
# {'207 SAN JACINTO BLVD STE 200', '3600 MCKINNEY AVE STE 100'}

{'207 SAN JACINTO BLVD STE 200', '3600 MCKINNEY AVE STE 100'}

Ok, so we do see two addresses listed here.  A search on Google confirms that this matches location addresses that we have. 

### 3. Reduce Our Data (Transform)

#### A. Reduce the number of items

Now, as we know one of the unwieldy things about APIs is the amount of information that's returned.  So let's work on scoping the amount of information we need to work with.  The first thing we can do is to only focus in on one restaurant, the `3600 MCKINNEY AVE STE 100` address.  Select only those restaurants with the address '3600 MCKINNEY AVE STE 100' and place them in a list called `dallas_maxs`, as this location is in Dallas.  Use Python to accomplish this.

In [10]:
[d for d in restaurant_receipts if d['location_address'] == '3600 MCKINNEY AVE STE 100']

[{'taxpayer_number': '32046798537',
  'taxpayer_name': 'MWD DALLAS UPTOWN, LLC',
  'taxpayer_address': '7026 OLD KATY RD STE 250',
  'taxpayer_city': 'HOUSTON',
  'taxpayer_state': 'TX',
  'taxpayer_zip': '77024',
  'taxpayer_county': '101',
  'location_number': '1',
  'location_name': "MAX'S WINE DIVE",
  'location_address': '3600 MCKINNEY AVE STE 100',
  'location_city': 'DALLAS',
  'location_state': 'TX',
  'location_zip': '75204',
  'location_county': '57',
  'inside_outside_city_limits_code_y_n': 'Y',
  'tabc_permit_number': 'MB917035',
  'responsibility_begin_date_yyyymmdd': '2015-08-11T00:00:00.000',
  'responsibility_end_date_yyyymmdd': '2017-08-21T00:00:00.000',
  'obligation_end_date_yyyymmdd': '2016-12-31T00:00:00.000',
  'liquor_receipts': '12257',
  'wine_receipts': '41093',
  'beer_receipts': '2832',
  'cover_charge_receipts': '0',
  'total_receipts': '56182'},
 {'taxpayer_number': '32046798537',
  'taxpayer_name': 'MWD DALLAS UPTOWN, LLC',
  'taxpayer_address': '7026 OLD

In [11]:
dallas_maxs = [d for d in restaurant_receipts if d['location_address'] == '3600 MCKINNEY AVE STE 100']


dallas_maxs[0]['location_address']
# '3600 MCKINNEY AVE STE 100'

'3600 MCKINNEY AVE STE 100'

In [12]:
len(dallas_maxs)
# 25

25

#### B. Reduce the amount of data per item

Ok, so now that we have cut the number of items in half, let's also limit the amount of information in each item.  Let's start by remembering what information contained in each dictionary.  An easy way to do so is to use the `keys` method on our dictionary.

In [13]:
first_dallas_receipt = dallas_maxs[0]
first_dallas_receipt.keys()

dict_keys(['taxpayer_number', 'taxpayer_name', 'taxpayer_address', 'taxpayer_city', 'taxpayer_state', 'taxpayer_zip', 'taxpayer_county', 'location_number', 'location_name', 'location_address', 'location_city', 'location_state', 'location_zip', 'location_county', 'inside_outside_city_limits_code_y_n', 'tabc_permit_number', 'responsibility_begin_date_yyyymmdd', 'responsibility_end_date_yyyymmdd', 'obligation_end_date_yyyymmdd', 'liquor_receipts', 'wine_receipts', 'beer_receipts', 'cover_charge_receipts', 'total_receipts'])

Ok, so we can see that a lot of information is here.  Let's reduce our information by only including the `total_receipts`, the `responsibility_begin_date_yyyymmdd` and the `obligation_end_date_yyyymmdd`.

In [14]:
keys = ("total_receipts", "responsibility_begin_date_yyyymmdd", "obligation_end_date_yyyymmdd")
[{i:elem[i] for i in elem if i in keys} for elem in dallas_maxs]

[{'responsibility_begin_date_yyyymmdd': '2015-08-11T00:00:00.000',
  'obligation_end_date_yyyymmdd': '2016-12-31T00:00:00.000',
  'total_receipts': '56182'},
 {'responsibility_begin_date_yyyymmdd': '2015-08-11T00:00:00.000',
  'obligation_end_date_yyyymmdd': '2017-08-31T00:00:00.000',
  'total_receipts': '9400'},
 {'responsibility_begin_date_yyyymmdd': '2015-08-11T00:00:00.000',
  'obligation_end_date_yyyymmdd': '2016-06-30T00:00:00.000',
  'total_receipts': '50574'},
 {'responsibility_begin_date_yyyymmdd': '2015-08-11T00:00:00.000',
  'obligation_end_date_yyyymmdd': '2016-08-31T00:00:00.000',
  'total_receipts': '50305'},
 {'responsibility_begin_date_yyyymmdd': '2015-08-11T00:00:00.000',
  'obligation_end_date_yyyymmdd': '2015-09-30T00:00:00.000',
  'total_receipts': '66609'},
 {'responsibility_begin_date_yyyymmdd': '2015-08-11T00:00:00.000',
  'obligation_end_date_yyyymmdd': '2017-06-30T00:00:00.000',
  'total_receipts': '39535'},
 {'responsibility_begin_date_yyyymmdd': '2015-08-11T0

In [15]:
keys = ("total_receipts", "responsibility_begin_date_yyyymmdd", "obligation_end_date_yyyymmdd")

restaurant_revenues = [{i:elem[i] for i in elem if i in keys} for elem in dallas_maxs]
len(restaurant_revenues)
# 25

25

In [16]:
restaurant_revenues

[{'responsibility_begin_date_yyyymmdd': '2015-08-11T00:00:00.000',
  'obligation_end_date_yyyymmdd': '2016-12-31T00:00:00.000',
  'total_receipts': '56182'},
 {'responsibility_begin_date_yyyymmdd': '2015-08-11T00:00:00.000',
  'obligation_end_date_yyyymmdd': '2017-08-31T00:00:00.000',
  'total_receipts': '9400'},
 {'responsibility_begin_date_yyyymmdd': '2015-08-11T00:00:00.000',
  'obligation_end_date_yyyymmdd': '2016-06-30T00:00:00.000',
  'total_receipts': '50574'},
 {'responsibility_begin_date_yyyymmdd': '2015-08-11T00:00:00.000',
  'obligation_end_date_yyyymmdd': '2016-08-31T00:00:00.000',
  'total_receipts': '50305'},
 {'responsibility_begin_date_yyyymmdd': '2015-08-11T00:00:00.000',
  'obligation_end_date_yyyymmdd': '2015-09-30T00:00:00.000',
  'total_receipts': '66609'},
 {'responsibility_begin_date_yyyymmdd': '2015-08-11T00:00:00.000',
  'obligation_end_date_yyyymmdd': '2017-06-30T00:00:00.000',
  'total_receipts': '39535'},
 {'responsibility_begin_date_yyyymmdd': '2015-08-11T0

In [17]:
restaurant_revenues[0:2]

# [{'total_receipts': '56182',
#   'begin_date': '2015-08-11T00:00:00.000',
#   'end_date': '2016-12-31T00:00:00.000'},
#  {'total_receipts': '9400',
#   'begin_date': '2015-08-11T00:00:00.000',
#   'end_date': '2017-08-31T00:00:00.000'}]

[{'responsibility_begin_date_yyyymmdd': '2015-08-11T00:00:00.000',
  'obligation_end_date_yyyymmdd': '2016-12-31T00:00:00.000',
  'total_receipts': '56182'},
 {'responsibility_begin_date_yyyymmdd': '2015-08-11T00:00:00.000',
  'obligation_end_date_yyyymmdd': '2017-08-31T00:00:00.000',
  'total_receipts': '9400'}]

Now looking at the first two elements we see that the end date seems to proceed monthly but that `begin_date` is always May 13 2016.  It seems like this just marks the first time that Max's needed to submit information.  We don't need to know this, so let's remove it from our list.  

In [18]:
[{i:elem[i] for i in elem if i!="responsibility_begin_date_yyyymmdd"} for elem in restaurant_revenues]

[{'obligation_end_date_yyyymmdd': '2016-12-31T00:00:00.000',
  'total_receipts': '56182'},
 {'obligation_end_date_yyyymmdd': '2017-08-31T00:00:00.000',
  'total_receipts': '9400'},
 {'obligation_end_date_yyyymmdd': '2016-06-30T00:00:00.000',
  'total_receipts': '50574'},
 {'obligation_end_date_yyyymmdd': '2016-08-31T00:00:00.000',
  'total_receipts': '50305'},
 {'obligation_end_date_yyyymmdd': '2015-09-30T00:00:00.000',
  'total_receipts': '66609'},
 {'obligation_end_date_yyyymmdd': '2017-06-30T00:00:00.000',
  'total_receipts': '39535'},
 {'obligation_end_date_yyyymmdd': '2017-02-28T00:00:00.000',
  'total_receipts': '43094'},
 {'obligation_end_date_yyyymmdd': '2017-05-31T00:00:00.000',
  'total_receipts': '34903'},
 {'obligation_end_date_yyyymmdd': '2016-11-30T00:00:00.000',
  'total_receipts': '41054'},
 {'obligation_end_date_yyyymmdd': '2016-07-31T00:00:00.000',
  'total_receipts': '51707'},
 {'obligation_end_date_yyyymmdd': '2017-07-31T00:00:00.000',
  'total_receipts': '39627'},


In [19]:
revenues_by_date = [{i:elem[i] for i in elem if i!="responsibility_begin_date_yyyymmdd"} for elem in restaurant_revenues]

len(revenues_by_date)
# 25

25

In [20]:
revenues_by_date[0:2]

[{'obligation_end_date_yyyymmdd': '2016-12-31T00:00:00.000',
  'total_receipts': '56182'},
 {'obligation_end_date_yyyymmdd': '2017-08-31T00:00:00.000',
  'total_receipts': '9400'}]

### 4.  Coerce the data (Still Transform)

Our final step will be to coerce our data to the correct format. We'd like total receipts to be an integer and we'd like the begin date to be of type `datetime`.

This is a little tricky so let's do it with dictionary first object first.

In [21]:
first_rev_by_date = revenues_by_date[0]
first_rev_by_date

{'obligation_end_date_yyyymmdd': '2016-12-31T00:00:00.000',
 'total_receipts': '56182'}

We can go from a string to an integer by using the `int` function.  The `int` function is called a constructor because it's used to construct integers.  We can use it so long as we pass in a string that can be changed to an integer.

In [22]:
int('33')

33

In [23]:
int(first_rev_by_date['total_receipts'])

56182

Ok, now let's coerce the string into a datetime.  First we ask the great oracle Google how we can [convert a string into a datetime](https://www.google.com/search?q=datetime+from+string+python&oq=datetime+from+st&aqs=chrome.0.0j69i57j0l4.2653j0j7&sourceid=chrome&ie=UTF-8).  Then we follow the directions in the search results. [One of those results](https://chrisalbon.com/python/basics/strings_to_datetime/) says we can convert with somthing like the following.

In [24]:
from datetime import datetime
start = '2011-01-03'
datetime.strptime(start, '%Y-%m-%d')

datetime.datetime(2011, 1, 3, 0, 0)

That gets us part of the way, but our date information includes information about minutes or seconds.  So we can either remove that ending data, or we can do some more searching on Google.  We go for the searching on Google and wind up with the following. 

In [25]:
from datetime import datetime
end_date = first_rev_by_date['obligation_end_date_yyyymmdd']
datetime.strptime(end_date, '%Y-%m-%dT%H:%M:%S.%f')

datetime.datetime(2016, 12, 31, 0, 0)

Ok, now it's that we were able to accomplish this for the attributes of one dictionary, let's use a loop to coerce each dictionary in `revenues_by_date`.

In [26]:
from datetime import datetime
formatted_revenues = []
for revenue in revenues_by_date:
    total = int(revenue['total_receipts'])
    revenue = {
        'total_receipts': total,
        'end_date': revenue['obligation_end_date_yyyymmdd'],
    }
    formatted_revenues.append(revenue)
formatted_revenues[0:2]

[{'total_receipts': 56182, 'end_date': '2016-12-31T00:00:00.000'},
 {'total_receipts': 9400, 'end_date': '2017-08-31T00:00:00.000'}]

In [27]:
formatted_revenues

[{'total_receipts': 56182, 'end_date': '2016-12-31T00:00:00.000'},
 {'total_receipts': 9400, 'end_date': '2017-08-31T00:00:00.000'},
 {'total_receipts': 50574, 'end_date': '2016-06-30T00:00:00.000'},
 {'total_receipts': 50305, 'end_date': '2016-08-31T00:00:00.000'},
 {'total_receipts': 66609, 'end_date': '2015-09-30T00:00:00.000'},
 {'total_receipts': 39535, 'end_date': '2017-06-30T00:00:00.000'},
 {'total_receipts': 43094, 'end_date': '2017-02-28T00:00:00.000'},
 {'total_receipts': 34903, 'end_date': '2017-05-31T00:00:00.000'},
 {'total_receipts': 41054, 'end_date': '2016-11-30T00:00:00.000'},
 {'total_receipts': 51707, 'end_date': '2016-07-31T00:00:00.000'},
 {'total_receipts': 39627, 'end_date': '2017-07-31T00:00:00.000'},
 {'total_receipts': 48239, 'end_date': '2016-05-31T00:00:00.000'},
 {'total_receipts': 45590, 'end_date': '2016-09-30T00:00:00.000'},
 {'total_receipts': 49965, 'end_date': '2017-01-31T00:00:00.000'},
 {'total_receipts': 0, 'end_date': '2015-08-31T00:00:00.000'},


In [28]:
len(formatted_revenues)

25

In [29]:
type(formatted_revenues[0]['total_receipts'])

int

In [30]:
type(formatted_revenues[0]['end_date'])

str

### 5. Store our data (Load)

Once we have our date in a good format.  Let's store that data so that we can use it in some future research.  The following code is slightly confusing, but it's also freely available on the Internet.  So let's use it to write our data to a file.   

In [31]:
import json
with open('maxs_revenues.json', 'w') as filehandle:  
    json.dump(formatted_revenues, filehandle)

In [32]:
filehandle

<_io.TextIOWrapper name='maxs_revenues.json' mode='w' encoding='cp949'>

We can eaily check that we stored this data correctly by attempting to read that data.

In [33]:
with open('maxs_revenues.json') as json_file:  
    pulled_revenues = json.load(json_file)

In [34]:
pulled_revenues

[{'total_receipts': 56182, 'end_date': '2016-12-31T00:00:00.000'},
 {'total_receipts': 9400, 'end_date': '2017-08-31T00:00:00.000'},
 {'total_receipts': 50574, 'end_date': '2016-06-30T00:00:00.000'},
 {'total_receipts': 50305, 'end_date': '2016-08-31T00:00:00.000'},
 {'total_receipts': 66609, 'end_date': '2015-09-30T00:00:00.000'},
 {'total_receipts': 39535, 'end_date': '2017-06-30T00:00:00.000'},
 {'total_receipts': 43094, 'end_date': '2017-02-28T00:00:00.000'},
 {'total_receipts': 34903, 'end_date': '2017-05-31T00:00:00.000'},
 {'total_receipts': 41054, 'end_date': '2016-11-30T00:00:00.000'},
 {'total_receipts': 51707, 'end_date': '2016-07-31T00:00:00.000'},
 {'total_receipts': 39627, 'end_date': '2017-07-31T00:00:00.000'},
 {'total_receipts': 48239, 'end_date': '2016-05-31T00:00:00.000'},
 {'total_receipts': 45590, 'end_date': '2016-09-30T00:00:00.000'},
 {'total_receipts': 49965, 'end_date': '2017-01-31T00:00:00.000'},
 {'total_receipts': 0, 'end_date': '2015-08-31T00:00:00.000'},


In [35]:
len(pulled_revenues)

25

In [36]:
pulled_revenues[0:2]

[{'total_receipts': 56182, 'end_date': '2016-12-31T00:00:00.000'},
 {'total_receipts': 9400, 'end_date': '2017-08-31T00:00:00.000'}]

Ok, looks good!

### Summary

Ok congrats! We've now gone through the entire process of *extracting* our data from an API, *transforming* that data into a format that we want and then *loading data* into a file.  The trickiest part here is the transforming component.

After first understanding our data by looking at what we retreived from the API, then we tried to reduce this data.  First we did this by reducing the number of entries as we only wanted data from one restaurant location.  Then we reduced the information in each entry by looking at all of the information included in the dictionary with the `keys` method and then looping through our data to only select the `key` `value` pairs that we would like. Finally, we coerced our data into a datatype and format that would be easiest for us to work with later on.

This process is very useful.  The reason why is because we will likely want to access our data many times.  And we don't want to have to clean that data each time we do.  Instead, let's just clean it once, and then we can access that cleaned up, easier to work with data as much as we want.