Copyright Scott Jensen, San Jose State University

<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" property="dct:title">This notebook</span> by <span xmlns:cc="http://creativecommons.org/ns#" property="cc:attributionName">Scott Jensen,Ph.D.</span> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.

# Working With Files Part 1: Loading The Yelp Data

Although the Yelp datset is not "Big" by commercial standards, for an academic dataset it's large in that unzipped it's approximately 10GB. Fortunately, Spark can work directly with compressed data in certain formats, and we will be loading zipped data using the bzip2 format (WinZip files will not work - don't try it).

Some of the data files, particularly the reviews and the user data files, are nearly 2GB even when compressed, so loading them from a home Internet connection is not possible for many students (keep in mind that if your ISP is a cable company, data download speeds are usually much faster than data upload speeds, and you would need to do both).  If you are curious about your Internet speed, see the <a href="https://www.att.com/support/speedtest/" target="_blank">AT&T speedtest</a> (there's also a link in Canvas) - you would have roughly 900 Mbps both directions when using a wired Ethernet (not Wi-Fi) connection on campus.

For this reason, we staged the zipped data on Amazon's S3 storage service and we will use the code in this notebook to load the data directly to your Databricks account (which is also on AWS's servers).  The code is designed so that if you re-run this code and the data is already in your account, it will not try to reload it.  This is also why you completed the Yelp dataset agreement exercise where you submitted the request to Yelp and agreed to be bound by their license.

We will walk through this notebook in class. Since the review and user files are rather large  (almost 2GB each when compressed), you will also need to run the notebook named `Building Review and User Tables` to create tables for the review and user data files.

### Step 1: Spinning up a cluster
To be able to calculate any cells in your notebook, you will need to be attached to a cluster.  If you completed the exercise in the *Intro to Jupyter and Python* video lecture, this is the same process.  From the toolbar on the
left-hand side, click on the `Compute` icon (it looks like a cloud and used to be amed Clusters).  You won't have any clusters to start with, so click on the `+Create Cluster` button.
The defaults in that screen are fine, but you will need to enter a name for your cluster.  Once your cluster's status is 
running (and has a solid green circle by it), come back to your notebook and from the drop-down list in the top-left corner, attach 
your notebook to the cluster listed with the solid green circle.

### Step 2: Using a Databricks widget to enter the path to the data manifest

Since some of the data files are too large to manually upload even when compressed, you will be importing the data from where we temporarily staged it on AWS, but **you *MUST* complete the dataset agreement assignment in order to earn credit for *ANY* of the exercises or the team assignments which use the Yelp data**.  To bring the data from where we staged it on AWS to your Databricks account hosted on AWS (for free by Databricks! Yay!) the code below needs to know where it can downlaod the data from.

When you run the next cell, a "widget" will appear **at the top of the notebook** that prompts you for your the path to a data manifest.  
On the right-hand side of that widget bar you will see a pushpin (a.k.a. thumbtack) icon that will allow you to pin that to the top of the code window even when you scroll down.  Once you have entered the manifest's path, you can "unpin" the widget bar to make more screen real estate available.

We will be importing three files to your Databricks account and eventually we will be putting the compressed files they contain in a directory named `/yelp` on the Databricks File System (DBFS) on your Databricks account. The manifest uses a JSON format and contains a JSON array with an object for each file.  Each JSON object provides three properties: the name of the file, an MD5 sum for the file, and a flag as to whether the file should be unzipped. The MD5 sum is a one-way hash that allows us to make sure the file download did not encounter any errors and is exactly the same as the file we originally staged on AWS.  The flag for whether to unzip is because we compress the data flies using the bzip2 format, but the review and user data files have been split by year (the year of the review or the year a user joined Yelp), so we zipped up the directories containing those files.  We need to unzip those directories before moving the files to the `/yelp` irectory on DBFS. You may be wondering why the manifest file is used, but it allows us to easily move the files or update them, and we only need to provide you with the URL to where you can find the current manifest.

To see the widget (if you just imported the notebook), click on the arrow in the upper 
right-hand corner of the next code cell and select `Run Cell`.

#### The following cell is Step 2

In [0]:
dbutils.widgets.text("manifest_url","xxxxxxxxxxxxxxx","Enter the manifest URL:")

#### Step 3: Entering the file manifest URL

In the input box for the widget with the prompt "Enter the manifest URL", enter the URL we provide in class.

Once you have entered the URL in the widget, we can get started.

### Step 4: Creating a directory for your data files using the DBFS utilities

You are going to be running code to bring your data over from our bucket on Amazon's S3 to your account on Databricks.  As discussed previously, your data is stored in the Databricks File System, which is referred to as DBFS.  To be able to store your data in DBFS, you first need to create a directory in which to store the data, we are going to name that directory "Yelp" (but without the quotation marks).

The commands for working with DBFS are very similar to Linux, but if you have not used Linux, don't panic, the number of commands we will be using can be counted on one hand (unless you are ET - he only had 2 fingers, but we are assuming you are an Earthling).  If you are a Mac user and have played around at the command line on your laptop, these commands will look familiar since the operating system for a Mac is a variant of Unix.  All of the commands for working with DBFS are in the dbutils library (which Databricks has conveniently already installed in your account when you spun up a cluster).  Although we will be using only a few commands, you don't need to memorize them, just run the following command in a blank cell and it will show you all of the available commands:

`dbutils.fs.help()`

That will display each file system method along with the syntax and a brief description.

The `dbutils` library contains Databricks utilities other than those for working with files, so we are calling the file system, or `fs`, set of methods within `dbutils`.  

The `dbutils.fs` methods are divided into file system utilities (fsutils) and mount methods.  You won't be mounting other drives, so you will only be using the fsutils methods. Run the next cell to see the syntax for the file commands.

In [0]:
dbutils.fs.help()

#### Step 4 continued ...
If you read the list of file utility methods, you will see that the method for creating a directory is named `mkdirs` and takes one parameter, a string with the path and name of the directory we want to create.  Since we are passing a string parameter, we enclose the text in quotes.  These can be single or double-quotes, but they need to match.

You may be wondering why the path starts with a slash.  If you are a Mac user familiar with the command line, this would seem natural, but if you are a PC user it may not.
The slash indicates the root.  In Windows, you specify a path by saying the drive letter, but in Linux (and on the Mac) there are no drive letters - drives are mounted as paths.  In DBFS, as in Linux, on the Mac, and for paths in URLs on the Internet, 
the separators for directories or folders on a path are forward slashes instead of the backslashes used in Windows.

Run the next cell to create your yelp directory.

In [0]:
dbutils.fs.mkdirs("/yelp")

#### **<span style="color:#22b922">Riddle me this</span>**: Why does the output have the value `True`?  

The `mkdirs` method has a Boolean return value, so instead we could have run the following code:

`result = dbutils.fs.mkdirs("/yelp")`

That would have created a new variable named `result` and assigned the value returned by `mkdirs` to that variable.  The value assigned to `result` would have been the Boolean value `True` (assuming the directory was created)
and then we could have printed the value of `result` with the following Python code:

`print(result)`

In the code we actually ran, we did not assign the value returned by the mkdirs method to a variable, so Jupyter printed out the value returned by the method call.

**PLEASE NOTE:** Although Jupyter printed out the returned value of our call to `mkdirs`, this is only because it's the last line in our code cell.  If we had additional code in that cell that wrote out or generated other values, the `True` returned by the call to `mkdirs` would not be shown.

### Step 5: Importing the data

The code in the following cell will import the data files to your yelp directory.  While that cell is running (**it will probably take around 5 minutes**), let's talk about what is happening in that cell.
<ol>
  <li style="padding-bottom:5px;">First, we are importing the library named `requests` which allows us to retrieve data over the web.  This is not built into the core of Python, so it's a separate library.  However, since it's one that's commonly used, Databricks has it in your cluster already.  If this was a less commonly used library, you would need to first load the library on your cluster.</li>
  
  <li style="padding-bottom:5px;">We define a few variables, such as the part of the URL that's the same for all files located on AWS, where we will be downloading the data from.</li>

  <li style="padding-bottom:5px;">To make the process easier to understand, we break it into steps that can be defined as separate functions.  Each function is defined starting with `def` and the indenting in Python tells it when each function has ended.  The main code near the end of the cell is less than 10 lines long and calls the functions defined above.</li>

  <li style="padding-bottom:5px;">One of the first functions called is `get_manifest`, which is passed the URL you entered for the widget at the top of the notebook, it tries to download the manifest from that URL, and the manifest says what files are being downloaded.  The manifest is a tiny JSON file that we store out at a public URL and that file is a JSON object containing a name:value pair with the id for the bucket where the data is stored, and two name:value pairs with arrays as the values:
    <ul>
      <li style="padding-bottom:3px;">An array of the files or directories that will be created in the directory where we download the data.</li>
      <li style="padding-bottom:3px;">The second array contains a JSON object for each download file with:
        <ol>
          <li>The name of the file being downloaded</li>
          <li>The MD5 sum of the file. The MD5 sum is a hash of the file contents that returns a string - this is used to make sure the file downloaded is complete.</li>
          <li>A flag indicating if the downladed file should be unzipped.  The review and user downloads are zipped files containing a directory of files.</li>
        </ol>
      </li>
    </ul>
  </li>
  <li style="padding-bottom:5px;">The code cycles through downloading and unzipping the files (as needed), and in our case the user and review files contain a directory of files compressed using the bzip2 format which Spark can load.  As each file is downloaded, the MD5 sum of the file is calculated and compared to the MD5 sum in the manifest to make sure the download is complete and data is not missing.</li>
</ol>  

When downloading the data, we cannot work directly in DBFS, so we download and unzip the data using the driver node and then movng the data to DBFS.  Keep in mind that where we download the files
on the file system local to the driver node of the cluster disappears when the cluster terminates, but the files we move to DBFS are permanent and will be there again when you start a new cluster.

In [0]:
import pyspark.sql.functions as f
import requests
import re
import tarfile
import zipfile
import json
import hashlib

FORCE_DOWNLOAD = False

URL_HOST = ".s3-us-west-2.amazonaws.com/"
TEMP_DIR = "/yelptemp/"
ZIPPED_DIR = "/yelptemp/zipped/"
UNZIPPED_DIR = "/yelptemp/unzipped/"
DATA_DIR = "/yelp"

def clean_all():
  ' Removes the data directory if it exists on DBFS'
  try:
    dbutils.fs.rm(DATA_DIR,recurse=True)
  except Exception:
    pass
  

def get_manifest():
  ''' Returns the dictionary that's the download manifest based on the URL
      entered in the URL widget.
      If it's not a valid URL or returns a status code other than 200, an exception is raised.
      If the manifest is not valid JSON, or does not contain name:value pairs named
      id, data_list, and download_list, an exception is raised.  
  ''' 
  manifest_url = dbutils.widgets.get("manifest_url")
  response = requests.get(manifest_url)
  if response.status_code != 200:
    raise Exception(f"The manifest URL {manifest_url} returned a status of " + str(response.status))
  manifest = response.text
  try:
    manifest_dict = json.loads(manifest)
    # The manifest should have a data_list element and a down_Load list element
    if "data_list" in manifest_dict == False:
      raise Exception("The manifest does not contain a data_list.")
    if "download_list" in manifest_dict == False:
      raise Exception("The manifest does not contain a download_list.")
    return(manifest_dict)
  except json.JSONDecodeError as err:
    raise Exception("The manifest is not a valid JSON document.", err)
    

def check_data(manifest):
  ''' Function used to check if the data directory contains valid
      copies of all of the files in the download. The manifest dictionary
      is passed as a parameter and is expected to comtain a data_list containing
      the names of each file expected in the data directory.
      If a file is missing, this method returns False.
      If all of the files exist, it returns True.
  '''
  try:
    data_dir_list = dbutils.fs.ls(DATA_DIR)
    if len(db_dir_list) == 0:
      return(False)
      file_list = manifest["file_list"]
      existing_list = dbutils.fs.ls(DATA_DIR)
      for file_name in file_list:
        found == False
        for info in existing_list:
          if info.name == file_name:
            found == True
            break
        if found == False:
          return(False)
      # looped through all of the required files and they are there
      return(True)
  except Exception:
    # The directory does not exist, does not match the manifest, or the hashes don't match
    return(False)
  
def get_bucket_id(manifest):
  ''' The manifest is expected to contain a name:value pair named id
      where the value is the bucket name on S3 where the files are
      staged.  If the id is missing or is a blanks string, then an
      exception is raised, otherwise the bucket id is returned.
  '''
  try:
    bucket = manifest['id'].strip()
    if len(bucket) == 0:
      raise Exception("The id provided in the manifest was an empty string, but should be the name of the bucket being downloaded from.")
    else:
      return(bucket)
  except Exception as e:
    raise Exception("An error occurred in retrieving the bucket id from the manifest", e)
      
  
def download_file(manifest_item, bucket_id):
  ''' Given a dictionary from the download list, download the file to the
      temporary directory for downloading the file and check the
      MD5 sum to make sure it matches.
      If the MD5 sum does not match, an excetion is raised, otherwise it prints
      that the file was successfully downloaded.
  '''
  file_name = manifest_item["name"]
  item_md5sum = manifest_item["md5"]
  request_url = "https://" + bucket_id + URL_HOST + file_name
  local_name = ZIPPED_DIR + file_name 
  print("requesting file from:", request_url)
  r = requests.get(request_url, stream=True)
  status_code = r.status_code
  # If the status code is 200, then we successfully retrieved the file
  if status_code != 200:
    raise Exception(f"The {file_name} download failed. A status code of {str(status_code)} was returned from the URL:{request_url}.")
  else: # write the file 
    with open(local_name, 'wb') as file:
      for chunk in r.iter_content(chunk_size=4096):
        file.write(chunk)
        file.flush()
    file.close()
  #check if the hash of the file downloaded matches the md5 sum in the manifest
  with open(local_name, 'rb') as data_file:
    md5sum = hashlib.md5( data_file.read() ).hexdigest()
    if md5sum.lower() != item_md5sum.lower():
      raise Exception(f"The file {file_name} downloaded from Google Drive generated a MD5 sum of {md5sum} instead of the MD5 sum in the manifest ({item_md5sum}) so it may be corrupted and the processing was terminated.")
    else:
      print ("successfully downloaded:", file_name)

      
def process_file(manifest_item):
    ''' The file is now downloaded.  If the file is zipped,
        it first needs to be unziiped, and either way, moved
        to the DBFS data directory.
    '''
    local_name = ZIPPED_DIR + manifest_item["name"]
    local_path = "file:" + local_name
    is_zipped = manifest_item["zipped"] == "true" # This is either Ture or False
    if is_zipped:
      with zipfile.ZipFile(local_name,"r") as zip_ref:
        zip_ref.extractall(UNZIPPED_DIR)
      untar_info = dbutils.fs.ls("file:" + UNZIPPED_DIR)
      # The zip file could contain a directory, a file, or more than 1 file,
      # so we loop through the file list, moving all of them to DBFS
      for info in untar_info:
        destination = DATA_DIR + "/" + info.name
        dbutils.fs.mv(info.path, destination, recurse=True)  
      dbutils.fs.rm(local_path)
    else: # file was not zipped (or should remain zipped), so just move it
        destination = DATA_DIR + "/" + manifest_item["name"]
        dbutils.fs.mv(local_path, destination)  
    print ("processed:", local_name)
    
                      
def load_data(manifest_list, bucket_id):
  ''' Loops through the files in the download list from the manifest and 
      downloads the file, verifies the MD5 sum is correct, unzips it if needed,  
      and moves the file or folder that was in it to the data directory.'''
  # Create the empty temporary directories
  try:
    dbutils.fs.rm("file:" + TEMP_DIR,recurse=True)
  except Exception:
    pass
  # Create the temporary local directory and sub-directories
  dbutils.fs.mkdirs("file:" + TEMP_DIR)
  dbutils.fs.mkdirs("file:" + ZIPPED_DIR)
  dbutils.fs.mkdirs("file:" + UNZIPPED_DIR)
  # Loop through the files to download
  for item in manifest_list:
    download_file(item, bucket_id)
    process_file(item)
  # Remove the temp directory used to unzip the files
  dbutils.fs.rm("file:" + TEMP_DIR, recurse=True)
  
  
# *******************************************  
# Run the Actual Routine to Load the Data
# This code uses the above defined functions
# *******************************************
if FORCE_DOWNLOAD == True:
  clean_all()
manifest_dict = get_manifest()
if check_data(manifest_dict) == False:
  bucket_id = get_bucket_id(manifest_dict)
  download_list = manifest_dict["download_list"]
  load_data(download_list, bucket_id)
else:
  print("All of the required files exist in the data directory already, so the download was not processed.")
print("Done")
  

### Step 6: Listing the Files Loaded

In the above cell that brought the data over to your Databricks account, we use the dbutils method to list the file contents of the directory.  Here you are going to use that utility again, but since it will be the only line in your cell (so it's the last line), you don't need to assign the value returned to a variable, the result return will be printed as the output of the cell (like the `true` value returned when you created a directory).

In the following cell, add a line of code to list the `/yelp` directory.  You should see three files and two directories listed.

If we wanted to *use* the list returned by the `ls` method, we could assign it to a variable name.

In [0]:
# Add your code here and run it to list the contents of your /yelp directory on DBFS


### Step 7: Listing the Review and User Subdirectories
In the above cell when you listed the contents of the `/yelp` directiory on DBFS, both `review` and `user` had a size of zero and ended with a slash, indicating they were directories and not files.  Copy the code you used in the above cell to list the `/yelp` directory (copy it twice actually), and modify the path passed as a parameter so that it first lists the contents of the `review` directory and then the contents of the `user` directory.

In [0]:
# Add code here and run it to list the contents of the /yelp/review director on DBFS


In [0]:
# Add code here and run it to list the contents of the /yelp/user directory on DBFS


# Working With Files - Part 2: Getting the Category Definitions

Above you loaded the Yelp data as zipped files in the bzip2 format to save space.  In the following cell we will take a slightly different approach using the `urllib` module to read a JSON data file from a page on Yelp's website.  

In the dataset's business file, most businesses have a `categories` field which is a comma-separated list of the categories in which a business operates.  Some categories are at a high level (such as "Restaurants"), but the categories form a hierarchy with increasing levels of detail, so there are more specific categories too, such as "Dim Sum" which is within the "Chinese" category, which in turn is within the "Restaurants" category.  There are over 1500 categories (and growing).  As part of their "Fusion API", Yelp makes this list available to web developers who are creating apps that use Yelp data (and drive traffic to Yelp).  The page documenting the controlled vocabulary for categories can be found at the following website: <a href="https://www.yelp.com/developers/documentation/v3/all_category_list" target="_blank">https://www.yelp.com/developers/documentation/v3/all_category_list</a>.

On that site there's a link to a JSON file defining the hierarchy for this controlled vocabulary.  Although there are a lot of categories, as a JSON file this file is tiny compared to the Yelp data, so we don't need to compress it.

### Step 1: Run the code in the following cell to load the data (or see below for an alternate approach)

In [0]:
import urllib
import json

TEMP_DIR = "/bus4118d/"
DBFS_DIR = "/yelp/"
CAT_FILENAME = "categories.json"
CAT_URL = "https://www.yelp.com/developers/documentation/v3/all_category_list/categories.json"

# create the temp directory for downloading and the DBFS directory if they don't exist
dbutils.fs.mkdirs("file:"+TEMP_DIR)
dbutils.fs.mkdirs("dbfs:"+DBFS_DIR)
# If the file already exists, we will remove it
local_name = "file:"+TEMP_DIR+CAT_FILENAME
dbutils.fs.rm(local_name)
dbfs_name = "dbfs:"+DBFS_DIR+CAT_FILENAME
dbutils.fs.rm(dbfs_name)

# get the file and store it in the temp directory
# Note that with the request, the destination 
# directory is inherently on the local driver
urllib.request.urlretrieve(CAT_URL, TEMP_DIR+CAT_FILENAME)
# move the file to DBFS
dbutils.fs.mv(local_name, dbfs_name)
dbutils.fs.ls(DBFS_DIR)
# Create and show a DataFrame
df_categories = spark.read.json(DBFS_DIR+CAT_FILENAME, multiLine=True)
print("categories:",df_categories.count())
df_categories.show(50, truncate=False)

### Category Definitions - Alternate Approach
The categories.json file is a small file, so we could also load it through the GUI interface, and that's what we will do in this alternate approach.

To load the categories.json through the GUI, see the lecture slides for the class.  The slides walk through the following steps:
1. Download the categories.zip file form the Canvas module for this week and unzip the file.  You should now have a file named categories.json
2. Through the GUI on the Data option (click the Create Table button, then load the file, but don't create a table)

You will now have a file on the path: /FileStore/tables/categories.json

You want to move that file to the path: /yelp/categories.json

#### Step 2a: Use the `mv` method from file system methods in dbutils to move the file

In [0]:
# Add your code to move the file in this cell


#### Step 2b: List the files in /yelp

You have now moved the categories.json file into the same directory where you loaded the Yelp data above in Part 1. In the following cell, list the files again and you will now see the Yelp data and the categories.json file.

In [0]:
# Add your code to list the directory in this cell


# Working with Files - Part 3: Finding Gender Data

Yelp wants to create the best user experience possible and show authentic reviews.  They have a proprietary 
algorithm for ranking the reviews they show users.  The average star rating for a business is part of it,
but not the whole story.  The number of reviews is part of it, but not the whole story.  
Since many users will not read more than a couple reviews, having a good algorithm when ordering the reviews to 
show them to a user is critical to Yelp's business.  They need to always be thinking of how to make the ranking better
in order to improve the customer experience.

What if men and women review differently?  Is a 4-star rating from a man the same as a 4-star rating from a woman?
In other words, might men or women consistently rate businesses higher or lower?  Would this depend on the type
of business?  Are ratings by one gender more consistent than the other?  If so, could we be more certain of the validity
of the ranking of a business based on 5 reviews by women than we would by 5 reviews by men (or vice versa)?

If there is a difference, should that be taken into account when Yelp ranks businesses based on their ratings?

### Using First Name as a Proxy for Gender

We have a problem.  We don't have information about the user's gender.  However, we are curious and persistent.  Is there a proxy
we could use?  A proxy is a stand-in or substitute for something else.  Could a user's first name be a proxy for their gender?  What issues would we have if we use name
as a proxy for gender?

First, we need some data to associate user names with genders.  We can start searching on the web.  Possibly lists of baby names?
If you were to search for a while, you would find the Social Security Administration's (SSA) website with the 1000 most popular baby names
for girls and boys (at least in the U.S.), but we want more than the most popular names, we want to tie as many names as possible to a gender.
If you dig a little further, you'll find the SSA page titled <a href="https://www.ssa.gov/oact/babynames/limits.html" target="_blank">Beyond the Top 1000 Names</a>.

Read that page - they have a zip file there with national data as to every first name used to apply for a social security account and a count
of the number of men and women applying with that name, based on their date of birth.  Hover your mouse over that link, the file can be downloaded 
from the following URL:  <a href="https://www.ssa.gov/oact/babynames/names.zip" target="_blank">https://www.ssa.gov/oact/babynames/names.zip</a>

We could download that file and unzip it (and you may want to do that after class), but what the zip file contains is a file for each year-of-birth, so 
for those little girls and boys born in 2017 who applied for a social security card, the file is named `yob2017.txt` (a copy is in this week's module in Canvas) and it contains data in the following
format:

Emma,F,19738 <br/>
Isabella,F,15100 <br/>
Sophia,F,14831 <br/>
Mia,F,13437 <br/>
Liam,M,18728 <br/>
Logan,M,13974 <br/>
Benjamin,M,13733

So if you just had a baby boy or girl and named her Isabella or named him Liam because you thought it would be unique, apparently so did everybody else.

As you can see, the file has 3 columns, the name, the gender (M or F), and the number of people with that name who were born in 2017 and applied for a social security card.  You should also note there are no headers.

It does not say what year the data is from, but the year is in the name of each file, so in a later exercise we will do an *enriching transformation* to insert
that metadata into a new column in the DataFrame.

### Step 1: Downloading the data

We are again going to use the `urllib` module. As before with the Yelp categories data, for the download, all of the necessary code is already included below.

Using `dbutils.fs`, we will create a new directory named `ssa` where we are going to download the zip file from the SSA: `names.zip`.  However, we use the `file:` prefix for the path to say we want to create the directory local to the driver node for our cluster.  We need to do this because we cannot treat DBFS as a local file system.  Later we will move the zipped and unzipped data to DBFS.

In [0]:
import requests
import urllib

SSA_URL = "https://www.ssa.gov/oact/babynames/names.zip"
SSA_DIR ="/ssa/"
SSA_FILENAME = "names.zip"

# If the ssa directory exists, remove it
dbutils.fs.rm("file:"+SSA_DIR, recurse=True)
dbutils.fs.mkdirs("file:"+SSA_DIR)

urllib.request.urlretrieve(SSA_URL, SSA_DIR+SSA_FILENAME)

###  Step 2: Check if your file was written

Did the file end up in our `ssa` directory?  In the cell below, add code that uses the `ls` method from  `dbutils.fs` to list 
the contents of the `file:/ssa` directory which is a directory local to the driver node and not DBFS.

In [0]:
# Add your code here to list the /ssa directory (and run the code)


### Step 3: Unzipping the data
When we loaded our Yelp data we loaded bzip2 files because Spark does not read zip files, but here we
are still using Python (not Spark), to unzip the file, so that's not a problem.

We will need to import another Python module that has the functions to unzip a file, and we will
use the Databricks `dbutils` functions to create a subdirectory we are going to unzip the data files into. 

Keep in mind that when we are running the zip commands, we are just using Python and not Spark.  Why does that matter?  The paths we provide
are not pointing to paths on DBFS (since the plain old Python does not "know" about DBFS).  When we use the path "/ssa", the zip command assumes
we are talking about the `ssa` directory off the root of the driver node.  If we use the path "/ssa" in a dbutils command, it assumes we are
referring to a DBFS path since we did not prefix the path with `file:`.

Once we unzip the data, we must move the directory to DBFS.  While data on DBFS will remain when our cluster shuts down, the files 
local to the driver will not exist once the cluster is shutdown.

**After running the following cell, in Step 4 be sure to use the `ls` method to see the listing of all of the SSA files for each year**

In [0]:
import zipfile

SSA_SUBDIR = "data"
ssaDataDir = "file:" + SSA_DIR + SSA_SUBDIR
dbutils.fs.mkdirs(ssaDataDir) 

ssaNamesZip = SSA_DIR + SSA_FILENAME

with zipfile.ZipFile(ssaNamesZip,"r") as zip_ref:
    zip_ref.extractall(SSA_DIR + SSA_SUBDIR)
    
# Move the files to DBFS
dbutils.fs.rm("dbfs:"+SSA_DIR, recurse=True)
dbutils.fs.mv("file:"+SSA_DIR, "dbfs:"+SSA_DIR, recurse=True)

###  Step 4: Check if your data was unzipped properly

Did the files for each year's data end up in our `data` subdirectory under `ssa`?  Use the `ls` method from  `dbutils.fs` to list 
the contents of the `/ssa/data` directory.

In [0]:
# Add your code here to list the contents of the /ssa/data directory (and run it)


### Step 5: What does one of these data files look like?

Using the `head` method in `dbutils.fs` we can take a peek at the first "X" bytes.  We will use the default of 64K.

In [0]:
dbutils.fs.head("dbfs:/ssa/data/yob1880.txt")

### Step 6: Making it more human-readable
In the above command, the `head` function is just getting bytes.  For a more human-readable view of the head of the file, we can enclose the call to `dbustils.fs.head()` inside a call 
th the Python print function and the `\r\n` which represent line feeds in the file will now show each name on a separate line.

**In the following cell**, enclose the call to the `head` function from the cell above in a call to the python `print` function.

In [0]:
# In this cell, add a call to the print function around the call to the dbutils.fs.head method


# Assignment Deliverable

* Make sure you have added and run code for those steps where you were supposed to list or print the results.
* Publish your notebook as described in the lecture video titled *Intro to Jupyter and Python* 
* Submit the published URL as the deliverable for this assignment