# Opening files from URLs

Why would you want to open files from URLs within a Jupyter Notebook or a Python script?

1. You want your code to be portable without needing to transport the data (like for teaching!)
2. You want to use a public data set but you don't need it or want it on your computer
3. You're using Google Colab, which doesn't access your files
4. You want to provide data to download within your notebook or script
5. You are unable to download data to your computer


<br>There are two different things you might want to do:

- Download the file to your computer (or to your Colab workspace) and then open it now or later. (You have the data even after you close the notebook.)

- Open the file from the web without downloading it. (You won't have it after you close your notebook - but you can save it to a file locally before closing your notebook if you need to.)

### Practice data
I've collected a few datasets for us to practice on today, from public sources.
- **fastestAnimals**: From **GitHub repo** for one of my other workshops. Top recorded speeds of the twenty fastest animals in the world. **Format: txt**
- **trafficCounts**: From **data.gov**. A close approximation to the actual number of vehicles passing through a given location in Chicago on an average weekday. **Format: csv**
- **covidSewer**:  From **data.gov**. Concentrations of the COVID-19 virus gene in the Chicago sewer system, as measured at eight sewershed sites. **Format: xml**
- **missedWork**: From **UC Irvine Machine Learning Repository**. Records of absenteeism at work from July 2007 to July 2010 at a courier company in Brazil. **Format: zip**

Let's store the URL to each dataset as a string variable:

In [None]:
covidSewer = "https://data.cityofchicago.org/api/views/urdi-w8wq/rows.xml"
missedWork = "https://archive.ics.uci.edu/static/public/445/absenteeism+at+work.zip"
fastestAnimals = "https://raw.githubusercontent.com/nuitrcs/NextStepsInPython/master/pickleJson/fastestAnimals.txt"
trafficCounts = "https://data.cityofchicago.org/api/views/pfsx-4n4m/rows.csv"

### <br><br>Opening data without downloading it

First, let's try to open one of these files the way we would normally in Python. We'll try the least complex file, the txt file. The URL is saved as `fastestAnimals`. As a reminder, `f.read()` will turn the file into one long string.

In [None]:
with open(fastestAnimals, "r") as f:
    print(f.read())

<br><br>It's always worth a try!

<br>We're going to import the **request** object from the package **urllib**. It is a built-in Python package, so everyone has it on their computer, it just needs to be imported.

In [None]:
from urllib import request

<br>Now we can open the file as we normally would, only we will use the function `request.urlopen()`. We do not need to pass it a **mode** because it is only used to read files - there is no mode for writing to a URL, since creating a URL is more complicated than just writing to a filename.

In [None]:
with request.urlopen(fastestAnimals) as f:
    print(f.read())

<br><br>That worked! Sort of...

<br><br>Notice there is a "b" in front of the string and there are a couple strange special characters in there. We know the tabs (`\t`) and the new lines (`\n`), but what about the `\xe2\x80\x93` and other strange code?
<br><br>By default, `request.urlopen()` will open every file as a **bytes** object. 
<br><br>Luckily, there is a simple Python string method function called `decode()` that will interpret those bytes and return a string:

In [None]:
with request.urlopen(fastestAnimals) as f:
    str_f = f.read().decode()
print(str_f) 

#### <br><br>Exercise 1
Write code to open and decode the other URL files: `trafficCounts` (csv), `covidSewers` (xml), and `missedWork` (zip). Which ones look clean enough to work with? Which ones would you want to look up a new solution to open?

<br>There is also the file method function `readlines()` that you can use instead of `read()`. It transforms your file into a list of lines instead of just one long string. Try it out on any of the files that you think might benefit from that format.

### <br><br>Opening csv files from URLs in pandas

You can always parse a csv file as a string or list of lines or a dictionary, depending on what you need to do with the data (list comprehensions and dictionary comprehensions are great for this). However, your plan might be to open the csv file as a **pandas dataframe**.
<br><br>Let's try opening the `trafficCounts` URL as a pandas dataframe the same way we would open any csv file:

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv(trafficCounts)
print(df)

<br><br><br>I told you - it's always worth a try!
<br><br>The `read_csv()` function can automatically load data from URLs!

#### <br><br>Exercise 2.
Write code to open the `fastestAnimals` txt file using pandas.

<br>Which parameters can you pass to the `read_csv()` function to make the txt file work better as a dataframe? Check out the documentation to see all the options: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html.

### <br><br>Opening xml data from a URL
**xml** is a format used to share data between computer programs. It stores text data in a heirarchy using tags, similar to how HTML marks code. (Maybe we'll do a Lunch Lesson on working with xml data in the future.)
<br><br>xml data can contain malicious code, so Python always recommends using the built-in library **defusedxml** to open xml files. It is mostly the same as the built-in library **xml**, but it prevents most major attacks.
<br><br>To open an xml file that is saved on your computer, we would normally use the function `ElementTree.parse()`, but that won't work for URLs. Instead, we will open it the same way we learned previously, using the `request.urlopen()` function from urllib, and then using `decode()` to convert it from a bytes object to a string. Then we'll use the `ElementTree.fromstring()` function to convert the string to an xml object. I know it seems like quite a few steps, but it will always be here in this notebook if you ever need it!

In [None]:
from defusedxml.ElementTree import fromstring

In [None]:
with request.urlopen(covidSewer) as f:    
    yuck = f.read().decode()
yuck_xml = fromstring(yuck)
print(yuck_xml)

<br>This is what we'd expect to see for an xml object. It can be further parsed using other methods and attributes of the ElementTree object in the defusedxml library.

### <br><br>Opening a zip file from a URL
A zip file needs to be unzipped! Zip files can contain one or many files of different file types. 
<br><br>We will use the `ZipFile()` function from the **zipfile** built-in library to open the `missedWork` zip file from a URL.

In [None]:
from zipfile import ZipFile

<br><br>Let's try what we know. We can open the URL in the `request.urlopen()` function, then read it in, then decode it to convert it from a bytes object, then unzip it.

In [None]:
with request.urlopen(missedWork) as f:
    absent_files = ZipFile(f.read().decode())
print(absent_files)

<br><br>Ok, that didn't work because `decode()` converts from a bytes object to a string, and a zip file won't work as a string because it's really a folder of files and not a single file. 
<br><br>We need to bring in a more powerful decoder - `BytesIO()` from the built-in package **io**.

In [None]:
from io import BytesIO

<br><br>We use the same logic: We can open the URL in the request.urlopen() function, then read it in, then decode it to convert it from a bytes object, then unzip it.

In [None]:
with request.urlopen(missedWork) as f:
    absent_files = ZipFile(BytesIO(f.read()))
print(absent_files)

<br><br>This is a zipfile object. We can use other methods in the **zipfile** library to do more with our unzipped files, but we won't cover that here.

In [None]:
absent_files.namelist()

### <br><br>Downloading data from URLs
Take a look in your file tree on the left menu. We were able to work with all those files without downloading them onto our computer! Instead, we just loaded them into memory within Python. We can now work with that data in this notebook - filter it, analyze it, change it, visualize it - and just save the final products. If we close this notebook and open it in a year, we can reload the data from the URLs, as long as the data still lives at the same URL.

<br>Sometimes, though, we want to use a Python script to download data to our computer. Maybe that's the purpose of the script. 
<br><br>There are a few ways to do this. On a Linux system like Google Colab or a Mac computer, you can use the `!` to let Python know you're going to give a system command, and then use the bash command `wget`.

In [None]:
!wget https://raw.githubusercontent.com/nuitrcs/NextStepsInPython/master/pickleJson/fastestAnimals.txt

<br>If you are on Colab or a Mac, you should see the file pop up in your file tree (might take a minute).

<br><br>A second way to download data from a url, that should work for all Operating Systems because it's in Python and not command line, is a function in the **urllib** library called `request.urlretrieve()`. We already installed `request` from `urllib` in this notebook, so we're good to go.
<br><br>The function takes two arguments - the url to pull from, and the path where it should be saved. We will save the file right here in our current working directory, so our path will just be the filename:

In [None]:
request.urlretrieve(fastestAnimals, "secondFastestAnimals.txt")

<br>Hopefully that worked for you, and you see the file pop up in your file tree after a minute. Because we didn't give it a path to a different folder, the file is saved in the same folder where this notebook is located. The message that got printed to the screen starts with the path to the file.

#### <br><br>Exercise 3.
Write code to download the other files from the URLs we saved: `missedWork` (zip), `trafficCounts` (csv), and `covidSewer` (xml). Think up your own name for the files, or try leaving the second argument off for one file and find out where the file is saved and what it gets named if you don't give a path. If you're naming your own file, remember to give the filename the correct extension.

Try saving one file to a different location on your computer by passing an absolute path like "~/Downloads/my_file.ext". If you're on Colab, you can save it in the "sample_data" folder.

#### <br><br>How to get the correct URL from a GitHub repo
In the repo, click on the name of the file. Then click on the button `Raw` on the top right of the file. Copy the URL from that page. You can also right click on the `Raw` button and choose Copy link address, or the equivalent command.