# Tutorial 1.2 - Extracting Downloadable Files from Websites

*Reference source code/method here:* https://towardsdatascience.com/how-to-web-scrape-with-python-in-4-minutes-bc49186a8460

> Sometimes the data we have access to must be downloaded as .txt or .csv files. It can be tedious to download each data file, and then process it. We can write a script to download files.

In [1]:
import requests
import urllib.request

We have our table of interest at this link:

https://www.cde.ca.gov/ds/sd/sd/filesenr.asp

<div align="center">
<img src="images/enrollment_data2.png" width="100%">
</div>

Copying the link address from the hyperlink `enr19` yields:

http://dq.cde.ca.gov/dataquest/dlfile/dlfile.aspx?cLevel=School&cYear=2019-20&cCat=Enrollment&cPage=filesenr.asp

This is the file download address that we will use to request and download the file.

We can use the `urllib.request.urlretrieve()` function to download the file from the URL and name the file.

For example:

`urllib.request.urlretrieve('my_download_link', 'my_file_name')`

In [3]:
urllib.request.urlretrieve('http://dq.cde.ca.gov/dataquest/dlfile/dlfile.aspx?cLevel=School&cYear=2019-20&cCat=Enrollment&cPage=filesenr.asp', 'enr_2019_20.txt')

('enr_2019_20.txt', <http.client.HTTPMessage at 0x10426c390>)

**Done!** The file should be downloaded into the same location as your script location.

---

## Troubleshooting and Good Things to Know

1. Read through the website’s **Terms and Conditions** to understand how you can legally use the data. Most sites prohibit you from using the data for commercial purposes.

2. A common error we may get when directly requesting to download files in this way is **blocked access**.

For example, if we try to download the file from this website, we get the following error below.
Unfortunately, this means we do not have the authorization to access the website in this way.

In [3]:
urllib.request.urlretrieve('https://www3.cde.ca.gov/demo-downloads/ce/cenroll1819.txt', 'enr_2019_20_mod.txt')

URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1076)>

---

## Future Developments

In future developments, we will use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) package to read in the HTML from the website, and create a loop to automatically download files of interest!

Here's a preview of the code below...

In [4]:
import time #(to be used in the future for looping)
from bs4 import BeautifulSoup #(to be used in the future for looping)

In [5]:
url = 'https://www.cde.ca.gov/ds/sd/sd/filesenr.asp'
response = requests.get(url)
response

<Response [200]>

In [6]:
soup = BeautifulSoup(response.text, 'html.parser')

In [None]:
soup.findAll('a')

In [8]:
one_a_tag = soup.findAll('a')[214]
link = one_a_tag['href']
link
# need to iterate by +2 starting at 212
# https://www3.cde.ca.gov/demo-downloads/fycgr/cohort5year1819.txt

'http://dq.cde.ca.gov/dataquest/dlfile/dlfile.aspx?cLevel=School&cYear=2018-19&cCat=Enrollment&cPage=filesenr.asp'

Further resources:
https://stackoverflow.com/questions/44699682/how-to-save-a-file-downloaded-from-requests-to-another-directory