# About this notebook
This notebook is meant as a guideline for a mostly-live coding demo delivered to a class of introductory Data Science students. **If you try to execute the code in this notebook end-to-end, you are likely to be disappointed**.  If you expect correct grammar, complete sentences, and fully coherent comments throughout, you are likely to be disappointed _again_. The demo itself is intended to take 30 minutes; some items in this notebook may have to be omitted because I talk faster than I code.  But hey, ya gotta have goals.

Oh, and did I mention "Don't try to the code inside the notebook as-is?"

Don't try to run the code inside the notebook as-is. 

No, really.

# Introduction/motivation
Our goals are: 
* Show a simple example of executing python code from the command line, since much of our class work to date has been in Jupyter notebooks
* Demonstrate tools/techniques that may be useful during work on capstone projects for parsing wb pages and retrieving hosted datafiles.
* Demonstrate an incremental approach to building a script, making sure one piece works before movig on to the next

## Data
This is a Vancouver class, so for local flavor, we'll be using the data provided by [Mobi](https://mobibikes.ca) about their bikeshare system, at https://www.mobibikes.ca/en/system-data.  

## Tools
Our data page provides a list of links to files hosted on Google Drive.  In addition to some standard libraries we've seen before, we'll also be using: 
* requests ((for accessing the URL - more info at https://requests.readthedocs.io/en/master/user/quickstart)
* Beautiful Soup (for exploring page content - https://www.crummy.com/software/BeautifulSoup/bs4/doc)
* gdown (for downloading Google Drive files - https://github.com/wkentaro/gdown)

## Setup
We won't have time to walk through the environment setup for this, but I'll provide an environment export file for later.  Most of the packages we need, except ```gdown```, are supplied in the base Anaconda install.  For a fresh virtualenv, you'll need to install: 
* requests
* lxml
* bs4 
* gdown (via conda-forge)

# Libraries
Before we do anything else, let's import the libraries we'll need:

In [1]:
from bs4 import BeautifulSoup
import gdown 
import requests
import sys 

# Parsing arguments
Our script will take a simple single argument, the URL for the page containing our data.  It could be enough to read that argument this way:

In [4]:
page_url=sys.argv[1]

...but we want to make sure the argument is actually provided, so, we'll wrap this in a try/except block, which we saw in our earlier Python lessons:

In [6]:
page_url=''
try:
    page_url = sys.argv[1]
except IndexError:
    print('No URL provided')

# Reading URL
The next step is to open the URL we've passed into the script.  We'll start out with the try/except this time, since we definitely want to catch things if they break.

In [None]:
page_req=''
print(f'Attempting to access {page_url}') 
try:
    page_req=requests.get(page_url)
except requests.exceptions.RequestException as e:
    print(f'Error accessing {page_url}:\n {e} ')

As before, start with some tests.  We expect an error from the first, no output (clean execution) from the second:

Now we can try out the real URL!

Well, that ain't good. It looks like that the root SSL certificates in use by the Mobi site aren't in the default set used by our python environment.  While troubleshooting SSL issues is **super fun**, we don't really have the time, so we're going to take a shortcut.  This is not a good shortcut, but it does make the problem go away:

In [None]:
page_req=''
print(f'Attempting to access {page_url}') 
try:
    page_req=requests.get(page_url, verify=False) #Fix certs later. Demo purposes only.
except requests.exceptions.RequestException as e:
    print(f'Error accessing {page_url}:\n {e} ')

Warnings are better than errors.  And the warning's a real one; this is not something to do in a real-world situation, since it creates security risks

# In the Soup
Finally we can do what what we came to do. We'll create an instance of BeautifulSoup using the content of our web page:

In [None]:
tomato_basil=BeautifulSoup(page_req.content)
# Tomato basil. It's a soup. Get it? Whee!

Unlike the previous warning message, we can make this one go away pretty easily.  Depending on your environment, the parser you're using might need to change, but this will do for now.

In [None]:
tomato_basil=BeautifulSoup(page_req.content,features='lxml')

From this point on, we can just choose which parts of our "soup" we want to display.  Something that they show early in the docs is how to get a well-formatted version of the whole page:

In [None]:
print(tomato_basil.prettify())

What we really want, though, are the links we saw before.  The quick way to get those is using find_all() on the anchor tags:

In [None]:
print(tomato_basil.find_all("a"))

There's also a shorthand version for using find_all when working with common elements like tags.  We'll use that from now on, because I hate typing.

In [None]:
print(tomato_basil("a"))

That's a good start, looks like we're getting a list of tags, and we can iterate through those.  To save ourselves a bit of scrolling while we focus on the output we want, we can also make use of the ```limit``` parameter:

In [None]:
for anchor in tomato_basil("a",limit=3):
    print(anchor)

Ok, so we're processing anchors one at a time, which gets us clsoer to our goal of extracting the Google Drive URLs.  Small problem: There are anchors that aren't links for Google Drive.  We can use a regular expression to match the ```href``` attributes: 

In [None]:
link_filter=re.compile('google.com')
for anchor in tomato_basil("a",href=link_filter,limit=3):
    print(anchor)

Improvement, but there's a "hidden" link in there with no text.  Fortunately, we can filter on that with a regex, too.  The text enclosed by a tag is in the tag's ```string``` attribute  We could filter just on the year (e.g. ```20\d{2}```), but if the naming convention changes later, or if we want to use this script elsewhere, then that's a bit short-sighted.  Instead, let's filter out everything that only contains blanks, by requiring at least one non-blank character in ```string```

In [None]:
link_filter=re.compile('google.com')
string_filter=re.compile('\S+') #small hack to skip "invisible" links
for anchor in tomato_basil("a",href=link_filter, string=string_filter, limit=3):
    print(anchor)

Now we're finally ready to start using ```gdown``` to get all the Google Drive files!  Let's see what happens if we pass the link directly to the ```gdown.download()``` method. 

In [None]:
for anchor in tomato_basil("a",href=link_filter, string=string_filter, limit=3):
    gdown.download(anchor.get("href"))

Ok, so gdown won't take just _any_ Google Drive URL.  It expects a specific format. Ok, _fine_, what _ever_, geez Louise.   The warning message gives us a helpful hint about how the URL should look, and that's a pretty straightforward string replacement.

In [None]:
for anchor in tomato_basil("a", href=link_filter, string=string_filter):
    download_link=anchor.get('href')\
        .replace('file/d/','uc?id=')\ 
        .replace('/view?usp=sharing','')
    gdown.download(download_link)

Victory!  You'll note we also removed the ```limit=3``` from our ```for``` loop.  That's confidence. ;-) 

# Discussion
* Obviously, writing and testing this code took far longer than just manually clicking a bunch of links. In your capstone, however, or in your later jobs, you may find opportunities to use some of these tools to save yourself time, or just to have a repeatable non-clicky approach.

* The basic argument parsing we did with ```sys.argv``` is ok for quick work, but there are more sophisticated methods (e.g. [argparse](https://docs.python.org/3/library/argparse.html) that can make your life easier and your code more readable.

* This implementation is ok for a quick functionality demo, but if you were getting paid to write a command-line utility, you'd be expected to use some of the techniques we talked about in earlier lectures create a package.  One method for doing so is described [here](https://realpython.com/python-application-layouts/#one-off-script), but in general you can expect to follow your organization's standards.

* Repeatedly executing "python _my_script_ _my_parameters_ and printing results is ok for quick prototyping, but it's not a great way to test changes.  Back to the previous point, if you were being paid to write this code, you'd be expected to produce tests.  In fact, in many circumstances, you will benefit from writing some of your tests before writing any of the "real" code for your utility.  There will be more discussion of testing in later lectures.

* As discussed earlier in class, you'll probably read differing opinions about "scraping," which is what we're doing here.  This is admittedly small-scale stuff (a few requests spread out over half an hour), but even small things (e.g. accidentally executing this script 1000 times) can have big impacts.  When you're interacting with web pages like this, even if it seems harmless, keep in mind that you're doing things that might not fit the systems designer's intent. You're a guest: be careful, be polite...and remember that there are fellow geeks on the other end of the wire who are responsible for keeping their services healthy, and there's no point in making their lives difficult. :)