# **Python Libraries - `requests`, `bs4`**

**Author: Eni Mustafaraj**

In this notebook I'll be showing you:

1. how to use the `requests` package to download an HTML file
2. how to install new packages directly from Jupyter
3. how to use `bs4` (BeautifulSoup) to parse the content of a simple HTML file

## Part 1: Using `requests`

We humans use a web browser to read HTML pages on the Web. Everytime we visit a webpage, it is actually being transfered to our computer (and stored on our local drive). When we don't use a browser to visit websites, we can use libraries from a programming language to perform the same action as the browser.

In Python, we will use `requests` to download files from your web folder.

In [None]:
import requests

Use the `get` method to send a request to the server for the desired file. Check that the response was received.

In [None]:
response = requests.get("http://cs.wellesley.edu/~cs315/readings/index.html")
print(response.status_code)

The variable response is an object, referring to an instance of a class defined in the library `requests`, we can verify this through the Python function `type`: 

In [None]:
type(response)

We use `dir` to lookup the list of all attributed and methods for an object or class or library:

In [None]:
print(dir(response))

Look up the text of the response:

In [None]:
response.url

Check if the desired phrase is in the response's content:

In [None]:
response.text.find("for protection") != -1 

# Why are we checking that the value of the expression on the left is different from -1?
# Answer: Because the string method find return -1 when it doesn't find a substring in text

### Use case: checking your CS server accounts

We will need first to get your accounts:

In [None]:
with open('sec02.txt') as inputF:
    lines = inputF.readlines()
    
print(lines)

Clean up the list of accounts, using list comprehension:

In [None]:
accounts = [line.split('@')[0] for line in lines]
print(accounts)

Now that we have all your accounts, let's generate the URLs for them:

In [None]:
# generate URLs for all accounts via list comprehension and the format string syntax

urls = [f'http://cs.wellesley.edu/~{acc}/index.html' for acc in accounts]

for el in urls:
    print(el) 

We now will look up the text of each index.html file to check if it contains our desired phrase.

In [None]:
for acc in accounts:
    url = f"http://cs.wellesley.edu/~{acc}/index.html"
    response = requests.get(url)
    if response.status_code == 200:
        if response.text.find("for protection") != -1:
            print(acc, "SUCCESS")
        else:
            print(acc, "Didn't find phrase.")
    else:
        print(acc, "ERROR", response.reason)

## Part 2: Install a new package

When our Python installation doesn't contain a package/module, we will get an error when importing it:

In [None]:
import textblob

It's easy and possible to install packages from the notebook itself, just use the command `pip install` followed by the library name.

In [None]:
pip install textblob

## Part 3: BeautifulSoup

This is a library that helps parse HTML documents. 

In [None]:
import bs4

The class we will use is called BeatifulSoup, but since the name is long, we will rename it as BS.

In [None]:
from bs4 import BeautifulSoup as BS

Firs, I'm creating a simple function to get the content of HTML pages based on their URLs:

In [None]:
def getHTMLPage(url):
    """Given a url, get the HTML page content"""
    
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        print("Failure resaon:", response.reason)
        return

Now I will get the content of the HTML file using the function I created:

In [None]:
url = "http://cs.wellesley.edu/~cs315/readings/index.html"
htmlPage = getHTMLPage(url)
print(htmlPage)

Let's check what value type is stored in `htmlPage`:

In [None]:
type(htmlPage)

The BeautifulSoup constructor will create the DOM (document object model) object:

In [None]:
domTree = BS(htmlPage, 'html.parser')
type(domTree)

**Note:** Notice the difference between `htmlPage`, which is simply a string and `domTree` which is an object (an instance of the class BeautifuSoup).

In [None]:
print(dir(domTree))

Let us use the method `find` to find elements with a given tag:

In [None]:
domTree.find('p') # get a paragraph element

In [None]:
domTree.find('body') # get the body element

In [None]:
domTree.find('title') # get the title element

In [None]:
domTree.find('p').text # get the text of the p element

This was just to give you a taste of BeautifulSoup, we will continue doing more work with it on future tutorials.