# Getting Data from the Internet

There is a ton of good data on the internet, but it can be hard to access.  In this lesson we will learn just enough about web scraping to... get in trouble.  

**Important**: stay out of trouble!

### Best Practices

1. Don't break anything.  Many rapid requests to smaller sites can overload the host server.
2. Use a published API if possible - it is more robust and usually much easier!
3. Respect the policy published at `robots.txt` 
4. Don't spoof your UserAgent (or try to trick the server into thinking you are a person)
5. Read the Terms of Service for the site and follow it.

# Requests

`requests` is a python package that allows you to use Python to interact with the internet!  There are other packages, but I find `requests` to be much easier to use.

In fact, to get the UCSD home page is a simple as
```
import requests
text = requests.get("https://ucsd.edu").text
```
But before we do that, we need to learn just a little bit more.

# Status Codes

When we request data from a website, the server responds with a HTTP status code.  The most common response is `200` which means things went well.  Other times you will get a different status code saying something else happened - you might be familiar with a `404` which means the page wasn't found.

This great site lists http status codes: [https://httpstat.us/](https://httpstat.us/).

But better yet, it has example sites that return a certain code, so you can test!  So, for example, https://httpstat.us/404 returns a `404`

In [None]:
import requests

r = requests.get("https://httpstat.us/404")
print(r.status_code)

In [None]:
r = requests.get("https://httpstat.us/404")
r.status_code
r.text

You can check if the call went ok with `r.ok` which returns a boolean.

After you run the code below, read up on each of the status codes at [https://httpstat.us/](https://httpstat.us/).

In [None]:
statusCodes = [200, 404, 403, 429]

for statusCode in statusCodes:
    r = requests.get("https://httpstat.us/" + str(statusCode))
    print(str(statusCode) + " ok: " + str(r.ok))

In [None]:
# Or raise an exception when there is a not-ok status code

r = requests.get("https://httpstat.us/404")
r.raise_for_status()

# Robots.txt

Many sites have a published policy allowing or disallowing automatic access to their site.  They may also specify which user-agent is allowed to automatically access specific part of the site. They use a text file `robots.txt` and you can learn more about it [here](https://moz.com/learn/seo/robotstxt).

The code below checks if the `robot.txt` file prohibits you from scraping the site.  Remember the best practices above - just because you aren't prohibited by the robots policy doesn't mean you can scrape the site!

In [None]:
from urllib.parse import urlparse
import urllib.robotparser

# This code checks the robots.txt file
def canFetch(url):

    parsed_uri = urlparse(url)
    domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)

    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(domain + "/robots.txt")
    try:
        rp.read()
        canFetchBool = rp.can_fetch("*", url)
    except:
        canFetchBool = None
    
    return canFetchBool

In [None]:
url = "http://blink.ucsd.edu/search"
canFetch(url)

In [None]:
url = "https://datascience.ucsd.edu/academics/undergraduate/"
canFetch(url)

# Getting the HTML

Now we can request a website!  Let's see what is on the UCSD Data Science Events page.

In [None]:
url = "https://datascience.ucsd.edu/academics/undergraduate/"

r = requests.get(url)
    
urlText = r.text

Nchars = 10000
print(urlText[:Nchars]) # Print the first 500 characters
print("\n\n... " + str(len(urlText)-Nchars) + " additional characters")


In [None]:
len(r.text)

# Cleaning

Wow, that is gross looking!  It is raw HTML, which the browser uses to make the viewable site.  To process it we can use [Beautiful Soup 4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).

**Warning** BeautifulSoup has changed quite a bit between versions, so make sure you are looking at documentation for the version you are using (4 here).
Let's follow this example on using BeautifulSoup: ([example](https://codeburst.io/web-scraping-101-with-python-beautiful-soup-bb617be1f486))

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(urlText, 'html.parser')

In [None]:
# let's check once more if it is safe and legal to scrape from.
canFetch(url)

In [None]:
page_response = requests.get(url,timeout=5)
# here, we fetch the content from the url, using the requests library

In [None]:
page_content = BeautifulSoup(page_response.content, "html.parser")
#we use the html parser to parse the url content and store it in a variable.

paragraphs = page_content.find_all("p")
# Here we get all content within <p> paragraphs

In [None]:
# let's see it one by one:
for paragraph in paragraphs:
    print(paragraph)

In [None]:
#let's get rid of all the html code:
for paragraph in paragraphs:
    print(paragraph.text)

# Next steps

From here you can do a number of different things!

* Choose a website that you like! Check if you can web scape it!
* Get the text
* Pull text down and use NLP from last week (like sentiment analysis)
* Monitor a site daily for changes.
* Use the text to create your own search engine!

In [None]:
# your code here!