# Getting Data from the Internet

There is a ton of good data on the internet, but it can be hard to access.  In this lesson we will learn just enough about web scraping to get in trouble.  

**Important**: stay out of trouble!

### Best Practices

1. Don't break anything.  Many rapid requests to smaller sites can overload the host server.
2. Use a published API if possible - it is more robust and usually much easier!
3. Respect the policy published at `robots.txt` 
4. Don't spoof your UserAgent (or try to trick the server into thinking you are a person)
5. Read the Terms of Service for the site and follow it.

# Requests

`requests` is a python package that allows you to use Python to interact with the internet!  There are other packages, but I find `requests` to be much easier to use.

In fact, to get the UCSD home page is a simple as
```
import requests
text = requests.get("https://ucsd.edu").text
```
But before we do that, we need to learn just a little bit more.

# Status Codes

When we request data from a website, the server responds with a HTTP status code.  The most common response is `200` which means things went well.  Other times you will get a different status code saying something else happened - you might be familiar with a `404` which means the page wasn't found.

This great site lists http status codes: [https://httpstat.us/](https://httpstat.us/).

But better yet, it has example sites that return a certain code, so you can test!  So, for example, https://httpstat.us/404 returns a `404`

In [14]:
import requests

r = requests.get("https://httpstat.us/404")
print(r.status_code)

404


In [15]:
r = requests.get("https://httpstat.us/404")
r.status_code
r.text

'404 Not Found'

You can check if the call went ok with `r.ok` which returns a boolean.

After you run the code below, read up on each of the status codes at [https://httpstat.us/](https://httpstat.us/).

In [16]:
statusCodes = [200, 404, 403, 429]

for statusCode in statusCodes:
    r = requests.get("https://httpstat.us/" + str(statusCode))
    print(str(statusCode) + " ok: " + str(r.ok))

200 ok: True
404 ok: False
403 ok: False
429 ok: False


In [17]:
# Or raise an exception when there is a not-ok status code

r = requests.get("https://httpstat.us/404")
r.raise_for_status()

HTTPError: 404 Client Error: Not Found for url: https://httpstat.us/404

# Robots.txt

Many sites have a published policy allowing or disallowing automatic access to their site.  It uses a text file `robots.txt` and you can learn more about it [here](https://moz.com/learn/seo/robotstxt).

The code below checks if the `robot.txt` file prohibits you from scraping the site.  Remember the best practices above - just because you aren't prohibited by the robots policy doesn't mean you can scrape the site!

In [18]:
from urllib.parse import urlparse
import urllib.robotparser

# This code checks the robots.txt file
def canFetch(url):

    parsed_uri = urlparse(url)
    domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)

    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(domain + "/robots.txt")
    try:
        rp.read()
        canFetchBool = rp.can_fetch("*", url)
    except:
        canFetchBool = None
    
    return canFetchBool

In [19]:
url = "https://slate.com/bullpen/"
canFetch(url)

False

In [20]:
url = "http://dsc.ucsd.edu/node/10"
canFetch(url)

True

# Getting the HTML

Now we can request a website!  Let's see what is on the UCSD Data Science Events page.

In [21]:
url = "http://dsc.ucsd.edu/node/10"

r = requests.get(url)
    
urlText = r.text

Nchars = 10000
print(urlText[:Nchars]) # Print the first 500 characters
print("\n\n... " + str(len(urlText)-Nchars) + " additional characters")


<!DOCTYPE html>
<html lang="en" dir="ltr" prefix="content: http://purl.org/rss/1.0/modules/content/  dc: http://purl.org/dc/terms/  foaf: http://xmlns.com/foaf/0.1/  og: http://ogp.me/ns#  rdfs: http://www.w3.org/2000/01/rdf-schema#  schema: http://schema.org/  sioc: http://rdfs.org/sioc/ns#  sioct: http://rdfs.org/sioc/types#  skos: http://www.w3.org/2004/02/skos/core#  xsd: http://www.w3.org/2001/XMLSchema# ">
  <head>
    <meta charset="utf-8" />
<meta name="Generator" content="Drupal 8 (https://www.drupal.org)" />
<meta name="MobileOptimized" content="width" />
<meta name="HandheldFriendly" content="true" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link rel="shortcut icon" href="/core/misc/favicon.ico" type="image/vnd.microsoft.icon" />
<link rel="canonical" href="/node/10" />
<link rel="shortlink" href="/node/10" />
<link rel="revision" href="/node/10" />

    <title>Events | Data Science Undergraduate Program</title>
    <link rel="stylesheet" hre

In [22]:
len(r.text)

15827

# Cleaning

Wow, that is gross looking!  It is raw HTML, which the browser uses to make the viewable site.  To process it we can use [Beautiful Soup 4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).

**Warning** BeautifulSoup has changed quite a bit between versions, so make sure you are looking at documentation for the version you are using (4 here).

In [23]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(urlText, 'html.parser')

In [24]:
# Grab all the links
for link in soup.find_all('li'):
    print(link)

<li class="menu-item">
<a data-drupal-link-system-path="user/login" href="/user/login">Log in</a>
</li>
<li>
<a href="/">Home</a>
</li>
<li><a href="https://drive.google.com/file/d/0B4gMdSgWOaZ_TWh5TjRGRW5RT1k/view?usp=sharing">Data Science Town Hall Meeting</a>: 7 pm, Tuesday, May 30, 2017; Ledden Auditoriun
	</li>
<li><a href="https://drive.google.com/file/d/0B4gMdSgWOaZ_NFN4YzB0NFdlSEk/view?usp=sharing">Data Science Town Hall Meeting</a>: 2 pm, Tuesday, September 26, 2017; CSE 1202</li>
<li><a href="https://drive.google.com/file/d/1tpMmHNTuvJHHMitddFl50FJR2mXCnpsn/view?usp=sharing">Data Science Group Advising Session</a>: 3 pm, Tuesday, February 13th, CSE  1202<br/>
In this session, students will be advised about enrolling in data science classes for Spring 2018 </li>
<li><a href="https://drive.google.com/file/d/11bTF_C56Y_zI1dTTWiuTMPqi64aR4BFF/view?usp=sharing">End of the Year Town Hall Meeting</a>: 5-7 pm, Monday, April 30th, 2018;  Atkinson Hall Auditorium<br/>
	Course planning 

In [25]:
# Show the text
print(soup.get_text())













Events | Data Science Undergraduate Program







      Skip to main content
    






User account menu



Show — User account menu
Hide — User account menu


Log in







Data Science Undergraduate Program





Data Science Undergraduate Program


















Breadcrumb


Home












Events










Data Science Events
Data Science Town Hall Meeting: 7 pm, Tuesday, May 30, 2017; Ledden Auditoriun
	Data Science Town Hall Meeting: 2 pm, Tuesday, September 26, 2017; CSE 1202
Data Science Group Advising Session: 3 pm, Tuesday, February 13th, CSE  1202
In this session, students will be advised about enrolling in data science classes for Spring 2018 
End of the Year Town Hall Meeting: 5-7 pm, Monday, April 30th, 2018;  Atkinson Hall Auditorium
	Course planning for 2018-19 academic year will be discussed during the town hall meeting.


Interested in learning more about events happening on campus related to Data Science? Subscribe to our newsletter below!



<!--/*--

In [26]:
# That text had too much white space.  Let's try
for string in soup.stripped_strings:
    print(repr(string))

'Events | Data Science Undergraduate Program'
'Skip to main content'
'User account menu'
'Show — User account menu'
'Hide — User account menu'
'Log in'
'Data Science Undergraduate Program'
'Data Science Undergraduate Program'
'Breadcrumb'
'Home'
'Events'
'Data Science Events'
'Data Science Town Hall Meeting'
': 7 pm, Tuesday, May 30, 2017; Ledden Auditoriun'
'Data Science Town Hall Meeting'
': 2 pm, Tuesday, September 26, 2017; CSE 1202'
'Data Science Group Advising Session'
': 3 pm, Tuesday, February 13th, CSE  1202'
'In this session, students will be advised about enrolling in data science classes for Spring 2018'
'End of the Year Town Hall Meeting'
': 5-7 pm, Monday, April 30th, 2018;  Atkinson Hall Auditorium'
'Course planning for 2018-19 academic year will be discussed during the town hall meeting.'
'Interested in learning more about events happening on campus related to Data Science? Subscribe to our newsletter below!'
'<!--/*--><![CDATA[/* ><!--*/\n#mc_embed_signup{background:#f

# Next steps

From here you can do a number of different things!

* Scrape individual elements from a site ([example](https://codeburst.io/web-scraping-101-with-python-beautiful-soup-bb617be1f486))
* Pull text down and use NLP from last week (like sentiment analysis)
* Monitor a site daily for changes.
* Use the text to create your own search engine!