# Worksheet 10 - Getting started with Web Scraping
  
Your Name: Hyeong-gi Hong      
Your Class: INST 447  
Your Section: 0101 (MWF) | 0102 (TTh)  
Your favorite flavor of Frozen Yogurt, Ice Cream, Sorbet, or Other: Vanilla  

## Reading Reflection  

Write a 75 word (+/- 15 words) response about the two assigned readings: 


> “Bots Are Scraping Your Data For Cash Amid Murky Laws And Ethics.” Accessed March 15, 2018. https://www.fastcompany.com/40456140/bots-are-scraping-your-public-data-for-cash-amid-murky-laws-and-ethics-linkedin-hiq (Links to an external site.).

> Fiesler, Casey. “Law & Ethics of Scraping: What HiQ v LinkedIn Could Mean for Researchers Violating TOS.” Medium (blog), August 15, 2017. https://medium.com/@cfiesler/law-ethics-of-scraping-what-hiq-v-linkedin-could-mean-for-researchers-violating-tos-787bd3322540 (Links to an external site.).

Prompting questions to get your brain moving:
<ol>
<li>What are your thoughts about the ethics of scraping?</li>
<li>How would you feel if you found that your content was being scraped?</li>
</ol>

In my opinion, people should follow the basic rule of ethics of scraping such as not taking information that are not authorized. As a student researcher, I have also scraped a few websites for my school research projects, yet the data was available for the public, and it does not say that I cannot scrape the website as well. 
If it is involving my personal content that should not be shared without my consent, I would be very uncomfortable. It is not only illegal but also not ethical.

In [1]:
import requests
from bs4 import BeautifulSoup
import re

## Scraping Testudo

URL: https://app.testudo.umd.edu/

## Check if we can scrape!

We need to answeer all 3 of the below:
<ol>
<li>Is there a robots.txt file?</li>
<li>Is there a robots meta tag that prohibits scraping? (http://www.robotstxt.org/meta.html)</li>
<li>Is there an X-Robots-Tag in the headder? (https://developers.google.com/search/reference/robots_meta_tag)</li>
</ol>

We will also check to see if there is a licensing agreement prohibiting the use of this data.

#### Step 1: Check for robots.txt

In [2]:
testudo_url = "https://app.testudo.umd.edu/robots.txt"

In [3]:
r = requests.get(testudo_url)

In [4]:
r.status_code

404

#### Step 2: Is there a robots meta tag?

Meta tags can appear on any html page, so we should check every page, so we should right a function to check if there are meta tags on the page.

In [5]:
testudo_url = "https://app.testudo.umd.edu/"
r = requests.get(testudo_url)
r.status_code

200

In [6]:
parsed = BeautifulSoup(r.text, 'lxml')
type(parsed)

bs4.BeautifulSoup

In [7]:
parsed.find_all('meta')

[<meta charset="utf-8"/>]

#### Step 3: Is there an X-Robots-Tag in the header?

In [8]:
'X-Robots-Tag' in r.headers.keys()

False

#### Not programming, but looking:
Are there policies or licensing agreements that prevent or allow our scraping of the data?  



There is a policy / rule, but it does not mention anything about scraping.

## What iSchool Classes are listed on Testudo for Fall 2018?

It is getting close to registration time. Wouldn't it be nice to be able to have a way to be told automatically if they change the courses listed?  

The URL I have provided goes straight to the course listings for the INST Fall 2018 Semester course listings. We'll need to parse the page and get a list of the courses and the sections and the times each section meets.

In [24]:
testudo_url = "https://app.testudo.umd.edu/soc/201808/INST"

### First get the page.

<ol><li>Get the page.</li>
<li>Check the response status.</li>
<li>Parse the response with BeautifulSoup</li>
<li>Check if there is a robots meta.</li>
<li>Check if there is a 'x-robots-tag' in the header response.</li></ol>

In [27]:
r = requests.get(testudo_url)
r.status_code

200

In [28]:
parsed = BeautifulSoup(r.text, 'lxml')

In [34]:
courses = parsed.find_all('div', attrs={'class': 'course'})

In [36]:
courses = parsed.select('div.course')

In [37]:
len(courses)

53

In [38]:
course = courses[0]

In [45]:
course_id = course.select_one('.course-id').text

In [54]:
#course.select_one('.course-title').text
course_title = course.find(attrs={'class':'course-title'}).text

In [55]:
course_credit = course.select_one('.course-min-credits').text

#### Find the elements to grab.

In your browser, use the inspector to find the element that contains each course.

What element contains each course?

*Your answer here*

#### Now get a list of each of those elements from the parsed html

How many do you get?

#### Let's test on the first course

Create a dictionary that contains:  
- Course ID
- Course Title
- Course Credits

#### Loading sections:
The sections are kept on a separate page and loaded with JavaScript:  

> https://app.testudo.umd.edu/soc/201808/sections?courseIds=<course-id\>  
    
You need to replace the <course-id\> with the course id of the course whose sections you want to lookup.


Make a request to get the sections for that first course that you worked with and parse that response.
We will then add the sections' information to the dictionary you created above for the course.

#### Make the request
With requests.get we can build the query string (the part after the '?') by using a dictionary as the second argumnet. This makes building complex queries much easier over time and prevents you from passing the same key multiple times.

We do this like:
> requests.get(url, {'key', 'value'})

In [57]:
r = requests.get('https://app.testudo.umd.edu/soc/201808/sections', 
            {'courseIds': 'INST126'})

In [59]:
r.status_code

200

In [65]:
sect = BeautifulSoup(r.text, 'lxml')

#### Create a parse 'soup' object from the response with BeaufulSoup

#### Get the section's container element

Go to your browser and find the container element that holds each section's info. Then create a list with each section in it:

In [66]:
section_dict = sect.select('div.delivery-f2f')

In [67]:
len(section_dict)

3

#### Now get the info for each section.
Save the section_id, instructor, and days with time that the class meets. 
Save each section into the dictionary that you used for the course.

You should end up with a data structure that looks like:
<pre>
[{'instructor': ['Instructor Name'],
  'meeting_place': 'BUILDING ROOM#',
  'meeting_time': 'DAYS TIME',
  'section_id': 'SECTION_ID'}]
</pre>
That is a list that contains a dictionary for each section. Note that the key 'instructor' is also a list because sometimes there are muliple instructors for a section.

In [69]:
first_section = section_dict[0]

In [73]:
section_id = first_section.select_one('.section-id').text.strip()

In [76]:
first_section.select_one('.class-building').text.strip()

'ATL\n1113'

#### Add the sections to the course dictionary

Add the sections information to the course information so that you end up with a structure like:

<pre>
{'course_credits': 'NUM CREDITS',
 'course_id': 'COURSE_ID',
 'course_title': 'COURSE_TITLE',
 'sections': [{'instructor': ['INSTRUCTOR NAME'],
   'meeting_place': 'BUILDING ROOM#',
   'meeting_time': 'DAYS TIME',
   'section_id': 'SECTION_ID'}]}
</pre>

### Collect them All!

Ok, so you've just collected the info for a single course. Now do it for all of the courses for INST.

The result should be a data structure that when printed looks like:

<pre>
[{'course_id': 'COURSE_ID',
 'course_title': 'COURSE_TITLE',
 'course_credits': 'NUM CREDITS',
 'sections': [{'instructor': ['INSTRUCTOR NAME'],
   'meeting_place': 'BUILDING ROOM#',
   'meeting_time': 'DAYS TIME',
   'section_id': 'SECTION_ID'}]}
]
</pre>
That is a list that contains a dictionary. The dictionary contains the information for the course and has a key called sections. The key sections contains a list of dictionaries that contain each sections' information.