# Web scraping
A common data collection task is to collect data from web pages and transform them into an analysis ready format. In this exercise, you'll be scraping the [Informatics Course Information](https://www.washington.edu/students/crscat/info.html) page to ask some basic questions about the courses offered. To do so, we'll be using the [beautiful soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) package, which should have been included as part of the Anaconda python distribtuion. A great starting place for understanding the package is [this tutorial](https://www.dataquest.io/blog/web-scraping-tutorial-python/).

## Set up
In order to use packages in a script (that are downloaded as part of the Anaconda distribution), you will need to `import` them. There are a variety of approaches for importing packages: you can import an entire package, or import only some of its functions. It's common to import a package as an abbreviation for easier use with this syntax:

```
# Import a package as an abbreviation
import requests as r

# Only import some functions (as abbreviations)
from bs4 import BeautifulSoup as bs, SoupStrainer as ss
```

In [45]:
# Import the `requests` and `BeautifulSoup` packages using the code above
import requests as r
from bs4 import BeautifulSoup as bs, SoupStrainer as ss

## Scraping Content

In [46]:
# Use the `get` method of the requests library to fetch the page content
content = r.get("https://www.washington.edu/students/crscat/info.html").content
# Tells if the package was successfully loaded or not 200 means success YAY
# Status code starting with 4 or 5 means error and 2 means good
# page.status_code
# print content
# page.content
# We can also do .get.content


In [47]:
# Use bs to parse the HTML returned
soup = bs(content, 'html.parser')
# print what we parsed
#print(soup.prettify())

In [48]:
# We can now use the `find_all` method to find all course title elements
# Store the *text* of the course titles in variable
# Hint: You'll need to review the HTML to figure out how to identify them
# Hint: use a list comprehension!

# We can first select all the elements at the top level of the page using the children property of soup. 
# children generates list so we need to call list function
# list(parse.children)

# look up type for each child
# [type(item) for item in list(parse.children)]

title_objects = soup.findAll("b")

tit = title_objects[0]
# Just the string value corresponding the object
tit.text
# As a loop or in a list
# Another way to do it -----> titles = [obj.text for obj in title_objects]
titles = []
for obj in title_objects:
    titles.append(obj.text)
#titles

In [49]:
# We can now use the `find_all` method to find all course description elements
# Store the *text* of the course description in variable
# Hint: You'll need to review the HTML to figure out how to identify them
# Hint: you may have to skip certain elements...
desc_objects = soup.find_all('a')

descriptions = []
for i in range(1, 10):
  descriptions.append(desc_objects[i*10 + 1].text)

# print(descriptions)



## Data processing
Now that you have the data, we'll re-structure it so that we can easily ask questions about the data

In [50]:
# Create a dictionary where the *keys are course numbers*, and the values are *dictionaries* 
# with information about that course. Specifically, include the following values: 
#     - "title": title of the course (from above)
#     - "description": description of the course (from above)
#     - "credits": can be a string of the number of credits (some are a range)
#     - "level": 100, 200, 300, or 400 (an *integer*)
# Hint: start with an empty dictionary, and use a loop, keeping track of the *index* using the `enumerate` method
# Hint: think of creative ways to get the credits/level from your string 
# the `.find` method can help you find characters in a string
credits_list = []
course_numbers = []
level_list = []
requirements_list = []
for t in titles:
    start_credits = t.find('(')
    end_credits = t.find(')')
    credits_list.append(t[start_credits+1:end_credits])
    course_numbers.append(t[5:9])
    level_list.append(int(t[5:6]+ "00"))
    requirements_list.append(t[end_credits + 2:])

course_dicts = []
for i in range (1,9):
    course_dicts.append({"title":titles[i], "description":descriptions[i], "credits":credits_list[i], "level":level_list[i], "meets_requirements":requirements_list[i]})
final_dict = dict(zip(course_numbers, course_dicts))
final_dict           


{'101 ': {'title': 'INFO 102 Gender and Information Technology (5) I&S, DIV',
  'description': 'INFO 201 Technical Foundations (5) QSRIntroduces fundamental tools and technologies necessary to transform data into knowledge. Covers the full information lifecycle, including the collection, storage, analysis and visualization of data. Core competencies underlying this process, including functional programming, use of databases, data wrangling, version control, and command line proficiency, are acquired through real-world data-driven challenges.View course details in MyPlan: INFO 201',
  'credits': '5',
  'level': 100,
  'meets_requirements': 'I&S, DIV'},
 '102 ': {'title': 'INFO 180 Introduction to Data Science (4) QSR',
  'description': 'INFO 314 Computer Networks and Distributed Applications (5) NWBasic concepts of local and wide-area computer networking including an overview of services provided by networks, network topologies and hardware, packet switching, client/server architectures

## Asking questions of the data
Now we can filter the dataset to ask questions of interest

In [51]:
# How many courses are 300 level courses?
# Hint: use a list comprehension! 
l = [p for p in level_list if p == 300]
len(l)

16

In [52]:
# Write a function that takes in your courses object and a course level (100, 200, etc.) and 
# returns all of the *course titles* of courses that are that level

# Make sure to use a doc string to document your function
def find_level(obj, level):
    vals = obj.values()
    return [v['title'] for v in vals if v['level'] == level]
# Make sure to use a doc string to document your function

In [53]:
# Demonstrate that your function works by passing in the `courses` object and a course level
print(find_level(final_dict, 300))

['INFO 300 Research Methods (5)', 'INFO 310 Information Assurance and Cybersecurity (5) I&S, QSR']
