# Cleaning Quiz: Udacity's Course Catalog
In this activity, you're going to extract the following information from each course listing on the Udacity's web page:
1. The course name - e.g. "Data Analyst"
2. The school the course belongs to - e.g. "School of Data Science"

### Step 1: Get text from Udacity's course catalog web page

In [24]:
# import statements
import requests
import re
from bs4 import BeautifulSoup

In [9]:
# fetch web page
def fetch_webpage(url) :
    try:
        r = requests.get(url)
        return r
    except:
        print('Couldn\'t fetch the information from the url')

In [10]:
r = fetch_webpage('https://www.udacity.com/courses/all')
#r.text

In [22]:
# Try REGEX to get ride of the html tags
#pattern = re.compile(r'<.*?>')
#print(pattern.sub('', r.text)) 

The regex operations do remove html tags but there are still some unnecessary header and Javascript remaining.

### Step 2: Use BeautifulSoup to remove HTML tags
Use `"lxml"` rather than `"html5lib"`.

In [28]:
soup = BeautifulSoup(r.text, 'lxml')

### Step 3: Find all course summaries
Use the BeautifulSoup's `find_all` method to select based on tag type and class name. Right click on the item, and click "Inspect" to view its html on a web page.

In [30]:
# Find all course summaries
summaries = soup.find_all("div", class_="card__title-container")
print('Number of Courses:', len(summaries))

Number of Courses: 250


### Step 4: Inspect the first summary to find selectors for the course name and school
Tip: `.prettify()` is a super helpful method BeautifulSoup provides to output html in a nicely indented form! Make sure to use `print()` to ensure whitespace is displayed properly.

In [36]:
# print the first summary in summaries
print(summaries[0].prettify())

<div class="card__title-container">
 <span class="catalog-card-tag--desktop">
  New
 </span>
 <h3 class="card__title__school greyed">
  School of Data Science
 </h3>
 <h2 class="card__title__nd-name">
  Data Engineer
 </h2>
</div>



Look for selectors that contain the courses title and school name text you want to extract. Then, use the `select_one` method on the summary object to pull out the html with those selectors. Argument for the method is css selector (awesome!). As a reference, `select` method returns a list of tags associated with the css selector.

Afterwards, don't forget to do some extra cleaning to isolate the names (get rid of unnecessary html).

In [46]:
# Extract course title
summaries[0].select_one('h2.card__title__nd-name').get_text().strip()

'Data Engineer'

In [47]:
# Extract school
summaries[0].select_one('h3.card__title__school').get_text().strip()

'School of Data Science'

### Step 5: Collect names and schools of ALL course listings
Reuse your code from the previous step, but now in a loop to extract the name and school from every course summary in `summaries`!

In [52]:
courses = []
for summary in summaries:
    courses.append(summary.select_one('h2.card__title__nd-name').get_text().strip())

In [51]:
# display results
print(len(courses), "course summaries found. Sample:")
courses[:20]

250 course summaries found. Sample:


['Data Engineer',
 'Data Analyst',
 'Introduction to Programming',
 'Deep Learning',
 'Full Stack Web Developer',
 'UX Designer',
 'Data Scientist',
 'Business Analytics',
 'Self Driving Car Engineer',
 'Programming for Data Science with Python',
 'Machine Learning Engineer',
 'C++',
 'Digital Marketing',
 'SQL',
 'AI Programming with Python',
 'Front End Web Developer',
 'AI Product Manager',
 'Cloud DevOps Engineer',
 'Artificial Intelligence for Trading',
 'DevOps Engineer for Microsoft Azure']