# Cleaning Quiz: Udacity's Course Catalog
It's your turn! Udacity's [course catalog page](https://www.udacity.com/courses/all) has changed since the last video was filmed. One notable change is the introduction of  _schools_.

In this activity, you're going to perform similar actions with BeautifulSoup to extract the following information from each course listing on the page:
1. The course name - e.g. "Data Analyst"
2. The school the course belongs to - e.g. "School of Data Science"

**Note: All solution notebooks can be found by clicking on the Jupyter icon on the top left of this workspace.**

### Step 1: Get text from Udacity's course catalog web page
You can use the `requests` library to do this.

You may have to scroll down past the javascript and CSS in the output of the last cell in this section to see the text.

**UPDATE: I had to change the web page used for this notebook because Udacity uses Javascript to generate the course list; BeautifulSoup cannot deal with pages generated with Javascript after the HTML code is fetched.**

In [11]:
# import statements
import requests 
from bs4 import BeautifulSoup

In [44]:
# fetch web page
#r = requests.get('https://www.udacity.com/courses/all')
r = requests.get('https://learndataengineering.com/p/all-courses')

In [45]:
# display text from web page
print(r.text)


<!DOCTYPE html>
<html>
<head>
<link href='https://process.fs.teachablecdn.com/ADNupMnWyR7kCWRvm76Laz/resize=width:32,height:32/https://www.filepicker.io/api/file/pEKwD6EKSGgLHDxaM9yA' rel='icon' type='image/png'>
<link href='https://process.fs.teachablecdn.com/ADNupMnWyR7kCWRvm76Laz/resize=width:72,height:72/https://www.filepicker.io/api/file/pEKwD6EKSGgLHDxaM9yA' rel='apple-touch-icon' type='image/png'>
<link href='https://process.fs.teachablecdn.com/ADNupMnWyR7kCWRvm76Laz/resize=width:144,height:144/https://www.filepicker.io/api/file/pEKwD6EKSGgLHDxaM9yA' rel='apple-touch-icon' type='image/png'>
<link href='https://process.fs.teachablecdn.com/ADNupMnWyR7kCWRvm76Laz/resize=width:320,height:345/https://www.filepicker.io/api/file/tkoAhgcsTyakndIvJlvn' rel='apple-touch-startup-image' type='image/png'>
<link href='https://process.fs.teachablecdn.com/ADNupMnWyR7kCWRvm76Laz/resize=width:640,height:690/https://www.filepicker.io/api/file/tkoAhgcsTyakndIvJlvn' rel='apple-touch-startup-image' 

### Step 2: Use BeautifulSoup to remove HTML tags
Use `"lxml"` rather than `"html5lib"`.

Again, you may have to scroll down past the javascript and CSS in the output of the last cell in this section to see the text. **Alternatively,** you can run the following two lines right before running `soup.get_text()`:

```python
for script in soup(["script", "style"]):
    script.decompose()
```
Read more about this [here](https://stackoverflow.com/questions/22799990/beatifulsoup4-get-text-still-has-javascript).

In [46]:
#soup = BeautifulSoup(r.text, "lxml")
soup = BeautifulSoup(r.content, 'html.parser')
#soup = BeautifulSoup(r.text, "html5lib")
print(soup.get_text())

















All Data Engineering Courses | Learn Data Engineering











































Academy




All Courses




Login




Sign Up














Explore each individual topicClick on each individual topic below to view the introduction video, read the complete syllabus and see a list of all the lessons.








 1. Data Engineering Basics









Introduction to Data Engineering





Introduction to Data Engineering with over 1 hour of videos including my journey here.





Andreas Kretz




%

COMPLETE














Computer Science Fundamentals





A complete guide of topics and resources you should know as a Data Engineer.





Andreas Kretz




%

COMPLETE














Introduction to Python





Learn all the fundamentals of Python to start coding quick





Amit Jain




%

COMPLETE














Python for Data Engineers





Learn all the Python topics a Data Engineer needs even if you don't have a coding background





Andreas Kretz




%

COMPL

### Step 3: Find all course summaries
Use the BeautifulSoup's `find_all` method to select based on tag type and class name. Just ike in the video, you can right click on the item, and click "Inspect" to view its html on a web page.

In [47]:
tags = {tag.name for tag in soup.find_all()}
tags

{'a',
 'body',
 'br',
 'button',
 'div',
 'footer',
 'h3',
 'h4',
 'head',
 'header',
 'html',
 'img',
 'li',
 'link',
 'main',
 'meta',
 'nav',
 'p',
 'script',
 'section',
 'span',
 'strong',
 'style',
 'title',
 'ul'}

In [48]:
# class list set
class_list = set()
  
# iterate all tags
for tag in tags:
    if tag == "div":
        # find all element of tag
        for i in soup.find_all( tag ):

            # if tag has attribute of class
            if i.has_attr( "class" ):

                if len( i['class'] ) != 0:
                    class_list.add(" ".join( i['class']))

print(class_list)

{'featured-product-card__meta', 'course-block block text block-custom-name-overview-headline', 'featured-product-card__progressbar hidden', 'block__featured-products__heading_text b-124815742-heading_color rich-text', 'course-block block button', 'block__columns b-124815741-content_width', 'featured-product-card__image-container', 'footer', 'featured-product-card card-style-grid block__column b-124815741-card_background_color b-124815741-card_border_color b-124815741-card_border_width b-124815741-card_border_radius b-124815741-card_text_alignment', 'root', 'featured-product-card card-style-grid block__column b-124815742-card_background_color b-124815742-card_border_color b-124815742-card_border_width b-124815742-card_border_radius b-124815742-card_text_alignment', 'featured-product-card__content', 'blocks-page blocks-page-course_sales_page_v2', 'course-block block featured_products', 'block__featured-products__heading_text b-124815740-heading_color rich-text', 'block__columns__fixed b-

In [71]:
# Find all course summaries
#summaries = soup.find_all('div', class_="course-summary-card")
#summaries = soup.find_all('div', class_="card_overview__sRYmr")
#summaries = soup.find_all('div', class_="card_body__5gAnU")
#summaries = soup.find_all('p', class_="card_summary__uxD6t")
#summaries = soup.find_all('div')
#summaries = soup.find_all("div", {"class": "featured-product-card__content"})
titles = soup.find_all("h3", {"class": "featured-product-card__content__title"})
summaries = soup.find_all("h4", {"class": "featured-product-card__content__subtitle"})

course_objects = soup.find_all("div", {"class": "featured-product-card__content"})
for i in range(len(course_objects)):
    print(course_objects[i].select_one("h3").get_text().strip())
    print("\t", course_objects[i].select_one("h4").get_text().strip())

print('Number of Courses:', len(summaries))
print(summaries[0].text)
for i in range(len(titles)):
    print(titles[i].text.strip())

Introduction to Data Engineering
	 Introduction to Data Engineering with over 1 hour of videos including my journey here.
Computer Science Fundamentals
	 A complete guide of topics and resources you should know as a Data Engineer.
Introduction to Python
	 Learn all the fundamentals of Python to start coding quick
Python for Data Engineers
	 Learn all the Python topics a Data Engineer needs even if you don't have a coding background
Docker Fundamentals
	 Learn all the fundamental Docker concepts with hands-on examples
Data Platform And Pipeline Design
	 Learn how to build data pipelines with templates and examples for Azure, GCP and Hadoop.
Platform & Pipelines Security
	 Learn the important security fundamentals for Data Engineering
Choosing Data Stores
	 Learn the different types of data stores and when to use which.
Schema Design Data Stores
	 Learn to define schemas for SQL, NoSQL databases and Data Warehouses
Modern Data Warehouses & Data Lakes
	 How to integrate a Data Lake with a

### Step 4: Inspect the first summary to find selectors for the course name and school
Tip: `.prettify()` is a super helpful method BeautifulSoup provides to output html in a nicely indented form! Make sure to use `print()` to ensure whitespace is displayed properly.

In [72]:
# print the first summary in summaries
print(course_objects[0].prettify())

<div class="featured-product-card__content">
 <h3 class="featured-product-card__content__title" title="Introduction to Data Engineering">
  Introduction to Data Engineering
 </h3>
 <div aria-hidden="true" class="featured-product-card__progressbar hidden">
  <div aria-labelledby="percent-complete-1296272" aria-valuemax="100" aria-valuemin="0" class="featured-product-card__progressbar-fill" role="progressbar">
  </div>
 </div>
 <h4 class="featured-product-card__content__subtitle" title="Introduction to Data Engineering with over 1 hour of videos including my journey here.">
  Introduction to Data Engineering with over 1 hour of videos including my journey here.
 </h4>
 <div class="featured-product-card__meta">
  <div class="featured-product-card__meta__item featured-product-card__author">
   <img alt="Andreas Kretz" src="https://process.fs.teachablecdn.com/ADNupMnWyR7kCWRvm76Laz/resize=width:30,height:30/https://www.filepicker.io/api/file/Fj1QqQcCRBKKKfRizqg0"/>
   <p class="featured-pro

Look for selectors that contain the courses title and school name text you want to extract. Then, use the `select_one` method on the summary object to pull out the html with those selectors. Afterwards, don't forget to do some extra cleaning to isolate the names (get rid of unnecessary html), as you saw in the last video.

In [73]:
# Extract course title
course_objects[0].select_one("h3").get_text().strip()

'Introduction to Data Engineering'

In [74]:
# Extract school/description
course_objects[0].select_one("h4").get_text().strip()

'Introduction to Data Engineering with over 1 hour of videos including my journey here.'

### Step 5: Collect names and schools of ALL course listings
Reuse your code from the previous step, but now in a loop to extract the name and school from every course summary in `summaries`!

In [77]:
courses = []
for course in course_objects:
    # append name and school of each summary to courses list
    title = course.select_one("h3").get_text().strip()
    description = course.select_one("h4").get_text().strip()
    courses.append((title, description))

In [78]:
# display results
print(len(courses), "course summaries found. Sample:")
courses[:20]

26 course summaries found. Sample:


[('Introduction to Data Engineering',
  'Introduction to Data Engineering with over 1 hour of videos including my journey here.'),
 ('Computer Science Fundamentals',
  'A complete guide of topics and resources you should know as a Data Engineer.'),
 ('Introduction to Python',
  'Learn all the fundamentals of Python to start coding quick'),
 ('Python for Data Engineers',
  "Learn all the Python topics a Data Engineer needs even if you don't have a coding background"),
 ('Docker Fundamentals',
  'Learn all the fundamental Docker concepts with hands-on examples'),
 ('Data Platform And Pipeline Design',
  'Learn how to build data pipelines with templates and examples for Azure, GCP and Hadoop.'),
 ('Platform & Pipelines Security',
  'Learn the important security fundamentals for Data Engineering'),
 ('Choosing Data Stores',
  'Learn the different types of data stores and when to use which.'),
 ('Schema Design Data Stores',
  'Learn to define schemas for SQL, NoSQL databases and Data Wareho