# Cleaning Quiz: Udacity's Course Catalog
It's your turn! Udacity's [course catalog page](https://www.udacity.com/courses/all) has changed since the last video was filmed. One notable change is the introduction of  _schools_.

In this activity, you're going to perform similar actions with BeautifulSoup to extract the following information from each course listing on the page:
1. The course name - e.g. "Data Analyst"
2. The school the course belongs to - e.g. "School of Data Science"

**Note: All solution notebooks can be found by clicking on the Jupyter icon on the top left of this workspace.**

### Step 1: Get text from Udacity's course catalog web page
You can use the `requests` library to do this.

You may have to scroll down past the javascript and CSS in the output of the last cell in this section to see the text.

In [1]:
# import statements
import requests
from bs4 import BeautifulSoup

In [2]:
# fetch web page
r = requests.get("https://www.udacity.com/courses/all")

In [3]:
# display text from web page
print(r.text)

<!DOCTYPE html><html><head>
  <meta charset="utf-8">
  <script type="text/javascript" class="ng-star-inserted">window.NREUM||(NREUM={}),__nr_require=function(t,n,e){function r(e){if(!n[e]){var o=n[e]={exports:{}};t[e][0].call(o.exports,function(n){var o=t[e][1][n];return r(o||n)},o,o.exports)}return n[e].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<e.length;o++)r(e[o]);return r}({1:[function(t,n,e){function r(t){try{s.console&&console.log(t)}catch(n){}}var o,i=t("ee"),a=t(15),s={};try{o=localStorage.getItem("__nr_flags").split(","),console&&"function"==typeof console.log&&(s.console=!0,o.indexOf("dev")!==-1&&(s.dev=!0),o.indexOf("nr_dev")!==-1&&(s.nrDev=!0))}catch(c){}s.nrDev&&i.on("internal-error",function(t){r(t.stack)}),s.dev&&i.on("fn-err",function(t,n,e){r(e.stack)}),s.dev&&(r("NR AGENT IN DEVELOPMENT MODE"),r("flags: "+a(s,function(t,n){return t}).join(", ")))},{}],2:[function(t,n,e){function r(t,n,e,r,s){try{p?p-=1:o(s||new UncaughtException(t,n,e

### Step 2: Use BeautifulSoup to remove HTML tags
Use `"lxml"` rather than `"html5lib"`.

Again, you may have to scroll down past the javascript and CSS in the output of the last cell in this section to see the text. **Alternatively,** you can run the following two lines right before running `soup.get_text()`:

```python
for script in soup(["script", "style"]):
    script.decompose()
```
Read more about this [here](https://stackoverflow.com/questions/22799990/beatifulsoup4-get-text-still-has-javascript).

In [4]:
soup = BeautifulSoup(r.text, "lxml")

for script in soup(["script", "style"]):
    script.decompose()

print(soup.get_text())
# 报错 lxml 的话需要安装 lxml




优达学城课程分类_在线视频课程-优达学城(Udacity)官网





























 选课指南  纳米学位  所有课程  企业业务  邀请红包  登录  注册  选课指南  纳米学位  所有课程  企业业务  邀请红包  免费注册 | 登录 扫码绑定微信，及时获得限时优惠通知随时掌握硅谷新课动态，成长快人一步！课程目录类别全部课程专项提升人工智能/无人驾驶机器学习/深度学习数据分析网站开发云计算VR/ARAndroid/iOS非技术类佐治亚理工学院计算机科学硕士类型纳米学位单项课程课程难度初级中级高级预估完成时间1个月以下1-3个月3个月以上 机器学习云部署 (英)高级快速掌握如何将机器学习模型部署在云端 Java 开发工程师 (英)初级学习主流后端语言，开发硅谷企业级应用，迈出成为后端工程师第一步 AI 产品经理 (英)初级驱动 AI 产品商业化落地，用 AI 为业务赋能，成为炙手可热稀缺人才。合作企业 Figure Eight  数据挖掘 求职直通班中级从数据分析、机器学习开始，掌握推荐系统和大数据计算框架 Spark 等核心技能，通过学习、项目、拓展、案例和求职辅导，系统完善对接职场需求，一站式直达职业目标合作企业 Tableau kaggle Starbucks IBM Watson Bertelsmann  市场营销分析 (英)初级掌握市场营销所需的基本数据技能，使用 Google Analytics 和 Data Studio 呈现你的结论 传感器融合 (英)高级和梅赛德斯-奔驰学习，成为掌握无人驾驶、物联网、机器人开发核心技术的抢手工程师。合作企业 Mercedes  数据洞察与说服技巧 (英)中级掌握数据可视化呈现商业洞察，练就用数据影响决策的说服技巧。合作企业 Tableau  云计算软件开发 (英)中级学习在 AWS 上构建全栈应用，并使用 Kubernetes 和 Serverless 框架分发采用微服务的应用，成为炙手可热的云计算软件开发工程师。 云计算 DevOps (英)中级学习在 AWS 上通过代码部署基础设施和应用，构建 CI/CD 管道，以及使用 Kubernetes 和其他现代工具运维化微服务，成为一名抢手的云计算 DevOps 工程师。 AI 求职直通班中级一站掌握机器学习、深度学习、

### Step 3: Find all course summaries
Use the BeautifulSoup's `find_all` method to select based on tag type and class name. Just ike in the video, you can right click on the item, and click "Inspect" to view its html on a web page.

In [6]:
# Find all course summaries
summaries = soup.find_all("div", {"class":"course-summary-card"})
print('Number of Courses:', len(summaries))

Number of Courses: 189


### Step 4: Inspect the first summary to find selectors for the course name and school
Tip: `.prettify()` is a super helpful method BeautifulSoup provides to output html in a nicely indented form! Make sure to use `print()` to ensure whitespace is displayed properly.

In [7]:
# print the first summary in summaries
print(summaries[0].prettify())

<div _ngcontent-sc247="" class="course-summary-card row row-gap-medium ng-star-inserted">
 <div _ngcontent-sc247="" class="col-sm-3">
  <!-- -->
  <img _ngcontent-sc247="" alt="机器学习云部署 (英)" class="course-thumb img-responsive img-bordered center-block ng-star-inserted" height="170" src="https://static-assets.s3.cn-north-1.amazonaws.com.cn/degrees/nd895-cn/nd_card.jpg" width="290"/>
  <!-- -->
  <!-- -->
 </div>
 <div _ngcontent-sc247="" class="col-sm-9 course-text">
  <div _ngcontent-sc247="" class="row">
   <div _ngcontent-sc247="" class="col-sm-8">
    <h3 _ngcontent-sc247="" class="h-slim">
     <a _ngcontent-sc247="" href="/course/machine-learning-cloud-deployment-nanodegree--nd895-cn">
      机器学习云部署 (英)
     </a>
     <span _ngcontent-sc247="" class="badges">
      <!-- -->
     </span>
    </h3>
   </div>
   <div _ngcontent-sc247="" class="col-sm-4 hidden-xs">
    <!-- -->
    <span _ngcontent-sc247="" class="caption text-right ng-star-inserted">
     <span _ngcontent-sc247="" cla

Look for selectors that contain the courses title and school name text you want to extract. Then, use the `select_one` method on the summary object to pull out the html with those selectors. Afterwards, don't forget to do some extra cleaning to isolate the names (get rid of unnecessary html), as you saw in the last video.

In [8]:
# Extract course title
summaries[0].select_one("h3").get_text().strip()

'机器学习云部署 (英)'

In [12]:
# Extract school
summaries[0].select_one("h4").get_text().strip()
# 中文网站结构没有 h4 ，会报错（英文的有）
# 在国内会自动转成中文

AttributeError: 'NoneType' object has no attribute 'get_text'

### Step 5: Collect names and schools of ALL course listings
Reuse your code from the previous step, but now in a loop to extract the name and school from every course summary in `summaries`!

In [15]:
courses = []
for summary in summaries:
    # append name and school of each summary to courses list
    title = summary.select_one("h3").get_text().strip()
    # school = summary.select_one("h4").get_text().strip()
    # courses.append((title, school))
    courses.append((title))

In [16]:
# display results
print(len(courses), "course summaries found. Sample:")
courses[:20]

189 course summaries found. Sample:


['机器学习云部署 (英)',
 'Java 开发工程师 (英)',
 'AI 产品经理 (英)',
 '数据挖掘 求职直通班',
 '市场营销分析 (英)',
 '传感器融合 (英)',
 '数据洞察与说服技巧 (英)',
 '云计算软件开发 (英)',
 '云计算 DevOps (英)',
 'AI 求职直通班',
 'C++ 程序设计 (英)',
 '数据结构与算法 (英)',
 'Python 编程入门',
 '28 天入门 Python',
 '高频算法面试题精讲',
 'Python 人工智能入门',
 '数据分析（入门）',
 '数据工程师 (英)',
 '商业数据分析',
 'AI 量化投资 (英)']