# Web Scraping In Python to get Information about Courses on Udacity
I wanted to see which courses on Udacity formed part of the different Nanodegrees. 
I first got a list of all the courses (accessible on https://www.udacity.com/courses/all) and then visited the URL for each course to extract info about the Nanodegree it belonged to.

Dataquest has an awesome tutorial on how to do web scraping, which I recommend you go through if you are new to it. https://www.dataquest.io/blog/web-scraping-beautifulsoup/

This makes use of BeautifulSoup to easily extract HTML elements from the content in a response of getting a web page.

In [1]:
from bs4 import BeautifulSoup
import requests

response = requests.get("https://www.udacity.com/courses/all")
html_soup = BeautifulSoup(response.content, 'html.parser')

#find all the URLs (items in the html where href exists)
url = html_soup.find_all(href=True)

#Iterate through the urls, and only return the ones that contain course and --
links = []
for u in url:
    if (u['href'].find('course') != -1):
        if (u['href'].find('--') != -1):
            if (u['href'].find('nanodegree') == -1):
                links.append(u['href'])
            
#remove duplicates by converting the list of links to a set, then back to a list
links = list(set(links)) 
links

['/course/artificial-intelligence-for-robotics--cs373',
 '/course/engagement-monetization-mobile-games--ud407',
 '/course/object-oriented-javascript--ud015',
 '/course/how-to-make-an-ios-app--ud607',
 '/course/model-building-and-validation--ud919',
 '/course/learn-unreal-vr-foundations--nd117',
 '/course/strengthen-your-linkedin-network-and-brand--ud242',
 '/course/data-analysis-with-r--ud651',
 '/course/high-performance-computing--ud281',
 '/course/data-structures-and-algorithms-in-swift--ud1011',
 '/course/javascript-design-patterns--ud989',
 '/course/differential-equations-in-action--cs222',
 '/course/embedded-systems--ud169',
 '/course/swift-for-beginners--ud1022',
 '/course/android-basics-data-storage--ud845',
 '/course/design-of-computer-programs--cs212',
 '/course/big-data-analytics-in-healthcare--ud758',
 '/course/robotics-software-engineer--nd209',
 '/course/android-auto-development--ud875C',
 '/course/android-performance--ud825',
 '/course/browser-rendering-optimization--ud86

In [2]:
len(links)

200

There are 200 unique links that belong to the different courses on Udacity. These are the courses that can be done without enrolling in a nanodegree.

Each course URL will be visited in the following code to extract the information needed. The relevent HTML tags and class names were extracted by visiting the URL, right clicking on the element of interest and selecting Inspect to view the related HTML. This could then be used with the BeautifulSoup find and find_all functions to get the information.

In [10]:
import time
from random import randint

#initilise the lists to store the extracted info
course_names = []
course_urls = []
course_nanodegrees = []
course_costs = []

# Preparing the monitoring of the loop
start_time = time.time()
requests_count = 0

#iterate through each link to extract info about the nano degree it belongs to and the cost
for i in range(len(links)):
    #need to make requests at random times in order to not be blocked by website
    time.sleep(randint(2,6)) 
    url = "https://www.udacity.com" + links[i]
    response = requests.get(url)
    content = response.content
    html_soup = BeautifulSoup(content, 'html.parser')
    
    # Monitor the requests
    requests_count += 1
    elapsed_time = time.time() - start_time
    print('Request:{}; Frequency: {} requests/s'.format(requests_count, requests_count/elapsed_time))
    #clear_output(wait = True)
    
    # Break the loop if the number of requests is greater than expected
    if requests_count > 250:
        warn('Number of requests was greater than expected.')  
        break 
    
    try:
        nanodegree = html_soup.find('h3',class_="h1 hero__nanodegree--title").text
    except:
        nanodegree = 'None'
    try:
        course_name = html_soup.find('h1',class_='hero__course--title').text
    except:
        course_name = 'None'
    try:
        cost = html_soup.find('h6',class_='hero__course--type').text
    except:
        cost = 'None'
    course_names.append(course_name)
    course_urls.append(url)
    course_nanodegrees.append(nanodegree)
    course_costs.append(cost)


Request:1; Frequency: 0.17667353821036022 requests/s
Request:2; Frequency: 0.21928629575410366 requests/s
Request:3; Frequency: 0.2584761215221361 requests/s
Request:4; Frequency: 0.22902981383188756 requests/s
Request:5; Frequency: 0.2365505733454123 requests/s
Request:6; Frequency: 0.2304757296488512 requests/s
Request:7; Frequency: 0.20998458772770595 requests/s
Request:8; Frequency: 0.216026468638043 requests/s
Request:9; Frequency: 0.22592739192625771 requests/s
Request:10; Frequency: 0.2294515947997564 requests/s
Request:11; Frequency: 0.21815650310835355 requests/s
Request:12; Frequency: 0.21035776320594227 requests/s
Request:13; Frequency: 0.2179981739328464 requests/s
Request:14; Frequency: 0.2245864723987871 requests/s
Request:15; Frequency: 0.21756654644337867 requests/s
Request:16; Frequency: 0.2154386827341679 requests/s
Request:17; Frequency: 0.21864592059903892 requests/s
Request:18; Frequency: 0.2127145987440594 requests/s
Request:19; Frequency: 0.20985016844926185 requ

Request:153; Frequency: 0.20987099732335124 requests/s
Request:154; Frequency: 0.20930971244899643 requests/s
Request:155; Frequency: 0.20906924865635534 requests/s
Request:156; Frequency: 0.20960853847818198 requests/s
Request:157; Frequency: 0.20883140428140656 requests/s
Request:158; Frequency: 0.20940643107885637 requests/s
Request:159; Frequency: 0.20996630001873517 requests/s
Request:160; Frequency: 0.21024522736406356 requests/s
Request:161; Frequency: 0.21076005370058362 requests/s
Request:162; Frequency: 0.21020951240462143 requests/s
Request:163; Frequency: 0.21074575319773056 requests/s
Request:164; Frequency: 0.2112887732028455 requests/s
Request:165; Frequency: 0.21140224397464308 requests/s
Request:166; Frequency: 0.21116858720093426 requests/s
Request:167; Frequency: 0.2106557535945539 requests/s
Request:168; Frequency: 0.21115507177616222 requests/s
Request:169; Frequency: 0.21147359184195708 requests/s
Request:170; Frequency: 0.21175417239352975 requests/s
Request:171;

In [22]:
#Convert the lists to a dataframe
import pandas as pd
courses_df = pd.DataFrame({'Name': course_names,
                          'URL': course_urls,
                          'Nanodegree': course_nanodegrees,
                          'Cost':course_costs})
courses_df = courses_df.sort_values('Nanodegree')
print(courses_df.info())
courses_df

<class 'pandas.core.frame.DataFrame'>
Int64Index: 200 entries, 176 to 145
Data columns (total 4 columns):
Name          200 non-null object
URL           200 non-null object
Nanodegree    200 non-null object
Cost          200 non-null object
dtypes: object(4)
memory usage: 7.8+ KB
None


Unnamed: 0,Name,URL,Nanodegree,Cost
176,Intro to Psychology,https://www.udacity.com/course/intro-to-psycho...,,Free Course
14,Android Basics: Data Storage,https://www.udacity.com/course/android-basics-...,Android Basics,Free Course
175,Android Basics: User Input,https://www.udacity.com/course/android-basics-...,Android Basics,Free Course
161,Android Basics: Multiscreen Apps,https://www.udacity.com/course/android-basics-...,Android Basics,Free Course
134,Java Programming Basics,https://www.udacity.com/course/java-programmin...,Android Basics,Free Course
81,Android Basics: Networking,https://www.udacity.com/course/android-basics-...,Android Basics,Free Course
127,Mobile Design and Usability for Android,https://www.udacity.com/course/mobile-design-a...,Android Developer,Free Course
32,Passwordless Login Solutions for Android,https://www.udacity.com/course/passwordless-lo...,Android Developer,Free Course
35,Localization Essentials,https://www.udacity.com/course/localization-es...,Android Developer,Free Course
129,Gradle for Android and Java,https://www.udacity.com/course/gradle-for-andr...,Android Developer,Free Course


In [23]:
#Save to csv file
courses_df.to_csv('Udacity_Courses.csv')

To see which courses belonged to the Data Analyst degree, the courses data frame was filtered and a new data frame was created

In [24]:
data_analyst = courses_df[courses_df['Nanodegree'] == 'Data Analyst']
data_analyst

Unnamed: 0,Name,URL,Nanodegree,Cost
26,Software Analysis & Testing,https://www.udacity.com/course/software-analys...,Data Analyst,Free Course
115,A/B Testing,https://www.udacity.com/course/ab-testing--ud257,Data Analyst,Free Course
146,Human-Computer Interaction,https://www.udacity.com/course/human-computer-...,Data Analyst,Free Course
187,Data Visualization and D3.js,https://www.udacity.com/course/data-visualizat...,Data Analyst,Free Course
7,Data Analysis with R,https://www.udacity.com/course/data-analysis-w...,Data Analyst,Free Course
37,Data Analysis and Visualization,https://www.udacity.com/course/data-analysis-a...,Data Analyst,Free Course
173,App Marketing,https://www.udacity.com/course/app-marketing--...,Data Analyst,Free Course
49,Data Visualization in Tableau,https://www.udacity.com/course/data-visualizat...,Data Analyst,Free Course
121,Programming Languages,https://www.udacity.com/course/programming-lan...,Data Analyst,Free Course
164,Data Wrangling with MongoDB,https://www.udacity.com/course/data-wrangling-...,Data Analyst,Free Course
