# edX Web Scraping Project 



### Patrick Masi-Phelps
### NYCDSA Cohort 10


#### This document shows the code used to scrape edX's courses offered in English as of July 28, 2017. 

#### The purpose of this excercise was to successfully scrape edX's website for information on online courses currently offered, and then conduct exploratory data analysis on the scraped data. This information could be useful for educational institutions - getting a clear picture of the current supply and characteristics of MOOCs could better inform current and potential market participants. This can also be useful for students looking understand the availability of alternative online options.

### 1. Import packages, initialize driver, and open csv writer

In [55]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import csv
import re

driver = webdriver.Chrome('path to driver')

#main edX page of all English courses -- this scraper excludes edX courses in other languages
driver.get('https://www.edx.org/course?language=English')

#open a new blank csv
csv_file = open('courses_whole.csv', 'w')

writer = csv.writer(csv_file)

#write column headers for each of the variables to scrape
writer.writerow(['course_link','price', 'title', 
                 'subject', 'level', 'institution', 'length', 
                 'prerequisites', 'short_description', 'effort'])



### 2. Scrape the main course page 
#### Link: https://www.edx.org/course?language=English

In [4]:
###   this code scrapes the main course list page, returning a list of all english course links

#edX lists the total number of courses on the top of the main page. This scrapes that number.
num_classes_str = driver.find_element_by_xpath('//span[@class="js-count result-count"]').text

#convert total course number to an integer
num_classes = int((re.findall(r'\d+', num_classes_str))[0])

#initialize page number = 0
page = 0

###   this while loop scrolls down the main course page until all courses are loaded

while page < num_classes:
    
    #driver does an initial scroll down to bottom of the page
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    
    #this try command waits until it can see the "loading..." icon. Once it sees the icon, we add 1 to page counter
    #and continue at the top of the while loop to do another scroll
    try:
        WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, '//div[@class="loading"]')))
        page += 1
    
    #when the driver waits 10 seconds and still cannot see the "loading..." icon, it will raise an exception
    #at this point, we will be at bottom of the page, all courses visible, break out of the loop
    except Exception as e:
        print(e)
        print(page)
        break

#get a list of all course link xpath elements
courses = driver.find_elements_by_xpath('//div[@class="discovery-card-inner-wrapper"]/a[@class="course-link"]')
    
#initialize empty list
course_links = []

#for each course link xpath, grab the link itself (the href element) and append it to the course_link list
for course in courses:
    course_links.append(course.get_attribute('href'))



Message: 

1257
1257
<selenium.webdriver.remote.webelement.WebElement (session="aae2888fc5e42e2618d93dbacef69838", element="0.5939802481140002-3")>
<selenium.webdriver.remote.webelement.WebElement (session="aae2888fc5e42e2618d93dbacef69838", element="0.5939802481140002-4")>
<selenium.webdriver.remote.webelement.WebElement (session="aae2888fc5e42e2618d93dbacef69838", element="0.5939802481140002-5")>
<selenium.webdriver.remote.webelement.WebElement (session="aae2888fc5e42e2618d93dbacef69838", element="0.5939802481140002-6")>
1257
https://www.edx.org/course/introduction-web-accessibility-microsoft-dev240x


#### (Some optional test code)

In [27]:
###   optional testing code to scrape a sample of courses
###   course_links_test1 = ['https://www.edx.org/course/apr-italian-language-culture-wellesleyx-apita-x', 
###                         'https://www.edx.org/course/ramp-ap-biology-weston-high-school-bio101x-1',
###                         'https://www.edx.org/course/selling-ideas-how-influence-others-get-wharton-sellingideas101x-2']

### 3. Scrape the individual course pages

In [56]:
###each of these functions returns the corresponding value from each edX course page###
###if the scraper can't find a value, it returns the string 'Missing'###

#get title of course
def get_title():
    try:
        title = driver.find_element_by_xpath('.//*[@id="course-info-page"]//h1[@id="course-intro-heading"]').text
    except:
        title = 'Missing'
    finally:
        return title

#get short description of course
def get_short_description():
    try:
        short_description = driver.find_element_by_xpath('.//*[@id="course-info-page"]//p[@class="course-intro-lead-in"]').text
    except:
        short_description = 'Missing'
    finally:
        return short_description

#get length of course (typically number of weeks, or total number of hours)
def get_length():
    try:
        length = driver.find_element_by_xpath('.//*[@id="course-summary-area"]//li[@data-field="length"]/span[2]').text
    except:
        length = 'Missing'   
    finally:
        return length

#get the effort of course (typically hours per week, or total course hours)
def get_effort():
    try:
        effort = driver.find_element_by_xpath('.//*[@id="course-summary-area"]//li[@data-field="effort"]//span[@class="block-list__desc"]').text
    except:
        effort = 'Missing'
    finally:
        return effort

#get the price of course. The first "try" only works for free courses. This grabs the text "FREE" by xpath
#to get the price of not-free courses, the "except, try" gets the unique "tag" icon, then jumps to the parent 
#span class, then to a sibling span class to get the price amount. Unfortunately, the price amount doesn't 
#have a unique identifier.

def get_price():
    try:
        price = driver.find_element_by_xpath('.//*[@id="course-summary-area"]//li[@data-field="price"]//span[@class="block-list__desc"]/span[@class="uppercase"]').text
    except:
        try:                               
            price = driver.find_element_by_xpath('.//*[@id="course-summary-area"]//span[@class="fa fa-tag fa-lg"]]/../parent::span/following-sibling::span').text()
        except:
            price = "Missing"
    finally:
        return price
    
#gets the institution
def get_institution():
    try:
        institution = driver.find_element_by_xpath('.//*[@id="course-summary-area"]//li[@data-field="school"]/span[2]/a').text
    except:
        institution = 'Missing'
    finally:
        return institution

#gets the subject
def get_subject():
    try:
        subject = driver.find_element_by_xpath('.//*[@id="course-summary-area"]//li[@data-field="subject"]/span[2]/a').text
    except:
        subject = 'Missing'
    finally:
        return subject

#gets the level (introductory, intermediate, advanced)
def get_level():
    try:
        level = driver.find_element_by_xpath('.//*[@id="course-summary-area"]//li[@data-field="level"]//span[@class="block-list__desc"]').text
    except:
        level = 'Missing'
    finally:
        return level

#gets the prerequisites, if any
def get_prerequisites():
    try:
        prerequisites = driver.find_element_by_xpath('.//*[@id="course-summary-area"]/div[2]/p').text
    except:
        try:
            prerequisites = driver.find_element_by_xpath('.//*[@id="course-summary-area"]/div[2]/ul/li[1]')
        except:
            prerequisites = 'Missing'
    finally:
        return prerequisites

###  this for loop:
###     1) iterates through each of the course links
###     2) creates a new empty dictionary
###     3) directs the driver to the link
###     4) calls each of the scraping functions above and saves the return values in the dictionary
###     5) writes the dictionary values out to the csv

for course_link in course_links:
    course_dict = {}
    driver = webdriver.Chrome('/Users/Patrick/Downloads/chromedriver')
    driver.get(course_link)
    
        
    course_dict['link'] = course_link
    course_dict['title'] = get_title()
    course_dict['short_description'] = get_short_description()
    course_dict['length'] = get_length()
    course_dict['effort'] = get_effort()
    course_dict['price'] = get_price()
    course_dict['institution'] = get_institution()
    course_dict['subject'] = get_subject()
    course_dict['level'] = get_level()
    course_dict['prerequisites'] = get_prerequisites()
    writer.writerow(course_dict.values())
    driver.close()

#close the csv once all course info is scraped
csv_file.close()