# Webscraper FutureLearn
FeatureLearn.com is an online learning platform, offering courses on a variety of subjects. This scraper scrapes all the courses available on futurelearn. The total run time is around an hour and a half.

## 1 Importing packages

In [40]:
import requests
from bs4 import BeautifulSoup
from datetime import datetime
import time
import csv

## 2 Preparation

### 2.1 Getting the max page number
Get the maximum number of pages in the 'all' category

In [43]:
#Start time so we know how long it takes to run the code
start_time = datetime.now()

#Request of the courses page showing all the courses. It scrapes the button on the bottom showing what the last page is
url = "https://www.futurelearn.com/courses"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
max_page = int(soup.find('div', class_ = "a-content a-contiguous-top u-centered a-content--tight").find_all('li', class_="pagination-module_item__3XB-l")[-1].text)
max_page

95

### 2.2 Make the page urls
This code does not require an html request since each link is predictable. The resulting dictionary contains the page number and page url.

In [44]:
 def make_page_urls(max_page):   
    page_urls = []
    for page in range(1, max_page+1):
        page_info = {}
        page_info['page'] = page
        page_info['page_url'] = f"https://www.futurelearn.com/courses?&page={page}#courses-grid-start"
        page_urls.append(page_info)
    return page_urls

In [45]:
page_urls = make_page_urls(max_page)
page_urls

[{'page': 1,
  'page_url': 'https://www.futurelearn.com/courses?&page=1#courses-grid-start'},
 {'page': 2,
  'page_url': 'https://www.futurelearn.com/courses?&page=2#courses-grid-start'},
 {'page': 3,
  'page_url': 'https://www.futurelearn.com/courses?&page=3#courses-grid-start'},
 {'page': 4,
  'page_url': 'https://www.futurelearn.com/courses?&page=4#courses-grid-start'},
 {'page': 5,
  'page_url': 'https://www.futurelearn.com/courses?&page=5#courses-grid-start'},
 {'page': 6,
  'page_url': 'https://www.futurelearn.com/courses?&page=6#courses-grid-start'},
 {'page': 7,
  'page_url': 'https://www.futurelearn.com/courses?&page=7#courses-grid-start'},
 {'page': 8,
  'page_url': 'https://www.futurelearn.com/courses?&page=8#courses-grid-start'},
 {'page': 9,
  'page_url': 'https://www.futurelearn.com/courses?&page=9#courses-grid-start'},
 {'page': 10,
  'page_url': 'https://www.futurelearn.com/courses?&page=10#courses-grid-start'},
 {'page': 11,
  'page_url': 'https://www.futurelearn.com/c

### 2.3 Getting the course urls
Each page contains up to 15 courses. The following code retrieves these data and makes a csv file for later use. The sleep time is 1 second. The list of inaccessible page was 0 during the last run. The list of inaccessible pages is included in the code in case the website changes and the code doesn't work anymore

In [46]:
def make_course_urls(page_urls):
    course_urls = []
    list_of_inaccessible_pages = []                
    for page_url in page_urls:
        try:
            r = requests.get(page_url['page_url'])
            soup = BeautifulSoup(r.text, "html.parser")
            courses = soup.find(class_="cardGrid-wrapper_2TvtF cardGrid-hasSideNav_1sLqj").find_all('div', class_="m-card Container-wrapper_1lZbP Container-grey_1l9VP")
            for course in courses:
                course_url = {}
                course_url['course_url'] = course.find_all('a')[0]['href']
                course_url['page'] = page_url['page']
                course_urls.append(course_url)
        except:
            list_of_inaccessible_pages.append(page_url)
                
        time.sleep(1)
        print(f"Currently scraping page {page_url}")
    print(f"Inaccessible pages: {list_of_inaccessible_pages}")
    print(f"Number of unaccessible pages: {len(list_of_inaccessible_pages)}")
    print(f"Number of courses: {len(course_urls)}")
    print(f"First few courses: {course_urls[0:5]}")
    return course_urls

In [47]:
course_urls = make_course_urls(page_urls)

Currently scraping page {'page': 1, 'page_url': 'https://www.futurelearn.com/courses?&page=1#courses-grid-start'}
Currently scraping page {'page': 2, 'page_url': 'https://www.futurelearn.com/courses?&page=2#courses-grid-start'}
Inaccessible pages: []
Number of unaccessible pages: 0
Number of courses: 32
First few courses: [{'course_url': '/courses/anatomy-know-your-abdomen', 'page': 1}, {'course_url': '/courses/atmospheric-chemistry-planets-and-life-beyond-earth', 'page': 1}, {'course_url': '/courses/business-ethics', 'page': 1}, {'course_url': '/courses/human-disease-exploring-cancer-genetic-disease', 'page': 1}, {'course_url': '/courses/human-disease-lifestyle-environment', 'page': 1}]


In [48]:
with open("../../gen/input/futurelearn_course_urls.csv", "w", encoding = "UTF-8") as csv_file:
    writer = csv.writer(csv_file, delimiter = ";")
    writer.writerow(['course_url', 'page'])
    for course_url in course_urls:
        writer.writerow([course_url['course_url'], course_url['page']])

## 3 Data Collection

### 3.1 Getting the course info
The following function retrieves the following data on each page:
- Url: The complete url of the course
- Time: The exact time the data were scraped
- Page: The page from the category
- Category: The name of the category
- Header: The header of the course
- Enrollment: The number of enrollments of the course. Note that new courses do not have the information yet
- New: Dummy showing whether the course is new
- Star_rating: The number of stars in a given course
- Review_count: Variable showing how many reviews there are
- Description: Short description of the course
- Duration: Duration of the course in weeks
- Weekly_study: Study time required on a weekly bases
- Unlimited: Dummy showing whether you can access this course with a so-called "unlimited" subscription
- 100_online: Dummy showing whether the course is for 100% online
- Free: Dummy showing whether the course is for free. Not_present meaning it's not free
- Accreditation: Dummy showing whether the course is eligible for accreditation
- Part_of_expert: Dummy showing whether the course is part of a larger expert course
- Name_school: The name of the school that teaches the course
- Endorsed: Dummy showing whether the course is endorsed by third parties. 

Although the course page contained more information (e.g. in-depth course descriptions, numerous teachers), the variables above are relevant for an analysis. The sleep time is 1 second per retrieval

In [49]:
def get_course_info(course_url):
    course_info = {}
    complete_url = f"https://www.futurelearn.com{course_url['course_url']}"
    r = requests.get(complete_url)
    soup = BeautifulSoup(r.text, "html.parser")
    
    #META-DATA
    ##url
    course_info['url'] = complete_url
    
    ##time
    course_info['Time'] = datetime.now()
    
    ##page
    course_info['page'] = course_url['page']
    
    ##category
    try:
        course_info['category'] = soup.find_all('li', class_ = "breadcrumbs-module_item__3SxlK")[1].text
    except:
        course_info['category'] = ''
    
    #HEADER
    try:
        course_info['Header'] = soup.find('h1').text
    except:
        course_info['Header'] = ''
        
    #NUMBER OF ENROLLMENTS
    try:
        course_info['Enrollments'] = soup.find('div', class_="spacer-module_default__3N2H9 spacer-module_vertical-4__5ZLo8").find('p', class_ = "text-module_wrapper__FfvIV text-module_black__2u5Rt text-module_sBreakpointSizexsmall__2Jlmd text-module_sBreakpointAlignmentleft__1MvbB text-module_isRegular__1K97K").text
    except:
        course_info['Enrollments'] = ''
    
    #NEW COURSE
    try:
        if soup.find('span', class_ = "Ribbon-module_wrapper__312EV Ribbon-module_coral__B1J53 Ribbon-module_isUppercase__1NTSw").text == 'New':
            course_info['New'] = 'yes'
        else:
            course_info['New'] = 'no' 
    except:
        course_info['New'] = 'no'

    #STAR RATING
    try:
        course_info['star_rating'] = soup.find('div', class_="PageHeader-content_1v6-E").find(class_="spacer-module_default__3N2H9 spacer-module_left-1__1AJxh").text.split()[0]
    except:
        course_info['star_rating'] = ''
    
    #NUM OF REVIEWS
    try:
        course_info['review_count'] = soup.find('div', class_="PageHeader-content_1v6-E").find('div', class_="ReviewStars-text_mSEFD").find('span').text
    except:
        course_info['review_count'] = ''
        
    #DESCRIPTION
    try:
        course_info['description'] = soup.find('div', class_ = "stack-module_wrapper__3ZERF").find(class_="text-module_wrapper__FfvIV text-module_black__2u5Rt text-module_sBreakpointSizemedium__2qitW text-module_mBreakpointSizemedium__1_OnK text-module_lBreakpointSizemedium__1Yq39 text-module_xlBreakpointSizemedium__1nNCx text-module_xxlBreakpointSizelarge__1uhrp text-module_sBreakpointAlignmentleft__1MvbB text-module_isRegular__1K97K").text
    except:
        course_info['description'] = ''
    
    #INFO COURSE
    try:
        for li in soup.find('div', class_ = "PageHeader-keyInfoWrapper_39HpT").find_all(class_="keyInfo-module_itemText__3w63w"):
            ## Top part of cell
            table_header = li.find(class_ = "text-module_wrapper__FfvIV text-module_mediumGrey__1uvOt text-module_sBreakpointSizesmall__3K4b4 text-module_sBreakpointAlignmentleft__1MvbB text-module_isInline__m5cFK text-module_isRegular__1K97K").text
            ## Lower part of cell
            table_text = li.find(class_ = "keyInfo-module_content__1K_85").text
            if table_header == 'Duration':
                course_info['Duration'] = table_text
            if table_header == 'Weekly study':
                course_info['Weekly_study'] = table_text
            if table_header == 'Unlimited':
                course_info['Unlimited'] = 'yes'
            if table_header == '100% online':
                course_info['100_online'] = 'yes'
            if table_header == 'Digital upgrade':
                course_info['Free'] = 'yes'
            if table_header == 'Accreditation':
                course_info['Accreditation'] = 'yes'
            if table_header == "Included in an ExpertTrack":
                course_info['Part_of_Expert'] = 'yes'
        table_headers_options = ['Duration', 'Weekly_study', 'Unlimited', '100_online', 'Free', 'Accreditation', 'Part_of_Expert']
        ##Looks whether a cell is missing
        for table_headers_option in table_headers_options:
            if table_headers_option not in course_info:
                course_info[table_headers_option] = 'no'
    ##If there's an error in loading the table, place empty values
    except:
        for table_headers_option in table_headers_options:
            course_info[table_headers_option] = ''
    
    #NAME SCHOOL
    try:
        course_info['Name_school'] = soup.find('h2', class_="heading-module_wrapper__2dcxt heading-module_sBreakpointAlignmentleft__pCA_Y heading-module_sBreakpointSizelarge__SiUxO heading-module_black__Uge9G heading-module_isRegular__2NZyV").text
    except:
        course_info['Name_school'] = ''
    print(f"Currently scraping: {course_info['Header']}")
    
    #ENDORSERS (Check how often present)
    try:
        if soup.find('h2', class_ = "heading-module_wrapper__2dcxt heading-module_sBreakpointAlignmentcenter__2R3nG heading-module_sBreakpointSizelarge__SiUxO heading-module_black__Uge9G heading-module_isRegular__2NZyV").text == 'Endorsers and supporters':
            course_info['Endorsed'] = 'yes'
        else:
            course_info['Endorsed'] = 'no'
    except:
        course_info['Endorsed'] = 'no'
    time.sleep(1)
    return course_info

### 3.2 Scraping each course page and saving it in csv
The following code uses the function above to retrieve the relevant data of each page and stores it into a csv file. Note the delimiter is a ';'. The length of the list of inaccessable courses was 0 during the last run. It takes about an hour to complete.

In [50]:
def write_data(course_urls):
    courses_info = []
    list_of_inaccessable_courses = []
    counter = 1
    length_course_urls = len(course_urls)
    with open("../../gen/output/futurelearn_data.csv", "w", encoding = "UTF-8") as csv_file:
        writer = csv.writer(csv_file, delimiter = ";")
        writer.writerow(['Url', 'Time', 'Page', 'Category', 'Header', 'Enrollments', 'New', 'Star_rating', 'Review_count', 'Description', 'Duration', 'Weekly_study', 'Unlimited', '100_online', 'Free', 'Accreditation', 'Part_of_Expert', 'Name_school', 'Endorsed'])
        for course_url in course_urls:
            try:
                course_info = get_course_info(course_url)
                courses_info.append(course_info)
                writer.writerow([course_info['url'], course_info['Time'], course_info['page'], course_info['category'], course_info['Header'], course_info['Enrollments'], course_info['New'], course_info['star_rating'], course_info['review_count'], course_info['description'], course_info['Duration'], course_info['Weekly_study'], course_info['Unlimited'], course_info['100_online'], course_info['Free'], course_info['Accreditation'], course_info['Part_of_Expert'], course_info['Name_school'], course_info['Endorsed']])
            except:
                list_of_inaccessable_courses.append(course_url)
            print(f'Processing... [{counter}/{length_course_urls}]')
            counter += 1
    print(f"Courses it couldn't access: {list_of_inaccessable_courses}")
    print(f"Couldn't access {len(list_of_inaccessable_courses)} courses")
    return courses_info

In [51]:
write_data(course_urls)
end_time = datetime.now()
print("Time it took to scrape: " + str(end_time - start_time))

Currently scraping: Anatomy: Know Your Abdomen
Processing... [1/32]
Currently scraping: Atmospheric Chemistry: Planets and Life Beyond Earth
Processing... [2/32]
Currently scraping: Business Ethics: Exploring Big Data and Tax Avoidance
Processing... [3/32]
Currently scraping: Causes of Human Disease: Exploring Cancer and Genetic Disease
Processing... [4/32]
Currently scraping: Causes of Human Disease: Nutrition and Environment
Processing... [5/32]
Currently scraping: Causes of Human Disease: Transmitting and Fighting Infection
Processing... [6/32]
Currently scraping: Causes of Human Disease: Understanding Cardiovascular Disease
Processing... [7/32]
Currently scraping: Causes of Human Disease: Understanding Causes of Disease
Processing... [8/32]
Currently scraping: Clinical Bioinformatics: Unlocking Genomics in Healthcare
Processing... [9/32]
Currently scraping: Communication and Interpersonal Skills at Work
Processing... [10/32]
Currently scraping: Community Based Research: Getting Sta