# Webscraper

- What is a webscraper?
- Why is it useful?
- Is it legal?

## Project Assumptions

The projects asked the candidate to fetch all Coursera "course" information however the types of learning tracks are split into 3 categories:
- Course
- Specialization
- Professional Certificate

This adds an additional layer of complexity since we need to fetch the course names and type and deal with then appropriately.

I will assume that we are fetching courses from the main page (https://www.coursera.org/browse/data-science) and not any of the pages which contains only a single category of courses.

This is important as each of those 3 categories have different UI components. So we need to ensure that the data is fetched from unique locations depending on the category. An easy way to identify which of these the course is, is by the url. The following key word follows the category within the URL after the "coursera.org/" (see example url):
- Course ('learn')
    - https://www.coursera.org/learn/foundations-data
- Specialization ('specializations')
    - https://www.coursera.org/specializations/deep-learning
- Professional Certificate ('professional-certificates')
    - https://www.coursera.org/professional-certificates/ibm-data-science

In [13]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [157]:
# Imports

from bs4 import BeautifulSoup
import requests
import os
import pandas as pd

from IPython.display import Image, Markdown
from IPython.core.display import display, HTML

import urllib.parse
from selenium import webdriver

  from IPython.core.display import display, HTML


## Example of how to fetch and display HTML using BeautifulSoup

In [15]:
url = "https://web.ics.purdue.edu/~gchopra/class/public/pages/webdesign/05_simple.html"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')


display = False

if display:
    HTML(str(soup))

## Subproblem - Extract info from selected page

- Course name
- Course provider
- Course description
- num of Students enrolled


In [109]:
def get_rating_info(html_soup):   
    overall_ratings_str = html_soup.find_all("span", {"data-test": "number-star-rating"})[0].text
    overall_rating = float(overall_ratings_str.replace("stars", ""))

    ratings_str = html_soup.find_all("span", {"data-test": "ratings-count-without-asterisks"})[0].text
    rating = int(ratings_str.split(" ")[0].replace(",", ""))

    rating_info = {
        "rating": rating, 
        "overall_rating": overall_rating,
    }

    return rating_info


def get_enrolled_info(html_soup):
    num_students_enrolled_str = html_soup.find_all("div", {"class": "_1fpiay2"})[0].find("strong").find("span").text
    num_students_enrolled = int(num_students_enrolled_str.replace(",", ""))
    
    enrolled_info = {
        "num_students_enrolled": num_students_enrolled, 
    }
    
    z = {**x, **y}

    return enrolled_info
    
    
def  get_descirption_info(html_soup, specialized_url=False):
    
    if specialized_url:
        course_description = specialized_html_soup.find_all("div", {"class": "description"})[0].text
        
    else:
        course_description = html_soup.find_all("div", {"class": "m-t-1 description"})[0].find("div", {"class": "content-inner"}).find("p").text

    description_info = {
        "course_description": course_description, 

    }
    return description_info
    
    
def get_provider_info(html_soup):
    course_provider = html_soup.find_all("h3", {"class": "headline-4-text bold rc-Partner__title"})[0].text
    
    provider_info = {
        "course_provider": course_provider, 

    }
    return provider_info

    
    
def fetch_course_info_from_course_url(course_url):
    if "specializations" ==  course_url.split("/")[3]:
        specialized_url = True
    else:
        specialized_url = False
    print("specialized_url", specialized_url)
    
    response = requests.get(url)
    html_soup = BeautifulSoup(response.content, 'html.parser')
    
    # Get Rating Info
    rating_info = get_rating_info(html_soup)

    # Get Enrolled Info
    enrolled_info = get_enrolled_info(html_soup)
    
    # Get Provider Info
    provider_info = get_provider_info(html_soup)
    
    # Get Descritpion Info
    description_info = get_descirption_info(html_soup, specialized_url)
    
    # Merge all information into a single dictionary
    merged_dict = {**rating_info, **enrolled_info, **description_info, **provider_info}
    
    return merged_dict
    

url = "https://www.coursera.org/learn/process-mining"
merged_dict = fetch_course_info_from_course_url(url)
print(merged_dict)

url = "https://www.coursera.org/learn/data-management"
merged_dict = fetch_course_info_from_course_url(url)
print(merged_dict)

url = "https://www.coursera.org/specializations/practical-data-science"
merged_dict = fetch_course_info_from_course_url(url)
print(merged_dict)

specialized_url False
{'rating': 1121, 'overall_rating': 4.7, 'num_students_enrolled': 70978, 'course_description': 'Process mining is the missing link between model-based process analysis and data-oriented analysis techniques. Through concrete data sets and easy to use software the course provides data science knowledge that can be applied directly to analyze and improve processes in a variety of domains.', 'course_provider': 'Eindhoven University of Technology'}
specialized_url False
{'rating': 631, 'overall_rating': 4.7, 'num_students_enrolled': 31599, 'course_description': 'This course will provide learners with an introduction to research data management and sharing. After completing this course, learners will understand the diversity of data and their management needs across the research data lifecycle, be able to identify the components of good data management plans, and be familiar with best practices for working with data including the organization, documentation, and storage 

In [112]:
selected_category_name_str = "Data Science"
selected_category_name = selected_category_name_str.lower().replace(" ", "-")

possible_categories = ["physical-science-and-engineering", "data-science", "business"]

assert selected_category_name in possible_categories

course_url = f"https://www.coursera.org/browse/{selected_category_name}"
course_url

'https://www.coursera.org/browse/data-science'

In [113]:
response = requests.get(course_url)
full_course_browser_soup = BeautifulSoup(response.content, 'html.parser')

In [117]:
full_course_browser_soup

# COPYRIGHT

In [None]:
course_title = []
course_organization = []
course_URL =[]
course_Certificate_type = []
course_rating = []
course_difficulty = []
course_students_enrolled = []
course_image_URL = []
course_image_name=[]

In [130]:
url = "https://www.coursera.org/search?query=free&page=2&index=prod_all_launched_products_term_optimization&topic=Data%20Science"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

In [None]:
# soup

In [131]:
ss = soup.find_all("main", {"class": "css-i045df"})
ss

[<main class="css-i045df"><div class="cds-1 css-1p9qvnf cds-2 cds-7"><div class="cds-9 css-1kspkkz cds-10"><div class="cds-9 css-0 cds-11 cds-grid-item cds-56"><div class="css-qfpx85"></div></div></div></div><div class="css-wgtptl"><div class="cds-1 css-1p9qvnf cds-2 cds-7"><div class="cds-9 css-1kspkkz cds-10"><div class="cds-9 css-0 cds-11 cds-grid-item cds-56"><div class="cds-9 css-1winmd cds-10"><div class="cds-9 css-0 cds-11 cds-grid-item cds-56 cds-75"><div class="css-1cr1orv"><div class="css-fmsv16" data-testid="placeholder"></div><div class="css-fmsv16" data-testid="placeholder"></div><div class="css-fmsv16" data-testid="placeholder"></div><div class="css-fmsv16" data-testid="placeholder"></div><div class="css-fmsv16" data-testid="placeholder"></div><div class="css-fmsv16" data-testid="placeholder"></div><div class="css-fmsv16" data-testid="placeholder"></div></div></div><div class="cds-9 css-0 cds-11 cds-grid-item cds-56 cds-81"><div class="rc-SearchResultsHeader"><h1 class="c

In [136]:
soup.find_all("script", {"type": "application/ld+json"})


[]

## Wait time issue - Try Selinum

['Foundations: Data, Data, Everywhere',
 'IBM Applied AI',
 'Supervised Machine Learning: Regression and Classification',
 'DeepLearning.AI TensorFlow Developer',
 'Machine Learning Engineering for Production (MLOps)',
 'Ask Questions to Make Data-Driven Decisions',
 'Learn SQL Basics for Data Science',
 'Natural Language Processing',
 'IBM AI Engineering',
 'AI in Healthcare',
 'Business Analytics',
 'Preparing for Google Cloud Certification: Cloud Data Engineer']

In [294]:
card

<div aria-hidden="true" class="css-ilhc4l"><div class="css-1rj417c"><div class="css-17cbn3s"><div class="cds-71 css-1fhq39r cds-72 cds-78"><div class="cds-71 css-1vpgbgp cds-73 cds-grid-item"><div class="_1x9ons3"><div class="lazyload-wrapper"><img alt="IBM Skills Network" height="25" src="https://d3njjcbhbojbot.cloudfront.net/api/utilities/v1/imageproxy/http://coursera-university-assets.s3.amazonaws.com/bb/f5ced2bdd4437aa79f00eb1bf7fbf0/IBM-Logo-Blk---Square.png?auto=format%2Ccompress&amp;dpr=1&amp;w=25&amp;h=25&amp;q=40" srcset="https://d3njjcbhbojbot.cloudfront.net/api/utilities/v1/imageproxy/http://coursera-university-assets.s3.amazonaws.com/bb/f5ced2bdd4437aa79f00eb1bf7fbf0/IBM-Logo-Blk---Square.png?auto=format%2Ccompress&amp;dpr=2&amp;w=25&amp;h=25&amp;q=40 2x, https://d3njjcbhbojbot.cloudfront.net/api/utilities/v1/imageproxy/http://coursera-university-assets.s3.amazonaws.com/bb/f5ced2bdd4437aa79f00eb1bf7fbf0/IBM-Logo-Blk---Square.png?auto=format%2Ccompress&amp;dpr=3&amp;w=25&amp

In [298]:
card.find("h2", {"class": "cds-33 cds-35"})

In [308]:

# def auto_scrape(j, html_tag, class_tag, course_item):
#     bsoup = soup.find_all(html_tag, class_ = class_tag)[j].get_text()
#     if bsoup is None:
#         course_item.append("None")
#     else:
#         course_item.append(bsoup)

# def auto_scrape_imgURL(j, html_tag, class_tag, course_url_item, course_url_name):
#     x_div = soup.find_all('img', class_ ="product-photo")[j]
#     if x_div is None:
#         course_url_item.append("None")
#         course_url_name.append("None")
#     else:
#         course_url_item.append(x_div.get('src'))
#         course_url_name.append(x_div.get('alt'))

# def auto_scrape_URL(j, html_tag, class_tag, course_item):
#     cu= soup.find_all("a", class_='rc-DesktopSearchCard anchor-wrapper')[j]
#     if cu is None:
#         course_item.append("None")
#     else:
#         course_URL.append("https:/www.coursera.org/"+ cu.get('href'))
        
        
def value_to_float(value):
    if type(value) == float or type(value) == int:
        return value
    
    value = value.replace(",", "").replace(".", "").upper()
    print("value", value)
    if 'K' in value:
        if len(value) > 1:
            return float(value.replace('K', '')) * 1000
        return 1000.0
    if 'M' in value:
        if len(value) > 1:
            return float(value.replace('M', '')) * 1000000
        return 1000000.0
    if 'B' in value:
        return float(value.replace('B', '')) * 1000000000
    
    return 0.0


def get_coursera_page_url_by_page_number(page_number, topic):
    url_str = f"https://www.coursera.org/search?page={page_number}&index=prod_all_launched_products_term_optimization"
    topic_url_parsed_str = "&topic=" + urllib.parse.quote(topic)
    full_url = url_str + topic_url_parsed_str
    return full_url


def get_course_attributes(card):
    course_info = {}

    course_name = card.find("h2", {"class": "cds-119 css-bku0rr cds-121"}).text
    # course_name = card.find("h2", {"class": "cds-33 css-bku0rr cds-35"}).text 
    # course_name = card.find("h2", {"class": "ccds-33 css-bku0rr cds-35"}).text 
    


    course_info["course_name"] = course_name 
    
    try:

        # Find additional data fields    
        # course_info["course_rating"] = float(card.find("p", {"class": "cds-33 css-zl0kzj cds-35"}).text)
        course_info["course_rating"] = float(card.find("p", {"class": "cds-119 css-zl0kzj cds-121"}).text)
        

        # course_data = card.find_all("p", {"class": "cds-33 css-14d8ngk cds-35"})
        course_data = card.find_all("p", {"class": "cds-119 css-14d8ngk cds-121"})
        
        print("course_data", course_data)
        # Set Course Reviews
        course_reviews_str = course_data[0].text.replace(" reviews", "").replace("(", "").replace(")", "")
        print("course_reviews_str", course_reviews_str)
        course_reviews = value_to_float(course_reviews_str)
        course_info["course_reviews"] = course_reviews


#         split_course_data = course_data[1].text.split(" · ")
#         course_difficulty_level = split_course_data[0]
#         course_type = split_course_data[1]
#         course_period = split_course_data[2]

#         course_info["course_difficulty_level"] = course_difficulty_level 
#         course_info["course_type"] = course_type 
#         course_info["course_period"] = course_period 

    except Exception as e:
        print(f"Exception: {e}")
        return course_info
        
    return course_info

def get_all_course_card_info(course_card_soup):
    course_cards = course_card_soup.find_all("div", {"class": "css-ilhc4l"})
    
    course_information_dict = {}
    
    for card in course_cards:
        course_dict = get_course_attributes(card)
        print("course_dict", course_dict)
        course_name = course_dict.pop("course_name")
        course_information_dict[course_name] = course_dict
        
    print("done")

    # course_names = [for c in course_cards]
    # all_course_names.extend(course_names)
    
    return course_information_dict

# get_coursera_page_url_by_page_number(page_number=1, topic="Math and Logic")
# 'https://www.coursera.org/search?page=1&index=prod_all_launched_products_term_optimization&topic=Math%20and%20Logic'

In [300]:
course_cards = course_card_soup.find_all("div", {"class": "css-ilhc4l"})


# 
card = course_cards[1]
# card
# card.find("h2", {"class": "cds-33 css-bku0rr cds-35"}).text


get_course_attributes(card)

<div aria-hidden="true" class="css-ilhc4l"><div class="css-1rj417c"><div class="css-17cbn3s"><div class="cds-9 css-1fhq39r cds-10 cds-16"><div class="cds-9 css-1vpgbgp cds-11 cds-grid-item"><div class="_1x9ons3"><div class="lazyload-wrapper"><img alt="IBM Skills Network" height="25" src="https://d3njjcbhbojbot.cloudfront.net/api/utilities/v1/imageproxy/http://coursera-university-assets.s3.amazonaws.com/bb/f5ced2bdd4437aa79f00eb1bf7fbf0/IBM-Logo-Blk---Square.png?auto=format%2Ccompress&amp;dpr=1&amp;w=25&amp;h=25&amp;q=40" srcset="https://d3njjcbhbojbot.cloudfront.net/api/utilities/v1/imageproxy/http://coursera-university-assets.s3.amazonaws.com/bb/f5ced2bdd4437aa79f00eb1bf7fbf0/IBM-Logo-Blk---Square.png?auto=format%2Ccompress&amp;dpr=2&amp;w=25&amp;h=25&amp;q=40 2x, https://d3njjcbhbojbot.cloudfront.net/api/utilities/v1/imageproxy/http://coursera-university-assets.s3.amazonaws.com/bb/f5ced2bdd4437aa79f00eb1bf7fbf0/IBM-Logo-Blk---Square.png?auto=format%2Ccompress&amp;dpr=3&amp;w=25&amp;h

AttributeError: 'NoneType' object has no attribute 'text'

In [309]:
topic = "Data Science"

all_course_information = {}

browser = webdriver.Chrome()

for page_number in range(1,3):
    print(f"page_number: {page_number}")
    url = get_coursera_page_url_by_page_number(page_number, topic)
    print("url", url)

    browser.get(url)
    html = browser.page_source
    course_card_soup = BeautifulSoup(html, 'lxml')
    
    course_information_dict = get_all_course_card_info(course_card_soup)
    all_course_information.update(course_information_dict)
    

page_number: 1
url https://www.coursera.org/search?page=1&index=prod_all_launched_products_term_optimization&topic=Data%20Science


AttributeError: 'NoneType' object has no attribute 'text'

In [None]:
# #creating a dataframe gathering the list
# coursera_df = pd.DataFrame({'course_title': course_title,
#                    'course_URL':course_URL,
#                   'course_organization': course_organization,
#                   'course_Certificate_type': course_Certificate_type,
#                   'course_rating':course_rating,
#                    'course_difficulty':course_difficulty,
#                    'course_students_enrolled':course_students_enrolled,
#                     'course_icon':course_image_URL,
#                   'image_name':course_image_name})

# #writing into a .csv file
# coursera_df.to_csv('Coursera_catalog.csv')