# Webscraper

- What is a webscraper?
- Why is it useful?
- Is it legal?

## Project Assumptions

The projects asked the candidate to fetch all Coursera "course" information however the types of learning tracks are split into 3 categories:
- Course
- Specialization
- Professional Certificate

This adds an additional layer of complexity since we need to fetch the course names and type and deal with then appropriately.

I will assume that we are fetching courses from the main page (https://www.coursera.org/browse/data-science) and not any of the pages which contains only a single category of courses.

This is important as each of those 3 categories have different UI components. So we need to ensure that the data is fetched from unique locations depending on the category. An easy way to identify which of these the course is, is by the url. The following key word follows the category within the URL after the "coursera.org/" (see example url):
- Course ('learn')
    - https://www.coursera.org/learn/foundations-data
- Specialization ('specializations')
    - https://www.coursera.org/specializations/deep-learning
- Professional Certificate ('professional-certificates')
    - https://www.coursera.org/professional-certificates/ibm-data-science

In [310]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [311]:
# Imports

from bs4 import BeautifulSoup
import requests
import os
import pandas as pd

from IPython.display import Image, Markdown
from IPython.core.display import display, HTML

import urllib.parse
from selenium import webdriver

  from IPython.core.display import display, HTML


## Example of how to fetch and display HTML using BeautifulSoup

In [15]:
url = "https://web.ics.purdue.edu/~gchopra/class/public/pages/webdesign/05_simple.html"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')


display = False

if display:
    HTML(str(soup))

## Subproblem - Extract info from selected page

- Course name
- Course provider
- Course description
- num of Students enrolled


In [314]:
def get_rating_info(html_soup):   
    overall_ratings_str = html_soup.find_all("span", {"data-test": "number-star-rating"})[0].text
    overall_rating = float(overall_ratings_str.replace("stars", ""))

    ratings_str = html_soup.find_all("span", {"data-test": "ratings-count-without-asterisks"})[0].text
    rating = int(ratings_str.split(" ")[0].replace(",", ""))

    rating_info = {
        "rating": rating, 
        "overall_rating": overall_rating,
    }

    return rating_info


def get_enrolled_info(html_soup):
    num_students_enrolled_str = html_soup.find_all("div", {"class": "_1fpiay2"})[0].find("strong").find("span").text
    num_students_enrolled = int(num_students_enrolled_str.replace(",", ""))
    
    enrolled_info = {
        "num_students_enrolled": num_students_enrolled, 
    }
    
    return enrolled_info
    
    
def  get_descirption_info(html_soup, specialized_url=False):
    
    if specialized_url:
        course_description = specialized_html_soup.find_all("div", {"class": "description"})[0].text
        
    else:
        course_description = html_soup.find_all("div", {"class": "m-t-1 description"})[0].find("div", {"class": "content-inner"}).find("p").text

    description_info = {
        "course_description": course_description, 

    }
    return description_info
    
    
def get_provider_info(html_soup):
    course_provider = html_soup.find_all("h3", {"class": "headline-4-text bold rc-Partner__title"})[0].text
    
    provider_info = {
        "course_provider": course_provider, 

    }
    return provider_info

    
    
def fetch_course_info_from_course_url(course_url):
    if "specializations" ==  course_url.split("/")[3]:
        specialized_url = True
    else:
        specialized_url = False
    print("specialized_url", specialized_url)
    
    response = requests.get(url)
    html_soup = BeautifulSoup(response.content, 'html.parser')
    
    # Get Rating Info
    rating_info = get_rating_info(html_soup)

    # Get Enrolled Info
    enrolled_info = get_enrolled_info(html_soup)
    
    # Get Provider Info
    provider_info = get_provider_info(html_soup)
    
    # Get Descritpion Info
    description_info = get_descirption_info(html_soup, specialized_url)
    
    # Merge all information into a single dictionary
    merged_dict = {**rating_info, **enrolled_info, **description_info, **provider_info}
    
    return merged_dict
    

In [315]:

url = "https://www.coursera.org/learn/process-mining"
merged_dict = fetch_course_info_from_course_url(url)
print(merged_dict)

url = "https://www.coursera.org/learn/data-management"
merged_dict = fetch_course_info_from_course_url(url)
print(merged_dict)

url = "https://www.coursera.org/specializations/practical-data-science"
merged_dict = fetch_course_info_from_course_url(url)
print(merged_dict)

specialized_url False
{'rating': 1121, 'overall_rating': 4.7, 'num_students_enrolled': 70978, 'course_description': 'Process mining is the missing link between model-based process analysis and data-oriented analysis techniques. Through concrete data sets and easy to use software the course provides data science knowledge that can be applied directly to analyze and improve processes in a variety of domains.', 'course_provider': 'Eindhoven University of Technology'}
specialized_url False
{'rating': 631, 'overall_rating': 4.7, 'num_students_enrolled': 31599, 'course_description': 'This course will provide learners with an introduction to research data management and sharing. After completing this course, learners will understand the diversity of data and their management needs across the research data lifecycle, be able to identify the components of good data management plans, and be familiar with best practices for working with data including the organization, documentation, and storage 

In [316]:
selected_category_name_str = "Data Science"
selected_category_name = selected_category_name_str.lower().replace(" ", "-")

possible_categories = ["physical-science-and-engineering", "data-science", "business"]

assert selected_category_name in possible_categories

course_url = f"https://www.coursera.org/browse/{selected_category_name}"
course_url

'https://www.coursera.org/browse/data-science'

In [415]:

def value_to_float(value):
    if type(value) == float or type(value) == int:
        return value
    
    value = value.replace(",", ".").upper()
    if 'K' in value:
        if len(value) > 1:
            return float(value.replace('K', '')) * 1000
        return 1000.0
    if 'M' in value:
        if len(value) > 1:
            return float(value.replace('M', '')) * 1000000
        return 1000000.0
    if 'B' in value:
        return float(value.replace('B', '')) * 1000000000
    
    return 0.0


def get_coursera_page_url_by_page_number(page_number, topic):
    url_str = f"https://www.coursera.org/search?page={page_number}&index=prod_all_launched_products_term_optimization"
    topic_url_parsed_str = "&topic=" + urllib.parse.quote(topic)
    full_url = url_str + topic_url_parsed_str
    return full_url


def get_course_attributes(card):
    course_info = {}

    course_name = card.find("h2", {"class": "cds-119 css-bku0rr cds-121"}).text

    course_info["name"] = course_name 
    course_info["rating"] = float(card.find("p", {"class": "cds-119 css-zl0kzj cds-121"}).text)
    course_review_data = card.find_all("p", {"class": "cds-119 css-14d8ngk cds-121"})[0].text

    # Set Course Reviews
    course_reviews_str = course_review_data.replace(" reviews", "").replace("(", "").replace(")", "")

    course_num_of_reviewers = value_to_float(course_reviews_str)
    course_info["num_of_reviewers"] = course_num_of_reviewers    

    return course_info

def get_all_course_card_info(course_card_soup):
    course_cards = course_card_soup.find_all("div", {"class": "css-ilhc4l"})
    
    course_information_list = []
    
    for card in course_cards:
        course_dict = get_course_attributes(card)
        course_information_list.append(course_dict)
    
    return course_information_list

# get_coursera_page_url_by_page_number(page_number=1, topic="Math and Logic")
# 'https://www.coursera.org/search?page=1&index=prod_all_launched_products_term_optimization&topic=Math%20and%20Logic'

## Individual Card

In [413]:
url = get_coursera_page_url_by_page_number(1, topic)
browser.get(url)
html = browser.page_source
course_card_soup = BeautifulSoup(html, 'lxml')

course_cards = course_card_soup.find_all("div", {"class": "css-ilhc4l"})

card = course_cards[0]

course_data_objs = card.find_all("p", {"class": "cds-119 css-14d8ngk cds-121"})
course_data_objs

IndexError: list index out of range

In [412]:
# course_data_objs = card.find_all("p", {"class": "css-ilhc4l"})
# course_data_objs

[]

## Full Code

In [398]:
topic = "Data Science"

list_of_courses = []
browser = webdriver.Chrome()

for page_number in range(1,3):
    print(f"page_number: {page_number}")
    url = get_coursera_page_url_by_page_number(page_number, topic)
    print("url", url)

    browser.get(url)
    html = browser.page_source
    course_card_soup = BeautifulSoup(html, 'lxml')
    
    # Check if we have more results
    results_are_finished = course_card_soup.find("div", {"data-e2e": "NumberOfResultsSection"}).text == "No results found for your search"
    
    # This works!
    if results_are_finished:
        print("!!! Breaking !!!")
        break
    
    course_information_list = get_all_course_card_info(course_card_soup) 
    list_of_courses.extend(course_information_list)


page_number: 1
url https://www.coursera.org/search?page=1&index=prod_all_launched_products_term_optimization&topic=Data%20Science
course_review_data (92.1k reviews)
course_reviews_str 92.1k
value 92.1K
course_dict {'name': 'Google Data Analytics', 'rating': 4.8, 'num_of_reviewers': 92100.0}
course_review_data (103.2k reviews)
course_reviews_str 103.2k
value 103.2K
course_dict {'name': 'IBM Data Science', 'rating': 4.6, 'num_of_reviewers': 103200.0}
course_review_data (58.3k reviews)
course_reviews_str 58.3k
value 58.3K
course_dict {'name': 'IBM Data Analyst', 'rating': 4.6, 'num_of_reviewers': 58300.0}
course_review_data (5.7k reviews)
course_reviews_str 5.7k
value 5.7K
course_dict {'name': 'Machine Learning', 'rating': 4.9, 'num_of_reviewers': 5700.0}
course_review_data (75.2k reviews)
course_reviews_str 75.2k
value 75.2K
course_dict {'name': 'Introduction to Data Science', 'rating': 4.6, 'num_of_reviewers': 75200.0}
course_review_data (51.2k reviews)
course_reviews_str 51.2k
value 51

Unnamed: 0,name,rating,num_of_reviewers,category
0,Google Data Analytics,4.8,92100.0,Data Science
1,IBM Data Science,4.6,103200.0,Data Science
2,IBM Data Analyst,4.6,58300.0,Data Science
3,Machine Learning,4.9,5700.0,Data Science
4,Introduction to Data Science,4.6,75200.0,Data Science


In [403]:
courses_df = pd.DataFrame(list_of_courses)
courses_df["category"] = topic
courses_df.head()
print(courses_df.shape)

(24, 4)


In [418]:
courses_df.to_csv(f"{topic.lower().replace(' ', '-')}_course_info.csv", index=False)

## Save to Server

In [385]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive


In [386]:
# access the drive
gauth = GoogleAuth()
gauth.LocalWebserverAuth()

drive = GoogleDrive(gauth)

InvalidConfigError: Invalid client secrets file ('Error opening file', 'client_secrets.json', 'No such file or directory', 2)

In [None]:

# the file you want to upload, here simple example
f = drive.CreateFile()
f.SetContentFile('course_category_15-10-2022.csv')

# upload the file
f.Upload()
print(f'title: {f['title']}, mimeType: {f['mimeType']}')

# read all files, the newly uploaded file will be there
file_list = drive.ListFile({'q': "'root' in parents and trashed=false"}).GetList()
for file1 in file_list:
    print(f'title: {file1['title']}, mimeType: {file1['mimeType']}')


In [None]:
# Look into using Google Drive as a server

import gspread
import os
gc = gspread.oauth(credentials_filename='/users/krzysztofpaszta/credentials.json')

os.chdir('/users/krzysztofpaszta/CSVtoGD')

files = os.listdir()

for filename in files:
    if filename.split(".")[1] == "csv":
        sh = gc.create(filename.split(".")[0]+' TTF')
        content = open(filename, 'r').read().encode('utf-8') 
        gc.import_csv(sh.id, content)

## Connect to a server

In [383]:
import pyodbc


In [384]:
pyodbc.drivers()

[]

In [381]:


conn = pyodbc.connect('Driver={SQL Server};'
                      'Server=RON\SQLEXPRESS;'
                      'Database=test_database;'
                      'Trusted_Connection=yes;')
cursor = conn.cursor()

Error: ('01000', "[01000] [unixODBC][Driver Manager]Can't open lib 'SQL Server' : file not found (0) (SQLDriverConnect)")

In [None]:
# Create Table

create_table = False
if create_table:
    cursor.execute(
        '''

            CREATE TABLE products (
                product_id int primary key,
                product_name nvarchar(50),
                price int
            )
        '''
    )

In [None]:
# Add all rows from df

# for row in df.itertuples():
#     cursor.execute(
#         '''
#             INSERT INTO products (product_id, product_name, price)
#             VALUES (?,?,?)
#         ''',
#         row.product_id, 
#         row.product_name,
#         row.price
#                 )
conn.commit()