# **Algorithmic Methods for Data Mining: Homework 3**

**Author:** Miguel Angel Sanchez Cortes

*MSc. in Data Science, Sapienza University of Rome*

---

## **0. Uploading the Classes and Modules**

Before doing any kind of analysis it is necessary to upload both the relevant Classes and Modules we will use to work.

In [1]:
from modules.web_scraper import WebScraper
from modules.html_parser import HTMLParser
from modules.data_preprocesser import DataPreprocesser
from modules.search_engine import SearchEngine, TopKSearchEngine



---

## **1. Data Collection**

For this homework, there was no provided dataset, instead, we had to build our own. We were asked to obtain relevant information for the courses contained in the [Find a Masters](https://www.findamasters.com/) website. Specifically, we were asked to obtain information for the Masters courses contained on the first **400** pages of the [MSc. Degrees](https://www.findamasters.com/masters-degrees/msc-degrees/) section of the [Find a Masters](https://www.findamasters.com/) website.

To do this, we created two custom-made **Python classes** that performed *Web Scraping* and *HTML Parsing* on the first 400 pages of the [MSc. Degrees](https://www.findamasters.com/masters-degrees/msc-degrees/) section of the [Find a Masters](https://www.findamasters.com/) website and obtained as a result a **Pandas Dataframe** with all the relevant information for each course within these pages. Here is a brief description of each of the classes:

- `WebScraper`. This class performs web scraping on multiple pages of the [MSc. Degrees](https://www.findamasters.com/masters-degrees/msc-degrees/) section of the [Find a Masters](https://www.findamasters.com/) website. To do this it uses the following methods:

    - `scrape_urls()`. Scrapes the urls for all the courses within the [MSc. Degrees](https://www.findamasters.com/masters-degrees/msc-degrees/) section of the [Find a Masters](https://www.findamasters.com/) website and saves them in a text file.
    
    - `scrape_htmls()`. Scrapes and saves the HTML file for each URL contained in the text file produced by the `scrape_urls()` method.

- `HTMLParser`. This class parses each HTML file obtained with the `WebScraper.scrape_htmls()` method and obtains relevant information within these files. To do this it uses the following methods:

    - `parse_htmls()`. Parses information within the HTMLs (courses) obtained with the `WebScraper.scrape_htmls()` method and obtains: the name of the course, name of the university, name of the faculty, modality (MSc., full time/part time, online/on campus, etc.), description, start date, fees, duration, city, country, and URL, and saves all this information on tsv files.

    - `get_dataframe()`: Obtains a dataframe containing the parsed information of all the courses obtained with the `parse_htmls()` method.

For more information about the **implementation** of these classes and methods, please refer to their corresponding `web_scraper.py` and `html_parser.py` files contained in the `modules` directory of our repository.


### **1.1. Getting the list of Master's Degrees courses**

First, we start by web scraping the [MSc. Degrees](https://www.findamasters.com/masters-degrees/msc-degrees/) section of the [Find a Masters](https://www.findamasters.com/) website. In particular, we start by collecting the URLs associated with all the Master courses within the first 400 pages of this section, since each page has 15 courses, we obtained 6000 unique master's degree URLs. We obtained as an output a .txt file where each single line corresponds to a particular master's URL.

To do this, we used our `WebScraper` class. In particular, the `scrape_urls()` method of our `WebScraper` class:

In [2]:
#First, we have to initialize the WebScraper class by calling the constructor
web_scraper = WebScraper()

#As a second step, we can call the .scrape_urls() method to get the list of urls mentioned above. This method doesn't return anything. 
#Instead, the method saves the URLs in a text file called urls.txt
web_scraper.scrape_urls()


After performing the last step we generated an `urls.txt` file where each line was a different URL for a course within the [MSc. Degrees](https://www.findamasters.com/masters-degrees/msc-degrees/) section of the [Find a Masters](https://www.findamasters.com/) website. As an exercise, to see that our code indeed worked we can print the first **three** lines of this file:

In [4]:
#Here we print the first 3 URLs from the urls.txt file
with open('data/urls.txt', 'r') as f:
    for _ in range(3):
        print(f.readline())


https://www.findamasters.com/masters-degrees/course/3d-design-for-virtual-environments-msc/?i93d2645c19223

https://www.findamasters.com/masters-degrees/course/accounting-and-finance-msc/?i321d3232c3891

https://www.findamasters.com/masters-degrees/course/accounting-accountability-and-financial-management-msc/?i132d7816c25522



### **1.2. Crawling the master's degree pages**

Once we got the URLs of the courses contained in the first 400 pages of the [MSc. Degrees](https://www.findamasters.com/masters-degrees/msc-degrees/) section of the [Find a Masters](https://www.findamasters.com/) website, we saved the HTML files of each individual URL. We performed the following steps:

1. Downloading the HTML corresponding to each of the collected URLs.

2. Saving the downloaded `HTML` file into folders. Where each folder contains the URLs belonging to a given page. In other words, each folder contains the `HTML` files of the courses on page 1, page 2, etc. of the list of master's programs.
   
To do this, we used our `WebScraper` class. In particular, the `scrape_htmls()` method of our `WebScraper` class:

In [3]:
#Here, we can call the .scrape_htmls() method to get the HTMLs of the URLs mentioned above. This method doesn't return anything.
#Instead, the method saves the HTMLs in a folder called htmls, where this folder contains subfolders for each page for the first 400 pages.
#!!DO NOT RUN THIS METHOD IF YOU ALREADY HAVE THE HTMLS IN THE FOLDER!!
web_scraper.scrape_htmls()


Connection error: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')). Waiting 60 seconds and trying again.
Connection error: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')). Waiting 60 seconds and trying again.


It is important to notice that this method takes to run approximately $\sim 10$ hours since we have to download $6000$ HTML files. We didn't try to optimize the running time since we encountered several issues that constrained our running time when performing web-scraping:

1. When accessing the information, we encountered an error: **Error 429: Too Many Requests** since we were requesting information **too fast**, this constrained us to take a given amount of time between each request in order to ensure the correct obtention of information.

3. We had a very *unstable* internet connection when trying to download the HTML files. In order to ensure the optimal obtention of data, we waited $60$ seconds every time there was a **ConnectionError** and tried again.

4. Finally, even though we took a given amount of time between each request, there were cases where we obtained HTML files with only the text *"Just a moment..."* written on it. Therefore we augmented the time threshold between each request in order to avoid this situation and clearly this took a big toll on the running time of our final code.

We consulted the following sources to solve each one of these problems:

- [Problem HTTP error 403 in Python 3 Web Scraping](https://stackoverflow.com/questions/16627227/problem-http-error-403-in-python-3-web-scraping).

- [urllib2 HTTP error 429](https://stackoverflow.com/questions/13213048/urllib2-http-error-429).

- [Web Scraping using Python (and Beautiful Soup)](https://www.datacamp.com/tutorial/web-scraping-using-python).

### **1.3. Parsing the downloaded pages**

At this point, we have saved all the HTML documents about the master's degree of interest. The next step we need to perform is extracting relevant information from each one of the courses we have. The **structure** of the web pages for each course can be exemplified with the following image:


<img align = "right" src="https://raw.githubusercontent.com/Sapienza-University-Rome/ADM/master/2023/Homework_3/img/example.jpeg" width="650" height = "450"/>

<div style="text-align: left"> 

The information we desire to obtain for each course and their format is as follows:

1. Course Name (<span style='color:#CB0900'> **CourseName** </span>): `string`.

2. University (<span style='color:#007B04'> **UniversityName** </span>): `string`.

3. Faculty (<span style='color:orange'> **facultyName** </span>): `string`.

4. Full or Part Time (<span style='color:#1A2E81'> **isItFullTime** </span>): `string`.

5. Short Description (<span style='color:black'> **description** </span>): `string`.

6. Start Date (<span style='color:purple'> **startDate** </span>): `string`.

7. Fees (<span style='color:pink'> **fees** </span>): `string`.

8. Modality (<span style='color:#FF9797'> **modality** </span>): `string`.

9. Duration (<span style='color:#3ABC3C'> **duration** </span>): `string`.

10. City (<span style='color:#F93345'> **city** </span>): `string`.

11. Country (<span style='color:yellow'> **country** </span>): `string`.

12. Presence or online modality (<span style='color:#0874FF'> **administration** </span>): `string`.

13. Link to the page (**url**): `string`.
</div>

To obtain this information, we used our `HTMLParser` class. In particular, the `parse_htmls()` method of our `HTMLParser` class. This method extracts the previous information from each HTML file of every course contained in the first 400 pages of the [MSc. Degrees](https://www.findamasters.com/masters-degrees/msc-degrees/) section of the [Find a Masters](https://www.findamasters.com/) website and saves it in a `.tsv` file. At the end we should finish with $6000$ `.tsv` files:

In [6]:
#First, we have to initialize the HTMLParser class by calling the constructor
html_parser = HTMLParser()

#As a second step, we can call the .parse_htmls() method to get the tsv files containing all the information mentioned before for each course.
#Instead, the method saves the URLs in a text file called urls.txt
html_parser.parse_htmls()


As an exercise, to see that our code indeed worked we can print the first **three** `.tsv` files i.e., the `.tsv` for the first three courses in the first page of the [MSc. Degrees](https://www.findamasters.com/masters-degrees/msc-degrees/) section of the [Find a Masters](https://www.findamasters.com/) website:

In [8]:
#Here we print the first 3 .tsv files from the data/tsvs folder
for i in range(3):
    with open(f'data/tsvs/course_{i+1}.tsv', 'r') as f:
        print(f.read())


3D Design for Virtual Environments - MSc	Glasgow Caledonian University	School of Engineering and Built Environment	Full time	3D visualisation and animation play a role in many areas, and the popularity of these media just keeps growing. Digital animation provides the eye-catching special effects in the 21st century's favourite films and television shows; 3D design is also essential to everyday work in everything from computer games development, online virtual world development and industrial design to marketing, product design and architecture. GCU's programme in 3D Design for Virtual Environments will help you develop the skills to thrive in a successful career as a visual designer. The programme is practical and career-focused, oriented towards current industry needs, technology and practice. No prior knowledge of 3D design is required.	September	Please see the university website for further information on fees for this course.	MSc	1 year full-time	Glasgow	United Kingdom	On Campus	ht

Once we've obtained the relevant data we can finally visualize it properly by creating a *Pandas DataFrame*. To do this we can use the `get_dataframe()` method of our `HTMLParser` class. This method joins all the `.tsv` files created before in a single DataFrame:

In [2]:
html_parser = HTMLParser()
#Here we can call the .get_dataframe() method to get the pandas dataframe containing all the information mentioned before for each course.
course_dataset = html_parser.get_dataframe()
#Here we print the first 5 rows of the dataframe
course_dataset.head()


Unnamed: 0,courseName,universityName,facultyName,isItFullTime,description,startDate,fees,modality,duration,city,country,administration,url
0,3D Design for Virtual Environments - MSc,Glasgow Caledonian University,School of Engineering and Built Environment,Full time,3D visualisation and animation play a role in ...,September,Please see the university website for further ...,MSc,1 year full-time,Glasgow,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
1,Accounting and Finance - MSc,University of Leeds,Leeds University Business School,Full time,Businesses and governments rely on sound finan...,September,"UK: £18,000 (Total) International: £34,750 (To...",MSc,1 year full time,Leeds,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
2,"Accounting, Accountability & Financial Managem...",King’s College London,King’s Business School,Full time,"Our Accounting, Accountability & Financial Man...",September,Please see the university website for further ...,MSc,1 year FT,London,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
3,"Accounting, Financial Management and Digital B...",University of Reading,Henley Business School,Full time,Embark on a professional accounting career wit...,September,Please see the university website for further ...,MSc,1 year full time,Reading,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...
4,Addictions MSc,King’s College London,"Institute of Psychiatry, Psychology and Neuros...",Full time,Join us for an online session for prospective ...,September,Please see the university website for further ...,MSc,One year FT,London,United Kingdom,On Campus,https://www.findamasters.com/masters-degrees/c...


----

## **2. Search Engine**

Now that we have our data cleaned and organized in one dataset, we want to build a Search Engine that, given as input a query, return the courses that match the query. To do this we created **two** custom-made Python classes that helped us build our Search Engine and pre-process our data in order to make our Search Engine efficient and accurate. Here is a brief description of each of the classes:

- `DataPreprocesser`: This class preprocesses the fees column and the text columns (removes punctuation/stopwords and performs tokenizing and lemmatizing) of the MSc courses dataset. To do this it uses the following methods:

    - `preprocess_fees_column()`. Finds and converts the course fees into Euros, obtains the maximum fee and saves it as a float in a new column.

    - `preprocess_text_column()`. Removes punctuation and stopwords from the text, and performs tokenizing and lemmatizing of every word.


- `Search Engine`:  This class


### **2.0. Preprocessing the text**

In [3]:
data_preprocesser = DataPreprocesser(course_dataset)
data_preprocesser.preprocess_text_column(column_name="description")


In [10]:
search_engine = SearchEngine(data_preprocesser.dataset)
result = search_engine.query("advanced knowledge")
result


Unnamed: 0,courseName,universityName,description,url
1,Accounting and Finance - MSc,University of Leeds,Businesses and governments rely on sound finan...,https://www.findamasters.com/masters-degrees/c...
198,Master of Public Health (MPH) Online,Brunel University Online,Brunel’s Master of Public Health (online) has ...,https://www.findamasters.com/masters-degrees/c...
204,Master of Science in Mechanical Engineering,The Hong Kong University of Science and Techno...,The program is designed to benefit students wi...,https://www.findamasters.com/masters-degrees/c...
215,Materials Science and Engineering - MSc,University of Leeds,Materials science is at the forefront of provi...,https://www.findamasters.com/masters-degrees/c...
239,MSc Advanced Computer Science,University of Sheffield,This MSc keeps you at the cutting edge of deve...,https://www.findamasters.com/masters-degrees/c...
...,...,...,...,...
5872,Master of Science in Professional Nursing,Atlantic Technological University,The programme provides the nurse with a broad-...,https://www.findamasters.com/masters-degrees/c...
5904,Master of Science/Postgraduate Diploma in Civi...,The Hong Kong University of Science and Techno...,The program offers advanced civil engineering ...,https://www.findamasters.com/masters-degrees/c...
5905,Master of Science/Postgraduate Diploma in Envi...,The Hong Kong University of Science and Techno...,The program is meant to meet the needs of prac...,https://www.findamasters.com/masters-degrees/c...
5943,Master's course in Cognitive Science,University of Trento,The two-year CIMeC Master’s in Cognitive Scien...,https://www.findamasters.com/masters-degrees/c...


In [23]:
search_engine = TopKSearchEngine(data_preprocesser.dataset)
result = search_engine.query("advanced knowledge")
result


Unnamed: 0,courseName,universityName,description,url,similarity
651,Advanced Clinical Practice - MSc,Canterbury Christ Church University,Gain the knowledge and skills needed to become...,https://www.findamasters.com/masters-degrees/c...,0.344516
5242,Management and Digital Business (with Advanced...,Liverpool John Moores University,This Advanced Practice course provides an in-d...,https://www.findamasters.com/masters-degrees/c...,0.342258
753,Advanced Computing MSc,King’s College London,Our Advanced Computing MSc provides knowledge ...,https://www.findamasters.com/masters-degrees/c...,0.335546
857,Advanced Nurse Practitioner/Professional Pract...,University of the Highlands and Islands,Developed in partnership with expert clinical ...,https://www.findamasters.com/masters-degrees/c...,0.304013
698,Advanced Clinical Practice MSc,University of Greenwich,Develop your skills and deepen your knowledge ...,https://www.findamasters.com/masters-degrees/c...,0.297118


In [25]:
df_city= list(data_preprocesser.dataset.iloc[[651, 5242, 753, 857, 698]]["city"])
df_country= list(data_preprocesser.dataset.iloc[[651, 5242, 753, 857, 698]]["country"])
df_fees= list(data_preprocesser.dataset.iloc[[651, 5242, 753, 857, 698]]["fees"])
duration= list(data_preprocesser.dataset.iloc[[651, 5242, 753, 857, 698]]["duration"])
faculty= list(data_preprocesser.dataset.iloc[[651, 5242, 753, 857, 698]]["facultyName"])


In [74]:
df_result


['Canterbury', 'Liverpool', 'London', 'Inverness', 'London']

In [28]:
#Here we add to result dataframe its cities and countries, fees and duration
result["city"] = df_city
result["country"] = df_country
result["fees"] = df_fees
result["duration"] = duration
result["facultyName"] = faculty

#Here we add also the full address
result['Full_Address'] = result['facultyName'] + ',' + result['universityName'] + ',' + \
                result['city'] + ',' + \
                result['country']



In [29]:
result


Unnamed: 0,courseName,universityName,description,url,similarity,city,country,fees,duration,facultyName,Full_Address
651,Advanced Clinical Practice - MSc,Canterbury Christ Church University,Gain the knowledge and skills needed to become...,https://www.findamasters.com/masters-degrees/c...,0.344516,Canterbury,United Kingdom,"UK Part time - £915 or £1,580 per 20 credit mo...",2 or 3 years part time,"Faculty of Medicine, Health and Social Care","Faculty of Medicine, Health and Social Care,Ca..."
5242,Management and Digital Business (with Advanced...,Liverpool John Moores University,This Advanced Practice course provides an in-d...,https://www.findamasters.com/masters-degrees/c...,0.342258,Liverpool,United Kingdom,Please see the university website for further ...,See course dates on website,Faculty of Business and Law,"Faculty of Business and Law,Liverpool John Moo..."
753,Advanced Computing MSc,King’s College London,Our Advanced Computing MSc provides knowledge ...,https://www.findamasters.com/masters-degrees/c...,0.335546,London,United Kingdom,Please see the university website for further ...,Full-time: One year,Faculty of Natural and Mathematical Sciences,"Faculty of Natural and Mathematical Sciences,K..."
857,Advanced Nurse Practitioner/Professional Pract...,University of the Highlands and Islands,Developed in partnership with expert clinical ...,https://www.findamasters.com/masters-degrees/c...,0.304013,Inverness,United Kingdom,Please see the university website for further ...,"1 year part time PGCert, 2 years part time PGD...","Science, Health and Engineering","Science, Health and Engineering,University of ..."
698,Advanced Clinical Practice MSc,University of Greenwich,Develop your skills and deepen your knowledge ...,https://www.findamasters.com/masters-degrees/c...,0.297118,London,United Kingdom,Please see the university website for further ...,3 years,School of Health Sciences,"School of Health Sciences,University of Greenw..."


In [30]:
import googlemaps
import os

#Here we get the API key from the environment variable
API_KEY = os.environ['GOOGLE_API_KEY']

#Here we initialize the Google Maps API client
gmaps = googlemaps.Client(key=API_KEY)

result["lat"] = result["Full_Address"].apply(lambda x: gmaps.geocode(x)[0]["geometry"]["location"]["lat"])
result["lng"] = result["Full_Address"].apply(lambda x: gmaps.geocode(x)[0]["geometry"]["location"]["lng"])





In [31]:
result


Unnamed: 0,courseName,universityName,description,url,similarity,city,country,fees,duration,facultyName,Full_Address,lat,lng
651,Advanced Clinical Practice - MSc,Canterbury Christ Church University,Gain the knowledge and skills needed to become...,https://www.findamasters.com/masters-degrees/c...,0.344516,Canterbury,United Kingdom,"UK Part time - £915 or £1,580 per 20 credit mo...",2 or 3 years part time,"Faculty of Medicine, Health and Social Care","Faculty of Medicine, Health and Social Care,Ca...",51.279496,1.089876
5242,Management and Digital Business (with Advanced...,Liverpool John Moores University,This Advanced Practice course provides an in-d...,https://www.findamasters.com/masters-degrees/c...,0.342258,Liverpool,United Kingdom,Please see the university website for further ...,See course dates on website,Faculty of Business and Law,"Faculty of Business and Law,Liverpool John Moo...",53.403288,-2.973098
753,Advanced Computing MSc,King’s College London,Our Advanced Computing MSc provides knowledge ...,https://www.findamasters.com/masters-degrees/c...,0.335546,London,United Kingdom,Please see the university website for further ...,Full-time: One year,Faculty of Natural and Mathematical Sciences,"Faculty of Natural and Mathematical Sciences,K...",51.469949,-0.089087
857,Advanced Nurse Practitioner/Professional Pract...,University of the Highlands and Islands,Developed in partnership with expert clinical ...,https://www.findamasters.com/masters-degrees/c...,0.304013,Inverness,United Kingdom,Please see the university website for further ...,"1 year part time PGCert, 2 years part time PGD...","Science, Health and Engineering","Science, Health and Engineering,University of ...",57.471014,-4.230628
698,Advanced Clinical Practice MSc,University of Greenwich,Develop your skills and deepen your knowledge ...,https://www.findamasters.com/masters-degrees/c...,0.297118,London,United Kingdom,Please see the university website for further ...,3 years,School of Health Sciences,"School of Health Sciences,University of Greenw...",51.518173,-0.099507


In [114]:
import plotly.express as px

#Here we plot the map
fig = px.scatter_mapbox(
    result,
    lat="lat",
    lon="lng",
    hover_name="courseName",
    size="similarity",
    color="similarity",
    opacity=0.7,
    color_continuous_scale=px.colors.cyclical.Edge,
    hover_data=["universityName", "facultyName", "fees", "duration"],
)

fig.update_layout(mapbox={"style": "open-street-map", "zoom": 4}, margin={"t":0,"b":0,"l":0,"r":0}, hovermode='closest', coloraxis_colorbar={"title":"Similarity", "x":0, "y":0.5, "orientation":"v"})

fig.update_traces(hovertemplate="<br>".join([
    "%{hovertext}",
    "University: %{customdata[0]}",
    "Faculty: %{customdata[1]}",
    "Fees: %{customdata[2]}",
    "Duration: %{customdata[3]}"
]))
