# Data Preprocessing

Import libraries

In [71]:
import pandas as pd
from tqdm import tqdm
from keybert import KeyBERT

Read data

In [72]:
df1 = pd.read_csv('./data/edx_courses.csv')
df2 = pd.read_csv('./data/datacamp_courses_full.csv')

In [73]:
df1

Unnamed: 0,Course Name,Course URL,About This Course
0,CS50's Introduction to Computer Science,https://www.edx.org/learn/computer-science/har...,"This isCS50x, Harvard University's introductio..."
1,CS50's Introduction to Programming with Python,https://www.edx.org/learn/python/harvard-unive...,An introduction to programming using a languag...
2,CS50's Introduction to Artificial Intelligence...,https://www.edx.org/learn/artificial-intellige...,This course explores the concepts and algorith...
3,CS50's Introduction to Cybersecurity,https://www.edx.org/learn/cybersecurity/harvar...,This is CS50's introduction to cybersecurity f...
4,CS50's Web Programming with Python and JavaScript,https://www.edx.org/learn/web-development/harv...,"Topics include database design, scalability, s..."
...,...,...,...
995,算法设计与分析(高级) | Advanced Design and Analysis of ...,https://www.edx.org/learn/algorithms/peking-un...,No description available
996,Programming Reactive Systems,https://www.edx.org/learn/scala/ecole-polytech...,Reactive programming is a set of techniques fo...
997,Paradigms of Computer Programming – Abstractio...,https://www.edx.org/learn/computer-programming...,Louv1.2x and its predecessorLouv1.1xtogether g...
998,Z/OS REXX Programming,https://www.edx.org/learn/computer-programming...,This course is designed to teach you the basic...


In [74]:
df2

Unnamed: 0,Course Name,Course URL,Detailed Description
0,Introduction to Python,https://www.datacamp.com/courses/intro-to-pyth...,An Introduction to Python | Python has grown t...
1,Introduction to SQL,https://www.datacamp.com/courses/introduction-...,Get an Introduction to SQL in Two Hours | Much...
2,Understanding Artificial Intelligence,https://www.datacamp.com/courses/understanding...,Explore the basics of Artificial Intelligence ...
3,Introduction to Power BI,https://www.datacamp.com/courses/introduction-...,A Thorough Introduction to Power BI | In this ...
4,Introduction to R,https://www.datacamp.com/courses/free-introduc...,Learn R Programming | R programming language i...
...,...,...,...
567,Predicting CTR with Machine Learning in Python,https://www.datacamp.com/courses/predicting-ct...,Have you ever wondered how companies like Face...
568,Optimizing R Code with Rcpp,https://www.datacamp.com/courses/optimizing-r-...,"R is a great language for data science, but so..."
569,GDPR in Practice: Compliance and Fines,https://www.datacamp.com/courses/gdpr-in-pract...,Apply GDPR Concepts in Real Business Scenarios...
570,Scalable AI Models with PyTorch Lightning,https://www.datacamp.com/courses/scalable-ai-m...,Foundations of Scalable AI | This course takes...


Rename column and merge datasets

In [75]:
df2 = df2.rename(columns={'Detailed Description': 'About This Course'})

In [76]:
df = pd.concat([df1, df2], axis=0).reset_index(drop=True)

In [77]:
df

Unnamed: 0,Course Name,Course URL,About This Course
0,CS50's Introduction to Computer Science,https://www.edx.org/learn/computer-science/har...,"This isCS50x, Harvard University's introductio..."
1,CS50's Introduction to Programming with Python,https://www.edx.org/learn/python/harvard-unive...,An introduction to programming using a languag...
2,CS50's Introduction to Artificial Intelligence...,https://www.edx.org/learn/artificial-intellige...,This course explores the concepts and algorith...
3,CS50's Introduction to Cybersecurity,https://www.edx.org/learn/cybersecurity/harvar...,This is CS50's introduction to cybersecurity f...
4,CS50's Web Programming with Python and JavaScript,https://www.edx.org/learn/web-development/harv...,"Topics include database design, scalability, s..."
...,...,...,...
1567,Predicting CTR with Machine Learning in Python,https://www.datacamp.com/courses/predicting-ct...,Have you ever wondered how companies like Face...
1568,Optimizing R Code with Rcpp,https://www.datacamp.com/courses/optimizing-r-...,"R is a great language for data science, but so..."
1569,GDPR in Practice: Compliance and Fines,https://www.datacamp.com/courses/gdpr-in-pract...,Apply GDPR Concepts in Real Business Scenarios...
1570,Scalable AI Models with PyTorch Lightning,https://www.datacamp.com/courses/scalable-ai-m...,Foundations of Scalable AI | This course takes...


Check missing and duplicated values

In [78]:
df.isna().sum()

Course Name          0
Course URL           0
About This Course    0
dtype: int64

In [79]:
df.duplicated().sum()

0

Remove courses with no description

In [80]:
df = df[df["About This Course"] != "No description available"]

Clean text in description column (lowercasing, removing whitespace)

In [81]:
df["Description"] = df["About This Course"].str.strip().str.lower().str.replace(r"\s+", " ", regex=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["Description"] = df["About This Course"].str.strip().str.lower().str.replace(r"\s+", " ", regex=True)


Use ```KeyBERT``` for keywords extraction

In [82]:
kw_model = KeyBERT()
tqdm.pandas()

df["keywords"] = df["Description"].progress_apply(lambda x: kw_model.extract_keywords(x, top_n=10))
df["keywords"] = df["keywords"].apply(lambda x: [kw[0] for kw in x])
df

100%|██████████| 1557/1557 [11:44<00:00,  2.21it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["keywords"] = df["Description"].progress_apply(lambda x: kw_model.extract_keywords(x, top_n=10))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["keywords"] = df["keywords"].apply(lambda x: [kw[0] for kw in x])


Unnamed: 0,Course Name,Course URL,About This Course,Description,keywords
0,CS50's Introduction to Computer Science,https://www.edx.org/learn/computer-science/har...,"This isCS50x, Harvard University's introductio...","this iscs50x, harvard university's introductio...","[programming, harvardx, courses, iscs50x, cs50..."
1,CS50's Introduction to Programming with Python,https://www.edx.org/learn/python/harvard-unive...,An introduction to programming using a languag...,an introduction to programming using a languag...,"[programming, python, cs50x, software, cs50p, ..."
2,CS50's Introduction to Artificial Intelligence...,https://www.edx.org/learn/artificial-intellige...,This course explores the concepts and algorith...,this course explores the concepts and algorith...,"[algorithms, python, handwriting, learning, se..."
3,CS50's Introduction to Cybersecurity,https://www.edx.org/learn/cybersecurity/harvar...,This is CS50's introduction to cybersecurity f...,this is cs50's introduction to cybersecurity f...,"[cybersecurity, threats, protect, usability, r..."
4,CS50's Web Programming with Python and JavaScript,https://www.edx.org/learn/web-development/harv...,"Topics include database design, scalability, s...","topics include database design, scalability, s...","[heroku, github, applications, cloud, projects..."
...,...,...,...,...,...
1567,Predicting CTR with Machine Learning in Python,https://www.datacamp.com/courses/predicting-ct...,Have you ever wondered how companies like Face...,have you ever wondered how companies like face...,"[ads, ad, learning, python, click, learn, targ..."
1568,Optimizing R Code with Rcpp,https://www.datacamp.com/courses/optimizing-r-...,"R is a great language for data science, but so...","r is a great language for data science, but so...","[rcpp, boost, performance, language, compiled,..."
1569,GDPR in Practice: Compliance and Fines,https://www.datacamp.com/courses/gdpr-in-pract...,Apply GDPR Concepts in Real Business Scenarios...,apply gdpr concepts in real business scenarios...,"[gdpr, compliance, privacy, regulation, data, ..."
1570,Scalable AI Models with PyTorch Lightning,https://www.datacamp.com/courses/scalable-ai-m...,Foundations of Scalable AI | This course takes...,foundations of scalable ai | this course takes...,"[ai, optimizers, optimized, optimize, learning..."


Join keywords into sentence

In [83]:
df["keywords"] = df["keywords"].apply(lambda x: ", ".join(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["keywords"] = df["keywords"].apply(lambda x: ", ".join(x))


In [84]:
df.to_csv('./data/courses.csv', index=False)