# Course Recommendation System using Udemy Dataset

**Algorithms**
* Cosine Similarity
* Linear Similarity

**Workflow**
* Import Dataset
* Vectorize Dataset
* Cosine Similarity Matrix
* ID Score
* Recommend

In [1]:
# Import dependancies
import pandas as pd # data analysis and manipulation tool
import numpy as np # brings the computational power of languages like C and Fortran to Python
import neattext.functions as nfx # used to clean text data
from sklearn.feature_extraction.text import TfidfVectorizer # Convert a collection of raw documents to a matrix of TF-IDF features
from sklearn.feature_extraction.text import CountVectorizer # a method to convert text to numerical data
from sklearn.metrics.pairwise import cosine_similarity # measures the similarity between two vectors of an inner product space
from sklearn.metrics.pairwise import linear_kernel # Compute the linear kernel between X and Y

In [2]:
df = pd.read_csv("udemy_course_data.csv")
df.head()

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject,profit,published_date,published_time,year,month,day
0,1070968,Ultimate Investment Banking Course,https://www.udemy.com/ultimate-investment-bank...,True,200,2147,23,51,All Levels,1.5 hours,2017-01-18T20:58:58Z,Business Finance,429400,2017-01-18,20:58:58Z,2017,1,18
1,1113822,Complete GST Course & Certification - Grow You...,https://www.udemy.com/goods-and-services-tax/,True,75,2792,923,274,All Levels,39 hours,2017-03-09T16:34:20Z,Business Finance,209400,2017-03-09,16:34:20Z,2017,3,9
2,1006314,Financial Modeling for Business Analysts and C...,https://www.udemy.com/financial-modeling-for-b...,True,45,2174,74,51,Intermediate Level,2.5 hours,2016-12-19T19:26:30Z,Business Finance,97830,2016-12-19,19:26:30Z,2016,12,19
3,1210588,Beginner to Pro - Financial Analysis in Excel ...,https://www.udemy.com/complete-excel-finance-c...,True,95,2451,11,36,All Levels,3 hours,2017-05-30T20:07:24Z,Business Finance,232845,2017-05-30,20:07:24Z,2017,5,30
4,1011058,How To Maximize Your Profits Trading Options,https://www.udemy.com/how-to-maximize-your-pro...,True,200,1276,45,26,Intermediate Level,2 hours,2016-12-13T14:57:18Z,Business Finance,255200,2016-12-13,14:57:18Z,2016,12,13


In [3]:
# convert price from Rupee to Dollar
df["price"] = df["price"] * 0.0121
df.head()

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject,profit,published_date,published_time,year,month,day
0,1070968,Ultimate Investment Banking Course,https://www.udemy.com/ultimate-investment-bank...,True,2.42,2147,23,51,All Levels,1.5 hours,2017-01-18T20:58:58Z,Business Finance,429400,2017-01-18,20:58:58Z,2017,1,18
1,1113822,Complete GST Course & Certification - Grow You...,https://www.udemy.com/goods-and-services-tax/,True,0.9075,2792,923,274,All Levels,39 hours,2017-03-09T16:34:20Z,Business Finance,209400,2017-03-09,16:34:20Z,2017,3,9
2,1006314,Financial Modeling for Business Analysts and C...,https://www.udemy.com/financial-modeling-for-b...,True,0.5445,2174,74,51,Intermediate Level,2.5 hours,2016-12-19T19:26:30Z,Business Finance,97830,2016-12-19,19:26:30Z,2016,12,19
3,1210588,Beginner to Pro - Financial Analysis in Excel ...,https://www.udemy.com/complete-excel-finance-c...,True,1.1495,2451,11,36,All Levels,3 hours,2017-05-30T20:07:24Z,Business Finance,232845,2017-05-30,20:07:24Z,2017,5,30
4,1011058,How To Maximize Your Profits Trading Options,https://www.udemy.com/how-to-maximize-your-pro...,True,2.42,1276,45,26,Intermediate Level,2 hours,2016-12-13T14:57:18Z,Business Finance,255200,2016-12-13,14:57:18Z,2016,12,13


In [4]:
# convert course title to lower cases
df["course_title"] = df["course_title"].apply(lambda x: x.lower())

## Recommendation System

For building the course recommendation system, we will be working with only the `course_title` column only. We start by cleaning the `course_title` using `neattext.functions` column as it is a text data. 

**`neattext`** is a simple Natural Language Processing package for cleaning text data and pre-processing text data. It can be used to clean sentences, extract emails, phone numbers, weblinks, and emojis from sentences.

In [5]:
# list all the methods present in the neattext function
dir(nfx)

['BTC_ADDRESS_REGEX',
 'CURRENCY_REGEX',
 'CURRENCY_SYMB_REGEX',
 'Counter',
 'DATE_REGEX',
 'EMAIL_REGEX',
 'EMOJI_REGEX',
 'HASTAG_REGEX',
 'MASTERCard_REGEX',
 'MD5_SHA_REGEX',
 'MOST_COMMON_PUNCT_REGEX',
 'NUMBERS_REGEX',
 'PHONE_REGEX',
 'PoBOX_REGEX',
 'SPECIAL_CHARACTERS_REGEX',
 'STOPWORDS',
 'STOPWORDS_de',
 'STOPWORDS_en',
 'STOPWORDS_es',
 'STOPWORDS_fr',
 'STOPWORDS_ru',
 'STOPWORDS_yo',
 'STREET_ADDRESS_REGEX',
 'TextFrame',
 'URL_PATTERN',
 'USER_HANDLES_REGEX',
 'VISACard_REGEX',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__generate_text',
 '__loader__',
 '__name__',
 '__numbers_dict',
 '__package__',
 '__spec__',
 '_lex_richness_herdan',
 '_lex_richness_maas_ttr',
 'clean_text',
 'defaultdict',
 'digit2words',
 'extract_btc_address',
 'extract_currencies',
 'extract_currency_symbols',
 'extract_dates',
 'extract_emails',
 'extract_emojis',
 'extract_hashtags',
 'extract_html_tags',
 'extract_mastercard_addr',
 'extract_md5sha',
 'extract_numbers',
 'extr

In [6]:
# Remove stopwords
df["clean_title"] = df["course_title"].apply(nfx.remove_stopwords)

# Remove special characters
df["clean_title"] = df["course_title"].apply(nfx.remove_special_characters)

df.head()

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject,profit,published_date,published_time,year,month,day,clean_title
0,1070968,ultimate investment banking course,https://www.udemy.com/ultimate-investment-bank...,True,2.42,2147,23,51,All Levels,1.5 hours,2017-01-18T20:58:58Z,Business Finance,429400,2017-01-18,20:58:58Z,2017,1,18,ultimate investment banking course
1,1113822,complete gst course & certification - grow you...,https://www.udemy.com/goods-and-services-tax/,True,0.9075,2792,923,274,All Levels,39 hours,2017-03-09T16:34:20Z,Business Finance,209400,2017-03-09,16:34:20Z,2017,3,9,complete gst course certification grow your ...
2,1006314,financial modeling for business analysts and c...,https://www.udemy.com/financial-modeling-for-b...,True,0.5445,2174,74,51,Intermediate Level,2.5 hours,2016-12-19T19:26:30Z,Business Finance,97830,2016-12-19,19:26:30Z,2016,12,19,financial modeling for business analysts and c...
3,1210588,beginner to pro - financial analysis in excel ...,https://www.udemy.com/complete-excel-finance-c...,True,1.1495,2451,11,36,All Levels,3 hours,2017-05-30T20:07:24Z,Business Finance,232845,2017-05-30,20:07:24Z,2017,5,30,beginner to pro financial analysis in excel 2017
4,1011058,how to maximize your profits trading options,https://www.udemy.com/how-to-maximize-your-pro...,True,2.42,1276,45,26,Intermediate Level,2 hours,2016-12-13T14:57:18Z,Business Finance,255200,2016-12-13,14:57:18Z,2016,12,13,how to maximize your profits trading options


### Vectorize `clean_title` column

Text Vectorization is the process of converting text into numerical representation.

We decided not to stem or lemmatize the titles because titles do not contain lots of texts.

In [7]:
# Vectorice the titles
count_vect = CountVectorizer()
cv_matrix = count_vect.fit_transform(df["clean_title"])
cv_matrix

<3683x3680 sparse matrix of type '<class 'numpy.int64'>'
	with 23448 stored elements in Compressed Sparse Row format>

### Cosine Similarity

Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. It is often used to measure document similarity in text analysis.

In [8]:
# Instanciate cosine_similarity
cos_sim = cosine_similarity(cv_matrix)
cos_sim

array([[1.        , 0.1767767 , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.1767767 , 1.        , 0.        , ..., 0.        , 0.125     ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.16903085, 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.16903085, ..., 1.        , 0.        ,
        0.31622777],
       [0.        , 0.125     , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.31622777, 0.        ,
        1.        ]])

In [9]:
cos_sim.shape

(3683, 3683)

### Recommend Course

In [10]:
# pick a course title from `course_title` column
title = "how to maximize your profits trading options"

In [11]:
# Get the course index and drop duplicate
course_index = pd.Series(df.index, 
                        index=df["course_title"]).drop_duplicates()

course_index

course_title
ultimate investment banking course                                0
complete gst course & certification - grow your ca practice       1
financial modeling for business analysts and consultants          2
beginner to pro - financial analysis in excel 2017                3
how to maximize your profits trading options                      4
                                                               ... 
learn jquery from scratch - master of javascript library       3678
how to design a wordpress website with no coding at all        3679
learn and build using polymer                                  3680
css animations: create amazing effects on your website         3681
using modx cms to build websites: a beginner's guide           3682
Length: 3683, dtype: int64

In [12]:
# Get index of title
index = course_index[title]
index

4

In [13]:
# Find index similarity matrix for title
cos_sim[4]

array([0.        , 0.13363062, 0.        , ..., 0.        , 0.13363062,
       0.13363062])

In [14]:
# get a list of course index and the cosine similarity score
scores = list(enumerate(cos_sim[4]))
scores

[(0, 0.0),
 (1, 0.13363062095621217),
 (2, 0.0),
 (3, 0.13363062095621217),
 (4, 0.9999999999999997),
 (5, 0.1259881576697424),
 (6, 0.13363062095621217),
 (7, 0.13363062095621217),
 (8, 0.26726124191242434),
 (9, 0.1259881576697424),
 (10, 0.1259881576697424),
 (11, 0.3380617018914066),
 (12, 0.0),
 (13, 0.1259881576697424),
 (14, 0.2519763153394848),
 (15, 0.1259881576697424),
 (16, 0.0),
 (17, 0.13363062095621217),
 (18, 0.2519763153394848),
 (19, 0.0),
 (20, 0.2182178902359924),
 (21, 0.2519763153394848),
 (22, 0.1259881576697424),
 (23, 0.1543033499620919),
 (24, 0.11952286093343936),
 (25, 0.11952286093343936),
 (26, 0.0),
 (27, 0.1091089451179962),
 (28, 0.0),
 (29, 0.3585685828003181),
 (30, 0.3779644730092272),
 (31, 0.0),
 (32, 0.1091089451179962),
 (33, 0.28571428571428564),
 (34, 0.0),
 (35, 0.26726124191242434),
 (36, 0.20965696734438366),
 (37, 0.14285714285714282),
 (38, 0.1259881576697424),
 (39, 0.0),
 (40, 0.0),
 (41, 0.0),
 (42, 0.0),
 (43, 0.5976143046671968),
 (44,

In [15]:
# sort the scores in decending order from highest to lowest
sorted_scores = sorted(scores, key=lambda x: x[1], reverse=True)
sorted_scores

[(4, 0.9999999999999997),
 (43, 0.5976143046671968),
 (463, 0.5714285714285713),
 (59, 0.5698028822981898),
 (416, 0.5669467095138407),
 (67, 0.5345224838248487),
 (118, 0.50709255283711),
 (387, 0.50709255283711),
 (113, 0.5039526306789696),
 (167, 0.5039526306789696),
 (1136, 0.47809144373375745),
 (68, 0.4629100498862757),
 (86, 0.4629100498862757),
 (1915, 0.4629100498862757),
 (147, 0.4558423058385518),
 (205, 0.4364357804719848),
 (410, 0.4364357804719848),
 (430, 0.4364357804719848),
 (187, 0.4285714285714285),
 (222, 0.4285714285714285),
 (697, 0.4285714285714285),
 (1850, 0.4285714285714285),
 (3589, 0.4285714285714285),
 (1796, 0.4193139346887673),
 (650, 0.40089186286863654),
 (704, 0.40089186286863654),
 (1140, 0.40089186286863654),
 (1147, 0.40089186286863654),
 (1380, 0.40089186286863654),
 (1546, 0.40089186286863654),
 (2884, 0.40089186286863654),
 (3067, 0.40089186286863654),
 (811, 0.3903600291794132),
 (1929, 0.3903600291794132),
 (2544, 0.3903600291794132),
 (30, 0.3

In [16]:
# get course index
# we are indexing from [1:] so as not to recommend the same course
selected_course_index = [i[0] for i in sorted_scores[1:]]
selected_course_index

[43,
 463,
 59,
 416,
 67,
 118,
 387,
 113,
 167,
 1136,
 68,
 86,
 1915,
 147,
 205,
 410,
 430,
 187,
 222,
 697,
 1850,
 3589,
 1796,
 650,
 704,
 1140,
 1147,
 1380,
 1546,
 2884,
 3067,
 811,
 1929,
 2544,
 30,
 44,
 46,
 71,
 195,
 330,
 426,
 514,
 553,
 738,
 798,
 991,
 1005,
 1123,
 1171,
 1687,
 1708,
 2828,
 3305,
 3336,
 3632,
 2100,
 29,
 96,
 138,
 154,
 206,
 510,
 580,
 803,
 1374,
 1799,
 2978,
 3319,
 50,
 57,
 379,
 444,
 742,
 829,
 1988,
 3034,
 3492,
 11,
 66,
 97,
 338,
 439,
 448,
 500,
 525,
 647,
 783,
 907,
 931,
 947,
 1521,
 1693,
 2032,
 2768,
 2927,
 3673,
 399,
 901,
 1116,
 1190,
 1806,
 1950,
 2852,
 3383,
 140,
 424,
 1131,
 84,
 236,
 273,
 295,
 369,
 378,
 696,
 766,
 940,
 950,
 1174,
 1220,
 1250,
 1257,
 1530,
 1582,
 1637,
 1644,
 1960,
 2122,
 2448,
 2895,
 3112,
 3279,
 3409,
 3511,
 3582,
 3638,
 2615,
 2663,
 3412,
 2040,
 2174,
 33,
 49,
 61,
 85,
 144,
 157,
 160,
 186,
 234,
 247,
 254,
 303,
 322,
 337,
 363,
 393,
 401,
 411,
 456,
 

In [17]:
# Get course cosine similarity score
# we are indexing from [1:] so as not to recommend the same course
selected_course_score = [i[1] for i in sorted_scores[1:]]
selected_course_score

[0.5976143046671968,
 0.5714285714285713,
 0.5698028822981898,
 0.5669467095138407,
 0.5345224838248487,
 0.50709255283711,
 0.50709255283711,
 0.5039526306789696,
 0.5039526306789696,
 0.47809144373375745,
 0.4629100498862757,
 0.4629100498862757,
 0.4629100498862757,
 0.4558423058385518,
 0.4364357804719848,
 0.4364357804719848,
 0.4364357804719848,
 0.4285714285714285,
 0.4285714285714285,
 0.4285714285714285,
 0.4285714285714285,
 0.4285714285714285,
 0.4193139346887673,
 0.40089186286863654,
 0.40089186286863654,
 0.40089186286863654,
 0.40089186286863654,
 0.40089186286863654,
 0.40089186286863654,
 0.40089186286863654,
 0.40089186286863654,
 0.3903600291794132,
 0.3903600291794132,
 0.3903600291794132,
 0.3779644730092272,
 0.3779644730092272,
 0.3779644730092272,
 0.3779644730092272,
 0.3779644730092272,
 0.3779644730092272,
 0.3779644730092272,
 0.3779644730092272,
 0.3779644730092272,
 0.3779644730092272,
 0.3779644730092272,
 0.3779644730092272,
 0.3779644730092272,
 0.37796

In [18]:
# We now locate the courses
rec_df = df.iloc[selected_course_index]
rec_df.head()

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject,profit,published_date,published_time,year,month,day,clean_title
43,627540,options trading - how to win with weekly options,https://www.udemy.com/work-from-home-setup-you...,True,1.3915,7489,1190,25,Intermediate Level,1 hour,2015-10-22T21:54:28Z,Business Finance,861235,2015-10-22,21:54:28Z,2015,10,22,options trading how to win with weekly options
463,1276182,options trading foundation: your journey to co...,https://www.udemy.com/option-trading-foundatio...,True,1.1495,0,0,5,Intermediate Level,1 hour,2017-07-05T04:41:54Z,Business Finance,0,2017-07-05,04:41:54Z,2017,7,5,options trading foundation your journey to com...
59,1239068,how to buy cheap options - options trading pri...,https://www.udemy.com/options-black-scholes-mo...,True,2.42,658,2,19,All Levels,1 hour,2017-06-02T18:12:45Z,Business Finance,131600,2017-06-02,18:12:45Z,2017,6,2,how to buy cheap options options trading pric...
416,613944,how to trade options,https://www.udemy.com/how-to-trade-options/,True,0.5445,12,1,9,Intermediate Level,43 mins,2015-09-20T21:45:48Z,Business Finance,540,2015-09-20,21:45:48Z,2015,9,20,how to trade options
67,408440,how to win 97% of your options trades,https://www.udemy.com/how-to-win-97-percent-of...,True,1.5125,5050,461,26,All Levels,1.5 hours,2015-02-10T04:21:40Z,Business Finance,631250,2015-02-10,04:21:40Z,2015,2,10,how to win 97 of your options trades


In [19]:
# Add the similarity score to the rec_df dataframe
rec_df["similarity_score"] = selected_course_score

final_recommende_courses = rec_df[[
    'course_title', 'similarity_score', 'url', 'price', 
    'num_subscribers'
]]

final_recommende_courses

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rec_df["similarity_score"] = selected_course_score


Unnamed: 0,course_title,similarity_score,url,price,num_subscribers
43,options trading - how to win with weekly options,0.597614,https://www.udemy.com/work-from-home-setup-you...,1.3915,7489
463,options trading foundation: your journey to co...,0.571429,https://www.udemy.com/option-trading-foundatio...,1.1495,0
59,how to buy cheap options - options trading pri...,0.569803,https://www.udemy.com/options-black-scholes-mo...,2.4200,658
416,how to trade options,0.566947,https://www.udemy.com/how-to-trade-options/,0.5445,12
67,how to win 97% of your options trades,0.534522,https://www.udemy.com/how-to-win-97-percent-of...,1.5125,5050
...,...,...,...,...,...
3674,building better apis with graphql,0.000000,https://www.udemy.com/building-better-apis-wit...,0.6050,555
3676,build a stock downloader with visual studio 20...,0.000000,https://www.udemy.com/csharpyahoostockdownloader/,0.2420,436
3677,jquery ui in action: build 5 jquery ui projects,0.000000,https://www.udemy.com/jquery-ui-practical-buil...,1.8150,382
3678,learn jquery from scratch - master of javascri...,0.000000,https://www.udemy.com/easy-jquery-for-beginner...,1.2100,1040
