# Proyek Akhir : Membuat Model Sistem Rekomendasi
Nama: Nazrul Effendy

Data: https://www.kaggle.com/datasets/khusheekapoor/coursera-courses-dataset-2021/data

### Download dataset:

In [1]:
!kaggle datasets download -d khusheekapoor/coursera-courses-dataset-2021

Dataset URL: https://www.kaggle.com/datasets/khusheekapoor/coursera-courses-dataset-2021
License(s): unknown
Downloading coursera-courses-dataset-2021.zip to /content
  0% 0.00/1.65M [00:00<?, ?B/s]
100% 1.65M/1.65M [00:00<00:00, 128MB/s]


## Unzip data coursera-courses-dataset-2021.zip

In [2]:
!unzip coursera-courses-dataset-2021.zip

Archive:  coursera-courses-dataset-2021.zip
  inflating: Coursera.csv            


### Impor library-library yang diperlukan

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
import re
import nltk
from nltk.stem import WordNetLemmatizer

# Data Understanding

### Loading dataset dan menampilkan 5 baris pertama dari dataset tersebut:

In [4]:
# Load the dataset
df = pd.read_csv("Coursera.csv")
df.head()

Unnamed: 0,Course Name,University,Difficulty Level,Course Rating,Course URL,Course Description,Skills
0,Write A Feature Length Screenplay For Film Or ...,Michigan State University,Beginner,4.8,https://www.coursera.org/learn/write-a-feature...,Write a Full Length Feature Film Script In th...,Drama Comedy peering screenwriting film D...
1,Business Strategy: Business Model Canvas Analy...,Coursera Project Network,Beginner,4.8,https://www.coursera.org/learn/canvas-analysis...,"By the end of this guided project, you will be...",Finance business plan persona (user experien...
2,Silicon Thin Film Solar Cells,�cole Polytechnique,Advanced,4.1,https://www.coursera.org/learn/silicon-thin-fi...,This course consists of a general presentation...,chemistry physics Solar Energy film lambda...
3,Finance for Managers,IESE Business School,Intermediate,4.8,https://www.coursera.org/learn/operational-fin...,"When it comes to numbers, there is always more...",accounts receivable dupont analysis analysis...
4,Retrieve Data using Single-Table SQL Queries,Coursera Project Network,Beginner,4.6,https://www.coursera.org/learn/single-table-sq...,In this course you�ll learn how to effectively...,Data Analysis select (sql) database manageme...


### Mencari ukuran dataset:

In [5]:
print("Size of dataset: ", df.shape)

Size of dataset:  (3522, 7)


### Mencari jumlah baris yang ada duplikasi dengan baris yang lain:

In [6]:
print("Number of duplicated rows: ", df.duplicated().sum())

Number of duplicated rows:  98


### Menampilkan tipe data setiap kolom pada dataset "df":

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3522 entries, 0 to 3521
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Course Name         3522 non-null   object
 1   University          3522 non-null   object
 2   Difficulty Level    3522 non-null   object
 3   Course Rating       3522 non-null   object
 4   Course URL          3522 non-null   object
 5   Course Description  3522 non-null   object
 6   Skills              3522 non-null   object
dtypes: object(7)
memory usage: 192.7+ KB


### Pengecekan jumlah variabel yang bernilai Null

In [8]:
df.isnull().sum()

Unnamed: 0,0
Course Name,0
University,0
Difficulty Level,0
Course Rating,0
Course URL,0
Course Description,0
Skills,0


### Melihat statistik dataset:

In [9]:
df.describe()

Unnamed: 0,Course Name,University,Difficulty Level,Course Rating,Course URL,Course Description,Skills
count,3522,3522,3522,3522.0,3522,3522,3522
unique,3416,184,5,31.0,3424,3397,3424
top,Google Cloud Platform Fundamentals: Core Infra...,Coursera Project Network,Beginner,4.7,https://www.coursera.org/learn/gcp-fundamentals,This course introduces you to important concep...,Google Cloud Platform Big Data Cloud Infrast...
freq,8,562,1444,740.0,8,8,8


## Data Preparation

### Menghilangkan baris yang terdapat duplikasi dengan baris lain

In [10]:
df.drop_duplicates(inplace=True)

### Membuat suatu fungsi untuk mengganti nama kolom pada suatu dataset

In [12]:
def rename_col(col_name):
    col_name = col_name.split(' ')
    col_name = '_'.join(col_name)
    return col_name

### Menampilkan nama kolom sebelum rename, melakukan renama nama kolom dan menampilkan nama kolom setelah rename:

In [13]:
print("Columns names before renaming: ", df.columns.to_list())
df.columns = [rename_col(col) for col in df.columns]
print("Columns names after renaming: ", df.columns.to_list())

Columns names before renaming:  ['Course Name', 'University', 'Difficulty Level', 'Course Rating', 'Course URL', 'Course Description', 'Skills']
Columns names after renaming:  ['Course_Name', 'University', 'Difficulty_Level', 'Course_Rating', 'Course_URL', 'Course_Description', 'Skills']


### Feature selection

### Mendefinisikan feature yang dipilih

In [14]:
features_selected = ["Course_Name", "Course_Description", "Skills", "Difficulty_Level"]

### membuat dataset baru yang memiliki kolom sesuai feature yang dipilih

In [15]:
new_df = df[features_selected]
new_df.head()

Unnamed: 0,Course_Name,Course_Description,Skills,Difficulty_Level
0,Write A Feature Length Screenplay For Film Or ...,Write a Full Length Feature Film Script In th...,Drama Comedy peering screenwriting film D...,Beginner
1,Business Strategy: Business Model Canvas Analy...,"By the end of this guided project, you will be...",Finance business plan persona (user experien...,Beginner
2,Silicon Thin Film Solar Cells,This course consists of a general presentation...,chemistry physics Solar Energy film lambda...,Advanced
3,Finance for Managers,"When it comes to numbers, there is always more...",accounts receivable dupont analysis analysis...,Intermediate
4,Retrieve Data using Single-Table SQL Queries,In this course you�ll learn how to effectively...,Data Analysis select (sql) database manageme...,Beginner


### Menggabungkan konten semua fitur untuk membentuk satu fitur, menampilkan 5 baris pertama dataset yang sudah diolah tersebut

In [16]:
new_df["description_key_words"] = ['' for i in range(new_df.shape[0])]
for col in features_selected:
    new_df["description_key_words"] += [' ' for i in range(new_df.shape[0])] + new_df[col]
new_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df["description_key_words"] = ['' for i in range(new_df.shape[0])]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df["description_key_words"] += [' ' for i in range(new_df.shape[0])] + new_df[col]


Unnamed: 0,Course_Name,Course_Description,Skills,Difficulty_Level,description_key_words
0,Write A Feature Length Screenplay For Film Or ...,Write a Full Length Feature Film Script In th...,Drama Comedy peering screenwriting film D...,Beginner,Write A Feature Length Screenplay For Film Or...
1,Business Strategy: Business Model Canvas Analy...,"By the end of this guided project, you will be...",Finance business plan persona (user experien...,Beginner,Business Strategy: Business Model Canvas Anal...
2,Silicon Thin Film Solar Cells,This course consists of a general presentation...,chemistry physics Solar Energy film lambda...,Advanced,Silicon Thin Film Solar Cells This course con...
3,Finance for Managers,"When it comes to numbers, there is always more...",accounts receivable dupont analysis analysis...,Intermediate,Finance for Managers When it comes to numbers...
4,Retrieve Data using Single-Table SQL Queries,In this course you�ll learn how to effectively...,Data Analysis select (sql) database manageme...,Beginner,Retrieve Data using Single-Table SQL Queries ...


### Dataset dibuat supaya hanya memiliki kolom Course_Name dan description_key_words

In [17]:
new_df = new_df[["Course_Name", "description_key_words"]]
new_df

Unnamed: 0,Course_Name,description_key_words
0,Write A Feature Length Screenplay For Film Or ...,Write A Feature Length Screenplay For Film Or...
1,Business Strategy: Business Model Canvas Analy...,Business Strategy: Business Model Canvas Anal...
2,Silicon Thin Film Solar Cells,Silicon Thin Film Solar Cells This course con...
3,Finance for Managers,Finance for Managers When it comes to numbers...
4,Retrieve Data using Single-Table SQL Queries,Retrieve Data using Single-Table SQL Queries ...
...,...,...
3517,"Capstone: Retrieving, Processing, and Visualiz...","Capstone: Retrieving, Processing, and Visuali..."
3518,Patrick Henry: Forgotten Founder,Patrick Henry: Forgotten Founder �Give me lib...
3519,Business intelligence and data analytics: Gene...,Business intelligence and data analytics: Gen...
3520,Rigid Body Dynamics,Rigid Body Dynamics This course teaches dynam...


### Menampilkan isi baris ke 5 dari kolom "description_key_words"

In [18]:
new_df["description_key_words"].iloc[5]

' Building Test Automation Framework using Selenium and TestNG Selenium is one of the most widely used functional UI automation testing tools and TestNG is a brilliant testing framework.  Test automation frameworks are a set of guidelines or rules for writing test cases.  They can reduce maintenance costs and testing efforts and will provide a higher return on investment (ROI) for teams looking to optimize their processes.  Testing guidelines include coding standards, test-data management, defining object repositories, reporting guidelines, and logging strategies.  Through hands-on, practical experience, you will go through concepts writing reusable and structure code which is easy to maintain and understand, creating helper classes or utilities, write effective testcases, and generating reports and logs. maintenance  test case  test automation  screenshot  project  helper class  selenium  reusability  debugging  php computer-science software-development Beginner'

### Mendefinisikan my_lematizer dan mendefinisikan fungsi PreprocessText



In [19]:
my_lematizer = WordNetLemmatizer()

def PreprocessText(text):

    cleaned_text = re.sub(r'-',' ',text)

    # remove  urls
    cleaned_text = re.sub(r'https?://\S+|www\.\S+|http?://\S+',' ',cleaned_text)
    # remove html tags
    cleaned_text = re.sub(r'<.*?>',' ',cleaned_text)
    # replace all numbers
    cleaned_text = re.sub(r'[0-9]', '', cleaned_text)
    # filtering out miscellaneous text.
    cleaned_text = re.sub(r"\([^()]*\)", "", cleaned_text)
    # remove mentions
    cleaned_text = re.sub('@\S+', '', cleaned_text)
    # removes ponctuations
    cleaned_text = re.sub('[%s]' % re.escape("""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""), '', cleaned_text)

    cleaned_text = re.sub(r'ML',' Machine Learning ',cleaned_text)

    cleaned_text = re.sub(r'DL',' Deep Learning ',cleaned_text)

    cleaned_text = cleaned_text.lower() #
    cleaned_text = cleaned_text.split()

    # apply lematisation
    cleaned_text = ' '.join([my_lematizer.lemmatize(word) for word in cleaned_text])

    return cleaned_text

### Download 'wordnet' dan 'omw-1.4' dari nltk

In [20]:
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

### Menerapkan pemrosesan teks pada description_key_words dan menampilkan hasilnya pada iloc 5

In [21]:
new_df["description_key_words"] = new_df["description_key_words"].apply(PreprocessText)
new_df["description_key_words"].iloc[5]

'building test automation framework using selenium and testng selenium is one of the most widely used functional ui automation testing tool and testng is a brilliant testing framework test automation framework are a set of guideline or rule for writing test case they can reduce maintenance cost and testing effort and will provide a higher return on investment for team looking to optimize their process testing guideline include coding standard test data management defining object repository reporting guideline and logging strategy through hand on practical experience you will go through concept writing reusable and structure code which is easy to maintain and understand creating helper class or utility write effective testcases and generating report and log maintenance test case test automation screenshot project helper class selenium reusability debugging php computer science software development beginner'

### Vectorization

In [22]:
vectorizer = CountVectorizer(max_features=10000, stop_words='english')
vectors = vectorizer.fit_transform(new_df["description_key_words"]).toarray()

### Menampilkan bentuk matriks feature dan ukuran kamus

In [23]:
print("Shape of feature  matrix: ", vectors.shape)
print("Vocabulary size : ", len(vectorizer.vocabulary_))

Shape of feature  matrix:  (3424, 10000)
Vocabulary size :  10000


## Modeling

### Membuat fungsi course_id_recommended

In [24]:
def course_id_recommended(description, vectorizer, vectors, number_of_recommendation=5):
    # preprocess text
    description = [PreprocessText(description)]

    # Melakukan vectorization
    vect = vectorizer.transform(description)

    # menghitung similarity dengan vektor feature yang lain
    similars_vectors = cosine_similarity(vect, vectors)[0]

    # Mengurutkan nilai similarity dengan urutan ascending (Hasilnya berupa suatu list indices)
    ordered_similars_vectors = list(similars_vectors.argsort())

    # Membalikkan urutan
    reverse_ordered_similars_vectors = [index for index in reversed(ordered_similars_vectors)]

    # Sejumlah indices rekomendasi dipilih berdasarkan pada koefiesien similaritay tertinggi
    best_indexs = reverse_ordered_similars_vectors[1:number_of_recommendation]

    return best_indexs

### Membuat fungsi recommend_me

In [25]:
def recommend_me(description):
    course_index = course_id_recommended(description, vectorizer, vectors, number_of_recommendation=10)
    if course_index != None:
        course_to_recommend = list(new_df.iloc[course_index]["Course_Name"])
        print("Courses yang direkomendasikan ke user: ")
        print("------------------------------------------------------------------")
        for i, course in enumerate(course_to_recommend):
            print(f"\t{i+1}- {course}")
        print("------------------------------------------------------------------")
    else:
        print("Tidak ada course yang direkomendasikan ke anda")

## Evaluasi

### Menampikan top 9 course yang direkomendasikan ke user dari kata kunci "Python programming"

In [26]:
recommend_me("Python programming")

Courses yang direkomendasikan ke user: 
------------------------------------------------------------------
	1- Python Programming Essentials
	2- Python Data Representations
	3- Python Basics
	4- Python Programming: A Concise Introduction
	5- Compose and Program Music in Python using Earsketch
	6- Crash Course on Python
	7- Python Functions, Files, and Dictionaries
	8- Python Data Structures
	9- Programming for Everybody (Getting Started with Python)
------------------------------------------------------------------
