## Import Library

Kode berikut digunakan untuk mengimpor berbagai pustaka yang diperlukan dalam pengembangan model machine learning untuk sistem rekomendasi. Setiap library memiliki fungsi khusus yang mendukung proses preprocessing data, pembuatan model, evaluasi, serta visualisasi hasil.

In [1]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from google.colab import files
import string
import re
import nltk
import pandas as pd
import numpy as np

In [2]:
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

## Data Loading

Pada tahapan ini, dataset akan diunduh dari Kaggle dan kemudian dibaca menggunakan pandas. Pandas merupakan library Python yang digunakan untuk memanipulasi dan menganalisis data dalam bentuk tabel (DataFrame).

### Get data from kaggle

Kode berikut digunakan untuk mengunggah berkas API key dari Kaggle (kaggle.json) dan mengunduh dataset yang diperlukan. Setelah proses pengunduhan selesai, dataset diekstrak (unzip) agar dapat dibaca menggunakan pandas.

In [3]:
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"itsself","key":"6c93442db0188f70c942eb08330bb46c"}'}

In [4]:
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! kaggle datasets download gpreda/bbc-news

Dataset URL: https://www.kaggle.com/datasets/gpreda/bbc-news
License(s): CC0-1.0
Downloading bbc-news.zip to /content
  0% 0.00/3.64M [00:00<?, ?B/s]
100% 3.64M/3.64M [00:00<00:00, 928MB/s]


In [5]:
! unzip bbc-news.zip

Archive:  bbc-news.zip
  inflating: bbc_news.csv            


### Load Dataset

Setelah dataset berhasil diunduh dan diekstrak, langkah selanjutnya adalah membaca file dataset menggunakan fungsi .read_csv() dari library pandas. Fungsi ini digunakan untuk memuat data dari file CSV ke dalam bentuk DataFrame.

In [6]:
news = pd.read_csv("bbc_news.csv")
news.head()

Unnamed: 0,title,pubDate,guid,link,description
0,Ukraine: Angry Zelensky vows to punish Russian...,"Mon, 07 Mar 2022 08:01:56 GMT",https://www.bbc.co.uk/news/world-europe-60638042,https://www.bbc.co.uk/news/world-europe-606380...,The Ukrainian president says the country will ...
1,War in Ukraine: Taking cover in a town under a...,"Sun, 06 Mar 2022 22:49:58 GMT",https://www.bbc.co.uk/news/world-europe-60641873,https://www.bbc.co.uk/news/world-europe-606418...,"Jeremy Bowen was on the frontline in Irpin, as..."
2,Ukraine war 'catastrophic for global food',"Mon, 07 Mar 2022 00:14:42 GMT",https://www.bbc.co.uk/news/business-60623941,https://www.bbc.co.uk/news/business-60623941?a...,One of the world's biggest fertiliser firms sa...
3,Manchester Arena bombing: Saffie Roussos's par...,"Mon, 07 Mar 2022 00:05:40 GMT",https://www.bbc.co.uk/news/uk-60579079,https://www.bbc.co.uk/news/uk-60579079?at_medi...,The parents of the Manchester Arena bombing's ...
4,Ukraine conflict: Oil price soars to highest l...,"Mon, 07 Mar 2022 08:15:53 GMT",https://www.bbc.co.uk/news/business-60642786,https://www.bbc.co.uk/news/business-60642786?a...,Consumers are feeling the impact of higher ene...


## Exploratory Data Analysis

Pada tahap ini dilakukan proses *Exploratory Data Analysis* (EDA) yang bertujuan untuk:

- Menganalisis karakteristik masing-masing fitur dalam dataset.
- Mengidentifikasi dan menangani *missing value* serta data duplikat.
- Melakukan analisis terhadap fitur-fitur untuk menentukan pendekatan pemodelan yang paling sesuai.

### Variable Description

Melakukan pemeriksaan informasi dasar pada dataset menggunakan fungsi info() untuk mengetahui tipe data, jumlah nilai non-null, serta penggunaan memori pada setiap kolom.

In [7]:
news.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42115 entries, 0 to 42114
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   title        42115 non-null  object
 1   pubDate      42115 non-null  object
 2   guid         42115 non-null  object
 3   link         42115 non-null  object
 4   description  42115 non-null  object
dtypes: object(5)
memory usage: 1.6+ MB


### Check for Missing and Duplicate Values

Melakukan pemeriksaan terhadap *missing value* dan data duplikat pada dataset guna memastikan kualitas data sebelum memasuki tahap analisis dan pemodelan. Berdasarkan hasil pemeriksaan, tidak ditemukan adanya *missing value* maupun data duplikat, sehingga tahap pembersihan data tidak diperlukan.

In [8]:
print("Total missing value: ", news.isnull().sum())
print("Total duplicates: ", news.duplicated().sum())

Total missing value:  title          0
pubDate        0
guid           0
link           0
description    0
dtype: int64
Total duplicates:  0


### Univariate Analysis

Pada tahap ini dilakukan Univariate analysis untuk memahami distribusi dan karakteristik masing-masing fitur secara terpisah.

Kode berikut digunakan untuk mengonversi tipe data pada fitur pubDate menjadi tipe data datetime64, agar dapat dilakukan analisis berbasis waktu secara lebih akurat dan efisien.

In [9]:
news["pubDate"] = pd.to_datetime(news["pubDate"])
print(news["pubDate"].dtype)

datetime64[ns]


In [10]:
news.head()

Unnamed: 0,title,pubDate,guid,link,description
0,Ukraine: Angry Zelensky vows to punish Russian...,2022-03-07 08:01:56,https://www.bbc.co.uk/news/world-europe-60638042,https://www.bbc.co.uk/news/world-europe-606380...,The Ukrainian president says the country will ...
1,War in Ukraine: Taking cover in a town under a...,2022-03-06 22:49:58,https://www.bbc.co.uk/news/world-europe-60641873,https://www.bbc.co.uk/news/world-europe-606418...,"Jeremy Bowen was on the frontline in Irpin, as..."
2,Ukraine war 'catastrophic for global food',2022-03-07 00:14:42,https://www.bbc.co.uk/news/business-60623941,https://www.bbc.co.uk/news/business-60623941?a...,One of the world's biggest fertiliser firms sa...
3,Manchester Arena bombing: Saffie Roussos's par...,2022-03-07 00:05:40,https://www.bbc.co.uk/news/uk-60579079,https://www.bbc.co.uk/news/uk-60579079?at_medi...,The parents of the Manchester Arena bombing's ...
4,Ukraine conflict: Oil price soars to highest l...,2022-03-07 08:15:53,https://www.bbc.co.uk/news/business-60642786,https://www.bbc.co.uk/news/business-60642786?a...,Consumers are feeling the impact of higher ene...


Kode berikut digunakan untuk memperoleh informasi mengenai tanggal pertama dan terakhir berita dipublikasikan dalam dataset.

In [11]:
print("Start Publish : ", news["pubDate"].min())
print("End Publish : ", news["pubDate"].max())

Start Publish :  2013-08-30 01:01:55
End Publish :  2024-12-04 00:05:52


Melakukan pengelompokan data berita berdasarkan tahun publikasinya untuk memperoleh jumlah berita yang dipublikasikan pada setiap tahun.

In [12]:
total_news_per_year = news["pubDate"].dt.year.value_counts().sort_index()
total_news_per_year

Unnamed: 0_level_0,count
pubDate,Unnamed: 1_level_1
2013,1
2017,1
2018,1
2019,1
2021,6
2022,12301
2023,15043
2024,14761


## Data Preparation

Pada tahap ini, dataset dipersiapkan agar siap digunakan dalam proses pemodelan machine learning. Beberapa langkah penting yang dilakukan antara lain:

- Seleksi data berita tahun 2024, yang terdiri dari 14.761 entri berita.
- Proses text processing, meliputi pembersihan teks seperti penghapusan tanda baca, huruf kapital, dan karakter-karakter yang tidak penting.
- Ekstraksi fitur menggunakan TF-IDF (Term Frequency-Inverse Document Frequency) untuk merepresentasikan konten teks dalam bentuk vektor numerik.
- Perhitungan derajat kesamaan antar berita menggunakan teknik cosine similarity untuk mengukur tingkat kemiripan konten antar artikel.

In [13]:
news_2024 = news[news["pubDate"].dt.year == 2024]
news_2024.head()

Unnamed: 0,title,pubDate,guid,link,description
27288,Justin Welby: Political leaders should treat o...,2024-01-01 00:00:04,https://www.bbc.co.uk/news/uk-67844356,https://www.bbc.co.uk/news/uk-67844356?at_medi...,The Archbishop of Canterbury urges politicians...
27289,Almost three million tested for cancer in England,2024-01-01 00:09:56,https://www.bbc.co.uk/news/health-67841348,https://www.bbc.co.uk/news/health-67841348?at_...,Record numbers are being tested for cancer but...
27290,Household energy price rise of 5% comes into f...,2024-01-01 00:00:16,https://www.bbc.co.uk/news/business-67785266,https://www.bbc.co.uk/news/business-67785266?a...,A higher cap for the next three months adds £9...
27329,Primrose Hill stabbing: Harry Pitman named as ...,2024-01-01 17:11:13,https://www.bbc.co.uk/news/uk-england-london-6...,https://www.bbc.co.uk/news/uk-england-london-6...,"Harry Pitman, 16, was attacked on London's Pri..."
27330,Israel Supreme Court strikes down judicial ref...,2024-01-01 19:47:58,https://www.bbc.co.uk/news/world-middle-east-6...,https://www.bbc.co.uk/news/world-middle-east-6...,The controversial plans triggered nationwide p...


### Text Processing

Pada tahap ini dilakukan proses text processing untuk membersihkan dan mempersiapkan teks berita sebelum diekstraksi menjadi fitur numerik. Langkah-langkah preprocessing ini penting agar model dapat memahami isi teks secara konsisten dan akurat.

Kode berikut merupakan sebuah fungsi yang digunakan untuk membersihkan teks dari tanda baca, huruf kapital, serta karakter-karakter yang tidak penting.

In [14]:
def cleaningText(text):
  text = re.sub(r'@[A-Za-z0-9]+', '', text)
  text = re.sub(r'#[A-Za-z0-9]+', '', text)
  text = re.sub(r'RT[\s]', '', text)
  text = re.sub(r"http\S+", '', text)
  text = re.sub(r'[0-9]+', '', text)
  text = re.sub(r'[^\w\s]', '', text)

  text = text.replace('\n', ' ')
  text = text.translate(str.maketrans('', '', string.punctuation))
  text = text.strip(' ')
  return text

def casefoldingText(text):
  text = text.lower()
  return text

Kode berikut digunakan untuk menerapkan fungsi pembersih teks yang telah dibuat sebelumnya pada fitur ```description```. Tujuan dari proses ini adalah untuk menyederhanakan isi teks agar menjadi lebih konsisten dan terstruktur, sehingga memudahkan dalam tahap-tahap selanjutnya seperti ekstraksi fitur dan perhitungan kesamaan antar teks.

In [15]:
news_to_clean = news_2024[["description"]].copy()


news_to_clean["description_clean"] = news_to_clean["description"].apply(cleaningText)
news_to_clean["description_casefolding"] = news_to_clean["description_clean"].apply(casefoldingText)

news_to_clean.head()

Unnamed: 0,description,description_clean,description_casefolding
27288,The Archbishop of Canterbury urges politicians...,The Archbishop of Canterbury urges politicians...,the archbishop of canterbury urges politicians...
27289,Record numbers are being tested for cancer but...,Record numbers are being tested for cancer but...,record numbers are being tested for cancer but...
27290,A higher cap for the next three months adds £9...,A higher cap for the next three months adds t...,a higher cap for the next three months adds t...
27329,"Harry Pitman, 16, was attacked on London's Pri...",Harry Pitman was attacked on Londons Primrose...,harry pitman was attacked on londons primrose...
27330,The controversial plans triggered nationwide p...,The controversial plans triggered nationwide p...,the controversial plans triggered nationwide p...


### Create new dataframe for Dataset

Pada tahap ini, dibuat dataset akhir yang berisi dua kolom utama dari data berita tahun 2024: title dan description (yang telah melalui proses pembersihan). Dataset inilah yang akan digunakan sebagai input pada tahap selanjutnya, yaitu ekstraksi fitur menggunakan TF-IDF dan perhitungan kesamaan antar teks menggunakan cosine similarity.

In [16]:
dataset = pd.DataFrame({
    "title": news_2024["title"],
    "description": news_to_clean["description_casefolding"]
})

dataset

Unnamed: 0,title,description
27288,Justin Welby: Political leaders should treat o...,the archbishop of canterbury urges politicians...
27289,Almost three million tested for cancer in England,record numbers are being tested for cancer but...
27290,Household energy price rise of 5% comes into f...,a higher cap for the next three months adds t...
27329,Primrose Hill stabbing: Harry Pitman named as ...,harry pitman was attacked on londons primrose...
27330,Israel Supreme Court strikes down judicial ref...,the controversial plans triggered nationwide p...
...,...,...
42110,Highlights: Wales make history in Dublin,watch highlights as wales win in dublin for a...
42111,Gang jailed over £200m of cocaine in banana boxes,more than two tonnes of the class a drug was s...
42112,Scottish Budget presents huge challenges for SNP,finance secretary shona robison is preparing t...
42113,Celebrations as Wales make history qualifying ...,wales defeated the republic of ireland making...


### Feature Extraction (TF-IDF)

Ekstraksi fitur dilakukan menggunakan TF-IDF untuk mengubah teks pada kolom description menjadi representasi numerik. TfidfVectorizer dikonfigurasi dengan penghapusan stopwords Bahasa Inggris, max_df=0.8, min_df=10, dan max_features=1000. Hasilnya berupa matriks TF-IDF yang merepresentasikan setiap dokumen berdasarkan kata-kata pentingnya dan siap digunakan untuk perhitungan cosine similarity.

In [17]:
tfidf = TfidfVectorizer(
    stop_words = "english",
    max_df=0.8,
    min_df=10,
    max_features=1000,
)

tfidf_matrix = tfidf.fit_transform(dataset["description"])

print("TF-IDF Matrix shape: ", tfidf_matrix.shape)
print(tfidf_matrix.toarray())

TF-IDF Matrix shape:  (14761, 1000)
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


Kode berikut digunakan untuk membuat DataFrame yang menghubungkan judul berita dengan setiap kata fitur yang diekstrak dari kolom description menggunakan TF-IDF. DataFrame ini berguna untuk menganalisis dan memvisualisasikan korelasi antara judul berita dengan kata-kata penting yang muncul di deskripsi, sehingga dapat membantu memahami representasi teks dan kontribusi kata-kata terhadap masing-masing berita.

In [18]:
tfidf_matrix_df = pd.DataFrame(
    tfidf_matrix.toarray(),
    columns=tfidf.get_feature_names_out(),
    index=dataset.title
)

tfidf_matrix_df.sample(5)

Unnamed: 0_level_0,able,abuse,access,according,accused,act,action,actor,actress,adam,...,worlds,worst,worth,writes,wrong,year,yearold,years,york,young
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Chris Mason: Will Sunak’s dubious tax claim stick in voters’ minds?,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Ron DeSantis drops out of presidential race,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Positives for Liverpool - but a big chance missed',0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Shock and a struggle to comprehend after US airman's Gaza protest death,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'I'm not ready to die': Tourist survives holiday hippo attack,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Cosine Similarity

Pada tahap ini dilakukan perhitungan cosine similarity antar dokumen berdasarkan matriks TF-IDF yang telah dihasilkan sebelumnya. Cosine similarity digunakan untuk mengukur tingkat kemiripan antar berita dengan menghitung sudut kosinus antara dua vektor teks. Nilai kemiripan berkisar antara 0 (tidak mirip) hingga 1 (identik). Hasil dari perhitungan ini akan digunakan untuk merekomendasikan berita yang memiliki konten serupa.

In [19]:
cosine_sim = cosine_similarity(tfidf_matrix)
cosine_sim

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

Pada tahap ini, dibuat sebuah DataFrame dari hasil perhitungan cosine similarity untuk memudahkan dalam menampilkan dan menganalisis sampel kemiripan antar berita. DataFrame ini memungkinkan kita untuk melihat tingkat kesamaan antara satu berita dengan berita lainnya dalam format tabel yang lebih mudah dipahami dan digunakan untuk proses rekomendasi.



In [20]:
cosine_sim_df = pd.DataFrame(
    cosine_sim,
    index=dataset.title,
    columns=dataset.title,
)

print("Shape: ", cosine_sim_df.shape)

cosine_sim_df.sample(5, axis=1).sample(10, axis=0)

Shape:  (14761, 14761)


title,Gabriel Attal: Youngest French PM hopes to revive Macron's government,Gary Lineker invited to Wales after 'farmers' league' joke,Vennells accused of false statement on postmasters,US fugitive Nicholas Rossi extradited from Scotland,Manslaughter considered by Sicily yacht sinking investigators
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
The Meghan effect? Suits was 2023's most-streamed show in US,0.0,0.0,0.0,0.0,0.0
TV host Jay Blades charged with coercive behaviour,0.0,0.0,0.0,0.0,0.0
"Clapham attack: Abdul Shokoor Ezedi is being helped to hide, say police",0.0,0.0,0.0,0.0,0.0
"'A story of human trafficking, hope and love'",0.0,0.0,0.0,0.0,0.0
Do councils spend too much on diversity schemes?,0.0,0.0,0.0,0.0,0.0
Train strikes: All you need to know about services this week and in February,0.0,0.0,0.0,0.0,0.0
Al Fayed’s victims in France call for investigation,0.0,0.0,0.0,0.0,0.0
UK suspends some arms exports to Israel,0.0,0.0,0.0,0.0,0.0
Ryanair boss calls for two-drink limit at airports,0.0,0.0,0.0,0.0,0.0
Five golds among 10 GB medals on sensational Saturday,0.0,0.0,0.0,0.0,0.0


## Model Development Content-Based Filtering

Pada tahap ini dikembangkan fungsi rekomendasi berita berbasis Content-Based Filtering (CBF) yang memanfaatkan matriks cosine similarity antar berita.
Fungsi news_recommendation bekerja dengan cara:

- Menerima input berupa judul berita (news_title) yang ingin dicari rekomendasinya.
- Memastikan judul berita tersebut tersedia di data similarity.
- Mengurutkan dan mengambil k berita terdekat berdasarkan nilai kemiripan tertinggi dari matriks cosine similarity.
- Menghapus judul berita input dari hasil rekomendasi agar tidak muncul sebagai rekomendasi terhadap dirinya sendiri.
- Mengembalikan DataFrame yang berisi judul berita rekomendasi beserta deskripsinya.

Fungsi ini memungkinkan pengguna mendapatkan rekomendasi berita yang memiliki konten serupa dengan berita yang dipilih, sehingga meningkatkan pengalaman pengguna dalam menemukan berita relevan.

In [21]:
def news_recommendation(news_title, similarity_data=cosine_sim_df, items=dataset[["title", "description"]], k=5):
  if news_title not in similarity_data.columns:
    return f"News title '{news_title}' is not found in the dataset."

  index = similarity_data[news_title].to_numpy().argsort()[-k-1:][::-1]

  closest = similarity_data.columns[index]

  closest = closest.drop(news_title, errors="ignore")

  return pd.DataFrame(closest, columns=["title"]).merge(items, on="title").head(k)

Kode berikut digunakan untuk menjalankan fungsi rekomendasi yang memberikan daftar berita lain yang relevan berdasarkan judul berita input.

In [22]:
news_recommendation("Tom Cruise abseils off stadium roof in daring Olympic finale")

Unnamed: 0,title,description
0,Actor Chance Perdomo dies in motorcycle accide...,the ukus star was known for playing ambrose sp...
1,Queen Margrethe: Will abdication cause a rippl...,nordic monarchies are known to embrace moderni...
2,Sebastián Piñera: Former president of Chile di...,sebastián piñera became known abroad for overs...
3,Fans fume over Jason Donovan Rocky Horror no-show,fans said they would not have booked if they h...
4,Coronation Street's John Savident - who played...,the star played fred elliott a character best ...


## Evaluation

Pada tahap ini, model rekomendasi dievaluasi menggunakan metrik Precision@k dan Average Precision. Kedua metrik ini berfungsi untuk mengukur sejauh mana model mampu memberikan rekomendasi yang relevan dan berguna bagi pengguna.

Kode berikut mendefinisikan ground truth berupa daftar judul berita yang relevan sebagai acuan kebenaran, kemudian menjalankan fungsi rekomendasi news_recommendation berdasarkan judul berita input tertentu. Hasil rekomendasi disimpan dalam variabel recommendations dalam bentuk daftar judul berita.

In [23]:
ground_truth = ["Actor Chance Perdomo dies in motorcycle accident, aged 27",
                "Sebastián Piñera: Former president of Chile dies in helicopter crash"]

recommendations_df = news_recommendation("Tom Cruise abseils off stadium roof in daring Olympic finale")
recommendations = recommendations_df["title"].tolist()

### Precision@K

Kode berikut mendefinisikan fungsi precision_at_k untuk menghitung Precision@k, yaitu proporsi judul berita dalam rekomendasi top-k yang juga terdapat pada riwayat atau ground truth pengguna. Fungsi ini membandingkan daftar rekomendasi dengan data kebenaran untuk mengukur ketepatan hasil rekomendasi. Hasil evaluasi dengan menggunakan Precision@5 pada contoh ini adalah 0.40 (40%).

In [24]:
def precision_at_k(user_history, recommended_titles, k=5):
    user_history_set = set([title.strip().lower() for title in user_history])
    recommended_top_k = [title.strip().lower() for title in recommended_titles[:k]]

    hits = sum(1 for title in recommended_top_k if title in user_history_set)
    return hits / k

precision = precision_at_k(ground_truth, recommendations)
print(f"Precision@5: {precision:.2f}")

Precision@5: 0.40


### Average Precision

Kode berikut mendefinisikan fungsi average_precision untuk menghitung Average Precision, yaitu rata-rata presisi pada posisi di mana item relevan ditemukan dalam daftar rekomendasi. Fungsi ini menjumlahkan nilai presisi setiap kali rekomendasi yang relevan muncul, lalu membaginya dengan total item relevan dalam ground truth. Hasil evaluasi menggunakan Average Precision pada contoh ini adalah 0.83 (83%).

In [25]:
def average_precision(user_history, recommended_titles):
  hits = 0
  sum_precision = 0

  for i, title in enumerate(recommended_titles):
    if title in user_history:
      hits += 1
      precision_at_i = hits / (i + 1)
      sum_precision += precision_at_i

    if hits == 0:
      return 0.0

  return sum_precision / len(user_history)

ap = average_precision(ground_truth, recommendations)
print(f"Average Precision: {ap:.2f}")

Average Precision: 0.83
