
- Buatlah model klasfikasi Multinomial Naive Bayes dengan ketentuan,
    1. Menggunakan data spam.csv
    2. Fitur CountVectorizer dengan mengaktifkan stop_words
    3. Evaluasi hasilnya

- Buatlah model klasfikasi Multinomial Naive Bayes dengan ketentuan,
    1. Menggunakan data spam.csv
    2. Fitur TF-IDF dengan mengaktifkan stop_words
    3. Evaluasi hasilnya dan bandingkan dengan hasil pada Tugas no 2.
    4. Berikan kesimpulan fitur mana yang terbaik pada kasus data spam.csv

In [79]:
# Load data
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import warnings
df = pd.read_csv('spam.csv', encoding='latin-1')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [80]:
# Membersihkan data - hanya mengambil kolom yang diperlukan
df = df[['v1', 'v2']]
df.columns = ['label', 'message']

# Membersihkan data dari nilai NaN
df = df.dropna()

# Memisahkan fitur dan target
X = df['message']
y = df['label']

In [81]:
# Membagi data menjadi training dan testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Membuat CountVectorizer dengan stop_words
vectorizer = CountVectorizer(stop_words='english')

# Transformasi teks menjadi vektor fitur
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Membuat dan melatih model Multinomial Naive Bayes
model = MultinomialNB()
model.fit(X_train_vec, y_train)



0,1,2
,alpha,1.0
,force_alpha,True
,fit_prior,True
,class_prior,


In [82]:
# Memprediksi pada data test
y_pred = model.predict(X_test_vec)

In [83]:
# Evaluasi model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

In [84]:
# Menampilkan hasil evaluasi
print("=" * 60)
print("EVALUASI MODEL MULTINOMIAL NAIVE BAYES")
print("=" * 60)
print(f"Akurasi: {accuracy:.4f}")
print("\nConfusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)



EVALUASI MODEL MULTINOMIAL NAIVE BAYES
Akurasi: 0.9839

Confusion Matrix:
[[959   6]
 [ 12 138]]

Classification Report:
              precision    recall  f1-score   support

         ham       0.99      0.99      0.99       965
        spam       0.96      0.92      0.94       150

    accuracy                           0.98      1115
   macro avg       0.97      0.96      0.96      1115
weighted avg       0.98      0.98      0.98      1115



In [85]:
# Menampilkan beberapa informasi tambahan
print("\n" + "=" * 60)
print("INFORMASI TAMBAHAN")
print("=" * 60)
print(f"Jumlah data training: {len(X_train)} ( {(len(X_train)/len(df))*100:.2f}% )")
print(f"Jumlah data testing: {len(X_test)} ( {(len(X_test)/len(df))*100:.2f}% )")
print(f"Jumlah fitur (vocabulary): {len(vectorizer.vocabulary_)}")
print(f"Distribusi kelas dalam data testing:")
print(y_test.value_counts())


INFORMASI TAMBAHAN
Jumlah data training: 4457 ( 79.99% )
Jumlah data testing: 1115 ( 20.01% )
Jumlah fitur (vocabulary): 7472
Distribusi kelas dalam data testing:
label
ham     965
spam    150
Name: count, dtype: int64


**TF-IDF**

In [86]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Membuat TF-IDF Vectorizer dengan stop_words
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

# Transformasi teks menjadi vektor fitur TF-IDF
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Membuat dan melatih model Multinomial Naive Bayes dengan TF-IDF
model_tfidf = MultinomialNB()
model_tfidf.fit(X_train_tfidf, y_train)

0,1,2
,alpha,1.0
,force_alpha,True
,fit_prior,True
,class_prior,


In [87]:
# Memprediksi pada data test
y_pred_tfidf = model_tfidf.predict(X_test_tfidf)

In [88]:
# Evaluasi model TF-IDF
accuracy_tfidf = accuracy_score(y_test, y_pred_tfidf)
conf_matrix_tfidf = confusion_matrix(y_test, y_pred_tfidf)
class_report_tfidf = classification_report(y_test, y_pred_tfidf)

In [89]:
print("\nHASIL TF-IDF VECTORIZER:")
print("=" * 40)
print(f"Akurasi: {accuracy_tfidf:.4f}")
print(f"Jumlah fitur: {len(tfidf_vectorizer.vocabulary_)}")
print("\nConfusion Matrix:")
print(conf_matrix_tfidf)
print("\nClassification Report:")
print(class_report_tfidf)


HASIL TF-IDF VECTORIZER:
Akurasi: 0.9668
Jumlah fitur: 7472

Confusion Matrix:
[[965   0]
 [ 37 113]]

Classification Report:
              precision    recall  f1-score   support

         ham       0.96      1.00      0.98       965
        spam       1.00      0.75      0.86       150

    accuracy                           0.97      1115
   macro avg       0.98      0.88      0.92      1115
weighted avg       0.97      0.97      0.96      1115



In [90]:
# ===== MENAMPILKAN HASIL PERBANDINGAN =====
print("=" * 70)
print("PERBANDINGAN MODEL: TF-IDF vs COUNT VECTORIZER")
print("=" * 70)

print("\nHASIL TF-IDF VECTORIZER:")
print("=" * 40)
print(f"Akurasi: {accuracy_tfidf:.4f}")
print(f"Jumlah fitur: {len(tfidf_vectorizer.vocabulary_)}")
print("\nConfusion Matrix:")
print(conf_matrix_tfidf)
print("\nClassification Report:")
print(class_report_tfidf)

print("\nHASIL COUNT VECTORIZER:")
print("=" * 40)
print(f"Akurasi: {accuracy:.4f}")
print("\nConfusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)

PERBANDINGAN MODEL: TF-IDF vs COUNT VECTORIZER

HASIL TF-IDF VECTORIZER:
Akurasi: 0.9668
Jumlah fitur: 7472

Confusion Matrix:
[[965   0]
 [ 37 113]]

Classification Report:
              precision    recall  f1-score   support

         ham       0.96      1.00      0.98       965
        spam       1.00      0.75      0.86       150

    accuracy                           0.97      1115
   macro avg       0.98      0.88      0.92      1115
weighted avg       0.97      0.97      0.96      1115


HASIL COUNT VECTORIZER:
Akurasi: 0.9839

Confusion Matrix:
[[959   6]
 [ 12 138]]

Classification Report:
              precision    recall  f1-score   support

         ham       0.99      0.99      0.99       965
        spam       0.96      0.92      0.94       150

    accuracy                           0.98      1115
   macro avg       0.97      0.96      0.96      1115
weighted avg       0.98      0.98      0.98      1115



In [91]:
# ===== ANALISIS PERBANDINGAN DETAIL =====
print("\n" + "=" * 70)
print("ANALISIS PERBANDINGAN DETAIL")
print("=" * 70)


ANALISIS PERBANDINGAN DETAIL


In [92]:
# Menghitung metrik tambahan untuk perbandingan
def calculate_metrics(conf_matrix):
    tn, fp, fn, tp = conf_matrix.ravel()
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    return precision, recall, f1

In [93]:
precision_tfidf, recall_tfidf, f1_tfidf = calculate_metrics(conf_matrix_tfidf)
precision_count, recall_count, f1_count = calculate_metrics(conf_matrix)

In [94]:
print(f"{'METRIK':<15} {'TF-IDF':<10} {'COUNT':<10} {'SELISIH':<10}")
print("-" * 50)
print(f"{'Akurasi':<15} {accuracy_tfidf:.4f}    {accuracy:.4f}    {accuracy_tfidf-accuracy:+.4f}")
print(f"{'Precision':<15} {precision_tfidf:.4f}    {precision_count:.4f}    {precision_tfidf-precision_count:+.4f}")
print(f"{'Recall':<15} {recall_tfidf:.4f}    {recall_count:.4f}    {recall_tfidf-recall_count:+.4f}")
print(f"{'F1-Score':<15} {f1_tfidf:.4f}    {f1_count:.4f}    {f1_tfidf-f1_count:+.4f}")

METRIK          TF-IDF     COUNT      SELISIH   
--------------------------------------------------
Akurasi         0.9668    0.9839    -0.0170
Precision       1.0000    0.9583    +0.0417
Recall          0.7533    0.9200    -0.1667
F1-Score        0.8593    0.9388    -0.0795


In [95]:
# ===== KESIMPULAN =====
print("\n" + "=" * 70)
print("KESIMPULAN")
print("=" * 70)

if accuracy_tfidf > accuracy:
    print("TF-IDF memberikan performa LEBIH BAIK dibandingkan Count Vectorizer")
    improvement = ((accuracy_tfidf - accuracy) / accuracy) * 100
    print(f"   Peningkatan akurasi: {improvement:.2f}%")
elif accuracy_tfidf < accuracy:
    print("Count Vectorizer memberikan performa LEBIH BAIK dibandingkan TF-IDF")
    improvement = ((accuracy - accuracy_tfidf) / accuracy_tfidf) * 100
    print(f"   Peningkatan akurasi: {improvement:.2f}%")
else:
    print("Kedua metode memberikan performa yang SAMA")

print(f"\Detail Perbandingan:")
print(f"   • TF-IDF Vectorizer: {accuracy_tfidf:.4f}")
print(f"   • Count Vectorizer:  {accuracy:.4f}")

# Analisis berdasarkan metrik lainnya
print(f"\nAnalisis Metrik Spam Detection:")
print(f"   • Precision TF-IDF: {precision_tfidf:.4f} (semakin tinggi semakin baik)")
print(f"   • Recall TF-IDF: {recall_tfidf:.4f} (semakin tinggi semakin baik)")
print(f"   • Precision Count: {precision_count:.4f}")
print(f"   • Recall Count: {recall_count:.4f}")




KESIMPULAN
Count Vectorizer memberikan performa LEBIH BAIK dibandingkan TF-IDF
   Peningkatan akurasi: 1.76%
\Detail Perbandingan:
   • TF-IDF Vectorizer: 0.9668
   • Count Vectorizer:  0.9839

Analisis Metrik Spam Detection:
   • Precision TF-IDF: 1.0000 (semakin tinggi semakin baik)
   • Recall TF-IDF: 0.7533 (semakin tinggi semakin baik)
   • Precision Count: 0.9583
   • Recall Count: 0.9200


  print(f"\Detail Perbandingan:")


**TF-IDF lebih baik dalam menghindari false positive (mengklasifikasikan ham sebagai spam)**