# Tugas

### Tugas 1

Buatlah model klasfikasi Multinomial Naive Bayes dengan ketentuan,

1. Menggunakan data `spam.csv`
2. Fitur `CountVectorizer` dengan mengaktifkan **stop_words**
3. Evaluasi hasilnya

### Tugas 2

Buatlah model klasfikasi Multinomial Naive Bayes dengan ketentuan,

1. Menggunakan data `spam.csv`
2. Fitur `TF-IDF` dengan mengaktifkan **stop_words**
3. Evaluasi hasilnya dan bandingkan dengan hasil tugas 1.
4. Berikan kesimpulan fitur mana yang terbaik pada kasus data `spam.csv`

# Jawaban :

### Tugas 1

In [76]:
import numpy as np
import pandas as pd

df = pd.read_csv('spam.csv', encoding='latin-1')

df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [77]:
df = df.drop(df.iloc[:,2:], axis=1)

df.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [78]:
new_cols = {
    'v1': 'Labels',
    'v2': 'SMS'
}

df = df.rename(columns=new_cols)

df.head()

Unnamed: 0,Labels,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [79]:
print(df['Labels'].value_counts())
print('\n')

print(df.info())
print('\n')

print(df.describe())

ham     4825
spam     747
Name: Labels, dtype: int64


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Labels  5572 non-null   object
 1   SMS     5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB
None


       Labels                     SMS
count    5572                    5572
unique      2                    5169
top       ham  Sorry, I'll call later
freq     4825                      30


In [80]:
new_labels = {
    'spam': 1,
    'ham': 0
}

df['Labels'] = df['Labels'].map(new_labels)

df.head()

Unnamed: 0,Labels,SMS
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [81]:
X = df['SMS'].values
y = df['Labels'].values

In [100]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=50)

# Inisiasi CountVectorizer dengan parameter stop words = english
bow = CountVectorizer(stop_words='english')

X_train = bow.fit_transform(X_train)

X_test = bow.transform(X_test)

In [101]:
print(len(bow.get_feature_names()))

print(f'Dimensi data : {X_train.shape}')

# print(bow.get_feature_names())

7466
Dimensi data : (4457, 7466)


In [102]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

mnb = MultinomialNB()

mnb.fit(X_train, y_train)

y_pred_train = mnb.predict(X_train)

acc_train = accuracy_score(y_train, y_pred_train)

y_pred_test = mnb.predict(X_test)

acc_test = accuracy_score(y_test, y_pred_test)

print('Akurasi menurut Count Vectorizer')
print(f'Hasil akurasi data train: {acc_train}')
print(f'Hasil akurasi data test: {acc_test}')

Akurasi menurut Count Vectorizer
Hasil akurasi data train: 0.9946152120260264
Hasil akurasi data test: 0.9829596412556054


### Tugas 2

Sesuai dengan resource yang ada pada Tugas 1, kita akan melakukan pencarian hasil akurasi data dengan menggunakan TF-IDF

In [107]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# Split data training dan testing
Xt_train, Xt_test, yt_train, yt_test = train_test_split(X, y, test_size=0.2, random_state=50)

# Inisiasi CountVectorizer
tfidf = TfidfVectorizer(stop_words='english')

# Fitting dan transform X_train dengan CountVectorizer
Xt_train = tfidf.fit_transform(Xt_train)

# Transform X_test
# Mengapa hanya transform? Alasan yang sama dengan kasus pada percobaan ke-3
# Kita tidak menginginkan model mengetahui paramter yang digunakan oleh CountVectorizer untuk fitting data X_train
# Sehingga, data testing dapat tetap menjadi data yang asing bagi model nantinya
Xt_test = tfidf.transform(Xt_test)

In [108]:
print(len(tfidf.get_feature_names()))

print(f'Dimensi data : {Xt_train.shape}')

7466
Dimensi data : (4457, 7466)


In [111]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Inisiasi MultinomialNB
mnb = MultinomialNB()

# Fit model
mnb.fit(Xt_train, yt_train)

# Prediksi dengan data training
yt_pred_train = mnb.predict(Xt_train)

# Evaluasi akurasi data training
acc_train_tfidf = accuracy_score(yt_train, yt_pred_train)

# Prediksi dengan data training
yt_pred_test = mnb.predict(Xt_test)

# Evaluasi akurasi data training
acc_test_tfidf = accuracy_score(yt_test, yt_pred_test)

# Print hasil evaluasi
print('Akurasi menurut TF-IDF')
print(f'Hasil akurasi data train: {acc_train_tfidf}')
print(f'Hasil akurasi data test: {acc_test_tfidf}')

Akurasi menurut TF-IDF
Hasil akurasi data train: 0.9842943684092439
Hasil akurasi data test: 0.9605381165919282


### Kesimpulan

Berdasarkan hasil percobaan dengan menambahkan parameter stop words di kedua fitur baik itu Count Vectorizer maupun TF-IDF. Hasil akurasi yang paling mendekati 100% adalah Count Vectorizer. Oleh karena itu, dapat disimpulkan Fitur Count Vectorizer lebih baik dari pada TF-IDF. Berikut hasil akurasi data Train maupun data Test pada kedua Fitur :

> Akurasi menurut Count Vectorizer
- Hasil akurasi data train: 0.9946152120260264
- Hasil akurasi data test: 0.9829596412556054

> Akurasi menurut TF-IDF
- Hasil akurasi data train: 0.9842943684092439
- Hasil akurasi data test: 0.9605381165919282