# Lab Assignment JOBSHEET 4

## Tasks
1. Create a classification model using SVM for the voice.csv data. 
2. Create a Multinomial Naive Bayes classification model with the following conditions:
    1. Use the spam.csv data.
    2. Utilize CountVectorizer with stop words enabled.
    3. Evaluate the results.
3. Create another Multinomial Naive Bayes classification model with the following conditions:
    1. Use the spam.csv data.
    2. Employ TF-IDF features with stop words enabled.
    3. Evaluate the results and compare them with the results from Task #2.
    4. Provide a conclusion on which feature extraction method is best for the spam.csv dataset.

### 1. Create a classification model using SVM for the voice.csv data.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [2]:
data = pd.read_csv('voice.csv')

# Check the structure of the dataset
print(data.head())

# Separate features and target
X = data.iloc[:, :-1]  # Features
y = data.iloc[:, -1]   # Target

   meanfreq        sd    median       Q25       Q75       IQR       skew  \
0  0.059781  0.064241  0.032027  0.015071  0.090193  0.075122  12.863462   
1  0.066009  0.067310  0.040229  0.019414  0.092666  0.073252  22.423285   
2  0.077316  0.083829  0.036718  0.008701  0.131908  0.123207  30.757155   
3  0.151228  0.072111  0.158011  0.096582  0.207955  0.111374   1.232831   
4  0.135120  0.079146  0.124656  0.078720  0.206045  0.127325   1.101174   

          kurt    sp.ent       sfm  ...  centroid   meanfun    minfun  \
0   274.402906  0.893369  0.491918  ...  0.059781  0.084279  0.015702   
1   634.613855  0.892193  0.513724  ...  0.066009  0.107937  0.015826   
2  1024.927705  0.846389  0.478905  ...  0.077316  0.098706  0.015656   
3     4.177296  0.963322  0.727232  ...  0.151228  0.088965  0.017798   
4     4.333713  0.971955  0.783568  ...  0.135120  0.106398  0.016931   

     maxfun   meandom    mindom    maxdom   dfrange   modindx  label  
0  0.275862  0.007812  0.007812  

In [3]:
# Encode labels (male and female) to numerical values (0 and 1)
encoder = LabelEncoder()
y = encoder.fit_transform(y)

# Standardize the features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [4]:
# Initialize and train the SVM model
svm_model = SVC(kernel='linear', random_state=42)  # Linear kernel is used for simplicity
svm_model.fit(X_train, y_train)

In [5]:
# Predict on the test set
y_pred = svm_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

# Print classification report
print('Classification Report:')
print(classification_report(y_test, y_pred))

# Print confusion matrix
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.9763406940063092
Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.99      0.98       297
           1       0.99      0.97      0.98       337

    accuracy                           0.98       634
   macro avg       0.98      0.98      0.98       634
weighted avg       0.98      0.98      0.98       634

Confusion Matrix:
[[293   4]
 [ 11 326]]


### 2. Create a Multinomial Naive Bayes classification model with the following conditions:
1. Use the spam.csv data.
2. Utilize CountVectorizer with stop words enabled.
3. Evaluate the results.

In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [7]:
# Load the dataset
data = pd.read_csv('voice.csv')

# Separate features and target
X = data.iloc[:, :-1]  # Features
y = data.iloc[:, -1]   # Target

# Encode labels (male and female) to numerical values (0 and 1)
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
y = encoder.fit_transform(y)

# Convert features to text format for CountVectorizer
X_text = X.applymap(str).apply(' '.join, axis=1)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_text, y, test_size=0.2, random_state=42)

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer with stop words
vectorizer = CountVectorizer(stop_words='english')

# Fit and transform the training data
X_train_vectorized = vectorizer.fit_transform(X_train)

# Transform the testing data
X_test_vectorized = vectorizer.transform(X_test)

In [9]:
# Initialize and train the Multinomial Naive Bayes model
naive_bayes_model = MultinomialNB()
naive_bayes_model.fit(X_train_vectorized, y_train)

In [10]:
# Predict on the test set
y_pred = naive_bayes_model.predict(X_test_vectorized)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

# Print classification report
print('Classification Report:')
print(classification_report(y_test, y_pred))

# Print confusion matrix
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.6545741324921136
Classification Report:
              precision    recall  f1-score   support

           0       0.60      0.77      0.68       297
           1       0.73      0.55      0.63       337

    accuracy                           0.65       634
   macro avg       0.67      0.66      0.65       634
weighted avg       0.67      0.65      0.65       634

Confusion Matrix:
[[229  68]
 [151 186]]


### 3. Create another Multinomial Naive Bayes classification model with the following conditions:
1. Use the spam.csv data.
2. Employ TF-IDF features with stop words enabled.
3. Evaluate the results and compare them with the results from Task #2.
4. Provide a conclusion on which feature extraction method is best for the spam.csv dataset.

In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [12]:
# Load the dataset
data = pd.read_csv('voice.csv')

# Separate features and target
X = data.iloc[:, :-1]  # Features
y = data.iloc[:, -1]   # Target

# Encode labels (male and female) to numerical values (0 and 1)
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
y = encoder.fit_transform(y)

# Convert features to text format for TfidfVectorizer
X_text = X.applymap(str).apply(' '.join, axis=1)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_text, y, test_size=0.2, random_state=42)

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer with stop words
vectorizer = TfidfVectorizer(stop_words='english')

# Fit and transform the training data
X_train_tfidf = vectorizer.fit_transform(X_train)

# Transform the testing data
X_test_tfidf = vectorizer.transform(X_test)

In [14]:
# Initialize and train the Multinomial Naive Bayes model
naive_bayes_model_tfidf = MultinomialNB()
naive_bayes_model_tfidf.fit(X_train_tfidf, y_train)

In [16]:
# Predict on the test set
y_pred_tfidf = naive_bayes_model_tfidf.predict(X_test_tfidf)

# Calculate accuracy
accuracy_tfidf = accuracy_score(y_test, y_pred_tfidf)
print('Accuracy using TF-IDF:', accuracy_tfidf)

# Print classification report
print('Classification Report using TF-IDF:')
print(classification_report(y_test, y_pred_tfidf))

# Print confusion matrix
print('Confusion Matrix using TF-IDF:')
print(confusion_matrix(y_test, y_pred_tfidf))

Accuracy using TF-IDF: 0.6403785488958991
Classification Report using TF-IDF:
              precision    recall  f1-score   support

           0       0.58      0.81      0.68       297
           1       0.74      0.49      0.59       337

    accuracy                           0.64       634
   macro avg       0.66      0.65      0.64       634
weighted avg       0.67      0.64      0.63       634

Confusion Matrix using TF-IDF:
[[240  57]
 [171 166]]


### CONCLUSION: 
Based on the comparison of the results from Task #2 (using CountVectorizer) and Task #3 (using TF-IDF features), draw a conclusion on which feature extraction method is better for the voice.csv dataset in terms of accuracy and other evaluation metrics.