###**Mini-Project (Machine Learning)**###

---


##**Sentiment Analysis on SMS Classification Dataset**##

---



In [None]:
import pandas as pd

# Reading the file with a different encoding, for example 'latin-1'
df = pd.read_csv('sms_spam.csv', encoding='latin-1')

In [None]:
# Displaying the first few rows
print(df.head())

                                             Message Classification
0  Go until jurong point, crazy.. Available only ...            ham
1                      Ok lar... Joking wif u oni...            ham
2  Free entry in 2 a wkly comp to win FA Cup fina...           spam
3  U dun say so early hor... U c already then say...            ham
4  Nah I don't think he goes to usf, he lives aro...            ham


In [None]:
print(df.shape)  # Printing the number of rows and columns
print(df.head())  # Printing the first few rows of the DataFrame(df)

(580, 2)
                                             Message Classification
0  Go until jurong point, crazy.. Available only ...            ham
1                      Ok lar... Joking wif u oni...            ham
2  Free entry in 2 a wkly comp to win FA Cup fina...           spam
3  U dun say so early hor... U c already then say...            ham
4  Nah I don't think he goes to usf, he lives aro...            ham


####**Removing Stop Words**####

In [None]:
import nltk
from nltk.corpus import stopwords
import re

# Downloads stopwords
nltk.download('stopwords')

def preprocess_text(text):
    # Converting to lowercase
    text = text.lower()

    # Removing numbers
    text = re.sub(r'\d+', '', text)

    # Removing punctuation
    text = re.sub(r'[^\w\s]', '', text)

    # Removing stopwords
    stop_words = set(stopwords.words('english'))
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered_words)

# Applying preprocessing to the dataset
df['Message'] = df['Message'].apply(preprocess_text)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
df.head()

Unnamed: 0,Message,Classification
0,go jurong point crazy available bugis n great ...,ham
1,ok lar joking wif u oni,ham
2,free entry wkly comp win fa cup final tkts st ...,spam
3,u dun say early hor u c already say,ham
4,nah dont think goes usf lives around though,ham


In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
# Vectorizers
vectorizer1 = CountVectorizer(binary=True)  # Binary term frequency
vectorizer2 = CountVectorizer(binary=False)  #frequency
tfidf_vect = TfidfVectorizer()  # TFIDF vectorizer

In [None]:
# Transforming the data
X1 = vectorizer1.fit_transform(df['Message'])
X2 = vectorizer2.fit_transform(df['Message'])
X3 = tfidf_vect.fit_transform(df['Message'])

In [None]:
X1
X2
X3

<580x2180 sparse matrix of type '<class 'numpy.float64'>'
	with 5874 stored elements in Compressed Sparse Row format>

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Spliting the data into training and testing
X_train1,X_test1,y_train,y_test = train_test_split(X1,df['Classification'],test_size=0.25,random_state=1)
X_train2,X_test2,y_train,y_test = train_test_split(X2,df['Classification'],test_size=0.25,random_state=1)
X_train3,X_test3,y_train,y_test = train_test_split(X3,df['Classification'],test_size=0.25,random_state=1)

In [None]:
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.metrics import accuracy_score

In [None]:
# Classifiers
bnb = BernoulliNB()
mnb = MultinomialNB()
mnb2 = MultinomialNB()

In [None]:
# Fitting the model
bnb.fit(X_train1, y_train)

In [None]:
mnb.fit(X_train2, y_train)

In [None]:
mnb2.fit(X_train3, y_train)

In [None]:
# Predicting
pred1 = bnb.predict(X_test1)
pred2 = mnb.predict(X_test2)
pred3 = mnb2.predict(X_test3)

# Checking the Accuracy
print('Accuracy score for Binary-based Vectors is', accuracy_score(y_test, pred1))
print('Accuracy score for Frequency-based vectors is', accuracy_score(y_test, pred2))
print('Accuracy score for TFIDF based vectors is', accuracy_score(y_test, pred3))

Accuracy score for Binary-based Vectors is 0.896551724137931
Accuracy score for Frequency-based vectors is 0.9172413793103448
Accuracy score for TFIDF based vectors is 0.9172413793103448


####**Saving and Loading the Model and Vectorizer**####

In [None]:
import joblib

# Assuming `bnb` is your trained BernoulliNB model
# and `vectorizer1` is your trained CountVectorizer
joblib.dump(bnb, 'SentimentalSMS_model')
joblib.dump(vectorizer1, 'vectorizer1')

['vectorizer1']

In [None]:
# Load the trained model and vectorizer
loaded_model = joblib.load('SentimentalSMS_model')
loaded_vectorizer = joblib.load('vectorizer1')

# Sample text for prediction
new_text = ['Limited time offer! Get a free phone with your new plan. Call now!']

# Transform the new text
new_text_transformed = loaded_vectorizer.transform(new_text)

# Predict using the loaded model
new_prediction = loaded_model.predict(new_text_transformed)

print("Prediction of new text:", new_prediction)

Prediction of new text: ['spam']


In [None]:
# Testing with more sample inputs
sample_messages = [
    'Congratulations! You have won a $1000 cash prize. Call now to claim your prize.',
    'Hey, are we still meeting for dinner tonight?',
    'Limited time offer! Get a free phone with your new plan. Call now!',
    'Hi, just checking in. How\'s everything going?'
]

for msg in sample_messages:
    print(f"Sample: {msg}")
    print("Prediction:", predict_sentiment(msg))

Sample: Congratulations! You have won a $1000 cash prize. Call now to claim your prize.
Prediction: spam
Sample: Hey, are we still meeting for dinner tonight?
Prediction: ham
Sample: Limited time offer! Get a free phone with your new plan. Call now!
Prediction: spam
Sample: Hi, just checking in. How's everything going?
Prediction: ham


In [None]:
# Examining the class distribution
print("Class distribution in training data:")
print(y_train.value_counts())

Class distribution in training data:
Classification
ham     234
spam    201
Name: count, dtype: int64


In [None]:
from sklearn.metrics import classification_report

# Predictions
predictions = bnb.predict(X_test1)

# Classification report
print("Classification Report:")
print(classification_report(y_test, predictions))

Classification Report:
              precision    recall  f1-score   support

         ham       0.81      1.00      0.90        66
        spam       1.00      0.81      0.90        79

    accuracy                           0.90       145
   macro avg       0.91      0.91      0.90       145
weighted avg       0.92      0.90      0.90       145



In [None]:
from sklearn.metrics import confusion_matrix

# Confusion matrix
cm = confusion_matrix(y_test, predictions, labels=['ham', 'spam'])
print("Confusion Matrix:")
print(cm)

Confusion Matrix:
[[66  0]
 [15 64]]




---


####***Conclusion:***####
The sentiment analysis project for classifying SMS messages demonstrated strong performance across different vectorization techniques. Here are the accuracy results for each method:

- **Binary-based Vectors:** 89.7%
- **Frequency-based Vectors:** 91.7%
- **TF-IDF Vectors:** 91.7%

The frequency-based and TF-IDF methods both achieved the highest accuracy of 91.7%, slightly outperforming the binary-based vector method.

This indicates that the frequency-based and TF-IDF approaches are more effective in accurately classifying SMS messages.

The classification report shows that the model was very effective at identifying 'ham' messages with high precision and recall, but there was a slight decrease in performance for 'spam' messages. Overall, the frequency-based and TF-IDF methods proved to be the most reliable for this classification task, making them preferable choices for future applications.


---

