**Step 1: Importing the libraries**

In [54]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import confusion_matrix  # Import confusion_matrix separately
import nltk
from sklearn.model_selection import train_test_split, GridSearchCV



**Step 2: Loading the Datasets**
- I downloaded te datasets from this webpage: https://www.kaggle.com/datasets/team-ai/spam-text-message-classification
- We also have to encode the text, since there are many characters and symbols

In [74]:
# Load the training dataset
df = pd.read_csv('H:/Mi unidad/4. Data Science/Postulación MG/DataSet/SPAM text message 20170820.csv', encoding='latin-1')
df


Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ã¼ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


**Step 3: Cleaning the data set**
- We will build a function that can do the following: i. Lowercasing, ii. Removing special characters and punctuation, iii. Removing stopwords and iv. join tokens back to text

In [75]:
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stopwords = set(stopwords.words('english'))

def clean_text(text):
    # Lowercasing
    text = text.lower()
    
    # Removing special characters and punctuation
    text = ''.join(c for c in text if c.isalnum() or c.isspace())
    
    # Removing stopwords
    tokens = word_tokenize(text)
    tokens = [token for token in tokens if token not in stopwords]
    
    # Join tokens back to text
    text = ' '.join(tokens)
    
    return text

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [76]:
#Applying the function to the dataset

df['Message'] = df['Message'].apply(clean_text)
df


Unnamed: 0,Category,Message
0,ham,go jurong point crazy available bugis n great ...
1,ham,ok lar joking wif u oni
2,spam,free entry 2 wkly comp win fa cup final tkts 2...
3,ham,u dun say early hor u c already say
4,ham,nah dont think goes usf lives around though
...,...,...
5567,spam,2nd time tried 2 contact u u â750 pound prize ...
5568,ham,ã¼ b going esplanade fr home
5569,ham,pity mood soany suggestions
5570,ham,guy bitching acted like id interested buying s...


In [77]:
train_df, test_df, train_labels, test_labels = train_test_split(
    df['Message'], df['Category'], test_size=0.2, random_state=42)

**Step 4: Feature extraction**

- We will use the Term Frequency-Inverse Document Frequency (TF-IDF) technique. 
- Term Frequency (TF) counts how many times a word appears in a document. Inverse Document Frequency (IDF) measures how important a word is across all documents. By combining TF and IDF, we get a score for each word that tells us how important it is in a particular document compared to all the documents. We do this for all the words in all the news articles to create numerical features that the computer can understand.

In [78]:
# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit the vectorizer on the training text data and transform it into numerical features
train_features = vectorizer.fit_transform(train_df)

# Transform the testing text data into numerical features
test_features = vectorizer.transform(test_df)


**Step 5: Training the classification model**
- After we vectorize the text, so the computer can understand it, we will use a Supervised Machine Learning technique called Logistic Regression.
- Logistic Regression uses a math equation that takes the numbers as input and gives a probability as output. If the probability is above a certain threshold, it predicts the news article as fake; otherwise, it predicts it as not fake

In [79]:
# Create a logistic regression model
model = LogisticRegression(random_state = 0)

# Train the model on the training data
model.fit(train_features, train_labels)


LogisticRegression(random_state=0)

In [80]:
# Predict the labels for the testing data
predictions = model.predict(test_features)

# Calculate the accuracy of the model
accuracy = accuracy_score(test_labels, predictions)
print("Accuracy:", accuracy)

# Generate a classification report
print("Classification Report:")
print(classification_report(test_labels, predictions))


Accuracy: 0.9587443946188341
Classification Report:
              precision    recall  f1-score   support

         ham       0.96      1.00      0.98       966
        spam       0.97      0.71      0.82       149

    accuracy                           0.96      1115
   macro avg       0.96      0.85      0.90      1115
weighted avg       0.96      0.96      0.96      1115



**Step 6: Findings**

These results indicate that the model's performance is quite good. It achieves high accuracy, precision, and recall in the "ham" class, correctly classifying the majority of non-spam messages. In the "spam" class, although the recall is relatively lower, the model still achieves a reasonable F1-score, suggesting that it identifies a significant portion of the spam messages. However, there may be some false negatives, meaning that some spam messages are misclassified as non-spam.

In [81]:
# Generate a confusion matrix
cm = confusion_matrix(test_labels, predictions)
print("Confusion Matrix:")
print(cm)

Confusion Matrix:
[[963   3]
 [ 43 106]]
