1. **Imports & Dependencies:**

In [50]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

2. **Loading the Dataset:**

In [51]:
file_path = r"..\Assignment 2\SMSSpamCollection"
# Load the file as a tab-separated values (TSV) file
df = pd.read_csv(file_path, sep='\t', header=None, names=["Label", "Message"])

df.head()

Unnamed: 0,Label,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


**Encoding the Labels:**

In [52]:
print(df["Label"].value_counts())
df['Label'] = df['Label'].map({'ham': 0, 'spam': 1})
df.head()

Label
ham     4825
spam     747
Name: count, dtype: int64


Unnamed: 0,Label,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


- **Data Cleaning Steps:** If stopwords and punctuation removal are intended, a clear function should be defined for preprocessing text.

In [53]:
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text = text.lower()  # Convert to lowercase
    text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
    words = text.split()  # Tokenize
    words = [word for word in words if word not in stop_words and len(word) >1]  # Remove stopwords
    return " ".join(words)

In [54]:
df['Processed_Message'] = df['Message'].apply(preprocess_text)
df.head()

Unnamed: 0,Label,Message,Processed_Message
0,0,"Go until jurong point, crazy.. Available only ...",go jurong point crazy available bugis great wo...
1,0,Ok lar... Joking wif u oni...,ok lar joking wif oni
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,free entry wkly comp win fa cup final tkts 21s...
3,0,U dun say so early hor... U c already then say...,dun say early hor already say
4,0,"Nah I don't think he goes to usf, he lives aro...",nah dont think goes usf lives around though


In [55]:
df_final = df[['Label', 'Processed_Message']].rename(columns={'Processed_Message': 'Message'})
df_final.drop_duplicates(inplace=True)
df_final.dropna().reset_index(drop=True)
df_final.head()

Unnamed: 0,Label,Message
0,0,go jurong point crazy available bugis great wo...
1,0,ok lar joking wif oni
2,1,free entry wkly comp win fa cup final tkts 21s...
3,0,dun say early hor already say
4,0,nah dont think goes usf lives around though


In [56]:
!git init

Initialized empty Git repository in E:/Sem 3/Assignment 2/.git/


In [57]:
!dvc init

Initialized DVC repository.

You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>


In [58]:
train, test = train_test_split(df_final, test_size=0.2, random_state=21)
train, val = train_test_split(train, test_size=0.25, random_state=21)
df_final.to_csv("raw_data.csv",index=False)
train.to_csv('train.csv', index=False)
val.to_csv('validation.csv', index=False)
test.to_csv('test.csv', index=False)

In [59]:
import pandas as pd

def print_data_distribution(name, data):
    print(f"{name} data:\n")
    print(data['Label'].value_counts().to_string())
    print(f"Total data points = {len(data)}\n")

print_data_distribution("Training", train)
print_data_distribution("Validation", val)
print_data_distribution("Testing", test)

Training data:

Label
0    2684
1     379
Total data points = 3063

Validation data:

Label
0    895
1    126
Total data points = 1021

Testing data:

Label
0    896
1    126
Total data points = 1022



In [60]:
!dvc add raw_data.csv train.csv validation.csv test.csv
!git add raw_data.csv.dvc train.csv.dvc validation.csv.dvc test.csv.dvc .gitignore
!git commit -m "Added raw and split datasets with seed 21"


To track the changes with git, run:

	git add train.csv.dvc .gitignore validation.csv.dvc raw_data.csv.dvc test.csv.dvc

To enable auto staging, run:

	dvc config core.autostage true


⠋ Checking graph



[master (root-commit) ffcf9ac] Added raw and split datasets with seed 21
 8 files changed, 30 insertions(+)
 create mode 100644 .dvc/.gitignore
 create mode 100644 .dvc/config
 create mode 100644 .dvcignore
 create mode 100644 .gitignore
 create mode 100644 raw_data.csv.dvc
 create mode 100644 test.csv.dvc
 create mode 100644 train.csv.dvc
 create mode 100644 validation.csv.dvc


In [61]:
!git log

commit ffcf9ac83b46809e8ca42c6f413156fc071f802a
Author: msiddhesh <mahesh131400@gmail.com>
Date:   Tue Mar 4 19:40:22 2025 +0530

    Added raw and split datasets with seed 21


In [62]:
train, test = train_test_split(df_final, test_size=0.2, random_state=77)
train, val = train_test_split(train, test_size=0.25, random_state=77)
df_final.to_csv("raw_data.csv",index=False)
train.to_csv('train.csv', index=False)
val.to_csv('validation.csv', index=False)
test.to_csv('test.csv', index=False)

In [63]:
print_data_distribution("Training", train)
print_data_distribution("Validation", val)
print_data_distribution("Testing", test)

Training data:

Label
0    2685
1     378
Total data points = 3063

Validation data:

Label
0    894
1    127
Total data points = 1021

Testing data:

Label
0    896
1    126
Total data points = 1022



In [64]:
!dvc add train.csv validation.csv test.csv
!git commit -am "Updated train/validation/test split with seed 77"


To track the changes with git, run:

	git add validation.csv.dvc test.csv.dvc train.csv.dvc

To enable auto staging, run:

	dvc config core.autostage true


⠋ Checking graph



[master a2b2943] Updated train/validation/test split with seed 77
 3 files changed, 6 insertions(+), 6 deletions(-)


In [65]:
!git log

commit a2b2943ab0569abcf24c9ff183e2f4122aae24bf
Author: msiddhesh <mahesh131400@gmail.com>
Date:   Tue Mar 4 19:40:42 2025 +0530

    Updated train/validation/test split with seed 77

commit ffcf9ac83b46809e8ca42c6f413156fc071f802a
Author: msiddhesh <mahesh131400@gmail.com>
Date:   Tue Mar 4 19:40:22 2025 +0530

    Added raw and split datasets with seed 21


In [66]:
!git checkout ffcf9ac83b46809e8ca42c6f413156fc071f802a
!dvc checkout

Note: switching to 'ffcf9ac83b46809e8ca42c6f413156fc071f802a'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at ffcf9ac Added raw and split datasets with seed 21


M       validation.csv
M       test.csv
M       train.csv


In [69]:
print_data_distribution("Training", train)
print_data_distribution("Validation", val)
print_data_distribution("Testing", test)

Training data:

Label
0    2684
1     379
Total data points = 3063

Validation data:

Label
0    895
1    126
Total data points = 1021

Testing data:

Label
0    896
1    126
Total data points = 1022



In [70]:
!git checkout a2b2943ab0569abcf24c9ff183e2f4122aae24bf
!dvc checkout

Previous HEAD position was ffcf9ac Added raw and split datasets with seed 21
HEAD is now at a2b2943 Updated train/validation/test split with seed 77


M       train.csv
M       validation.csv
M       test.csv


In [73]:
print_data_distribution("Training", train)
print_data_distribution("Validation", val)
print_data_distribution("Testing", test)

Training data:

Label
0    2685
1     378
Total data points = 3063

Validation data:

Label
0    894
1    127
Total data points = 1021

Testing data:

Label
0    896
1    126
Total data points = 1022

