In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


#**MOUNTING GOOGLE DRIVE AND INSTALL REQUIRED LIBRARIES**

## **Environment Setup (Google Drive + DVC)**

This block prepares the working environment for data version control.

&nbsp;&nbsp;&nbsp;• `drive.mount('/content/drive')` connects Google Drive to Colab for persistent storage.  
&nbsp;&nbsp;&nbsp;• Enables saving raw and processed datasets directly to Drive.  

&nbsp;&nbsp;&nbsp;• `!pip install dvc` installs Data Version Control (DVC).  
&nbsp;&nbsp;&nbsp;• DVC is used to track dataset versions (raw_data.csv, train.csv, validation.csv, test.csv).  
&nbsp;&nbsp;&nbsp;• Supports reproducible experiments and data version tracking.  


In [2]:
!pip install dvc

Collecting dvc
  Downloading dvc-3.66.1-py3-none-any.whl.metadata (17 kB)
Collecting celery (from dvc)
  Downloading celery-5.6.2-py3-none-any.whl.metadata (23 kB)
Collecting colorama>=0.3.9 (from dvc)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Collecting configobj>=5.0.9 (from dvc)
  Downloading configobj-5.0.9-py2.py3-none-any.whl.metadata (3.2 kB)
Collecting dpath<3,>=2.1.0 (from dvc)
  Downloading dpath-2.2.0-py3-none-any.whl.metadata (15 kB)
Collecting dulwich (from dvc)
  Downloading dulwich-1.0.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (5.4 kB)
Collecting dvc-data<3.19.0,>=3.18.0 (from dvc)
  Downloading dvc_data-3.18.2-py3-none-any.whl.metadata (5.0 kB)
Collecting dvc-http>=2.29.0 (from dvc)
  Downloading dvc_http-2.32.0-py3-none-any.whl.metadata (1.3 kB)
Collecting dvc-objects (from dvc)
  Downloading dvc_objects-5.2.0-py3-none-any.whl.metadata (3.9 kB)
Collecting dvc-render<2,>=1.0.1 (from dvc)
  Downloading dvc_render-1.0.2-py3-none-any.whl.met

## **Set Working Directory**

This block sets the project directory for the assignment.

&nbsp;&nbsp;&nbsp;• `%cd /content/drive/MyDrive/AMLAssignment/Assignment_2` changes the current working directory.  
&nbsp;&nbsp;&nbsp;• Ensures all data files and DVC metadata are stored inside the Assignment_2 folder.  
&nbsp;&nbsp;&nbsp;• Keeps the project organized and version-controlled within Google Drive.  


In [3]:
%cd /content/drive/MyDrive/AMLAssignment/Assignment_2


/content/drive/MyDrive/AMLAssignment/Assignment_2


## **Configure Git User**

This block configures Git identity for version control commits.

&nbsp;&nbsp;&nbsp;• `git config --global user.email` sets the global Git email.  
&nbsp;&nbsp;&nbsp;• `git config --global user.name` sets the global Git username.  
&nbsp;&nbsp;&nbsp;• Required for tracking commits when using DVC with Git.  
&nbsp;&nbsp;&nbsp;• Ensures proper authorship for dataset version changes.  




In [4]:
!git config --global user.email "pothanpranav@gmail.com"
!git config --global user.name "Pranav Pothan"


## **Initialize Git and DVC**

This block initializes version control for both code and data.

&nbsp;&nbsp;&nbsp;• `git init` initializes a new Git repository in the project directory.  
&nbsp;&nbsp;&nbsp;• `dvc init` initializes Data Version Control (DVC) for dataset tracking.  
&nbsp;&nbsp;&nbsp;• `git add .dvc .dvcignore` stages DVC configuration files for commit.  
&nbsp;&nbsp;&nbsp;• `git commit -m "Initialize DVC"` creates the first commit, enabling structured data version tracking.  


In [5]:
!git init
!dvc init
!git add .dvc .dvcignore
!git commit -m "Initialize DVC"


[33mhint: Using 'master' as the name for the initial branch. This default branch name[m
[33mhint: is subject to change. To configure the initial branch name to use in all[m
[33mhint: [m
[33mhint: 	git config --global init.defaultBranch <name>[m
[33mhint: [m
[33mhint: Names commonly chosen instead of 'master' are 'main', 'trunk' and[m
[33mhint: 'development'. The just-created branch can be renamed via this command:[m
[33mhint: [m
[33mhint: 	git branch -m <name>[m
Initialized empty Git repository in /content/drive/MyDrive/AMLAssignment/Assignment_2/.git/
Initialized DVC repository.

You can now commit the changes to git.

[31m+---------------------------------------------------------------------+
[0m[31m|[0m                                                                     [31m|[0m
[31m|[0m        DVC has enabled anonymous aggregate usage analytics.         [31m|[0m
[31m|[0m     Read the analytics documentation (and how to opt-out) here:     [31m|[0m
[3

## **Create DVC Storage Directory**

This block creates a dedicated storage location for DVC-tracked data.

&nbsp;&nbsp;&nbsp;• `mkdir -p /content/drive/MyDrive/AMLAssignment/dvc_storage` creates a storage folder in Google Drive.  
&nbsp;&nbsp;&nbsp;• This directory will act as the remote storage backend for DVC.  
&nbsp;&nbsp;&nbsp;• Enables separation of compute (Colab runtime) and storage (Google Drive).  
&nbsp;&nbsp;&nbsp;• Supports versioned dataset backups and reproducibility.  


In [6]:
!mkdir -p /content/drive/MyDrive/AMLAssignment/dvc_storage


## **Configure DVC Remote (Google Drive Storage)**

This block configures Google Drive as the default DVC remote storage.

&nbsp;&nbsp;&nbsp;• `dvc remote add -d storage ...` sets the Drive folder as the default DVC remote.  
&nbsp;&nbsp;&nbsp;• `-d` marks it as the default remote for future push/pull operations.  
&nbsp;&nbsp;&nbsp;• `-f` forces overwrite if a remote with the same name already exists.  
&nbsp;&nbsp;&nbsp;• `.dvc/config` is added to Git to track remote configuration.  
&nbsp;&nbsp;&nbsp;• The commit ensures remote setup is version-controlled and reproducible.  


## **Bonus-DVC Remote Storage**
Bonus: Decoupling Compute and Storage

I configured Google Drive as a DVC remote storage.
Git tracks only .dvc metadata files locally.
The actual dataset files are stored remotely in Google Drive.
This allows independent versioning of data separate from compute.


In [7]:
!dvc remote add -d storage /content/drive/MyDrive/AMLAssignment/dvc_storage -f
!git add .dvc/config
!git commit -m "Configure local Drive remote"
!dvc push
!dvc remote list


Setting 'storage' as a default remote.
[0m[master 1a24dea] Configure local Drive remote
 1 file changed, 4 insertions(+)
Collecting          |0.00 [00:00,    ?entry/s]
Pushing
Everything is up to date.
[0m[32mstorage [0m[32m/content/drive/MyDrive/AMLAssignment/dvc_storage        [0m[32m(default)[0m
[0m

## **Load Raw SMS Dataset**

This block loads the original SMS Spam Collection dataset into a Pandas DataFrame.

&nbsp;&nbsp;&nbsp;• `pd.read_csv()` reads the dataset file into memory.  
&nbsp;&nbsp;&nbsp;• `sep="\t"` specifies that the file is tab-separated.  
&nbsp;&nbsp;&nbsp;• `header=None` indicates the dataset has no header row.  
&nbsp;&nbsp;&nbsp;• `names=["label", "text"]` assigns column names manually.  
&nbsp;&nbsp;&nbsp;• `encoding="latin-1"` ensures special characters are read correctly.  

&nbsp;&nbsp;&nbsp;• `df.shape` prints the dataset dimensions.  
&nbsp;&nbsp;&nbsp;• `df.head()` displays the first few rows for inspection.  


In [None]:
import pandas as pd

df = pd.read_csv(
    "SmSSpamCollection",   # exact file name
    sep="\t",
    header=None,
    names=["label", "text"],
    encoding="latin-1"
)

print("Dataset shape:", df.shape)
df.head()


Dataset shape: (5572, 2)


Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## **Create Structured Raw Data File**

This block converts labels into numeric format and saves a clean raw dataset.

&nbsp;&nbsp;&nbsp;• `map({"ham": 0, "spam": 1})` converts text labels into binary format.  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Ham → 0  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Spam → 1  

&nbsp;&nbsp;&nbsp;• The DataFrame is reduced to two columns:  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`text` → SMS message content  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`spam` → Binary target variable  

&nbsp;&nbsp;&nbsp;• `to_csv("raw_data.csv")` saves the cleaned dataset locally.  
&nbsp;&nbsp;&nbsp;• `index=False` prevents row indices from being stored in the file.  

This file is now tracked using DVC for version control.  


In [None]:
df["spam"] = df["label"].map({"ham": 0, "spam": 1})
df = df[["text", "spam"]]

df.to_csv("raw_data.csv", index=False)

print("raw_data.csv created successfully!")


raw_data.csv created successfully!


## **Track Raw Dataset with DVC**

This block tracks the raw dataset using Data Version Control (DVC).

&nbsp;&nbsp;&nbsp;• `dvc add raw_data.csv`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Creates a `.dvc` file that tracks the dataset without storing large data inside Git.

&nbsp;&nbsp;&nbsp;• `git add raw_data.csv.dvc .gitignore`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Adds the DVC metadata file to Git.  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Ensures the actual dataset file is ignored by Git.

&nbsp;&nbsp;&nbsp;• `git commit -m "Track raw_data.csv"`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Commits the tracking configuration to version control.

&nbsp;&nbsp;&nbsp;• `dvc push`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Uploads the dataset to the configured Google Drive remote storage.

This ensures data versioning is separated from source code while maintaining reproducibility.  


In [None]:
!dvc add raw_data.csv
!git add raw_data.csv.dvc .gitignore
!git commit -m "Track raw_data.csv"
!dvc push


[?25l[32m⠋[0m Checking graph
Adding...:   0% 0/1 [00:00<?, ?file/s{'info': ''}]
![A
          |0.00 [00:00,     ?file/s][A
                                    [A
![A
  0% |          |0/? [00:00<?,    ?files/s][A
                                           [A
Adding raw_data.csv to cache:   0% 0/1 [00:00<?, ?file/s][A
Adding raw_data.csv to cache:   0% 0/1 [00:00<?, ?file/s{'info': ''}][A
                                                                     [A
  0% 0/1 [00:00<?, ?files/s][A
  0% 0/1 [00:00<?, ?files/s{'info': ''}][A
Adding...: 100% 1/1 [00:00<00:00,  3.17file/s{'info': ''}]

To track the changes with git, run:

	git add .gitignore raw_data.csv.dvc

To enable auto staging, run:

	dvc config core.autostage true
[0m[master e767d4d] Track raw_data.csv
 2 files changed, 6 insertions(+)
 create mode 100644 .gitignore
 create mode 100644 raw_data.csv.dvc
Collecting          |1.00 [00:00,  102entry/s]
Pushing
![A
  0% |          |0/? [00:00<?,    ?files/s][A
    

# Version 1

## **Split Dataset into Train / Validation / Test**

This block splits the dataset into training, validation, and test sets using stratified sampling.

&nbsp;&nbsp;&nbsp;• `train_test_split()` is used twice to create a 70%–15%–15% split.  

&nbsp;&nbsp;&nbsp;• First split:  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;70% → Training set  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;30% → Temporary (validation + test)

&nbsp;&nbsp;&nbsp;• Second split (on temporary set):  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;15% → Validation set  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;15% → Test set  

&nbsp;&nbsp;&nbsp;• `stratify=data["spam"]` ensures class balance is preserved in each split.  

&nbsp;&nbsp;&nbsp;• `random_state=42` ensures reproducibility of the split.  

&nbsp;&nbsp;&nbsp;• The splits are saved as:  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`train.csv`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`validation.csv`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`test.csv`  

This creates reproducible and class-balanced dataset splits for model training and evaluation.  


In [None]:
from sklearn.model_selection import train_test_split

data = pd.read_csv("raw_data.csv")

train, val_test = train_test_split(
    data,
    test_size=0.30,
    random_state=42,
    stratify=data["spam"]
)

val, test = train_test_split(
    val_test,
    test_size=0.50,
    random_state=42,
    stratify=val_test["spam"]
)

train.to_csv("train.csv", index=False)
val.to_csv("validation.csv", index=False)
test.to_csv("test.csv", index=False)


## **Track Dataset Splits – Version 1 (Seed = 42)**

This block tracks the first version of the train/validation/test splits using DVC.

&nbsp;&nbsp;&nbsp;• `dvc add train.csv validation.csv test.csv`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Creates `.dvc` files that track each dataset split.  

&nbsp;&nbsp;&nbsp;• `git add *.dvc .gitignore`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Adds DVC metadata files to Git (not the actual data files).  

&nbsp;&nbsp;&nbsp;• `git commit -m "Version 1 split (seed=42)"`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Saves the first version of dataset splits in version control.  

&nbsp;&nbsp;&nbsp;• `dvc push`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Uploads the tracked data files to the configured DVC remote (Google Drive storage).  

This completes tracking of the first dataset version generated using `random_state = 42`.  


In [None]:
!dvc add train.csv validation.csv test.csv
!git add train.csv.dvc validation.csv.dvc test.csv.dvc .gitignore
!git commit -m "Version 1 split (seed=42)"
!dvc push


[?25l[32m⠋[0m Checking graph
Adding...:   0% 0/3 [00:00<?, ?file/s{'info': ' train.csv |'}]
![A
          |0.00 [00:00,     ?file/s][A
                                    [A
![A
  0% |          |0/? [00:00<?,    ?files/s][A
                                           [A
Adding train.csv to cache:   0% 0/1 [00:00<?, ?file/s][A
Adding train.csv to cache:   0% 0/1 [00:00<?, ?file/s{'info': ''}][A
                                                                  [A
  0% 0/1 [00:00<?, ?files/s][A
  0% 0/1 [00:00<?, ?files/s{'info': ''}][A
Adding...:  33% 1/3 [00:00<00:00,  3.12file/s{'info': ' validation.csv |'}]
![A
          |0.00 [00:00,     ?file/s][A
                                    [A
![A
  0% |          |0/? [00:00<?,    ?files/s][A
                                           [A
Adding validation.csv to cache:   0% 0/1 [00:00<?, ?file/s][A
Adding validation.csv to cache:   0% 0/1 [00:00<?, ?file/s{'info': ''}][A
                                                 

# Version 2


## **Update Dataset Splits – Version 2 (Seed = 21)**

This block regenerates the train/validation/test splits using a different random seed.

&nbsp;&nbsp;&nbsp;• `random_state = 21`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Changes the data splitting pattern while preserving class balance.  

&nbsp;&nbsp;&nbsp;• `stratify = data["spam"]`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Ensures that the proportion of ham (0) and spam (1) messages remains consistent across splits.  

&nbsp;&nbsp;&nbsp;• Training Set → 70%  
&nbsp;&nbsp;&nbsp;• Validation Set → 15%  
&nbsp;&nbsp;&nbsp;• Test Set → 15%  

&nbsp;&nbsp;&nbsp;• `to_csv()`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Overwrites the previous `train.csv`, `validation.csv`, and `test.csv` with the new split version.  

This creates a second dataset version with a different random split configuration.  


In [None]:
train, val_test = train_test_split(
    data,
    test_size=0.30,
    random_state=21,
    stratify=data["spam"]
)

val, test = train_test_split(
    val_test,
    test_size=0.50,
    random_state=21,
    stratify=val_test["spam"]
)

train.to_csv("train.csv", index=False)
val.to_csv("validation.csv", index=False)
test.to_csv("test.csv", index=False)


## **Track Updated Splits – Version 2 (Seed = 21)**

This block tracks the updated dataset splits using DVC and Git.

&nbsp;&nbsp;&nbsp;• `dvc add train.csv validation.csv test.csv`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Adds the updated split files to DVC for version control.  

&nbsp;&nbsp;&nbsp;• `git add train.csv.dvc validation.csv.dvc test.csv.dvc`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Stages the new DVC tracking files for Git commit.  

&nbsp;&nbsp;&nbsp;• `git commit -m "Version 2 split (seed=21)"`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Creates a new Git commit representing the second split version.  

&nbsp;&nbsp;&nbsp;• `dvc push`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Pushes the updated data version to the configured remote storage (Google Drive).  

This completes versioning of the second dataset split configuration.  


In [None]:
!dvc add train.csv validation.csv test.csv
!git add train.csv.dvc validation.csv.dvc test.csv.dvc
!git commit -m "Version 2 split (seed=21)"
!dvc push


[?25l[32m⠋[0m Checking graph
Adding...:   0% 0/3 [00:00<?, ?file/s{'info': ' train.csv |'}]
![A
          |0.00 [00:00,     ?file/s][A
                                    [A
![A
  0% |          |0/? [00:00<?,    ?files/s][A
                                           [A
Adding train.csv to cache:   0% 0/1 [00:00<?, ?file/s][A
Adding train.csv to cache:   0% 0/1 [00:00<?, ?file/s{'info': ''}][A
                                                                  [A
  0% 0/1 [00:00<?, ?files/s][A
  0% 0/1 [00:00<?, ?files/s{'info': ''}][A
Adding...:  33% 1/3 [00:00<00:00,  3.75file/s{'info': ' validation.csv |'}]
![A
          |0.00 [00:00,     ?file/s][A
                                    [A
![A
  0% |          |0/? [00:00<?,    ?files/s][A
                                           [A
Adding validation.csv to cache:   0% 0/1 [00:00<?, ?file/s][A
Adding validation.csv to cache:   0% 0/1 [00:00<?, ?file/s{'info': ''}][A
                                                 

## **View Version History**

This block displays the commit history of the repository.

&nbsp;&nbsp;&nbsp;• `git log --oneline`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Shows a compact list of all Git commits in chronological order.  

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Each line contains:  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;– Short commit hash  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;– Commit message  

This helps identify different dataset versions (e.g., Version 1 split with seed=42 and Version 2 split with seed=21) so they can be checked out later using DVC.


In [None]:
!git log --oneline


[33m3470dc7[m[33m ([m[1;36mHEAD -> [m[1;32mmaster[m[33m)[m Version 2 split (seed=21)
[33m001e70a[m Version 1 split (seed=42)
[33me767d4d[m Track raw_data.csv
[33mb41adf6[m Configure local Drive remote
[33mbe09efb[m Initialize DVC


## **Print Target Distribution**

This block defines a helper function to display the class distribution of the target variable (`spam`) in each dataset split.

&nbsp;&nbsp;&nbsp;• `print_distribution()`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Iterates over the three split files:  

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;– `train.csv`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;– `validation.csv`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;– `test.csv`  

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;For each file:  

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;– Loads the dataset using Pandas  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;– Counts number of `0` values (Ham messages)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;– Counts number of `1` values (Spam messages)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;– Prints the distribution clearly  

This function is used after checking out different DVC versions to verify that class balance is preserved across dataset splits.


In [None]:
def print_distribution():
    for file in ["train.csv", "validation.csv", "test.csv"]:
        df = pd.read_csv(file)
        print(file)
        print("0:", (df["spam"] == 0).sum())
        print("1:", (df["spam"] == 1).sum())
        print()




## **Checkout Version 1 (Previous Split)**

This block checks out the earlier version of the dataset split using Git and DVC.

&nbsp;&nbsp;&nbsp;• `git checkout HEAD~1`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Moves the repository to the previous commit (Version 1 split).  

&nbsp;&nbsp;&nbsp;• `dvc checkout`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Restores the data files (`train.csv`, `validation.csv`, `test.csv`) corresponding to that specific Git commit.  

This ensures that the dataset files match the exact version tracked by DVC for the earlier random seed configuration.


In [None]:
# Checkout Version 1
!git checkout HEAD~1
!dvc checkout


Note: switching to 'HEAD~1'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 001e70a Version 1 split (seed=42)
Building workspace index          |4.00 [00:00,  102entry/s]
Comparing indexes          |5.00 [00:00, 2.90kentry/s]
Applying changes          |3.00 [00:00,  21.2file/s]
[33mM[0m       test.csv
[33mM[0m       train.csv
[33mM[0m       validation.csv
[0m

## **Print Target Distribution – Version 1 Split**

This block prints the class distribution for the first data split (seed = 42).

&nbsp;&nbsp;&nbsp;• Displays the number of ham messages (0) in each file.  
&nbsp;&nbsp;&nbsp;• Displays the number of spam messages (1) in each file.  
&nbsp;&nbsp;&nbsp;• Applies to:
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;– `train.csv`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;– `validation.csv`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;– `test.csv`  

This confirms that stratified sampling preserved class balance in Version 1 of the dataset split.


In [None]:
print("=== Version 1 Split ===")
print_distribution()

=== Version 1 Split ===
train.csv
0: 3377
1: 523

validation.csv
0: 724
1: 112

test.csv
0: 724
1: 112



## **Checkout Latest Version (Version 2 Split)**

This block restores the latest committed version of the dataset (seed = 21).

&nbsp;&nbsp;&nbsp;• `git checkout master` switches back to the latest commit.  
&nbsp;&nbsp;&nbsp;• `dvc checkout` restores the corresponding tracked data files.  

This ensures that:

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;– `train.csv`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;– `validation.csv`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;– `test.csv`  

match the updated split version stored in DVC.


In [None]:
!git checkout master
!dvc checkout


Previous HEAD position was 001e70a Version 1 split (seed=42)
Switched to branch 'master'
Building workspace index          |4.00 [00:00,  108entry/s]
Comparing indexes          |5.00 [00:00, 2.13kentry/s]
Applying changes          |3.00 [00:00,  25.6file/s]
[33mM[0m       test.csv
[33mM[0m       train.csv
[33mM[0m       validation.csv
[0m

## **Print Distribution – Version 2 Split**

This block prints the class distribution for the updated dataset split (seed = 21).

&nbsp;&nbsp;&nbsp;• Calls the `print_distribution()` function.  
&nbsp;&nbsp;&nbsp;• Displays the number of:

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;– 0s (Ham messages)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;– 1s (Spam messages)  

for each file:

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;– `train.csv`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;– `validation.csv`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;– `test.csv`  

This confirms that:

&nbsp;&nbsp;&nbsp;• Stratified sampling is preserved.  
&nbsp;&nbsp;&nbsp;• Class balance remains consistent after updating the random seed.  
&nbsp;&nbsp;&nbsp;• DVC version switching is working correctly.


In [None]:
print("=== Version 2 Split ===")
print_distribution()


=== Version 2 Split ===
train.csv
0: 3377
1: 523

validation.csv
0: 724
1: 112

test.csv
0: 724
1: 112



## **Checkout Version 1 and Inspect Data**

This block switches to the first version of the dataset split (seed = 42) and inspects the training data.

&nbsp;&nbsp;&nbsp;• `git checkout HEAD~1`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;– Moves the repository to the previous commit (Version 1 split).

&nbsp;&nbsp;&nbsp;• `dvc checkout`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;– Restores the corresponding data files (`train.csv`, `validation.csv`, `test.csv`) tracked by DVC.

&nbsp;&nbsp;&nbsp;• Loads `train.csv` into a DataFrame.

&nbsp;&nbsp;&nbsp;• Prints the first SMS message in the training set.

&nbsp;&nbsp;&nbsp;• Stores the first SMS in a variable (`first_sms_v1`) for comparison with the updated split.

This confirms that:

&nbsp;&nbsp;&nbsp;• DVC successfully restores the earlier data version.  
&nbsp;&nbsp;&nbsp;• The dataset content changes when switching between split versions.


In [None]:
!git checkout HEAD~1
!dvc checkout

v1_train = pd.read_csv("train.csv")
print("VERSION 1 (seed=42) first SMS:")
print(v1_train.iloc[0]["text"])
first_sms_v1 = v1_train.iloc[0]["text"]



Note: switching to 'HEAD~1'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 001e70a Version 1 split (seed=42)
Building workspace index          |4.00 [00:00,  215entry/s]
Comparing indexes          |5.00 [00:00, 2.69kentry/s]
Applying changes          |3.00 [00:00,  19.3file/s]
[33mM[0m       test.csv
[33mM[0m       train.csv
[33mM[0m       validation.csv
[0mVERSION 1 (seed=42) first SMS:
Goal! Arsenal 4 (Henry, 7 v Liverpool 2 Henry scores with a simple shot from 6 yards from a pass by Bergkamp to give

## **Checkout Version 2 and Compare Data**

This block switches back to the updated dataset split (seed = 21) and verifies the change in training data.

&nbsp;&nbsp;&nbsp;• `git checkout master`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;– Returns the repository to the latest commit (Version 2 split).

&nbsp;&nbsp;&nbsp;• `dvc checkout`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;– Restores the corresponding DVC-tracked data files for Version 2.

&nbsp;&nbsp;&nbsp;• Loads the updated `train.csv` into a DataFrame.

&nbsp;&nbsp;&nbsp;• Prints the first SMS message in the training set.

&nbsp;&nbsp;&nbsp;• Stores the first SMS in a variable (`first_sms_v2`) for comparison with Version 1.

This demonstrates that:

&nbsp;&nbsp;&nbsp;• Changing the random seed produces a different train/validation/test split.  
&nbsp;&nbsp;&nbsp;• DVC successfully manages and restores multiple data versions.  
&nbsp;&nbsp;&nbsp;• Data version control ensures full reproducibility of dataset splits.


In [None]:
!git checkout master
!dvc checkout

v2_train = pd.read_csv("train.csv")
print("VERSION 2 (seed=21) first SMS:")
print(v2_train.iloc[0]["text"])
first_sms_v2 = v2_train.iloc[0]["text"]


Previous HEAD position was 001e70a Version 1 split (seed=42)
Switched to branch 'master'
Building workspace index          |4.00 [00:00,  217entry/s]
Comparing indexes          |5.00 [00:00, 2.67kentry/s]
Applying changes          |3.00 [00:00,  21.3file/s]
[33mM[0m       test.csv
[33mM[0m       train.csv
[33mM[0m       validation.csv
[0mVERSION 2 (seed=21) first SMS:
YOUR CHANCE TO BE ON A REALITY FANTASY SHOW call now = 08707509020 Just 20p per min NTT Ltd, PO Box 1327 Croydon CR9 5WB 0870 is a national = rate call.


## **Compare Version 1 and Version 2 Training Sets**

This block checks whether the training datasets from Version 1 (seed = 42) and Version 2 (seed = 21) are identical.

&nbsp;&nbsp;&nbsp;• `v1_train.equals(v2_train)`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;– Compares both DataFrames element-by-element.  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;– Returns `True` if all rows and values are exactly the same.  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;– Returns `False` if any difference exists.

Output Observed:

&nbsp;&nbsp;&nbsp;Are train sets identical? **False**

Interpretation:

&nbsp;&nbsp;&nbsp;• The two training splits are **not identical**.  
&nbsp;&nbsp;&nbsp;• Changing the random seed (42 → 21) successfully produced a different stratified split.  
&nbsp;&nbsp;&nbsp;• DVC correctly tracked and restored both versions of the dataset.  
&nbsp;&nbsp;&nbsp;• This confirms reproducibility and proper data version control.

This final verification step demonstrates that dataset versions are distinct, traceable, and fully reproducible using DVC.


In [None]:
print("Are train sets identical?",
      v1_train.equals(v2_train))


Are train sets identical? False


## **Compare First SMS Across Versions**

This block compares the first SMS message in the training set from Version 1 (seed = 42) and Version 2 (seed = 21).

&nbsp;&nbsp;&nbsp;• `first_sms_v1 == first_sms_v2`  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;– Compares the first row text from both versions.  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;– Returns `True` if both messages are exactly the same.  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;– Returns `False` if they differ.

Output Observed:

&nbsp;&nbsp;&nbsp;Are first SMS identical? **False**

Interpretation:

&nbsp;&nbsp;&nbsp;• The first SMS in Version 1 is different from Version 2.  
&nbsp;&nbsp;&nbsp;• This confirms that changing the random seed altered the order and composition of the training set.  
&nbsp;&nbsp;&nbsp;• DVC successfully restored two distinct dataset versions.  
&nbsp;&nbsp;&nbsp;• The dataset splits are reproducible and properly version-controlled.

This further validates that the data versioning workflow is functioning correctly.


In [None]:
print("\nAre first SMS identical?")
print(first_sms_v1 == first_sms_v2)



Are first SMS identical?
False
