# DVC Quickstart Assignment

Welcome! In this notebook, you'll get hands-on experience with DVC, a tool for versioning datasets and models.

**Objective:** Track a dataset and a model file using DVC in under 10 minutes.

## 1. Initialize DVC
Let's initialize DVC in your project folder. This will create the necessary configuration files. You must comment DVC part in .gitignore.

In [None]:
# Initialize DVC
!dvc init

## 2. Track a Dataset with DVC
Let's add a sample dataset to DVC tracking.

In [None]:
import pandas as pd
import numpy as np

n_rows = 2_000 

print(f"Création d'un dataset de {n_rows} lignes.")

# Générer les données synthétiques
# np.random.seed(42)  # Pour la reproductibilité
df = pd.DataFrame({
    'feature1': np.random.randint(0, 1000, size=n_rows),
    'feature2': np.random.randn(n_rows),           # float64
    'feature3': np.random.choice(['A', 'B', 'C', 'D'], size=n_rows),
    'feature4': np.random.uniform(0, 1, size=n_rows),
    'feature5': np.random.exponential(1.0, size=n_rows)
})

# Sauvegarder en CSV
df.to_csv('sample_dataset.csv', index=False)

# Vérifier la taille
import os
size_mb = os.path.getsize('sample_dataset.csv') / (1024 * 1024)
print(f"Fichier créé : sample_dataset.csv ({size_mb:1f} Mo)")

We don't want to push data that is too large (here ~150MB) => dvc solves this problem.

exemple error after pushing :
"remote: error: File large_dataset.csv is 159.77 MB; this exceeds GitHub's file size limit of 100.00 MB"

In [None]:
# Track the dataset file with DVC
!dvc add sample_dataset.csv

In [None]:
!dvc diff

In [None]:
# DVC tracks changes to this dataset
df.rename(columns={'feature3': 'feature7'}, inplace=True)
df.to_csv('sample_dataset.csv', index=False)

In [None]:
!dvc add sample_dataset.csv

## 3. Track a Model File with DVC
Let's create a simple model file and add it to DVC tracking.

In [None]:
# Create a simple model file
with open('model.pkl', 'wb') as f:
    f.write(b'Model binary content')
print('model.pkl created')

In [None]:
# Track the model file with DVC
!dvc add model.pkl

## 4. Check DVC Status
Let's check the status of tracked files.

In [None]:
# Check DVC status for tracked files
!dvc status

End of this quickstart : uncomment DVC files in .gitignore.

## 5. Modify the Dataset and Track Changes
Let's make another modification to the dataset and see how DVC detects it.

In [None]:
# Read the current dataset
df = pd.read_csv('sample_dataset.csv')
print(f"Current dataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")

In [None]:
# Modify the dataframe - add a new column and filter some rows
df['feature6'] = df['feature1'] * df['feature4']  # New computed column
df = df[df['feature1'] > 200]  # Filter rows where feature1 > 200

print(f"Modified dataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")

# Save the modified dataset
df.to_csv('sample_dataset.csv', index=False)
print("Dataset modified and saved!")

In [None]:
# Check what DVC detected as changed
!dvc diff

In [None]:
# Track the new version with DVC
!dvc add sample_dataset.csv

## 6. Commit Changes to Git
Now let's commit the DVC metadata files to Git. The actual data stays in DVC cache.

In [None]:
# Check git status to see what files changed
!git status

In [None]:
# Stage the DVC files (not the actual data!)
!git add sample_dataset.csv.dvc .gitignore
!git status

In [None]:
# Commit the changes
!git commit -m "Update dataset: added feature6 column and filtered rows"

### Optional: Push to Git Remote
If you have a remote repository configured, you can push your changes:
```bash
# Push to remote repository
git push origin main
```

**Note:** The actual data file (`sample_dataset.csv`) is NOT pushed to Git because it's in `.gitignore`. Only the metadata file (`sample_dataset.csv.dvc`) is pushed, which contains the hash of the data version.