# DVC Quickstart Assignment

Welcome! In this notebook, you'll get hands-on experience with DVC, a tool for versioning datasets and models.

**Objective:** Track a dataset and a model file using DVC in under 10 minutes.

In [None]:
!pip install dvc

## 1. Initialize DVC
Let's initialize DVC in your project folder. This will create the necessary configuration files. You must comment DVC part in .gitignore.

In [None]:
# Initialize DVC
!dvc init

In [None]:
!dd if=/dev/zero of=large_dataset.csv bs=1M count=100

## 2. Track a Dataset with DVC
Let's add a sample dataset to DVC tracking.

In [None]:
import pandas as pd
import numpy as np

n_rows = 2500000

print(f"Création d'un dataset de {n_rows} lignes.")

# Générer les données synthétiques
# np.random.seed(42)  # Pour la reproductibilité
df = pd.DataFrame({
    'feature1': np.random.randint(0, 1000, size=n_rows),
    'feature2': np.random.randn(n_rows),           # float64
    'feature3': np.random.choice(['A', 'B', 'C', 'D'], size=n_rows),
    'feature4': np.random.uniform(0, 1, size=n_rows),
    'feature5': np.random.exponential(1.0, size=n_rows)
})

# Sauvegarder en CSV
df.to_csv('sample_dataset.csv', index=False)

# Vérifier la taille
import os
size_mb = os.path.getsize('sample_dataset.csv') / (1024 * 1024)
print(f"Fichier créé : sample_dataset.csv ({size_mb:1f} Mo)")

We don't want to push data that is too large (here ~150MB) => dvc solves this problem.

In [None]:
# Track the dataset file with DVC
!dvc add sample_dataset.csv

## 3. Track a Model File with DVC
Let's create a simple model file and add it to DVC tracking.

In [None]:
# Create a simple model file
with open('model.pkl', 'wb') as f:
    f.write(b'Model binary content')
print('model.pkl created')

In [None]:
# Track the model file with DVC
!dvc add model.pkl

## 4. Check DVC Status
Let's check the status of tracked files.

In [None]:
# Check DVC status for tracked files
!dvc status

End of this quickstart : uncomment DVC files in .gitignore.