[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gouthamgo/FineTuning/blob/main/lessons/module1_foundations/03_understanding_data.ipynb)

# üìä Understanding Your Data (The Secret Sauce)

**Duration:** 1 hour  
**Level:** Beginner  
**Prerequisites:** Module 1, Lessons 1-2

---

## Hey Friend! Let's Talk About Data üç≥

Okay, so here's the thing about AI that nobody tells beginners:

**Your data matters WAYYYY more than your model.**

Seriously. I'm not kidding. Let me explain with an example:

Imagine you want to become a great chef. You have two options:

**Option A:** World's best kitchen + Rotten vegetables  
**Option B:** Okay kitchen + Fresh, quality ingredients

Which one makes better food? **Option B, every single time!**

That's EXACTLY how AI works:
- Your **model** = the kitchen
- Your **data** = the ingredients

Bad data = bad AI. Period.

So in this lesson, we're going to learn how to:
1. Look at your data (actually SEE what's in there)
2. Clean your data (remove the rotten stuff)
3. Prepare your data (chop it up the right way)
4. Split your data (for training and testing)

Let's do this! üöÄ

## Step 1: Install What We Need

Same as before - we need the Hugging Face tools:

In [None]:
!pip install -q datasets transformers pandas matplotlib seaborn

## Step 2: Load Some Real Data

Let's use the IMDB movie reviews dataset. It's perfect for learning because:
- It's free
- It's real data (actual movie reviews)
- It's labeled (we know which reviews are positive/negative)
- It's messy (just like real-world data!)

Let me show you:

In [None]:
from datasets import load_dataset
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set up pretty plots
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("üì• Loading IMDB dataset...")
dataset = load_dataset('imdb')
print("‚úÖ Done!\n")

# Let's see what we got
print("Here's what the dataset looks like:")
print(dataset)

## ü§î What Are We Looking At?

You should see something like:
```
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
})
```

Let me break this down:

- **train:** 25,000 reviews for training our model
- **test:** 25,000 reviews for testing how well it learned
- **text:** The actual movie review (the words)
- **label:** 0 = negative review, 1 = positive review

Cool, right? Now let's actually LOOK at some examples:

In [None]:
# Let's see some actual reviews!
train_data = dataset['train']

print("\n" + "="*80)
print("üé¨ EXAMPLE 1: A POSITIVE Review (label = 1)")
print("="*80)

# Find a positive review
for i in range(10):
    if train_data[i]['label'] == 1:
        print(f"\n{train_data[i]['text'][:500]}...")  # First 500 characters
        break

print("\n" + "="*80)
print("üé¨ EXAMPLE 2: A NEGATIVE Review (label = 0)")
print("="*80)

# Find a negative review
for i in range(10):
    if train_data[i]['label'] == 0:
        print(f"\n{train_data[i]['text'][:500]}...")  # First 500 characters
        break

## üìä Step 3: Explore Your Data (Like a Detective!)

Before we do ANYTHING with data, we need to understand it. Here's what we want to know:

1. **How much data do we have?** (More = better, usually)
2. **Is it balanced?** (Equal positive/negative reviews?)
3. **How long are the texts?** (Super important!)
4. **Any weird stuff?** (Missing data, HTML tags, etc.)

Let's investigate:

In [None]:
# Convert to pandas for easier analysis
df = pd.DataFrame(train_data)

print("üìà DATASET STATISTICS\n" + "="*50)
print(f"\nüì¶ Total examples: {len(df):,}")
print(f"\nüòä Positive reviews: {(df['label'] == 1).sum():,}")
print(f"üòû Negative reviews: {(df['label'] == 0).sum():,}")
print(f"\n‚öñÔ∏è Balance: {(df['label'] == 1).sum() / len(df) * 100:.1f}% positive")

# Add text length as a new column
df['text_length'] = df['text'].apply(len)
df['word_count'] = df['text'].apply(lambda x: len(x.split()))

print(f"\nüìù Average review length: {df['text_length'].mean():.0f} characters")
print(f"üìù Average word count: {df['word_count'].mean():.0f} words")
print(f"\nüìè Shortest review: {df['text_length'].min()} characters")
print(f"üìè Longest review: {df['text_length'].max()} characters")

## üé® Let's Visualize This!

Numbers are cool, but pictures are better. Let's make some charts:

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Chart 1: Positive vs Negative
label_counts = df['label'].value_counts()
axes[0, 0].bar(['Negative', 'Positive'], label_counts.values, color=['#ff6b6b', '#51cf66'])
axes[0, 0].set_title('üé¨ Review Distribution', fontsize=14, fontweight='bold')
axes[0, 0].set_ylabel('Count')
for i, v in enumerate(label_counts.values):
    axes[0, 0].text(i, v, f'{v:,}', ha='center', va='bottom', fontweight='bold')

# Chart 2: Text length distribution
axes[0, 1].hist(df['text_length'], bins=50, color='#4dabf7', edgecolor='black')
axes[0, 1].set_title('üìè Review Length Distribution', fontsize=14, fontweight='bold')
axes[0, 1].set_xlabel('Characters')
axes[0, 1].set_ylabel('Count')
axes[0, 1].axvline(df['text_length'].mean(), color='red', linestyle='--', label='Average')
axes[0, 1].legend()

# Chart 3: Word count distribution
axes[1, 0].hist(df['word_count'], bins=50, color='#ff922b', edgecolor='black')
axes[1, 0].set_title('üìù Word Count Distribution', fontsize=14, fontweight='bold')
axes[1, 0].set_xlabel('Words')
axes[1, 0].set_ylabel('Count')
axes[1, 0].axvline(df['word_count'].mean(), color='red', linestyle='--', label='Average')
axes[1, 0].legend()

# Chart 4: Length comparison by sentiment
df.boxplot(column='word_count', by='label', ax=axes[1, 1])
axes[1, 1].set_title('üìä Word Count by Sentiment', fontsize=14, fontweight='bold')
axes[1, 1].set_xlabel('Label (0=Negative, 1=Positive)')
axes[1, 1].set_ylabel('Word Count')
plt.suptitle('')  # Remove default title

plt.tight_layout()
plt.show()

print("\nüí° What do these charts tell us?")
print("   1. The data is perfectly balanced (50/50 positive/negative)")
print("   2. Most reviews are between 100-400 words")
print("   3. Some reviews are SUPER long (outliers)")
print("   4. Positive and negative reviews have similar lengths")

## üßπ Step 4: Clean Your Data

Real-world data is MESSY. Like, really messy. Let's look for problems:

In [None]:
print("üîç Looking for data quality issues...\n")

# Check for missing data
print("1. Missing values:")
print(df.isnull().sum())
print("‚úÖ No missing data! Nice.\n")

# Check for HTML tags (common in web-scraped data)
html_count = df['text'].str.contains('<br', regex=False).sum()
print(f"2. Reviews with HTML tags: {html_count:,}")
if html_count > 0:
    print("‚ö†Ô∏è We found HTML! Let's look at an example:")
    for text in df['text']:
        if '<br' in text:
            print(f"\n{text[:300]}...")
            break

# Check for duplicates
duplicates = df.duplicated(subset=['text']).sum()
print(f"\n3. Duplicate reviews: {duplicates}")

# Check for very short reviews (might be useless)
very_short = (df['word_count'] < 10).sum()
print(f"\n4. Very short reviews (<10 words): {very_short}")

## üßº Let's Clean That HTML!

Those `<br />` tags are HTML line breaks. They don't help our model learn. Let's remove them:

In [None]:
import re

def clean_text(text):
    """Remove HTML tags and extra whitespace"""
    # Remove HTML tags
    text = re.sub(r'<[^>]+>', ' ', text)
    # Remove extra whitespace
    text = ' '.join(text.split())
    return text

# Let's test it!
print("üß™ Testing our cleaning function:\n")
print("BEFORE:")
dirty_text = "This movie was great!<br /><br />I loved it so much."
print(dirty_text)

print("\nAFTER:")
clean = clean_text(dirty_text)
print(clean)

print("\n‚úÖ Perfect! Now let's clean the whole dataset...")

# Clean all texts
df['clean_text'] = df['text'].apply(clean_text)

print("‚úÖ Done! Let's compare:")
print("\nORIGINAL:")
print(df['text'].iloc[0][:200])
print("\nCLEANED:")
print(df['clean_text'].iloc[0][:200])

## ‚úÇÔ∏è Step 5: Split Your Data Properly

This is CRUCIAL. Listen carefully:

**You CANNOT test your model on the same data you trained it on!**

Why? Imagine studying for an exam:
- If you memorize the exact questions, you'll get 100%
- But you didn't actually LEARN anything
- You just memorized!

Same with AI. We need:
1. **Training data** - Model learns from this
2. **Validation data** - We check progress during training
3. **Test data** - Final exam (model never sees this during training!)

Typical split: 80% train, 10% validation, 10% test

Good news: IMDB dataset already split train/test for us! But let's learn how to do it:

In [None]:
from sklearn.model_selection import train_test_split

# Let's split our training data into train + validation
train_df = df.copy()

# Split: 90% train, 10% validation
train_subset, val_subset = train_test_split(
    train_df, 
    test_size=0.1,  # 10% for validation
    random_state=42,  # For reproducibility
    stratify=train_df['label']  # Keep same positive/negative ratio
)

print("‚úÇÔ∏è Data Split Results:\n" + "="*50)
print(f"\nüìö Training set: {len(train_subset):,} examples")
print(f"   Positive: {(train_subset['label'] == 1).sum():,} ({(train_subset['label'] == 1).sum()/len(train_subset)*100:.1f}%)")
print(f"   Negative: {(train_subset['label'] == 0).sum():,} ({(train_subset['label'] == 0).sum()/len(train_subset)*100:.1f}%)")

print(f"\nüîç Validation set: {len(val_subset):,} examples")
print(f"   Positive: {(val_subset['label'] == 1).sum():,} ({(val_subset['label'] == 1).sum()/len(val_subset)*100:.1f}%)")
print(f"   Negative: {(val_subset['label'] == 0).sum():,} ({(val_subset['label'] == 0).sum()/len(val_subset)*100:.1f}%)")

print("\n‚úÖ Notice how the percentages are the same? That's what 'stratify' does!")
print("   It keeps the class balance consistent across splits.")

## üéØ Step 6: Prepare Data for Your Model

AI models don't understand words. They only understand numbers!

So we need to convert:
- "This movie is great!" ‚Üí [2023, 3742, 2003, 2307, 999]

This is called **tokenization**. Let's do it:

In [None]:
from transformers import AutoTokenizer

# Load a tokenizer (we'll use BERT's)
print("üì• Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
print("‚úÖ Done!\n")

# Let's tokenize a sample text
sample_text = "This movie is absolutely fantastic! I loved it!"

print("üî§ Original text:")
print(f'   "{sample_text}')

# Tokenize it
tokens = tokenizer.tokenize(sample_text)
print("\nüî¢ Tokens (words/subwords):")
print(f"   {tokens}")

# Convert to IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print("\nüî¢ Token IDs (numbers the model uses):")
print(f"   {token_ids}")

# The easy way (does everything at once)
encoded = tokenizer(sample_text, padding=True, truncation=True, return_tensors='pt')
print("\nüì¶ Full encoding:")
print(encoded)

## üéì Key Lessons About Data

Let me summarize what we learned:

### 1. **Always Explore Your Data First** üîç
   - Look at actual examples
   - Check for missing values
   - Visualize distributions
   - Understand what you're working with

### 2. **Clean Your Data** üßπ
   - Remove HTML/special characters
   - Handle missing values
   - Fix inconsistencies
   - Remove duplicates

### 3. **Check Balance** ‚öñÔ∏è
   - Are classes equally represented?
   - If not, you might need to balance them
   - Use stratification when splitting

### 4. **Split Properly** ‚úÇÔ∏è
   - Train: Model learns from this (70-80%)
   - Validation: Monitor progress (10-15%)
   - Test: Final evaluation (10-15%)
   - NEVER test on training data!

### 5. **Tokenize for Your Model** üî¢
   - Models need numbers, not text
   - Use the right tokenizer for your model
   - Handle padding and truncation

---

## üéâ You Did It!

You now understand data preparation - the MOST IMPORTANT part of fine-tuning!

Remember: **Garbage in = Garbage out**

Good data = Good AI. Every. Single. Time.

In the next lesson, we'll actually fine-tune a model using this clean, prepared data!

Ready? Let's go! üöÄ

## üìö Quick Reference

**Load dataset:**
```python
from datasets import load_dataset
dataset = load_dataset('dataset_name')
```

**Clean text:**
```python
import re
text = re.sub(r'<[^>]+>', ' ', text)  # Remove HTML
text = ' '.join(text.split())  # Remove extra spaces
```

**Split data:**
```python
from sklearn.model_selection import train_test_split
train, val = train_test_split(data, test_size=0.1, stratify=data['label'])
```

**Tokenize:**
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('model_name')
encoded = tokenizer(text, padding=True, truncation=True)
```

---

### üéØ Practice Exercise

Try this yourself:

1. Load a different dataset (try 'emotion' or 'sst2')
2. Explore it like we did here
3. Clean any messy text
4. Create train/val/test splits
5. Tokenize a few examples

Post your results in the community! We'd love to see what you found!

**Next up:** Module 2, Lesson 1 - Your First Fine-Tuning! üéâ