# üìä Data Exploration - SemEval 2026 Task 13

**Goal:** Understand the dataset and find patterns in human vs AI-generated code

**Level:** ‚≠ê Beginner (30-60 minutes)

**What you'll learn:**
- Load and explore parquet files
- Calculate basic statistics
- Create visualizations
- Identify patterns in the data

## 1. Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set seed for reproducibility
SEED = 42
np.random.seed(SEED)

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("‚úÖ Libraries loaded!")
print(f"üîí Random seed: {SEED}")

## 2. Load Data

In [None]:
# Load training data
train_df = pd.read_parquet('../data/train_A.parquet')
val_df = pd.read_parquet('../data/validation_A.parquet')

print(f"Training samples: {len(train_df)}")
print(f"Validation samples: {len(val_df)}")
print(f"\nColumns: {list(train_df.columns)}")

# Show first few rows
train_df.head()

## 3. Basic Statistics

In [None]:
# Class distribution
print("Class Distribution:")
print(train_df['label'].value_counts())
print(f"\nBalance: {train_df['label'].value_counts(normalize=True)}")

# Visualize
plt.figure(figsize=(8, 5))
train_df['label'].value_counts().plot(kind='bar', color=['#66bb6a', '#ef5350'])
plt.title('Class Distribution (0=Human, 1=AI)', fontsize=14, fontweight='bold')
plt.xlabel('Label')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

## 4. Code Length Analysis

In [None]:
# Add length features
train_df['code_length'] = train_df['code'].str.len()
train_df['num_lines'] = train_df['code'].str.count('\n') + 1

# Statistics by class
print("Code Length Statistics:")
print(train_df.groupby('label')['code_length'].describe())
print("\nNumber of Lines Statistics:")
print(train_df.groupby('label')['num_lines'].describe())

In [None]:
# Visualize distributions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Code length
for label in [0, 1]:
    data = train_df[train_df['label'] == label]['code_length']
    axes[0].hist(data, bins=30, alpha=0.6, label=f'Label {label}')
axes[0].set_xlabel('Code Length (characters)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Code Length Distribution')
axes[0].legend()

# Number of lines
for label in [0, 1]:
    data = train_df[train_df['label'] == label]['num_lines']
    axes[1].hist(data, bins=30, alpha=0.6, label=f'Label {label}')
axes[1].set_xlabel('Number of Lines')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Line Count Distribution')
axes[1].legend()

plt.tight_layout()
plt.show()

## 5. Sample Code Inspection

In [None]:
# Show example human code
print("=" * 60)
print("EXAMPLE: Human-Written Code (Label 0)")
print("=" * 60)
human_sample = train_df[train_df['label'] == 0].iloc[0]['code']
print(human_sample)

print("\n" + "=" * 60)
print("EXAMPLE: AI-Generated Code (Label 1)")
print("=" * 60)
ai_sample = train_df[train_df['label'] == 1].iloc[0]['code']
print(ai_sample)

## 6. Your Turn! üéØ

**Tasks to try:**
1. Calculate average word length for human vs AI code
2. Count frequency of common keywords (def, class, if, for)
3. Analyze comment patterns (lines starting with #)
4. Create more visualizations

**Add your code below:**

In [1]:
# Your exploration code here!
def word_length():
    '''
    Calculate average word length for human vs AI code

    Returns:  
        tuple: (human_code, ai_code)  
    '''
    train_df['word_length'] = train_df['code'].str.split().str.len()
    
    human_code = train_df[train_df['label'] == 0]['word_length'].mean()
    ai_code = train_df[train_df['label'] == 1]['word_length'].mean()
    print(f"Human code average word length: {human_code:.2f}")
    print(f"AI code average word length: {ai_code:.2f}")
    
    # Visualize distributions
    plt.figure(figsize=(8, 5))
    
    for label in [0, 1]:
        data = train_df[train_df['label'] == label]['word_length']
        plt.hist(data, bins=30, alpha=0.6, label=f'Label {label}')
    plt.xlabel('Word Count per Code Sample')
    plt.ylabel('Frequency')
    plt.title('Word Count Distribution')
    plt.legend()
    plt.tight_layout()
    plt.show()
    
    return human_code, ai_code

## 7. Key Findings

**Document your observations:**
- What patterns did you notice?
- Are there clear differences between human and AI code?
- What features might be useful for classification?

**Your notes:**
- 
- 
- 

---

## ‚úÖ Next Steps

1. **Share your findings** - Open a PR with this notebook
2. **Try notebook 02** - Feature analysis
3. **Propose new features** - Based on what you discovered

**Great job exploring the data!** üéâ