# 01 – Data Exploration

In this notebook we will:

1. Load our train / validation / test splits  
2. Confirm data integrity (no missing values, correct columns)  
3. Spot-check random examples per emotion  
4. Examine class distributions per split (Seaborn plots)  
5. (Optional) Inspect text-length distributions  

All data files were generated by `scripts/1_load_data.py`.


In [7]:
# — Imports & Setup
from datasets import load_dataset
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Seaborn styling
%matplotlib inline
sns.set_theme(style="whitegrid")

# Point to data directory
DATA_DIR = Path.cwd().parent / "data"

## 1. Load CSV Splits via `datasets`

We’ll use `load_dataset("csv", …)` to pull in our local files as a DatasetDict.

In [8]:
# — Load local CSVs
ds = load_dataset("csv", data_files={
    "train": str(DATA_DIR / "train.csv"),
    "validation": str(DATA_DIR / "validation.csv"),
    "test": str(DATA_DIR / "test.csv")
})

# Inspect dataset sizes and columns
print(ds)

Generating train split: 16000 examples [00:00, 267878.27 examples/s]
Generating validation split: 2000 examples [00:00, 147629.58 examples/s]
Generating test split: 2000 examples [00:00, 113676.02 examples/s]


DatasetDict({
    train: Dataset({
        features: ['text', 'emotion'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'emotion'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'emotion'],
        num_rows: 2000
    })
})


Now convert each split into a pandas DataFrame for detailed inspection.

In [9]:
# — To pandas
df_train = ds["train"].to_pandas()
df_val   = ds["validation"].to_pandas()
df_test  = ds["test"].to_pandas()

print("Train:", df_train.shape, "| Validation:", df_val.shape, "| Test:", df_test.shape)

Train: (16000, 2) | Validation: (2000, 2) | Test: (2000, 2)


## 2. Data Integrity Checks

Ensure each DataFrame has:
- Exactly two columns: `text` and `label`  
- No missing values  
- Correct data types

In [12]:
# Column names and types
display(df_train.dtypes)

# Missing values
for split, df in [("train", df_train), ("validation", df_val), ("test", df_test)]:
    print(f"{split}: {df.isna().sum().sum()} missing values")

text       object
emotion    object
dtype: object

train: 0 missing values
validation: 0 missing values
test: 0 missing values


> **Result:** No missing values; columns `text` (string) and `emotion` (string) look correct.

## 3. Spot-Check Random Examples

View random text per emotion from the training set to manually verify quality.

In [None]:
n_samples = 3  # Number of examples to sample per emotion

for emotion in sorted(df_train.emotion.unique()):
    print(f"--- {emotion.upper()} ---")
    samples = df_train[df_train.emotion == emotion].sample(n_samples, random_state=42)
    for idx, row in samples.iterrows():
        print(f"# {row.text}")
    print("\n")


--- ANGER ---
# i ve been feeling a bit cranky with the kids this week cranky baby whiny year old demanding preschooler so i wanted to stop and remember how blessed i really am
# i feel frustrated sometimes with my mac lipsticks when i have to read names or open each of them to select shade
# i feeling stressed


--- FEAR ---
# i can feel the frantic beat of his heart but cookie s voice is surprisingly clear
# i feel a little suspicious
# i can t help but feeling weird when opening every closet in an apartment that somebody s still living in so i didn t


--- JOY ---
# i feel im rather innocent in that respect
# im feeling quite adventurous and tried out those drinks that i just normally read through the pages of pocketbooks
# im feeling much more positive about the impending move


--- LOVE ---
# i mean fuck i feel like i was way more considerate with customers and concerned about appearance and sanitiation snoozel pm but fine
# i remember a couple of years ago i was feeling romantic 

> **Observation:** No capital letters. Doesn't have any symbols or punctuation marks (including apostrophe); Sometimes just doesn't have apostrophe and somethimes have space wher it should have been (I assume it's the same for the punctuation marks); Some locating mistakes (---LOVE--- has this "# i mean fuck i feel like i was way more considerate with customers and concerned about appearance and sanitiation snoozel pm but fine
") ; Written not according to grammatic rules sometimes (---ANGER--- "# i feeling stressed") ; 

##TODO: Send this output to the chat and ask him to write an observation about it