# 02 · Data-Preparation Demo

This notebook demonstrates the `src.data_prep` pipeline:

1. Loads the raw CSV  
2. Shows a before/after sample of 10 tweets  
3. Saves the cleaned data to Parquet and prints the output path  

In [1]:
# standard imports
from twitter_airline_analysis.data_prep import load_raw, preprocess, save_parquet
import pandas as pd

# load the raw DataFrame
df_raw = load_raw()
print(f"Raw data: {df_raw.shape[0]:,} rows × {df_raw.shape[1]} columns")
df_raw.head(10)

Raw data: 14,640 rows × 15 columns


Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)
5,570300767074181121,negative,1.0,Can't Tell,0.6842,Virgin America,,jnardino,,0,@VirginAmerica seriously would pay $30 a fligh...,,2015-02-24 11:14:33 -0800,,Pacific Time (US & Canada)
6,570300616901320704,positive,0.6745,,0.0,Virgin America,,cjmcginnis,,0,"@VirginAmerica yes, nearly every time I fly VX...",,2015-02-24 11:13:57 -0800,San Francisco CA,Pacific Time (US & Canada)
7,570300248553349120,neutral,0.634,,,Virgin America,,pilot,,0,@VirginAmerica Really missed a prime opportuni...,,2015-02-24 11:12:29 -0800,Los Angeles,Pacific Time (US & Canada)
8,570299953286942721,positive,0.6559,,,Virgin America,,dhepburn,,0,"@virginamerica Well, I didn't…but NOW I DO! :-D",,2015-02-24 11:11:19 -0800,San Diego,Pacific Time (US & Canada)
9,570295459631263746,positive,1.0,,,Virgin America,,YupitsTate,,0,"@VirginAmerica it was amazing, and arrived an ...",,2015-02-24 10:53:27 -0800,Los Angeles,Eastern Time (US & Canada)


## Before / After Cleaning

Below we show the first 10 tweets in their original form, then the cleaned `clean_text` column.

In [2]:
# take a 10-row sample for demo
sample = df_raw.head(10).copy()

# apply the cleaning pipeline
df_tidy = preprocess(sample)

# display side-by-side 
pd.concat(
    [
        sample[["tweet_id", "text"]].rename(columns={"text": "original_text"}),
        df_tidy[["clean_text"]]
    ],
    axis=1
)

Unnamed: 0,tweet_id,original_text,clean_text
0,570306133677760513,@VirginAmerica What @dhepburn said.,what said.
1,570301130888122368,@VirginAmerica plus you've added commercials t...,plus you've added commercials to the experienc...
2,570301083672813571,@VirginAmerica I didn't today... Must mean I n...,i didn't today... must mean i need to take ano...
3,570301031407624196,@VirginAmerica it's really aggressive to blast...,"it's really aggressive to blast obnoxious ""ent..."
4,570300817074462722,@VirginAmerica and it's a really big bad thing...,and it's a really big bad thing about it
5,570300767074181121,@VirginAmerica seriously would pay $30 a fligh...,seriously would pay $30 a flight for seats tha...
6,570300616901320704,"@VirginAmerica yes, nearly every time I fly VX...","yes, nearly every time i fly vx this “ear worm..."
7,570300248553349120,@VirginAmerica Really missed a prime opportuni...,really missed a prime opportunity for men with...
8,570299953286942721,"@virginamerica Well, I didn't…but NOW I DO! :-D","well, i didn't...but now i do! :-d"
9,570295459631263746,"@VirginAmerica it was amazing, and arrived an ...","it was amazing, and arrived an hour early. you..."


## Save to Parquet

Now we save the full cleaned dataset to Parquet and display the path.

In [3]:
# load & preprocess full dataset
full_raw  = load_raw()
full_tidy = preprocess(full_raw)

# save and capture the file path
out_path = save_parquet(full_tidy)
print(f"✅ Saved {len(full_tidy):,} rows to:\n{out_path}")

✅ Saved 14,640 rows to:
C:\Projects\twitter-airline-analysis\data\processed\tweets.parquet


In [4]:
df_tidy.isna().sum().sum() == 0
df_tidy.tweet_id.dtype == "int64"


True

In [5]:
# ── Cell: regenerate_splits.py ──────────────────────────────────────────────
import pathlib as pl
import pandas as pd
from sklearn.model_selection import train_test_split

RAW_DIR      = pl.Path("../data/raw")
PROC_DIR     = pl.Path("../data/processed")
PROC_DIR.mkdir(parents=True, exist_ok=True)

df   = pd.read_csv(RAW_DIR / "twitter_airline_clean.csv")  # adjust to your raw file
X    = df["text"]
y    = df["label"]

# 20 % validation, 20 % test (adjust if you used other ratios earlier)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.40, stratify=y, random_state=42)
X_val,   X_test, y_val, y_test   = train_test_split(X_temp, y_temp, test_size=0.50, stratify=y_temp, random_state=42)

(pd.DataFrame({"text": X_val}).reset_index(drop=True)
   .to_feather(PROC_DIR / "X_val.ftr"))
(pd.DataFrame({"label": y_val})
   .to_feather(PROC_DIR / "y_val.ftr"))
(pd.DataFrame({"text": X_test}).reset_index(drop=True)
   .to_feather(PROC_DIR / "X_test.ftr"))
(pd.DataFrame({"label": y_test})
   .to_feather(PROC_DIR / "y_test.ftr"))

print("Validation / test splits written to", PROC_DIR.resolve())


FileNotFoundError: [Errno 2] No such file or directory: '..\\data\\raw\\twitter_airline_clean.csv'