# 03 · Exploratory Data Analysis – Airline Tweet Sentiment

> Goal: understand distributions, class balance, and potential text‑based
> features that will inform feature‑engineering and model selection for M4.

## 1  Data & Imports

* Load cleaned dataset `data/processed/tweets.parquet`  
* Inspect basic shape and column names  
* Verify no unexpected nulls after preprocessing

## 2  Class Balance

* **Bar chart** – absolute counts of `airline_sentiment`  
* **Pie or bar** – class proportions (%, to show imbalance)  
* Note class‑weight implications for supervised models

## 3  Tweet Length Distributions

* Histogram – character length (`char_len`)  
* Histogram – word count (`word_count`)  
* Box‑and‑whisker to spot outliers  
* Summarize median / IQR – useful for padding/truncation decisions

## 4  Negative‑Reason Breakdown

* Filter `airline_sentiment == "negative"`  
* Horizontal bar chart – frequency of each `negativereason`  
* Identify dominant complaint categories; merge sparse classes if needed

## 5  Sentiment by Airline

* Heatmap – counts or % of (airline × sentiment)  
* Which airlines have higher positive share?  
* Potential for model feature “airline” or airline‑specific models

## 6  Top TF‑IDF Terms (Quick Peek)

* (Optional) Compute TF‑IDF per sentiment and list top 10 terms  
* Useful for qualitative feature sanity check

## 7  Key Insights

* Bullet list of 4‑5 actionable findings, e.g.  
  - Dataset is **63 % negative**, **21 % neutral**, **16 % positive** → apply stratified sampling or class weights  
  - Typical tweet length ≈ 100 characters / 15 words – safe for standard tokenization window  
  - “Customer Service Issue” & “Late Flight” dominate negative reasons – possible hierarchical label grouping  
  - **United** and **Delta** receive highest negative volume; **Southwest** skews more positive  
  - Top TF‑IDF terms confirm airline‑specific jargon (e.g. *bag*, *gate*, *delayed*)  

## 8  Next Steps

* Finalize feature list (text, metadata)  
* Decide resampling / weighting strategy  
* Build baseline model notebook (`04_baseline_model.ipynb`)  
* Integrate findings into README & project roadmap