## 2. Data preprocessing & exploration
In this phase, the goal is to transform raw text data into a structured, clean format suitable for machine learning, and to gain insights into the dataset's characteristics, potential biases, and important features.

For this task, the following tools will be used;
- `pandas` for data handling
- `nltk` for natural language processing
- `matplotlib` and `seaborn` for visualization

>Unlike the normal convention of importing all necessary packages at the top, in this notebook the required packages are imported where they are first used - once. 

### 2.1 Load dataset
The dataset is a CSV file with two columns: `text` - the input and `label` - the output or command class.

In this section, necessary NLTK data is downloaded in preparation for the next step.

In [None]:
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import nltk
import re


# check if necessary nltk data exists otherwise download
nltk_data = [
    {"package": "corpora/stopwords", "name": "stopwords"},
    {"package": "tokenizers/punkt", "name": "punkt"},
    {"package": "corpora/wordnet", "name": "wordnet"},
    {"package": "corpora/omw-1.4", "name": "omw-1.4"},
]

for data in nltk_data:
    try:
        nltk.data.find(data['package'])
    except LookupError:
        nltk.download(data['name'])

# load the dataset
try:
    df = pd.read_csv("./data/dataset.csv")
    print("[dataset]: loaded successfully")
    print(f"[dataset]: initial shape: {df.shape}")
    print(df.head())
except FileNotFoundError:
    print("[dataset]: file not found")

### 2.2 Explore dataset
In this section, the distribution of the command classes is determined to help identify imbalanced classes that might require special handling (like oversampling, undersampling or class weighting). 

Since the dataset has equal samples for each class, there is little or no need for special handling. A class with fewer samples will be highly affected during training since the model might struggle to learn them effectively.

In [None]:
print("[dataset]: class distribution")
print(df['label'].value_counts())

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12,6))
sns.countplot(data=df, x="label")
plt.title('Distribution of Command Classes')
plt.xlabel('Command Class')
plt.ylabel('Number of Samples')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()