## 1. Data preprocessing & exploration
In this phase, the goal is to transform raw text data into a structured, clean format suitable for machine learning, and to gain insights into the dataset's characteristics, potential biases, and important features.

For this task, the following tools will be used;
- `pandas` for data handling
- `nltk` for natural language processing
- `matplotlib` and `seaborn` for visualization

>Unlike the normal convention of importing all necessary packages at the top, in this notebook the required packages are imported where they are first used - once. 

### 1.1 Load dataset
The dataset is a CSV file with two columns: `text` - the input and `label` - the output or command class.

In this section, necessary NLTK data is downloaded in preparation for the next step.

In [None]:
import pandas as pd
import nltk


# check if necessary nltk data exists otherwise download
nltk_data = [
    {"package": "corpora/stopwords", "name": "stopwords"},
    {"package": "tokenizers/punkt", "name": "punkt"},
    {"package": "corpora/wordnet", "name": "wordnet"},
    {"package": "corpora/omw-1.4", "name": "omw-1.4"},
    {"package": "punkt_tab", "name": "punkt_tab"},
]

for data in nltk_data:
    try:
        nltk.data.find(data['package'])
    except LookupError:
        nltk.download(data['name'])

# load the dataset

dataset_path = "./data/dataset.csv"

try:
    df = pd.read_csv(dataset_path)
    print("[dataset]: loaded successfully")
    print(f"[dataset]: initial shape: {df.shape}")
    print(df.head())
except FileNotFoundError:
    print("[dataset]: file not found")

### 1.2 Explore dataset
In this section, the distribution of the command classes is determined to help identify imbalanced classes that might require special handling (like oversampling, undersampling or class weighting). 

Since the dataset has equal samples for each class, there is little or no need for special handling. A class with fewer samples will be highly affected during training since the model might struggle to learn them effectively.

In [None]:
print("[dataset]: class distribution")
print(df['label'].value_counts())

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12,6))
sns.countplot(data=df, x="label")
plt.title('Distribution of Command Classes')
plt.xlabel('Command Class')
plt.ylabel('Number of Samples')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

### 1.3 Clean text with NLTK
The model needs to focus on relevant features and so cleaning text is essential to standardize text and reduce noise.

In [None]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import re

stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # convert all text to lowercase
    text = text.lower()
    # remove special characters and numbers to simplify vocabulary
    text = re.sub(r'[^a-z\s]','', text)
    # break down text into individual words
    tokens = word_tokenize(text)
    # remove stopwords(irrelevant words like 'is', 'the', 'a')
    tokens = [word for word in tokens if word not in stop_words]
    # reduce words to their base or root form like 'running' to 'run'
    # this reduces vocabulary size and treat different inflections of a word as the same
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    return " ".join(tokens)

# apply preprocessing to text column
df['clean_text'] = df['text'].apply(preprocess_text)
df.head()

In [None]:
print('[dataset]: text preprocessing complete')
print(f"original: '{df['text'].iloc[0]}'")
print(f"cleaned: '{df['clean_text'].iloc[0]}'")

In [None]:
# save cleaned dataset
clean_dataset_path = dataset_path.replace(".csv","-clean.csv")
# print(clean_dataset_path)
df.to_csv(clean_dataset_path, index=None, index_label=None)