# 2. Training
In this section, the clean dataset is split into two sets:

- training set
- testing set

The chosen model is trained on the training set.

In [None]:
import pandas as pd

clean_dataset_path = "./data/dataset-clean.csv"

try:
    df = pd.read_csv(clean_dataset_path)
except FileNotFoundError:
    print("[dataset]: file not found")

df.head()

In [None]:
# cleaned text as the feature (input)
X = df['clean_text'].fillna('')
# label as the class (output)
y = df['label']

print(f"Shape of features(X): {X.shape}")
print(f"Shape of labels(X): {y.shape}")

### 2.1 Train-test split
- The testing set should never be exposed to the model during training.
- `stratify=y` is important for classification tasks to ensure that the proportion of classes is roughly the same in both training and test sets.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Size of training set: {len(X_train)}")
print(f"Size of testing set: {len(X_test)}")

### 2.2 Model selection
To select the appropriate machine learning algorithms for text classification, several different approaches are tested and the best is taken.

Since this is a classification problem, think 
- Naive Bayes
- Tree-based algorithms(Random Forest, XGBoost, Decision Trees)
- Logistic Regression.

The input(feature) is text data, it has to be transformed to create a vector space mapping words to the likely output class. `scikit-learn` offers feature extraction techniques for text data(transforming text into numerical features)
- TfidVectorizer
- TfidTransformer

There 2 approaches for this;
- Manually or procedural flow - where we set up each step independently and connect the different tools together
- Pipelines - automate the process by specifying the steps & order to take.

#### 2.2.1 Selection
The selected tools for the workflow are;
- TfidVectorizer - transform text data into numerical features
- Naive Bayes (MultinomialNB) - to classify several classes

#### 2.2.2 Approach 1 - Manual setup

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# fit vectorizer to training set

tfid = TfidfVectorizer(ngram_range=(1, 2), max_features=5000)

tfid.fit(X_train, y_train)

tfid

In [None]:
from sklearn.naive_bayes import MultinomialNB

# use vectorized data to fit naive bayes for multi-class prediction

mnb = MultinomialNB(alpha=1.0)

X_transformed = tfid.transform(X_train)
y_transformed = tfid.transform(y_train)

mnb.fit(X=X_transformed, y=y_transformed)

In [None]:
from core.preprocessing import preprocess_text

def make_prediction(user_input):
    clean_user_input = preprocess_text(user_input)

    vectorized_input = tfid.transform([clean_user_input])

    return mnb.predict(vectorized_input)

# quick test

user_prompt = "what can you do?"

make_prediction(user_prompt)

#### 2.2.3 Approach 2 - Pipelining

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

pipeline_mnb = Pipeline(
    [
        ("tfdif", TfidfVectorizer(ngram_range=(1, 2), max_features=5000)),
        ("clf", MultinomialNB(alpha=1.0)),
    ]
)

pipeline_mnb.fit(X_train, y_train)

pipeline_mnb

In [None]:
from core.preprocessing import preprocess_text


def make_prediction2(user_input):
    clean_user_input = preprocess_text(user_input)

    vectorized_input = tfid.transform([clean_user_input])

    return mnb.predict(vectorized_input)


# quick test

user_prompt2 = "select that item"

make_prediction2(user_prompt2)