# General Flow for Training/Fitting Models

In [None]:
!wget -q https://github.com/PSAM-5020-2025S-A/5020-utils/raw/main/src/data_utils.py

In [None]:
import pandas as pd

from sklearn.model_selection import train_test_split

from data_utils import RandomForestClassifier, StandardScaler
from data_utils import classification_error, display_confusion_matrix
from data_utils import object_from_json_url

### 3 Stages
- Data Prep: Encoding, Scaling, Clustering, sometimes Splitting into train/test datasets
- Modeling: `fit()` classifier
- Evaluation: `predict()` and measure error

#### Data Prep:
Do we need to split our data, or is it already split into train/test sets?

If it's already split we prepare the Encoding, Scaling, Clustering objects using the `train` data (usually with the `fit_transform()` function), and then we use those same objects to encode, scale, cluster the `test` data (usually with the `transform()` function).

If the data is not split into two datasets, we could first split it and repeat the steps above, or, although it might add a bit of bias to the models, we could perform encoding, scaling, clustering with `fit_transform()` on the entire dataset and then only split the already encoded, scaled, clustered data. This biases the encoder, scaler, cluster models, and in turn, the model, but is a bit easier to perform.

#### Modeling
Once we have `train` and `test` datasets that has been encoded, scaled, clustered, we can use the `train` dataset to fit a supervised model (classifier, regression, etc).

Here we will usually call a `fit()` function with the training dataset's features and, separately, its labels or outcome variable values. Something like `fit(features, labels)`.

#### Evaluation
We have a model we trained/fitted with the `train` dataset. Now we can measure how well it actually performs once it's used without the correct labels.

Here we usually call `predict()` with a dataset's features to get label or regression predictions.

We want to call `predict()` for both the `train` and `test` dataset, and then measure how close those predictions are to the actual labels and values that we have in our dataset.

Eavluating with the `train` dataset will tell us if the model is capable of learning anything about the data. Evaluating with the `train` dataset will tell us if the model is capable of learning patterns and trends beyond the data that is fed to it.

It's common for the model to perform better with the `train` data since it was trained using that data and labels, but the `test` dataset error is what's more important because it will tell us what kind of error to expect from data that the model hasn't seen.

### Example

Classifying penguins based on measurements.

Let's load a dataset and look.

In [None]:
PENGUIN_URL = "https://raw.githubusercontent.com/PSAM-5020-2025S-A/5020-utils/refs/heads/main/datasets/json/penguins.json"
penguin_data = object_from_json_url(PENGUIN_URL)

display(penguin_data)

It doesn't have separate train and test data, so we can either 

#### Pre-process and then split:

<img src="./imgs/datasplit-00.jpg" width="720px"/>

OR
#### Split and then process:

<img src="./imgs/datasplit-01.jpg" width="720px"/>

In [None]:
# TODO: Put in DataFrames
penguin_df = ''

# TODO: Encode Species Label

### Split the Data

Using `train_test_split()`

In [None]:
# Split with train_test_split()
penguin_train, penguin_test = train_test_split(penguin_df, train_size=0.8)

In [None]:
penguin_scaler = StandardScaler()
train_scaled_df = penguin_scaler.fit_transform(penguin_train.drop(columns=["species", "label"]))

train_scaled_df["label"] = penguin_train["label"].values
display(train_scaled_df)
train_scaled_df.shape

In [None]:
# Transform Test data
test_scaled_df = penguin_scaler.transform(penguin_test.drop(columns=["species", "label"]))

test_scaled_df["label"] = penguin_test["label"].values
display(test_scaled_df)
test_scaled_df.shape

### Model/Fit

We can train our model now. We're going to use a `RandomForestClassifier` and `fit()` it with the training data.

In [None]:
# TODO: fit RandomForestClassifier
penguin_model = ''

### Evaluate

We can now run predictions for both `train` and `test` data and measure error.

In [None]:
# TODO: predict() train and test
train_pred = ''
test_pred = ''

### Measure Error

In [None]:
# Measure classification error with classification_error()
display(classification_error(train_scaled_df["label"], train_pred))
display(classification_error(test_scaled_df["label"], test_pred))

### Look at Confusion (Matrix)

`display_confusion_matrix(labels, predictions, display_labels=unique_labels)`

In [None]:
# Look at confusion matrices
display_confusion_matrix(train_scaled_df["label"], train_pred, display_labels=penguin_df["species"].unique().tolist())
display_confusion_matrix(test_scaled_df["label"], test_pred, display_labels=penguin_df["species"].unique().tolist())