# Part 2 - Machine Learning for Prediction

In [None]:
# !pip install pandas numpy sklearn imblearn matplotlib nltk seaborn torch torchvision torchaudio tqdm openai plotly xgboost transformers

### Fine-Tuning BERT

Initially, we assumed that the bulk of relevant information was contained in the textual columns, leading us to explore transformer-based models like BERT, Google’s pre-trained NLP model.

We fine-tuned BERT by unfreezing the last two layer blocks (out of a total of 32 layers) using the code in bert.py. Training ran smoothly with the required dependencies and sufficient memory for the specified BATCH_SIZE.

Surprisingly, we found that XGBoost outperformed BERT. Although transformer models like BERT are expected to capture complex contextual relationships in text, XGBoost likely benefited from information outside the text data, giving it a notable edge.

Below are the results from the first three epochs using our latest configuration:

![BERT Training.](images/bert_training.png) With appropriate learning
rate decay, the highest validation epoch accuracy I've seen was 94.5. I
decided to drop the model and use trees based models for the final
prediction.

### Pre-processing Stages

The following classes will be used to preprocess data within an imblearn pipeline before training machine learning models.

-   **FillNA** - Filling all NA's with the string "na".

-   **MergeWithFoodNutrients** - Merge the food dataframes with the
    nutriests & food_nutrients merged dataframe, as in Part 1.

    -   *nutrient_min_freq* - Minimum frequency for the nutrient to
        found in different snacks, else column will be dropped.

-   **DropColumns**

-   **NaiveBayesScores** - Adds the Naive Bayes scores for each category
    (total 6), or the count vectorization without applying the Naive
    Bayes model, for a given textual column.

    -   *colname* - The required textual column.

    -   *preprocess_func* - preprocessing func to apply over the textual
        column before doing anything else.

        *vectorizer_kwgs* - kwargs for the vectorizer (as in
        sklearn.feature_extraction.CountVectorizer & TfidfVectorizer.

    -   *mode* - "scores" for the Naive Bayes scores, or "count" for
        vectorize to textual column without applying Naive Bayes. The
        latter may result in adding a significant number of columns to
        the dataset (one for each unique word/n-gram). I'll be
        controlling the number of column by tuning the vectorizer
        kwargs, such as removing stop-words, stripping accent into ascii
        letters, including only tokens with `min_df` occurrences.

    -   *use_tfidf* - True for TfidfVectorizer, False for
        CountVectorizer (empirically works better).

-   **CleanAndListifyIngredients** - As described in Part 1. removes
    text inside () and [], some regexing for cleaning the ingredients.
    Ingredient containing more than a single word, will be spaced by an
    underscore, and different ingredients will be separated by a single
    space " ".

    -   *keep_top_n* - Keep only first n ingredients.

-   **StandardScale** - Standard scaler wrapper, to bypass the 'idx'
    column. Not important for tree based models.

-   **StemDescription** - Omitting description words suffixes, as in
    `nltk.stem.snowball.SnowballStemmer` .

        Few Rules:
        ILY  -----> ILI
        LY   -----> Nil
        SS   -----> SS
        S    -----> Nil
        ED   -----> E,Nil

-   **AddImportantTokens** - Not in use. Adding only tokens passing some
    importance threshold.

-   **AddCategoryTokensAppearance** - Not in use. Adding a column for
    each word in the categories, if they're found in a textual column.

After many trials, we decided to build the following pipe:

In [None]:
from helpers.preprocess import (
    FillNA,
    MergeWithFoodNutrients,
    CleanAndListifyIngredients,
    NaiveBayesScores,
    LogTransformation,
    DropColumns,
    StemDescription,
)
from imblearn.pipeline import Pipeline

steps = [
    FillNA(),
    MergeWithFoodNutrients(nutrient_min_freq=2),
    CleanAndListifyIngredients(),
    StemDescription(),
    NaiveBayesScores(colname="brand", preprocess_func=lambda x: x.replace(" ", ""),
            vectorizer_kwgs=dict(
            stop_words="english", ngram_range=(1, 6), strip_accents="unicode", min_df=20
        ),
        mode="count",
        use_tfidf=False,
    ),
    NaiveBayesScores(
        colname="description",
        vectorizer_kwgs=dict(
            stop_words="english", ngram_range=(1, 6), strip_accents="unicode", min_df=50
        ),
        mode="count",
        use_tfidf=False,
    ),
    NaiveBayesScores(
        colname="ingredients",
        vectorizer_kwgs=dict(
            stop_words="english", ngram_range=(1, 6), strip_accents="unicode", min_df=50, max_df=0.6
        ),
        mode="count",
        use_tfidf=False,
    ),
    NaiveBayesScores(
        colname="household_serving_fulltext",
        vectorizer_kwgs=dict(stop_words="english", ngram_range=(1, 6), strip_accents="unicode", min_df=50),
        mode="count",
        use_tfidf=False,
    ),
    LogTransformation(columns=["serving_size"]),
    DropColumns(columns=["serving_size_unit"]),
]


pipe = Pipeline([(f"{i}", step) for i, step in enumerate(steps)])

This pipeline produces a wide matrix (about 2000 columns) because we set mode="count" in NaiveBayesScores, creating a count column for each token with a frequency above min_df.

In our exploration, we discovered that chocolate is a challenging category to predict and is also less frequent than others. We attempted to address this with SMOTE, but the improvement was minimal. This may be because we applied oversampling early in the process while still extracting features—applying it now could potentially yield better results.

### ResNet18 Score Features

\*\* Code is based on
<https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html>

We’ll begin by leveraging the images in the dataset. Integrating the images with structured data wasn’t straightforward, so I experimented with two methods, using a ResNet18 model. All model weights were frozen except for the last layer or a few final layers. The model was trained for 100 epochs with a learning rate of 0.001, decaying by a factor of 0.1 every 7 epochs.

The two approaches tested were:

1. Fine-tuning ResNet18 by replacing the last layer with 6 output classes.

2. Fine-tuning ResNet18 by replacing the last layer with 
𝑛
n outputs (where 
6
≤
𝑛
≤
256
), then concatenating the "last layer" output with structured features and adding [FC -> BN -> ReLU] blocks, ending with 6 output scores.

![2nd Approach Illustration.](images/images_net.png)

The code is in helpers/resnet.py, although it’s been modified frequently. The ResNet18ForSnacks class represents the latest architecture I used.

Now, let’s see how well we can predict the class using the first approach (images only).

In [None]:
from sklearn.metrics import accuracy_score


rn18_train = pd.read_csv(f"data/_resnet18_train_features_fine_tuned.csv", index_col=0)

y = rn18_train['y']
y_pred = rn18_train.drop(columns='y').idxmax(axis=1).astype('int')
print(f'accuracy: {round(accuracy_score(y, y_pred), 2)}')

Not too bad, though not particularly strong either. We’ll include these scores as six new columns.

In [None]:
from sklearn.model_selection import train_test_split


def get_train_val_test():
    food_train = pd.read_csv("data/food_train.csv")
    
    features_df = food_train.drop("category", axis=1)
    labels_df = food_train["category"]
    
    image_scores_df = (
        pd.read_csv(f"data/resnet18_food_train_features_fine_tuned.csv", index_col=0)
        .set_index(food_train.index)
        .drop(columns=["y"])
        .add_prefix("image_scores_")
    )
    
    features_df = pd.concat([features_df, image_scores_df], axis=1)
    
    
    X_train, X_val_test, y_train, y_val_test = train_test_split(
        features_df, labels_df, test_size=0.2, random_state=42
    )
    
    
    X_val, X_test, y_val, y_test = train_test_split(
        X_val_test, y_val_test, test_size=0.25, random_state=42
    )
    
    return X_train, X_val, X_test, y_train, y_val, y_test

In the second approach, I couldn’t surpass 0.92 accuracy. Given that the data is primarily structured after preprocessing, I opted to rely on tree ensemble methods rather than a large unified network.

### Models

In [None]:
def get_test_set():
    X = pd.read_csv("data/food_test.csv")
    
    image_scores_df = (
        pd.read_csv(f"data/resnet18_food_test_features_fine_tuned.csv", index_col=0)
        .set_index(X.index)
        .add_prefix("image_scores_")
    )
    
    X = pd.concat([X, image_scores_df], axis=1)
    
    return X

def save_predictions(model, path):
    labels = [
      "cakes_cupcakes_snack_cakes",
      "candy",
      "chips_pretzels_snacks",
      "chocolate",
      "cookies_biscuits",
      "popcorn_peanuts_seeds_related_snacks"
    ]
    
    X = get_test_set()
    X = pipe.transform(X)
    
    X['pred_cat'] = model.predict(X)
    X['pred_cat'] = X['pred_cat'].apply(lambda x: x if isinstance(x, str) else labels[x])
    
    X[['idx', 'pred_cat']].to_csv(path, index=False)
    X.drop(columns=['pred_cat'], inplace=True)

##### The models we picked are the following:

-   Random Forest Classifier

-   XGBoost

-   Ensemble of XGBoosts

All models will use the pipeline above for preprocessing.

For cross-validation, X_train is used for training and X_val for validation. In the final model training, we’ll combine X_train and X_val for training and use X_test to benchmark performance.

## 1st Model - Random Forest Classifier

Random Forest was chosen for its robustness and ability to capture complex relationships through aggregated decision trees, making it well-suited for tasks requiring feature importance insights and strong generalization.

In [None]:
X_train, X_val, X_test, y_train, y_val, y_test = get_train_val_test()
X = pd.concat([X_val, X_train], axis=0)
y = pd.concat([y_val, y_train], axis=0)
X = pipe.fit_transform(X, y)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from joblib import dump

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300, 400, 500],  # Number of trees in the forest
    'criterion': ['gini', 'entropy'],           # Split criterion
    'max_depth': [None, 10, 20, 30, 40],        # Maximum depth of the tree
    'min_samples_split': [2, 5, 10],            # Minimum samples required to split an internal node
    'min_samples_leaf': [1, 2, 4],              # Minimum number of samples required to be at a leaf node
    'bootstrap': [True, False],                 # Whether bootstrap samples are used
    'random_state': [42]                        # Random seed for reproducibility
}

# Create a RandomForestClassifier
rf_classifier = RandomForestClassifier()

# Create RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=rf_classifier,
    param_distributions=param_grid,
    n_iter=50,            # Number of parameter settings that are sampled
    scoring='accuracy',   # Scoring metric for evaluation
    cv=5,                 # Cross-validation folds
    verbose=2,
)

# Fit the RandomizedSearchCV to your data
random_search.fit(X, y)


best_model = random_search.best_estimator_
dump(best_model, 'checkpoints/rf/best_model.joblib')

###### Benchmarking over test set:

In [None]:
from joblib import load

X_test = pipe.transform(X_test)

model = load('checkpoints/rf/best_model.joblib')

y_pred = model.predict(X_test)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

#### Saving predictions

In [None]:
X = get_test_set()
X = pipe.transform(X)

save_predictions(model, 'predictions/model01.csv')

## 2nd Model - XGBoost

XGBoost is selected for its efficient gradient boosting framework that
excels in structured data scenarios, such as our case.

Messing with structured data without trying the king of Kaggle in at
least one of the trials is a shame. Again, starting with random search
CV:

In [None]:
dmap = {
    "cakes_cupcakes_snack_cakes": 0,
    "candy": 1,
    "chips_pretzels_snacks": 2,
    "chocolate": 3,
    "cookies_biscuits": 4,
    "popcorn_peanuts_seeds_related_snacks": 5,
}
y = y.apply(lambda x: dmap[x])
y_test = y_test.apply(lambda x: dmap[x])

Random Search CV:

In [None]:

from xgboost import XGBClassifier

param_grid = {
    'n_estimators': [100, 200, 300, 400, 500],
    'learning_rate': [0.01, 0.1, 0.3],
    'max_depth': [3, 5, 6, 7, 8],
    'min_child_weight': [1, 2, 4],
    'subsample': [0.8, 0.9, 1.0],
    'colsample_bytree': [0.8, 0.9, 1.0],
    'gamma': [0, 0.1, 0.3],
    'reg_alpha': [0, 0.1, 0.3],
    'reg_lambda': [0, 0.1, 0.3],
}

xgb = XGBClassifier(objective='multi:softmax', num_class=6, random_state=42)

random_search = RandomizedSearchCV(
    estimator=xgb,
    param_distributions=param_grid,
    n_iter=50,
    scoring='accuracy',
    cv=5,
    verbose=3,
    n_jobs=2,
    random_state=42
)

random_search.fit(X, y)

best_params = random_search.best_params_
best_model = random_search.best_estimator_


best_model.save_model('checkpoints/xgb/xgb.model')

Benchmarking over test set:

In [None]:
from xgboost import XGBClassifier


best_xgb = XGBClassifier()
best_xgb.load_model('checkpoints/xgb/xgb.model')

y_pred = best_xgb.predict(X_test)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Saving predictions:

In [None]:
save_predictions(best_xgb, 'predictions/model02.csv')

XGBoost outperformed Random Forest, leading us to choose an ensemble of XGBoost models with the optimal parameters from cross-validation as our final model. This ensemble aims to reduce the variance of a single model, enhancing prediction accuracy.

In [None]:
from sklearn.ensemble import BaggingClassifier

best_xgb = XGBClassifier()
best_xgb.load_model('checkpoints/xgb/xgb.model') # best xgb model

ensemble = BaggingClassifier(best_xgb, n_estimators=20, verbose=2, n_jobs=2, random_state=42)
ensemble.fit(X, y)


dump(ensemble, 'checkpoints/xgb_ensemble/ensemble.joblib')

Benchmarking over test set:

In [None]:
from sklearn.ensemble import BaggingClassifier


ensemble = load('checkpoints/xgb_ensemble/ensemble.joblib')
ensemble.set_params(n_jobs=1) # to avoid SEGFAULT in rstudio

y_pred = ensemble.predict(X_test)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

In [None]:
Saving predictions.

In [None]:
save_predictions(ensemble, 'predictions/model03.csv')

# Conclusions

- **XGBoost outperformed Random Forest** with identical settings.
- **Deep models** like BERT performed worse than XGBoost, considering our specific use cases.
- **Ensembles don’t always ensure better performance** on the test set.
- **Feature extraction** is crucial for strong performance, regardless of the model choice.
- **Sklearn, pandas, and PyTorch** are robust frameworks for machine learning.
- **High-feature models** (with features almost 10% of sample size) performed well and were less prone to overfitting than expected.

---

**Thank you for reading!**