# Introduction

This code is expected to run on Colab, and should be run from top to bottom.

The user is also expected to upload `uw-cs480-winter23.zip` when running the code in the *Upload files in Google Colab* section.

Note there is a risk of Colab crashing due to reaching RAM limit. If so, please uncomment and run `torch.save` and `torch.load` as well when running the notebook so that you won't lose all progress when running the notebook.


---
# Report

This code can be divided into 5 parts:
1. Load Data
2. Combine Text & Categorical Data
3. CNN Models
4. Weighted Majority Vote
5. Download Predictions

***Note:***
- Categorical columns refer to `['gender', 'baseColour', 'season', 'usage']`.
- Text column refers to `noisyTextDescription`.

***Part 2: Combine Text & Categorical Data***

Text and categorical data are combined as inputs before getting fed into linear support vector classifiers (`LinearSVC` and `SGDClassifier`). Linear SVCs are generally regarded as one of the best text classification algorithms, hence our decision to use these models.

Preprocessing:
- Categorical columns: one-hot encoding on the categorical data (`OneHotEncoder`).
- Text column: computed the normalized term frequency matrix (`CountVectorizer` followed by `TfidfTransformer`). 
  - The parameters used for `CountVectorizer` enable accents removal, lowercasing, word tokenizing, stopwords removal, and filtering of tokens of 2 or more alphanumeric characters (in addition to building our own dictionary according to the train dataset).
  - `TdidfTransformer` is used to transform our count matrix (aka `CountVectorizer` output) to a normalized term frequency representation (`use_idf=False`).

Modeling & Bagging:

The preprocessed categorical and text columns are concatenated together before getting fed into our selected models to achieve 81-82% validation accuracy. This accuracy is further improved by 0.5-1% with bagging on each model. To ensure our SVCs are weak learners and to decrease computation time, each model is configured to `max_iter=5`.

This was a great improvement as only fitting the models on categorical data yieled 55% validation accuracy, while only fitting on text data yieled 78-79% validation accuracy. In addition, adding the categorical columns allow us to somewhat mitigate the noise from the text data, leading to a better performance.

***Part 3: CNN Models***

In part 3, I used 5 convolutional neural networks - each CNN is fed with different inputs. The CNN architecture for all 5 models are the same except for the first hidden fully connected layer, which will be explained in detail at the bottom.

The CNN architecture consists of 3 convolutional blocks - each block consists of 2 3x3 convolutional layers followed by a single 2x2 pooling layer (note the second convolutional block has a pooling layer of 3x2 to avoid rounding as our image sizes are 60x80). A dropout probability of 0.25 is applied after each max pooling layer and a dropout probability of 0.5 is applied after the first hidden fully connected layer.

For all CNNs below, the images are treated the same as follows:
- Images: The images are transformed into tensors before getting normalized using the mean and standard deviation of the entire train dataset.

* CNN1 - images & categorical data:
  - Categorical columns: As all the categorical columns contain string values, I lowercased and joined all the strings for each training instance (e.g. `sandal men tan summer`) before proceeding to get the text embedding of each sentence using the pretrained 25-dimensional GloVe embedding via `zeugma` package.

* CNN2 - images & categorical data:
  - Categorical columns: one-hot encoding on the categorical data (`OneHotEncoder`).

* CNN3 - images only.

* CNN4 - images, categorical and text data:
  - Categorical columns: one-hot encoding on the categorical data (`OneHotEncoder`).
  - Text column: computed the normalized term frequency matrix (`CountVectorizer` followed by `TfidfTransformer`), as above.

* CNN5 - images, categorical and text data:
  - Categorical columns: lowercased and joined all the strings for each training instance (e.g. `sandal men tan summer`) before proceeding to get the text embedding of each sentence using the pretrained 25-dimensional GloVe embedding via `zeugma` package.
  - Text column: computed the normalized term frequency matrix (`CountVectorizer` followed by `TfidfTransformer`), as above.

In order to make use of text/categorical data in CNN1, CNN2, CNN4, and CNN5, we appended the extra data into the first fully connected hidden layer, e.g. CNN3: `self.fc1 = nn.Linear(128*5*10,1024)` vs CNN5: `self.fc1 = nn.Linear(128*5*10+8462,1024)` where there are 8462 extra nodes to account for both categorical and text inputs.


***Model Validation Performances***:
- Bagged LinearSVC: 82.80%.
- Bagged SGDClassifier: 82.28%.
- CNN1: 83.65% (best performing epoch: 11).
- CNN2: 84.13% (best performing epoch: 7).
- CNN3: 80.62% (best performing epoch: 7).
- CNN4: 90.29% (best performing epoch: 7).
- CNN5: 89.07% (best performing epoch: 7).

For CNN models, the best performing epoch is determined by the lowest validation loss, which we will use when training the models on the full training set.

***Part 4: Weighted Majority Vote***

As models like SVCs cannot predict probability on Scikit-Learn, we instead resort to using the normalized validation accuracies as weights to do a weighted majority vote. For example, CNN3 has the lowest weight while CNN4 has the highest weight. Using this weighted approach allows us to generalize better while providing less weights to models with noisy predictions (like CNN3).

Using this method, we were able to achieve a 91% validation accuracy despite having only 2 models with ~90% validation accuracy.

---
# Upload files in Google Colab
Upload uw-cs480-winter23.zip file. Run the following command to unzip the file.


In [2]:
from google.colab import files
uploaded = files.upload()
%ls

!unzip uw-cs480-winter23.zip
!ls

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: noisy-images/noisy-images/58012.jpg  
  inflating: noisy-images/noisy-images/58013.jpg  
  inflating: noisy-images/noisy-images/58014.jpg  
  inflating: noisy-images/noisy-images/58015.jpg  
  inflating: noisy-images/noisy-images/58016.jpg  
  inflating: noisy-images/noisy-images/58017.jpg  
  inflating: noisy-images/noisy-images/58018.jpg  
  inflating: noisy-images/noisy-images/5802.jpg  
  inflating: noisy-images/noisy-images/58020.jpg  
  inflating: noisy-images/noisy-images/58022.jpg  
  inflating: noisy-images/noisy-images/58023.jpg  
  inflating: noisy-images/noisy-images/58024.jpg  
  inflating: noisy-images/noisy-images/58027.jpg  
  inflating: noisy-images/noisy-images/58028.jpg  
  inflating: noisy-images/noisy-images/5803.jpg  
  inflating: noisy-images/noisy-images/58030.jpg  
  inflating: noisy-images/noisy-images/58032.jpg  
  inflating: noisy-images/noisy-images/58035.jpg  
  inflating: noisy-

# Required Libraries

In [3]:
!pip install zeugma
!pip install dill

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting zeugma
  Downloading zeugma-0.49.tar.gz (9.9 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: zeugma
  Building wheel for zeugma (setup.py) ... [?25l[?25hdone
  Created wheel for zeugma: filename=zeugma-0.49-py3-none-any.whl size=8809 sha256=9432af0866ba2c1f9a243ae4ffd6b30bec33c9e88f78d1c6f6f45e214448ea05
  Stored in directory: /root/.cache/pip/wheels/81/eb/a0/b3178c0a0fa7de13d140cc79f3106867f5c74540e2d66b85f7
Successfully built zeugma
Installing collected packages: zeugma
Successfully installed zeugma-0.49
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting dill
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dil

In [4]:
%matplotlib inline

import os.path
import shutil
from google.colab import files
import dill

import matplotlib.pyplot as plt
import seaborn as sns

import numpy as np
import pandas as pd

# used for preprocessing & defining models
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.base import clone
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

# for text embedding
from zeugma.embeddings import EmbeddingTransformer
glove = EmbeddingTransformer('glove')

# for combined text and categorical models
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier

# for images
from skimage import io
from tqdm import tqdm
import torch
from torch import nn, optim
from torch.utils.data import DataLoader
import torchvision
from torchvision.transforms import ToTensor
import torchvision.transforms as transforms
import torch.nn.functional as F

# ensemble learning
from sklearn.ensemble import BaggingClassifier



In [5]:
import time
import math

def timeSince(since):
    now = time.time()
    s = now - since
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)  

In [6]:
# Get cpu or gpu device for training.
device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
print(f"Using {device} device")

Using cuda device


# Part 1: Load Data

In [7]:
# define variables
cat_cols = ['gender', 'baseColour', 'season', 'usage']
text_cols = ['noisyTextDescription']
target = 'category'

In [8]:
def load_data(filename='train.csv'):
  df = pd.read_csv(filename, index_col='id')
  return df

# load train and test data
train_df = load_data(filename='train.csv')
test_df = load_data(filename='test.csv')

# split into X_train, y_train, X_val
X_train = train_df[cat_cols + text_cols]
y_train = train_df[target]
X_val = test_df

In [9]:
# define dictionaries used to encode classes into indices and vice versa
classes = train_df[target].unique()
class_to_idx = {c: id for id, c in enumerate(classes)}
idx_to_class = {id: c for id, c in enumerate(classes)}

# encode y_train into indices
y_train_transformed = np.vectorize(class_to_idx.get)(y_train)

# Part 2: Combine Text & Categorical Data

- Categorical data: applied one-hot encoding.
- Text data: computed normalized term frequency matrix of noisyTextDescription column.

Concatenate the preprocessed categorical and text data to be fitted later.

In [8]:
def squeeze(x):
  return x.squeeze()

def toarray(x):
  return x.toarray()

# one-hot encode categorical columns
cat_pipeline = Pipeline(steps=[
    ('one-hot',OneHotEncoder())
])

# get normalized term frequency matrix of noisyTextDescription column
text_pipeline = Pipeline(steps=[
    ("squeeze", FunctionTransformer(squeeze)),
     ('vect', CountVectorizer(stop_words='english', strip_accents='unicode')),
     ('tfidf', TfidfTransformer(use_idf=False)),
     ("toarray", FunctionTransformer(toarray)),
])

# combined transformation on all feature columns
col_trans = ColumnTransformer(transformers=[
    ('cat_pipeline', cat_pipeline, cat_cols),
    ('text_pipeline', text_pipeline, text_cols),
    ],
    n_jobs=-1)

## Model 1: LinearSVC

In [9]:
########## try with LinearSVC ##########
### base
clf = LinearSVC(max_iter=5)

clf_pipeline_LinearSVC = Pipeline(steps=[
    ('col_trans', col_trans),
    ('base', clf)
])

### bagging
estimator = clone(clf_pipeline_LinearSVC.steps[1][1])
bag_clf_pipeline_LinearSVC = Pipeline(steps=[
    ('col_trans', col_trans),
    ('bag', BaggingClassifier(estimator = estimator))
])
display(bag_clf_pipeline_LinearSVC)

# fit the bagged pipeline
bag_clf_pipeline_LinearSVC.fit(X_train, y_train)



## Model 2: SGDClassifier

In [10]:
########## try with SGDClassifier ##########
### base
clf = SGDClassifier(loss='hinge', penalty='l2',
                           alpha=1e-4, max_iter=5, tol=None)

clf_pipeline_SGDClassifier = Pipeline(steps=[
    ('col_trans', col_trans),
    ('base', clf)
])

### bagging
estimator = clone(clf_pipeline_SGDClassifier.steps[1][1])
bag_clf_pipeline_SGDClassifier = Pipeline(steps=[
    ('col_trans', col_trans),
    ('bag', BaggingClassifier(estimator = estimator))
])
display(bag_clf_pipeline_SGDClassifier)

# fit the bagged pipeline
bag_clf_pipeline_SGDClassifier.fit(X_train, y_train)

In [11]:
# # uncomment and run this if you'd like to save the models
# # highly recommend saving in case colab crashes due to reaching RAM usage limit
# torch.save(obj = bag_clf_pipeline_LinearSVC, f = 'test_bag_clf_pipeline_LinearSVC.pkl', pickle_module=dill)
# torch.save(obj = bag_clf_pipeline_SGDClassifier, f = 'test_bag_clf_pipeline_SGDClassifier.pkl', pickle_module=dill)

In [16]:
# # call garbage collector if necessary
# import gc
# gc.collect()

232

# Part 3: CNN Models

## Part 3.0: Useful Functions & Classes

In [10]:
# script parameters
batch_size = 64
log_interval = 100

In [11]:
# define CustomDataset class
class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, root_dir, X_index, X, y = None, transform=None, return_extra_variables=True):
        self.X = X
        self.y = y
        self.root_dir = root_dir
        self.transform = transform
        self.X_index = X_index
        self.return_extra_variables = return_extra_variables

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        # image
        if torch.is_tensor(idx):
            idx = idx.tolist()
        img_name = os.path.join(self.root_dir, str(self.X_index[idx]) + '.jpg')
        image = io.imread(img_name)
        target = self.y[idx] if self.y is not None else 0 # simply put something

        if self.transform:
            image = self.transform(image)

        # data variable could contain one-hot encoded/embedded categorical data and/or
        # term frequency matrix of noisyTextDescription column
        data = torch.tensor(self.X[idx].copy())

        if self.return_extra_variables:
          return (image, data), target
        return image, target

In [12]:
def prepare_dataloaders(transform, X_train_index, X_val_index, X_train, y_train, X_val, y_val=None, 
                        batch_size=batch_size, root_dir='noisy-images/noisy-images', return_extra_variables=True):
  # load training set
  trainset = CustomDataset(root_dir=root_dir, X=X_train, y=y_train, transform=transform, X_index=X_train_index, 
                           return_extra_variables=return_extra_variables)
  train_dataloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=True, num_workers=2)

  # load test set
  testset = CustomDataset(root_dir=root_dir, X=X_val, y=y_val, transform=transform, X_index=X_val_index, 
                          return_extra_variables=return_extra_variables)
  test_dataloader = torch.utils.data.DataLoader(testset, batch_size=batch_size, shuffle=False, num_workers=2)

  return train_dataloader, test_dataloader

# calculate mean and standard deviation of dataset
def batch_mean_and_sd(loader):
    
    cnt = 0
    fst_moment = torch.empty(3)
    snd_moment = torch.empty(3)

    for train, _ in loader:
        images = train[0]
        b, c, h, w = images.shape
        nb_pixels = b * h * w
        sum_ = torch.sum(images, dim=[0, 2, 3])
        sum_of_square = torch.sum(images ** 2,
                                  dim=[0, 2, 3])
        fst_moment = (cnt * fst_moment + sum_) / (cnt + nb_pixels)
        snd_moment = (cnt * snd_moment + sum_of_square) / (cnt + nb_pixels)
        cnt += nb_pixels

    mean, std = fst_moment, torch.sqrt(snd_moment - fst_moment ** 2)        
    return mean,std

In [13]:
# function to train CNN
def train(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    model.train()
    train_loss, correct = 0, 0
    for batch, (X, y) in enumerate(dataloader):
        y = y.to(device)
        if model.__class__.__name__ != 'CNN_deeper_dropout_v2':
          images = X[0].to(device)
          data = X[1].to(device)
          pred = model(images, data)
        else:
          X = X.to(device)
          pred = model(X)

        # Compute prediction error
        loss = loss_fn(pred, y)
        train_loss += loss.item()
        correct += (pred.argmax(1) == y).type(torch.float).sum().item()

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    average_train_loss = train_loss / num_batches
    accuracy = correct / size
    return accuracy, average_train_loss

In [14]:
# function to generate CNN predictions
def predict_images(dataloader, model):
    predictions = torch.Tensor([]).to(device)
    prob_predictions = torch.Tensor([]).to(device)
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    model.eval()
    test_loss, correct = 0, 0
    with torch.no_grad():
        for batch, (X, y) in enumerate(dataloader):
            y = y.to(device)
            if model.__class__.__name__ != 'CNN_deeper_dropout_v2':
              images = X[0].to(device)
              data = X[1].to(device)
              pred = model(images, data)
            else:
              X = X.to(device)
              pred = model(X)

            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
            predictions = torch.cat((predictions, pred.argmax(1)))
            prob_predictions = torch.cat((prob_predictions, torch.exp(pred))) # get probabilities
    accuracy = correct / size
    return accuracy, predictions.cpu().numpy(), prob_predictions.cpu().numpy()

## Part 3.1: Combine Images & Categorical Data

### Method 1: One-hot encoding on categorical data

- Images: transform to Tensor and normalize images using mean and std of whole train dataset.
- Categorical data: apply one-hot encoding.

In [15]:
# one hot encoding pipeline
cat_pipeline_onehot = Pipeline(steps=[
    ('one-hot',OneHotEncoder(sparse_output = False, dtype = np.float32))
])
col_trans_onehot = ColumnTransformer(transformers=[
    ('cat_pipeline', cat_pipeline_onehot, cat_cols),
    ],
    remainder='drop',
    n_jobs=-1)

clf_pipeline_onehot = Pipeline(steps=[
    ('col_trans', col_trans_onehot)
])

In [16]:
clf_pipeline_onehot.fit(X_train)
X_train_onehot = clf_pipeline_onehot.transform(X_train)
X_val_onehot = clf_pipeline_onehot.transform(X_val)

Before normalizing:

In [17]:
# transform to Tensor
transform = transforms.Compose([
    transforms.ToTensor(),
    ])

train_dataloader_onehot, test_dataloader_onehot = prepare_dataloaders(X_train=X_train_onehot, y_train=y_train_transformed, 
                                                        X_val=X_val_onehot, X_train_index = X_train.index, X_val_index = X_val.index, 
                                                        transform=transform)

In [18]:
mean, std = batch_mean_and_sd(train_dataloader_onehot)
print("mean and std: \n", mean, std)


mean and std: 
 tensor([0.8189, 0.8041, 0.7972]) tensor([0.2229, 0.2339, 0.2368])


Use the above mean and standard deviation to normalize our dataset.

After normalizing:

In [19]:
# define transformations
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize(mean, std),
    ])

train_dataloader_onehot, test_dataloader_onehot = prepare_dataloaders(X_train=X_train_onehot, y_train=y_train_transformed, 
                                                        X_val=X_val_onehot, X_train_index = X_train.index, X_val_index = X_val.index,
                                                         transform=transform)

### Method 2: Categorical embeddings

- Images: transform to Tensor and normalize images using mean and std of whole train dataset.
- Categorical data: lowercase and concatenate the strings in all categorical columns before converting to embeddings using pretrained 25-dimensional GLoVe embeddings.

In [20]:
# Here glove is a sklearn transformer has the standard transform method that takes a list of sentences as input 
# and outputs a design matrix, just like Tfidftransformer. 
# You can get the resulting embeddings with:
# embeddings = glove.transform(['first sentence of the corpus', 'another sentence']) 
# and embeddings woud contain a 2 x N matrics, where N is the dimension of the chosen embedding. 
def embed_categorical_columns(x):
  cat_cols_lowercase = x[cat_cols].apply(lambda col: col.astype(str).str.lower())
  cat_cols_joined = cat_cols_lowercase.apply(" ".join, axis=1)
  embed_col = glove.transform(cat_cols_joined)
  return embed_col
  # return pd.DataFrame(embed_col, index=x.index)

clf_pipeline_embed = Pipeline(steps=[
    ('embed_categorical_columns', FunctionTransformer(embed_categorical_columns)),
])

In [21]:
clf_pipeline_embed.fit(X_train)
X_train_embed = clf_pipeline_embed.transform(X_train)
X_val_embed = clf_pipeline_embed.transform(X_val)

In [22]:
# define transformations
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize(mean, std),
    ])

train_dataloader_embed, test_dataloader_embed = prepare_dataloaders(X_train=X_train_embed, y_train=y_train_transformed, 
                                                        X_val=X_val_embed, X_train_index = X_train.index, X_val_index = X_val.index,
                                                        transform=transform)

## Part 3.2: Images Only

- Images: transform to Tensor and normalize images using mean and std of whole train dataset.

In [23]:
# transform to Tensor
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean, std),
    ])

train_dataloader_original, test_dataloader_original = prepare_dataloaders(X_train=X_train_embed, y_train=y_train_transformed, 
                                                        X_val=X_val_embed, X_train_index = X_train.index, X_val_index = X_val.index,
                                                        transform=transform,
                                                        return_extra_variables=False)

## Part 3.3: Combine Images, Categorical & Text Data

### Method 1: One-hot encoding on categorical data, TF on text data

- Images: transform to Tensor and normalize images using mean and std of whole train dataset.
- Categorical data: apply one-hot encoding.
- Text data: compute normalized term frequency matrix of noisyTextDescription column.

In [24]:
def squeeze(x):
  return x.squeeze()

def toarray(x):
  return x.toarray()

def change_dtype(x):
  return x.astype('float32')

# default token pattern
text_pipeline = Pipeline(steps=[
    ("squeeze", FunctionTransformer(squeeze)),
     ('vect', CountVectorizer(stop_words='english', strip_accents='unicode')),
     ('tfidf', TfidfTransformer(use_idf=False)),
     ("toarray", FunctionTransformer(toarray)),
     ("change_dtype", FunctionTransformer(change_dtype)),
])

# one hot encoding pipeline
cat_pipeline_onehot = Pipeline(steps=[
    ('one-hot',OneHotEncoder(sparse_output = False, dtype = np.float32))
])
col_trans_all = ColumnTransformer(transformers=[
    ('cat_pipeline', cat_pipeline_onehot, cat_cols),
    ('text_pipeline', text_pipeline, text_cols),
    ],
    remainder='drop',
    n_jobs=-1)

clf_pipeline_all = Pipeline(steps=[
    ('col_trans', col_trans_all)
])

In [25]:
clf_pipeline_all.fit(X_train)
X_train_all = clf_pipeline_all.transform(X_train)
X_val_all = clf_pipeline_all.transform(X_val)

In [26]:
# define transformations
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize(mean, std),
    ])

train_dataloader_all, test_dataloader_all = prepare_dataloaders(X_train=X_train_all, y_train=y_train_transformed, 
                                                        X_val=X_val_all, X_train_index = X_train.index, X_val_index = X_val.index,
                                                        transform=transform)

### Method 2: Categorical embeddings, TF on text data

- Images: transform to Tensor and normalize images using mean and std of whole train dataset.
- Categorical data: lowercase and concatenate the strings in all categorical columns before converting to embeddings using pretrained 25-dimensional GLoVe embeddings.
- Text data: compute normalized term frequency matrix of noisyTextDescription column.

In [27]:
cat_pipeline_embed = Pipeline(steps=[
    ('embed_categorical_columns', FunctionTransformer(embed_categorical_columns)),
])

col_trans_all_embed = ColumnTransformer(transformers=[
    ('cat_pipeline', cat_pipeline_embed, cat_cols + text_cols), # use all columns since function takes in whole dataframe
    ('text_pipeline', text_pipeline, text_cols),
    ],
    remainder='drop',
    n_jobs=-1)

clf_pipeline_all_embed = Pipeline(steps=[
    ('col_trans', col_trans_all_embed)
])

In [28]:
clf_pipeline_all_embed.fit(X_train)
X_train_all_embed = clf_pipeline_all_embed.transform(X_train)
X_val_all_embed = clf_pipeline_all_embed.transform(X_val)

In [29]:
# define transformations
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize(mean, std),
    ])

train_dataloader_all_embed, test_dataloader_all_embed = prepare_dataloaders(X_train=X_train_all_embed, y_train=y_train_transformed, 
                                                        X_val=X_val_all_embed, X_train_index = X_train.index, X_val_index = X_val.index,
                                                        transform=transform)

## Part 3.4: Model Data


In [None]:
#negative log likelihood loss
loss_fn = nn.NLLLoss()

### Model 1: CNN with images and categorical embeddings

In [44]:
# Define model - hidden nodes = 1024, more blocks, added new pooling
class CNN_deeper_dropout_v4_add_cat_embed(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 64, 3, padding='same')
        self.conv1b = nn.Conv2d(64, 64, 3, padding='same')
        self.conv2 = nn.Conv2d(64, 128, 3, padding='same')
        self.conv2b = nn.Conv2d(128, 128, 3, padding='same')
        self.pool = nn.MaxPool2d(2, 2)
        self.pool2 = nn.MaxPool2d((2,3), (2,3))
        self.fc1 = nn.Linear(128*5*10+25,1024)
        self.fc2 = nn.Linear(1024, 27)
        self.dropout1 = nn.Dropout(p=0.25)
        self.dropout2 = nn.Dropout(p=0.5)

    def forward(self, image, cat_embed):
        image = F.relu(self.conv1(image))
        image = F.relu(self.conv1b(image))
        image = self.pool(image)
        image = self.dropout1(image)

        image = F.relu(self.conv2(image))
        image = F.relu(self.conv2b(image))
        image = self.pool2(image)
        image = self.dropout1(image)

        image = F.relu(self.conv2b(image))
        image = F.relu(self.conv2b(image))
        image = self.pool(image)
        image = self.dropout1(image)

        x1 = torch.flatten(image, 1) # flatten all dimensions except batch
        x2 = cat_embed
        x = torch.cat((x1, x2), dim=1)
        
        x = F.relu(self.fc1(x))
        x = self.dropout2(x)
        x = F.log_softmax(self.fc2(x), 1)
        return x

In [37]:
# define model and optimizer
model_deeper_dropout_v4_add_cat_embed = CNN_deeper_dropout_v4_add_cat_embed().to(device)
optimizer_deeper_dropout_v4_add_cat_embed = torch.optim.Adam(model_deeper_dropout_v4_add_cat_embed.parameters())


model_optimizer_dict = {
    'test_CNN_deeper_dropout_v4_add_cat_embed': (model_deeper_dropout_v4_add_cat_embed, 
                                                 optimizer_deeper_dropout_v4_add_cat_embed, 
                                                 train_dataloader_embed, 
                                                 test_dataloader_embed),
    }

In [35]:
best_epoch = 11 # best performing epoch

for model_type, model_optimizer_train_test_pair in model_optimizer_dict.items():
  model = model_optimizer_train_test_pair[0]
  optimizer = model_optimizer_train_test_pair[1]
  train_dataloader = model_optimizer_train_test_pair[2]
  test_dataloader = model_optimizer_train_test_pair[3]
  print(model_type)
  for t in tqdm(range(best_epoch)):
    # train
    train_accuracy, average_train_loss = train(train_dataloader, model, loss_fn, optimizer)
    print(f"Epoch {t+1}:\t Train accuracy: {100*train_accuracy:0.1f}%")
  print('\n')
  torch.save(model, '{}.pt'.format(model_type))
  # files.download('{}.pt'.format(model_type))  

test_CNN_deeper_dropout_v4_add_cat_embed


  9%|▉         | 1/11 [00:39<06:35, 39.54s/it]

Epoch 1:	 Train accuracy: 56.3%


 18%|█▊        | 2/11 [01:03<04:34, 30.46s/it]

Epoch 2:	 Train accuracy: 74.6%


 27%|██▋       | 3/11 [01:30<03:48, 28.61s/it]

Epoch 3:	 Train accuracy: 78.1%


 36%|███▋      | 4/11 [01:53<03:06, 26.60s/it]

Epoch 4:	 Train accuracy: 79.9%


 45%|████▌     | 5/11 [02:15<02:30, 25.04s/it]

Epoch 5:	 Train accuracy: 81.3%


 55%|█████▍    | 6/11 [02:39<02:02, 24.47s/it]

Epoch 6:	 Train accuracy: 82.5%


 64%|██████▎   | 7/11 [03:01<01:35, 23.83s/it]

Epoch 7:	 Train accuracy: 83.1%


 73%|███████▎  | 8/11 [03:24<01:10, 23.36s/it]

Epoch 8:	 Train accuracy: 84.0%


 82%|████████▏ | 9/11 [03:47<00:46, 23.31s/it]

Epoch 9:	 Train accuracy: 84.3%


 91%|█████████ | 10/11 [04:10<00:23, 23.22s/it]

Epoch 10:	 Train accuracy: 84.9%


100%|██████████| 11/11 [04:32<00:00, 24.75s/it]

Epoch 11:	 Train accuracy: 85.3%







<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Model 2: CNN with images and OHE categorical columns

In [45]:
# Define model - hidden nodes = 1024, more blocks, added new pooling
class CNN_deeper_dropout_v3_add_cat(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 64, 3, padding='same')
        self.conv1b = nn.Conv2d(64, 64, 3, padding='same')
        self.conv2 = nn.Conv2d(64, 128, 3, padding='same')
        self.conv2b = nn.Conv2d(128, 128, 3, padding='same')
        self.pool = nn.MaxPool2d(2, 2)
        self.pool2 = nn.MaxPool2d((2,3), (2,3))
        self.fc1 = nn.Linear(128*5*10+(5+46+4+7),1024)
        self.fc2 = nn.Linear(1024, 27)
        self.dropout1 = nn.Dropout(p=0.25)
        self.dropout2 = nn.Dropout(p=0.5)

    def forward(self, image, cat_data):
        image = F.relu(self.conv1(image))
        image = F.relu(self.conv1b(image))
        image = self.pool(image)
        image = self.dropout1(image)

        image = F.relu(self.conv2(image))
        image = F.relu(self.conv2b(image))
        image = self.pool2(image)
        image = self.dropout1(image)

        image = F.relu(self.conv2b(image))
        image = F.relu(self.conv2b(image))
        image = self.pool(image)
        image = self.dropout1(image)

        x1 = torch.flatten(image, 1) # flatten all dimensions except batch
        x2 = cat_data
        x = torch.cat((x1, x2), dim=1)
        
        x = F.relu(self.fc1(x))
        x = self.dropout2(x)
        x = F.log_softmax(self.fc2(x), 1)
        return x

In [49]:
# define model and optimizer
# define models
model_deeper_dropout_v3_add_cat = CNN_deeper_dropout_v3_add_cat().to(device)
optimizer_deeper_dropout_v3_add_cat = torch.optim.Adam(model_deeper_dropout_v3_add_cat.parameters())

model_optimizer_dict = {
    'test_CNN_deeper_dropout_v3_add_cat': (model_deeper_dropout_v3_add_cat, 
                                           optimizer_deeper_dropout_v3_add_cat, 
                                           train_dataloader_onehot, 
                                           test_dataloader_onehot),
    }

In [50]:
best_epoch = 7 # best performing epoch

for model_type, model_optimizer_train_test_pair in model_optimizer_dict.items():
  model = model_optimizer_train_test_pair[0]
  optimizer = model_optimizer_train_test_pair[1]
  train_dataloader = model_optimizer_train_test_pair[2]
  test_dataloader = model_optimizer_train_test_pair[3]
  print(model_type)
  for t in tqdm(range(best_epoch)):
    # train
    train_accuracy, average_train_loss = train(train_dataloader, model, loss_fn, optimizer)
    print(f"Epoch {t+1}:\t Train accuracy: {100*train_accuracy:0.1f}%")
  print('\n')
  torch.save(model, '{}.pt'.format(model_type))
  # files.download('{}.pt'.format(model_type))  

test_CNN_deeper_dropout_v3_add_cat


 14%|█▍        | 1/7 [00:20<02:02, 20.39s/it]

Epoch 1:	 Train accuracy: 55.8%


 29%|██▊       | 2/7 [00:39<01:39, 19.88s/it]

Epoch 2:	 Train accuracy: 77.8%


 43%|████▎     | 3/7 [00:59<01:19, 19.95s/it]

Epoch 3:	 Train accuracy: 80.6%


 57%|█████▋    | 4/7 [01:21<01:02, 20.70s/it]

Epoch 4:	 Train accuracy: 82.3%


 71%|███████▏  | 5/7 [01:41<00:40, 20.46s/it]

Epoch 5:	 Train accuracy: 83.3%


 86%|████████▌ | 6/7 [02:03<00:21, 21.04s/it]

Epoch 6:	 Train accuracy: 84.3%


100%|██████████| 7/7 [02:25<00:00, 20.74s/it]

Epoch 7:	 Train accuracy: 85.1%







### Model 3: CNN with images only

In [47]:
# Define model - hidden nodes = 1024, more blocks, added new pooling
class CNN_deeper_dropout_v2(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 64, 3, padding='same')
        self.conv1b = nn.Conv2d(64, 64, 3, padding='same')
        self.conv2 = nn.Conv2d(64, 128, 3, padding='same')
        self.conv2b = nn.Conv2d(128, 128, 3, padding='same')
        self.pool = nn.MaxPool2d(2, 2)
        self.pool2 = nn.MaxPool2d((2,3), (2,3))
        self.fc1 = nn.Linear(128*5*10,1024)
        self.fc2 = nn.Linear(1024, 27)
        self.dropout1 = nn.Dropout(p=0.25)
        self.dropout2 = nn.Dropout(p=0.5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv1b(x))
        x = self.pool(x)
        x = self.dropout1(x)

        x = F.relu(self.conv2(x))
        x = F.relu(self.conv2b(x))
        x = self.pool2(x)
        x = self.dropout1(x)

        x = F.relu(self.conv2b(x))
        x = F.relu(self.conv2b(x))
        x = self.pool(x)
        x = self.dropout1(x)

        x = torch.flatten(x, 1) # flatten all dimensions except batch
        x = F.relu(self.fc1(x))
        x = self.dropout2(x)
        x = F.log_softmax(self.fc2(x), 1)
        return x

In [51]:
# define model and optimizer
model_deeper_dropout_v2 = CNN_deeper_dropout_v2().to(device)
optimizer_deeper_dropout_v2 = torch.optim.Adam(model_deeper_dropout_v2.parameters())

model_optimizer_dict = {
    'test_CNN_deeper_dropout_v2': (model_deeper_dropout_v2, 
                                   optimizer_deeper_dropout_v2, 
                                   train_dataloader_original, 
                                   test_dataloader_original),
    }

In [52]:
best_epoch = 7 # best performing epoch

for model_type, model_optimizer_train_test_pair in model_optimizer_dict.items():
  model = model_optimizer_train_test_pair[0]
  optimizer = model_optimizer_train_test_pair[1]
  train_dataloader = model_optimizer_train_test_pair[2]
  test_dataloader = model_optimizer_train_test_pair[3]
  print(model_type)
  for t in tqdm(range(best_epoch)):
    # train
    train_accuracy, average_train_loss = train(train_dataloader, model, loss_fn, optimizer)
    print(f"Epoch {t+1}:\t Train accuracy: {100*train_accuracy:0.1f}%")
  print('\n')
  torch.save(model, '{}.pt'.format(model_type))
  # files.download('{}.pt'.format(model_type))  

test_CNN_deeper_dropout_v2


 14%|█▍        | 1/7 [00:22<02:17, 22.97s/it]

Epoch 1:	 Train accuracy: 59.6%


 29%|██▊       | 2/7 [00:44<01:49, 21.93s/it]

Epoch 2:	 Train accuracy: 75.4%


 43%|████▎     | 3/7 [01:03<01:22, 20.73s/it]

Epoch 3:	 Train accuracy: 78.0%


 57%|█████▋    | 4/7 [01:23<01:01, 20.38s/it]

Epoch 4:	 Train accuracy: 79.5%


 71%|███████▏  | 5/7 [01:43<00:40, 20.19s/it]

Epoch 5:	 Train accuracy: 80.5%


 86%|████████▌ | 6/7 [02:03<00:20, 20.18s/it]

Epoch 6:	 Train accuracy: 81.4%


100%|██████████| 7/7 [02:26<00:00, 20.86s/it]

Epoch 7:	 Train accuracy: 81.8%







### Model 4: CNN with images, OHE categorical columns, and TF on text data

In [46]:
# Define model - hidden nodes = 1024, more blocks, added new pooling
class test_CNN_deeper_dropout_v5_add_cat_text(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 64, 3, padding='same')
        self.conv1b = nn.Conv2d(64, 64, 3, padding='same')
        self.conv2 = nn.Conv2d(64, 128, 3, padding='same')
        self.conv2b = nn.Conv2d(128, 128, 3, padding='same')
        self.pool = nn.MaxPool2d(2, 2)
        self.pool2 = nn.MaxPool2d((2,3), (2,3))
        self.fc1 = nn.Linear(128*5*10+8499,1024)
        self.fc2 = nn.Linear(1024, 27)
        self.dropout1 = nn.Dropout(p=0.25)
        self.dropout2 = nn.Dropout(p=0.5)

    def forward(self, image, cat_data):
        image = F.relu(self.conv1(image))
        image = F.relu(self.conv1b(image))
        image = self.pool(image)
        image = self.dropout1(image)

        image = F.relu(self.conv2(image))
        image = F.relu(self.conv2b(image))
        image = self.pool2(image)
        image = self.dropout1(image)

        image = F.relu(self.conv2b(image))
        image = F.relu(self.conv2b(image))
        image = self.pool(image)
        image = self.dropout1(image)

        x1 = torch.flatten(image, 1) # flatten all dimensions except batch
        x2 = cat_data
        x = torch.cat((x1, x2), dim=1)
        
        x = F.relu(self.fc1(x))
        x = self.dropout2(x)
        x = F.log_softmax(self.fc2(x), 1)
        return x

In [44]:
# define model and optimizer
model_deeper_dropout_v5_add_cat_text = test_CNN_deeper_dropout_v5_add_cat_text().to(device)
optimizer_deeper_dropout_v5_add_cat_text = torch.optim.Adam(model_deeper_dropout_v5_add_cat_text.parameters())

model_optimizer_dict = {
    'test_CNN_deeper_dropout_v5_add_cat_text': (model_deeper_dropout_v5_add_cat_text, 
                                                optimizer_deeper_dropout_v5_add_cat_text, 
                                                train_dataloader_all, 
                                                test_dataloader_all),
    }

In [45]:
best_epoch = 7 # best performing epoch

for model_type, model_optimizer_train_test_pair in model_optimizer_dict.items():
  model = model_optimizer_train_test_pair[0]
  optimizer = model_optimizer_train_test_pair[1]
  train_dataloader = model_optimizer_train_test_pair[2]
  test_dataloader = model_optimizer_train_test_pair[3]
  print(model_type)
  for t in tqdm(range(best_epoch)):
    # train
    train_accuracy, average_train_loss = train(train_dataloader, model, loss_fn, optimizer)
    print(f"Epoch {t+1}:\t Train accuracy: {100*train_accuracy:0.1f}%")
  print('\n')
  torch.save(model, '{}.pt'.format(model_type))
  # files.download('{}.pt'.format(model_type))  

test_CNN_deeper_dropout_v5_add_cat_text


 14%|█▍        | 1/7 [00:22<02:14, 22.41s/it]

Epoch 1:	 Train accuracy: 62.2%


 29%|██▊       | 2/7 [00:44<01:51, 22.37s/it]

Epoch 2:	 Train accuracy: 86.7%


 43%|████▎     | 3/7 [01:05<01:27, 21.81s/it]

Epoch 3:	 Train accuracy: 90.9%


 57%|█████▋    | 4/7 [01:27<01:05, 21.76s/it]

Epoch 4:	 Train accuracy: 93.0%


 71%|███████▏  | 5/7 [01:48<00:42, 21.38s/it]

Epoch 5:	 Train accuracy: 94.5%


 86%|████████▌ | 6/7 [02:11<00:21, 21.95s/it]

Epoch 6:	 Train accuracy: 95.8%


100%|██████████| 7/7 [02:34<00:00, 22.01s/it]

Epoch 7:	 Train accuracy: 96.7%







### Model 5: CNN with images, categorical embeddings, and TF on text data

In [48]:
# Define model - hidden nodes = 1024, more blocks, added new pooling
class test_CNN_deeper_dropout_v6_add_cat_embed_text(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 64, 3, padding='same')
        self.conv1b = nn.Conv2d(64, 64, 3, padding='same')
        self.conv2 = nn.Conv2d(64, 128, 3, padding='same')
        self.conv2b = nn.Conv2d(128, 128, 3, padding='same')
        self.pool = nn.MaxPool2d(2, 2)
        self.pool2 = nn.MaxPool2d((2,3), (2,3))
        self.fc1 = nn.Linear(128*5*10+8462,1024)
        self.fc2 = nn.Linear(1024, 27)
        self.dropout1 = nn.Dropout(p=0.25)
        self.dropout2 = nn.Dropout(p=0.5)

    def forward(self, image, cat_data):
        image = F.relu(self.conv1(image))
        image = F.relu(self.conv1b(image))
        image = self.pool(image)
        image = self.dropout1(image)

        image = F.relu(self.conv2(image))
        image = F.relu(self.conv2b(image))
        image = self.pool2(image)
        image = self.dropout1(image)

        image = F.relu(self.conv2b(image))
        image = F.relu(self.conv2b(image))
        image = self.pool(image)
        image = self.dropout1(image)

        x1 = torch.flatten(image, 1) # flatten all dimensions except batch
        x2 = cat_data
        x = torch.cat((x1, x2), dim=1)
        
        x = F.relu(self.fc1(x))
        x = self.dropout2(x)
        x = F.log_softmax(self.fc2(x), 1)
        return x

In [47]:
# define model and optimizer
model_deeper_dropout_v6_add_cat_embed_text = test_CNN_deeper_dropout_v6_add_cat_embed_text().to(device)
optimizer_deeper_dropout_v6_add_cat_embed_text = torch.optim.Adam(model_deeper_dropout_v6_add_cat_embed_text.parameters())

model_optimizer_dict = {
    'test_CNN_deeper_dropout_v6_add_cat_embed_text': (model_deeper_dropout_v6_add_cat_embed_text, 
                                                      optimizer_deeper_dropout_v6_add_cat_embed_text, 
                                                      train_dataloader_all_embed, 
                                                      test_dataloader_all_embed),
    }

In [48]:
best_epoch = 7 # best performing epoch

for model_type, model_optimizer_train_test_pair in model_optimizer_dict.items():
  model = model_optimizer_train_test_pair[0]
  optimizer = model_optimizer_train_test_pair[1]
  train_dataloader = model_optimizer_train_test_pair[2]
  test_dataloader = model_optimizer_train_test_pair[3]
  print(model_type)
  for t in tqdm(range(best_epoch)):
    # train
    train_accuracy, average_train_loss = train(train_dataloader, model, loss_fn, optimizer)
    print(f"Epoch {t+1}:\t Train accuracy: {100*train_accuracy:0.1f}%")
  print('\n')
  torch.save(model, '{}.pt'.format(model_type))
  # files.download('{}.pt'.format(model_type))  

test_CNN_deeper_dropout_v6_add_cat_embed_text


 14%|█▍        | 1/7 [00:21<02:06, 21.12s/it]

Epoch 1:	 Train accuracy: 60.5%


 29%|██▊       | 2/7 [00:41<01:43, 20.79s/it]

Epoch 2:	 Train accuracy: 82.2%


 43%|████▎     | 3/7 [01:04<01:26, 21.74s/it]

Epoch 3:	 Train accuracy: 88.0%


 57%|█████▋    | 4/7 [01:25<01:04, 21.51s/it]

Epoch 4:	 Train accuracy: 90.4%


 71%|███████▏  | 5/7 [01:46<00:42, 21.25s/it]

Epoch 5:	 Train accuracy: 92.3%


 86%|████████▌ | 6/7 [02:07<00:21, 21.33s/it]

Epoch 6:	 Train accuracy: 93.9%


100%|██████████| 7/7 [02:28<00:00, 21.22s/it]

Epoch 7:	 Train accuracy: 94.9%







# Part 4: Weighted Majority Vote

In [35]:
def get_predictions(models, X_test=X_val):
  predictions = np.array([])
  for item in models:
    if isinstance(item, tuple):
      model = item[0]
      if 'CNN' in model.__class__.__name__:
        test = item[1]
        _, prediction, _ = predict_images(dataloader=test, model=model)
      elif isinstance(item[1], tuple):
        Xtest, _ = item[1][0], item[1][1]
        prediction = model.predict(Xtest)
      else:
        test = item[1]
        prediction = model.predict(test)
    
    else:
      model = item
      prediction = model.predict(X_test)
      
    # flatten prediction
    prediction = prediction.flatten()
    
    # convert back to string
    if prediction.dtype not in (np.dtype('<U24'), np.dtype(object)):
      prediction = np.vectorize(idx_to_class.get)(prediction.astype(int))
      # prediction = le.inverse_transform(prediction.astype(int))

    # update predictions
    predictions = np.vstack((predictions, prediction)) if len(predictions) != 0 else prediction

  return predictions

def calculate_weights(accuracies, n_models_per_type = [1,1,1,1,1]):
  div_weights_by = np.array(n_models_per_type).copy()
  weights = accuracies.copy()

  start = 0
  end = 0
  for i in range(len(div_weights_by)):
      start = sum(div_weights_by[:i])
      end = sum(div_weights_by[:i+1])
      weights[start:end] /= div_weights_by[i]

  return weights

def majority_vote(predictions, vote_type='hard', weights=None):
  predictions_df = pd.DataFrame(np.transpose(predictions))

  if vote_type =='hard':
    return predictions_df.mode(axis=1)[0] # if there are ties, take the category from left to right
  elif vote_type == 'soft':
    weights = weights / sum(weights) # normalize
    weights = weights.flatten()
    predictions_df_enc = pd.DataFrame(np.transpose(predictions)).replace(class_to_idx)
    predictions_df_enc = predictions_df_enc.apply(lambda x: np.bincount(x, weights=weights).argmax(), axis=1)
    predictions_df = predictions_df_enc.replace(idx_to_class)
    return predictions_df

In [36]:
def format_predictions(predictions, index, target=target):
  predictions_df = final_predictions.to_frame()
  predictions_df.index = index
  predictions_df.columns = [target]
  return predictions_df

def download_predictions(predictions_df, filename):
  predictions_df.to_csv('{}.csv'.format(filename))
  files.download('{}.csv'.format(filename))

In [49]:
###### uncomment and load models if necessary #####
### Image Models
model_deeper_dropout_v2 = torch.load('test_CNN_deeper_dropout_v2.pt', map_location=device)

### Combined Cat and Text Models
bag_clf_pipeline_LinearSVC = torch.load('test_bag_clf_pipeline_LinearSVC.pkl', pickle_module=dill, map_location=device)
bag_clf_pipeline_SGDClassifier = torch.load('test_bag_clf_pipeline_SGDClassifier.pkl', pickle_module=dill, map_location=device)

### Combined Cat and Image Models
model_deeper_dropout_v3_add_cat = torch.load('test_CNN_deeper_dropout_v3_add_cat.pt', map_location=device)
model_deeper_dropout_v4_add_cat_embed = torch.load('test_CNN_deeper_dropout_v4_add_cat_embed.pt', map_location=device)

# ### Combined Cat, Text and Image Models
model_deeper_dropout_v5_add_cat_text = torch.load('test_CNN_deeper_dropout_v5_add_cat_text.pt', map_location=device)
model_deeper_dropout_v6_add_cat_embed_text = torch.load('test_CNN_deeper_dropout_v6_add_cat_embed_text.pt', map_location=device)

In [50]:
cat_models = [
]

image_models = [
    (model_deeper_dropout_v2, test_dataloader_original),
]

cat_and_text_combined_models = [
    bag_clf_pipeline_LinearSVC, 
    bag_clf_pipeline_SGDClassifier,
]

cat_and_image_combined_models = [
    (model_deeper_dropout_v3_add_cat, test_dataloader_onehot),
    (model_deeper_dropout_v4_add_cat_embed, test_dataloader_embed),
]

cat_and_text_and_image_combined_models = [
    (model_deeper_dropout_v5_add_cat_text, test_dataloader_all),
    (model_deeper_dropout_v6_add_cat_embed_text, test_dataloader_all_embed),
]

all_models = cat_models + image_models + cat_and_text_combined_models + cat_and_image_combined_models + cat_and_text_and_image_combined_models

---

In [None]:
# if Colab keeps crashing on this line of code due to reaching RAM usage limit, please ignore this line of code
# and instead. Read the text block below for more instructions.
predictions = get_predictions(all_models, X_test=X_val)

Colab has a high risk of crashing due to LinearSVC and SGDClassifier. If it continues to do so, only run the following sections:

- Upload files in Google Colab (if necessary)
- Required Libraries (you can skip !pip install if no need)
- Part 1: Load Data
- First 2 code blocks in Part 4: Weighted Majority Vote
- Only call torch.load on LinearSVC and SGDClassifier
- Run the initialization of `cat_and_text_combined_models` only.

Finally, run `predictions3 = get_predictions(cat_and_text_combined_models, X_test=X_val)` and `np.save`,

After that, you can go ahead and run all the previous code blocks and continue running the notebook.

If it happens to crash again, at least `np.load` will make it easier. You can also download the `predictions.npy` files and upload them in case Colab crashes.

In [12]:
# # if Colab keeps crashing on this line of code due to reaching RAM usage limit, please uncomment and run this.
# predictions3 = get_predictions(cat_and_text_combined_models, X_test=X_val)
# np.save('predictions3.npy', predictions3)

In [51]:
# # if Colab keeps crashing on this line of code due to reaching RAM usage limit, please uncomment and run this.
# predictions2 = get_predictions(image_models, X_test=X_val)
# np.save('predictions2.npy', predictions2)

In [52]:
# # if Colab keeps crashing on this line of code due to reaching RAM usage limit, please uncomment and run this.
# predictions4 = get_predictions(cat_and_image_combined_models, X_test=X_val)
# np.save('predictions4.npy', predictions4)

In [53]:
# # if Colab keeps crashing on this line of code due to reaching RAM usage limit, please uncomment and run this.
# predictions5 = get_predictions(cat_and_text_and_image_combined_models, X_test=X_val)
# np.save('predictions5.npy', predictions5)

In [56]:
# # if Colab keeps crashing on this line of code due to reaching RAM usage limit, please uncomment and run this.
# predictions2 = np.load('predictions2.npy', allow_pickle=True)
# predictions3 = np.load('predictions3.npy', allow_pickle=True)
# predictions4 = np.load('predictions4.npy', allow_pickle=True)
# predictions5 = np.load('predictions5.npy', allow_pickle=True)

In [61]:
# predictions = np.vstack((predictions2, predictions3, predictions4, predictions5))

In [62]:
# predictions

array([['Bottomwear', 'Shoes', 'Wallets', ..., 'Wallets', 'Topwear',
        'Topwear'],
       ['Bottomwear', 'Sandal', 'Bags', ..., 'Wallets', 'Topwear',
        'Topwear'],
       ['Bottomwear', 'Sandal', 'Bags', ..., 'Wallets', 'Topwear',
        'Topwear'],
       ...,
       ['Bottomwear', 'Sandal', 'Wallets', ..., 'Wallets', 'Topwear',
        'Topwear'],
       ['Bottomwear', 'Sandal', 'Bags', ..., 'Wallets', 'Topwear',
        'Topwear'],
       ['Bottomwear', 'Sandal', 'Bags', ..., 'Wallets', 'Topwear',
        'Topwear']], dtype=object)

---

In [63]:
val_accuracies = np.array([
    [0.80617718], 
    [0.82800074],
    [0.82282227],
    [0.84131681],
    [0.83650823],
    [0.90290364],
    [0.89069724]
    ])

weights = calculate_weights(val_accuracies)
final_predictions = majority_vote(predictions, vote_type='soft', weights=weights)

In [64]:
final_predictions

0        Bottomwear
1            Sandal
2              Bags
3             Shoes
4           Wallets
            ...    
21623       Topwear
21624         Shoes
21625       Wallets
21626       Topwear
21627       Topwear
Length: 21628, dtype: object

# Part 5: Download Predictions

In [65]:
predictions_df = format_predictions(predictions=final_predictions, index=X_val.index)
download_predictions(predictions_df=predictions_df, filename='final_submission')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>