# W207 Baseline Submission
## Group 2: Austin Jin, Matt Lyons, Chandni Shah

<b> Project Link:</b> https://www.kaggle.com/c/petfinder-pawpularity-score

<b> Project Description:</b><br>
"In this competition, you’ll analyze raw images and metadata to predict the “Pawpularity” of pet photos. Your task is to predict engagement with a pet's profile based on the photograph for that profile. You are also provided with hand-labelled metadata for each photo. The dataset for this competition therefore comprises both images and tabular data." <br><br>

"Tabular Metadata: Each pet photo is labeled with the value of 1 (Yes) or 0 (No) for each of the following features. These labels are not used for deriving the Pawpularity score.

- Focus - Pet stands out against uncluttered background, not too close / far.
- Eyes - Both eyes are facing front or near-front, with at least 1 eye / pupil decently clear.
- Face - Decently clear face, facing front or near-front.
- Near - Single pet taking up significant portion of photo (roughly over 50% of photo width or height).
- Action - Pet in the middle of an action (e.g., jumping).
- Accessory - Accompanying physical or digital accessory / prop (i.e. toy, digital sticker), excluding collar and leash.
- Group - More than 1 pet in the photo.
- Collage - Digitally-retouched photo (i.e. with digital photo frame, combination of multiple photos).
- Human - Human in the photo.
- Occlusion - Specific undesirable objects blocking part of the pet (i.e. human, cage or fence). Note that not all blocking objects are considered occlusion.
- Info - Custom-added text or labels (i.e. pet name, description).
- Blur - Noticeably out of focus or noisy, especially for the pet’s eyes and face. For Blur entries, “Eyes” column is always set to 0."

## Part I - Metadata EDA

### In order to better understand the 'petfinder-pawpularity-score' dataset, we have performed some early EDA by performing the following pre-requisites:

## 1. Load in the packages

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from tensorflow import keras
from tensorflow.keras import datasets, layers, models
from tensorflow.keras.regularizers import l2
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.python.client import device_lib
import tensorflow as tf
from matplotlib import image
from glob import glob
import cv2

%matplotlib inline

import time
from matplotlib.ticker import MultipleLocator
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_openml
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LinearRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from math import sqrt
import seaborn as sns
import matplotlib.pyplot as plt

In [5]:
%%capture

# Perform a for loop to browse through the 'petfinder-pawpularity-score' directory and print out all file names:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [6]:
# Define the source path for the Pawpularity contest data, retrieve and assign the .csv metadata into DataFrames, and retrieve and assign the .jpq image data into lists:
# path = '../input/petfinder-pawpularity-score/'

train_df = pd.read_csv('./train.csv')
test_df = pd.read_csv('./test.csv')

train_jpg = glob("./train/*.jpg")
test_jpg = glob("./test/*.jpg")

In [5]:
# Printing the dimensions for the train metadata.
print('train_df dimensions: ', train_df.shape)
print('train_df column names: ', train_df.columns.values.tolist())

# Adding a space in between the dimensions for the train and test metadata
print('')

# Printing the dimensions for the test metadata
print('test_df dimensions: ',test_df.shape)
print('test_df column names: ', test_df.columns.values.tolist())

# After printing the shape of the train_df and test_df DataFrames, we have noticed that the train_df has 9912 rows and 14 columns whereas the test_df only has 8 rows and 13 columns. It is also worth mentioning that the test_df dataframe doesn't contain the pawpularity score.

train_df dimensions:  (9912, 14)
train_df column names:  ['Id', 'Subject Focus', 'Eyes', 'Face', 'Near', 'Action', 'Accessory', 'Group', 'Collage', 'Human', 'Occlusion', 'Info', 'Blur', 'Pawpularity']

test_df dimensions:  (8, 13)
test_df column names:  ['Id', 'Subject Focus', 'Eyes', 'Face', 'Near', 'Action', 'Accessory', 'Group', 'Collage', 'Human', 'Occlusion', 'Info', 'Blur']


### After printing the shape of the train_df and test_df dataframes, we can see that the train_df has 14 columns and 9912 rows, while the test_df only has 13 columns and 8 rows. It is also worth noting that the test_df particulary hasn't have pawpularity score data attached to it. We have further explored the metadata in the train dataframe since it would be the dataset for building out our models and have decided to utilize the test dataframe for practicing some predictions since it didn't contain a column for pawpularity score:

In [9]:
# Display the first 10 rows of the train_df dataframe
train_df.head(10)

### It was noticed that train_df still contains ID's of the photos which means that we won't be necessarily using this metadata when building out the models. Since we figured that it would be useful to also take a look athe distribution of the target variable, which would be the Pawpularity Score in the ranges from 1-100, a simple histogram has been plotted out to view the distribution:

In [10]:
# Distribution for Pawpularity Scores

sns.set(rc={'figure.figsize':(15,5)})
fig = plt.figure()
sns.histplot(data=train_df, x='Pawpularity', bins=100)
plt.axvline(train_df['Pawpularity'].mean(), c='red', ls='-', lw=3, label='Mean Pawpularity')
plt.axvline(train_df['Pawpularity'].median(),c='blue',ls='-',lw=3, label='Median Pawpularity')
plt.title('Distribution of Pawpularity Scores', fontsize=20, fontweight='bold')
plt.legend()
plt.show()

### After taking a look at the histogram, we see that there is a skew in the distribution of the pawpularity scores. It was interesting to see that there is a small curve close to zero Pawpularity along with another curve at the 100 Pawpularity Score with a count of close to 300. Since the EDA alone doesn't allow us to truly know the reason as to why there were many scores at 100, we have decided to keep the following theories in mind:

##### - Was there something unique about the animals such as their age, color, or breed that was most desirable by the people visiting the site?
##### - Did it have to do with the way in which photos were taken that were leading to more clicks and thus a higher Pawpularity score?
##### - Did it have to do with the Pawpularity score itself?
##### - Were there any outliers that need to be removed from the training data to improve the models that were built?
##### - Was there perhaps any noise in the dataset that caused the huge increase in pawpularity scores of 100?

### Since we are unable to find the actual answer through EDA alone, we plan on looking to develop concrete ML models that will allow us to see which features make more impact on the high pawpularity score in order to further explain the curves.

In [11]:
# Describe the distribution of the train dataframe in a numerical way

train_df[['Pawpularity']].describe()

In [12]:
# Put column names into a list
feature_variables = train_df.columns.values.tolist()

# For each feature variable, doesn't include Id and Pawpularity by using [1:-1]
# Display a boxplot and distribution plot against pawpularity
for variable in feature_variables[1:-1]:
    fig, ax = plt.subplots(1,2)
    sns.boxplot(data=train_df, x=variable, y='Pawpularity', ax=ax[0])
    sns.histplot(train_df, x="Pawpularity", hue=variable, kde=True, ax=ax[1])
    plt.suptitle(variable, fontsize=20, fontweight='bold')
    fig.show()

### As you can see from the charts, the distribution of pawpularity scores is very similar for each feature variable which means that changing the features doesn't end up influencing the pawpularity scores as much. This would mean that we would need to use the images and not the .csv metadata. This would've not been realized if it hadn't been for the EDA that was performed. We will focus on analyzing the pixels for the remainder of the baseline.

## Part II - Pixel EDA

### Before resizing the images to a uniform size, we have decided to explore the image data by taking a look at the first image in the train_jpg dataset and plotting that initial image:

In [37]:
print(train_jpg[0])

In [38]:
path_image = train_jpg[0]
array_image = plt.imread(path_image) 
print(array_image.shape)

plt.imshow(array_image)
plt.title('Initial Training Image') 
plt.axis('off')
plt.show()

### Next, we have attached a Pawpularity score as the title next to each image:

In [39]:
for x in range(3):
    path_image = train_jpg[x]
    array_image = plt.imread(path_image) 
    print("The image {}'s dimensions are: {}".format(x,array_image.shape))
    plt.imshow(array_image)
    plt.title(x) 
    plt.axis('off')
    plt.show()

### After gaining an initial sense of images looked, we have decided to start resizing the images to a uniform size. In this transformation, we also add white padding to images to help preserve image quality during the resizing.

In [7]:
## process in the training and test data, including the bw 1-d train data for baseline

train_path = './train_resized'
train_bw_path = './train_resized_bw'
test_path = './test'

train_jpg = glob(train_path + "/*.jpg")
train_bw_jpg = glob(train_bw_path + "/*.jpg")
test_jpg = glob(test_path + "/*.jpg")


train_images = [cv2.imread(file) for file in train_jpg]
train_bw_images_1d = [cv2.imread(file, 0).flatten(order = 'C') for file in train_bw_jpg] # 0 for grayscale, C for row-style flattening
test_images = [cv2.imread(file) for file in test_jpg]

In [8]:
X = np.array(train_bw_images_1d)
X = X / 255
Y = np.array(train_df['Pawpularity'])

In [46]:
X.shape

### Examples of Transformed Images, Top Scoring Images, and Bottom Scoring Images:

In [None]:
#examples of transformed images
pltnum = 0
plt.figure(figsize=(100,100))

for i in range(3):
    pltnum += 1
    plt.subplot(1, 3, pltnum)
    plt.imshow(X[i].reshape(300,300), cmap='gray')

In [None]:
#examples of score = 100

y_100 = np.where(Y == 100)


pltnum = 0
plt.figure(figsize=(100,100))

for i in y_100[0][:3]:
    pltnum += 1
    plt.subplot(1, 3, pltnum)
    plt.imshow(X[i].reshape(300,300), cmap='gray')

In [None]:
#example of scores in 75th percentile
y_75quant = np.where(Y == np.percentile(Y, 75))

pltnum = 0
plt.figure(figsize=(100,100))

for i in y_75quant[0][:3]:
    pltnum += 1
    plt.subplot(1, 3, pltnum)
    plt.imshow(X[i].reshape(300,300), cmap='gray')

In [None]:
#example of scores in 25th percentile
y_25quant = np.where(Y == np.percentile(Y, 25))

pltnum = 0
plt.figure(figsize=(100,100))

for i in y_25quant[0][:3]:
    pltnum += 1
    plt.subplot(1, 3, pltnum)
    plt.imshow(X[i].reshape(300,300), cmap='gray')

## Part III - Baseline Model

In [11]:
train_bw_images_1d, test_data, train_labels, test_labels = train_test_split(X,Y, test_size = .2, random_state = 42)
print(train_bw_images_1d.shape)
print(train_labels.shape)
print(test_data.shape)
print(test_labels.shape)

(7929, 90000)
(7929,)
(1983, 90000)
(1983,)


### KNN Regression Model:

In [None]:
# Score RSME, especially where k = 9

def KNN(k_values):
    for val in k_values:
        KNN_model = KNeighborsRegressor(n_neighbors=val)
        KNN_model.fit(train_bw_images_1d, train_labels)
        test_predict = KNN_model.predict(test_data)
        print("For k = ", val, ", the RMSE is: ", sqrt(mean_squared_error(test_labels, test_predict)), "\n")
        
k_values = [1, 5, 9, 11, 55, 175, 201, 301, 501]
KNN(k_values)

### We observe the KNN regression produces optimized results for lowest RMSE between 201-301 neighbors (k = 201, RMSE = 21.006 and k = 301, RMSE = 21.0111). However, we can see the optimized RMSE is only slightly lower than when 9 neighbors are used (k = 9, RMSE = 21.901). Therefore we determine the benefits of the slightly lower RMSE are not worth the computing power of 200+ neighbors for our model. We will move forward with the k=9 KNN Regression model.

### Linear Regression Model:

In [None]:
# To comment out LRM

lr_model = LinearRegression()
lr_model.fit(train_bw_images_1d, train_labels)
lr_model.intercept_, lr_model.coef_

In [None]:
lr_model.score(test_data, test_labels)

In [None]:
lr_predict = lr_model.predict(test_data)
print("LR RMSE is: ", sqrt(mean_squared_error(test_labels, lr_predict)))

### We can observe the linear regression model performs poorly compared to the KNN models. We achieve an RMSE score of 31.681, much higher than any of the KNN models. Separately, we achieve a negative R squared score, which means the model's best-fit line fits worse than a horizontal line.

### Based on these results, we will move forward with the KNN Regression with our baseline model. The RMSE for our baseline model is 21.901.

### Our next steps will be to build a CNN model, which we hope will be able to better handle the complexity of the images and, in return, lower the RMSE.

### CNN Model:

In [9]:
# Setting the file path of each image

train_df["path"] = train_df["Id"].apply(lambda x: "./train/" + x + ".jpg")
test_df["path"] = test_df["Id"].apply(lambda x: "./test/" + x + ".jpg")

In [10]:
AUTOTUNE = tf.data.experimental.AUTOTUNE
BATCH_SIZE = 64
IMG_SIZE = 224
target = 'Pawpularity'
seed = 0

def set_seed(seed=seed):
    """Utility function to use for reproducibility.
    :param seed: Random seed
    :return: None
    """
    np.random.seed(seed)
    random.seed(seed)
    tf.random.set_seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    os.environ['TF_DETERMINISTIC_OPS'] = '1'


def set_display():
    """Function sets display options for charts and pd.DataFrames.
    """
    # Plots display settings
    plt.style.use('fivethirtyeight')
    plt.rcParams['figure.figsize'] = 12, 8
    plt.rcParams.update({'font.size': 14})
    # DataFrame display settings
    pd.set_option('display.max_columns', None)
    pd.set_option('display.max_rows', None)
    pd.options.display.float_format = '{:.4f}'.format


def id_to_path(img_id: str, dir: str):
    """Function returns a path to an image file.
    :param img_id: Image Id
    :param dir: Path to the directory with images
    :return: Image file path
    """
    return os.path.join(dir, f'{img_id}.jpg')


@tf.function
def get_image(path: str) -> tf.Tensor:
    """Function loads image from a file and preprocesses it.
    :param path: Path to image file
    :return: Tensor with preprocessed image
    """
    print(f"IMAGE PROCESSING {str}")
    ## Decoding the image
    image = tf.image.decode_jpeg(tf.io.read_file(path), channels=3)

    ## Resizing image
    image = tf.cast(tf.image.resize_with_pad(image, IMG_SIZE, IMG_SIZE), dtype=tf.int32)

    return image


@tf.function
def process_dataset(path: str, label: int) -> tuple:
    """Function returns preprocessed image and label.
    :param path: Path to image file
    :param label: Class label
    :return: tf.Tensor with preprocessed image, numeric label
    """
    return get_image(path), label


@tf.function
def get_dataset(x, y=None) -> tf.data.Dataset:
    """Function creates batched optimized dataset for the model
    out of an array of file paths and (optionally) class labels.
    :param x: Input data for the model (array of file paths)
    :param y: Target values for the model (array of class indexes)
    :return TensorFlow Dataset object
    """
    if y is not None:
        ds = tf.data.Dataset.from_tensor_slices((x, y))
        return ds.map(process_dataset, num_parallel_calls=AUTOTUNE) \
            .batch(BATCH_SIZE).prefetch(buffer_size=AUTOTUNE)
    else:
        ds = tf.data.Dataset.from_tensor_slices(x)
        return ds.map(get_image, num_parallel_calls=AUTOTUNE) \
            .batch(BATCH_SIZE).prefetch(buffer_size=AUTOTUNE)

def plot_history(hist):
    """Function plots a chart with training and validation metrics.
    :param hist: Tensorflow history object from model.fit()
    """
    # Losses and metrics
    loss = hist.history['loss']
    val_loss = hist.history['val_loss']
    rmse = hist.history['root_mean_squared_error']
    val_rmse = hist.history['val_root_mean_squared_error']

    # Epochs to plot along x axis
    x_axis = range(1, len(loss) + 1)

    fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1, sharex=True)

    ax1.plot(x_axis, loss, 'bo', label='Training')
    ax1.plot(x_axis, val_loss, 'ro', label='Validation', alpha=0.3)
    ax1.set_title('MSE Loss')
    ax1.legend()

    ax2.plot(x_axis, rmse, 'bo', label='Training')
    ax2.plot(x_axis, val_rmse, 'ro', label='Validation', alpha=0.3)
    ax2.set_title('Root Mean Squared Error')
    ax2.set_xlabel('Epochs')
    ax2.legend()

    plt.tight_layout()
    plt.show()

In [11]:
# Splitting train into train and validation sets

train_subset, valid_subset = train_test_split(
    train_df[['path', target]],
    test_size=.2, shuffle=True, random_state=0
)

In [12]:
train_ds = get_dataset(x=train_subset['path'], y=train_subset[target])
valid_ds = get_dataset(x=valid_subset['path'], y=valid_subset[target])
test_ds = get_dataset(x=test_df['path'])

IMAGE PROCESSING <class 'str'>


In [15]:
# Creating the model

def get_model():
    
    ## Setting the Inputs
    inputs = keras.Input(shape=(224, 224, 3))
    x = inputs
    
    ## Preprocessing Layers
    
    ### Rescaling
    x = keras.layers.experimental.preprocessing.Rescaling(1./255)(x)
    
    ## Data Augmentation
    x = keras.layers.experimental.preprocessing.RandomFlip("horizontal_and_vertical")(x)
    x = keras.layers.experimental.preprocessing.RandomRotation(0.2)(x)
    x = keras.layers.experimental.preprocessing.RandomTranslation(0.2,0.2)(x)
    
    ## Convolutional Layers
    
    ### First CNN layer
    x = keras.layers.Conv2D(filters=96, kernel_size=3, strides=2, padding='same', kernel_initializer=tf.keras.initializers.HeNormal())(x)
    x = keras.layers.Activation('relu')(x)
    x = keras.layers.MaxPool2D(2)(x)

    ### Second CNN layer
    x = keras.layers.Conv2D(filters=128, kernel_size=3, strides=2, padding='same', kernel_initializer=tf.keras.initializers.HeNormal())(x)
    x = keras.layers.BatchNormalization()(x)
    x = keras.layers.Activation('relu')(x)
    x = keras.layers.MaxPool2D(2)(x)
    
    ### Third CNN layer
    x = keras.layers.Conv2D(filters=256, kernel_size=3, strides=2, padding='same', kernel_initializer=tf.keras.initializers.HeNormal())(x)
    x = keras.layers.BatchNormalization()(x)
    x = keras.layers.Activation('relu')(x)
    x = keras.layers.MaxPool2D(2)(x)

    ## Flattening the layer
    x = keras.layers.Flatten()(x)
    
    ## Fully Connected (Dense) Layers
    
    ### First Fully Connected layer w/ Dropout
    x = keras.layers.Dense(128, activation='relu', kernel_initializer=tf.keras.initializers.HeNormal())(x)
    x = keras.layers.Dropout(0.2)(x)
    
    ## Output layer
    output = keras.layers.Dense(1)(x)

    ## Returning the model
    return keras.Model(inputs=inputs, outputs=output)

In [1]:
# Fitting the model

def compile_and_fit(model):
    
    # Creating an exponential decay for learning rate

    LEARNING_RATE = 1e-3
    DECAY_STEPS = 100
    DECAY_RATE = 0.99

    lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
        initial_learning_rate=LEARNING_RATE,
        decay_steps=DECAY_STEPS, decay_rate=DECAY_RATE,
        staircase=True
    )
    
    # Creating an early stopper

    early_stop = tf.keras.callbacks.EarlyStopping(
        monitor='val_loss', patience=5, restore_best_weights=True
    )
    
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=lr_schedule),
        loss=tf.keras.losses.MeanSquaredError(),
        metrics=[tf.keras.metrics.RootMeanSquaredError()]
    )
    
    history = model.fit(
        train_ds, 
        validation_data=valid_ds,
        epochs=50,
        use_multiprocessing=True, workers=-1,
        callbacks=[early_stop]
    )
    
    return model, history

In [2]:
# Getting the model

keras.backend.clear_session()

model = get_model()
model.summary()

NameError: name 'keras' is not defined

In [None]:
# Fitting the model

model, history = compile_and_fit(model)
# predictions = model.predict(valid_ds, use_multiprocessing=True, workers=os.cpu_count())

Epoch 1/50


In [None]:
# Plotting accuracy and loss of model

plot_history(history)

In [None]:
# Using the model to predict on the test data

test_df[target] = model.predict(
    test_ds, use_multiprocessing=True, workers=os.cpu_count()
)

In [None]:
# Saving the submission file

test_df[['Id', target]].to_csv('submission.csv', index=False)
test_df[['Id', target]].head()

In [None]:
#binning columns to test models
train['two_bin_pawp'] = pd.qcut(train_df['Pawpularity'], q=2, labels=False)
train = train.astype({"two_bin_pawp": str})

train['four_bin_pawp'] = pd.qcut(train_df['Pawpularity'], q=4, labels=False)
train = train.astype({"four_bin_pawp": str})

train['ten_bin_pawp'] = pd.qcut(train_df['Pawpularity'], q=10, labels=False)
train = train.astype({"ten_bin_pawp": str})