<a href="https://colab.research.google.com/github/laumek/MLEND-yummy-ML-Dish-classification-project/blob/main/MLEND_miniproject_preprocessing_basic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ECS7020P mini-project - basic part

# 1 Problem formulation

For this component of the coursework, a machine learning pipeline needs to be built to take as an input a photo of a dish that has either rice or chips and as output a prediction on whether the picture has rice or chips. This is interesting because it allows us to explore the concept of data preparation and feature operations (selection, extraction, etc.) on raw images informations, and their impacts on the quality of the model.

# 2 Transformation stage and Feature Extraction

The built pipeline will involve a transformation of the images by resizing them, to ensure they are all processed the same.
Then, two functions will be defined to retrieve 3 features, which we will for our models:

- get_yellow_component: This function is tailored to extract statistical color moment features from yellow regions in RGB images, focusing on distinguishing between dishes with rice or chips. By using threshold values (t1 and t2) to define the yellow hue range, the function calculates the mean, variance, and skewness, determining the count of yellow pixels. This is particularly relevant in dish classification, where yellow is associated with specific ingredients or cooking styles, enhancing the accuracy of predictions.


- GMLC_features: This function extracts Gray-Level Co-occurrence Matrix (GLCM) features from the grayscale version of an RGB image, prioritising texture information by eliminating color-related disparities. The dissimilarity and correlation features obtained offer insights into spatial relationships among pixel intensities. These texture features hold potential for discriminating between rice and chips, as distinctive textual patterns may exist in the two types of dishes.

The resulting features will then be standardised using StandardScaler() and concatenated into a numpy array.


In [1]:
#Resizing

def make_it_square(I, pad=0):
    N, M, C = I.shape
    if N > M:
        Is = [np.pad(I[:, :, i], [(0, 0), (0, N - M)], 'constant', constant_values=pad) for i in range(C)]
    else:
        Is = [np.pad(I[:, :, i], [(0, M - N), (0, 0)], 'constant', constant_values=pad) for i in range(C)]
    return np.array(Is).transpose([1, 2, 0])

def resize_img(I, size=[200, 200]):
    N, M, C = I.shape
    Ir = [sp_transform.resize(I[:, :, i], size) for i in range(C)]
    return np.array(Ir).transpose([1, 2, 0])

In [2]:
# Function to get the yellow component level of images
def get_yellow_component(I,t1=27, t2=33):
  Ihsv = (rgb2hsv(I)*255).astype('uint8')
  mask = (Ihsv[:,:,0]<t2)*(Ihsv[:,:,0]>t1)
  Ypx = mask.sum()
  return Ypx

#Function to get the gray-level co-occurence matrix of images

def GMLC_features(I):
  Ig = (rgb2gray(I)*255).astype('uint8')
  glcm = graycomatrix(Ig, distances=[5], angles=[0], levels=256,
                        symmetric=True, normed=True)
  dissimilarity = graycoprops(glcm, 'dissimilarity')[0, 0]
  correlation = graycoprops(glcm, 'correlation')[0, 0]
  return dissimilarity, correlation

# 3 Dataset


In this section, the full Yummy dataset will be loaded and filtered to contain only images with rice or potato chips. Then a column 'Rice_or_chips' will be created, encoding 1 for rows containing rice and 0 for rows containing chips.

In [None]:
!pip install mlend



In [None]:
!pip install opencv-python



In [None]:
pip install tqdm




In [None]:
!pip install imbalanced-learn



In [None]:
import mlend
from mlend import download_yummy, yummy_load

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import shutil
import glob
from tqdm import tqdm
from imblearn.over_sampling import SMOTE

from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, precision_score, recall_score, roc_curve

from skimage.feature import ORB
from skimage.feature import graycomatrix, graycoprops
from skimage import exposure
from skimage.color import rgb2hsv, rgb2gray
from skimage import transform as sp_transform
import cv2

In [None]:
from google.colab import drive

In [None]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
datadir = download_yummy(save_to = '/content/drive/MyDrive/Data/MLEnd/full', verbose=1, overwrite=False)

Downloading 3250 image files from https://github.com/MLEndDatasets/Yummy
100%|[0m▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓[0m|3250\3250|003250.jpg
Done!


In [None]:
MLENDYD_df = pd.read_csv('/content/drive/MyDrive/Data/MLEnd/full/yummy/MLEndYD_image_attributes_benchmark.csv').set_index('filename')

In [None]:
MLENDYD_df

Unnamed: 0_level_0,Diet,Cuisine_org,Cuisine,Dish_name,Home_or_restaurant,Ingredients,Healthiness_rating,Healthiness_rating_int,Likeness,Likeness_int,Benchmark_A
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
000001.jpg,non_vegetarian,japanese,japanese,chicken_katsu_rice,marugame_udon,"rice,chicken_breast,spicy_curry_sauce",neutral,3.0,like,4.0,Train
000002.jpg,non_vegetarian,english,english,english_breakfast,home,"eggs,bacon,hash_brown,tomato,bread,tomato,bake...",unhealthy,2.0,like,4.0,Train
000003.jpg,non_vegetarian,chinese,chinese,spicy_chicken,jinli_flagship_branch,"chili,chicken,peanuts,sihuan_peppercorns,green...",neutral,3.0,strongly_like,5.0,Train
000004.jpg,vegetarian,indian,indian,gulab_jamun,home,"sugar,water,khoya,milk,salt,oil,cardamon,ghee",unhealthy,2.0,strongly_like,5.0,Train
000005.jpg,non_vegetarian,indian,indian,chicken_masala,home,"chicken,lemon,turmeric,garam_masala,coriander_...",healthy,4.0,strongly_like,5.0,Train
...,...,...,...,...,...,...,...,...,...,...,...
003246.jpg,vegetarian,indian,indian,zeera_rice,home,"1_cup_basmati_rice,2_cups_water,2_tablespoons_...",healthy,4.0,strongly_like,5.0,Train
003247.jpg,vegetarian,indian,indian,paneer_and_dal,home,"fried_cottage_cheese,ghee,lentils,milk,wheat_f...",healthy,4.0,strongly_like,5.0,Test
003248.jpg,vegetarian,indian,indian,samosa,home,"potato,onion,peanut,salt,turmeric_powder,red_c...",very_unhealthy,1.0,like,4.0,Test
003249.jpg,vegan,indian,indian,fruit_milk,home,"kiwi,banana,apple,milk",very_healthy,5.0,strongly_like,5.0,Train


The MLENDYD_ dataframe contains 3250 rows, therefore there is an equivalent number of pictures, and 11 columns. Only the ingredients column will be used for now

In [None]:
filtered_df = MLENDYD_df[MLENDYD_df['Ingredients'].str.contains('rice|chip|fries', case=False, na=False)]

In [None]:
len(filtered_df)

867

In [None]:
check_chips = filtered_df[filtered_df['Ingredients'].str.contains('chips')]
# check_chips

Other types of 'chips' appear. In particular, chocolate chips. These need to be removed.

In [None]:
choc_chip = filtered_df['Ingredients'].str.contains('choc|chocolate') & filtered_df['Ingredients'].str.contains('chip')
rows_choc_chip= filtered_df[choc_chip]
filtered_df = filtered_df.drop(rows_choc_chip.index)

In [None]:
len(filtered_df)

858

In [None]:
filtered_df.isnull().sum()

Diet                      0
Cuisine_org               0
Cuisine                   0
Dish_name                 0
Home_or_restaurant        0
Ingredients               0
Healthiness_rating        1
Healthiness_rating_int    1
Likeness                  2
Likeness_int              2
Benchmark_A               0
dtype: int64

Rows containing null values will be removed. This should not affect our analysis as the size of the dataset is reasonable.

In [None]:
rows_with_null = filtered_df[filtered_df.Healthiness_rating_int.isnull() | filtered_df.Likeness_int.isnull()]
filtered_df = filtered_df.drop(rows_with_null.index)

In [None]:
filtered_df = filtered_df.drop(['Likeness', 'Healthiness_rating'], axis=1) #These are duplicated columns, providing strings instead of integers.

In [None]:
filtered_df

Unnamed: 0_level_0,Diet,Cuisine_org,Cuisine,Dish_name,Home_or_restaurant,Ingredients,Healthiness_rating_int,Likeness_int,Benchmark_A
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
000001.jpg,non_vegetarian,japanese,japanese,chicken_katsu_rice,marugame_udon,"rice,chicken_breast,spicy_curry_sauce",3.0,4.0,Train
000016.jpg,vegan,indian,indian,khichdi,home,"rice,spices,herbs",4.0,3.0,Test
000020.jpg,vegetarian,indian,indian,lentil-based_vegetable_stew_with__rice,home,ingredients:\nfor_cooking_rice:\n1_cup_rice_(a...,4.0,4.0,Test
000021.jpg,non_vegetarian,asian,asian,biryani,home,"mutton,rice,onion,tomato,red_chilli_powder,sal...",4.0,5.0,Train
000022.jpg,vegetarian,indian,indian,rice_beetroot_curry,home,"rice,beetroot,salt,spices",5.0,3.0,Train
...,...,...,...,...,...,...,...,...,...
003235.jpg,non_vegetarian,singapore,singaporean,singapore_style_noodles,asda,"cooked_rice_noodles,roasted_chicken,water,red_...",5.0,4.0,Test
003236.jpg,non_vegetarian,german/turkish,german_turkish,german_doner_kebab,gdk,"lettuce,tomato,onion,red_cabbage,bread,yoghurt...",2.0,5.0,Train
003243.jpg,vegetarian,british,british,pan-fried_beef_with_rice,restaurant,"rice,corn,beef,red_cabbage",4.0,2.0,Train
003244.jpg,vegetarian,italian,italian,khichdi,home,"rice,split_yellow_mung_beans,salt,cumin_seeds,...",4.0,3.0,Train


Finally, the column that will be used as label is created.

In [None]:
filtered_df['Rice_or_chips'] = np.where(filtered_df['Ingredients'].str.contains('chip'), 0, 1)

In [None]:
image_folder = '/content/drive/MyDrive/Data/MLEnd/full/yummy/filtered_images'

os.makedirs(image_folder, exist_ok=True)

image_files = filtered_df.index.tolist()

image_paths = [os.path.join(image_folder, filename) for filename in image_files]


In [None]:
for filename in filtered_df.index:
    original_path = os.path.join(datadir, 'MLEndYD_images', filename)
    new_path = os.path.join(image_folder, filename)
    shutil.copy(original_path, new_path)

print(f"Filtered images copied to: {image_folder}")


Filtered images copied to: /content/drive/MyDrive/Data/MLEnd/full/yummy/filtered_images


In [None]:
# verifying the number of rows to the number of images in the folder created
sample_path = '/content/drive/MyDrive/Data/MLEnd/full/yummy/filtered_images/*.jpg'
files = glob.glob(sample_path)
len(files)

856

# 3 Results