## Data Analysis Mathematics, Algorithms and Modeling

# AI Powered Recipe Recommendation System 

### Team : Group 3
| Student No  | First Name                  | Last Name     |
|-------------|-----------------------------|---------------|
| 9041129     | Nidhi                       | Ahir          |
| 9016986     | Keerthi                     | Gonuguntla    |
| 9027375     | Khushbu                     | Lad           |

#### Introduction

This presentation is focused on the data cleaning techniques as per last class notes. It will mainly deal with data reduction.

### Dataset & Programming Requirements

##### Ractangular Dataset : files
1. Raw_recepes.csv
2. Raw_interaction.csv

##### Import Libraries

In [1]:
import numpy as np
import pandas as pd 
from scipy import stats
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import scipy.stats as zscore
from statsmodels.tsa.seasonal import seasonal_decompose
import matplotlib as mpl
mpl.rcParams['agg.path.chunksize'] = 10000
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

##### RawRecipe : Dataset in classes and methods

In [2]:
class RawRecipe:
    def __init__(self):
        self.file_path = './Dataset/RAW_recipes.csv'
        self.data = None
    
    # Loads the data from a CSV file.
    def load_data(self):
        self.data = pd.read_csv(self.file_path)
        print(f"---> STEP 1 : Loads the data from a CSV file. \r\n")
        print(f"RAW_recipes.csv : Data loaded successfully.")
        print(f"Total Records : {self.data.shape[0]} \r\n")
        return self.data

##### RAW_interactions : Dataset in classes and methods

In [3]:
class RecepeInteraction:
    def __init__(self):
        self.file_path = './Dataset/RAW_interactions.csv'
        self.data = None
    
    # Loads the data from a CSV file.
    def load_data(self):
        self.data = pd.read_csv(self.file_path)
        print(f"---> STEP 1 : Loads the data from a CSV file. \r\n")
        print(f"RAW_interactions.csv : Data loaded successfully.")
        print(f"Total Records : {self.data.shape[0]} \r\n")
        return self.data
    
    def view_sample_data(self):
        self.data.head(5)

    # Data quality : Null Check
    def check_null_values(self):
        print(f"---> STEP 2 : Null Check for data \r\n")
        if self.data is not None:
            nulls = self.data.isnull().sum()
            print(nulls)
            return nulls
        else:
            print("Data not loaded.")
     # Data quality : Duplicate Check
    def check_duplicate_values(self):
        print(f"\r\n---> STEP 3 : Duplicate data Check for recepe \r\n")
        if self.data is not None:
            counts = self.data["recipe_id"].value_counts()
            dupl = (counts[counts>1]).reset_index()
            dupl.columns = ["recipe_id", "Count"]
            print(dupl)
            return dupl
        else:
            print("Data not loaded.")

#### The main function : Initialise class objects & load data

In [4]:
if __name__ == "__main__":

    # Create an instance of the RecepeInteraction  class and load data
    interactionData = RecepeInteraction()
    interactionData.load_data()

    # Create an instance of the RecepeInteraction  class and load data
    recepeData = RawRecipe()
    recepeData.load_data()

---> STEP 1 : Loads the data from a CSV file. 

RAW_interactions.csv : Data loaded successfully.
Total Records : 1132367 

---> STEP 1 : Loads the data from a CSV file. 

RAW_recipes.csv : Data loaded successfully.
Total Records : 231637 



#### Merge dataset based on recepe Id

In [5]:
# Merge data using common field recepe Id
merged_data = pd.merge(recepeData.data, interactionData.data, left_on='id', right_on='recipe_id')
print("Data Merged Successfully")
merged_data.head(2)

Data Merged Successfully


Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients,user_id,recipe_id,date,rating,review
0,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"['60-minutes-or-less', 'time-to-make', 'course...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,"['make a choice and proceed with recipe', 'dep...",autumn is my favorite time of year to cook! th...,"['winter squash', 'mexican seasoning', 'mixed ...",7,4470,137739,2006-02-18,5,I used an acorn squash and recipe#137681 Swee...
1,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"['60-minutes-or-less', 'time-to-make', 'course...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,"['make a choice and proceed with recipe', 'dep...",autumn is my favorite time of year to cook! th...,"['winter squash', 'mexican seasoning', 'mixed ...",7,593927,137739,2010-08-21,5,This was a nice change. I used butternut squas...


#### Missing value Ratio

In [6]:
missing_value_ratio = (merged_data.isnull().sum() / len(merged_data))
missing_value_percentage = missing_value_ratio * 100
pd.set_option('display.float_format', '{:.10f}'.format)

missing_values_table = pd.DataFrame({
    'Column Name': missing_value_ratio.index,
    'Ratio': missing_value_ratio.values,
    'Percentage': missing_value_percentage.values
})
missing_values_table

Unnamed: 0,Column Name,Ratio,Percentage
0,name,8.831e-07,8.83106e-05
1,id,0.0,0.0
2,minutes,0.0,0.0
3,contributor_id,0.0,0.0
4,submitted,0.0,0.0
5,tags,0.0,0.0
6,nutrition,0.0,0.0
7,n_steps,0.0,0.0
8,steps,0.0,0.0
9,description,0.0207618202,2.0761820152


The Data suggests that there are only null values in 3 columns name, description and review. Here, description and review is not a mandatory or important feature in the dataset. The "name" column must have value. We can drop records with null names 

In [7]:
merged_data = merged_data.dropna(subset=['name'])

#### Low Variance Filter

In [8]:
numerical_features = merged_data.select_dtypes(include=['number'])
variance = numerical_features.var()
print(variance)

id                      17003803127.6721191406
minutes              77378371288826.5937500000
contributor_id     4589619737965439.0000000000
n_steps                          33.8687988906
n_ingredients                    13.6154322700
user_id          251429104826142400.0000000000
recipe_id               17003803127.6721191406
rating                            1.5995814482
dtype: float64


1. Reviewing variance of all numerical features, features with variance > 100 are columns including ID's. 
2. number of steps and ingredients has considerable lower variance, however those columns contains values which can not be ignored to perform ML tasks.
3. Rating is having lowest variance among all numerical features. However, Rating is integral part of data as it has categorical data between 1 to 5 numbers. 

#### High Correlation Filter

In [9]:
correlation_matrix = numerical_features.corr().abs()
upper_triangle = correlation_matrix.where(
    np.triu(np.ones(correlation_matrix.shape), k=1).astype(bool)
)
print(upper_triangle)

threshold = 0.9
highly_correlated = [column for column in upper_triangle.columns if any(upper_triangle[column] > threshold)]

print("Features to remove due to high correlation:", highly_correlated)

df_filtered = merged_data.drop(columns=highly_correlated)

                id      minutes  contributor_id      n_steps  n_ingredients  \
id             NaN 0.0031643677    0.1029516836 0.0567002909   0.0181197736   
minutes        NaN          NaN    0.0001325212 0.0004375931   0.0010589317   
contributor_id NaN          NaN             NaN 0.0277206877   0.0053791485   
n_steps        NaN          NaN             NaN          NaN   0.3802954944   
n_ingredients  NaN          NaN             NaN          NaN            NaN   
user_id        NaN          NaN             NaN          NaN            NaN   
recipe_id      NaN          NaN             NaN          NaN            NaN   
rating         NaN          NaN             NaN          NaN            NaN   

                    user_id    recipe_id       rating  
id             0.1000590419 1.0000000000 0.0135653999  
minutes        0.0005947115 0.0031643677 0.0010534285  
contributor_id 0.1027458073 0.1029516836 0.0122144900  
n_steps        0.0516858362 0.0567002909 0.0211706262  
n_ingred

#### Principal Component Analysis (PCA)


In [10]:
# Extracting only numeric columns for PCA
numeric_data = merged_data.select_dtypes(include=[float, int])

# Standardizing the data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(numeric_data)

# Applying PCA, choosing 2 components as an example
pca = PCA(n_components=2)
principal_components = pca.fit_transform(data_scaled)

# Creating a DataFrame with the principal components
pca_data = pd.DataFrame(principal_components, columns=['PC1', 'PC2'])

# Displaying the explained variance and the transformed data
print("Explained Variance Ratio:", pca.explained_variance_ratio_)
print("Principal Components:\n", pca_data)

Explained Variance Ratio: [0.25726245 0.1719967 ]
Principal Components:
                   PC1           PC2
0       -0.3247712722 -0.2419663842
1       -0.3245840519 -0.2418344166
2       -0.3247160209 -0.2419274387
3       -1.2904069975 -0.1049440968
4       -1.4926938349 -0.4891884359
...               ...           ...
1132361  1.4020954602 -0.8122966214
1132362  1.5375237071  0.1228669409
1132363  1.4096947961 -0.9129287057
1132364  1.2480004672 -1.2202289191
1132365  2.0226483570 -0.4325397473

[1132366 rows x 2 columns]


#### Random Forest Trees

In [15]:
from sklearn.ensemble import RandomForestRegressor

def random_forest_feature_importance(data, target_column='rating'):
    """
    Use Random Forest to evaluate feature importance.
    """
    # Selecting the features and target variable
    X = data.select_dtypes(include=['number']).drop(columns=[target_column])
    y = data[target_column]

    # Instantiate the RandomForestRegressor
    rf = RandomForestRegressor(n_estimators=100, random_state=42)
    rf.fit(X, y)
    
    # Get feature importances
    feature_importances = pd.Series(rf.feature_importances_, index=X.columns)
    feature_importances_sorted = feature_importances.sort_values(ascending=False)
    
    print("Feature importances:\n", feature_importances_sorted)
    return feature_importances_sorted

# Apply Random Forest Feature Importance to the cleaned merged data
feature_importances = random_forest_feature_importance(merged_data, target_column='rating')

Feature importances:
 user_id          0.3792770853
contributor_id   0.1524160286
id               0.1005026234
recipe_id        0.1004300101
minutes          0.1004252063
n_steps          0.0899135655
n_ingredients    0.0770354808
dtype: float64


#### Backward/Forward Feature Elimination/Selection

In [29]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

def backward_elimination(data, target_column='rating', threshold=0.05):
   
    # Define the target and features
    X = data.select_dtypes(include=['number']).drop(columns=[target_column])
    y = data[target_column]

    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Instantiate the RandomForestRegressor
    rf = RandomForestRegressor(n_estimators=100, random_state=42)
    rf.fit(X_train, y_train)
    
    # Get feature importances
    feature_importances = pd.Series(rf.feature_importances_, index=X.columns)
    
    # Backward Elimination: Remove features with low importance
    while feature_importances.max() < threshold:
        # Remove the least important feature
        least_important_feature = feature_importances.idxmin()
        X_train = X_train.drop(columns=[least_important_feature])
        X_test = X_test.drop(columns=[least_important_feature])
        
        # Re-train the model
        rf.fit(X_train, y_train)
        feature_importances = pd.Series(rf.feature_importances_, index=X_train.columns)
    
    print(f"Features selected after Backward Elimination: {X_train.columns}")
    return X_train, X_test

# Apply Backward Elimination
X_train_selected, X_test_selected = backward_elimination(merged_data, target_column='rating')


Features selected after Backward Elimination: Index(['id', 'minutes', 'contributor_id', 'n_steps', 'n_ingredients',
       'user_id', 'recipe_id'],
      dtype='object')
