# **Recommender Function**


## Summary

This notebook contains the Recommender Function for our Wine Recommender System. A seperate file was created so that the usage/testing of the Recommender System can be done quickly and efficiently.

The Recommender System will take in the user preferences and output the recommended type of wine based on those preferences. The top 3 most highly rated wines of that type from the country given in the inputs will be recommended to them.
- If there are no such options availiable, the top 3 most highly rated wines across the whole catalogue will be given to the user. 
- If the country, given by the user, does not have at least 3 wines of the predicted type, the remaining will be filled by the other top rated wines of that type from the catalogue that the user can try.

## Importing neccessary packages

In [97]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# To ignore warnings in the notebook
import warnings
warnings.filterwarnings("ignore")

In [98]:
# import data
df = pd.read_csv('XWines_Full_100K_wines.csv')

## Preprocessing and Data Cleaning

### Grapes

In [99]:
# creating a function to remove square brackets and quotation marks
# to be used on Grapes, Harmonize and Vintages
# inputs: data is the dataframe, column_names is the name of the column (string) or list
def clean_column(data, column_names):
    
    for column_name in column_names:

        if column_name in ['Grapes', 'Harmonize']:
            # extracting all words inside 
            data[column_name] = data[column_name].apply(lambda x: re.findall(r"'(.*?)'", x))

            # convert the list of words back to a string
            data[column_name] = data[column_name].apply(lambda x: ', '.join(x))
        
        else: 
            # removing the square brackets
            data[column_name] = data[column_name].apply(lambda x: str(x).strip('[]'))

    return data


df = clean_column(df, ['Grapes', 'Harmonize', 'Vintages'])

In [100]:
# creating function to get the counts
# inputs: data is the dataframe, columns_name are the list of columns to get the counts

def get_counts(data, column_names):
    for column_name in column_names:
        data[column_name] = data[column_name].apply(lambda x: len(x.split(', ')))

    return data

df = get_counts(df, ['Grapes'])

### Harmonize

In [101]:
# re-classifying similar types of foods with the similar names into same categories
red_meat = ['Beef', 'Pork', 'Lamb', 'Veal', 'Meat', 'Ham', 'Red Meat']
white_meat = ['Chicken', 'Poultry', 'Duck', 'Cold Cuts']
cheese = ['Mild Cheese', 'Medium-cured Cheese', 'Cheese', 'Soft Cheese', 'Maturated Cheese', 'Hard Cheese', 'Goat Cheese', 'Blue Cheese']
seafood = ['Shellfish', 'Rich Fish', 'Lean Fish', 'Fish', 'Codfish', 'Seafood']
italian = ['Pasta', 'Risotto', 'Tagliatelle', 'Lasagna', 'Eggplant Parmigiana', 'Pizza']
dessert = ['Sweet Dessert', 'Fruit Dessert', 'Dessert', 'Citric Dessert', 'Cake', 'Soufflé', 'Chocolate', 'Spiced Fruit Cake']
vegetarian = ['Vegetarian', 'Mushrooms', 'Salad', 'Beans', 'Baked Potato', 'Chestnut']
snacks = ['Snack', 'French Fries', 'Fruit', 'Cookies']
others = ['Sushi', 'Sashimi', 'Yakissoba', 'Asian Food', 'Roast', 'Tomato Dishes', 'Cream', 'Curry Chicken', 'Barbecue', 'Light Stews', 'Paella', 'Grilled', 'Dried Fruits']
appetizer = ['Appetizer', 'Aperitif']

In [102]:
list_of_lists = [red_meat, white_meat, cheese, seafood, italian, dessert, vegetarian, snacks, others, appetizer]
names = ['Red Meat', 'White Meat', 'Cheese', 'Seafood', 'Italian', 'Dessert', 'Vegetarian', 'Snacks', 'Appetizer', 'Others']

# define a function to re-assign the categories for each row
def reassign_categories(row):
    # splitting the food in the string and making it a list
    food_list = row.split(', ')

    # iterate through the list and re-assign the categories
    for i in range(len(food_list)):
        for lst, name in zip(list_of_lists, names):
            if food_list[i] in lst:
                food_list[i] = name

    # remove repeated food categories for each row
    new_row = list(set(food_list))

    # joining the list back into a string
    new_row = ', '.join(new_row)

    return new_row

# apply the function to each row of the DataFrame
df['Harmonize'] = df['Harmonize'].apply(reassign_categories)

### Body

In [103]:
# removing '-bodied' from body column
df['Body'] = df['Body'].str.replace('-bodied', '')

### Type

In [104]:
# replace dessert/port to just dessert wine
df['Type'] = df['Type'].str.replace('Dessert/Port', 'Dessert')

### Countries

In [105]:
# getting countries that appeared more than 100 times
country_counts = df['Country'].value_counts()
filtered_countries = country_counts[country_counts > 100]
df = df[df['Country'].isin(filtered_countries.index)]

### Data Splitting

In [106]:
# splitting data into catalogue and training data
df = df.sample(frac = 1, random_state = 100)
catalogue = df[:80000]
df = df[80000:]

## Using the Best Model

In [107]:
df = df[['Type', 'Grapes', 'Harmonize', 'ABV', 'Body', 'Acidity', 'Country']]

In [108]:
# one-hot encoding the harmonise column
one_hot = df['Harmonize'].str.get_dummies(', ')

# Rename the columns with the 'Harmonize_' prefix
one_hot = one_hot.add_prefix('Harmonize_')

# Concatenate the original DataFrame with the one-hot encoded DataFrame
df = pd.concat([df, one_hot], axis=1)

# Drop Harmonize Column
df = df.drop(columns = 'Harmonize', axis = 1)

In [109]:
# splitting the data into training and test
X = df.drop(columns = ['Type'])
y = df['Type']

# train_test_split on dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 100)

In [110]:
# function to get all categorical variables

def getCategorical(X_train, data):
    categorical_variables = []
    
    for column in X_train.columns:
        if data[column].dtype == "object":
            categorical_variables.append(column)

    return categorical_variables

In [111]:
# function to create a transformer to encode categorical variables

def transformer(categorical_variables):
    # One-hot encoding
    enc_rf = OneHotEncoder(sparse_output = False, handle_unknown = "ignore")

    transformer_rf = ColumnTransformer([
        ("categorical", enc_rf, categorical_variables)
    ], remainder="passthrough")

    return transformer_rf

In [112]:
# function to transform data

def transformData(X_train, X_test, transformer_rf):

    X_train_encoded_rf = pd.DataFrame(transformer_rf.fit_transform(X_train), columns = transformer_rf.get_feature_names_out())
    X_test_encoded_rf = pd.DataFrame(transformer_rf.fit_transform(X_test), columns = transformer_rf.get_feature_names_out())
    
    return [X_train_encoded_rf, X_test_encoded_rf]

In [113]:
# function to rename the column to increase readability

def renameCol(categorical_variables, X_train_encoded_rf, X_test_encoded_rf):
    
    X_train_encoded_rf.columns = X_train_encoded_rf.columns.str.replace(re.compile(r'categorical__|remainder__'), '', regex = True)
    X_test_encoded_rf.columns = X_test_encoded_rf.columns.str.replace(re.compile(r'categorical__|remainder__'), '', regex = True)

    return [X_train_encoded_rf, X_test_encoded_rf]


In [114]:
# function that combines all the above functions into a function called preprocess
def preprocess(X_train, X_test, data):
    
    # use the getCategorical function to get categorical variables in the dataset
    categorical_variables = getCategorical(X_train, data)
    
    # use tranformer function to create the transformer
    transformer_rf = transformer(categorical_variables)
    
    # use transformData function
    X_train_encoded_rf, X_test_encoded_rf = transformData(X_train, X_test, transformer_rf)

    # renaming the columns for readability
    X_train_encoded_rf, X_test_encoded_rf = renameCol(categorical_variables, X_train_encoded_rf, X_test_encoded_rf)

    return [X_train_encoded_rf, X_test_encoded_rf, transformer_rf]

## Random Forest Classifier

In [115]:
# unpacking values
X_train_encoded_rf, X_test_encoded_rf, transformer_rf = preprocess(X_train, X_test, df)

In [116]:
clf = RandomForestClassifier(criterion = 'entropy', 
                            max_depth = 15, 
                            min_samples_leaf = 2, 
                            min_samples_split = 5, 
                            n_estimators = 700,
                            class_weight = 'balanced', 
                            random_state = 100)

clf.fit(X_train_encoded_rf, y_train)

## Recommender Function

In [117]:
# define a function to get user inputs for recommendation

def get_inputs():

    body_options = ['Full', 'Light', 'Medium', 'Very full', 'Very light']
    acidity_options = ['High', 'Low', 'Medium']
    country_options = ['Argentina', 'Australia', 'Austria', 'Brazil', 'Bulgaria', 'Canada', 'Chile', 'Croatia', 'Czech Republic', 'France', 'Georgia', 'Germany', 'Greece', 'Hungary', 'Israel', 'Italy', 'Mexico', 'Moldova', 'New Zealand', 'Portugal', 'Romania', 'Russia', 'South Africa', 'Spain', 'Switzerland', 'United States', 'Uruguay']
    harmonize_options = ['White Meat', 'Red Meat', 'Game Meat', 'Vegetarian', 'Spicy Food', 'Seafood', 'Dessert', 'Cheese', 'Cured Meat', 'Snacks', 'Appetizer', 'Italian', 'Others']

    while True:
        body = input(f'Enter the body of the wine you want. You have the following choices: {", ".join(body_options)}.')
        body = body.capitalize()
        if body in body_options:
            break

    while True:
        acidity = input(f'Enter the acidity level of the wine you want. You have the following choices: {", ".join(acidity_options)}.')
        acidity = acidity.title()
        if acidity in acidity_options:
            break

    while True:
        country = input(f'Enter the country you want your wine from. You have the following choices: {", ".join(country_options)}.')
        country = country.title()
        if country in country_options:
            break

    while True:
        try:
            grapes = int(input('Enter the number of grapes you want in your wine. More grapes means the taste of the wine may be more complex.'))
            break
        except ValueError:
            pass

    while True:
        try:
            abv = float(input('Enter the desired alcohol percentage of your wine.'))
            break
        except ValueError:
            pass

    while True:
        harmonize = input(f'Enter the type of food(s) you want your wine to pair with. You have the following choices: {", ".join(harmonize_options)}.')
        harmonize = harmonize.split(', ')
        harmonize = [word.title() for word in harmonize]
        
        for food in harmonize:
            if food not in harmonize_options:
                break
        else:
            break

    body = 'Body_' + body
    acidity = 'Acidity_' + acidity 
    country = 'Country_' + country
    harmonize = ['Harmonize_' + word for word in harmonize]

    return body, acidity, country, grapes, abv, harmonize


In [118]:
# define a function to get average ratings of wines
def get_avg_ratings():
    
    # importing ratings dataset
    ratings = pd.read_csv('XWines_Full_21M_ratings.csv')
    
    # calculating average ratings
    avg_ratings = ratings.groupby('WineID')['Rating'].mean().to_frame('Avg_Ratings').reset_index()
    avg_ratings['Avg_Ratings'] = avg_ratings['Avg_Ratings'].round(2)

    return avg_ratings

In [123]:
# define a function to give the user their recommendation
def get_recommendations(clf = clf, catalogue = catalogue):
    
    # using get_inputs function to get inputs to use in the model
    body, acidity, country, grapes, abv, harmonize = get_inputs()

    dict_1 = {
        body: 1,
        acidity: 1,
        country: 1,
        'ABV': abv,
        'Grapes': grapes
    }

    dict_2 = {key: 1 for key in harmonize}

    input_dict = {** dict_1, ** dict_2}

    # Convert the input dictionary to a pandas DataFrame
    input_data = pd.DataFrame.from_dict(input_dict, orient='index').T

    # Get the columns from training_data
    columns = X_train_encoded_rf.columns

    # Reindex input_data with the columns from training_data
    input_data = input_data.reindex(columns, axis=1)

    # Fill in the missing columns with a value of 0
    input_data = input_data.fillna(0)

    # Generating prediction
    predict = clf.predict(input_data)[0]

    # getting average ratings of wines
    avg_ratings = get_avg_ratings()

    # combining with catalogue
    combined = pd.merge(catalogue, avg_ratings, how = 'inner', on = 'WineID')

    # getting the list of food for filtering
    harmonize_list = [string.replace('Harmonize_', '') for string in harmonize]

    # subseting dataset to get only predicted wine type
    recommendations = combined[(combined['Type'] == predict)]
    
    # filtering the ABV to 1 sd above and below the user input for more personalisation
    sd = recommendations['ABV'].std()
    upper_bound = abv + sd
    lower_bound = abv - sd

    # getting country input from user
    country = country.replace('Country_', '')

    # filtering the combined data to match either food type in the harmonize_list, if there are more than 1, as well as abv
    # recommendations will only contain country that user input
    # if there are lesser than 3, remaining will be filled in with recommendations from other countries
    # Output top 3 wines based on average rating
    recommendations = recommendations[recommendations['ABV'].between(lower_bound, upper_bound, inclusive = 'both')]
    recommendations = recommendations[recommendations['Harmonize'].str.contains('|'.join(harmonize_list))].sort_values(by = 'Avg_Ratings', ascending = False).reset_index(drop = True)
    recommendations1 = recommendations.copy()
    recommendations1 = recommendations1[recommendations1['Country'] == country].head(3)

    print(f'Based on your preferences, the recommended type of wine is {predict} wine.')
    
    if recommendations1.empty:
        print(f'There are unfortunately no {predict} wines from {country} in our catalogue. Here are the top 3 {predict} wines with the highest ratings that you can try instead!')
        return recommendations.head(3)
    
    elif len(recommendations1) == 1:
        additional_rows = recommendations[~recommendations['WineID'].isin(recommendations1['WineID'])].head(2)
        recommendations1 = pd.concat([recommendations1, additional_rows], axis = 0)
        recommendations1 = recommendations1.sort_values(by = 'Avg_Ratings', ascending = False).reset_index(drop = True)
        print(f'There is only 1 {predict} wine from {country}. Here are 2 other {predict} wines that you can try!')

        return recommendations1
    
    elif len(recommendations1) == 2:
        additional_rows = recommendations[~recommendations['WineID'].isin(recommendations1['WineID'])].head(1)
        recommendations1 = pd.concat([recommendations1, additional_rows], axis = 0)
        recommendations1 = recommendations1.sort_values(by = 'Avg_Ratings', ascending = False).reset_index(drop = True)
        print(f'There are only 2 {predict} wines from {country}. Here is 1 other {predict} wine that you can try!')

        return recommendations1
    
    else:
        print(f'Here are the top 3 highly rated {predict} wines from {country} that you can try!')
        return recommendations1.reset_index(drop = True)

In [125]:
# getting recommendations
get_recommendations()

Based on your preferences, the recommended type of wine is Red wine.
There are unfortunately no wines from Greece in our catalogue. Here are the top 3 Red wines with the highest ratings that you can try instead!


Unnamed: 0,WineID,WineName,Type,Elaborate,Grapes,Harmonize,ABV,Body,Acidity,Code,Country,RegionID,RegionName,WineryID,WineryName,Website,Vintages,Avg_Ratings
0,166737,Private Reserve Syrah,Red,Varietal/100%,1,"Game Meat, Red Meat, White Meat",11.5,Full,High,CL,Chile,2269,Curico Valley,39952,Galan Vineyards-Vitivinicola Siete Tazas,http://viñagalan.cl,"2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012...",4.7
1,175479,Eduard Old Vine Shiraz,Red,Varietal/100%,1,"Game Meat, Red Meat, White Meat",11.5,Very full,High,AU,Australia,2097,Barossa Valley,62956,Kalleske,http://www.kalleske.com,"2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013...",4.36
2,103547,Collares Tinto,Red,Varietal/100%,1,"Red Meat, White Meat, Italian",11.8,Full,High,PT,Portugal,1051,Colares,12790,Viúva Gomes,http://www.adegaviuvagomes.com,"2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012...",4.36
