# Задание 4. Реализовать рекомендательную систему по подбору пива на основе  датасета «BeerAdvocate»

## Downloading and unzipping data

In [None]:
# Change directory for kaggle JSON
import os
import warnings

os.chdir("/Users/pan_ae/Study/ml academy/task_4/task_4")

In [9]:
# Create a kaggle folder
!mkdir -p ~/.kaggle

# Copy kaggle.json to created folder
!cp kaggle.json ~/.kaggle/

In [None]:
os.chdir("/content")

In [10]:
# Permission for the json to act
!chmod 600 ~/.kaggle/kaggle.json

In [11]:
# Download the required dataset
!kaggle datasets download -d thedevastator/1-5-million-beer-reviews-from-beer-advocate

1-5-million-beer-reviews-from-beer-advocate.zip: Skipping, found more recently modified local copy (use --force to force download)


In [12]:
# Unzip our dataset
from zipfile import ZipFile
from tqdm import tqdm


file_to_extract = "1-5-million-beer-reviews-from-beer-advocate.zip"

# Open your .zip file
with ZipFile(file=file_to_extract) as zip_file:

    # Loop over each file and extract them
    for file in tqdm(iterable=zip_file.namelist(), total=len(zip_file.namelist())):
        zip_file.extract(member=file)

  0%|          | 0/1 [00:00<?, ?it/s]

100%|██████████| 1/1 [00:00<00:00,  1.40it/s]


## Data Preprocessing

In [4]:
import pandas as pd
import numpy as np
import time
import warnings
from typing import List
warnings.filterwarnings("ignore")

In [5]:
start_time = time.time()
df = pd.read_csv("data/beer_reviews.csv")
end_time = time.time()

print(f"Elapsed time to read csv-file is: {end_time - start_time}")
df.head(10)

Elapsed time to read csv-file is: 5.08476710319519


Unnamed: 0,index,brewery_id,brewery_name,review_time,review_overall,review_aroma,review_appearance,review_profilename,beer_style,review_palate,review_taste,beer_name,beer_abv,beer_beerid
0,0,10325,Vecchio Birraio,1234817823,1.5,2.0,2.5,stcules,Hefeweizen,1.5,1.5,Sausa Weizen,5.0,47986
1,1,10325,Vecchio Birraio,1235915097,3.0,2.5,3.0,stcules,English Strong Ale,3.0,3.0,Red Moon,6.2,48213
2,2,10325,Vecchio Birraio,1235916604,3.0,2.5,3.0,stcules,Foreign / Export Stout,3.0,3.0,Black Horse Black Beer,6.5,48215
3,3,10325,Vecchio Birraio,1234725145,3.0,3.0,3.5,stcules,German Pilsener,2.5,3.0,Sausa Pils,5.0,47969
4,4,1075,Caldera Brewing Company,1293735206,4.0,4.5,4.0,johnmichaelsen,American Double / Imperial IPA,4.0,4.5,Cauldron DIPA,7.7,64883
5,5,1075,Caldera Brewing Company,1325524659,3.0,3.5,3.5,oline73,Herbed / Spiced Beer,3.0,3.5,Caldera Ginger Beer,4.7,52159
6,6,1075,Caldera Brewing Company,1318991115,3.5,3.5,3.5,Reidrover,Herbed / Spiced Beer,4.0,4.0,Caldera Ginger Beer,4.7,52159
7,7,1075,Caldera Brewing Company,1306276018,3.0,2.5,3.5,alpinebryant,Herbed / Spiced Beer,2.0,3.5,Caldera Ginger Beer,4.7,52159
8,8,1075,Caldera Brewing Company,1290454503,4.0,3.0,3.5,LordAdmNelson,Herbed / Spiced Beer,3.5,4.0,Caldera Ginger Beer,4.7,52159
9,9,1075,Caldera Brewing Company,1285632924,4.5,3.5,5.0,augustgarage,Herbed / Spiced Beer,4.0,4.0,Caldera Ginger Beer,4.7,52159


Let's look at the [dataset's](https://www.kaggle.com/datasets/thedevastator/1-5-million-beer-reviews-from-beer-advocate/data) columns description from Kaggle 

|Column name| Description                                         |
|---|---------------|
|brewery_name| The name of the brewery that made the beer. (String) |
|review_time|The date and time of the review. (String)|
|review_overall|The reviewer's overall rating of the beer on a scale of 1 to 5. (Float)|
|review_aroma|The reviewer's rating of the beer's aroma on a scale of 1 to 5. (Float)|
|review_appearance|The reviewer's rating of the beer's appearance on a scale of 1 to 5. (Float)|
|review_profilename|The reviewer's username. (String)|
|beer_style|The style of beer. (String)|
|review_palate|The reviewer's rating of the beer's palate on a scale of 1 to 5. (Float)|
|review_taste|The reviewer's rating of the beer's taste on a scale of 1 to 5. (Float)|
|beer_name|The name of the beer. (String)|
|beer_abv|The alcohol by volume of the beer. (Float)|
|brewery_id||
|beer_beerid||

Checking for null values

In [6]:
df.isna().sum()

index                     0
brewery_id                0
brewery_name             15
review_time               0
review_overall            0
review_aroma              0
review_appearance         0
review_profilename      348
beer_style                0
review_palate             0
review_taste              0
beer_name                 0
beer_abv              67785
beer_beerid               0
dtype: int64

Let's choose the main features

In [7]:
df_1 = df[["beer_name", "review_profilename", "review_overall"]]
df_1.head()

Unnamed: 0,beer_name,review_profilename,review_overall
0,Sausa Weizen,stcules,1.5
1,Red Moon,stcules,3.0
2,Black Horse Black Beer,stcules,3.0
3,Sausa Pils,stcules,3.0
4,Cauldron DIPA,johnmichaelsen,4.0


Drop for duplicacted and null values

In [8]:
df_1.drop_duplicates(inplace=True)

In [9]:
df_1.isna().sum()

beer_name               0
review_profilename    343
review_overall          0
dtype: int64

In [10]:
df_1.dropna(inplace=True)

Let's look at the overall number of reviews for each beer drink. I used grouping by `beer name`

In [11]:
beer_rating_count = (
    df_1.groupby(by=["beer_name"])["review_overall"]
    .count()
    .reset_index()
    .rename(columns={"review_overall": "review_overall_count"})
    )

In [12]:
beer_rating_count.head()

Unnamed: 0,beer_name,review_overall_count
0,! (Old Ale),1
1,"""100""",5
2,"""100"" Pale Ale",1
3,"""12"" Belgian Golden Strong Ale",2
4,"""33"" Export",3


Let's join the obtained dataframe to the original one

In [13]:
rating_with_review_overall_count = df_1.merge(beer_rating_count, on="beer_name", how="left")
rating_with_review_overall_count.head(10)

Unnamed: 0,beer_name,review_profilename,review_overall,review_overall_count
0,Sausa Weizen,stcules,1.5,1
1,Red Moon,stcules,3.0,1
2,Black Horse Black Beer,stcules,3.0,1
3,Sausa Pils,stcules,3.0,1
4,Cauldron DIPA,johnmichaelsen,4.0,1
5,Caldera Ginger Beer,oline73,3.0,9
6,Caldera Ginger Beer,Reidrover,3.5,9
7,Caldera Ginger Beer,alpinebryant,3.0,9
8,Caldera Ginger Beer,LordAdmNelson,4.0,9
9,Caldera Ginger Beer,augustgarage,4.5,9


We will choose only those beer products for which we have more than 50 evaluations for a more accurate prediction

In [14]:
rating_popular_beer = rating_with_review_overall_count[rating_with_review_overall_count["review_overall_count"] > 50]

In [51]:
PATH = "data/rating_popular_beer.csv"
rating_popular_beer.to_csv(PATH)

Created an interaction matrix based on ratings of popular beers

In [17]:
interaction_matrix = rating_popular_beer.pivot_table(index='beer_name', columns='review_profilename', values='review_overall')
interaction_matrix.fillna(0, inplace=True)
interaction_matrix.head()

review_profilename,0110x011,01Ryan10,02maxima,03SVTCobra,04101Brewer,05Harley,0Naught0,0beerguy0,0runkp0s,0to15,...,zutmin,zwalk8,zwan,zwoehr,zymrgy,zymurgy4all,zymurgywhiz,zythus,zyzygy,zzajjber
beer_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"""Old Yeltsin"" Imperial Stout",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"""Shabadoo"" Black & Tan Ale",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
# 100,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
#9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
't Gaverhopke Extra,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
from scipy.sparse import csr_matrix
from fuzzywuzzy import process

Created a compressed sparse row matrix representing beer features.

In [19]:
beer_features_df_matrix = csr_matrix(interaction_matrix.values)

There could be a scenario where the user either types in the beer's name incorrectly or completely forgets what it is called. The following function solves this problem

In [21]:
def get_beers_list(beers: List[str]) -> List[str]:
    """
    Filters a list of beers based on their presence in the interaction matrix or closest match.

    Parameters:
    - beers (List[str]): List of beers to filter.
    - interaction_matrix (DataFrame): DataFrame containing the interaction matrix.

    Returns:
    - user_beers (List[str]): Filtered list of beers, considering their presence or closest match in the interaction matrix.
    """
    user_beers = []
    for beer in beers:
        if beer in interaction_matrix.index:
            user_beers.append(beer)
        else:
            closest_match = process.extractOne(beer, interaction_matrix.index)[0]
            user_beers.append(closest_match)
    return user_beers

Exapmle of usage:

In [22]:
get_beers_list(["hieneken"])

['Heineken Dark Lager']

## Modeling

### ItemKNN Model

In [30]:
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import mean_squared_error, mean_absolute_error

In [31]:
# Initializes an Item-based k-Nearest Neighbors (ItemKNN) model.
model_knn = NearestNeighbors(metric='cosine', algorithm='brute')
model_knn.fit(beer_features_df_matrix)

In [32]:
def get_recommendations_knn(user_beers: List[str], beers_length: int):
    """
    Generates recommendations based on user's beer preferences.

    Parameters:
    - user_beers (List[str]): List of beers representing user preferences.
    - beers_length (int): Number of beers to include in recommendations.

    Returns:
    - distances (ndarray): Array of distances to nearest neighbors.
    - indices (ndarray): Array of indices of nearest neighbors.
    """
    # Создание матрицы предпочтений пользователя
    user_preferences = np.zeros((1, len(interaction_matrix.columns)))
    for beer in user_beers:
        if beer in interaction_matrix.index:
            user_preferences[0, :] += interaction_matrix.loc[beer, :].values
    
    # Получение рекомендаций
    distances, indices = model_knn.kneighbors(user_preferences, n_neighbors=7 + beers_length)
    return distances, indices
        

In [33]:
def print_recommendations(user_beers: List[str]) -> None:
    """
    Prints recommendations based on user's beer preferences.

    Parameters:
    - user_beers (List[str]): List of beers representing user preferences.
    """
    user_beers = get_beers_list(user_beers)
    beers_length = len(user_beers)
    distances, indices = get_recommendations_knn(user_beers, beers_length)
    for i in range(beers_length, len(distances.flatten())):
        idx = i - beers_length + 1
        print(f"{idx}) {interaction_matrix.index[indices.flatten()[i]]} with the distance {distances.flatten()[i]}")

Testing

In [34]:
user_beers = ["Kronenbourg 1664", "hoegaarden", "Baltika #3 Classic", "Heineken"]
print_recommendations(user_beers=user_beers)

1) Carlsberg Beer with the distance 0.6131883004427712
2) Peroni Nastro Azzurro with the distance 0.6132715827997527
3) Birra Moretti with the distance 0.6194675862814257
4) Tiger Beer with the distance 0.6222505966243081
5) Pilsner Urquell with the distance 0.6229378563978542
6) Leffe Blonde with the distance 0.6256516299453568
7) Warsteiner Premium Verum with the distance 0.6268916620733997


In [35]:
user_beers = ["Kronenbourg 1664"]
print_recommendations(user_beers=user_beers)

1) Peroni Nastro Azzurro with the distance 0.6806705514684273
2) Carlsberg Beer with the distance 0.6807936334419276
3) Birra Moretti with the distance 0.6956345079726951
4) Stella Artois with the distance 0.6989859989548483
5) Tsingtao with the distance 0.7021077084409177
6) Grolsch Premium Lager with the distance 0.7030299833275766
7) Bitburger Premium Pils with the distance 0.7069491406939292


In [36]:
user_beers = ["Budweiser"]
print_recommendations(user_beers=user_beers)

1) Bud Light with the distance 0.5946882523399587
2) Heineken Lager Beer with the distance 0.5992587856939435
3) Samuel Adams Boston Lager with the distance 0.6007752202047341
4) Miller High Life with the distance 0.6221019094202828
5) Pabst Blue Ribbon (PBR) with the distance 0.6339194924067815
6) Guinness Draught with the distance 0.6348481225354892
7) Corona Extra with the distance 0.6366962833306351


In general, the flavor of this beer is very similar to anything listed here

Calculate metrics

In [37]:
user_beers = ["Kronenbourg 1664", "hoegaarden", "Baltika #3 Classic", "Heineken"]
user_beers = get_beers_list(user_beers)
beers_length = len(user_beers)
distances, indices = get_recommendations_knn(user_beers, beers_length)

In [38]:
predicted_ratings_dict = {}

# Predict ratings for user's beers
for beer in user_beers:
    # Find nearest neighbors for the current beer
    distances, indices = model_knn.kneighbors(interaction_matrix.loc[beer,:].values.reshape(1, -1), n_neighbors=8)
    
    # Get ratings of these neighbors from interaction_matrix
    neighbor_ratings = interaction_matrix.iloc[indices.flatten()]
    
    # Average ratings of these neighbors
    predicted_ratings = neighbor_ratings.mean(axis=0)
    
    # Save predicted ratings in the dictionary
    predicted_ratings_dict[beer] = predicted_ratings

# Convert data to numpy arrays to use with mean_squared_error function
actual_ratings = interaction_matrix.loc[user_beers, :].values

# Collect predicted ratings from the dictionary into a numpy array
predicted_ratings = np.vstack([predicted_ratings_dict[beer] for beer in user_beers])

# Calculate RMSE and MAE
rmse = np.sqrt(mean_squared_error(actual_ratings, predicted_ratings))
mae = mean_absolute_error(actual_ratings, predicted_ratings)

print("RMSE:", rmse, "\nMAE:", mae)

RMSE: 0.2594626542145182 
MAE: 0.04376863900631407


Saving the model

In [39]:
from joblib import dump, load

In [41]:
FILE_PATH = 'models/model_knn.joblib'

dump(model_knn, FILE_PATH)

['models/model_knn.joblib']

In [None]:
# Load the model, if needed
model_knn = load(FILE_PATH)

### ALS model

In [42]:
from implicit.als import AlternatingLeastSquares

In [43]:
model_als = AlternatingLeastSquares(factors=50)
model_als.fit(beer_features_df_matrix.T)

100%|██████████| 15/15 [01:05<00:00,  4.38s/it]


In [44]:
def get_recommendations(user_beers: List[str]) -> List[tuple]:
    """
    Generates recommendations based on user's beer preferences using ALS model.

    Parameters:
    - user_beers (List[str]): List of beers representing user preferences.

    Returns:
    - top_recommendations (List[tuple]): List of top recommended beer indices and their scores.
    """
    # Get indices of user's beers
    user_indices = [interaction_matrix.index.get_loc(beer) for beer in user_beers]
    
    # Create user preferences vector
    user_preferences = np.zeros(interaction_matrix.shape[0])
    for idx in user_indices:
        user_preferences[idx] = 1
    
    # Get recommendations for each user beer separately
    recommendations = []
    for idx in user_indices:
        item_recommendations = model_als.similar_items(idx, N=7+1)
        recommendations.append(item_recommendations)
    
    # Combine recommendation scores
    combined_recommendations = {}
    for rec in recommendations:
        for item_id, score in zip(rec[0], rec[1]):
            if item_id not in combined_recommendations: 
                combined_recommendations[item_id] = 0
            combined_recommendations[item_id] += score
    
    # Remove user's beers from recommendations
    for idx in user_indices:
        if idx in combined_recommendations: 
            del combined_recommendations[idx]
    
    # Select top 7 recommendations based on their scores
    top_recommendations = sorted(combined_recommendations.items(), key=lambda x: x[1], reverse=True)[:7]
    
    return top_recommendations


In [45]:
def print_recommendations(user_beers: List[str]) -> None:
    """
    Prints recommendations based on user's beer preferences using ALS model.

    Parameters:
    - user_beers (List[str]): List of beers representing user preferences.
    """
    user_beers = get_beers_list(user_beers)
    recommendations = get_recommendations(user_beers)
    for i, rec in enumerate(recommendations):
        print(f"{i+1}) {interaction_matrix.index[rec[0]]} with score {rec[1]}")

Testing

In [46]:
user_beers = ["Kronenbourg 1664", "hoegaarden", "Baltika #3 Classic", "Heineken"]
print_recommendations(user_beers=user_beers)

1) Verboden Vrucht  / Fruit Defendu (Forbidden Fruit) with score 0.9415228962898254
2) Leffe Radieuse with score 0.8986030220985413
3) Leffe Tripel with score 0.896060049533844
4) Leffe Vieille Cuvée with score 0.8890465497970581
5) Judas with score 0.878032386302948
6) La Guillotine with score 0.8703019022941589
7) Svyturys Ekstra with score 0.8647470474243164


In [47]:
user_beers = ["Kronenbourg 1664"]
print_recommendations(user_beers=user_beers)

1) Carlsberg Beer with score 0.8620048761367798
2) Kingfisher Premium Lager with score 0.8595719337463379
3) Birra Moretti with score 0.8549344539642334
4) Staropramen Lager with score 0.8296495079994202
5) Tsingtao with score 0.8290836215019226
6) Singha with score 0.8225492835044861
7) Bavaria Beer / Pilsener with score 0.8225173950195312


In [48]:
user_beers = ["Budweiser"]
print_recommendations(user_beers=user_beers)

1) Coors Light with score 0.889082670211792
2) Bud Light with score 0.8869562745094299
3) Miller High Life with score 0.8835870027542114
4) Corona Extra with score 0.8785228133201599
5) Miller Lite with score 0.8426086902618408
6) Coors with score 0.8414738178253174
7) Heineken Lager Beer with score 0.8324257135391235


The recommendations of both methods are very similar

Let's calculate metrics

In [49]:
# Predicting ratings for all users and beers
predicted_ratings = model_als.user_factors @ model_als.item_factors.T

# Convert data to numpy arrays to use with mean_squared_error function
actual_ratings = interaction_matrix.T.to_numpy()

# Calculate RMSE and MAE
rmse = np.sqrt(mean_squared_error(actual_ratings, predicted_ratings))
mae = mean_absolute_error(actual_ratings, predicted_ratings)

print("RMSE:", rmse, "\nMAE:", mae)

RMSE: 0.3089069238947742 
MAE: 0.05114991179952786


Save the model

In [185]:
FILE_PATH = 'models/model_als.joblib'

dump(model_als, FILE_PATH)

['models/model_als.joblib']

In [186]:
# Load the model, if needed
model_als = load(FILE_PATH)

According the metrics, the ItemKNN model shows the best results. We will continue working with it

## In addition

Saved the interaction matrix

In [52]:
INTERACTION_MATRIX_FILE_PATH = 'data/interaction_matrix.csv'

interaction_matrix.to_csv(INTERACTION_MATRIX_FILE_PATH)