# Capstone Project: Wine Recommender
---
**Book 1: Problem Statement & Data Cleaning**<br>
Book 2: Exploratory Data Analysis, Preprocessing & Feature Engineering<br>
Book 3: Modelling, Conclusion & Recommendation<br>
Author: Lee Wan Xian

## Problem Statement

Our client is in the wine retail business. Customers have informed them of decision fatigue when trying to choose from their huge catalogue of wines offered. To improve customer experience and engagement, they have tasked us to develop a wine recommender system that suggests a few suitable wines for their customers. In turn, this would alleviate the decision-making toil.

## Contents
- [Background](#Background)
- [Dataset Information](#Dataset-Information)
- [Data Cleaning](#Data-Cleaning)

## Background

There are more than 10,000 wine grape varieties in the world ([source](https://www.masterclass.com/articles/what-are-the-different-types-of-wine-grapes-a-guide-to-the-various-types-of-red-and-white-wine-grapes-in-the-world)). This could easily translate to more than 10,000 varieties of wine out in the global market. With such a wide variety available, it is easy to understand how a typical consumer would struggle to find the wine that suits their palate.

The client has received feedback from customers who felt time and energy were wasted during browsing through their wine catalogue. Eventually, customers felt unsure whether the wine they chose was the right one after purchase. It resulted in a negative shopping experience for them. Thus, the client has engaged us to build a wine recommender system that provides good suggestions for them to choose from. 

As a team of data professionals, we will leverage the wine catalogue provided by the client and the relevant reviews/ratings given by professional sommeliers. These datasets will form the database needed to build our recommender system. To add on, we will also use a list of wine trait descriptive terms from RoboSomm wine wheels. These wine wheels were created and compiled by researchers and wine enthusiasts alike ([source](https://towardsdatascience.com/robosomm-chapter-3-wine-embeddings-and-a-wine-recommender-9fc678f1041e)). These terms will help enable customers to choose specific wine traits from the recommender system. In turn, reducing the list of recommended wines for customers to choose from.

The points/ratings stated in the wine reviews follow the 100-points system ([source](https://winefolly.com/tips/wine-ratings-explained/)).

Points/Ratings|Wine Quality
---|---
95-100|Classics
90-94|Superior
85-90|Good
80-84|Above Average
70-79|Average

## Dataset Information

### Datasets used

There are 2 datasets included in the `data` folder for this project.

* [`winemag-data-130k-v2`](../data/winemag-data-130k-v2.csv): Wine reviews that were scraped from [WineEnthusiast](https://www.winemag.com/?s=&drink_type=wine) webpage by Zack Thoutt ([Source](https://www.kaggle.com/datasets/zynicide/wine-reviews)).
* [`descriptor_mapping`](../data/descriptor_mapping.csv): List of standardized descriptors that are derived from RoboSomm wine wheels. This list is available on Roald Schuring's Github ([Source](https://github.com/RoaldSchuring/wine_recommender/blob/master/descriptor_mapping.csv)).

### Data Dictionary

#### Wine reviews dataframe

Feature|Description
---|---
country|The country that the wine is from
description|The review given to the wine
designation|The vineyard within the winery where the grapes that made the wine are from
points|The number of points/ratings WineEnthusiast rated the wine on a scale of 1-100
price|The cost for a bottle of the wine
province|The province or state that the wine is from
region_1|The wine growing area in a province or state (i.e. Napa)
region_2|The specific region specified within a wine growing area (i.e. Rutherford inside the Napa Valley)
taster_name|Name of Wine taster
taster_twitter_handle|Wine taster's twitter handle
title|Name, Year & Vineyard of the wine. This feature is the key feature to differentiate wines from each other
variety|Type of wine (i.e. Pinot Noir)
winery|The place where the wine was made

#### Wine traits descriptor mapping dataframe

Feature|Description
---|---
raw_descriptor|The raw descriptive terms used to describe a wine's traits
level_3|The 3rd level descriptive terms from RoboSomm wine wheels. This is the most detailed trait layer.
level_2|The 2nd level descriptive terms from RoboSomm wine wheels
level_1|The 1st level descriptive terms from RoboSomm wine wheels This is the most generic trait layer.

## Python Libraries

In [1]:
import pandas as pd
import numpy as np

## Data Cleaning

In [2]:
# Import datasets
df_wine = pd.read_csv('../data/winemag-data-130k-v2.csv', index_col = 0)
descriptor_map = pd.read_csv('../data/descriptor_mapping.csv')

### Dataframe: Wine Reviews

In [3]:
print(df_wine.shape)
df_wine.info()
df_wine.head()

(129971, 13)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 129971 entries, 0 to 129970
Data columns (total 13 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   country                129908 non-null  object 
 1   description            129971 non-null  object 
 2   designation            92506 non-null   object 
 3   points                 129971 non-null  int64  
 4   price                  120975 non-null  float64
 5   province               129908 non-null  object 
 6   region_1               108724 non-null  object 
 7   region_2               50511 non-null   object 
 8   taster_name            103727 non-null  object 
 9   taster_twitter_handle  98758 non-null   object 
 10  title                  129971 non-null  object 
 11  variety                129970 non-null  object 
 12  winery                 129971 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 13.9+ MB


Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


In [4]:
# Check for missing values
df_wine.isna().sum().sort_values(ascending=False)

region_2                 79460
designation              37465
taster_twitter_handle    31213
taster_name              26244
region_1                 21247
price                     8996
country                     63
province                    63
variety                      1
description                  0
points                       0
title                        0
winery                       0
dtype: int64

Reviews with missing values do not provide many insights into the similarity between wines or wine tasters. The recommender system will not be able to use such data for training. Thus, we will remove all reviews with missing data values.

In [5]:
# Remove reviews with missing data
df_wine.dropna(inplace=True)

In [6]:
# Check for duplicated data records
df_wine.duplicated().value_counts()

False    20493
True      1894
dtype: int64

Given that this is a recommender use case, we should ideally retain the integrity of the database. Only 8% of the dataset consists of duplicated records. The recommender should be able to manage multiple identical records ([source](https://github.com/apple/turicreate/issues/1433)) and should not be affected by this. Thus, we will leave the duplicated records untouched.

In [7]:
# Reorder the columns & remove 'taster_twitter_handle' as it's unnecessary
col = ['taster_name', 'title', 'points', 'variety', 'designation', 'winery', 'country', 'province', 'region_1', 'region_2', 'price', 'description']
df_wine_ordered = df_wine[col]

It is ideal for the wine recommender to recommend good quality wines instead of inferior ones. Thus, we will only extract wine reviews with points equal to 88 or above.

In [8]:
df_wine_clean = df_wine_ordered.loc[df_wine_ordered['points'] >= 88, :].reset_index(drop=True)

### Dataframe: Wine Trait Descriptors

In [9]:
print(descriptor_map.shape)
descriptor_map.info()
descriptor_map.head()

(1015, 4)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1015 entries, 0 to 1014
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   raw descriptor  1015 non-null   object
 1   level_3         1015 non-null   object
 2   level_2         1015 non-null   object
 3   level_1         1015 non-null   object
dtypes: object(4)
memory usage: 31.8+ KB


Unnamed: 0,raw descriptor,level_3,level_2,level_1
0,abras,abrasive,high_tannin,tannin
1,acacia,acacia,flowery,flower
2,acacia_flower,acacia,flowery,flower
3,aciddriven,acid_driven,high_acid,acid
4,aggress,aggressive,high_acid,acid


There are no missing values in this dataframe.

In [10]:
# Check for duplicated reviews
descriptor_map.duplicated().value_counts()

False    1015
dtype: int64

There are no duplicate records in this dataframe.

**Export `df_wine` dataframe into pickle file**

In [11]:
# Export the dataframe into pickle file  
df_wine_clean.to_pickle('../data/df_wine_clean.pkl')

### Summary

**Wine Reviews**: All reviews with missing values were removed. Any reviews with points lower than 88 have been removed. The order of the columns have been reorganized for better visualisation. The `taster_twitter_handle` column has been dropped as it serves the same purpose as `taster_name`.

**Wine Trait Descriptors**: No data cleaning was required as there are no duplicated records or missing data values.

*Please proceed to Book 2 for EDA, Preprocessing & Feature Engineering*