# Import Libraries

In [1]:
# import libraries
import pandas as pd
import numpy as np

In [2]:
# import datasets
wine_df = pd.read_csv('../data/winemag-data-130k-v2.csv', usecols=lambda column: column != 'Unnamed: 0')
descriptors_df = pd.read_csv('../data/descriptor_mapping.csv', usecols=lambda column: column != 'Unnamed: 0')

In [3]:
# set to maximum column width
pd.set_option('display.max_colwidth', None)

___

# Cleaning Wine Reviews Dataset

In [4]:
# examine shape of dataset
wine_df.shape

(129971, 13)

In [5]:
# examine the columns and datatypes
wine_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129971 entries, 0 to 129970
Data columns (total 13 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   country                129908 non-null  object 
 1   description            129971 non-null  object 
 2   designation            92506 non-null   object 
 3   points                 129971 non-null  int64  
 4   price                  120975 non-null  float64
 5   province               129908 non-null  object 
 6   region_1               108724 non-null  object 
 7   region_2               50511 non-null   object 
 8   taster_name            103727 non-null  object 
 9   taster_twitter_handle  98758 non-null   object 
 10  title                  129971 non-null  object 
 11  variety                129970 non-null  object 
 12  winery                 129971 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 12.9+ MB


In [6]:
wine_df.head(2)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled out with juicy red berry fruits and freshened with acidity. It's already drinkable, although it will certainly be better from 2016.",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos


## Drop Duplicates

In [7]:
# count the number of duplicates based on title
num_duplicates = wine_df.duplicated(subset=['title']).sum()
num_duplicates

11131

In [8]:
# calculate the percentage of duplicate rows
num_rows = len(wine_df)
percent_duplicates = (num_duplicates / num_rows) * 100
percent_duplicates

8.564218171746004

___Notes:___

Only a small percentage of the dataset are made up of duplicated records and is not expected to significantly impact the quality or accuracy of the recommender system. Thus, the records will be retained.

## Drop taster_twitter_handle

In [9]:
# drop the taster_twitter_handle column
wine_df = wine_df.drop('taster_twitter_handle', axis=1)

___Notes:___

taster_name is an existing column and thus, taster_twitter_handle will be dropped as it is not required for the analysis.

## Check for null values

In [10]:
# check for null values
wine_df.isnull().sum()

country           63
description        0
designation    37465
points             0
price           8996
province          63
region_1       21247
region_2       79460
taster_name    26244
title              0
variety            1
winery             0
dtype: int64

In [11]:
# drop null values
wine_df.dropna(inplace=True)

___Notes:___

Null values do not value-add to the recommender system or analysis. As such, they will be dropped.

## Filter for wine ratings > 88 points

In [12]:
# Filter the dataframe to only include wines with ratings greater than 88
wine_df = wine_df[wine_df['points'] > 88]

___Notes:___

The recommender system is only interested to recommend wine of high quality. As such, only wines with points greater than 88 will be retained. More information can be found in the data dictionary below.

## Export dataset

In [13]:
# export new, cleaned dataframe
wine_df.to_csv('../data/filtered_wine.csv', index=False)

___

# Cleaning Descriptors Dataset

In [14]:
# examine the columns and datatypes
descriptors_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1015 entries, 0 to 1014
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   raw descriptor  1015 non-null   object
 1   level_3         1015 non-null   object
 2   level_2         1015 non-null   object
 3   level_1         1015 non-null   object
dtypes: object(4)
memory usage: 31.8+ KB


In [15]:
# check for null values
descriptors_df.isnull().sum()

raw descriptor    0
level_3           0
level_2           0
level_1           0
dtype: int64

___

# Data Dictionary

## Wine Reviews

|Feature|Type|Description|
|---|---|---|
|country|object|The country where the wine was produced.|
|description|object|Description of the wine's characteristics and tasting notes|
|designation|object|The vineyard within the winery where the grapes for the wine were sourced from.|
|points|int64|The number of points assigned to the wine on a scale of 1-100 by the wine reviewer.|
|price|float64|The price of a bottle of the wine.|
|province|object|The province or state within the country where the wine was produced.|
|region_1|object|The first-level region within the province or state where the wine was produced (e.g. Napa Valley).|
|region_2|object|The second-level region within the province or state where the wine was produced (e.g. Rutherford within Napa Valley).|
|taster_name|object|The name of the wine reviewer.|
|taster_Twitter_handle|object|The Twitter handle of the wine reviewer.|
|title|object|The title of the wine review.|
|variety|object|The type of grape used to produce the wine.|
|winery|object|The name of the winery that produced the wine.|

## Wine Points System

Source: https://www.wine-searcher.com/wine-scores

|Points|Description|
|---|---|
|95-100|Wines are benchmark examples or ‘classic’.|
|90-94|Wines are ‘superior’ to ‘exceptional’.|
|85-89|Wines are ‘good’ to ‘very good’.|
|80-84|Wines are ‘above average’ to ‘good’.|
|70-79|Wines are flawed and taste average.|
|60-69|Wines are flawed and not recommended but drinkable.|
|50-59|Wines are flawed and undrinkable.|

## Descriptors

|Feature|Type|Description|
|---|---|---|
|raw descriptor|object|A descriptor for a sensory attribute of the food or drink being evaluated (e.g. "sweetness", "acidity", "herbaceousness", etc.).|
|level_3|object|A subcategory of the descriptor that provides additional detail about the attribute being evaluated (e.g. "fruit sweetness" for the "sweetness" descriptor).|
|level_2|object|A broader category that groups together related descriptors and level_3 subcategories (e.g. "flavor").|
|level_1|object|The highest level of categorization that groups together the level_2 categories (e.g. "aroma and flavor").|