# Parallel Processing of Wine Review Data**

## Objective

Can we recreate this $5$ basic wine type groupings?

<p align="center">
<img src="images/Different-Types-of-Wine-v2.jpg"alt="wine types" width="300"><img>
</p>

## Define Wine Classifications

**Helper functions to classify wines into the $9$ basic styles and $5$ basic types**

In [1]:
from collections import namedtuple
import pandas as pd

WINE_STYLE = namedtuple('WINE_STYLE', ['name', 'varieties'])

# wine varieties by wine style
sparkling = WINE_STYLE('sparkling', ['Cava','Prosecco', 'Crémant','Champagne', 'Spumante', 'Sparkling Blend'])

light_white = WINE_STYLE('light_white',
                         ['Albariño','Aligoté','Assyrtiko','Chablis','Chasselas','Chenin Blanc','Cortese','Friulano','Sauvignon Vert','Garganega','Grenache Blanc','Muscadet','Melon','Picpoul de Pinet','Pinot Blanc','Pinot Grigio','Pinot Gris','Verdejo','Verdicchio','Xarel-lo',]\
                          + ['Erbaluce','Grüner Veltliner','Sancerre','Sauvignon Blanc','Vermentino','Vinho Verde']
                        )
full_white = WINE_STYLE('full_white', ['Chardonnay','Marsanna','Sémillon','Trebbiano','Viognier','White Rioja','Pecorino', 'White Blend'])
aromatic_white = WINE_STYLE('aromatic_white', ['Gewürztraminer','Müller-Thurgau','Moschofilero','Muscat','Muscat Blanc','Moscato','Riesling','Torrontés',])
rose = WINE_STYLE('rose', ['Rosé', 'Rose'])
light_red = WINE_STYLE('light_red', ['Schiava', 'Gamay', 'Pinot Noir','Counoise', 'St. Laurent','Cinsaut','Primitivo','Blaufränkisch','Barolo'])
medium_red = WINE_STYLE('medium_red', ['Grenache','Granacha','Valpolicella Blend','Bobal','Carménère','Carignan','Cabernet Franc','Mencía', 'Sangiovese','Negroamaro','Rhône-style Red Blend', 'Rhône/GSM Blend','Barbera','Merlot','Montepulciano','Zinfandel','Marquette','Chambourcin','Petite Pearl','Red Blend', 'Portuguese Red'])
full_red = WINE_STYLE('full_red', ['Tempranillo','Nebbiolo',"Nero d'Avola",'Aglianico','Malbec','Bordeaux Blend','Cabernet Sauvignon','Syrah','Shiraz','Priorat','Touriga Franca','Pinotage','Petit Verdot','Mourvèdre','Touriga Nacional','Petite Sirah','Sagrantino','Tannat','Bordeaux-style Red Blend'])
dessert = WINE_STYLE('dessert', ['Ice Wine','Late Harvest','Madeira','Malvasia','Sauternes','Sherry','Tokaji','Vin Santo','White Port','Port','Porto','Marsala','Noble Rot','Passito','Freisa'])

all_varieties = pd.Series(sparkling.varieties + light_white.varieties + full_white.varieties + aromatic_white.varieties + rose.varieties + light_red.varieties + medium_red.varieties + full_red.varieties + dessert.varieties)

# assign style
styles = [sparkling, light_white, full_white, aromatic_white, rose, light_red, medium_red, full_red, dessert]
to_style = lambda wine: next(map(lambda style: style.name, filter(lambda style: wine in style.varieties, styles)))
to_type = lambda wine: next(map(lambda style: style.name.split('_')[-1], filter(lambda style: wine in style.varieties, styles)))


**Verify that there are no repeated wines varieties in the styles**

In [2]:
# verify there are no variety duplicated in any of the style varietals
assert all_varieties[all_varieties.duplicated()].count() == 0

**Verify the Type & Style Helpers**

In [3]:
assert to_style('Pinot Noir') == 'light_red'
assert to_type('Pinot Noir') == 'red'
to_style('Pinot Noir'), to_type('Pinot Noir')

('light_red', 'red')

## Load Wine Review Data

Libraries

In [4]:
import numpy as np

from IPython.display import Markdown, display

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()

Source: Kaggle [Wine Reviews](https://www.kaggle.com/datasets/zynicide/wine-reviews)

In [5]:
wine_df = pd.read_parquet('files/wine_review.parquet.gzip')
wine_df.info()
wine_df[['title', 'winery', 'year', 'variety', 'description', 'preprocessed_description']].head()

<class 'pandas.core.frame.DataFrame'>
Index: 100538 entries, 0 to 129970
Data columns (total 16 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   country                   100538 non-null  object 
 1   description               100538 non-null  object 
 2   points                    100538 non-null  int64  
 3   price                     93522 non-null   float64
 4   taster_name               100538 non-null  object 
 5   title                     100538 non-null  object 
 6   variety                   100538 non-null  object 
 7   winery                    100538 non-null  object 
 8   year                      100538 non-null  int64  
 9   wine_style                100538 non-null  object 
 10  type                      100538 non-null  object 
 11  quality                   100538 non-null  object 
 12  classification            100538 non-null  object 
 13  location                  100538 non-null  object

Unnamed: 0,title,winery,year,variety,description,preprocessed_description
0,Nicosia 2013 Vulkà Bianco (Etna),Nicosia,2013,White Blend,"Aromas include tropical fruit, broom, brimston...",aroma include tropical fruit broom brimstone...
1,Quinta dos Avidagos 2011 Avidagos Red (Douro),Quinta dos Avidagos,2011,Portuguese Red,"This is ripe and fruity, a wine that is smooth...",do ripe fruity wine smooth structure firm tann...
2,Rainstorm 2013 Pinot Gris (Willamette Valley),Rainstorm,2013,Pinot Gris,"Tart and snappy, the flavors of lime flesh and...",rainstorm tart snappy flavor lime flesh rind d...
3,St. Julian 2013 Reserve Late Harvest Riesling ...,St. Julian,2013,Riesling,"Pineapple rind, lemon pith and orange blossom ...",pineapple rind lemon pith orange blossom start...
4,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Sweet Cheeks,2012,Pinot Noir,"Much like the regular bottling from 2012, this...",vintner like regular bottling come rough tanni...


## Feature Extraction

**Vectorize**

In [23]:
corpus = wine_df.preprocessed_description

In [31]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer, HashingVectorizer

vectorizer = TfidfVectorizer(max_features=2**12, stop_words="english")
X = vectorizer.fit_transform(corpus)
print(X.shape)
vectorizer.get_feature_names_out()[:50]


(100538, 4096)


array(['10', '16th', '18th', '1970', '19th', '2010', '2012', '2013',
       '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021',
       '2022', '2023', '2025', '2030', 'abbreviate', 'abeja', 'ability',
       'able', 'abound', 'abrasive', 'abrupt', 'abruptly', 'absolute',
       'absolutely', 'absorb', 'abundance', 'abundant', 'abv', 'acacia',
       'accent', 'accentuate', 'acceptable', 'accessible',
       'accompaniment', 'accompany', 'accord', 'account', 'achieve',
       'achievement', 'acid', 'acidic', 'acidity', 'acquire', 'acre'],
      dtype=object)

In [36]:
np.array(list(filter(lambda col: col[:1].isalpha(), vectorizer.get_feature_names_out())))[:50]

array(['abbreviate', 'abeja', 'ability', 'able', 'abound', 'abrasive',
       'abrupt', 'abruptly', 'absolute', 'absolutely', 'absorb',
       'abundance', 'abundant', 'abv', 'acacia', 'accent', 'accentuate',
       'acceptable', 'accessible', 'accompaniment', 'accompany', 'accord',
       'account', 'achieve', 'achievement', 'acid', 'acidic', 'acidity',
       'acquire', 'acre', 'acrid', 'act', 'action', 'actually', 'add',
       'addition', 'additional', 'adequate', 'admirable', 'admirably',
       'adorn', 'advanced', 'advantage', 'aeration', 'affect',
       'affordable', 'affordably', 'african', 'afternoon', 'aftertaste'],
      dtype='<U15')

## Cluster Analysis

### KMeans Clustering

see [Clustering documents with TFIDF and KMeans](https://www.kaggle.com/code/jbencina/clustering-documents-with-tfidf-and-kmeans)

### Agglomerative Clustering