# Parallel Processing of Wine Review Data**

## Objective

Can we recreate this $5$ basic wine type groupings?

<p align="center">
<img src="images/Different-Types-of-Wine-v2.jpg"alt="wine types" width="300"><img>
</p>

## Define Wine Classifications

**Helper functions to classify wines into the $9$ basic styles and $5$ basic types**

In [14]:
from collections import namedtuple
import pandas as pd

WINE_STYLE = namedtuple('WINE_STYLE', ['name', 'varieties'])

# wine varieties by wine style
sparkling = WINE_STYLE('sparkling', ['Cava','Prosecco', 'Crémant','Champagne', 'Spumante', 'Sparkling Blend'])

light_white = WINE_STYLE('light_white',
                         ['Albariño','Aligoté','Assyrtiko','Chablis','Chasselas','Chenin Blanc','Cortese','Friulano','Sauvignon Vert','Garganega','Grenache Blanc','Muscadet','Melon','Picpoul de Pinet','Pinot Blanc','Pinot Grigio','Pinot Gris','Verdejo','Verdicchio','Xarel-lo',]\
                          + ['Erbaluce','Grüner Veltliner','Sancerre','Sauvignon Blanc','Vermentino','Vinho Verde']
                        )
full_white = WINE_STYLE('full_white', ['Chardonnay','Marsanna','Sémillon','Trebbiano','Viognier','White Rioja','Pecorino', 'White Blend'])
aromatic_white = WINE_STYLE('aromatic_white', ['Gewürztraminer','Müller-Thurgau','Moschofilero','Muscat','Muscat Blanc','Moscato','Riesling','Torrontés',])
rose = WINE_STYLE('rose', ['Rosé', 'Rose'])
light_red = WINE_STYLE('light_red', ['Schiava', 'Gamay', 'Pinot Noir','Counoise', 'St. Laurent','Cinsaut','Primitivo','Blaufränkisch','Barolo'])
medium_red = WINE_STYLE('medium_red', ['Grenache','Granacha','Valpolicella Blend','Bobal','Carménère','Carignan','Cabernet Franc','Mencía', 'Sangiovese','Negroamaro','Rhône-style Red Blend', 'Rhône/GSM Blend','Barbera','Merlot','Montepulciano','Zinfandel','Marquette','Chambourcin','Petite Pearl','Red Blend', 'Portuguese Red'])
full_red = WINE_STYLE('full_red', ['Tempranillo','Nebbiolo',"Nero d'Avola",'Aglianico','Malbec','Bordeaux Blend','Cabernet Sauvignon','Syrah','Shiraz','Priorat','Touriga Franca','Pinotage','Petit Verdot','Mourvèdre','Touriga Nacional','Petite Sirah','Sagrantino','Tannat','Bordeaux-style Red Blend'])
dessert = WINE_STYLE('dessert', ['Ice Wine','Late Harvest','Madeira','Malvasia','Sauternes','Sherry','Tokaji','Vin Santo','White Port','Port','Porto','Marsala','Noble Rot','Passito','Freisa'])

all_varieties = pd.Series(sparkling.varieties + light_white.varieties + full_white.varieties + aromatic_white.varieties + rose.varieties + light_red.varieties + medium_red.varieties + full_red.varieties + dessert.varieties)

# assign style
styles = [sparkling, light_white, full_white, aromatic_white, rose, light_red, medium_red, full_red, dessert]
to_style = lambda wine: next(map(lambda style: style.name, filter(lambda style: wine in style.varieties, styles)))
to_type = lambda wine: next(map(lambda style: style.name.split('_')[-1], filter(lambda style: wine in style.varieties, styles)))


**Verify that there are no repeated wines varieties in the styles**

In [15]:
# verify there are no variety duplicated in any of the style varietals
assert all_varieties[all_varieties.duplicated()].count() == 0

**Verify the Type & Style Helpers**

In [16]:
assert to_style('Pinot Noir') == 'light_red'
assert to_type('Pinot Noir') == 'red'
to_style('Pinot Noir'), to_type('Pinot Noir')

('light_red', 'red')

## Load Preprocessed Reviews

Libraries

In [17]:
import numpy as np

from IPython.display import Markdown, display

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()

Source: Kaggle [Wine Reviews](https://www.kaggle.com/datasets/zynicide/wine-reviews)

In [26]:
corpus = pd.read_parquet('files/wine_review.parquet.gzip', columns=['preprocessed_description']).preprocessed_description
corpus

0           aroma include tropical fruit broom brimstone...
1         do ripe fruity wine smooth structure firm tann...
2         rainstorm tart snappy flavor lime flesh rind d...
3         pineapple rind lemon pith orange blossom start...
4         vintner like regular bottling come rough tanni...
                                ...                        
129966    note honeysuckle cantaloupe sweeten deliciousl...
129967    citation citation give decade bottle age prior...
129968    drain gravel soil give wine crisp dry characte...
129969    dry style crisp acidity weight solid powerful ...
129970    dit big rich dry power intense spiciness round...
Name: preprocessed_description, Length: 100538, dtype: object

## Feature Extraction

**Vectorize**

In [27]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer, HashingVectorizer

vectorizer = TfidfVectorizer(max_features=2**12, stop_words="english")
tf_v = vectorizer.fit_transform(corpus)
tf_v.shape

(100538, 4096)

**Inspect Features**

In [28]:
vectorizer.get_feature_names_out()[:50]

array(['10', '16th', '18th', '1970', '19th', '2010', '2012', '2013',
       '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021',
       '2022', '2023', '2025', '2030', 'abbreviate', 'abeja', 'ability',
       'able', 'abound', 'abrasive', 'abrupt', 'abruptly', 'absolute',
       'absolutely', 'absorb', 'abundance', 'abundant', 'abv', 'acacia',
       'accent', 'accentuate', 'acceptable', 'accessible',
       'accompaniment', 'accompany', 'accord', 'account', 'achieve',
       'achievement', 'acid', 'acidic', 'acidity', 'acquire', 'acre'],
      dtype=object)

**Drop Number Features**

In [29]:
features = np.array(list(filter(lambda col: col[:1].isalpha(), vectorizer.get_feature_names_out())))
features[:50]

array(['abbreviate', 'abeja', 'ability', 'able', 'abound', 'abrasive',
       'abrupt', 'abruptly', 'absolute', 'absolutely', 'absorb',
       'abundance', 'abundant', 'abv', 'acacia', 'accent', 'accentuate',
       'acceptable', 'accessible', 'accompaniment', 'accompany', 'accord',
       'account', 'achieve', 'achievement', 'acid', 'acidic', 'acidity',
       'acquire', 'acre', 'acrid', 'act', 'action', 'actually', 'add',
       'addition', 'additional', 'adequate', 'admirable', 'admirably',
       'adorn', 'advanced', 'advantage', 'aeration', 'affect',
       'affordable', 'affordably', 'african', 'afternoon', 'aftertaste'],
      dtype='<U15')

**Feature Matrix**

In [24]:
X = pd.DataFrame(tf_v.toarray(), columns=vectorizer.get_feature_names_out())[features]
X

Unnamed: 0,abbreviate,abeja,ability,able,abound,abrasive,abrupt,abruptly,absolute,absolutely,...,zesty,zin,zinfandel,zing,zingy,zip,zippy,zone,zull,île
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100533,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100534,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100535,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100536,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Cluster Analysis

### KMeans Clustering

see [Clustering documents with TFIDF and KMeans](https://www.kaggle.com/code/jbencina/clustering-documents-with-tfidf-and-kmeans)

### Agglomerative Clustering