# NLP Feature Extraction

### Load Subset of Clean Wine Reviews

See [data preparation](wine_review-data_preparation.ipynb) for details on the prepared dataset.

Libraries

In [8]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [9]:
wine_df_subset = pd.read_parquet('files/wine_review_subset.parquet.gzip')
wine_df_subset.info()
wine_df_subset.head()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 32797 to 11496
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   country         1000 non-null   object 
 1   description     1000 non-null   object 
 2   points          1000 non-null   int64  
 3   price           964 non-null    float64
 4   taster_name     1000 non-null   object 
 5   title           1000 non-null   object 
 6   variety         1000 non-null   object 
 7   winery          1000 non-null   object 
 8   year            1000 non-null   int64  
 9   wine_style      1000 non-null   object 
 10  type            1000 non-null   object 
 11  quality         1000 non-null   object 
 12  classification  1000 non-null   object 
 13  location        1000 non-null   object 
 14  band            964 non-null    object 
 15  tokens          1000 non-null   object 
dtypes: float64(1), int64(2), object(13)
memory usage: 132.8+ KB


Unnamed: 0,country,description,points,price,taster_name,title,variety,winery,year,wine_style,type,quality,classification,location,band,tokens
32797,US,"Perfumed in rose and violet, with a sauvage sc...",90,30.0,Virginie Boone,Row Eleven 2013 Dutton Sanchietti Pinot Noir (...,Pinot Noir,Row Eleven,2013,light_red,red,high,New World,California,super,perfum rose violet sauvag scent musk distinct ...
105190,US,"This is a very lush Cab, made in the modern st...",92,70.0,Unknown,Robert Craig 2004 Cabernet Sauvignon (Mount Ve...,Cabernet Sauvignon,Robert Craig,2004,full_red,red,high,New World,California,luxury,thi lush cab made modern style all element cul...
89652,Australia,"Full-bodied and plush in texture, this offers ...",85,8.0,Joe Czerwinski,Yellow Tail 2015 Chardonnay (South Eastern Aus...,Chardonnay,Yellow Tail,2015,full_white,white,medium,New World,South Eastern Australia,value,full-bodi plush textur offer plenti apricot ma...
115814,US,Where else in this country can you find $12 Pi...,87,12.0,Paul Gregutt,Eola Hills 2006 Pinot Noir (Oregon),Pinot Noir,Eola Hills,2006,light_red,red,medium,New World,Oregon,value,where els countri find 12 pinot actual variet ...
93320,US,This shows an old-fashioned Spring Mountain dr...,87,55.0,Unknown,Bougetz 2010 Amaryllis Cabernet Sauvignon (Spr...,Cabernet Sauvignon,Bougetz,2010,full_red,red,medium,New World,California,luxury,thi show old-fashion spring mountain dryness t...


**Group Features by Type**

In [10]:
num_cols = wine_df_subset.select_dtypes(np.number).columns.to_list()
num_cols

['points', 'price', 'year']

In [11]:
cat_cols = wine_df_subset.select_dtypes(object).columns.drop('tokens').to_list()
cat_cols

['country',
 'description',
 'taster_name',
 'title',
 'variety',
 'winery',
 'wine_style',
 'type',
 'quality',
 'classification',
 'location',
 'band']

### Feature Extraction

In [12]:
wine_df_subset.tokens.head().values[0]

'perfum rose violet sauvag scent musk distinct wine soft silki palat it featur mix red cherri sage full bodi expans lengthi finish'

**Bag-of-words** count occurrence of words

In [13]:
cvect = CountVectorizer()
dtm = cvect.fit_transform(wine_df_subset.tokens)
pd.DataFrame(dtm.toarray(), columns = cvect.get_feature_names_out()).head()

Unnamed: 0,000,05,07,08,10,100,1000,11,114,12,...,you,young,yountvil,your,youth,zest,zesti,zingi,zip,zippi
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,1,...,0,1,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**TF-IDF** importance of token in document relative to corpus

In [14]:
cvect = TfidfVectorizer()
dtm = cvect.fit_transform(wine_df_subset.tokens)
pd.DataFrame(dtm.toarray(), columns = cvect.get_feature_names_out()).head()

Unnamed: 0,000,05,07,08,10,100,1000,11,114,12,...,you,young,yountvil,your,youth,zest,zesti,zingi,zip,zippi
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.190698,...,0.0,0.116853,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


TODO look into **Google Word to Vec**

# Next
- [train initial model](wine_review-baseline_model.ipynb)