## Wine Review Text Classifier

I'm going to build model that predicts Wine Enthusiast scores for a bunch of
wines. I'm using the following data set from Kaggle...

https://www.kaggle.com/zynicide/wine-reviews?select=winemag-data_first150k.csv

This data was initially scrapped from the Wine Enthusiast website. My plan 
is to incorporate text analytics, since the data set has 
descriptions for each wine.

For the first attempt, I'll classify whether a wine is >= 90 points based off its description using a logistic regression model (1 or 0).

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
import nltk
import re


In [2]:
# Download text corpora
nltk.download()


showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [3]:
# List out text corpora
from nltk.book import *
from nltk.corpus import treebank


*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


### Data set

This data set consists of 2 csv files. 
1. One with 150k records
2. One with 130k records 

After initially exploring the data, I found that there were 
three things in the csv files that need to be fixed. 

#### Remove Quotation marks

There were quotations marks in the description field. This 
caused a problem with uploading the data into a spark dataframe, because 
it was splitting the description column when it shouldn't have.

#### Remove record with new line

There was a record with '\r\n' in the description.
I don't know why, but I removed that as well.

The way I did this was to import the csv into a pandas dataframe, and apply
a replace function on the records. If I was dealing with a dataset large 
enough, this may not work, which is why it would be smart to figure this 
out using .open() etc. However, for now, I made the necessary adjustments 
and saved them as new csv files. These new files will be used for the spark 
dataframes.

#### Remove ‚Äî characters

This was common throughout the two files. I tried removing these using regex in the code, but I wasn't able to get it to work. Rather than go down a rabbit hole, I simply removed the characters manually in the csv file.

In [4]:
# Load 150k file into pandas dataframe
wine150k = pd.read_csv('/Users/Matt/Desktop/Programming/DataScience/TextPractice/WineReviews/winemag-data_first150k.csv')


In [5]:
# remove quotations marks for all description records
wine150k['description'] = wine150k['description'].apply(lambda row: row.replace('"', ''))


In [6]:
# one record has ''\r\n' in the description that needs to be removed
wine150k['description'] = wine150k['description'].apply(lambda row: row.replace('\r\n', ' '))


In [7]:
# Preview wine150k
wine150k.head()


Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


In [8]:
# Load 130k file into pandas dataframe
wine130k = pd.read_csv('/Users/Matt/Desktop/Programming/DataScience/TextPractice/WineReviews/winemag-data_130k_v2.csv')


In [9]:
# remove quotations marks for all description records
wine130k['description'] = wine130k['description'].apply(lambda row: row.replace('"', ''))


In [10]:
# Preview wine130k
wine130k.head()


Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


Since the 130k dataframe has 3 additional fields, I'll need to remove those
before combining the data.

In [11]:
# Remove columns in 130k file that aren't in 150k file
wine130k.drop(['taster_name','taster_twitter_handle','title'], axis=1, inplace=True)


In [52]:
# Combine both wine files into one dataframe
# https://stackoverflow.com/questions/40397206/how-can-i-combineconcatenate-two-data-frames-with-the-same-column-name-in-java
wine_combined = pd.concat([wine150k, wine130k])

wine_combined.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 280901 entries, 0 to 129970
Data columns (total 11 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   Unnamed: 0   280901 non-null  int64  
 1   country      280833 non-null  object 
 2   description  280901 non-null  object 
 3   designation  197701 non-null  object 
 4   points       280901 non-null  int64  
 5   price        258210 non-null  float64
 6   province     280833 non-null  object 
 7   region_1     234594 non-null  object 
 8   region_2     111464 non-null  object 
 9   variety      280900 non-null  object 
 10  winery       280901 non-null  object 
dtypes: float64(1), int64(2), object(8)
memory usage: 25.7+ MB


I want to rename 'Unnamed: 0' column to 'ID', just so it's a cleaner 
column name to work with. I'll copy the column with a new name, and drop the old one

In [13]:
# Rename column 'Unnamed: 0' to 'ID'
wine_combined['ID'] = wine_combined['Unnamed: 0']

# Drop old column
wine_combined.drop(['Unnamed: 0'], axis=1, inplace=True)


In [14]:
# Create label column that will be used in the text classifier
wine_combined['Above_90'] = np.where(wine_combined.points > 90, 1, 0)


### Review this with text8 so you understand what it's doing with the wine data!!!

In [None]:
# Frequency of words in personals corpus
dist = FreqDist(wine_combined['description']) # text8 is personals


In [None]:
# Unique words
vocab1 = dist.keys()


In [None]:
# Words appearing > 100 times
frequent_words = [v for v in vocab1 if dist[v] > 10 and len(v) > 5]


In [None]:
# Normalize and Stem words
porter = nltk.PorterStemmer()


In [None]:
# NLTK built in tokenizers
sent_fox = "The fox jumped over the brown lazy dog on his way to eat the chicken nuggets!"
WToken = nltk.word_tokenize(sent_fox)


In [15]:
text = wine_combined.iloc[10].description
SToken = nltk.sent_tokenize(text)

print(text)
print(SToken)

Elegance, complexity and structure come together in this drop-dead gorgeous winethat ranks among Italy's greatest whites. It opens with sublime yellow spring flower, aromatic herb and orchard fruit scents. The creamy, delicious palate seamlessly combines juicy white peach, ripe pear and citrus flavors while white almond and savory mineral notes grace the lingering finish.
["Elegance, complexity and structure come together in this drop-dead gorgeous winethat ranks among Italy's greatest whites.", 'It opens with sublime yellow spring flower, aromatic herb and orchard fruit scents.', 'The creamy, delicious palate seamlessly combines juicy white peach, ripe pear and citrus flavors while white almond and savory mineral notes grace the lingering finish.']


In [17]:
# Import sklearn packages for model
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score


### Text classifier features

Ideally, the wine description will be features in addition to the other columns in the data (price, winery, etc). However, for now I'll just use the description as the features so I can get a better understanding of building a text classifier.

In [21]:
# Create test and train groups with features (description) and label (Above_90)
x_train, x_test, y_train, y_test = train_test_split(wine_combined['description']
,wine_combined['Above_90']
,random_state=0)


In [22]:
# min_df sets a minimum frequency for words to occur

# ngram_range will take features that are words adjacent to each other
# ie. 'really good', 'very bad', etc.
# can set this to > 2 as well!

# Chose min_df=10 and ngram_range=(1,3) because it returned a large number of words
# that would still process
wine_vect = CountVectorizer(min_df=10, ngram_range=(1,3)).fit(x_train)


In [23]:
# Get number of words (features) in all of the wine reviews
len(wine_vect.get_feature_names())


191267

In [24]:
# Create sparse matrix with all the words in the wine reviews
# Each row will represent each wine review
# each column will be 0 or > 0 depending on if the word is in
# that specific review!
x_train_vectorized = wine_vect.transform(x_train)


In [25]:
# Create the logistic regression model and fit it to the data
model = LogisticRegression(max_iter=10000)
model.fit(x_train_vectorized, y_train)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=10000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [31]:
# Predict class labels for samples in x_test
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba
predictions = model.predict(wine_vect.transform(x_test))

print('Predictions:', predictions[:10])


Predictions: [0 1 0 1 0 0 1 0 0 0]


In [32]:
# Predict the probability of the sample for each class in the model
# classes are ordered by model.classes_
# model.classes_
print('Probabilities:', probabilities[:10])


Probabilities: [[9.98606498e-01 1.39350169e-03]
 [4.55966857e-03 9.95440331e-01]
 [9.88303122e-01 1.16968776e-02]
 [1.55098856e-02 9.84490114e-01]
 [9.99580605e-01 4.19395438e-04]
 [9.98153628e-01 1.84637218e-03]
 [1.75506505e-02 9.82449349e-01]
 [8.18574031e-01 1.81425969e-01]
 [9.99776526e-01 2.23474398e-04]
 [8.31647345e-01 1.68352655e-01]]


In [33]:
# Evaluate the model with area under the curve (auc)
print('Area Under Curve:', roc_auc_score(y_test, predictions))


Area Under Curve: 0.9040313583282931


In [41]:
# get all feature names included in model in aphabetical order
feature_names = np.array(wine_vect.get_feature_names())

feature_names[:10]

array(['000', '000 bottles', '000 case', '000 case product',
       '000 case production', '000 cases', '000 cases imported',
       '000 cases it', '000 cases made', '000 cases of'], dtype='<U33')

In [43]:
# Sort the coefficients from the model
# I believe takes coefficients for words (from the alphabetical order above) and
# sorts them by the coefficients/words that are the least impactful to the most impactful
sorted_coef_index = model.coef_[0].argsort()

sorted_coef_index

array([ 94680, 141611,  57278, ...,   1626,   1659,   1654])

In [44]:
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:-11:-1]]))


Smallest Coefs:
['lacks' 'simple' 'everyday' 'ripasso' 'prosecco' 'drink up' 'alsace'
 'dolcetto' '89' 'gewürztraminer']

Largest Coefs:
['92' '93' '90 92' '2030' 'gorgeous' '2025' '91 93' 'superb' 'terrific'
 'beautiful']



In [45]:
# Import TFIDF if you want to weight more common features vs others
# For this model, we'll import it to set a min df
from sklearn.feature_extraction.text import TfidfVectorizer


In [46]:
# We'll require features to show up in at least 10 reviews
# This reduces features from 34,960 to 11,454
# Turns out AUC scores for min_df=3,5,10 are all lower
new_vect = TfidfVectorizer(min_df=5).fit(x_train)


In [47]:
# https://gist.github.com/larsmans/3745866
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
vect_df = pd.DataFrame(x_train_vectorized.toarray(), columns=wine_vect.get_feature_names())


In [57]:
df92 = wine_combined[wine_combined.description.str.contains("92")].copy()
df92['description'] = df92.description.str.replace(r'[0-9]+[-]?[–. ]?[0-9]+', '')
df92.to_csv(r'/Users/Matt/Desktop/Programming/DataScience/TextPractice/WineReviews/test.csv', index=False)
df92.description.str.findall(r'[0-9]+[-][0-9]+')


1166      []
4566      []
5839      []
10349     []
11804     []
          ..
129174    []
129175    []
129176    []
129577    []
129679    []
Name: description, Length: 992, dtype: object

In [49]:
#Replace all white-space characters with the digit "9":
txt = "The rain in Spain"
x = re.sub("\s", "9", txt)
print(x)


The9rain9in9Spain


In [54]:
fuckyou = wine_combined[wine_combined.description.str.contains("Äì")].copy()

fuckyou.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 0 entries
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   0 non-null      int64  
 1   country      0 non-null      object 
 2   description  0 non-null      object 
 3   designation  0 non-null      object 
 4   points       0 non-null      int64  
 5   price        0 non-null      float64
 6   province     0 non-null      object 
 7   region_1     0 non-null      object 
 8   region_2     0 non-null      object 
 9   variety      0 non-null      object 
 10  winery       0 non-null      object 
dtypes: float64(1), int64(2), object(8)
memory usage: 0.0+ bytes


In [56]:
df92.iloc[290].description

'–. Barrel sample. A very dry and tannic wine, this is hard edged, showing distinct minerality and a dry, gravelly feel. It has herbal notes, followed by rich spice and dark, brooding fruits. A powerful wine with a sense of extraction.'