## Wine Review Text Classifier

I'm going to build model that predicts Wine Enthusiast scores for a bunch of
wines. I'm using the following data set from Kaggle...

https://www.kaggle.com/zynicide/wine-reviews?select=winemag-data_first150k.csv

This data was initially scrapped from the Wine Enthusiast website. My plan 
is to incorporate text analytics, since the data set has 
descriptions for each wine.

For the first attempt, I'll classify whether a wine is >= 90 points based off its description using a logistic regression model (1 or 0).

Later on, I could try to predict the wine score itself, as well as add in lemmetazation/stemming.

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
import nltk
import re


In [2]:
# Download text corpora
nltk.download()


showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [3]:
# List out text corpora
from nltk.book import *
from nltk.corpus import treebank


*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


### Data set

This data set consists of 2 csv files. 
1. One with 150k records
2. One with 130k records 

After initially exploring the data, I found that there were 
three things in the csv files that need to be fixed. 

#### Remove Quotation marks

There were quotations marks in the description field. This 
caused a problem with uploading the data into a spark dataframe, because 
it was splitting the description column when it shouldn't have.

#### Remove record with new line

There was a record with '\r\n' in the description.
I don't know why, but I removed that as well.

The way I did this was to import the csv into a pandas dataframe, and apply
a replace function on the records. If I was dealing with a dataset large 
enough, this may not work, which is why it would be smart to figure this 
out using .open() etc. However, for now, I made the necessary adjustments 
and saved them as new csv files. These new files will be used for the spark 
dataframes.

#### Remove special characters

This character - ‚Äî - was common throughout the two files. I tried removing these using a lambda function with regex on each file but it didn't work. Instead, I modified the description field in the wine_combined dataframe below.

In [93]:
# Load 150k file into pandas dataframe
wine150k = pd.read_csv('/Users/Matt/Desktop/Programming/DataScience/TextPractice/WineReviews/winemag-data_first150k.csv')


In [94]:
# remove quotations marks for all description records
wine150k['description'] = wine150k['description'].apply(lambda row: row.replace('"', ''))


In [95]:
# one record has ''\r\n' in the description that needs to be removed
wine150k['description'] = wine150k['description'].apply(lambda row: row.replace('\r\n', ' '))


In [101]:
# Preview wine150k
wine150k.head()


Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude


In [102]:
# Load 130k file into pandas dataframe
wine130k = pd.read_csv('/Users/Matt/Desktop/Programming/DataScience/TextPractice/WineReviews/winemag-data_130k_v2.csv')


In [103]:
# remove quotations marks for all description records
wine130k['description'] = wine130k['description'].apply(lambda row: row.replace('"', ''))


In [105]:
# Preview wine130k
wine130k.head()


Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


Since the 130k dataframe has 3 additional fields, I'll need to remove those
before combining the data.

In [106]:
# Remove columns in 130k file that aren't in 150k file
wine130k.drop(['taster_name','taster_twitter_handle','title'], axis=1, inplace=True)


In [107]:
# Combine both wine files into one dataframe (280K records)
# https://stackoverflow.com/questions/40397206/how-can-i-combineconcatenate-two-data-frames-with-the-same-column-name-in-java
wine_combined = pd.concat([wine150k, wine130k])


I want to rename 'Unnamed: 0' column to 'ID', just so it's a cleaner 
column name to work with. I'll copy the column with a new name, and drop the old one

In [108]:
# Rename column 'Unnamed: 0' to 'ID'
wine_combined['ID'] = wine_combined['Unnamed: 0']

# Drop old column
wine_combined.drop(['Unnamed: 0'], axis=1, inplace=True)


In [109]:
# Create label column that will be used in the text classifier
wine_combined['Above_90'] = np.where(wine_combined.points > 90, 1, 0)


In [110]:
# Remove special characters from description
wine_combined['description'] = wine_combined.description.str.replace(r'[0-9]+[-]?[–. ]?[0-9]+', '')


In [74]:
# Import sklearn packages for model
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score


### Text classifier features

Ideally, the wine description will be features in addition to the other columns in the data (price, winery, etc). However, for now I'll just use the description as the features so I can get a better understanding of building a text classifier.

In [111]:
# Create test and train groups with features (description) and label (Above_90)
x_train, x_test, y_train, y_test = train_test_split(wine_combined['description']
,wine_combined['Above_90']
,random_state=0)


In [112]:
# min_df sets a minimum frequency for words to occur

# ngram_range will take features that are words adjacent to each other
# ie. 'really good', 'very bad', etc.
# can set this to > 2 as well!

# Chose min_df=10 and ngram_range=(1,3) because it returned a large number of words
# that would still process
wine_vect = CountVectorizer(min_df=10, ngram_range=(1,3)).fit(x_train)


In [125]:
# Get number of words (features) in all of the wine reviews
len(wine_vect.get_feature_names())


188786

In [114]:
# Create sparse matrix with all the words in the wine reviews
# Each row will represent each wine review
# each column will be 0 or > 0 depending on if the word is in
# that specific review!
x_train_vectorized = wine_vect.transform(x_train)


In [115]:
# Create the logistic regression model and fit it to the data
model = LogisticRegression(max_iter=10000)
model.fit(x_train_vectorized, y_train)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=10000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [116]:
# Predict class labels for samples in x_test
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba
predictions = model.predict(wine_vect.transform(x_test))

print('Predictions:', predictions[:10])


Predictions: [0 1 0 1 0 0 1 0 0 0]


In [117]:
# Predict the probability of the sample for each class in the model
# classes are ordered by model.classes_
# model.classes_
probabilities = model.predict_proba(wine_vect.transform(x_test))

print('Probabilities:', probabilities[:10])


Probabilities: [[9.97067612e-01 2.93238782e-03]
 [2.86824175e-02 9.71317582e-01]
 [9.84349604e-01 1.56503964e-02]
 [1.59148064e-02 9.84085194e-01]
 [9.99439435e-01 5.60565268e-04]
 [9.98908832e-01 1.09116755e-03]
 [1.41290010e-02 9.85870999e-01]
 [8.48089914e-01 1.51910086e-01]
 [9.99706111e-01 2.93888747e-04]
 [8.51475366e-01 1.48524634e-01]]


In [118]:
# Evaluate the model with area under the curve (auc)
print('Area Under Curve:', roc_auc_score(y_test, predictions))


Area Under Curve: 0.899908508777753


In [119]:
# get all feature names included in model in aphabetical order
feature_names = np.array(wine_vect.get_feature_names())


In [120]:
# Sort the coefficients from the model
# I believe takes coefficients for words (from the alphabetical order above) and
# sorts them by the coefficients/words that are the least impactful to the most impactful
sorted_coef_index = model.coef_[0].argsort()


In [121]:
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:-11:-1]]))


Smallest Coefs:
['drink up' 'lacks' 'tannins drink through' 'simple' 'everyday'
 'drink now' 'prosecco' 'ripasso' 'alsace' 'stalky']

Largest Coefs:
['barrel sample' 'gorgeous' 'terrific' 'superb' 'not heavy' 'stunning'
 'beautiful' 'marvelous' 'beautifully' 'very fine']

