# Wine Enthusiast: What factors determine a wine's rating?

Many factors that can influence a wine's quality and taste. Commonly, those are thought to be a wine's age and price. But what factors really influence a wine's rating? To answer this question, we analyzed reviews on a variety of wines by sommeliers and professional wine tasters to determine whether features such as variety, origin, or even the taster's sentiment affects a wine's rating.

**Objective**: *What factors have the strongest influence on a wine's star rating?*



## Dataset Description

The dataset, which was obtained by Wine Enthusiast, contains 14 feature columns and 129,971 records, of which 118,840 are unique. It contains the wine reviews, ratings, and other features such as origin country, variety, and region.

In [None]:
# Mounting Google Drive
from google.colab import drive
drive.mount("/content/drive")

## Feature Descriptions
* **country:** Origin country of the wine
* **description:**	Review by taster
* **designation:** Name given by producer	
* **points:** Rating given by taster
* **price:** Price of wine in USD
* **province:**	Province or state (US) of origin
* **region_1:**	Region of origin
* **region_2:**	Region of origin (duplicate)
* **taster_name:** Wine taster name
* **taster_twitter_handle:** Wine taster twitter username
* **title:** Wine name
* **variety:** Wine type
* **winery:** Winery that produced wine

## Importing Packages

In [None]:
# Importing packages
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

import re
import string

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

from wordcloud import WordCloud, STOPWORDS
from textblob import TextBlob
import nltk
nltk.download('punkt')
nltk.download('stopwords')

import spacy #used spacy for text prepocessing

import gensim
from gensim import corpora

# libraries for visualization
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
%matplotlib inline

In [None]:
# Reading the datset
wine = pd.read_csv('/content/drive/My Drive/MSBA_Colab_2020/Text_Mining/Wine_Enthusiast_Data.csv')

## Exploratory Data Analysis


In [None]:
# Examining first 5 rows
wine.head()

In [None]:
# Examining number of rows and columns
wine.shape

In [None]:
# Unique number of wine titles
len(wine.title.unique())

In [None]:
# Datatypes and column names
wine.info()

In [None]:
# Checking for null values
wine.isna().sum()

In [None]:
# Checking for duplicates
wine.duplicated().sum()

## Data Cleaning

Preparing the data for analysis by dropping and filling null values and extracting year information from the description column. Bins are created for the continuous variables Price, Star Rating, and Year in order to get more accurate results when running models.

In [None]:
# Dropping unneeded columns
wine = wine.drop(['Unnamed: 0','designation','province','region_1','taster_name','taster_twitter_handle','region_2','winery'],axis=1)
wine.head()

In [None]:
# Finding avg price per country and variety to replace null values in price column
price_mean = wine.groupby(['country', 'variety'])['price'].transform('mean')
wine['price'].fillna(price_mean,inplace=True)

In [None]:
# Checking remaining null values in price columns
wine.price.isna().sum()

# 31 price values were not populated

In [None]:
# Dropping remaining null values and zeros in price column
wine = wine[pd.notnull(wine['price'])]
wine = wine[wine.price != 0]
wine.price

In [None]:
# Creating new Column with year produced
wine_year = wine.title

wine_year =  [re.findall(r'\b20\d{2}', str(x)) for x in wine['title']]   # Extracting only the year
wine_year = pd.DataFrame(wine_year)
wine['year'] = wine_year[0]
wine['year'] = pd.to_datetime(wine['year'])
wine['year'] = pd.DatetimeIndex(wine['year']).year

# Dropping null values in year column 
wine = wine.dropna(subset=['year'],axis=0)

# Changing year to integer
wine['year'] = wine.year.astype(int)

wine.head(1)

### Price Bins

Wine prices range from USD4.00 to USD3,300.

Bins: 
* low = 4.0-18.0
* mid = 18.0-26.0
* high = 26.0-42.0
* ultra = 42.0-3300.0

In [None]:
# Finding price range
wine['price'].describe()

In [None]:
# Creating bins and labels for price data
bin_labels_price = ['low','mid','high','ultra']
wine['price_bin'] = pd.qcut(wine['price'],q=4, labels=bin_labels_price) #dividing data into 4 bins
wine.price_bin.value_counts()

### Star Rating Bins
Wine ratings range from 80 to 100 points.

Star ratings range from 1-5, with the point equally distributed within this range

In [None]:
# Finding ratings range
wine['points'].describe()

In [None]:
# Creating bins and labels for ratings data
bin_labels_points = ['1','2','3','4','5']
wine['star_rating'] = pd.qcut(wine['points'],q=5,labels=bin_labels_points) #dividing data into 5 bins (star ratings)
wine.star_rating.value_counts()

### Year Bins
The years range from 2000 to 2017. 

The years are binned in groups of 5 years.

In [None]:
# Finding year range
wine.year.describe()

In [None]:
# Creating bins and labels for year data
bin_labels_year = ['2000-2005','2005-2010','2010-2015','2015 and later']
wine['year_bin'] = pd.cut(wine['year'],bins=[1999,2005,2010,2015,2018], labels=bin_labels_year)

wine.year_bin.value_counts()

In [None]:
# Viewing result
wine.head()

## Sentiment Analysis

A sentiment analysis is conduced in order to identify any polarity in the wine reviews. After the polarity score is obtained, bins are created to better fit regression models. The bin groupings are as follows:

* Polarity score of -1 to 0: negative
* Polarity score of 0 to 1: positive

A column is added to the dataframe in order to keep track of the sentiment. Lastly, polarity in reviews is visualized by star rating and country.

In [None]:
# Creating functions to determine polarity

def detect_polarity(text):
    
    #Converts the text into textblob object and then retuns
    #the polarity.
    blob = TextBlob(text)
    
    # return the polarity
    return blob.sentiment.polarity

In [None]:
# Applying function to every row in data set
wine.head()
wine['polarity_score'] = wine.description.apply(detect_polarity)
wine.head()

In [None]:
# Binning polarity and subjectivity
polarity_labels = ['negative','positive']
wine['polarity'] = pd.cut(wine['polarity_score'], bins=[-1,0,1], labels=polarity_labels)

In [None]:
# Exporting for seperate model code
wine.to_csv('wine_full_data.csv')

### Visualizations
Visualizations are created to visualize the average polarity across star ratings and countries.

In [None]:
# Displaying Graphs of polarity by Points
points_polarity = wine.groupby('star_rating')['polarity_score'].mean()
plt.bar(points_polarity.index, points_polarity)
plt.xlabel('Star Rating')
plt.ylabel('Polarity Level')
plt.title('Polarity by Star Rating')
plt.show()

In [None]:
# Displaying Graphs of polarity by Country
wine.country.value_counts()

# Dropping all countries with less than 50 entries
country_over_50 = wine[wine.country.map(wine.country.value_counts()) > 50]

# Grouping country by average polarity
country_polarity = country_over_50.groupby('country')['polarity_score'].mean().sort_values(ascending=False)

# Creating bar chart
plt.bar(country_polarity.index, country_polarity, color = 'c')
plt.xticks(rotation = 90)
plt.title('Average Polarity by Country')
plt.xlabel('Country')
plt.ylabel('Polarity')
plt.show()

In [None]:
# Displaying Graphs of polarity by star rating
year_polarity = wine.groupby('year')['polarity_score'].mean()
plt.plot(year_polarity.index, year_polarity, linewidth=4)
plt.xlabel('Year')
plt.ylabel('Polarity Level')
plt.title('Polarity Over Time')
plt.show()

## LDA and Gensim Model
A LDA model was run in order to identify topics and top keywords. A gensim model was run to visualize the topics which are obtained from the *Description* column of the dataset. Performance evaluation helped identify the optimal number of topics.

In [None]:
# Cleaning text to remove stop words, numbers..etc
# lower case characters

def clean_text(text): 
    delete_dict = {sp_character: '' for sp_character in string.punctuation} 
    delete_dict[' '] = ' ' 
    table = str.maketrans(delete_dict)
    text1 = text.translate(table)
    
    textArr= text1.split()
    text2 = ' '.join([w for w in textArr if ( not w.isdigit() and  ( not w.isdigit() and len(w)>3))]) 
    
    return text2.lower()

In [None]:
# turning characters lower case and only keeping alphanumeric values
wine['description_gensim'] = wine.description.apply(clean_text)

# Checking for Null Values
wine.description_gensim.isna().sum()
    # 0 Null

In [None]:
# Creating columns with number of words in cleaned description:
wine['Num_words_text'] = wine['description'].apply(lambda x:len(str(x).split())) 
wine.columns

In [None]:
# Creating bins and labels for ratings data
bin_labels_points = ['1','2','3','4','5']
wine['star_rating'] = pd.qcut(wine['points'],q=5,labels=bin_labels_points) #dividing data into 5 bins (star ratings)

In [None]:
# Printing number of records per review
print('-------Dataset --------')
print(wine['star_rating'].value_counts())
print(len(wine))
print('-------------------------')

# Printing number of short reviews
max_review_data_sentence_length  = wine['Num_words_text'].max()
mask = (wine['Num_words_text'] < 100) & (wine['Num_words_text'] >=20)

wine_short_reviews = wine[mask]
wine_sampled = wine_short_reviews.groupby('star_rating').apply(lambda x: x.sample(n=200)).reset_index(drop = True)

print('No of Short reviews')
print(len(wine_short_reviews))

In [None]:
# Further removing stop words 
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

def remove_stopwords(text):
    textArr = text.split(' ')
    rem_text = " ".join([i for i in textArr if i not in stop_words])
    return rem_text

# Obtaining sample and removing stopwords
wine_sampled['description_gensim']=wine_sampled['description_gensim'].apply(remove_stopwords)
wine_sampled.head()

In [None]:
# Defining lemmatization function
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

def lemmatization(texts,allowed_postags=['NOUN', 'ADJ']): 
       output = []
       for sent in texts:
             doc = nlp(sent) 
             output.append([token.lemma_ for token in doc if token.pos_ in allowed_postags ])
       return output

In [None]:
# Converting text samples to list
text_list=wine_sampled['description_gensim'].tolist()
print(text_list[1])

In [None]:
# Tokenizing reviews 
tokenized_reviews = lemmatization(text_list)
print(tokenized_reviews[1])

In [None]:
# Storing tokenized reviews in dictionary
dictionary = corpora.Dictionary(tokenized_reviews)

# Creating dtm
doc_term_matrix = [dictionary.doc2bow(rev) for rev in tokenized_reviews]

In [None]:
# Fitting LDA model
LDA = gensim.models.ldamodel.LdaModel

lda_model = LDA(corpus=doc_term_matrix, id2word=dictionary, num_topics=10, random_state=100,
                chunksize=1000, passes=50,iterations=100)

In [None]:
# Print the 10 topics
lda_model.print_topics()

In [None]:
# Visualize the topics
#https://github.com/bmabey/pyLDAvis
#https://speakerdeck.com/bmabey/visualizing-topic-models
pyLDAvis.enable_notebook()

vis = gensimvis.prepare(lda_model, doc_term_matrix, dictionary)
vis

In [None]:
#How will we know that this LDA model is good: Perplexity versus coherence.
#Lower the perplexity, the model is better.
print('\nPerplexity: ', lda_model.log_perplexity(doc_term_matrix,total_docs=10000))  # a measure of how good the model is. lower the better.
#Higher the coherence, the model is better.
#Compute Coherence Score
from gensim.models.coherencemodel import CoherenceModel
coherence_model_lda = CoherenceModel(model=lda_model, texts=tokenized_reviews, dictionary=dictionary , coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

In [None]:
#Compute the coherence scores by varying the number of topics

def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.ldamodel.LdaModel(corpus=corpus, num_topics=num_topics, id2word=dictionary)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values

In [None]:
model_list, coherence_values = compute_coherence_values(dictionary=dictionary, corpus=doc_term_matrix, texts=tokenized_reviews, start=2, limit=50, step=1)

In [None]:
# Show graph
limit=50; start=2; step=1;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()# Print the coherence scores

In [None]:
# Print the coherence scores
for m, cv in zip(x, coherence_values):
    print("Num Topics =", m, " has Coherence Value of", round(cv, 4))

In [None]:
# Select the model and print the topics
optimal_model = model_list[3]
model_topics = optimal_model.show_topics(formatted=False)
optimal_model.print_topics(num_words=10)

In [None]:
# Visualize the topics
#After reducing the number of topics, there is not much overlapping
pyLDAvis.enable_notebook()
vis = gensimvis.prepare(optimal_model, doc_term_matrix, dictionary)
vis

In [None]:
# Removing gensim columns
wine = wine.drop(['description_gensim','Num_words_text'],axis=1)

## Topic Modeling: NFM 

The Gensim Model found the optimal number of topics to be five. Hence, five topics will be chosen when identifying topics and keywords using the NFM approach. After the topics are identified, they are named according to the key words. Lastly, word clouds are generated to visualize the various topics and the most important key words. The various word clouds visualize keywords in both the entire dataset as well as the top five countries.

In [None]:
#TfidfVectorizer performs a count vectorizer beforehand.
tf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [None]:
# Creating document term matrix of wine.description
dtm = tf.fit_transform(wine['description']) # Call fit_transform on the description column
dtm

In [None]:
# Creating NMF
nmf_model = NMF(n_components=5,random_state=42)
nmf_model

In [None]:
# Fitting model
nmf_model.fit(dtm)

In [None]:
# Obtaining number of features
len(tf.get_feature_names())

In [None]:
# Obtaining topic names and words
for index,topic in enumerate(nmf_model.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index+1}')
    print([tf.get_feature_names()[i] for i in topic.argsort()[-5:]])                                           
    print('\n')

In [None]:
# Naming topic classes
wine_classification = {0: 'Red & Oaky Wine', 1: 'Fruity Wines', 2:'Rich & Ripe Wines', 3: 'Bouquet Wines', 4: 'Dry Wines'}

# Attaching topics to original dtm
topic_labels = nmf_model.transform(dtm)

# Grab the index position of the most representative topic by calling argmax().
topic_labels[0].argmax()

# Placing Topic classification of each wine into an array
topic_labels.argmax(axis=1)

# Creating new column for topic class
wine['topic_class'] = topic_labels.argmax(axis=1)

# Adding topic classificaltion label
wine['topic_class'] = wine['topic_class'].map(wine_classification)
wine.topic_class.value_counts()

In [None]:
 # WorldCloud for entire dataset

# concatenate all the reviews into one single string 
full_text = ' '.join(wine['description'])

my_stop_words = ["https", "co", "RT", 'aren', 'couldn', 'didn', 'doesn', 'don', 'hadn', 'hasn', 'haven', 'isn', 'let', 'll', 'mustn', 're', 'rt', 'shan', 'shouldn', 've', 'wasn', 'weren', 'won', 'wouldn'] + list(STOPWORDS)
cloud_no_stopword = WordCloud(background_color='white', stopwords=my_stop_words).generate(full_text)

    # SYNTAX: WordCloud(background_color='white', stopwords= x ).generate(your list of words)

plt.imshow(cloud_no_stopword, interpolation='bilinear')
plt.axis('off')
plt.show()

In [None]:
# NFM by Country
# Identifying countries with the most wine records
wine.country.value_counts().nlargest(5)

In [None]:
# Word Cloud by Country:
# concatenate all the reviews into one single string
countries = ['US', 'France', 'Italy', 'Spain', 'Portugal']

def country_wordcloud(countries):
  for x in countries:
    print(x)
    df = wine[wine.country == x]
    full_text = ' '.join(df['description'])

    my_stop_words = ["drink","wine","https", "co", "RT", 'aren', 'couldn', 'didn', 'doesn', 'don', 'hadn', 'hasn', 'haven', 'isn', 'let', 'll', 'mustn', 're', 'rt', 'shan', 'shouldn', 've', 'wasn', 'weren', 'won', 'wouldn'] + list(STOPWORDS)
    cloud_no_stopword = WordCloud(background_color='white', stopwords=my_stop_words).generate(full_text)

        # SYNTAX: WordCloud(background_color='white', stopwords= x ).generate(your list of words)

    plt.imshow(cloud_no_stopword, interpolation='bilinear')
    plt.axis('off')
    plt.show()
country_wordcloud(countries)

## Logistic Regression
A multinomial logistic regression performed in order to identify the top features that influence a wine's star rating. 

### Preparing Data

The data is prepared to fit the regression. First, countries are replaced by continents (regions) in order to avoid too many features which can lead to overfitting. Then, One Hot Encoding is used to create dummy variables for the variables in order to run the regression. During this process, any remainng null values are removed.

In [None]:
# Replacing country with region (continent)
wine.country.replace(
    ['Italy', 'Portugal','Spain', 'France', 'Germany', 'Austria', 'Hungary', 'Greece', 
     'Romania', 'Czech Republic', 'Slovenia', 'Luxembourg', 'Croatia', 'England', 'Bulgaria', 
     'Switzerland', 'Bosnia and Herzegovina', 'Ukraine', 'Slovakia', 'Serbia', 'Moldova', 
     'Morocco', 'Cyprus', 'Macedonia'], 'Europe', inplace=True)
wine.country.replace(['US', 'Canada'], 'North America', inplace=True)
wine.country.replace(['Argentina', 'Chile', 'Mexico', 'Uruguay', 'Brazil', 'Peru'], 'South America',inplace=True)
wine.country.replace(['South Africa', 'Egypt'], 'Africa',inplace=True)
wine.country.replace(['Israel', 'Lebanon', 'India', 'Armenia', 'China', 'Turkey', 'Georgia'], 'Asia',inplace=True)
wine.country.replace(['New Zealand', 'Australia'], 'Australia',inplace=True)

# Renaming column to region
wine.rename(columns={'country':'region'}, inplace=True)

# Dropping nulls
wine = wine.dropna(subset=['region'], axis=0)

In [None]:
# Dropping uneeded columns
wine_original = wine
wine = wine.drop(['title','description','points','year','price','points', 'variety','polarity_score'], axis=1)

In [None]:
# Checking for nulls
wine.isna().sum()

In [None]:
# Previewing result
wine.head(1)

In [None]:
# Selecting data to be encoded
encode = wine.drop('star_rating', axis=1)

#Encoding - one hot encoder
enc = OneHotEncoder()
wine_enc = enc.fit_transform(encode)
col_names = enc.get_feature_names(encode.columns)

#Creating dataframe from encoded values
wine_enc = pd.DataFrame(wine_enc.todense(), columns= col_names) #todense() reshapes array 
wine_enc.head()

### Logistic Regression Model
The logistic regression model is fitted in order to predict star ratings for wines that are not included in the dataset. More importantly, it identifies the features that have the strongest influence on a wine's star rating by calling their coefficients' absolute values.

In [None]:
#Identify feature and label columns
feature_cols = wine_enc

X = feature_cols
y = wine.star_rating

In [None]:
#Splitting the Dataset into 65% train and 35% test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=35, random_state=35)

#Dropping index
X_test = X_test.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

In [None]:
#Scaling data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
#Fitting model
logreg = LogisticRegression(multi_class='multinomial', solver='newton-cg').fit(X_train, y_train)

#Making predictions
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))

In [None]:
#Classification report to measure performance
print(classification_report(y_test, y_pred, digits=4))

In [None]:
#Calling coefficients
logreg.coef_

#Storing coefficients in dictionary with feature names
coef_dict = {}

for coef, feat in zip(logreg.coef_[0,:], feature_cols):
  coef_dict[feat] = coef

#storing result in DataFrame including signs to show positive or negative influence
coef = pd.DataFrame.from_dict(coef_dict,orient='index', columns=['coefficient'])

#Viewing results
coef.sort_values(by='coefficient', ascending=False)

In [None]:
# Obtaining coefficient's absolute value identify features' influence
abs(coef).sort_values(by='coefficient', ascending=False)

## kNN based Recommender System

Build a Recommender System that suggest five preferred wines based on a custom input.

In [None]:
#Importing module
from sklearn.neighbors import NearestNeighbors

In [None]:
#Adding random wine data
rand_wine_data = {
    'region_Africa':0, 'region_Asia':0, 'region_Australia':0, 'region_Europe':1,
       'region_North America':0, 'region_South America':0, 'price_bin_high':1,
       'price_bin_low':0, 'price_bin_mid':0, 'price_bin_ultra':0,
       'year_bin_2000-2005':0, 'year_bin_2005-2010':0, 'year_bin_2010-2015':1,
       'year_bin_2015 and later':0, 'topic_class_Bouquet Wines':0, 'topic_class_Dry Wines':1,
       'topic_class_Fruity Wines':0, 'topic_class_Red & Oaky Wine':0,
       'topic_class_Rich & Ripe Wines':0, 'polarity_negative':0, 'polarity_positive':1, 
       'star_rating':4
}

rand_wine = pd.DataFrame(data=rand_wine_data, index=[0])
rand_wine

In [None]:
# Joining label and feature columns
wine_enc = wine_enc.join(y, how='inner')
wine_enc.head(1)

In [None]:
# Re-adding title column
wine_enc['title'] = wine_original['title']

# Dropping nulls
wine_enc.dropna(subset=['title'], axis=0, inplace=True)

# Selecting feature variables 
feature_cols = wine_enc.drop(['title'],axis=1)
X_kNN = feature_cols
y_kNN = wine_enc['title']

In [None]:
# Using NearestNeighbors model and kneighbors() method to find k neighbors.
# Setting n_neighbors = 5 to find 5 similar wines 
neigh = NearestNeighbors(n_neighbors=5, algorithm='auto')
neigh.fit(X_kNN)

distances, indices = neigh.kneighbors(rand_wine)

In [None]:
# Printing the top 5 wine recommendations:
print('Recommendations based on the selected wine:\n')
for i in range(len(distances.flatten())):
  print('{0}. {1}'.format(i+1, wine_enc['title'].iloc[indices.flatten()[i]],distances.flatten()[i]))