# Introduction to the problem
The following scenario is purely fictional and is intended to interest people in the problem we are trying to solve.

*Assume that you are running a website where people can post reviews of drinks such as wine, beer, and spirits (whiskey, gin etc). As an administrator of the website you want to keep things transparent and the reviews objective, so you only allow the reviewers to taste and review only one kind of drink, preferably focusing on wine since this is what the website is most famous for. However, you (the administrator) are growing suspicious that there are some reviewers who are cross drinking and posting reviews for other drinks. You want to stop this behaviour but you don't want to piss off your tasters because they are good at their job, so you hire a data scientist to build an ML model to identify tasters who are cross drinking and reviewing other drinks.*

The domain we are working on is a corpus of wine and beer reviews collected from a website called [Wine Enthusiast](https://www.winemag.com/). The writers of our corpus are people who review and rate wines, beers and spirits professionally. Some of these tasters specialize on one drink (most of them in wine) and some others embrace cross drinking and delve into other beverage categories.

Let's get started!

# Problem methodology
This notebook will focus on building a machine learning model, trained on wine and spirits reviews, to recognize tasters who are cross drinking and reviewing other beverages. The model will be evaluated against a collection of beer reviews where it will try to identify tasters based on their unique style.

Based on some thorough inspection of the data, we saw that there are 19 wine tasters, 2 beer tasters and 1 spirits taster. The 2 beer tasters also review wines, so they would be the people we would like to identify. The problem, then, boils down to a binary classification problem. We will use the wine dataset and the spirits dataset and we will train a model to identify two categories, **wine_taster** and **spirits_taster**.

# Goal
Goal of this project:
    1. train a machine learning model on wine and spirits reviews
    2. use the model on a completely new dataset from beer reviews to identify tasters who have reviewed wines
    3. build an understanding of the features that considered to be important for this task
    4. we must know the goodness of predictions

# Implementation

## 0. Notebook details
* Datasets source: We will be using three different datasets that have been collected from the [same website](https://www.winemag.com/) but have been acquired differently. The wine reviews dataset was downloaded from [Kaggle](https://www.kaggle.com/zynicide/wine-reviews) thanks to the user zackthoutt who did all the hard work to scrape the data. The beer and spirits reviews dataset was downloaded using Zack's scrape which can be found on his [github page](https://github.com/zackthoutt/wine-deep-learning).
* Metadata: Because the data have been collected the same way, their metadata are similar.
    - country: The country that the wine is from, String
    - description: A few sentences from a sommelier describing the wine's taste, smell, look, feel. String
    * designation: The vineyard within the winery where the grapes that made the wine are from, String
    * points: The number of points WineEnthusiast rated the wine on a scale of 1-100 (though they say they only post reviews for wines that score >=80), Numeric
    * price: The cost for a bottle of the wine, Numeric
    * province: The province or state that the wine is from, String
    * region_1: Tthe wine growing area in a province or state (ie Napa), String
    * region_2: Sometimes there are more specific regions specified within a wine growing area (ie Rutherford inside the Napa Valley), but this value can sometimes be blank, String
    * variety: The type of grapes used to make the wine (ie Pinot Noir), String
    * winery: The winery that made the wine, String

## 1. Imports

In [1]:
import os
import nltk
import re
import pickle
import spacy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC, LinearSVC
from nltk.corpus import stopwords
from nltk.stem.porter import *
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.learning_curve import learning_curve
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.feature_selection import SelectKBest, f_regression, f_classif
from sklearn.cross_validation import cross_val_score, KFold
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier

sns.set(color_codes=True)
nlp = spacy.load('en')
%matplotlib inline



In [2]:
os.chdir('C://Users/vasileios.vyzas/Documents/workspace/Projects/Miscellaneous/wine_critic_recognizer/')
# os.chdir('/home/fykos/Documents/workspace/wine_critic_recognizer/')

In [3]:
spirits = pd.read_json('data/raw/spirits.json')
wine = pd.read_csv('data/raw/winemag-data-130k-v2.csv')

## 2. Preprocessing and cleaning
Given that the drinks we are working with are very different, features such as country, points, price, province etc. are not useful. Therefore, we are only interested in the description column which later will give us enough information to identify reviewers.

In [4]:
# drop all unnecessary columns from both datasets
wine = wine.drop(['Unnamed: 0', 'country', 'points', 'price', 'province', 'title', 'designation', 'region_1', 'region_2', 'taster_twitter_handle', 'variety', 'winery'], axis = 1)
spirits = spirits.drop(['country', 'points', 'price', 'province', 'title', 'designation', 'region_1', 'region_2', 'taster_twitter_handle', 'variety', 'winery'], axis = 1)

In [5]:
# the wine dataset has more than 100000 rows, and using all of them for the classification
# will create an imbalanced classification problem.
wine_less_rows = wine.copy()
wine_less_rows = wine_less_rows[:5000]
wine_less_rows['label'] = 'wine_taster'
spirits['label'] = 'spirits_taster'

In [6]:
# combine wines and spirits
all_drinks = pd.concat([wine_less_rows, spirits])
all_drinks.reset_index(inplace=True)
all_drinks.drop(['taster_name', 'index'], axis=1, inplace=True)

In [7]:
all_drinks.groupby('label').describe()

Unnamed: 0_level_0,description,description,description,description
Unnamed: 0_level_1,count,unique,top,freq
label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
spirits_taster,4422,4369,"The muted bouquet is slow in offering woody, l...",2
wine_taster,5000,4985,There's a touch of toasted almond at the start...,2


In [8]:
# There seem to be some duplicates records in the dataset. Let's remove them
all_drinks.drop_duplicates(subset='description', keep='last', inplace=True)

In [9]:
# checking again the unique number of records
all_drinks.groupby('label').describe()

Unnamed: 0_level_0,description,description,description,description
Unnamed: 0_level_1,count,unique,top,freq
label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
spirits_taster,4369,4369,The mild aroma hints at vanilla and stone frui...,1
wine_taster,4985,4985,The 18 months of aging in 90% American oak rea...,1


## 3. Feature engineering
As we mentioned the description column which is text data will not give us any useful information to recognize authors of wine and spirits reviews. However, we can use the individual descriptions to create features that represent the writing style and linguistic patterns the authors follow.

In [10]:
def normalize(review):
    review_letters = re.sub('[^a-zA-Z]', ' ', str(review))
    review_letters = review_letters.lower()
    return (" ".join(review_letters.split()))

In [11]:
def remove_stopwords(review):
    stop_words = set(stopwords.words('english'))
    ls = [word for word in review.split() if word not in stop_words]
    return (" ".join(ls))

In [12]:
def stemming(review):
    stemmer = PorterStemmer()
    stemmed = [stemmer.stem(word) for word in review.split()]
    return (" ".join(stemmed))

In [13]:
processed_reviews = []
reviews = all_drinks['description']
for review in reviews:
    processed_reviews.append(remove_stopwords(normalize(review)))

In [14]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english', min_df=5, max_df = 0.95, ngram_range=(1,2))
tfidf_matrix = tfidf_vectorizer.fit_transform(processed_reviews)

In [15]:
# features holds a list of all the words in the tfidf's vocabulary in the same order as the column in the matrix
features = tfidf_vectorizer.get_feature_names()

In [16]:
weights = np.asarray(tfidf_matrix.mean(axis=0)).ravel().tolist()
weights_df = pd.DataFrame({'term':features, 'weights':weights})
weights_df = weights_df.sort_values(by='weights', ascending=False)

In [17]:
important_terms = weights_df['term'][:20].tolist()

In [18]:
def count_of_important_words(review):
    count = 0
    for word in normalize(review).split():
        if word in important_terms:
            count+=1
    return count

In [19]:
all_drinks['count_of_important_words'] = all_drinks['description'].map(count_of_important_words)

In [20]:
def normalize(text):
    letters = re.sub("[^a-zA-Z0-9]", " ", text)
    words = letters.lower().strip()
    return words

In [21]:
# count number of words used in a review
# normalize a review to remove punctuation in order to count accurately
all_drinks['description_length'] = all_drinks['description'].apply(lambda text: len(str(normalize(text)).split()))

In [22]:
# total count of all words and punctuation used in a description
all_drinks['count_of_characters'] = all_drinks['description'].apply(lambda text: len(str(text)))

In [23]:
all_drinks['average_length_of_words'] = all_drinks['count_of_characters'] / all_drinks['description_length']

In [24]:
def punctuation_counter(review):
    count = 0
    doc = nlp(review)
    
    for token in doc:
        if token.is_punct:
            count+=1
    return count   

In [25]:
# count of punctuation symbols used
all_drinks['number_of_punctuation'] = all_drinks['description'].map(punctuation_counter)

In [26]:
# function to find and count the number of nouns in a wine review
def noun_getter(review):
    count = 0
    doc = nlp(review)
    
    for token in doc:
        if token.pos_ == 'NOUN':
            count+=1
    return count        

In [27]:
all_drinks['number_of_nouns'] = all_drinks['description'].map(noun_getter)

In [28]:
# function to find and count the number of noun phrases used in a review
def noun_chunks_getter(review):
    count = 0
    doc = nlp(str(review))
    
    for token in doc.noun_chunks:
        if len(str(token.text).split()) > 1:
            count+=1
    return count

In [29]:
all_drinks['number_of_noun_phrases'] = all_drinks['description'].map(noun_chunks_getter)

In [30]:
# function to find and count the number of verbs in a wine review
def verb_getter(review):
    count = 0
    doc = nlp(review)
    
    for token in doc:
        if token.pos_ == 'VERB':
            count+=1
    return count        

In [31]:
all_drinks['number_of_verbs'] = all_drinks['description'].map(verb_getter)

In [32]:
# function to find and count the number of adjective in a wine review
def adj_getter(review):
    count = 0
    doc = nlp(review)
    
    for token in doc:
        if token.pos_ == 'ADJ':
            count+=1
    return count        

In [33]:
all_drinks['number_of_adj'] = all_drinks['description'].map(adj_getter)

In [34]:
def count_sentences(review):
    doc = nlp(review)
    return (len([sentence for sentence in doc.sents]))

In [35]:
all_drinks['count_of_sentences'] = all_drinks['description'].map(count_sentences)

In [None]:
def find_sentence_length_with_pos_tags(review):
    doc = nlp(review)
    pos_tags_dict = dict()
    for sentence in doc.sents:
        doc1 = nlp(str(sentence)) 
        pos_tags_dict[len(doc1)] = (noun_getter(str(sentence)), verb_getter(str(sentence)), adj_getter(str(sentence)))
    return pos_tags_dict

In [None]:
max_length_list = []
min_length_list = []
nouns_list = []
verb_list = []
adj_list = []

for review in all_drinks['description']:
    d = find_sentence_length_with_pos_tags(review)
    max_length = ((max(k for k, v in d.items())))
    min_length_list.append((min(k for k, v in d.items())))
    noun_count, verb_count, adj_count = d[max_length]
    
    max_length_list.append(max_length)
    nouns_list.append(noun_count)
    verb_list.append(verb_count)
    adj_list.append(adj_count)

In [None]:
all_drinks['number_of_characters_in_largest_sentence'] = max_length_list
all_drinks['number_of_characters_in_smallest_sentence'] = min_length_list
all_drinks['number_of_nouns_in_largest_sentence'] = nouns_list
all_drinks['number_of_verbs_in_largest_sentence'] = verb_list
all_drinks['number_of_adjectives_in_largest_sentence'] = adj_list

In [None]:
stop_words = set(stopwords.words('english'))
all_drinks['count_of_stopwords'] = all_drinks['description'].apply(lambda x: len([w for w in str(x).lower().split() if w in stop_words]) )

In [None]:
all_drinks['count_title_case_words'] = all_drinks['description'].apply(lambda x: len([w for w in str(x).replace('I','i').replace('A','a').split() if w.istitle() == True]) )

In [None]:
# definition of function found here: https://medium.com/@dimitrisspathis/exploring-linguistic-patterns-in-best-selling-book-series-100290c94242
def automated_readability_index(characters, words, sentences):
    ati = 4.71 * (characters/words) + 0.5 * (words/sentences) - 21.43
    return ati

In [None]:
all_drinks['automated_readability_index'] = all_drinks.apply(lambda x: automated_readability_index(x['count_of_characters'], x['description_length'], x['count_of_sentences']), axis = 1)

In [None]:
# function to find and count the number of adjective in a wine review
def adv_getter(review):
    count = 0
    doc = nlp(review)
    
    for token in doc:
        if token.pos_ == 'ADV':
            count+=1
    return count        

In [None]:
all_drinks['count_of_adverbs'] = all_drinks['description'].map(adv_getter)

In [None]:
def lexical_density(nouns, verbs, adjectives, adverbs, words):
    ld = ((nouns + verbs + adjectives + adverbs) / words) * 100
    return ld

In [None]:
all_drinks['lexical_density'] = all_drinks.apply(lambda x: lexical_density(x['number_of_nouns'], x['number_of_verbs'], x['number_of_adj'], x['count_of_adverbs'], x['description_length']), axis = 1)

In [None]:
# feature engineering operations take to long to complete
# saving the dataframe to save time in future reads
all_drinks.to_csv('data/modified/wine_and_spirits_reviews_with_generated_features.csv', encoding='utf-8', index = False)

## 4. Data exploration & visualization

In [None]:
all_drinks = pd.read_csv('data/modified/wine_and_spirits_reviews_with_generated_features.csv')

In [None]:
all_drinks.hist(column='description_length', by='label', bins=50, figsize=(10, 6));

In [None]:
all_drinks.groupby('label')['description_length'].describe()

In [None]:
all_drinks.hist(column='number_of_punctuation', by='label', bins=20, figsize=(10, 6));

In [None]:
all_drinks.hist(column='number_of_nouns', by='label', bins=30, figsize=(10, 6));

## 5. Model building & evaluation
Below we are splitting the dataset into training and test set, 80% of the observations will go to the training and 20% will go to the testing. Techinically, the test dataset could be called a validation set because the real test dataset, as I mentioned before, is the beer reviews dataset. This dataset will be used last and it will have no involvement in the training phase, it's practically a new, unseen dataset.

The main goal in this part is to test various learning algorithms and build a model that generalizes well. For the model building phase we are using GridSearchCV to find the most optimal parameters for the estimator. Additionally, GridSearchCV is optimized by cross validation, which can give us a good idea of how well the model generalizes.

In [None]:
drinks = pd.read_csv('data/modified/wine_and_spirits_reviews_with_generated_features.csv')

In [None]:
# will drop the description feature because I don't need it anymore
drinks = drinks.drop(['description'], axis = 1)

In [None]:
drinks['label'] = drinks['label'].map({'wine_taster': 1, 'spirits_taster': 0})

In [None]:
drinks.head()

In [None]:
X, y = drinks.loc[:, drinks.columns != 'label'], drinks.loc[:,'label']

In [None]:
# splitting the dataset into 80% for the training and 20% for the test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### 5.1 Cross validation experiments
For this part, three algorithms were tested, a Logistic Regression classifier, a Random Forest classifiers and a Support Vector Machine classifier. The range of parameters used in GridSearchCV were found by trial and error and a lot of research online.

For the best parameters for Logistic Regression model this [kaggle notebook](https://www.kaggle.com/joparga3/2-tuning-parameters-for-logistic-regression/code) was found. For the Random Forest parameters, also a [kaggle notebook](https://www.kaggle.com/hadend/tuning-random-forest-parameters) was consulted.

*Note to self: SelectKBest selects the top k features that have maximum relevance with the target variable. It takes two parameters as input arguments, "k" (obviously) and the score function to rate the relevance of every feature with the target variable. For example, for a regression problem, you can supply "feature_selection.f_regression" and for a classification problem, you can supply "feature_selection.f_classif".*

*You can use SelectKBest and GridSearchCV together using a Pipeline with an estimator as the second step. The pipeline applies the first step by choosing the best k features and transforms the input data to have only these features. After transformation, this is then fit with your estimator. The GridSearchCV helps you to tune the "number of features to be selected" and the hyperparameter of the estimator, by selecting the parameters that give the best score on validation data.*

In [None]:
kbest = SelectKBest(f_classif)
stdScaler = StandardScaler()
pipeline = Pipeline([('stdScaler', stdScaler), ('kbest', kbest), ('lr', LogisticRegression())])
grid_search = GridSearchCV(pipeline, {'kbest__k': [1,2,3,4,5, 6,7,8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], 'lr__C': [0.001,0.01,0.1,1,10,100,1000]}, cv = 5)
grid_search.fit(X_train, y_train)

In [None]:
print("Best parameters set found on development set:")
print()
print(grid_search.best_params_)
print()
print("Grid scores on development set:")
print()
means = grid_search.cv_results_['mean_test_score']
stds = grid_search.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, grid_search.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r"% (mean, std * 2, params))
print()

print("Detailed classification report:")
print()
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
print()
y_true, y_pred = y_test, grid_search.predict(X_test)
print(classification_report(y_true, y_pred))
print()

In [None]:
kbest = SelectKBest(f_classif)
stdScaler = StandardScaler()
pipeline = Pipeline([('stdScaler', stdScaler), ('kbest', kbest), ('clf', RandomForestClassifier())])
grid_search = GridSearchCV(pipeline, {'kbest__k': [1,2,3,4,5, 6,7,8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], 'clf__n_estimators': [5, 40, 42, 100], "clf__max_depth": [5, 6],
              "clf__min_samples_split": [5, 10],
              "clf__min_samples_leaf": [3, 5],
              "clf__max_leaf_nodes": [14, 15]}, cv = 5)
grid_search.fit(X_train, y_train)

In [None]:
print("Best parameters set found on development set:")
print()
print(grid_search.best_params_)
print()
print("Grid scores on development set:")
print()
means = grid_search.cv_results_['mean_test_score']
stds = grid_search.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, grid_search.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r"% (mean, std * 2, params))
print()

print("Detailed classification report:")
print()
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
print()
y_true, y_pred = y_test, grid_search.predict(X_test)
print(classification_report(y_true, y_pred))
print()

In [None]:
kbest = SelectKBest(f_classif)
stdScaler = StandardScaler()
pipeline = Pipeline([('kbest', kbest), ('stdScaler', stdScaler), ('clf', svm.SVC(kernel='linear'))])
grid_search = GridSearchCV(pipeline, {'kbest__k': [1,2,3,4,5, 6,7,8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], 'clf__C': [0.001, 0.01, 0.1, 1, 10], "clf__gamma": [0.001, 0.01, 0.1, 1]}, cv=5)
grid_search.fit(X_train, y_train)

In [None]:
print("Best parameters set found on development set:")
print()
print(grid_search.best_params_)
print()
print("Grid scores on development set:")
print()
means = grid_search.cv_results_['mean_test_score']
stds = grid_search.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, grid_search.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r"% (mean, std * 2, params))
print()

print("Detailed classification report:")
print()
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
print()
y_true, y_pred = y_test, grid_search.predict(X_test)
print(classification_report(y_true, y_pred))
print()

### 5.2 Testing the most optimal parameters found for the SVM model with the test dataset

In [None]:
# there is no particular reason in the choice of algorithms, just curiosity of performance
kbest = SelectKBest(f_classif, k=20)
stdScaler = StandardScaler()
pipeline = Pipeline([('kbest', kbest), ('stdScaler', stdScaler), ('clf', svm.SVC(kernel='linear', C = 10, gamma = 0.001))])
pipeline.fit(X_train, y_train)

In [None]:
svm_prediction = pipeline.predict(X_test)

In [None]:
print('SVM accuracy', accuracy_score(y_test, svm_prediction))
print ('SVM confusion matrix\n', confusion_matrix(y_test, svm_prediction))
print ('(row=expected, col=predicted)')

In [None]:
print(classification_report(y_test, logreg_prediction))

Based on the output of the confusion matrix, it seems that the SVM model is working descently well. It has managed to predict correctly 834 reviewers out of 1004 are wine reviewers and 574 out of 867 are spirits reviewers.

## 6. Moment of Truth: Testing the model with beer reviews

In [None]:
beer = pd.read_json('data/raw/beers.json')
beer = beer.copy()

In [None]:
processed_reviews = []
reviews = beer['description']
for review in reviews:
    processed_reviews.append(remove_stopwords(normalize(review)))

In [None]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english', min_df=5, max_df = 0.95, ngram_range=(1,2))
tfidf_matrix = tfidf_vectorizer.fit_transform(processed_reviews)

In [None]:
# features holds a list of all the words in the tfidf's vocabulary in the same order as the column in the matrix
features = tfidf_vectorizer.get_feature_names()

In [None]:
weights = np.asarray(tfidf_matrix.mean(axis=0)).ravel().tolist()
weights_df = pd.DataFrame({'term':features, 'weights':weights})
weights_df = weights_df.sort_values(by='weights', ascending=False)

In [None]:
important_terms = weights_df['term'][:20].tolist()

In [None]:
def count_of_important_words(review):
    count = 0
    for word in normalize(review).split():
        if word in important_terms:
            count+=1
    return count

In [None]:
beer['count_of_important_words'] = beer['description'].map(count_of_important_words)

In [None]:
beer['count_of_adverbs'] = beer['description'].map(adv_getter)

In [None]:
beer['count_of_important_words'] = beer['description'].map(count_of_important_words)

In [None]:
beer['description_length'] = beer['description'].apply(lambda text: len(str(text).split()))

beer['count_of_characters'] = beer['description'].apply(lambda text: len(text))

beer['average_length_of_words'] = beer['count_of_characters'] / beer['description_length']

beer['number_of_punctuation'] = beer['description'].map(punctuation_counter)

beer['number_of_nouns'] = beer['description'].map(noun_getter)

beer['number_of_noun_phrases'] = beer['description'].map(noun_chunks_getter)

beer['number_of_verbs'] = beer['description'].map(verb_getter)

beer['number_of_adj'] = beer['description'].map(adj_getter)

beer['count_of_sentences'] = beer['description'].map(count_sentences)

max_length_list = []
min_length_list = []
nouns_list = []
verb_list = []
adj_list = []

for review in beer['description']:
    d = find_sentence_length_with_pos_tags(review)
    max_length = ((max(k for k, v in d.items())))
    min_length_list.append((min(k for k, v in d.items())))
    noun_count, verb_count, adj_count = d[max_length]
    
    max_length_list.append(max_length)
    nouns_list.append(noun_count)
    verb_list.append(verb_count)
    adj_list.append(adj_count)

beer['number_of_characters_in_largest_sentence'] = max_length_list
beer['number_of_characters_in_smallest_sentence'] = min_length_list
beer['number_of_nouns_in_largest_sentence'] = nouns_list
beer['number_of_verbs_in_largest_sentence'] = verb_list
beer['number_of_adjectives_in_largest_sentence'] = adj_list

stop_words = set(stopwords.words('english'))
beer['count_of_stopwords'] = beer['description'].apply(lambda x: len([w for w in str(x).lower().split() if w in stop_words]) )

beer['count_of_upper_case_words'] = beer['description'].apply(lambda x: len([w for w in str(x).replace('I','i').replace('A','a').split() if w.isupper() == True]) )

beer['count_title_case_words'] = beer['description'].apply(lambda x: len([w for w in str(x).replace('I','i').replace('A','a').split() if w.istitle() == True]) )

In [None]:
beer['lexical_density'] = beer.apply(lambda x: lexical_density(x['number_of_nouns'], x['number_of_verbs'], x['number_of_adj'], x['count_of_adverbs'], x['description_length']), axis = 1)

In [None]:
beer['automated_readability_index'] = beer.apply(lambda x: automated_readability_index(x['count_of_characters'], x['description_length'], x['count_of_sentences']), axis = 1)

In [None]:
beer.drop(['description'], axis = 1, inplace = True)

In [None]:
beer.drop(['country', 'designation', 'points', 'price', 'province', 'region_1', 'region_2', 'taster_twitter_handle', 'title', 'variety', 'winery'], axis = 1, inplace=True)

In [None]:
beer.drop('taster_name', axis =1, inplace=True)

In [None]:
predictions = pipeline.predict(beer)

In [None]:
y = np.bincount(predictions)
ii = np.nonzero(y)[0]

In [None]:
list(zip(ii,y[ii]))

It seems that the classifier found all the reviews in the beer dataset to belong to class 1 which is the wine_taster class. As we mentioned in the introduction, the beer reviews have been done by two reviewers who also double on wine reviews as well. Surprisingly, the SVM model was able to find correctly all of them.

# Lessons learned and goals achieved
Following are some observations and things I learned while doing this exercise:

1. We managed to successfully train an SVM model to recognize wine reviewers.
    
    - Feature engineering played a key role to this task. With more research we can probably generate more and better features.
    - Cross validation and parameter estimation with GridSearchCV were very useful but extremely time consuming. Logistic Regression had the fastest training time, taking about 3-4 minutes. Random Forest took about 12-15 minutes and SVM took about 35 minutes.
2. We successfully used the model on a completely new dataset from beer reviews. The model performed really well by classifying all the reviews in the dataset as a wine_taster class, meaning that all the reviews have been written by wine reviewers.
3. The SVM model has achieved an accuracy score of 75%. Initially, with about 10 features and a normal train-test split approach we had achieved a score of 70%. By introducing more important features and performing parameter tuning in combination with cross validation we managed to increase the accuracy at 75%.