# Can we find some distinguishing features in wine descriptions across the continents?

## Have a look at my topic modelling for wine descriptions notebook first for more background story if interested...


So start with the usual stuff for basic NLP prepocessing:

In [None]:
import pandas as pd
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

Use the dataset with more features, supposedly less duplicates:

In [None]:
df = pd.read_csv('../input/winemag-data-130k-v2.csv') 
df.shape

In [None]:
 df.head(2)

Drop the duplicates that were not supposed to be there (or so I understood from the data description..):

In [None]:
#could maybe use more sophisticated comparison across several columns but...
df = df.drop_duplicates('description') 
df.shape

Stopwords, similar to my wine description topic modelling notebook..

In [None]:
from string import punctuation

stop_words = set(stopwords.words('english')) 
stop_words = stop_words.union(set(punctuation)) 
stop_words.update(["\'s", "n't"])

In [None]:
 descriptions = df['description']

This time, need to save both the lemmatized tokens as a list and as a single document. For input to different algorithms later. But first the lemmatization, later building the second version..

In [None]:
lemmatizer = WordNetLemmatizer()
text_tokens = [[lemmatizer.lemmatize(word) for word in word_tokenize(description.lower()) if word not in stop_words] for description in descriptions]


To add the bi-grams and tri-grams to the data:

In [None]:
import gensim
#https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/
# Build the bigram and trigram models
bigram = gensim.models.Phrases(text_tokens, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[text_tokens], threshold=100)
# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram) 
trigram_mod = gensim.models.phrases.Phraser(trigram)


In [None]:
print(trigram_mod[bigram_mod[text_tokens[4]]])

In [None]:
print(descriptions[4])

In [None]:
text_tokens = [trigram_mod[bigram_mod[text]] for text in text_tokens]

In [None]:
 print(text_tokens[4])

In [None]:
texts = [" ".join(tokens) for tokens in text_tokens] 
df["description2"] = texts 
df["description2_tokens"] = text_tokens
df.head(2)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

In [None]:
 df["country"].value_counts()

There is a pretty big class imbalance here. If I want to classify documents into different countries, the US has way more than most other. But counting together the European countries, it seems to come quite close in size. Since my goal was to try to build a binary classifier, that seems like a reasonable start.

So I decided to group all European countries to a single continent. Found a list of countries and their metadata at: [Github](https://gist.github.com/pamelafox/986163)

This is what the list looks like after I parsed all the countries with continent="Europe":

In [None]:
europe = ['Andorra', 'Albania', 'Austria', 'Belgium', 'Bulgaria', 'Belarus', 
          'Czech Republic', 'Germany', 'Denmark', 'Estonia', 'Finland', 
          'France', 'Greece', 'Hungary', 'Republic of Ireland', 'Iceland', 
          'Italy', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Latvia', 
          'Macedonia', 'Malta', 'Kingdom of the Netherlands', 'Norway', 
          'Poland', 'Portugal', 'Romania', 'Russia', 'Sweden', 'Slovenia', 
          'Slovakia', 'San Marino', 'Ukraine', 'Vatican City', 
          'Bosnia and Herzegovina', 'Croatia', 'Moldova', 'Monaco', 
          'Montenegro', 'Serbia', 'Spain', 'Switzerland', 'United Kingdom']

Just for the kicks, Asia using similar approach with same Github country lists:

In [None]:
asia = ['Afghanistan', 'Armenia', 'Azerbaijan', 'Bangladesh', 'Bahrain', 
        'Brunei Darussalam', 'Bhutan', "People's Republic of China", 
        'Cyprus', 'Georgia', 'Indonesia', 'Israel', 'India', 'Iraq', 
        'Iran', 'Jordan', 'Japan', 'Kyrgyzstan', 'North Korea', 
        'South Korea', 'Kuwait', 'Lebanon', 'Myanmar', 'Mongolia', 
        'Maldives', 'Malaysia', 'Nepal', 'Oman', 'Philippines', 
        'Pakistan', 'Qatar', 'Saudi Arabia', 'Singapore', 'Syria', 
        'Thailand', 'Tajikistan', 'Turkmenistan', 'Turkey', 
        'Uzbekistan', 'Vietnam', 'Yemen', 'Cambodia', 'East Timor', 
        'Kazakhstan', 'Laos', 'Sri Lanka', 'United Arab Emirates']

Now that I have a list of European countries, I can create a new column in my dataframe to label European countries as "European" continent, and US countries as "US" continent. While I am at it, do Asia as well... First the function to do the labeling:

In [None]:
#https://stackoverflow.com/questions/26886653/pandas-create-new-column-based-on-values-from-other-columns
def country_to_continent(row): 
    if row["country"] == "US":
        return "US"
    if row["country"] in europe:
        return "Europe"
    if row["country"] in asia:
        return "Asia" 
    return "Other"

    #and to apply it
df["continent"] = df.apply(lambda row: country_to_continent(row), axis=1)

In [None]:
df["continent"].value_counts()

So now I have a much more equal spread of US vs EU countries and datapoints. Should make for better training data. Now to filter myself a new dataframe with only these datapoints.

In [None]:
#https://stackoverflow.com/questions/11869910/pandas-filter-rows-of-dataframe-with-operator-chaining
us_eu = df[(df.continent == "US") | (df.continent == "Europe")] 
len(us_eu)

As visible above, I now have total of about 106k entries still. Should make a decent training/test set. Quick look:

In [None]:
 us_eu.head(4)

Now to turn the "continent" values into integer labels for the algorithm:

In [None]:
from sklearn.preprocessing import LabelEncoder
# encode class values as integers
encoder = LabelEncoder()
encoded_continent = encoder.fit_transform(us_eu.continent)


And to check what we get. It is label 0 = EU, 1 = US, there are still the correct number of 55724 rows for EU and 50448 for US.

In [None]:
import numpy as np

unique, counts = np.unique(encoded_continent, return_counts=True) 
print(unique)
print(counts)

In [None]:
 descriptions2 = us_eu['description2']

I use the "descriptions2" column as the features for the learning algorithms. The encoded_continent values are the target labels to predict.

In [None]:
from sklearn.model_selection import train_test_split

features_train, features_test, labels_train, labels_test = train_test_split(descriptions2, encoded_continent, test_size=0.2, random_state=2)

In [None]:
 print(features_train)

Now to turn the preprocessed word vectors into TFIDF word vectors for actual features.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5) 
features_train_transformed = vectorizer.fit_transform(features_train) 
features_test_transformed = vectorizer.transform(features_test)


The feature names from the vectorizer are the actual words (tokens) that the TFIDF vectorizer identified and processed. So the below maps index in the list to a value in the feature vector created above. Will be more clear further down this notebook. In any case, the printout below shows how the data could still use a lot of cleaning. But not always necessary, so let's see first.

In [None]:
 vectorizer.get_feature_names()

So the transformation below illustrates the token mapping from above feature names. For example, the first number in the tuple (0, 23683) refers to the first document (index 0 in the document list). The second number of (0, 23683) refers to the word at index 23683 in the feature names list above. The number following is the TFIDF score of the word for that document.

In [None]:
print(features_test_transformed[0])

In [None]:
feature_names = np.array(vectorizer.get_feature_names()) 
print(len(feature_names))

In [None]:
def top_tfidf_feats(row, features, top_n=25):
    topn_ids = np.argsort(row)[::-1][:top_n]
    top_feats = [(features[i], row[i]) for i in topn_ids]
    df = pd.DataFrame(top_feats)
    df.columns = ['feature', 'tfidf']
    return df

In [None]:
#this prints it for the first document in the set
arr = features_test_transformed[0].toarray() 
top_tfidf_feats(arr[0], feature_names)

Now, lets try to train some classifiers. MultinomialNB seems to be somewhat popular for this type of task (yes, I just Googled it...):

In [None]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB() 
clf.fit(features_train_transformed, labels_train)


In [None]:
from sklearn.metrics import accuracy_score

y_pred = clf.predict(features_test_transformed) 
y_true = labels_test
accuracy_score(y_true, y_pred)


In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

#print(confusion_matrix(y_true, y_pred)
print(classification_report(y_true, y_pred, target_names=["EU", "US"]))



Can we see what the algorithm finds to be the most predictive features (the ones most contributing to predictions)?


[StackOverflow](https://stackoverflow.com/questions/11116697/how-to-get-most-informative- features-for-scikit-learn-classifiers) to the rescue: 
        

In [None]:
def show_most_informative_features(vectorizer, clf, n=20): 
    feature_names = vectorizer.get_feature_names() 
    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names)) 
    top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1]) 
    for (coef_1, fn_1), (coef_2, fn_2) in top:
        print ("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2))


In [None]:
print(len(y_pred))
print(len(us_eu))

In [None]:
 features_train.head(2)

Oddly, the describe() function still claims there are duplicates. Maybe the wording is just a bit different and the lemmatizations + stopword removal does it? Seem like a complex string to just randomly repeat though..

In [None]:
 features_train.describe()

In [None]:
 show_most_informative_features(vectorizer, clf)

In [None]:
 len(features_test)

So lets take a look at the data. What do the predictions look like, and what do the associated descriptions in english and in preprocessed form look like?

In [None]:
last_df = pd.DataFrame()

#last_df['feat_descs'] = Series(features_test, index=df1.index)

count = 0
countries = [] 
countries_pred = [] 
descriptions = [] 
descriptions2 = []

for idx, value in features_test.iteritems(): 
    countries.append(us_eu.get_value(idx, "country")) 
    countries_pred.append(y_pred[count]) 
    descriptions.append(us_eu.get_value(idx, "description")) 
    descriptions2.append(value)
    count += 1
last_df["country"] = countries 
last_df["country_pred"] = countries_pred 
last_df["desc"] = descriptions 
last_df["desc2"] = descriptions2


In [None]:
 last_df.head(10)

The above table shows how the US reviews are consistently classified as US (=1) and EU countries as EU (=0). Cannot directly see anything specific about the text that makes the classifier do so. But lets see. I try a few different classifiers as well, just to see how well they generally do:

In [None]:
from sklearn.ensemble import RandomForestClassifier
estimators = [10, 20, 30, 40, 50] 
min_splits = [5, 10, 20, 30, 40, 50] 
min_leafs = [1, 2, 5, 10, 20, 30]

best_acc = 0
best_rf = None
for estimator in estimators:
    for min_split in min_splits: 
        for min_leaf in min_leafs:
            print("estimators=", estimator, "min_split=", min_split, " min_leaf=", min_leaf)
            clf = RandomForestClassifier(n_estimators=estimator, min_samples_leaf=min_leaf, min_samples_split=min_split)
            clf.fit(features_train_transformed, labels_train) 
            pred = clf.predict(features_test_transformed) 
            accuracy = accuracy_score(labels_test, pred) 
            print(accuracy)
            if accuracy > best_acc:
                best_acc = accuracy
                best_rf = clf
                print("found better:", best_acc, ", ", best_rf)


Yes, grid search with the sklearn methods would serve nicely. But I like to get some control on what I measure and print from that. Anyway..

In [None]:
print("best:")
print(best_acc)
print(best_rf)

In [None]:
from sklearn.metrics import accuracy_score

y_pred = best_rf.predict(features_test_transformed) 
y_true = labels_test
accuracy_score(y_true, y_pred)


In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

#print(confusion_matrix(y_true, y_pred)
print(classification_report(y_true, y_pred, target_names=["EU", "US"]))

The results seems quite similar for Random Forest. KNN next:

In [None]:
from sklearn.neighbors import KNeighborsClassifier

neighbour_items = [2, 5, 10, 15, 20, 25, 30]

for n in neighbour_items:
    print("running with n=", n)
    clf = KNeighborsClassifier(n_neighbors=n, weights="distance") 
    clf.fit(features_train_transformed, labels_train)
    
    pred = clf.predict(features_test_transformed) 
    
    accuracy = accuracy_score(labels_test, pred) 
    
    print(accuracy)

Not bad. Finally, the grand SVM of traditional ML algos. Optimize C param first.:

In [None]:
from time import time
from sklearn import svm
from sklearn.metrics import precision_score 
from sklearn.metrics import recall_score

#tried with smaller values, not good
c_list = [500, 1000, 10000, 20000]

best_c = None 
best_accuracy = 0 

for C in c_list:
    print("training with C="+str(C)) 
    clf = svm.SVC(kernel='rbf', C=C)

    t0 = time()

    end = int(features_train_transformed.shape[0]/10)
    clf.fit(features_train_transformed[:end], labels_train[:end]) 
    
    print("training time:", str(round(time()-t0, 3)), "s")

    t0 = time()

    pred = clf.predict(features_test_transformed) 
    print("prediction time:", str(round(time()-t0, 3)), "s")

    accuracy = clf.score(features_test_transformed, labels_test) 
    print("accuracy:"+str(accuracy)) 
    print("precision:"+str(precision_score(labels_test, pred))) 
    print("recall:"+str(recall_score(labels_test, pred)))

    if accuracy > best_accuracy: 
        best_accuracy = accuracy 
        best_c = C

print("best:"+str(best_accuracy)+" with "+str(best_c))


Then gamma param:

In [None]:
from time import time
from sklearn import svm
from sklearn.metrics import precision_score 
from sklearn.metrics import recall_score

#tried with smaller values, not good
gamma_list = [0.001, 0.01, 0.1, 1]

best_gamma = None 
best_accuracy = 0

for gamma in gamma_list:
    print("training with gamma="+str(gamma))
    clf = svm.SVC(kernel='rbf', C=best_c, gamma=gamma)

    t0 = time()

    end = int(features_train_transformed.shape[0]/10)
    clf.fit(features_train_transformed[:end], labels_train[:end]) 
    
    print("training time:", str(round(time()-t0, 3)), "s")

    t0 = time()

    pred = clf.predict(features_test_transformed) 
    
    print("prediction time:", str(round(time()-t0, 3)), "s")

    accuracy = clf.score(features_test_transformed, labels_test) 
    print("accuracy:"+str(accuracy)) 
    print("precision:"+str(precision_score(labels_test, pred))) 
    print("recall:"+str(recall_score(labels_test, pred)))

    if accuracy > best_accuracy: 
        best_accuracy = accuracy 
        best_gamma = gamma

print("best:"+str(best_accuracy)+" with "+str(best_gamma))



And now for something (completely) different. Let's explore EU vs US texts as a whole to see what are their defining characteristics. Maybe that tells us something about why the classifiers seem to be doing good?

In [None]:
europe_df = df[df["continent"] == "Europe"] 
print(len(europe_df))
us_df = df[df["continent"] == "US"] 
print(len(us_df))

First, lets collect all EU wine reviews into a single long document.

In [None]:
eu_descriptions2 = ""
for index, row in europe_df.iterrows():
    eu_descriptions2 += " "+row["description2"] 
print(len(eu_descriptions2))

And the same for US:

In [None]:
us_descriptions2 = ""
for index, row in us_df.iterrows():
    us_descriptions2 += " "+row["description2"] 
print(len(us_descriptions2))

So what does TFIDF do? It weights words in a document so that higher weight words occur in that document relatively often, but much more rarely across the whole document set. So highly weighted words should be important to that document, and much less frequent in other documents. In other words, the words ranked high by TFIDF can be expected to be descriptive of that document.

To find descriptive words in EU vs US wine reviews, I now create a new document set with just two documents. One with all EU wine reviews, and another with all US wine reviews. What TFIDF weights high in the EU set should then be descriptive of the EU wines, and what it weights high in US set should be descriptive of the US wines.

In [None]:
 country_docs = [eu_descriptions2, us_descriptions2]

In [None]:
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5) 
country_docs_transformed = vectorizer.fit_transform(country_docs)

In [None]:
#https://buhrmann.github.io/tfidf-analysis.html
feature_names = np.array(vectorizer.get_feature_names()) 
print(len(feature_names))
arr = country_docs_transformed[0].toarray() #print(arr[0])
eu_top_df = top_tfidf_feats(arr[0], feature_names) 
print(eu_top_df)

What do these mean? I have no idea but Google works:

barolo = [Italian wine](https://en.wikipedia.org/wiki/Barolo)

grabby = Oddly this actually seems to be a word used to describe wine taste (as I was hoping to find)

perlage = [wine from Venice (Italy)](http://www.perlagewines.com/en/)

late 2017 = not even going to Google this, seems to refer to wine from a specific time. why in Europe? no idea 

bread crust = huh? seems also to be a wording used to describe some wine taste. which is nice.

choppy = ? couldn't quite figure this out

pinot nero = [Italian name for some wine](http://www.winegeeks.com/grapes/309)


In [None]:
arr = country_docs_transformed[1].toarray() 
us_top_df = top_tfidf_feats(arr[0], feature_names) 
print(us_top_df)

Again, what do these mean? Again, I have no idea but Google still works:

petite sirah = .. [growing mainly in California](http://winefolly.com/review/petite-sirah-wine-guide/)

creek = there seem to be many related to wine but if you limit to US and EU, California comes up again

carneros = [some kind of US wine alliance](https://www.carneros.com/)

paso roble = [more winemaking in California](https://pasowine.com/)

ava = [wine growing region in the US](https://en.wikipedia.org/wiki/American_Viticultural_Area)

... and so on

So what is the summary on this? The classifier seems to work reasonably well. Looking at the TFIDF highlighted words in the end, there seem to be a number of words in there that might give the classifier an advantage. Like names of wines from specific countries or regions. Or directly the names of the regions themselves. I believe with plenty of filtering and reviewing the different results, it would be possible to get much more insights into these topics. Not sure what the classifier would be useful for in itself but perhaps as a process to learn something from the data?