Thank you to Dostal Jakub's kernel (https://www.kaggle.com/dostalj/wine-classification-informative-words) for preliminary EDA that shows that points, and not price, is a reliable metric against which to rank wines. Also, his kernel provides code for the subsetting of top and bottom % of wines performed below.

Dostal uses feature selection to identify words most relevant to determining price, whereas this kernel focuses more on word tokenization, basic NLP using term frequencies and data visualization using word clouds. 

In [7]:
#Load requisite packages
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
import string
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from collections import Counter
import re
import nltk
from nltk.stem.porter import PorterStemmer

Data Load and Cleaning

In [8]:
fname = '../input/winemag-data_first150k.csv'
df = pd.read_csv(fname)

Subset top and bottom 25% of wine by points awarded

In [9]:
# drop all rows where any column is na 
df_clean = df.dropna(axis = 0, how = 'any')
# sort data frame by points in descending order
df_sorted = df_clean.sort_values(by = 'points', ascending = False)
# number of wines 
num_wines = df_sorted.shape[0]
cnt_25_pct = int(0.25*num_wines)
# subset top 25% of wines by points 
best = df_sorted.head(cnt_25_pct)
worst = df_sorted.tail(cnt_25_pct)

Extract descriptions from each set 

In [10]:
# extract all words in descriptions for the best 25% wines - bag of words
best_descs = best['description'].values
best_descs = " ".join(best_descs)

# extract all words in descriptions for the worst 25% wines - - bag of words
worst_descs = worst['description'].values
worst_descs = " ".join(worst_descs)

Define functions to tokenize words

In [11]:
def tokenize(text):
    text = text.lower()
    text = re.sub('[' + string.punctuation + '0-9\\r\\t\\n]', ' ', text)
    tokens = nltk.word_tokenize(text)
    tokens = [w for w in tokens if len(w) > 2]
    tokens = [w for w in tokens if not w in ENGLISH_STOP_WORDS]
    return tokens 

def stemwords(words):
    stemmer = PorterStemmer()
    words = [stemmer.stem(w) for w in words] # stem words 
    return words

Tokenize and create counters of words

In [None]:
# create counter of best words 
best_words = stemwords(tokenize(best_descs))
best_ctr = Counter(best_words)
# create counter of worst words 
worst_words = stemwords(tokenize(worst_descs))
worst_ctr = Counter(worst_words)

Create word clouds:

In [None]:
# cloud for best words
wordcloud = WordCloud()
wordcloud.fit_words(best_ctr)

fig=plt.figure(figsize=(5, 3))   # Prepare a plot 5x3 inches
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

# cloud for worst words
wordcloud.fit_words(worst_ctr)

fig=plt.figure(figsize=(5, 3))   # Prepare a plot 5x3 inches
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

It is evident that many of the commonly used words like 'wine', 'flavor', 'finish' are overwhelming the clouds. These aren't words that distinguish the sets of words but that are simply commonly used in descriptions of wines regardless of sentiment. Let's recreate these clouds by excluding the top 25 words most commonly used across all descriptions.

In [None]:
# extract all descriptions - create bag of words 
all_descs = df_sorted['description'].values
all_descs = " ".join(all_descs)
# tokenize bag of words 
all_words = stemwords(tokenize(all_descs))
# create counter 
all_ctr = Counter(all_words)
# find top n most common
top_all = all_ctr.most_common()[:7]
top_all_words = [x[0] for x in top_all] 

Find new lists of descriptors excluding the above

In [None]:
new_best_words = [w for w in best_words if w not in top_all_words]
new_best_ctr = Counter(new_best_words)
new_worst_words = [w for w in worst_words if w not in top_all_words]
new_worst_ctr = Counter(new_worst_words)

New Clouds

In [None]:
wordcloud = WordCloud()
wordcloud.fit_words(new_best_ctr)

fig=plt.figure(figsize=(5, 3))   # Prepare a plot 5x3 inches
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

In [None]:
wordcloud = WordCloud()
wordcloud.fit_words(new_worst_ctr)

fig=plt.figure(figsize=(5, 3))   # Prepare a plot 5x3 inches
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

It looks like more relevant descriptors are starting to show: we no longer have words like "wine" and "flavor" overwhelming the picture, but rather words like "rich" and "sweet" showing up. However, we still see neutral terms like "acid" and "finish" showing up, which we would ideally like to eliminate from the mix. For this, we should use a TFIDF implementation. More to come!