This notebook explores the wine reviews dataset, consisting of wine varieties, vineyards, country and review text.
 I have attempted a simple xgboost model to predict wine varieties based on a sample set of reviews.

In [None]:
from io import StringIO
import requests
import json
import pandas as pd

In [None]:
winedata = pd.read_csv('../input/winemag-data_first150k.csv')
winedata.describe().T

The wine ratings are all above 80 and range between 80 and 100. The data source website does not publish any wine reviews which have points below 80. The lowest wine is priced at 4 dollars and the costliest wine is priced at 2300. It would be interesting to explore the reviews pivoting on price.


**Correlation between wine price and ratings.**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="white", color_codes=True)
%matplotlib inline

winedata.plot(kind="scatter", x="points", y="price")

There are very few high priced wines in the dataset.There are some low cost wines for whichreviewers have given a 100 point rating - we might look at the reviewers density for them.

**By Wine varieties**

In [None]:
#Lets look at count of reviews in each wine variety 
value_counts = winedata["variety"].value_counts()
value_counts.head()

In [None]:
value_counts.tail()


Some wine varieties have just a single record instance in our sample. We can ignore these and focus on the ones which have a significant presence in out dataset for now.

**By Wine Country**

In [None]:
#Lets visualize the number of wine reviews by country
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
%matplotlib inline

pd.options.mode.chained_assignment = None
pd.options.display.max_columns = 999

plt.figure(figsize=(20,7))
sns.countplot(x="country", data=winedata)
plt.ylabel("Country",fontsize = 12)
plt.xlabel("Review Count", fontsize=12)
plt.xticks(rotation=90)
plt.title("Count of Reviews by country", fontsize=15)
plt.show()

In [None]:
#Reviews distribution by country
ReviewCountbyCountry = pd.DataFrame(winedata["country"].value_counts())
ReviewCountbyCountry.describe().T

Total 48 countries. There are 25% countries with less than 6 number of total wine reviews. Lets focus on the top 12(by review count) country wines for further analysis and deeper dive.

In [None]:
country_list = ['US','Italy','France','Spain','Chile','Argentina','Portugal','Australia','New Zealand','Germany','South Africa']
sub_data1 = winedata[winedata['country'].isin(country_list)]

**Wine Prices by Country**

In [None]:
plt.figure(figsize=(40,12))
sns.set_context("paper", font_scale=2.5)    
sns.violinplot(x="country", y="price", data=sub_data1, inner=None)

New Zealand and South Africa have a wide variety of low cost wines mainly. On the other hadn, France and US have some high end wines costing close to 2000 dollars.

In [None]:
#Lets look at the vineyard(designation), province and wine variety of the costliest wine reviewed in our dataset.
sub_data1[sub_data1['price'] == 2300] 

The vineyard is missing, but we know the province and region to visit next time and certainly check the Bordeaux-style Red Blend wine variety when we are in France!

In [None]:
#Lets look at price variation for this wine variety in our dataset.
Bordeaux_style_redblend = sub_data1[sub_data1['variety'] == 'Bordeaux-style Red Blend']
Bordeaux_style_redblend.describe().T

This wine variety seems to be available in a broad price range from 7 to 2300 bucks.

In [None]:
#Lets look at few reviews on this variety across the dataset.
Bordeaux_style_redblend.head(5)

We see a variety of reviews in the description field for a single variety of wine. However there are common keywords in the reviews. for example- fruity, Merlot are some commonly occuring words in this example.

In [None]:
#Lets look at the sample wines from US.
wine_US = sub_data1[sub_data1['country'] == 'US']

In [None]:
wine_US.head()

In [None]:
#Now lets look at the wine varieties in US
value_counts = wine_US["variety"].value_counts()
value_counts.head()

In [None]:
#Reviews by province in US
plt.figure(figsize=(10,5))
plt.rc('xtick', labelsize=10) 
plt.rc('ytick', labelsize=10) 
sns.countplot(x="province", data=wine_US,)
plt.ylabel("province",fontsize = .1)
plt.xlabel("Review Count", fontsize=.1)
plt.title("Count of Reviews by province in US", fontsize=.2)
plt.xticks(rotation=90)
plt.show()

In [None]:
#Lets build a model to predict the top 5 wine varieties in US reviews.
varietylist = ['Pinot Noir','Cabernet Sauvignon','Chardonnay','Syrah']
subdata = wine_US[wine_US['variety'].isin(varietylist)]

In [None]:
#Reviews count by the five wine varieties in US
plt.figure(figsize=(10,5))
plt.rc('xtick', labelsize=10) 
plt.rc('ytick', labelsize=10) 
sns.countplot(x="variety", data=subdata,)
plt.ylabel("province",fontsize = .1)
plt.xlabel("Review Count", fontsize=.1)
plt.title("Count of Reviews by province in US", fontsize=.2)
plt.xticks(rotation=90)
plt.show()

In [None]:
#Lets plot wine prices based on the selected wine varieties in US
plt.figure(figsize=(40,12))
sns.set_context("paper", font_scale=2.5)    
sns.violinplot(x="variety", y="price", data=subdata, inner=None)

In [None]:
Chardonnay wine variety has some costly wines. Looks like Pinot Noir and Syrah wine varieties have lot of options in low cost range.

**Wine Tasting with tfidf and xgboost**

In [None]:
#encoding the labels 
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(subdata['variety'])
label_encoded_y = label_encoder.transform(subdata['variety'])
subdata['encoded_winevariety'] = label_encoded_y
subdata.head()

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    min_df=5, max_features=100, strip_accents='unicode',lowercase =True,
    analyzer='word', token_pattern=r'\w+', use_idf=True, 
    smooth_idf=True, sublinear_tf=True, stop_words = 'english').fit(subdata["description"])

In [None]:
features = tfidf.get_feature_names()
print(features)

In [None]:
X_tfidf_text = tfidf.transform(subdata["description"])
subdata_2 = pd.DataFrame(X_tfidf_text.toarray())
subdata = subdata.reset_index()
subdata_2['encoded_winevariety'] = subdata['encoded_winevariety']
#Also adding variety for better readibility
subdata_2['variety'] = subdata['variety']

In [None]:
from sklearn.cross_validation import train_test_split
seed = 7

#Split into train and test
test_size = 0.2
y = subdata_2['encoded_winevariety']
X = subdata_2.drop(['encoded_winevariety','variety'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)
# fit model no training data
import xgboost as xgb
clf = xgb.XGBClassifier(max_depth=3, n_estimators=300, learning_rate=0.05)

In [None]:
clf.fit(X_train, y_train)

In [None]:
y_pred = clf.predict(X_test)

In [None]:
#Measuring accuracy
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_pred, y_test)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

87% - I need to check and investigate, it doesn't feel right. To note, this was a very small sample of the original dataset. The results will vary when I add all wine varieties from all countries for prediction.
Any feedbacks will add to my learning..More to explore and learn.