## Introduction

This is an attempt to analyze the "Wine Reviews" dataset. I would like to explore questions such as the following:

* Do certain varieties of wine receive higher ratings?
* Do different tasters rate similar wines differently? 
* Do specific countries produce "better" wines?
* Is there any relationship between rating and price of a wine?
* Are there certain descriptions that are associated with higher ratings?

Since this is the first kernel that I am publishing on kaggle, any feedback is appreciated!

In [None]:
import numpy as np 

import pandas as pd 
pd.set_option('display.max_columns', None) # To display all columns
pd.set_option("display.max_rows",100)

import matplotlib.pyplot as plt 
%matplotlib inline 

import seaborn as sns 
sns.set_style('whitegrid') 

import plotly.plotly as py
import plotly.graph_objs as go
import plotly.figure_factory as ff

from wordcloud import WordCloud, STOPWORDS

data = pd.read_csv("../input/winemag-data-130k-v2.csv", low_memory=False)
data.head()

In [None]:
data = data.iloc[:,1:]

In [None]:
data.shape

In [None]:
# remove duplicates
data.drop_duplicates(inplace=True)
data.shape

In [None]:
for col in data:
    print(data[col].unique());

### Points vs. Price

In [None]:
data.describe()

In [None]:
sns.regplot(data.points, data.price)

# or plt.scatter(data.points, data.price)

In [None]:
sns.boxplot(data.points, data.price)

It appears that the average price of wine is higher as the rating increases. However, this could also be stated as: the rating increases as the price gets higher. Tasters might be biased in their reviews if they know the price of the wine. Alternatively, wines that receive good reviews might tend to get priced higher because of these good reviews.

In [None]:
data[data.price>=2000]

Just out of curiosity.. the most expensive wine is a French Bordeaux! Overall, 5 out of the 6 most expensive wines come from France.

### Points vs. Variety

In [None]:
data.groupby("variety").agg(['mean', 'median', 'min', 'max','count'])

There seems to be a wide range of ratings and prices for each wine.

In [None]:
data.groupby("variety").mean().sort_values(by = "points", ascending = False)

In [None]:
data.groupby("variety").median().sort_values(by = "points", ascending = False)

In [None]:
data.groupby("variety").mean().sort_values(by = "price", ascending = False)

In [None]:
data.groupby("variety").points.count().sort_values(ascending = False)

In [None]:
data.groupby("variety").points.count().sort_values(ascending = False).describe()

There are 707 wine varieties, half of which have less than 5 reviews. Pinot Noir is the variety with the most reviews - 12,278.

### Points vs. Taster

In [None]:
data.groupby("taster_name").points.count().sort_values(ascending = False)

In [None]:
data.groupby("taster_name").points.mean().sort_values(ascending = False)

In [None]:
data.groupby("taster_name").points.agg(["min","max","mean","count"])

In [None]:
sns.boxplot(data.taster_name, data.points)
plt.xticks(rotation=90)

Each taster has completed a different number of reviews, ranging between 6 and 23560 reviews. Overall, tasters seems to cover most of the range of ratings between 80 and 100. The tasters that have a lower ceiling tend to be the ones who have completed fewer reviews.

### Points vs. Country

In [None]:
sns.boxplot(data.country, data.points)
plt.xticks(rotation=90)

### Points vs. Reviews

In [None]:
top = data[data.points >= 95]
top.head()

In [None]:
bottom = data[data.points <= 85]
bottom.head()

In [None]:
from wordcloud import WordCloud, STOPWORDS

wordcloud = WordCloud(width = 1000, height = 500).generate(' '.join(top.description))
plt.figure(figsize=(15,8))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()


In [None]:
wordcloud = WordCloud(width = 1000, height = 500).generate(' '.join(bottom.description))
plt.figure(figsize=(15,8))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

Looking at the words that are mostly used in the reviews of the highest rated wines (>=95 points) and the lowest rated wines (<=85 points) the results are not very informative. If we exclude the basic common words (e.g. wine) or the neutral words (e.g. note), the 3 words that mostly appear in the higher rated wines are: fruit, tannin, rich. For the lower rated wines the words are: dry, sweet, light/simple.

I hope this is helpful. Please let me know if you have any comments!