## V2 added Duplicate inspection

**Intention of this Analysis: **

This dataset is quite good for several purposes and I´m currently shifting from R to Python. 

**Reason:**

R is a really great language to do fast and efficient analysis on datasets. Everything which I´m gonna do in this general Analysis - I could´ve also done in R. Unfortunately R is currently not matured enough for Deep Learning and to get a better feeling for python - doing some Analysis on interesting Dataset is really helpful for me to get a feeling of the language. 

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

First read the data and get some first impressions on what we could possibly do with the data. 

In [None]:
Wine = pd.read_csv('../input/winemag-data-130k-v2.csv')

In [None]:
Wine.head()

In [None]:
Wine.tail()

In [None]:
Wine.info()

In [None]:
Wine.describe()

**First Impression**

Overall we can identify several different dimensions and only a few metrics ( points & price ). As the author of this dataset already mentioned -- only wine ratings are represented which have at least 80 points. Also throughout the whole dataset we do have some NaN Values.

**Which questions come to my mind which we could try to answer?   **

* How many different countries are represented in this dataset?
* Which countries are the top 10 representations in this dataset?
* How many taster are represented in this dataset?
* What is the point range of the top 10 taster?
* Is there a relation between given points and prices?


## Review if there are some duplicates which may manipulate the results

I will look especially on the description as well as on the sommelier and title 
In short: I validate if there are some duplicates in relation to the title of the wine. 
I also do this because if I´m interested in doing some NLP (Natural Language Processing) or even Deep Learning - this may manipulate the results later on. 

In [None]:
Duplicates_Wines = Wine.groupby(['title', 'taster_name', 'description'])
# we´re counting how often the aggregation occurs. Therefore we can use any kind of column
Duplicates_Wines['points'].count().reset_index().sort_values(by = 'points', ascending=False).head(n = 10)

In [None]:
# Lets take an example to review our code if there was any error:
Wine[Wine['title'] == 'Benmarl 2014 Slate Hill Red (New York)']

We could identify some duplicates.  In the next step we will drop all duplicates so that we can make sure that we do not manipulate any further taster analysis

In [None]:
# Create new Frame for Duplicates

WineDuplicates = Duplicates_Wines['points'].count().reset_index().sort_values(by = 'points', ascending=False)
WineDuplicates = WineDuplicates[WineDuplicates['points'] > 1]


In [None]:
# review output of the lesser counts in dataframe it should be not less than 2
WineDuplicates.tail()

In [None]:
# drop duplicates from main dataframe
Wine.drop_duplicates(subset = ['title', 'taster_name', 'description'], inplace = True)

In [None]:
len(Wine.index)

* **How many different countries are represented in this dataset?**

In [None]:
len(Wine["country"].unique()) 

**Which countries are the top 10 representations in this dataset?**

In [None]:
Wine["country"].value_counts().head(n = 10)

In [None]:
sns.countplot(x = "country", data = Wine, order = Wine["country"].value_counts().iloc[:9].index)
plt.title('Countries with the most wine representations')
plt.tight_layout()

**Which Countries do have the best evaluated wines?**

In [None]:
Wine_Country_Grouping = Wine.groupby('country')
Wine_Country_Grouping_List = Wine_Country_Grouping['points'].mean().reset_index()
# sort Values and get only top 10
Wine_Country_Grouping_List = Wine_Country_Grouping_List.sort_values(by = 'points', ascending = False).iloc[:10]
# while testing - the results are kind of similar - a visualization does not really help at this point
Wine_Country_Grouping_List


Interestingly the countries with highest rated wines are not the same countries with the most representations. The only exception is France.  

**How many taster are represented in this dataset?**

In [None]:
len(Wine["taster_name"].unique()) 

**Who are the top 10 taster?**

In [None]:
top_ten_taster = Wine["taster_name"].value_counts().iloc[:10].reset_index().rename(columns = {'index' : 'Taster', 'taster_name': 'count'})
top_ten_taster

In [None]:
print('How many percentage of representations do have the top 10 taster?: {}'.format( sum(top_ten_taster['count']) / len(Wine.index) ))

Those top 10 Taster look quite interesting. Especially because they already take nearly 75% of all the tastings. 
Lets get a deeper look at those ten selected.

In [None]:
Wine_Taster = pd.DataFrame(columns = Wine.columns)
for taster in range(0, len(top_ten_taster.index)):
      bool_taster = top_ten_taster['Taster'][taster] == Wine['taster_name']
      Wine_Taster = Wine_Taster.append(Wine[bool_taster == True], ignore_index= True)

Check if everything worked smoothly with the new dataframe

In [None]:
Wine_Taster.info()

Looks great ;-). Lets get a deeper look of those 10 taster
How many different Wine´s have been tasted?

In [None]:
len(Wine_Taster['title'].unique())

Quite a high number of unique wines. Still it is unclear if there are any typo errors. This could be analysed with NLP later on. For now lets concentrate on the taster. 

**Is their wine rating based on the price of the wine? **
the prices vary quite drastically - therefore I cant use the average price and will remove the rows with NaN 

In [None]:
price_not_null = Wine_Taster['price'].notnull()
Wine_Taster_Prices = Wine_Taster[price_not_null == True]
Wine_Taster_Grouped = Wine_Taster_Prices.groupby('taster_name')

**What is the range of the top 10 taster ratings? **

Do they somehow always rate nearly identical or is there potentially a huge difference in their ratings possible?
Under top 10 taster are those meant which have the most representations in the dataset. 


In [None]:
Wine_Taster_Points = pd.DataFrame()
Wine_Taster_Points['minPoints'] = Wine_Taster_Grouped['points'].min()
Wine_Taster_Points['maxPoints'] = Wine_Taster_Grouped['points'].max()
Wine_Taster_Points

Their ratings are nearly identical. Therefore I assume that they rate the wines on neutral perspektive. 

In [None]:
Wine_Taster_Grouped['price'].describe()

All of them tasted expensive as well as pricy wine. Can´t see any anomalie or preferences of any taster so far. 
Lets check from which country representation they prefered the wine
Since we´re checking the country and not the price --> we can get back to the regular Wine_Taster dataset because its irrelevant if we have a price or not.

In [None]:
sns.countplot(x = "country", data = Wine_Taster, order = Wine_Taster["country"].value_counts().iloc[:9].index)
plt.title('Countries with the most wine representations')
plt.tight_layout()

Same result as by taking the general dataset *Wine*.

In General I can´t find any potential anomalies / insights based on the taster. 
The only thing which comes to mind is checking the relation between Points and Price. Since those ten people have quite a huge amount of taste ratings. We can try to find out if there is a relation between their rating and price of the wine.


In [None]:
Wine_Taster_Points_Grouped = Wine_Taster.groupby('points')
sns.barplot(x = 'points', y = 'price', data = Wine_Taster_Points_Grouped['price'].mean().reset_index())
plt.tight_layout()

In [None]:
sns.barplot(x = 'points', y = 'price', data = Wine_Taster_Points_Grouped['price'].min().reset_index())
plt.tight_layout()

In [None]:
sns.barplot(x = 'points', y = 'price', data = Wine_Taster_Points_Grouped['price'].max().reset_index())
plt.tight_layout()

Based on the **mean** and **min** we can identify that there seems to be a relation between the points and price of the wine. 
On the last barplot we can see that there are also some outlier which have in overall no real big  influence on the majority. 
Lets do one another plot which will visualize each single Taster which we looked at to see if there is some rating behaviour of each individual Taster. 

In [None]:
# Get an Overview of each Taster by himself
# To form such a Facet_Grid I used the example from: 
### http://seaborn.pydata.org/examples/many_facets.html ###
Wine_Taster_PT_Grouped = Wine_Taster.groupby(['taster_name', 'points'])
grid = sns.FacetGrid(Wine_Taster_PT_Grouped['price'].mean().reset_index(), 
                     col="taster_name", hue="taster_name", 
                     col_wrap=4, size=3)
grid.map(plt.plot, "points", "price")
grid.fig.tight_layout(w_pad=1.5)

Seems like that we found some relations between the taster and their rating. Overall we can say that the price does have some influence on the overall rating of the rating. Unfortunately my knowledge about the wine rating is not that deep that I know how the rating of the wine works. 
e.g. : Do they get the price of the wine before they try it?
By knowing that we could identify to possible outcomes of this analysis: 

**When they get the price before tasting: **

Most likely they´re influenced by the price and possibility of the rating being higher is increased.

** If they dont know the price before tasting **

Depending on the taste rating - the publisher of the wine could potentially increase the price of the wine to experienced Wine lover - especially the more expensive wines.
