## Introduction

In this notebook we are going to find the featues of the wines, which have the best price/taste(expressed in points) ratio. Although, everyone has a different taste, we will assume taste(expressed in points) as a measure of quality, which we want to maximize.

In [1]:
# Import libraries
import numpy as np
import pandas as pd

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [8]:
wines = pd.read_csv("../input/winemag-data-130k-v2.csv")

## Content

1. [Missing data](#missing-data)<br>
2. [Prices](#prices)<br>
    2.1. [Price distribution](#price-distribution)<br>
    2.2. [Impact of price on taste points](#impact-on-taste-points)<br>
    2.3. [Points/price ratio](#points-price-ratio)<br>
    2.4. [Variety's influence on price and taste points](#variety-influence)<br>
3. [Countries](#countries)<br>
    3.1. [Countries with the best taste/price ratio](#countries-best)<br>
    3.2. [Countries and varieties](#countries-and-varieties)<br>
4. [Looking for the best wine](#looking-best)
5. [Conclusion](#conclusion)

<a id='missing-data'></a>
## Missing data

Let us first have a look at missing values.

In [9]:
ax = sns.heatmap(wines.isnull().T, xticklabels=False, cbar=False)
ax.set_title('Missing data')
plt.show()

After having looked at the data, we see that we have a lot of missing values in columns 'designation', 'region_2', 'taster_name', 'taster_twitter_handle'. I assume there are more important features than mentioned, that is why we will drop these columns. Similarly, we will drop observations with no prices, as far as our analysis is based on finding the best price/taste ration. We will also drop column 'Unnamed: 0'.

In [10]:
wines = wines.drop(['designation', 'region_2', 'taster_name', 'taster_twitter_handle', 'Unnamed: 0'], axis=1)
wines = wines[wines['price'].notnull()]

<a id='prices'></a>
## Prices

<a id='price-distribution'></a>
#### Price distribution

Let us now have a look at the distribution of prices.<br>
Note! We are not plotting the observations, which are higher than 200.

In [11]:
price_hist = go.Histogram(
                x=wines[wines['price'] < 200]['price'], 
                nbinsx = 40
            )

data = [price_hist]

layout = go.Layout(
    title='Price distribution',
    autosize=False,
    width=800,
    height=350
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

<a id='impact-on-taste-points'></a>
#### Impact of price on taste points

Let us, firstly, have a look if there is a correlation between price and taste(points).<br>
Note! We are not plotting the observations, which have price higher than 1000.

In [12]:
plt.subplots(figsize=(12,2))
ax = sns.regplot(x="price", y="points", data=wines, scatter_kws={'alpha':0.4, 's': 13}, fit_reg=False)
ax.set_ylim(79, 101)
ax.set_xlim(0, 1000)
ax.set_yticks([80, 85, 90, 95, 100])
ax.set_title('Impact of price on taste(points)')
plt.show()

From the plot we can conclude that wines with higher prices on average tend to get higher taste points. Despite the fact that wines in lower price ranges tend to get lower number of points for taste on average, there are still many wines which perform equaly well(in terms of taste points) as the ones in higher price ranges.

Let us define 3 categories for price ranges:<br>
* below 14 - 'low'
* 15-49 - 'medium'
* above 50 - 'high'

In [13]:
# creating price_range feature
wines['price_range'] = 'high'
wines.loc[wines['price'] < 49, 'price_range'] = 'medium'
wines.loc[wines['price'] < 14, 'price_range'] = 'low'

Let us now have a look at the boxplots:

In [14]:
price_range_cats = ['low', 'medium', 'high']
price_range_boxes = []

for cat in price_range_cats:
    obj = go.Box(
        y=wines[wines['price_range'] == cat]['points'],
        name=cat
    )
    price_range_boxes.append(obj)

layout = go.Layout(
    title='Points distribution in price categories',
    autosize=False,
    width=800,
    height=350
)

fig = go.Figure(data=price_range_boxes, layout=layout)
py.iplot(fig)

As we have already seen number of points depends on price. But again, interesting to notice that the medium price category (<$49 per bottle) have an upper fence (96 points) very close to the high price category (97 points), although it has a higher point deviation.

<a id='points-price-ratio'></a>
#### Points/price ratio

Let us now create a feature that will define how much taste points we get for 1 unit of price

In [15]:
wines['taste/price'] = wines['points']/wines['price']

In [16]:
plt.subplots(figsize=(12,2))
ax1 = sns.regplot(x="taste/price", y="points", data=wines[wines['price_range'] == 'high'], label="high",
                 scatter_kws={'alpha':0.5, 's': 3, 'color': 'green', 'marker': 'o'}, fit_reg=False)
ax2 = sns.regplot(x="taste/price", y="points", data=wines[wines['price_range'] == 'medium'], label="medium",
                 scatter_kws={'alpha':0.5, 's': 3, 'color': 'blue', 'marker': 'o'}, fit_reg=False)
ax3 = sns.regplot(x="taste/price", y="points", data=wines[wines['price_range'] == 'low'], label="low",
                 scatter_kws={'alpha':0.5, 's': 3, 'color': 'red', 'marker': 'o'}, fit_reg=False)

plt.xlim(0, 22)
plt.yticks([80, 85, 90, 95, 100])
plt.title('Relation between number of points and taste/price ratio')
plt.legend(markerscale=3)
plt.show()

This is just another way to visualize what we have already seen. So, factually, cheaper wines can be relatively good, as well as expensive wines can be overpriced based on this chart.

<a id='variety-influence'></a>
#### Variety's influence on price and taste points

Let us have a look, if some varieties are preffered more than others.

In [17]:
wine_variety_15 = wines['variety'].value_counts()[:10]

In [18]:
price_variety_box = []

for variety in wine_variety_15.index:
    obj = go.Box(
        x=wines[(wines['variety'] == variety)]['price'],
        name=variety,
        orientation = 'h',
    )
    price_variety_box.append(obj)
    
layout = go.Layout(
    title='Price distribution in varieties',
    autosize=False,
    width=800,
    height=400,
    xaxis=dict(autorange=False,range=[4, 150])
)

fig = go.Figure(data=price_variety_box, layout=layout)
py.iplot(fig)


points_variety_box = []

for variety in wine_variety_15.index:
    obj = go.Box(
        x=wines[(wines['variety'] == variety)]['points'],
        name=variety,
        orientation = 'h',
    )
    points_variety_box.append(obj)
    
layout = go.Layout(
    title='Points distribution in varieties',
    autosize=False,
    width=800,
    height=400
)

fig = go.Figure(data=points_variety_box, layout=layout)
py.iplot(fig)

If we look at taste points' distribution for different varieties, we can notice differences: for example 'Pinot Noir' is rated a bit better than 'Merlot'. However, if we look at the price distribution, we can notice that prices for 'Pinot Noir' are higher. We have also seen that there is correlation between price and taste points (especially in low, medium price ranges), so probably it might be an explanation of the differences in taste points.

<a id='countries'></a>
## Countries

Let us image a situation: we know nothing about wine and we want to pick a good wine based on the country of origin, but we do not want to overpay. Which country do we choose?

<a id='countries-best'></a>
#### Countries with the best taste/price ratio

We are considering a wine, having 80+ taste points to be a good wine.<br>
So, let us simply choose a country with the highest median taste/price ratio.

In [19]:
wines_country_median = wines.groupby('country').median().sort_values('taste/price', ascending=False)[:15]
texts = []
for country in wines_country_median.index:
    text = 'Median points: {0}<br>Median price: {1}'.format(wines_country_median.loc[country, 'points'],
                                              wines_country_median.loc[country, 'price'])
    texts.append(text)

data = [
    go.Bar(x=wines_country_median.index, y=wines_country_median['taste/price'],
           textfont=dict(size=16, color='#333'),
           text=texts,
           marker=dict(
               color=wines_country_median['taste/price'],
               colorscale = 'Electric',
               line=dict(
                    color='rgba(50,25,25,0.5)',
                    width=1.5)
           ))
]

annot = dict(
            x=11,
            y=9,
            xref='x',
            yref='y',
            text="Hover to see the country's median price and points",
            showarrow=False
        )

layout = go.Layout(
    autosize=False,
    width=800,
    height=400,
    title='Countries with the highest taste/price ratio',
    annotations=[annot]
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

Note! Since we do not have data on wines, which were given less than 80 points, we are assuming that captured pattern is similar in any range of taste points.

<a id='countries-and-varieties'></a>
#### Countries and varieties

We have found countries with the highest taste points to price ratio, but what, if we are a little bit more experienced with wine and we want to pick a wine of a specific variety. To do this, we will find countries with the highest taste/price ratios for each of the 15 most popular wine varieties.

In [20]:
wine_variety_15 = wines['variety'].value_counts()[:15]

columns=['Variety', 'Country', 'Median taste/price', 'Median price', 'Median points']
wine_variety_df = pd.DataFrame(columns=columns)

for i, variety in enumerate(wine_variety_15.index):
    country_data = wines[wines['variety'] == variety].groupby('country').median().sort_values('taste/price', 
                                                                                              ascending=False).iloc[0]
    
    wine_variety_df.loc[i] = [variety, country_data.name, country_data['taste/price'], 
                              country_data['price'], country_data['points']]

wine_variety_df.iloc[:, :]

<a id='looking-best'></a>
## Looking for the best wine

Now we want to be amazed by wine's taste! So we will be looking for wines, which have at least 95 taste points.

First of all, let us have a look at these countries.

In [21]:
wines_points_95_summary = wines[wines['points'] >= 95].groupby('country').count()
wines_points_95_summary = wines_points_95_summary[['description']]
wines_points_95_summary = wines_points_95_summary.merge(wines[wines['points'] >= 95].groupby('country').mean(), 
                                                        left_index = True, right_index = True)
wines_points_95_summary = wines_points_95_summary.rename(columns={'description': 'Number of wines', 
                                                                  'points': 'Mean taste points', 'price': 'Mean price', 
                                                                  'taste/price': 'Mean taste/price'})

In [22]:
wines_points_95_summary

From the table, we see that the US, France, Italy, Austria and Portugal have the most number of wines, which reached our threshold. We can also notice that Austrian wines are on average much cheaper than, for example, French or German. Similarly, Austria has the highest mean taste/price ratio.

Now, let us plot the wines.<br>
Feel free to zoom in and hover to see the wine's name and country of origin.

In [24]:
wines_points_95 = wines[wines['points'] >= 95]
wines_price_95_countries = wines_points_95['country'].unique()
traces = []

for country in wines_price_95_countries:
    trace = go.Scatter(
        x = wines_points_95[wines_points_95['country'] == country]['price'],
        y = wines_points_95[wines_points_95['country'] == country]['points'],
        mode = 'markers',
        marker = dict(
            size = 8
        ),
        text = wines_points_95[wines_points_95['country'] == country]['title'],
        name = country
    )
    traces.append(trace)

layout = go.Layout(
    autosize=False,
    width=800,
    height=400,
    title='Wines with the highest number of taste points',
    hovermode= 'closest'
)

fig = go.Figure(data=traces, layout=layout)
py.iplot(fig)

<a id='conclusion'></a>
## Conclusion

What would our conclusion be?<br>
Shortly, more expensive wines have more taste points on average, however, it is possible to find a good wine paying a reasonable price.