## Introduction 

This is a data exploratory analysis project based on the winery data downloaded from Kaggle website. This first project is going to answer a few questions:

- Average rating of wine for each country, which country gives the best rating winery? 
- Average price of wine from each country
- Which country has the largest number of wineries producing chardonnay?
- Which country has the largest number of wineries producing Cabernet? 
- A wine map :) 


## Data importing and cleaning

In [1]:
# import data

import pandas as pd
import seaborn as sns
import matplotlib

listings=pd.read_csv('../input/winemag-data_first150k.csv',encoding="latin-1") 

In [2]:
# count how many data to start with

listings.count(axis=0)

In [3]:
listings.head(n=10)

In [4]:
# Remove 'region_2' column since too much na values exists
listings_new=listings.drop('region_2',axis=1)
listings_new=listings_new.dropna(axis=0,how='any')


In [5]:
# check how many rows are still left for exploration

listings_new.count(axis=0)

## Exploratory Data Analysis

In [6]:
# Average price of wine produced in each country 

import numpy as np
import matplotlib.pyplot as plt
sns.boxplot(x='country',y='price',data=listings_new)
ax=plt.gca()
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, ha='right')
ax.set_ybound(lower=None, upper=300)
plt.show()

In [8]:
# explorer relationship between wine ratings and price
sns.lmplot(x='points',y='price',data=listings_new)
plt.show()

In [9]:
average_price=pd.pivot_table(listings_new, values='price',index='country',aggfunc='mean').round(2)
average_price

In [10]:
# Average rating of wines from each country
sns.boxplot(x='country',y='points',data=listings_new)
ax=plt.gca()
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, ha='right')
ax.set_ybound(lower=None, upper=None)
plt.show()

In [11]:
average_points=pd.pivot_table(listings_new, values='points',index='country',aggfunc='mean').round(0)
average_points

In [12]:
# Country having the largest number of wineries producing Chardonnay

Chardonnay=listings_new[listings_new['variety']=='Chardonnay']
Chardonnay_max=pd.pivot_table(Chardonnay,values='variety',index='country',aggfunc='count')
Chardonnay_max

In [13]:
# The chardonnay brand having the highest rating

Chardonnay_highrate=Chardonnay['points'].max()
Chardonnay[Chardonnay['points']==Chardonnay_highrate]

In [14]:
# Country having the largest number of wineries producing Pinot Noir

Pinor_Noir=listings_new[listings_new['variety']=='Pinot Noir']
Pinor_Noir_max=pd.pivot_table(Pinor_Noir,values='variety',index='country',aggfunc='count')
Pinor_Noir_max

In [15]:
# The Pinot Noir brand having the highest rating

Pinor_Noir_highrate=Pinor_Noir['points'].max()
Pinor_Noir[Pinor_Noir['points']==Pinor_Noir_highrate]

In [16]:
# Visualize the number of wineries in each state in US

listings_US=listings_new[listings_new['country']=='US']
states=pd.pivot_table(listings_US,values='designation',index='province',aggfunc='count')
states

### Wine Map-US

In [17]:
import numpy as np

num_wineries=states.as_matrix(columns=None)
states_name=np.asarray(['AZ','CA','CO','CT','ID','IA','KT','MA',
                        'MI','MO','NV','NJ','NM','NY','NC','OH',
                        'OR','PA','TX','VT','VA','WA'])

In [18]:
import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools
from plotly.offline import iplot, init_notebook_mode

pd.options.mode.chained_assignment = None
init_notebook_mode()

winery_scale=[[0, 'rgb(232, 213, 255)'], [100, 'rgb(218, 188, 255)']]

data = [dict(
        type = 'choropleth',
        autocolorscale = False,
        colorscale = winery_scale,
        showscale = False,
        locations = states_name,
        locationmode = 'USA-states',
        z = num_wineries,
        marker = dict(
            line = dict(
                color = 'rgb(255, 255, 255)',
                width = 2)
            )
        )]

layout = dict(
         title = 'Wineries in US',
         geo = dict(
             scope = 'usa',
             projection = dict(type = 'albers usa'),
             countrycolor = 'rgb(255, 255, 255)',
             showlakes = True,
             lakecolor = 'rgb(255, 255, 255)')
         )

figure = dict(data = data, layout = layout)
iplot(figure)

## Conclusion 

- French wines have the highest ratings on average compared to wines from other countries
- Average price of French wines is the highest compared to wines from other countries. On average, you need to pay 45.5 USD for a bottle of French wine. If you are looking for a bargain, try wines from Argentina and Spain, both countries have average wine prices below 30 USD.
- France is the country having the highest number of wineries producing Chardonnay. And US is the country having the highest number of wineries producing Pinot Noir.
- Krug champagne is the Chardonnay with the highest rating (100 pts), and Pinot Noir from Williams Selyem winery in Russian River Valley has the highest rank in all rated Pinot Noir. Cheers!
- Glancing the US wine map, apparently most wineries recored in the data reside in the west coast and northeast coast of US. Winelovers can plan their trips accordingly:) 

## Acknowledgement

- Dataset provided by zachkott from Wine Reviews
- Geomap plotting idea and code from Abigail Larion's notebook