# **Final Project**

## **1.Collecting Data**

### **1.What subject is your data about?**

Main subject of my data is about reviews all kind of wine around the world such as: the number of points WineEnthusiast rated the wine, the country that the wine is from, the winery that made the wine, the cost for a bottle of the wine,…

### **2.What is the source of your data?**

My team take dataset from [Kaggle](https://www.kaggle.com/)

### **3.Do authors of this data allow you to use like this? You can check the data license**

The authors totally agree me to use this data because this is a public dataset on kaggle and I can download it
Data license: `CC BY-NC-SA 4.0`

### **4.How did authors collect data?**

The data was scraped from `WineEnthusiast` during the week of June 15th, 2017. The code for the scraper can be found [here](https://github.com/zackthoutt/wine-deep-learning) if you have any more specific questions about data collection that I didn't address.

## 2.Exploring Data 

In [1]:
import sys
sys.executable

'c:\\Users\\asus\\miniconda3\\envs\\min_ds-env\\python.exe'

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import re
from datetime import datetime, timedelta

### **1.How many rows and how many columns?**

In [9]:
# load data
wine_df = pd.read_csv('winemag-data-130k-v2.csv', index_col = 0)
wine_df.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


In [10]:
num_rows = len(wine_df)
num_columns = len(wine_df.columns)
print('Number of rows:', num_rows)
print('Number of columns', num_columns)

Number of rows: 129971
Number of columns 13


### **2.What is the meaning of each row?**

**Each row provides diferent information such as price, country, winery,... about different wines**

### **3.Are there duplicated rows?**

In [11]:
num_duplicated_rows = len(wine_df) - len(wine_df.drop_duplicates())
print('Duplicated:', num_duplicated_rows)

Duplicated: 9983


### **4.What is the meaning of each column?**

The data consists of 13 fields:

- *Points*: the number of points WineEnthusiast rated the wine on a scale of 1-100 (though they say they only post reviews for wines that score >=80)
- *Title*: the title of the wine review, which often contains the vintage if you're interested in extracting that feature
- *Variety*: the type of grapes used to make the wine (ie Pinot Noir)
- *Description*: a few sentences from a sommelier describing the wine's taste, smell, look, feel, etc.
- *Country*: the country that the wine is from
- *Province*: the province or state that the wine is from
- *Region 1*: the wine growing area in a province or state (ie Napa)
- *Region 2*: sometimes there are more specific regions specified within a wine growing area (ie Rutherford inside the Napa Valley), but this value can sometimes be blank
- *Winery*: the winery that made the wine
- *Designation*: the vineyard within the winery where the grapes that made the wine are from
- *Price*: the cost for a bottle of the wine 
- *Taster Name*: name of the person who tasted and reviewed the wine
- *Taster Twitter Handle*: Twitter handle for the person who tasted ane reviewed the wine

### 5.What is the current data type of each column? Are there columns having inappropriate data types?

In [12]:
# check current data type of each column
wine_df.dtypes

country                   object
description               object
designation               object
points                     int64
price                    float64
province                  object
region_1                  object
region_2                  object
taster_name               object
taster_twitter_handle     object
title                     object
variety                   object
winery                    object
dtype: object

**There are no columns having inappropriate data types**

### 6.With each numerical column, how are values distributed?
- What is the percentage of `missing values`?
- `Min`? `max`? Are they `abnormal`?

In [26]:
# get numerical column 
numerical_cols = wine_df.loc[:, (wine_df.dtypes == np.int64) | (wine_df.dtypes == np.float64)]
numerical_cols

Unnamed: 0,points,price
0,87,
1,87,15.0
2,87,14.0
3,87,13.0
4,87,65.0
...,...,...
129966,90,28.0
129967,90,75.0
129968,90,30.0
129969,90,32.0


In [39]:
# calculate percentage of missing value
missing_percentage = numerical_cols.isnull().sum() * 100 / num_rows
# calculate min value
min_val = numerical_cols.min()
# calculate max value
max_val = numerical_cols.max()
# display
data = [missing_percentage, min_val, max_val]
index_name = ['Percentage of missing value', 'Min', 'Max']
numerical_cols_df = pd.DataFrame(data, index = index_name, columns = numerical_cols.columns)
numerical_cols_df

Unnamed: 0,points,price
Percentage of missing value,0.0,6.921544
Min,80.0,4.0
Max,100.0,3300.0


### 7.With each categorical column, how are values distributed?
- What is the percentage of `missing values`?
- How many `different values`? Show a few
- Are they `abnormal`?

In [41]:
# get caterogical columns 
categorical_cols = wine_df.loc[:, ~((wine_df.dtypes == np.int64) | (wine_df.dtypes == np.float64))]
categorical_cols

Unnamed: 0,country,description,designation,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks
...,...,...,...,...,...,...,...,...,...,...,...
129966,Germany,Notes of honeysuckle and cantaloupe sweeten th...,Brauneberger Juffer-Sonnenuhr Spätlese,Mosel,,,Anna Lee C. Iijima,,Dr. H. Thanisch (Erben Müller-Burggraef) 2013 ...,Riesling,Dr. H. Thanisch (Erben Müller-Burggraef)
129967,US,Citation is given as much as a decade of bottl...,,Oregon,Oregon,Oregon Other,Paul Gregutt,@paulgwine,Citation 2004 Pinot Noir (Oregon),Pinot Noir,Citation
129968,France,Well-drained gravel soil gives this wine its c...,Kritt,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Gresser 2013 Kritt Gewurztraminer (Als...,Gewürztraminer,Domaine Gresser
129969,France,"A dry style of Pinot Gris, this is crisp with ...",,Alsace,Alsace,,Roger Voss,@vossroger,Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,Domaine Marcel Deiss


In [42]:
# calculate percentage of missing value
missing_percentage = categorical_cols.isnull().sum() * 100 / num_rows
# different values
num_diff_vals = categorical_cols.nunique()
diff_vals = categorical_cols.apply(lambda col: col.dropna().unique())
#display
data = [missing_percentage, num_diff_vals, diff_vals]
index_name = ['Percentage of missing values', 'Number of different values', 'Different values']
categorical_cols_df = pd.DataFrame(data, index = index_name, columns = categorical_cols.columns)
categorical_cols_df

Unnamed: 0,country,description,designation,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
Percentage of missing values,0.048472,0.0,28.825661,0.048472,16.347493,61.136715,20.192197,24.015357,0.0,0.000769,0.0
Number of different values,43,119955,37979,425,1229,17,19,15,118840,707,16757
Different values,"[Italy, Portugal, US, Spain, France, Germany, ...","[Aromas include tropical fruit, broom, brimsto...","[Vulkà Bianco, Avidagos, Reserve Late Harvest,...","[Sicily & Sardinia, Douro, Oregon, Michigan, N...","[Etna, Willamette Valley, Lake Michigan Shore,...","[Willamette Valley, Napa, Sonoma, Central Coas...","[Kerin O’Keefe, Roger Voss, Paul Gregutt, Alex...","[@kerinokeefe, @vossroger, @paulgwine , @wines...","[Nicosia 2013 Vulkà Bianco (Etna), Quinta dos...","[White Blend, Portuguese Red, Pinot Gris, Ries...","[Nicosia, Quinta dos Avidagos, Rainstorm, St. ..."


## 3. Asking meaningful questions

### Question 1

***Write question***

**Benefit:**

### Question 2

***Write question***

### **Benefit:**

### Question 3

***Write question***

### **Benefit:**

## 4. Preprocessing + analyzing data to answer each question

### Question 1:

### Preprocessing

#### Explanation

In [None]:
# Code

### Data Analysis

#### Explanation:

In [None]:
# Code

### Data Visualization

In [None]:
# Code

***Comments:***

### Question 2:

### Preprocessing

#### Explanation:

In [None]:
# Code

### Data Analysis

#### Explanation:

In [None]:
# Code

### Data Visualization

In [None]:
# Code

***Comments:***

### Question 3:

### Preprocessing

#### Explanation:

In [None]:
# Code

### Data Analysis

#### Explanation:

In [None]:
# Code

### Data Visualization

In [None]:
# Code

***Comments***

## 5.Reflection

## 6.References