# Wine Price Prediction - Data Cleansing and Feature Creation - v2

- **v1** - Initial features set
- **v2** - Added *age* feature

## Data and Use case

[**Wine Reviews** - 130k wine reviews with variety, location, winery, price, and description](https://www.kaggle.com/zynicide/wine-reviews/home)

This dataset is available on Kaggle and contains around 130k of wine reviews. The data was scraped from [WineEnthusiast](http://www.winemag.com/?s=&drink_type=wine) on November 22nd, 2017.

I plan to use this dataset to develop a model that predicts wine price for specified set of parameters, like wine variety, region, desired quality. Such model, may be integrated into an application that runs on a mobile device to suggest price range during wine shopping without need to do online search.

Let's load the transformed dataset from COS.

In [1]:
import pandas as pd

df_data_1 = pd.read_csv('wine-data-transformed.zip')
df_data_1.head()

Unnamed: 0,country,points,price,province,variety,year
0,Italy,87,,Sicily & Sardinia,White Blend,2013.0
1,Portugal,87,15.0,Douro,Portuguese Red,2011.0
2,US,87,14.0,Oregon,Pinot Gris,2013.0
3,US,87,13.0,Michigan,Riesling,2013.0
4,US,87,65.0,Oregon,Pinot Noir,2012.0


Column descriptions:
- country - The country that the wine is from
- points - The number of points WineEnthusiast rated the wine on a scale of 1-100 (though they say they only post reviews for wines that score >=80)
- price - The cost for a bottle of the wine
- province - The province or state that the wine is from
- variety - The type of grapes used to make the wine (ie Pinot Noir)
- year - The vintage we extracted from the review title

## Data Cleansing

In some process models Data Cleansing is a separate task, it is closely tied to Feature Creation but also draws findings from the Initial Data Exploration task. The actual data transformations are implemented in the Feature Creation asset deliverable; therefore, Data Cleansing is part of the Feature Creation task in this process model.

### Null values

Let's check number of null values in each of the columns.

In [2]:
def checkMissingValues(df):
    total = df.isnull().sum().sort_values(ascending = False)
    percent = (df.isnull().sum() / df.isnull().count() * 100 ).sort_values(ascending = False)
    df = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    
    print("Statistics for empty values in data:\n")
    print(df[~(df['Total'] == 0)])
    
checkMissingValues(df_data_1)

Statistics for empty values in data:

          Total   Percent
price      8388  6.992622
year       4285  3.572173
province     59  0.049185
country      59  0.049185
variety       1  0.000834


Let's drop all the nulls.

In [3]:
df_data_1 = df_data_1.dropna()

### Price Outliers
Let's filter out price outliers. I hesitated doing this, but if we think about our scenario, it is highly unlikely to see somebody buying a bottle of wine that costs hundreds of dollars without a knowledge or some research.

In [4]:
df_data_1 = df_data_1[df_data_1.price <= 158]

## Feature Creation

Feature Creation and Feature Engineering is one of the most important tasks in machine learning since it hugely impacts model performance. This also holds for deep learning, although to a lesser extent. Features can be changed or new features can be created from existing ones

In [5]:
from sklearn import preprocessing

### Encode Categorical Text Values

In [6]:
def encodeCategorical(df, column):
    le = preprocessing.LabelEncoder()
    
    df[column + "_code"] = le.fit_transform(df[column])
    
encodeCategorical(df_data_1, 'country')
encodeCategorical(df_data_1, 'province')
encodeCategorical(df_data_1, 'variety')

df_data_1.head()

Unnamed: 0,country,points,price,province,variety,year,country_code,province_code,variety_code
1,Portugal,87,15.0,Douro,Portuguese Red,2011.0,30,106,435
2,US,87,14.0,Oregon,Pinot Gris,2013.0,39,261,421
3,US,87,13.0,Michigan,Riesling,2013.0,39,212,463
4,US,87,65.0,Oregon,Pinot Noir,2012.0,39,261,425
5,Spain,87,15.0,Northern Spain,Tempranillo-Merlot,2011.0,36,255,570


### Encode Review Points

According to the information from WineEnthusiast site:
>Ratings reflect what our editors felt about a particular product. Beyond the rating, we encourage you to read the accompanying tasting note to learn about a product’s special characteristics.
>  
>98–100 Classic The pinnacle of quality.  
>94–97 Superb A great achievement.  
>90–93 Excellent Highly recommended.  
>87–89 Very Good Often good value; well recommended.  
>83–86 Good Suitable for everyday consumption; often good value.  
>80–82 Acceptable Can be employed in casual, less-critical circumstances.  
>  
>Products deemed Unacceptable (receiving a rating below 80 points) are not reviewed.

Let's transform point values into one of this six groups above.

In [7]:
def encodePoints(df, column):
    switcher = {
        80: 0, 81: 0, 82: 0,
        83: 1, 84: 1, 85: 1, 86: 1,
        87: 2, 88: 2, 89: 2,
        90: 3, 91: 3, 92: 3, 93: 3,
        94: 4, 95: 4, 96: 4, 97: 4,
        98: 5, 99: 5, 100: 5
    }
    df[column + "_code"] = df[column].apply(lambda x: switcher.get(x))
    
encodePoints(df_data_1, 'points')
df_data_1.head()

Unnamed: 0,country,points,price,province,variety,year,country_code,province_code,variety_code,points_code
1,Portugal,87,15.0,Douro,Portuguese Red,2011.0,30,106,435,2
2,US,87,14.0,Oregon,Pinot Gris,2013.0,39,261,421,2
3,US,87,13.0,Michigan,Riesling,2013.0,39,212,463,2
4,US,87,65.0,Oregon,Pinot Noir,2012.0,39,261,425,2
5,Spain,87,15.0,Northern Spain,Tempranillo-Merlot,2011.0,36,255,570,2


In [8]:
df_data_1.columns

Index(['country', 'points', 'price', 'province', 'variety', 'year',
       'country_code', 'province_code', 'variety_code', 'points_code'],
      dtype='object')

### Encode Vintage Year

Let's create a feature **age** that splits vintages into several groups:
- up to 5 years old
- up to 10 years old
- up to 20 years old
- up to 40 years old
- 40+ years old

In [9]:
def encodeYear(df, column):
    def getAge(x):
        if x<=5: return 0
        if 6<=x<10: return 1
        if 11<=x<20: return 2
        if 21<=x<40: return 3
        if 41<=x: return 4
        return 0
            
    df["age_code"] = df[column].apply(lambda x: getAge(2017 - x))
    
encodeYear(df_data_1, 'year')
df_data_1.head()

Unnamed: 0,country,points,price,province,variety,year,country_code,province_code,variety_code,points_code,age_code
1,Portugal,87,15.0,Douro,Portuguese Red,2011.0,30,106,435,2,1
2,US,87,14.0,Oregon,Pinot Gris,2013.0,39,261,421,2,0
3,US,87,13.0,Michigan,Riesling,2013.0,39,212,463,2,0
4,US,87,65.0,Oregon,Pinot Noir,2012.0,39,261,425,2,0
5,Spain,87,15.0,Northern Spain,Tempranillo-Merlot,2011.0,36,255,570,2,1


## Drop Unused Columns

In [10]:
df_data_1 = df_data_1.drop(['country', 'points', 'province', 'variety', 'year'], axis=1)

## Save Results

Save the processed data as a local file.

In [12]:
df_data_1.to_csv('wine-data-features.v2.zip', index=False, compression='zip')