In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from feature_engine.encoding import RareLabelEncoder
from feature_engine.encoding import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import ExtraTreesRegressor, GradientBoostingRegressor, HistGradientBoostingRegressor, RandomForestRegressor, BaggingRegressor, AdaBoostRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.tree import DecisionTreeRegressor, ExtraTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.neural_network import MLPRegressor
from lightgbm import LGBMRegressor

from sklearn.model_selection import GridSearchCV

from feature_engine.encoding import OrdinalEncoder

from math import sqrt

In [5]:
spain = pd.read_csv('spanish_wines_processed.csv')

### Train/Test Split

In [6]:
X_train, X_test, y_train, y_test = train_test_split(
    spain[['Winery','Region','Sub Region','Year','Type','Alcohol','Review Score','Main Variety','Variety Class']],
    spain['Price'],
    test_size=0.2,
    random_state=0,
)

X_train.shape, X_test.shape

((13455, 9), (3364, 9))

### Pipeline

In [7]:
spain_finalpipe = Pipeline([

    ## Rare Label Encoding
    ('rare_encoder',
     RareLabelEncoder(tol=0.05, n_categories=4, variables=['Type','Region'])),

    ## Ordered Ordinal Encoding
    ('ordinal_encoder',
     OrdinalEncoder(encoding_method='ordered', variables=['Winery','Sub Region','Main Variety'], unseen='encode')),

    ## One Hot Encoding
    ('encoder_rare_label',
     OneHotEncoder(variables=['Region','Type','Variety Class'], drop_last=True)),

    ## Scaling
    ('minmax_scaler',
     MinMaxScaler()),

    ## Modelling
    ('lgbm_regressor',
     LGBMRegressor(n_estimators=200, num_leaves=30, reg_alpha=0.5, reg_lambda=0))
])

In [9]:
spain_finalpipe.fit(X_train,y_train)

In [13]:
print(spain_finalpipe.score(X_train, y_train))
print(spain_finalpipe.score(X_test, y_test))

0.7522756240718298
0.5632321685382102


### Conclusions

When evaluating the accuracy of the models on both the training and test sets, it became apparent that the model was over-fitted on the training set.

While this outcome may be expected given the nature of the models used, further investigation is required to address this issue.

Although the current model falls short of accurately predicting the price of wine, there have been valuable insights gained through the process.

One important learning from the project is the feature importance metrics obtained from the model predictions.

The results indicate that the winery where the wine is produced has a greater impact on the price than the type of grape.

Additionally, ratings provided by experts appear to have more impact on the price than alcohol levels or the year of production.

While these assumptions cannot be made with complete certainty, they provide valuable insight into how to further work with this dataset to improve price prediction.

There are several additional factors that could potentially impact the price of wine, including pH levels or acidity, residual sugar, and the optimal point of consumption. These variables were not accounted for during the web scraping process.

The price of wine is also influenced by numerous uncontrollable factors, such as the variability in production resulting from weather and soil conditions, the unpredictable effects of ageing on wine quality and price, and economic factors like market fluctuations, supply and demand.

Given the sheer size of the wine industry, particularly in Spain, it is likely that additional data would be necessary to sufficiently train algorithms for accurately predicting wine prices.