# Linear Regression to predict price of wine

We will build a linear regression model to predict the log price of wine (LPRICE) from the following variables:

- VINT (vintage year)
- TIME_SV (time since vintage in years)
- DEGREES (average tempreature in celsius during growing season from April to September)
- WRAIN (winter rain in months preceding vintage from October to March)
- HRAIN (Harvest rain from August to September)

### Please read wine-analytics.pdf if you haven't to understand the math behind linear regression!!
***

Let's import the relevant libraries and the wine data

In [1]:
import pandas as pd
import statsmodels.api as sm

In [2]:
wine = pd.read_csv('wine.csv')
wine.head()

Unnamed: 0,VINT,LPRICE,WRAIN,DEGREES,HRAIN,TIME_SV
0,1952,-0.99868,600,17.1167,160,31
1,1953,-0.4544,690,16.7333,80,30
2,1954,,430,15.3833,180,29
3,1955,-0.80796,502,17.15,130,28
4,1956,,440,15.65,140,27


In [3]:
wine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38 entries, 0 to 37
Data columns (total 6 columns):
VINT       38 non-null int64
LPRICE     27 non-null float64
WRAIN      38 non-null int64
DEGREES    38 non-null float64
HRAIN      38 non-null int64
TIME_SV    38 non-null int64
dtypes: float64(2), int64(4)
memory usage: 1.9 KB


In [5]:
wine.isnull().sum()

VINT        0
LPRICE     11
WRAIN       0
DEGREES     0
HRAIN       0
TIME_SV     0
dtype: int64

Our y variable has 11 NA values. Let's remove them.

In [9]:
wine = wine.dropna()

### Define the X and Y variables
***

In [10]:
y = wine.LPRICE
x = wine.drop("LPRICE", axis = 1)

### Fit the Linear Regression Model
***

In [11]:
winelm = sm.OLS(y, x)
winelm_results = winelm.fit()
winelm_results.summary()

0,1,2,3
Dep. Variable:,LPRICE,R-squared:,0.828
Model:,OLS,Adj. R-squared:,0.796
Method:,Least Squares,F-statistic:,26.39
Date:,"Tue, 12 Mar 2019",Prob (F-statistic):,4.06e-08
Time:,15:58:37,Log-Likelihood:,-1.7963
No. Observations:,27,AIC:,13.59
Df Residuals:,22,BIC:,20.07
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
VINT,-0.0061,0.001,-7.195,0.000,-0.008,-0.004
WRAIN,0.0012,0.000,2.421,0.024,0.000,0.002
DEGREES,0.6164,0.095,6.476,0.000,0.419,0.814
HRAIN,-0.0039,0.001,-4.781,0.000,-0.006,-0.002
TIME_SV,0.0177,0.007,2.392,0.026,0.002,0.033

0,1,2,3
Omnibus:,1.969,Durbin-Watson:,2.787
Prob(Omnibus):,0.374,Jarque-Bera (JB):,1.102
Skew:,0.038,Prob(JB):,0.576
Kurtosis:,2.013,Cond. No.,3570.0


### Conclusion: Linear model is a good fit for this data due to the high R squared value