# Wine type prediction

In [None]:
import pandas as pd
import random
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import numpy as np
from collections import Counter
from sklearn import model_selection,linear_model,metrics
import seaborn as sns
from statistics import mean
from matplotlib import style

## Introduction

Recently people like to drink alchol and wine plays an really important rule in the alchol family. In this project I will use the dataset from kaggles to predict the type of the wine such as red wine and white wine according to other 10 different variables including fixed acidity, volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality, in which all the variables except the quality, and type are continuous numerical variables. This project is published on github with the name of [Wine_quality_analysis](https://github.com/harrysyz99/Wine_quality_analysis)

This project also have a publish [github repo](https://github.com/harrysyz99/Wine_quality_analysis), and this project is using the MIT defult license. If you want to directly use all the information or you can directly use run the code without setting the environment or file path please look at the gitrepo and clone the git repo on your local machine.

## Dataset

This dataset is from kaggle free and public dataset the original name of the datset is called [Wine Quality](https://www.kaggle.com/datasets/rajyellow46/wine-quality). According to the discription this dataset is originally from the UCI data set website, which is called the same name [wine Quality on UCI](https://archive.ics.uci.edu/ml/datasets/wine+quality),these two links has been copied to the references part at the end. These data set have is ordered by the type of the wine but the sample size between two types of the wine are different. Therefore when I preprocessed the dataset I need to do some spliting first. 

The next cell is loading the data and since there are some of the variables have na existed and in order to avoid the possible conflict, I used the dropna() function to drop every NA from the dataset.

In [None]:
winedataset = pd.read_csv("winequalityN.csv").dropna()
winedataset

In [None]:
typecount = winedataset.groupby("type").size()
typecount

As I mentioned before the sample size between two type of the wine are diffenence and we can find out that there is a huge difference between two types. Since we will split the train test data I will randomly select 1593 white wine out of the 4870 white wine. In order to make sure that when we split the train and testing data we could have the same sample size between two types of wine in order to avoid any bias situation.

In [None]:
plt.bar(x = ["red","white"], height=[typecount["red"],typecount["white"]], color = ["red","blue"])
plt.title("Figure 1: bar plot for the amount for red wine and white wine")

In [None]:
index = random.sample(list(range(0,4870,1)),1593)
white = winedataset[winedataset.type == "white"].iloc[index]
white
red = winedataset[winedataset.type == "red"]
red

In [None]:
finaldata = pd.concat([white,red])
finaldata

In [None]:
finaldata.dtypes

In [None]:
list(finaldata.columns.values)

The above dataset is the final dataset that have all the information neeeded. 

## First insight of the dataset

In this section I will use plenty of plots to shows the trend relationship statistical feature of the dataset itself. 

In [None]:
finaldata["quality"].value_counts().plot(kind='bar',
                                         color = ['yellow','black', 'red', 'green', 'blue', 'cyan', 'purple'],
                                         alpha = 0.6)
plt.xlabel("quality of the wine")
plt.ylabel("frequency")
plt.title("Figure 2:frequency of each wine quality type")

according to the above plot we can find out that most of the wine quality has been graded as 6.

Then I will find the relationship between wine type and each other variables.

In [None]:
sns.boxplot(x = "type", y ="quality",data = finaldata).set_title('Figure 3: boxplot for wine type and quality')

From this above plot we can find out that the distribution between these two types of wine and the quality is almost the same therefore we do not need to do any process to the variables of quality.

In [None]:
sns.boxplot(x = "type", y ='fixed acidity',data = finaldata).set_title('Figure 4:boxplot for wine type and fixed acidity')

In [None]:
data1 = white['fixed acidity']
data2 = red['fixed acidity']
plt.figure(figsize=(8,6))
plt.hist(data1, bins=100, alpha=0.5, label="data1")
plt.hist(data2, bins=100, alpha=0.5, label="data2")
plt.title("Figure 5:overlapping hiftogram between type and fixed acidity")

In [None]:
sns.boxplot(x = "type", y ='volatile acidity',data = finaldata).set_title('Figure 6:boxplot for wine type and volatile acidity')

In [None]:
data1 = white['volatile acidity']
data2 = red['volatile acidity']
plt.figure(figsize=(8,6))
plt.hist(data1, bins=100, alpha=0.5, label="data1")
plt.hist(data2, bins=100, alpha=0.5, label="data2")
plt.title("Figure 7:overlapping hiftogram between type and volatile acidity")

In [None]:
sns.boxplot(x = "type", y ='citric acid',data = finaldata).set_title('Figure 8:boxplot for wine type and citric acid')

In [None]:
data1 = white['citric acid']
data2 = red['citric acid']
plt.figure(figsize=(8,6))
plt.hist(data1, bins=100, alpha=0.5, label="data1")
plt.hist(data2, bins=100, alpha=0.5, label="data2")
plt.title("Figure 9:overlapping hiftogram between type and citric acid")

In [None]:
sns.boxplot(x = "type", y ='residual sugar',data = finaldata).set_title('Figure 10:boxplot for wine type and residual sugar')

In [None]:
data1 = white['residual sugar']
data2 = red['residual sugar']
plt.figure(figsize=(8,6))
plt.hist(data1, bins=100, alpha=0.5, label="data1")
plt.hist(data2, bins=100, alpha=0.5, label="data2")
plt.title("Figure 11:overlapping hiftogram between type and residual sugar")

In [None]:
sns.boxplot(x = "type", y ='chlorides',data = finaldata).set_title('Figure 12:boxplot for wine type and chlorides')

In [None]:
data1 = white['chlorides']
data2 = red['chlorides']
plt.figure(figsize=(8,6))
plt.hist(data1, bins=100, alpha=0.5, label="data1")
plt.hist(data2, bins=100, alpha=0.5, label="data2")
plt.title("Figure 13:overlapping hiftogram between type and chlorides")

In [None]:
sns.boxplot(x = "type", y ='free sulfur dioxide',data = finaldata).set_title('Figure 14:boxplot for wine type and free sulfur dioxide')

In [None]:
data1 = white['free sulfur dioxide']
data2 = red['free sulfur dioxide']
plt.figure(figsize=(8,6))
plt.hist(data1, bins=100, alpha=0.5, label="data1")
plt.hist(data2, bins=100, alpha=0.5, label="data2")
plt.title("Figure 15:overlapping hiftogram between type and free sulfur dioxide")

In [None]:
sns.boxplot(x = "type", y ='total sulfur dioxide',data = finaldata).set_title('Figure 16:boxplot for wine type and total sulfur dioxide')

In [None]:
data1 = white['total sulfur dioxide']
data2 = red['total sulfur dioxide']
plt.figure(figsize=(8,6))
plt.hist(data1, bins=100, alpha=0.5, label="data1")
plt.hist(data2, bins=100, alpha=0.5, label="data2")
plt.title("Figure 17:overlapping hiftogram between type and total sulfur dioxide")

In [None]:
sns.boxplot(x = "type", y ='density',data = finaldata).set_title('Figure 18:boxplot for wine type and density')

In [None]:
data1 = white['density']
data2 = red['density']
plt.figure(figsize=(8,6))
plt.hist(data1, bins=100, alpha=0.5, label="data1")
plt.hist(data2, bins=100, alpha=0.5, label="data2")
plt.title("Figure 19:overlapping hiftogram between type and total density")

In [None]:
sns.boxplot(x = "type", y ='pH',data = finaldata).set_title('Figure 20:boxplot for wine type and pH')

In [None]:
data1 = white['pH']
data2 = red['pH']
plt.figure(figsize=(8,6))
plt.hist(data1, bins=100, alpha=0.5, label="data1")
plt.hist(data2, bins=100, alpha=0.5, label="data2")
plt.title("Figure 21:overlapping hiftogram between type and pH")

In [None]:
sns.boxplot(x = "type", y ='sulphates',data = finaldata).set_title('Figure 22:boxplot for wine type and sulphates')

In [None]:
data1 = white['sulphates']
data2 = red['sulphates']
plt.figure(figsize=(8,6))
plt.hist(data1, bins=100, alpha=0.5, label="data1")
plt.hist(data2, bins=100, alpha=0.5, label="data2")
plt.title("Figure 23:overlapping hiftogram between type and sulphates")

In [None]:
sns.boxplot(x = "type", y ='alcohol',data = finaldata).set_title('Figure 24:boxplot for wine type and alcohol')

In [None]:
data1 = white['alcohol']
data2 = red['alcohol']
plt.figure(figsize=(8,6))
plt.hist(data1, bins=100, alpha=0.5, label="data1")
plt.hist(data2, bins=100, alpha=0.5, label="data2")
plt.title("Figure 25:overlapping hiftogram between type and alcohol")

According to the above data we can find out that the quality, alcohol, citric acid do not have a huge differences between white and red wine, and  sulphates,pH,chlorides have small difference in between. Therefore I will start to use some variables which do not have much differences. In addition I also draw a overlapping histogram in order to further make sure that they share the same distribution. 

## Data Spliting

In [None]:
y = finaldata["type"]
trainx, testx,trainy,testy = model_selection.train_test_split(finaldata,y,test_size = 0.3)
trainx = trainx.drop(columns = "type")
testx = testx.drop(columns = "type")

In [None]:
trainx

Since the originally type of the type variables is a object type variables, However in order to use the Lasso Regression, Linear Regression, and Logistic Regression we need to make our prediction variables into dummy variables type as follows.

In [None]:
trainy = pd.get_dummies(trainy)
trainy

In [None]:
trainy = trainy.drop(columns = "red")

In [None]:
trainy

In [None]:
testx

In [None]:
testy = pd.get_dummies(testy)

In [None]:
testy = testy.drop(columns= "red")

In [None]:
testy

# Initial Model

In [None]:
trainx

For the first model I will use the variables which do not have a huge difference between red and white wine, which are quality, alcohol, citric acid

In [None]:
train_new = trainx[["quality", "alcohol", "citric acid"]]
train_new

In [None]:
test_new = testx[["quality", "alcohol", "citric acid"]]
test_new

In [None]:
lgm = linear_model.LinearRegression()
lgm.fit(train_new, trainy)

In [None]:
print(lgm.coef_)

In [None]:
lgmtrain1 = lgm.score(train_new,trainy)
lgmtrain1

In [None]:
lgmtest1 = lgm.score(test_new,testy)
lgmtest1

In [None]:
MSEtrainlgm = metrics.mean_squared_error(trainy, lgm.predict(train_new))
MSEtestlgm = metrics.mean_squared_error(testy,lgm.predict(test_new))
MSEtestlgm

In [None]:
lasso = linear_model.Lasso()
lasso.fit(train_new, trainy)

In [None]:
print(lasso.coef_)

In [None]:
lassotrain1= lasso.score(train_new,trainy)

In [None]:
lassotest1 = lasso.score(test_new,testy)

In [None]:
MSEtrainlasso = metrics.mean_squared_error(trainy, lasso.predict(train_new))
MSEtestlasso = metrics.mean_squared_error(testy,lasso.predict(test_new))
MSEtestlasso

In [None]:
logit = linear_model.LogisticRegression(solver="sag")
logit.fit(train_new, trainy)

In [None]:
print(logit.coef_)

In [None]:
logittrain1 =logit.score(train_new,trainy)

In [None]:
logittest1 = logit.score(test_new,testy)

In [None]:
MSEtrainlogit = metrics.mean_squared_error(trainy, logit.predict(train_new))
MSEtestlogit = metrics.mean_squared_error(testy,logit.predict(test_new))
MSEtestlogit

In [None]:
data = {'test':[MSEtestlgm,MSEtestlogit,MSEtestlasso],'train':[MSEtrainlgm,MSEtrainlogit,MSEtrainlasso]}
pd.DataFrame(data, index =['Linear Reg', 'Logistic', 'Lasso'])

In [None]:
coef = {'X1':[lgm.coef_[0][0],lasso.coef_[0],logit.coef_[0][0]],'X2':[lgm.coef_[0][1],lasso.coef_[1],logit.coef_[0][1]],'X3':[lgm.coef_[0][2],lasso.coef_[2],logit.coef_[0][2]]}
pd.DataFrame(coef, index =['Linear Reg','Lasso', 'Logistic' ])

As we can see from the above MSE, the mse number is really small which means that this is not a proper model , or these variables is not choosing properly. In order to make our model more accurate we need to involve more variables and delete any variables which is not proper. In order to find more feature for the variables I will draw more plot in order to show the features.

In [None]:
def best_fit_slope_and_intercept(xs,ys):
    m = (((mean(xs)*mean(ys)) - mean(xs*ys)) /
         ((mean(xs)*mean(xs)) - mean(xs*xs)))
    
    b = mean(ys) - m*mean(xs)

    return m, b

In [None]:
style.use('ggplot')
fig, ax = plt.subplots(figsize=(10, 8))
ax.scatter(x = finaldata["total sulfur dioxide"],y =[finaldata["free sulfur dioxide"]-finaldata["free sulfur dioxide"].mean()],c = "red",alpha = 0.3)
ax.scatter(x = finaldata["total sulfur dioxide"],y =[finaldata["sulphates"]-finaldata["sulphates"].mean()], c= "blue",alpha = 0.3)
pop_a = mpatches.Patch(color='red', label='free sulfur dioxide',alpha = 0.3)
pop_b = mpatches.Patch(color='blue', label='sulphates',alpha = 0.3)
pop_c = mpatches.Patch(color='black', label='regression line for sulphate')
pop_d = mpatches.Patch(color="orange", label = "regression line for the free sulfur dioxide")
plt.title("Figure 26:sctter plot between total sulfur dioxide and free sulfure dioxide , and sulphates")
ax.legend(handles=[pop_b,pop_a,pop_c,pop_d],prop={"size":20})
m1,b1 = best_fit_slope_and_intercept(finaldata["total sulfur dioxide"],finaldata["sulphates"])
best_fit_line = [(m1*x)+b1 for x in finaldata["total sulfur dioxide"]]
m2,b2 = best_fit_slope_and_intercept(finaldata["total sulfur dioxide"],finaldata["free sulfur dioxide"])
best_fit_line2 = [(m2*x)+b2 for x in finaldata["total sulfur dioxide"]]
plt.plot(finaldata["total sulfur dioxide"], best_fit_line, c = "black")
plt.plot(finaldata["total sulfur dioxide"],best_fit_line2,c = "orange")


According to this scatter plot we can find out that there is not relationshiop between sulphate and total sulfur dioxide, however in order to prevent the potential scaling mistakes I will write an individual scatter plot between sulphates and total sulfur dioxide.

In [None]:
plt.scatter(x = finaldata["total sulfur dioxide"],y =[finaldata["sulphates"]-finaldata["sulphates"].mean()], c= "blue",alpha = 0.3)
plt.title("Figure 27:scatter plot of total sulfur dioxide and sulphates")
plt.xlabel("total sulfur dioxide")
plt.ylabel("sulphates")

According to the scatter plot we can make sure that the relation ship between sulphates and total sulfur dioxide are really small.

In [None]:
fig, ax = plt.subplots(figsize=(10, 8))
ax.scatter(x = finaldata["pH"],y =finaldata["fixed acidity"],c = "red",alpha = 0.3)
ax.scatter(x = finaldata["pH"],y =finaldata["volatile acidity"], c= "blue",alpha = 0.3)
ax.scatter(x = finaldata["pH"],y =finaldata["citric acid"], c= "green",alpha = 0.3)
pop_a = mpatches.Patch(color='red', label='fixed acidity',alpha = 0.3)
pop_b = mpatches.Patch(color='blue', label='volatile acidity',alpha = 0.3)
pop_c = mpatches.Patch(color='green', label='citric acid',alpha = 0.3)
pop_d = mpatches.Patch(color="orange", label=' regression line for fixed acidity',alpha = 0.3)
pop_e = mpatches.Patch(color="black", label='regression line for volatile acidity',alpha = 0.3)
pop_f = mpatches.Patch(color="purple", label='regression line for citric acid',alpha = 0.3)

m1,b1 = best_fit_slope_and_intercept(finaldata["pH"],finaldata["fixed acidity"])
best_fit_line = [(m1*x)+b1 for x in finaldata["pH"]]
m2,b2 = best_fit_slope_and_intercept(finaldata["pH"],finaldata["volatile acidity"])
best_fit_line2 = [(m2*x)+b2 for x in finaldata["pH"]]
m3,b3 = best_fit_slope_and_intercept(finaldata["pH"],finaldata["citric acid"])
best_fit_line3 = [(m3*x)+b3 for x in finaldata["pH"]]
plt.plot(finaldata["pH"], best_fit_line, c = "orange")
plt.plot(finaldata["pH"],best_fit_line2,c = "black")
plt.plot(finaldata["pH"],best_fit_line3,c = "purple")
plt.xlabel("pH")

plt.title("Figure 28:sctter plot between pH level and 3 acid concenration")
ax.legend(handles=[pop_a,pop_b,pop_c,pop_d,pop_e,pop_f])

This scatter plot shows that similar trend between pH and volatile acidity and ph with citric acid. As usual I will draw two individual scatter plot for this two relationship in order to prevent the potential scaling error. 

In [None]:
plt.scatter(x = finaldata["pH"],y =[finaldata["citric acid"]-finaldata["citric acid"].mean()], c= "green",alpha = 0.3)
plt.xlabel("pH")
plt.ylabel("citric acid")
plt.title("Figure 29:scatter plot between pH and citric acid")

In [None]:
plt.scatter(x = finaldata["pH"],y =[finaldata["volatile acidity"]-finaldata["volatile acidity"].mean()], c= "blue",alpha = 0.3)
plt.xlabel("pH")
plt.ylabel("volatile acidity")
plt.title("Figure 30:scatter plot between pH and volatile acidity")

According to the above two individual scatter plot we can find out that there are some relationship exists, between pH and volatile acidity and ph with citric acid, so when we modified the model I will further analysis we will take these relationships into the consideration. 

## Improved model

As what we found previously we can find out that there is a relationship between acid and pH and sulphr dioxide and sulphate. Therefore I will use "fixed acidity","pH","citric acid","total sulfur dioxide","sulphates","volatile acidity" as my new model variables. 

In [None]:
trainx_final = trainx[["fixed acidity","pH","citric acid","total sulfur dioxide","sulphates","volatile acidity"]]
testx_final = testx[["fixed acidity","pH","citric acid","total sulfur dioxide","sulphates","volatile acidity"]]

In [None]:
lgm2 = linear_model.LinearRegression()
lgm2.fit(trainx_final, trainy)

In [None]:
lgm2.coef_

In [None]:
lgmtrain2 = lgm2.score(trainx_final,trainy)
lgmtrain2

In [None]:
lgmtest2= lgm2.score(testx_final,testy)
lgmtest2

In [None]:
MSEtrainlgmfinal = metrics.mean_squared_error(trainy, lgm2.predict(trainx_final))
MSEtestlgmfinal = metrics.mean_squared_error(testy,lgm2.predict(testx_final))
MSEtestlgmfinal

In [None]:
lasso2 = linear_model.Lasso()
lasso2.fit(trainx_final, trainy)

In [None]:
lasso2.coef_

In [None]:
lassotest2= lasso2.score(testx_final,testy)

In [None]:
lassotrain2 = lasso2.score(trainx_final,trainy)

In [None]:
MSEtrainlassofinal = metrics.mean_squared_error(trainy, lasso2.predict(trainx_final))
MSEtestlassofinal = metrics.mean_squared_error(testy,lasso2.predict(testx_final))
MSEtestlasso

Since our model is an unscaled model therefore the solver "sag" in the Logistic Regression is not properate According to the API documentation of the sklearn we can know that the solver "newton-cg" is the solver that fit this situation therefore in this new model I will use the new solver to fit the model. 

In [None]:
logit2 = linear_model.LogisticRegression(solver="newton-cg")
logit2.fit(trainx_final, trainy)

In [None]:
logit2.coef_

In [None]:
logittrain2 = logit2.score(trainx_final,trainy)
logittrain2

In [None]:
logittest2 = logit2.score(testx_final,testy)
logittest2

In [None]:
MSEtrainlogitfinalfinal = metrics.mean_squared_error(trainy, logit2.predict(trainx_final))
MSEtestlogitfinalfinal = metrics.mean_squared_error(testy,logit2.predict(testx_final))
MSEtestlogitfinalfinal

In [None]:
data = {'test':[MSEtestlgm,MSEtestlogit,MSEtestlasso],'train':[MSEtrainlgm,MSEtrainlogit,MSEtrainlasso]}
old = pd.DataFrame(data, index =['Linear Reg', 'Logistic', 'Lasso'])
data_final = {'test':[MSEtestlgmfinal,MSEtestlogitfinalfinal,MSEtestlassofinal],'train':[MSEtrainlgmfinal,MSEtrainlogitfinalfinal,MSEtrainlassofinal]}
new = pd.DataFrame(data_final, index =['Linear Reg', 'Logistic', 'Lasso'])


In [None]:
old 

In [None]:
new

In [None]:
data_compare = {'test':[lgmtest1,lgmtest2,lassotest1,lassotest2,logittest1,logittest2],'train':[lgmtrain1,lgmtrain2,lassotrain1,lassotrain2,logittrain1,logittrain2]}
compare = pd.DataFrame(data_compare, index =['Linear Reg old','Linear Reg new','Lasso old','Lasso new','Logistic old','Logistic new'])
compare

In [None]:
coef = {'X1':[lgm2.coef_[0][0],lasso2.coef_[0],logit2.coef_[0][0]],'X2':[lgm2.coef_[0][1],lasso2.coef_[1],logit2.coef_[0][1]],'X3':[lgm2.coef_[0][2],lasso2.coef_[2],logit2.coef_[0][2]],'X4':[lgm2.coef_[0][3],lasso2.coef_[3],logit2.coef_[0][3]],'X5':[lgm2.coef_[0][4],lasso2.coef_[4],logit2.coef_[0][4]],'X6':[lgm2.coef_[0][5],lasso2.coef_[5],logit2.coef_[0][5]]}
pd.DataFrame(coef, index =['Linear Reg', 'Lasso', 'Logistic'])

By comparing the old model with the new model we can get that the overall accuracy is increasing and we can find out that the final accuracy is aroung 97% for logistic model which means this model is relatively good for the prediction. In Addition by comparing the MSE value for both new model and the old model we can find out that the new model MSE value is smaller than the old model MSE this also states that the overall model accuracy is good enough. 

In addition by comparing the coefficient value of the new version model and old version model we can find out that this time the coefficient of the Lasso is no longer all 0s there more we can know that this model is better than the old version one. 

# reference
dataset：
1. https://www.kaggle.com/datasets/rajyellow46/wine-quality/code
2. https://www.kaggle.com/code/brendangberkman/wine-type-prediction this link is a prediction project that is similar to mine. I did not copy/paraphrase any code from this project, but just in case that there might be some similar code with this project. 
3. https://archive.ics.uci.edu/ml/datasets/wine+quality

other reference：  
1. https://blog.csdn.net/qq_31279347/article/details/82795405
1. https://blog.csdn.net/wanglingli95/article/details/78887771
3. https://datacarpentry.org/python-socialsci/11-joins/index.html
4. https://www.cnblogs.com/yanjy-onlyone/p/11288098.html
5. https://www.geeksforgeeks.org/how-to-convert-categorical-data-to-binary-data-in-python/
6. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
7. https://scikit-learn.org/stable/glossary.html#term-random_state
8. https://stackoverflow.com/questions/36856428/attributeerror-function-object-has-no-attribute-bar-in-pandas
9. https://stackoverflow.com/questions/52404971/get-a-list-of-categories-of-categorical-variable-python-pandas
10. https://cmdlinetips.com/2019/03/how-to-make-grouped-boxplots-in-python-with-seaborn/"
11. https://www.javatpoint.com/how-to-create-a-dataframes-in-python
12. https://moonbooks.org/Articles/How-to-add-a-legend-for-a-scatter-plot-in-matplotlib-/
13. https://pythonprogramming.net/how-to-program-best-fit-line-machine-learning-tutorial/
14. https://python-graph-gallery.com/3-control-color-of-barplots
15. https://scikit-learn.org/stable/modules/linear_model.html
16. https://git-lfs.github.com/
17. https://towardsdatascience.com/linear-regression-in-python-9a1f5f000606
18. https://stackoverflow.com/questions/42406233/how-to-add-title-to-seaborn-boxplot
19. https://stackabuse.com/change-figure-size-in-matplotlib/
20. https://www.adamsmith.haus/python/answers/how-to-change-the-font-size-of-a-matplotlib-legend-in-python
21. https://datavizpyr.com/overlapping-histograms-with-matplotlib-in-python/