<a href="https://colab.research.google.com/github/meghanakolluri/Regression-models-red-wine-dataset/blob/master/REGRESSION_on_red_wine_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**RED WINE QUALITY DATASET USING REGRESSION MODELS(LINEAR REGRESSION,CLASSIFICATION,LOGISTIC REGRESSION)**

STEP 1: *Importing libraries*.
We have to import all the required libraries numpy,pandas,sklearn etc.

In [0]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from math import sqrt

STEP 2: *Reading the data*


In [0]:
from google.colab import files

In [0]:
uploaded=files.upload()

Saving winequality-red.csv to winequality-red.csv


In [0]:
data = pd.read_csv('winequality-red.csv')

In [0]:
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [0]:
data.shape

(1599, 12)

STEP 3: *Selecting input and output features* i.e., dependent and independent variables for the regression tasks.

In [0]:
features = ['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides',
            'free sulfur dioxide','total sulfur dioxide','density','pH','sulphates','alcohol']

In [0]:
target = ['quality']


Check whether there are null values in the dataset

In [0]:
data.isnull().any()

fixed acidity           False
volatile acidity        False
citric acid             False
residual sugar          False
chlorides               False
free sulfur dioxide     False
total sulfur dioxide    False
density                 False
pH                      False
sulphates               False
alcohol                 False
quality                 False
dtype: bool

In [0]:
x=data[features]
y=data[target]

STEP 4: *Perform train_test_split*

In [0]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=200)


STEP 5: *Fit on training set*

In [0]:
regressor = LinearRegression()
regressor.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

STEP 6: *Predict on testing data*

In [0]:
y_prediction = regressor.predict(x_test)
print(y_prediction[:5])
print('*'*40)
print(y_test[:5])

[[5.6362557 ]
 [5.73580131]
 [5.51754284]
 [5.48101339]
 [5.69866009]]
****************************************
      quality
366         7
1325        6
133         6
1418        5
1258        6


In [0]:
y_test.describe()

Unnamed: 0,quality
count,528.0
mean,5.575758
std,0.753196
min,3.0
25%,5.0
50%,5.5
75%,6.0
max,8.0


STEP 7: Evaluate linear regression accuracy using *ROOT-MEAN-SQUARE-ERROR(RMSE)*

In [0]:
RMSE = sqrt(mean_squared_error(y_true=y_test, y_pred=y_prediction))
print(RMSE)

0.6053813640486719


**DECISION TREE**: *Fit a new regression model to the training set*

In [0]:
regressor = DecisionTreeRegressor(max_depth=50)
regressor.fit(x_train, y_train)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=50,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')

*Perform prediction using decision tree regressor*

In [0]:
y_prediction = regressor.predict(x_test)
y_prediction[:5]

array([7., 6., 5., 5., 6.])

In [0]:
y_test[:5]

Unnamed: 0,quality
366,7
1325,6
133,6
1418,5
1258,6


*Evaluate Decision Tree Regression accuracy using root-mean-square-error(RMSE)*

In [0]:
RMSE = sqrt(mean_squared_error(y_true=y_test, y_pred=y_prediction))

In [0]:
print(RMSE)

0.7687061147858074


When we compare two or more regression models,the one with the least RMSE(less error) will be considerable.so, here linear regression is better than decision tree since it has less RMSE.

**CLASSIFICATION MODEL**

In [0]:
from sklearn.tree import DecisionTreeClassifier

*Copy the old dataset to new dataset*

In [0]:
data_classifier = data.copy()

In [0]:
data_classifier.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [0]:
data_classifier['quality'].dtype

dtype('int64')

*We will convert the task to a classification model*

Since we will consider binary classification,Let us take another column called q_label.It consists of values 0 or 1 based on the quality value.

1 - good and
0 - bad in terms of quality

In [0]:
data_classifier['q_label'] = (data_classifier['quality'] > 6.5)*1

In [0]:
data_classifier['q_label']

0       0
1       0
2       0
3       0
4       0
       ..
1594    0
1595    0
1596    0
1597    0
1598    0
Name: q_label, Length: 1599, dtype: int64

*Selecting the input and output features for classification*

In [0]:
features = ['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides',
            'free sulfur dioxide','total sulfur dioxide','density','pH','sulphates','alcohol']
target_classifier = ['q_label']

In [0]:
X = data_classifier[features]
y = data_classifier[target_classifier]


*Perform train test split*

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=324)

*Fit on train set*

In [0]:
wine_quality_classifier = DecisionTreeClassifier(max_leaf_nodes=20, random_state=0)

In [0]:
wine_quality_classifier.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=20,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=0, splitter='best')

*Predict on test data*

In [0]:
prediction = wine_quality_classifier.predict(X_test)
print(prediction[:5])
print('*'*10)
print(y_test['q_label'][:5])

[0 0 0 1 0]
**********
1144    0
1532    0
618     0
205     1
1384    0
Name: q_label, dtype: int64


*Measure accuracy of the classifier*

In [0]:
accuracy_score(y_true=y_test, y_pred=prediction)

0.9034090909090909

**LOGISTIC REGRESSION**

In [0]:
from sklearn.linear_model import LogisticRegression

In [0]:
data_classifier.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,q_label
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,0
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,0
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,0
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0


*Selecting the input and output features for classification tasks*

In [0]:
features = ['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides',
            'free sulfur dioxide','total sulfur dioxide','density','pH','sulphates','alcohol']
target_classifier = ['q_label']

In [0]:
X = data_classifier[features]
y = data_classifier[target_classifier]

*Perform train test split*

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=324)

*Fit on train set*

In [0]:
logistic_regression = LogisticRegression()
logistic_regression.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

*Predict on test data*

In [0]:
prediction = logistic_regression.predict(X_test)
print(prediction[:5])
print(y_test[:5])

[0 0 0 0 0]
      q_label
1144        0
1532        0
618         0
205         1
1384        0


*Measure accuracy of the classifier*

In [0]:
accuracy_score(y_true=y_test, y_pred=prediction)

0.8806818181818182