<a href="https://colab.research.google.com/github/nallagondu/datatrained_inter_public/blob/main/Red_Wine_Quality_Prediction_Project_v01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Project Description**
The dataset is related to red and white variants of the Portuguese "Vinho Verde" wine. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

This dataset can be viewed as **classification task**. The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones). Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.


**Attribute Information**
Input variables (based on physicochemical tests):
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
What might be an interesting thing to do, is to set an arbitrary cutoff for your dependent variable (wine quality) at e.g. 7 or higher getting classified as 'good/1' and the remainder as 'not good/0'.
This allows you to practice with hyper parameter tuning on e.g. decision tree algorithms looking at the ROC curve and the AUC value.
You need to build a classification model.


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

In [None]:
url = "https://raw.githubusercontent.com/nallagondu/ML-Datasets/main/Red%20Wine/winequality-red.csv"

df = pd.read_csv(url)
df


In [None]:
#checking Dimension of Dataset
df.shape

In [None]:
df.columns

In [None]:
df.columns.tolist()

In [None]:
df.dtypes

In [None]:
df.isnull().sum()

No missing values found in Dataset

In [None]:
#Basic information about data
df.info()

In [None]:
df.info(verbose=False)

In [None]:
sns.heatmap(df.isnull())

In [None]:
df['quality'].unique()

In [None]:
df['quality'].nunique()

In [None]:
for i in df.columns:
  print(df[i].value_counts())
  print("\n")

These are the values counts of all columns and we can see no blank in comulns

In [None]:
df.shape[0]

In [None]:
df.duplicated().any()

In [None]:
df[df.duplicated()==True].shape[0]

we found 240 duplicate records  and droping the duplicate records

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
df[df.duplicated()==True].shape[0]


In [None]:
df.describe()

The min wine Quality is 3 and the max wine quality is 8

Mean value of red wine holds according to the data is 10.4

there is a difference between average total **total sulfur dioxide**:  46.825975	  -and Max total sulfur dioxide -289.000000	 and also **free sulfur dioxide**   **15.893304** and max is **72.000000**

In [None]:
plt.figure(figsize=(14,10))
qty = df['quality'].value_counts()
sns.barplot(x=qty.index,y=qty.values,order=qty.index,palette='Dark2')
plt.title("Feature Distrubutions",fontsize = 14)
for index,value in enumerate(qty.values):
  plt.text(index,value, value, fontsize=14)
plt.tight_layout()
plt.show()


Here we can see most of the wine having quality 5 and 6 and we can see class imblance in the targe feature.

So we are going to use imbalance techniques ⁉



In [None]:
ax = sns.countplot(x='quality',data = df)
print(df['quality'].value_counts())

In [None]:
#checking for numerical columns
numerical_col = []
for i in df.dtypes.index:
  if df.dtypes[i]!="object":
    numerical_col.append(i)
print("Numerical columns:", numerical_col)

In [None]:
#Data disctubution of remaining  columns
#lets check how the data distributed remaining column
plt.figure(figsize = (10, 6), facecolor = "white")
plotnumber = 1
for col in numerical_col:
  if plotnumber<=4:
    ax = plt.subplot(2,2, plotnumber)
    sns.distplot(df[col],color = "m")
    plt.xlabel(col, fontsize = 12)
    plt.yticks(rotation = 0,fontsize = 10)
  plotnumber+=1
plt.tight_layout()

In [None]:
#Lets check the outliers by plotting boxplot
plt.figure(figsize = (10, 6), facecolor = "white")
plotnumber = 1
for col in numerical_col:
  if plotnumber<=4:
    ax = plt.subplot(2,2, plotnumber)
    sns.boxplot(df[col], palette = "Dark2")
    plt.xlabel(col, fontsize = 15)
    plt.yticks(rotation = 0, fontsize = 10)
  plotnumber+=1
plt.tight_layout()



In [None]:
df.skew()

In [None]:
def numerical_plt(column):
    plt.figure(figsize=(13.5,10))
    plt.subplot(2,1,1)
    sns.boxplot(x="quality",y=column, data=df, palette="Dark2")
    plt.title(f"{column.title()} vs Quality Analysis",fontweight="black",size=25,pad=10,)

    plt.subplot(2,1,2)
    sns.histplot(x=column,kde=True,hue="quality",data=df, palette="bright")
    skew = df[column].skew()
    plt.title(f"Skewness of {column.title()} Feature is: {round(skew,3)}",fontweight="black",size=20,pad=10)
    plt.tight_layout()
    plt.show()

In [None]:
#df['fixed acidity'] = np.cbrt(df['fixed acidity'])


In [None]:
#df['residual sugar'] = np.cbrt(df['residual sugar'])

In [None]:
#df['chlorides'] = np.cbrt(df['chlorides'])

In [None]:
df.skew()

In [None]:
numerical_plt("fixed acidity")

In [None]:
numerical_plt('volatile acidity')

In [None]:
numerical_plt('citric acid')

In [None]:
numerical_plt('residual sugar')

In [None]:
numerical_plt('chlorides')

In [None]:
numerical_plt('free sulfur dioxide')

In [None]:
numerical_plt('total sulfur dioxide')

In [None]:
numerical_plt('density')

In [None]:
numerical_plt('pH')

In [None]:
numerical_plt('sulphates')

In [None]:
numerical_plt('alcohol')

In [None]:
#Correlation
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


columns = df.columns.tolist()
columns.remove('quality') # Quality have discrete values


In [None]:
correlation  = df[columns].corr()
plt.figure(figsize=(14,10))
sns.heatmap(correlation, fmt=".2g", annot=True, cmap='YlOrRd_r',linewidths = .5 )
plt.tight_layout()
plt.show()

most of the featured are having High correlation with other features
Fixed acdity having high correlation with Citric acid and Density


In [None]:
df.describe()

In [None]:
df.skew()

In [None]:
#Perform ANOVA test to analyze the feature Importance in Wine Quality
from scipy import stats

f_scores = {}
p_values = {}

for column in columns:
  f_score,p_value = stats.f_oneway(df[column],df['quality'])
  f_scores[column] = f_score
  p_values[column] = p_value

In [None]:
plt.figure(figsize=(15,6))
keys = list(f_scores.keys())
values = list(f_scores.values())
sns.barplot(x=keys,y=values, palette = 'pastel')
plt.title("ANOVA Test", fontweight = "bold",size = 20,pad = 15)
plt.xticks(rotation = 90)

for index,value in enumerate(values):
  plt.text(index,value,int(value), ha= "center", va = 'bottom',fontweight="bold",size = 15)
plt.show()

In [None]:
#ANOVA TEST Comparing F_Score and P_values

df_test = pd.DataFrame({"Features":keys,"F_score": values})
df_test["P_values"] = list(p_values.values())

In [None]:
df_test

In [None]:
#Computing Skewness of Each Numerical Attributes.

df1 = df.copy()
columns = df.columns.tolist()
columns.remove("quality")

In [None]:
skew_df = df[columns].skew().to_frame().rename(columns={0:"skewness"})
skew_df

Analysis of Skewness Values

Fixed acidity (0.941041): Moderately skewed to the right.

Volatile acidity (0.729279): Moderately skewed to the right.

Citric acid (0.312726): Slightly skewed to the right, close to symmetric.

Residual sugar (4.548153): Highly skewed to the right.

Chlorides (5.502487): Highly skewed to the right.

Free sulfur dioxide (1.226579): Highly skewed to the right.

Total sulfur dioxide (1.540368): Highly skewed to the right.

Density (0.044778): Nearly symmetric.

pH (0.232032): Nearly symmetric.

Sulphates (2.406505): Highly skewed to the right
.

Alcohol (0.859841): Moderately skewed to the right.

To handle skewness, especially in highly skewed data,we have to apply transformations:

Log Transformation: Apply a logarithmic transformation to reduce right skewness.

Square Root Transformation: Apply a square root transformation to reduce moderate right skewness.

Box-Cox Transformation: A more flexible transformation that can handle both positive and negative skewness.

##**Skewness Transfermation analysis using different Techniques**

In [None]:
skewness_transformation = {}

for col in columns:
    transformed_log = np.log(df[col])  # Log Transformation
    transformed_boxcox = special.boxcox1p(df[col], 0.15)     # Box-Cox Transformation with lambda=0.15
    transformed_inverse = 1 / df[col]   # Inverse Transformation
    transformed_cbrt = np.cbrt(df[col]) # Cube Root Transformation
    transformed_yeojohnson, _ = stats.yeojohnson(df[col])
    # Create a dictionary for the skewness values of each transformation
    transformation_skewness = {
        "Log Transformation": stats.skew(transformed_log),
        "Box-Cox Transformation": stats.skew(transformed_boxcox),
        "Inverse Transformation": stats.skew(transformed_inverse),
        "Cube Root Transformation": stats.skew(transformed_cbrt),
        "Yeo Johnson Transformation": stats.skew(transformed_yeojohnson)}

    # Store the transformation skewness values for the column
    skewness_transformation[col] = transformation_skewness

In [None]:
df_ref = pd.DataFrame.from_dict(skewness_transformation, orient= 'index')
df_ref = pd.concat([skew_df["skewness"],  df_ref], axis=1)
df_ref

**Here I'm using box-cox  Transformation **

In [None]:
for col in columns:
  transformed_col,_ = stats.yeojohnson(df[col])
  df[col]= transformed_col

In [None]:
df.sample(4)

In [None]:
#for col in columns:
#  if df[col].min() < 0:
  #  df[col] = df[col] + abs(df[col].min()) +1

 # transformed_col, _ = stats.boxcox(df[col])
 # df[col]= transformed_col

##**Splitting Target vriables into two groups **

In [None]:
skewed_features = df.skew().sort_values(ascending=False)
print(skewed_features)

In [None]:
df["quality"].unique()

In [None]:
# here I'm trying to divid the quantity in to good and bad
qty_test = [0,6.5,10]
grouped_by = ['Bad', 'Good']
df["quality"] = pd.cut(df["quality"],bins = qty_test,labels = grouped_by)
df["quality"].unique()

In [None]:
df.head()

In [None]:
df.head(15)

In [None]:
#encoding Target variables
df["quality"] = df["quality"].replace({"Bad":0,"Good":1})
df.head(5)

In [None]:
df.sample(10)

In [None]:
x = df.drop('quality',axis=1)
y = df['quality']

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x = pd.DataFrame(scaler.fit_transform(x), columns = x.columns)
x

In [None]:
#Finding variance inflation factor in each scaled column i.e x.shape[1](1/(1-R2))
from  statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif["VIF values"] = [variance_inflation_factor(x.values, i) for i  in range (len(x.columns))]
vif["Features"] = x.columns
vif

In [None]:
#Dropping the TotalCharges  column
x.drop("fixed acidity", axis = 1, inplace = True )

In [None]:
# Again checking VIF values to confirm whether the multicollinearity still exists or not

vif = pd.DataFrame()
vif["VIF values"] = [variance_inflation_factor(x.values, i) for i  in range (len(x.columns))]
vif["Features"] = x.columns
vif

In [None]:
"""So , We have solved the multicolinearity issue .we can now move ahead for model building """

In [None]:
!pip install imblearn

In [None]:
from imblearn.over_sampling import SMOTE
SM = SMOTE()
x1,y1 = SM.fit_resample(x, y)

In [None]:
# Checking values count of target column
y.value_counts()

In [None]:
# Checking values count of target column
y1.value_counts()

**Finding the best random state**

In [None]:
from sklearn.model_selection  import train_test_split
from sklearn.ensemble  import RandomForestClassifier
from sklearn.metrics  import accuracy_score

maxAccu = 0
maxRS = 0
for i in range(1, 200):
  x_train,x_test,y_train,y_test = train_test_split(x1,y1,test_size = 0.25,random_state = i)
  RFR = RandomForestClassifier()
  RFR.fit(x_train, y_train)
  pred = RFR.predict(x_test)
  acc = accuracy_score(y_test, pred)
  if acc>maxAccu:
    maxAccu = acc
    maxRS = i
print("best accuracy is ", maxAccu, "at random_state", maxRS)

In [None]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.30,random_state = maxRS)

**Classification Algorithms**

In [None]:
from  sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier ,AdaBoostClassifier,BaggingClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, accuracy_score
from sklearn.model_selection import cross_val_score

##Random Forest Classifier

In [None]:
# Checking Accuracy for Random ForestClassifier
RFC = RandomForestClassifier()
RFC.fit(x_train,y_train)
predRFC = RFC.predict(x_test)
print(accuracy_score(y_test,predRFC))
print("Confusion Matric" , confusion_matrix(y_test,predRFC))
print(classification_report(y_test,predRFC))

**Logistic Regression**

In [None]:
# Checking Accuracy for Logistic Regression
LR = LogisticRegression()
LR.fit(x_train,y_train)
predLR = LR.predict(x_test)
print(accuracy_score(y_test,predLR))
print("Confusion Matric" , confusion_matrix(y_test,predLR))
print(classification_report(y_test,predLR))

** SVC - Support vector machine classifier**

In [None]:
svc  = SVC()
svc.fit(x_train,y_train)
predsvc = svc.predict(x_test)
print(accuracy_score(y_test,predsvc))
print("Confusion Matric" , confusion_matrix(y_test,predsvc))
print(classification_report(y_test,predsvc))

**Gradient Boosting Classifier**

In [None]:
GB  = GradientBoostingClassifier()
GB.fit(x_train,y_train)
predGB = GB.predict(x_test)
print(accuracy_score(y_test,predGB))
print("Confusion Matric" , confusion_matrix(y_test,predGB))
print(classification_report(y_test,predGB))

**AdaBoost Classifier**

In [None]:
# Check accuracy for AdaBoost Classifier

ABC  = AdaBoostClassifier()
ABC.fit(x_train,y_train)
predABC = ABC.predict(x_test)
print(accuracy_score(y_test,predABC))
print("Confusion Matric" , confusion_matrix(y_test,predABC))
print(classification_report(y_test,predABC))

**Extra Trees Classifier**

In [None]:
ET  = ExtraTreesClassifier()
ET.fit(x_train,y_train)
predET = ET.predict(x_test)
print(accuracy_score(y_test,predET))
print("Confusion Matric" , confusion_matrix(y_test,predET))
print(classification_report(y_test,predET))

##Bagging Classifier

In [None]:

BC  = BaggingClassifier()
BC.fit(x_train,y_train)
predBC = BC.predict(x_test)
print(accuracy_score(y_test,predBC))
print("Confusion Matric" , confusion_matrix(y_test,predBC))
print(classification_report(y_test,predBC))

**Cross Validation Score**

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
# Checking CV Score for Random Forest Classifier

score = cross_val_score(RFC,x1,y1)
print(score)
print(score.mean())
print("Difference between auucracy score and cross validation score is : ", accuracy_score(y_test,predRFC) - score.mean())

In [None]:
#Checking CV score for Logistic Regression
score = cross_val_score(LR,x1,y1)
print(score)
print(score.mean())
print("Difference between auucracy score and cross validation score is : ", accuracy_score(y_test,predLR) - score.mean())

In [None]:
#Checking CV score for Gradient Boosting Classifier
score = cross_val_score(GB,x1,y1)
print(score)
print(score.mean())
print("Difference between auucracy score and cross validation score is : ", accuracy_score(y_test,predGB) - score.mean())

In [None]:
#Checking CV score for SVC
score = cross_val_score(svc,x1,y1)
print(score)
print(score.mean())
print("Difference between auucracy score and cross validation score is : ", accuracy_score(y_test,predsvc) - score.mean())

In [None]:
#Checking CV score for Ada Boosting Classifier
score = cross_val_score(ABC,x1,y1)
print(score)
print(score.mean())
print("Difference between auucracy score and cross validation score is : ", accuracy_score(y_test,predABC) - score.mean())

In [None]:
#Checking CV score for Bagging Classifier
score = cross_val_score(BC,x1,y1)
print(score)
print(score.mean())
print("Difference between auucracy score and cross validation score is : ", accuracy_score(y_test,predBC) - score.mean())

In [None]:
#Checking CV score for ExtraTree Classifier
score = cross_val_score(ET,x1,y1)
print(score)
print(score.mean())
print("Difference between auucracy score and cross validation score is : ", accuracy_score(y_test,predET) - score.mean())

In [None]:
#Extra Trees Classifier
from sklearn.model_selection import GridSearchCV

perameters = { 'criterion': ['gini','entropy'],
              'random_state': [10,50,1000],
               'max_depth': [0,10,20],
               'n_jobs': [-2, -1,1],
               'n_estimators' : [50,100,200,300]}

In [None]:
GCV = GridSearchCV(ExtraTreesClassifier(), perameters, cv = 5)

In [None]:
GCV.fit(x_train,y_train)

In [None]:
GCV.best_params_

In [None]:
Final_model = ExtraTreesClassifier(criterion= 'gini',max_depth = 20,n_estimators=100, n_jobs = -2, random_state=10)
Final_model.fit(x_train,y_train)
pred_final = Final_model.predict(x_test)
acc = accuracy_score(y_test,pred_final)
print(acc * 100)

In [None]:
from sklearn.tree import DecisionTreeClassifier

dtree = DecisionTreeClassifier()

In [None]:
param_grid = {"max_depth":[3,5,10, None],
              "min_samples_split":[2,5,10],
              "min_samples_leaf":[1,2,4],
              "criterion":["gini","entropy"]
}

In [None]:
dtree_grid_search  = GridSearchCV(dtree, param_grid, cv = 5, n_jobs = -1,scoring='accuracy')

In [None]:
dtree_grid_search.fit(x_train,y_train)

In [None]:
best_parameters_Dtree = dtree_grid_search.best_params_
best_parameters_Dtree

In [None]:
Final_model_dtree  = DecisionTreeClassifier(criterion= 'gini',max_depth = 5,min_samples_leaf= 4, min_samples_split=5)
Final_model_dtree.fit(x_train,y_train)

In [None]:
dtree_pred_final = Final_model_dtree.predict(x_test)
dtree_acc = accuracy_score(y_test,dtree_pred_final)
print(dtree_acc * 100)

In [None]:

rfc = RandomForestClassifier()

In [None]:
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy']
}

In [None]:
rsc_grid_search = GridSearchCV(estimator=rfc, param_grid=param_grid, cv=5, n_jobs=-1, scoring='accuracy')

In [None]:
rsc_grid_search.fit(x_train,y_train)

In [None]:
best_parameters_rfc = rsc_grid_search.best_params_
best_parameters_rfc

In [None]:
Final_model_rfc  = RandomForestClassifier(criterion= 'gini',min_samples_leaf= 1, min_samples_split=2,n_estimators= 100)
Final_model_rfc.fit(x_train,y_train)

In [None]:
rfc_pred_final = Final_model_rfc.predict(x_test)
rfc_acc = accuracy_score(y_test,rfc_pred_final)
print(rfc_acc * 100)