---
<center><h1>Wine Quality Prediction</h1></center>
<center><h3>Part of 30 Days 30 ML Projects Challenge</h3></center>

---

## 1) Understanding Problem Statement
---

Wine is a complex and diverse beverage that is enjoyed worldwide. The quality of wine is influenced by numerous factors such as grape variety, climate, soil conditions and winemaking techniques. Accurately assessing wine quality is essential for both producers and consumers. While traditional methods rely on human experts and sensory evaluations, machine learning models can provide valuable insights and predictions based on objective data. 

The project falls under **Classication Machine Learning Problem**. The goal of this project is **to develop a wine quality prediction model to predict the quality of wine based on various chemical and physical attributes.

## 2) Understanding Data
---

The project uses Wine Quality Data which contains several variables (independent variables) such as **acidity, alcohol content, density, pH and more.** The outcome variable or dependent variable is the **quality** score which ranges from 0 to 10 and is based on sensory data. The quality score serves as the ground truth for training and evaluating the predictive model.

The dataset encompasses **11 independent variables** each representing a specific aspect of wine composition and the 12th variable, **quality**, quantifying the overall perceived quality of the wine which are shown as follows: 

- 1. fixed acidity
- 2. volatile acidity
- 3. citric acid
- 4. residual sugar
- 5. chlorides
- 6. free sulfur dioxide
- 7. total sulfur dioxide
- 8. density
- 9. pH
- 10. sulphates
- 11. alcohol
- 12. quality (score between 0 and 10)

## 3) Getting System Ready
---
Importing required libraries


In [1]:
import numpy as np
import pandas as pd

# for model buidling
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

## 4) Data Eyeballing
---

### Laoding Data

In [2]:
wine_data = pd.read_csv('Datasets/Day6_Wine_Quality_Data.csv') 

In [3]:
wine_data

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5


In [4]:
print('The size of Dataframe is: ', wine_data.shape)
print('-'*100)
print('The Column Name, Record Count and Data Types are as follows: ')
wine_data.info()
print('-'*100)

The size of Dataframe is:  (1599, 12)
----------------------------------------------------------------------------------------------------
The Column Name, Record Count and Data Types are as follows: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               15

In [5]:
# Defining numerical & categorical columns
numeric_features = [feature for feature in wine_data.columns if wine_data[feature].dtype != 'O']
categorical_features = [feature for feature in wine_data.columns if wine_data[feature].dtype == 'O']

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features : {}'.format(len(categorical_features), categorical_features))

We have 12 numerical features : ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality']

We have 0 categorical features : []


In [6]:
print('Missing Value Presence in different columns of DataFrame are as follows : ')
print('-'*100)
total=wine_data.isnull().sum().sort_values(ascending=False)
percent=(wine_data.isnull().sum()/wine_data.isnull().count()*100).sort_values(ascending=False)
pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

Missing Value Presence in different columns of DataFrame are as follows : 
----------------------------------------------------------------------------------------------------


Unnamed: 0,Total,Percent
fixed acidity,0,0.0
volatile acidity,0,0.0
citric acid,0,0.0
residual sugar,0,0.0
chlorides,0,0.0
free sulfur dioxide,0,0.0
total sulfur dioxide,0,0.0
density,0,0.0
pH,0,0.0
sulphates,0,0.0


In [7]:
print('Summary Statistics of numerical features for DataFrame are as follows:')
print('-'*100)
wine_data.describe()

Summary Statistics of numerical features for DataFrame are as follows:
----------------------------------------------------------------------------------------------------


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


In [9]:
wine_data['quality'].value_counts()

quality
5    681
6    638
7    199
4     53
8     18
3     10
Name: count, dtype: int64

## 5) Data Cleaning and Preprocessing
---

### Label Binarization (0 or 1)

In [15]:
wine_data['quality'] = wine_data['quality'].apply(lambda y_value: 1 if y_value >= 7 else 0)

In [16]:
wine_data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,0
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,0
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,0
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,0


## 5) Model Building
---

### Creating Feature Matrix (Independent Variables) & Target Variable (Dependent Variable)

In [19]:
# separating the data and labels
X = loan_data.drop(columns = ['Loan_ID','Loan_Status'], axis=1) # Feature matrix
y = loan_data['Loan_Status'] # Target variable

In [20]:
X

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
1,1,1,1,1,0,4583,1508.0,128.0,360.0,1.0,0
2,1,1,0,1,1,3000,0.0,66.0,360.0,1.0,2
3,1,1,0,0,0,2583,2358.0,120.0,360.0,1.0,2
4,1,0,0,1,0,6000,0.0,141.0,360.0,1.0,2
5,1,1,2,1,1,5417,4196.0,267.0,360.0,1.0,2
...,...,...,...,...,...,...,...,...,...,...,...
609,0,0,0,1,0,2900,0.0,71.0,360.0,1.0,0
610,1,1,4,1,0,4106,0.0,40.0,180.0,1.0,0
611,1,1,1,1,0,8072,240.0,253.0,360.0,1.0,2
612,1,1,2,1,0,7583,0.0,187.0,360.0,1.0,2


In [21]:
y

1      0
2      1
3      1
4      1
5      1
      ..
609    1
610    1
611    1
612    1
613    0
Name: Loan_Status, Length: 480, dtype: int64

### Data Standardization

In [22]:
scaler = StandardScaler()

In [23]:
scaler.fit(X)

In [24]:
standardized_data = scaler.transform(X)

In [25]:
standardized_data

array([[ 0.46719815,  0.73716237,  0.11235219, ...,  0.27554157,
         0.41319694, -1.31886834],
       [ 0.46719815,  0.73716237, -0.70475462, ...,  0.27554157,
         0.41319694,  1.25977445],
       [ 0.46719815,  0.73716237, -0.70475462, ...,  0.27554157,
         0.41319694,  1.25977445],
       ...,
       [ 0.46719815,  0.73716237,  0.11235219, ...,  0.27554157,
         0.41319694,  1.25977445],
       [ 0.46719815,  0.73716237,  0.92945899, ...,  0.27554157,
         0.41319694,  1.25977445],
       [-2.14041943, -1.35655324, -0.70475462, ...,  0.27554157,
        -2.42015348, -0.02954695]])

In [26]:
X = standardized_data

In [27]:
X

array([[ 0.46719815,  0.73716237,  0.11235219, ...,  0.27554157,
         0.41319694, -1.31886834],
       [ 0.46719815,  0.73716237, -0.70475462, ...,  0.27554157,
         0.41319694,  1.25977445],
       [ 0.46719815,  0.73716237, -0.70475462, ...,  0.27554157,
         0.41319694,  1.25977445],
       ...,
       [ 0.46719815,  0.73716237,  0.11235219, ...,  0.27554157,
         0.41319694,  1.25977445],
       [ 0.46719815,  0.73716237,  0.92945899, ...,  0.27554157,
         0.41319694,  1.25977445],
       [-2.14041943, -1.35655324, -0.70475462, ...,  0.27554157,
        -2.42015348, -0.02954695]])

### Train-Test Split

In [28]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=45)

In [29]:
print(X.shape, X_train.shape, X_test.shape)

(480, 11) (384, 11) (96, 11)


In [30]:
print(y.shape, y_train.shape, y_test.shape)

(480,) (384,) (96,)


### Model Comparison : Training & Evaluation

In [31]:
models = [LogisticRegression, SVC, DecisionTreeClassifier, RandomForestClassifier]
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []

for model in models:
    classifier = model().fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    
    accuracy_scores.append(accuracy_score(y_test, y_pred))
    precision_scores.append(precision_score(y_test, y_pred))
    recall_scores.append(recall_score(y_test, y_pred))
    f1_scores.append(f1_score(y_test, y_pred))

In [32]:
classification_metrics_df = pd.DataFrame({
    "Model": ["Logistic Regression", "SVM", "Decision Tree", "Random Forest"],
    "Accuracy": accuracy_scores,
    "Precision": precision_scores,
    "Recall": recall_scores,
    "F1 Score": f1_scores
})

classification_metrics_df.set_index('Model', inplace=True)
classification_metrics_df

Unnamed: 0_level_0,Accuracy,Precision,Recall,F1 Score
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Logistic Regression,0.760417,0.752941,0.969697,0.847682
SVM,0.75,0.75,0.954545,0.84
Decision Tree,0.65625,0.761905,0.727273,0.744186
Random Forest,0.75,0.769231,0.909091,0.833333


### Inference
In the context of Loan Eligibility Prediction:

1. **Logistic Regression** demonstrates the highest recall (0.97) indicating its effectiveness in identifying eligible applicants. However, precision (0.75) and F1 score (0.85) show a trade-off between accuracy and false positives.

2. **SVM** maintains a high recall (0.95) with a slightly lower precision (0.75). It's a balanced choice for minimizing false negatives while controlling false positives.

3. **Decision Tree** has the lowest accuracy (0.65) among the models. It provides good precision (0.75) but struggles with recall (0.73) leading to a moderate F1 score (0.74).

4. **Random Forest** strikes a balance between precision (0.77) and recall (0.89) resulting in a reasonable F1 score (0.83) and overall accuracy (0.74).

In summary, Logistic Regression excels in recall but sacrifices precision. SVM offers a balanced approach while Random Forest strikes a compromise between precision and recall.


**`Note:`** Choosing the most suitable model depends on the specific objectives of the Finance company. If minimizing false negatives (approving loans for eligible applicants) is crucial, Logistic Regression or SVM may be preferred. If a balance between precision and recall is desired, Random Forest offers a reasonable compromise. Further model evaluation and fine-tuning may be necessary to optimize performance for the specific business goals.