---
<center><h1>Wine Quality Prediction</h1></center>
<center><h3>Part of 30 Days 30 ML Projects Challenge</h3></center>

---

## 1) Understanding Problem Statement
---

Wine is a complex and diverse beverage that is enjoyed worldwide. The quality of wine is influenced by numerous factors such as grape variety, climate, soil conditions and winemaking techniques. Accurately assessing wine quality is essential for both producers and consumers. While traditional methods rely on human experts and sensory evaluations, machine learning models can provide valuable insights and predictions based on objective data. 

The project falls under **Classication Machine Learning Problem**. The goal of this project is **to develop a wine quality prediction model to predict the quality of wine based on various chemical and physical attributes.

## 2) Understanding Data
---

The project uses Wine Quality Data which contains several variables (independent variables) such as **acidity, alcohol content, density, pH and more.** The outcome variable or dependent variable is the **quality** score which ranges from 0 to 10 and is based on sensory data. The quality score serves as the ground truth for training and evaluating the predictive model.

The dataset encompasses **11 independent variables** each representing a specific aspect of wine composition and the 12th variable, **quality**, quantifying the overall perceived quality of the wine which are shown as follows: 

- 1. fixed acidity
- 2. volatile acidity
- 3. citric acid
- 4. residual sugar
- 5. chlorides
- 6. free sulfur dioxide
- 7. total sulfur dioxide
- 8. density
- 9. pH
- 10. sulphates
- 11. alcohol
- 12. quality (score between 0 and 10)

## 3) Getting System Ready
---
Importing required libraries


In [None]:
import numpy as np
import pandas as pd

# for model buidling
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

## 4) Data Eyeballing
---

### Laoding Data

In [None]:
wine_data = pd.read_csv('Datasets/Day6_Wine_Quality_Data.csv') 

In [None]:
wine_data

In [None]:
print('The size of Dataframe is: ', wine_data.shape)
print('-'*100)
print('The Column Name, Record Count and Data Types are as follows: ')
wine_data.info()
print('-'*100)

In [None]:
# Defining numerical & categorical columns
numeric_features = [feature for feature in wine_data.columns if wine_data[feature].dtype != 'O']
categorical_features = [feature for feature in wine_data.columns if wine_data[feature].dtype == 'O']

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features : {}'.format(len(categorical_features), categorical_features))

In [None]:
print('Missing Value Presence in different columns of DataFrame are as follows : ')
print('-'*100)
total=wine_data.isnull().sum().sort_values(ascending=False)
percent=(wine_data.isnull().sum()/wine_data.isnull().count()*100).sort_values(ascending=False)
pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

In [None]:
print('Summary Statistics of numerical features for DataFrame are as follows:')
print('-'*100)
wine_data.describe()

In [None]:
wine_data['quality'].value_counts()

## 5) Data Cleaning and Preprocessing
---

### Label Binarization (0 or 1)

In [None]:
wine_data['quality'] = wine_data['quality'].apply(lambda y_value: 1 if y_value >= 7 else 0)

In [None]:
wine_data.head()

## 5) Model Building
---

### Creating Feature Matrix (Independent Variables) & Target Variable (Dependent Variable)

In [None]:
# separating the data and labels
X = wine_data.drop(columns = ['quality'], axis=1) # Feature matrix
y = wine_data['quality'] # Target variable

In [None]:
X

In [None]:
y

### Data Standardization

In [None]:
scaler = StandardScaler()

In [None]:
scaler.fit(X)

In [None]:
standardized_data = scaler.transform(X)

In [None]:
standardized_data

In [None]:
X = standardized_data

In [None]:
X

### Train-Test Split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=45)

In [None]:
print(X.shape, X_train.shape, X_test.shape)

In [None]:
print(y.shape, y_train.shape, y_test.shape)

### Model Comparison : Training & Evaluation

In [None]:
models = [LogisticRegression, SVC, DecisionTreeClassifier, RandomForestClassifier]
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []

for model in models:
    classifier = model().fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    
    accuracy_scores.append(accuracy_score(y_test, y_pred))
    precision_scores.append(precision_score(y_test, y_pred))
    recall_scores.append(recall_score(y_test, y_pred))
    f1_scores.append(f1_score(y_test, y_pred))

In [None]:
classification_metrics_df = pd.DataFrame({
    "Model": ["Logistic Regression", "SVM", "Decision Tree", "Random Forest"],
    "Accuracy": accuracy_scores,
    "Precision": precision_scores,
    "Recall": recall_scores,
    "F1 Score": f1_scores
})

classification_metrics_df.set_index('Model', inplace=True)
classification_metrics_df

### Inference

In the context of Wine Quality Prediction, four machine learning models were evaluated: Logistic Regression, SVM, Decision Tree and Random Forest. Among these, **Random Forest performed the best** with the highest accuracy of 92.50% and a respectable F1 score of 68.42%. It excelled in precision, making it reliable in identifying high-quality wines. However, **all models struggled with recall**, indicating potential difficulties in recognizing all high-quality wines. Decision Tree exhibited a balanced trade-off between precision and recall, making it a strong contender for practical applications. The choice of model should align with the specific objectives, emphasizing precision (Random Forest) or comprehensive identification (Decision Tree) of high-quality wines.

**`Note:`** The choice of model should align with the specific objectives, emphasizing precision (Random Forest) or comprehensive identification (Decision Tree) of high-quality wines. Further model evaluation and fine-tuning may be necessary to optimize performance for the specific business goals.