## Project - Implement Data Science Process Lifecycle on Red Wine Quality Dataset

<img src="wine.gif">

### Business Understanding

The red wine industry shows a recent exponential growth as social drinking is on the rise. Nowadays, industry players are using product quality certifications to promote their products. This is a time-consuming process and requires the assessment given by human experts, which makes this process very expensive. Also, the price of red wine depends on a rather abstract concept of wine appreciation by wine tasters, opinion among whom may have a high degree of variability. Another vital factor in red wine certification and quality assessment is physicochemical tests, which are laboratory-based and consider factors like acidity, pH level, sugar, and other chemical properties. The red wine market would be of interest if the human quality of tasting can be related to wine’s chemical properties so that certification and quality assessment and assurance processes are more controlled. This project aims to determine which features are the best quality red wine indicators and generate insights into each of these factors to our model’s red wine quality.


### Analytic Approach

**Q1. What is the analytical approach that you would take for this project? Why do you think it's the right approach?**

For this project, we will use a **predictive modeling approach** using supervised machine learning, specifically **regression analysis**. Since our goal is to predict the quality of red wine based on various physicochemical properties, regression is the appropriate choice as it allows us to model the relationship between independent variables (features) and the dependent variable (wine quality).

The approach involves:
- Exploring and cleaning the dataset.
- Selecting relevant features based on correlation analysis.
- Implementing two regression models (e.g., Linear Regression and Random Forest Regression) to compare performance.
- Evaluating models using R², RMSE, and MAE.
- Determining the most important features affecting wine quality.
    


### Data Requirements

**Q2. What kind of data do we require for predicting red wine quality and for determining the best quality indicators?**

We require **structured numerical data** that includes:
- **Physicochemical properties** such as acidity, sugar content, alcohol percentage, pH, etc.
- **Quality ratings** as the target variable (rated on a scale from 0 to 10).

Key features expected to impact quality:
- **Alcohol content**: Higher alcohol levels often correlate with better quality.
- **pH and acidity**: Balanced acidity is essential for wine taste.
- **Sulfur dioxide levels**: Helps preserve wine but excessive amounts can impact taste.
    


### Data Collection

**Q3. From where do we get our data?**

The dataset is sourced from the **UCI Machine Learning Repository**, specifically the **Wine Quality Dataset**. The data consists of physicochemical tests conducted on different wine samples along with their quality ratings assigned by wine tasters.
    


### Data Understanding

**Q4. From where are red wine samples obtained?**

Red wine samples in this dataset were collected from the **Vinho Verde** wine region of Portugal, where various physicochemical tests were performed to analyze their properties.

**Q5. How can knowing the impact of each feature help in making better wine?**

Understanding which features influence quality can help winemakers optimize production by adjusting factors such as:
- Fermentation conditions (e.g., temperature, yeast selection).
- Chemical composition (e.g., balancing acidity and sugar levels).
- Storage and aging processes.
    


### Data Preparation

We will perform the following steps:

#### 1. Load the dataset and inspect its structure
```python
import pandas as pd

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
df = pd.read_csv(url, delimiter=";")

# Display dataset info
df.info()
df.head()
```

#### 2. Check for missing values
```python
# Check for missing values
df.isnull().sum()
```

#### 3. Handle outliers (Using IQR method)
```python
import numpy as np

# Function to remove outliers using IQR
def remove_outliers(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return data[(data[column] >= lower_bound) & (data[column] <= upper_bound)]

for col in df.columns[:-1]:  # Exclude 'quality' column
    df = remove_outliers(df, col)
```

#### 4. Implement Correlation Heatmap
```python
import seaborn as sns
import matplotlib.pyplot as plt

# Plot correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Feature Correlation Heatmap")
plt.show()
```
    


### Modeling

We will implement **two regression models**:

#### 1. Multivariable Linear Regression
```python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Define features and target
X = df.drop(columns=["quality"])
y = df["quality"]

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Linear Regression Model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Predictions
y_pred_lr = lr_model.predict(X_test)
```

#### 2. Random Forest Regression
```python
from sklearn.ensemble import RandomForestRegressor

# Train Random Forest Model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predictions
y_pred_rf = rf_model.predict(X_test)
```
    


### Model Evaluation

We will evaluate models using **R² Score, RMSE, and MAE**.

```python
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

# Define evaluation function
def evaluate_model(y_true, y_pred, model_name):
    r2 = r2_score(y_true, y_pred)
    rmse = mean_squared_error(y_true, y_pred, squared=False)
    mae = mean_absolute_error(y_true, y_pred)
    print(f"{model_name} Performance:")
    print(f"R² Score: {r2:.2f}")
    print(f"RMSE: {rmse:.2f}")
    print(f"MAE: {mae:.2f}")
    print("-" * 40)

# Evaluate both models
evaluate_model(y_test, y_pred_lr, "Linear Regression")
evaluate_model(y_test, y_pred_rf, "Random Forest")
```
    

*TODO*
<li>Use three metrics: R-squared, RMSE, and MAE, to evaluate model prediction performance</li>
<li>Compare these 3 metrics for the two models and analyze the performance</li>
<li>Calculate the feature importance scores for the top features that help predicting wine quality and visualize them</li>

### Conclusion

*TODO*