# IMPLEMENTATION DETAILS
### TYPE - ML Classification
### PROJECT - "RED WINE QUALITY"

# A. DATA ASSESSMENT PROCESS

## 1. DATASET SUMMARY
### - It is a 'Red Wine' dataset that provides the information about various features to predict the quality type.
### - There are 1599 observations and 12 features in the dataset.
### - Total 'float' type features are 11, 'integer' type features are 1.
#### - float type : 'fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol'.
#### - int type : 'quality'.
### - For machine learning applications 'quality' feature is the target variable (dependent) and the other 11 features are input (independent).

## 2. FEATURE DESCRIPTIONS
### Table - 'red_wine' using 'winequality_red.csv'
### * Independent Features
#### - **'fixed acidity'** {float} : ***Acids that are always present***.
#### - **'volatile acidity'** {float} : ***Gaseous acids with high values leave an unpleasant vinegar taste.***
#### - **'citric acid'** {float} : ***A preservative added in small quantity to increase acidity and for freshness and flavor.***
#### - **'residual sugar'** {float} : ***The amount of sugar left after fermentation.***
#### - **'chlorides'** {float} : ***The amount of salt present in the wine.***
#### - **'free sulfur dioxide'** {float} : ***It prevents growth of microbes and oxidation of the wine.***
#### - **'total sulfur dioxide'** {float} : ***The amount of free and bound forms of SO2.***
#### - **'density'** {float} : ***Higher density wines are sweet in taste.***
#### - **'pH'** {float} : ***Acidity level of the wine.***
#### - **'sulphates'** {float} : ***An additive to wine which is antimicrobial and antioxidant.***
#### - **'alcohol'** {float} : ***The amount of alcohol present in wine.***

### * Dependent Feature
#### - **'quality'** {int} : ***It indicates the quality of the wine***.
##### This feature is the target variable for Machine Learning Based Classification Type Problems.

## 3. DATA ISSUES
### Table - red_wine
#### 1. * Dirty Data (Low quality)
##### A. Completeness
##### - No incompleteness issue
##### B. Validity
##### - Total duplicate observations - 240.
##### C. Accuracy
##### - No inaccuracy issue
##### D. Consistency
##### - No inconsistency issue

#### 2. * Messy Data (Untidy / Structural)
##### - No messy data issue

# B. PRE-PROCESSING / DATA CLEANING

## 1. BASIC DATA PRE-PROCESSING APPLIED
### - Dropping duplicated obervations : 240
### - Resetting the index of the dataframe to discard the missing indices.
### - Splitting the pre-processed dataframe into Train, Validation, and Test datasets.
### - Finally saving the Train, Validation, and Test datasets into CSV and PKL files for further analysis.

## 2. RESULTS OF BASIC DATA PRE-PROCESSING
### - Dropped 240 duplicates observation, ***successful***.
### - Index reset, ***successful***.

## 3. SAVING RESULTS TO CSV AND PKL FILES 
### - Splitting the pre-processed dataframe into Train, Validation, and Test datasets, ***successful***.
### - Pre-Processed Train, Validation, and Test datasets stored in CSV and PKL files, ***successful***.
#### - Train (wine_quality_train_pp.csv, wine_quality_train_pp.pkl)
#### - Validation (wine_quality_valid_pp.csv, wine_quality_valid_pp.pkl)
#### - Test (wine_quality_test_pp.csv, wine_quality_test_pp.pkl)

# C. EDA CONCLUSIONS

## a. Uni-Variate EDA conclusions

### . Numerical Features
#### - There are no missing values in any of the features.
#### - More features have moderate to high skewed distribution i.e., features are not Normally Distributed.
#### - Outliers vary in the range of (0.08-9.23) % for the features.

### . Categorical Features 
#### - 'quality' is considered as a categorical type feature, instead of integer type.
#### - 'quality' has label '5' with mode value = 491, and no missing values.

## b. Bi/Multi Variate EDA conclusions

### . Numerical-Numerical
#### -> 'quality' vs others
##### - High : positive(), negative()
##### - Moderate : positive(alcohol), negative(volatile acidity)
##### - Low : positive(fixed acidity, citric acid, sulphates), negative(chlorides, total sulfur dioxide, density)
##### - Very Low : positive(residual sugar), negative(free sulfur dioxide, pH)

### . Categorical-Numerical
#### -> 'quality' vs others
##### - Outliers for different labels of quality.
##### - Features contributing for high quality wine are alcohol, sulphates, fixed acidity, citric acid, residual sugar.
##### - Features contributing for low quality wine are pH, chlorides, volatile acidity, total sulfur dioxide, density, free sulfur dioxide.

# D. FEATURE ENGINEERING

## A. Observations
#### 1. Target feature "quality" should be remapped as "low" for (3,4,5) and "high" for (6,7,8) labels.
#### 2. "SMOTE" oversampling technique should to be applied to balance the class distribution.
#### 3. Outliers are present in the range of [0-10] %. Should apply outliers handling techniques.
#### 4. Data in numerical columns is skewed. Should treat feature with transformation operations.
#### 5. Check if the features are multi-collinear or not.
#### 6. Feature scaling techniques must be applied to bring the dataset in the suitable form for model building.
#### 7. Feature selection techniques must be applied to have better performance of the model.

## B. Steps of Feature Engineering
#### 1. Binarization of target feature 'quality' as 'Low' and 'High' for Train, Validation and Test dataset.
#### 2. Applying the SMOTE oversampling technique on the Train dataset to balance the distribution of target feature 'quality'.

#### 3. Oultier Detection and Handling using Capping Technique
##### - 'fixed acidity', 'volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide','density','pH','sulphates','alcohol'

##### - Detect and handle outliers in numerical features of Train dataset only.

#### 4. Feature Transformation (to be applied in pipeline)
#### - Checking various features transformations: (Log), (Square), (Reciprocal), (Square Root), (Exponential), (Yeo-Johnson)
##### - 'fixed acidity', 'volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide','density','pH','sulphates','alcohol'

##### - 'Yeo-Johnson' Transformation is performing better in comparison to other transformations.
##### - Apply 'Yeo-Johnson' on numerical features of Train, Validation, and Test dataset.

#### 5. Multi-Colllinearity Check using VIF 
##### - VIF of other features is too high except 4 features ('total sulfur dioxide', 'free sulfur dioxide', 'volatile acidity', 'citric acid').
##### - High multi-collinearity exists between the other features.

#### 6. Scaling (StandardScaler / MinMaxScaler) (to be applied in pipeline)

##### - 'fixed acidity', 'volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide','density','pH','sulphates','alcohol'
##### - Apply 'StandardScaler' on numerical features of Train, Validation, and Test dataset to reduce the skew.

#### 7. Feature Selection Techniques (to be applied in pipeline)
##### - Apply SelectKBest with mutual_info_classif to select the top k features for Model Building using estimators.
##### - Treat Train, Validation, and Test dataset with feature selection strategy.

# E. ML MODELS IMPLEMENTATIONS & RESULTS

## A. Simple Model (using pipeline)

#### 1. Feature Engineering. (Target mapping and abeling)
#### 2. No Hyper Parameters tuning used.

#### 3. Dataset Size:
- Train Dataset size : (1159,12)
- Validation Dataset size : (100,12)

#### 4. Model: LogisticRegression(random_state=46)
- Train Acc: 75.0647 %, Validation Acc: 70.0 %
- CV Score (k=10): 74.9843 %, StratifiedKFold
- ***Acceptable performance for Train and Validation sets.***
- ***Performing well on both datasets.***

#### 5. Model: RandomForestClassifier(random_state=46)
- Train Acc: 100.0 %, Validation Acc: 69.0 % 
- CV Score (k=10): 75.7534 %, StratifiedKFold
- ***Model has shown overfitting behavior.***
- ***Not performing well on Validation set.***

## B. Feature Engineered Model (using pipeline)

#### 1. FE: target labeling, data balancing (SMOTE), outlier handling (IQR and Capping), numerical feature transformation (Yeo-Johnson), scaling (StandardScaler), feature selection{SelectKBest, mutual_info_classif, k='all').
#### 2. Some Hyper Parameter tuned.

#### 3. Dataset Size:
- Final Train Dataset size : (1230,12)
- Validation Dataset size : (100,12)

#### 4. Model: LogisticRegression(C=0.5, random_state=46)
- Train ACC: 76.1789 %  %, Validation Acc: 74.0 %
- CV Score (k=10): 76.1789 %, StratifiedKFold
- ***Acceptable performance for Train and Validation sets.***
- ***Performing well on both datasets.***

#### 5. Model: RandomForestClassifier(n_estimators=200, max_depth=3, random_state=46)
- Train ACC: 77.3984 %, Validation Acc: 76.0 %
- CV Score (k=10): 76.2602 %, StratifiedKFold
- ***Acceptable performance for Train and Validation sets.***
- ***Performing well on Validation set.***

## C. Best Tuned Model (using pipeline)

#### 1. Models Hyper Tuned using GridSearchCV: StratifiedKFold(n_splits=10, shuffle=True, random_state=46)

- 'Log_Reg':LogisticRegression(C=1.0, max_iter=50, penalty='l2', solver='lbfgs', random_state=46),

- 'KN_CLF':KNeighborsClassifier(algorithm='brute', metric='euclidean', n_neighbors=17, weights='distance'),

- 'SV_CLF':SVC(C=0.1, degree=2, gamma='auto', kernel='rbf', random_state=46),

- 'DT_CLF':DecisionTreeClassifier(criterion='entropy', max_depth=5, min_impurity_decrease=0.0, min_samples_split=0.3, splitter='random', random_state=46),

- 'BAG_CLF':BaggingClassifier(bootstrap=True, estimator=DecisionTreeClassifier(), max_samples=0.5, n_estimators=200, oob_score=True, random_state=46),

- 'RF_CLF':RandomForestClassifier(bootstrap=True, criterion='entropy', max_depth=5, max_samples=0.5, n_estimators=200, oob_score=True, random_state=46),

- 'GB_CLF':GradientBoostingClassifier(learning_rate=0.01, max_depth=5, n_estimators=200, subsample=0.5, random_state=46),

- 'HGB_CLF':HistGradientBoostingClassifier(learning_rate=0.1, max_depth=5, max_iter=200, max_leaf_nodes=25, random_state=46),

- 'XGB_CLF':XGBClassifier(eta=0.1, gamma=0.01, max_depth=5, n_estimators=50, subsample=0.5, objective='binary:logistic',
                        eval_metric='auc', seed=46)
    

#### 2. Model's Performance Comparison

##### **Model**    ---> ***Acc_Score***

##### 1. **GB_CLF**   --->  ***77.2358***
##### 2. **XGB_CLF**  --->  ***77.0732***
##### 3. **RF_CLF**   --->  ***76.6667***
##### 4. **BAG_CLF**  --->  ***76.5854***
##### 5. **Log_Reg**  --->  ***76.2602***
##### 6. **KN_CLF**   --->  ***76.1789***
##### 7. **HGB_CLF**  --->  ***76.0163***
##### 8. **SV_CLF**   --->  ***75.3659***
##### 9. **DT_CLF**   --->  ***72.3577***

#### 3. Best Model and Parameters

- **GradientBoostingClassifier** ***(learning_rate=0.01, max_depth=5, n_estimators=200, subsample=0.5, random_state=46)***

#### 4. Best Model Results

- StratifiedKFold(n_splits=10, random_state=46, shuffle=True), Cross Validation Score : 77.2358 %, dataset size : (1230,12)
- Accuracy Score on Validation Data : 75.0 %, dataset size : (100,12)
- ***Accuracy Score on Test Data : 76.0 %***, dataset size : (100,12)

## D. Production Model

- **GradientBoostingClassifier** ***(learning_rate=0.01, max_depth=5, n_estimators=200, subsample=0.5, random_state=46)***
- StratifiedKFold(n_splits=10, random_state=46, shuffle=True), Cross Validation Score : 76.3831 %, dataset size : (1334,12)
- ***Accuracy Score on Test Data : 77.0 %***, dataset size : (100,12)

## E. Gradio App Deployment Files
- App development is done using 'Production Model' and 'Production Data'.
- Production Model saved as 'wine_quality_mdl_prod.pkl' file.
- Xtrain set is saved 'wine_quality_X_prod.pkl' file. To access the feature unique values, and ranges.
- Test set is saved 'wine_quality_FE_prod_test.pkl' file. To test the app.