# IMPLEMENTATION DETAILS
### TYPE - ML Regression
### PROJECT - "Big Mart Sales"

# A. DATA ASSESSMENT PROCESS

## 1. DATASET SUMMARY
### - It is a 'Big Mart Sales' dataset that provides the information about various features to predict the sales of products at different stores.
### - Train dataset has 8523 observations and 12 features.
### - Test dataset has 5681 observations and 11 features.

### - Total 'float' type features are 4, 'integer' type features are 1, and 'object' type features are 7.
#### - float type : 'Item_Weight', 'Item_Visibility', 'Item_MRP', 'Item_Outlet_Sales'.
#### - int type : 'Outlet_Establishment_Year'.
#### - object type : 'Item_Identifier', 'Item_Fat_Content', 'Item_Type', 'Outlet_Identifier', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type'.

### - For machine learning applications 'Item_Outlet_Sales' feature is the target variable (dependent) and the other 11 features are input (independent).

## 2. FEATURE DESCRIPTIONS
### Table - 'bms_train' using 'bms_train.csv'
### - 'Item_Identifier' feature {object type, independent feature} - ***Unique identifier number assigned to each item.***
### - 'Item_Weight' feature {float type, independent feature} - ***Item weight in gms.***
### - 'Item_Fat_Content' feature {object type, independent feature} - ***Item fat content.***
### - 'Item_Visibility' feature {float type, independent feature} - ***Percentage of total display area of all items in a store allocated to the particular items. Placement value of each item: (0 - 'Far & Behind') and (1 - 'Near & Front')***
### - 'Item_Type' feature {object type, independent feature} - ***Type of food category item belongs to.***
### - 'Item_MRP' feature {float type, independent feature} - ***Max. Retail Price of the item in the outlet.***
### - 'Outlet_Identifier' feature {object type, independent feature} - ***Unique outlet identifier.***
### - 'Outlet_Establishment_Year' feature {int type, independent feature} - ***Year of outlet establishment.***
### - 'Outlet_Size' feature {object type, independent feature} - ***Size of the outlet.***
### - 'Outlet_Location_Type' feature {object type, independent feature} - ***Tier of city in which outlet is located.***
### - 'Outlet_Type' feature {object type, independent feature} - ***Type of outlet.***
### - 'Item_Outlet_Sales' feature {float type, dependent feature} - ***Sales of the item from the outlet.***
#### -- This feature is the target variable for Machine Learning Based Regression Type Problems.

## 3. DATA ISSUES
### Table - bms_train
#### 1. * Dirty Data (Low quality)
##### A. Completeness
##### - Missing Values: Item_Weight=1463, Outlet_Size=2410
##### B. Validity
##### - No duplicate observations
##### C. Accuracy
##### - No inaccuracy issue
##### D. Consistency
##### - "Item_Fat_Content" feature values to be marked as 'Non Edible' where the context is 'Non Consumables'.
##### - "Item_Fat_Content" feature values 'LF','low fat' must be mapped to 'Low Fat' and 'reg' must be mapped to 'Regular'.
##### - "Item_Type" feature values to be marked as 'DR_Dairy' and 'FD_Dairy', where the context is 'Drinks' and 'Foods' respectively.

#### 2. * Messy Data (Untidy / Structural)
##### - No messy data issue

# B. PRE-PROCESSING / DATA CLEANING

## 1. Pre-Processing Level - I.

### - Nothing to pre-process
### - Splitting the dataframe into Train, Validation, and Test datasets using 'bms_train.csv', ***successful***.

### - Saving the split data into CSV and PKL files, ***successful***.
#### . Train ('bms_train_init.csv','bms_train_init.pkl')
#### . Validation ('bms_valid_init.csv','bms_valid_init.pkl')
#### . Test ('bms_test_init.csv','bms_test_init.pkl')

## 2. Pre-Processing Level - II. (Train, Validation, and Test datasets)

### - Creating a high level item category as 'Item_Category' using 'Item_Identifier' feature by selecting first two letters of the feature values. (FD, DR, NC) ---> (Foods, Drinks, Non Consumables), ***successful***.
### - Imputing missing values for the Features, ***successful***.
#### -- 'Item_Weight': using mapping of 'Item_Identifier' and 'Item_Weight' wherever available.
##### --- if new item, then using the median value wieght of 'Item_Category' (derived feature).
#### -- 'Outlet_Size': calculating mode value using 'Outlet_Type' 
### - Marking the 'Item_Fat_Content' value as 'Non Edible', where 'Item_Category' is 'Non Consumables', ***successful***. 
### - Remapping the feature values of "Item_Fat_Content" ('LF','low fat') to 'Low Fat', 'reg' to 'Regular', ***successful***.
### - Marking the 'Item_Type' as 'Dairy Drinks' and 'Dairy Foods' where the 'Item_Category' is 'Drinks' and 'Foods' respectively, ***successful***.
### - Creating new feature 'Outlet_Age' from 'Outlet_Establishment_Year' by expression = 2013-year, ***successful***.
### - Correcting 'Item_Visibility' if 0, replace with mean, ***successful***.

### - Saving the pre-processed data into CSV and PKL files, ***successful***.
#### . Train ('bms_train_pp.csv','bms_train_pp.pkl')
#### . Validation ('bms_valid_pp.csv','bms_valid_pp.pkl')
#### . Test ('bms_test_pp.csv','bms_test_pp.pkl')

# C. EDA CONCLUSIONS

## a. Uni-Variate EDA conclusions

### . Numerical Features
#### - There are no missing values in any of the features.
#### - Some features have highly skewed distribution i.e., features are not Normally Distributed.
#### - Outliers vary in the range of (0-2.18) % for the features.

### . Categorical Features 
#### - 'Item_Identifier' feature uniquely identifies each product and has lot of categories, can be dropped for model buidling.
#### - 'Outlet_Identifier' feature uniquely indentifies each store, can be dropped for model building.
#### - 'Item_Type' has lot of unique categorical values, needs to be handled.
#### - 'Outlet_Establishment_Year' feature is used to create new feature 'Outlet_Age'. So, can be dropped for model building. 

## b. Bi Variate EDA conclusions

### . Numerical-Numerical ('Item_Outlet_Sales' vs others)
#### - High : positive(), negative()
#### - Moderate : positive(Item_MRP), negative()
#### - Low : positive(), negative(Item_Visibility)
#### - Very Low : positive(Item_Weight, Outlet_Age), negative()


### . Categorical-Numerical ('Item_Outlet_Sales' vs others)
#### - Outliers for different labels of Categorical Features.
#### - Foods Item are contributing more for sales than Drinks and Non Consumables.
#### - Regular Fat items are giving more sales than Low Fat and Non Edible. But more units are sold for Low Fat items.
#### - Maximum units sold for 'Fruits and Vegetables','Snack Foods', and 'Household'. Highest sale is 'Seafood','Starchy Foods', and 'Dairy Foods'.
#### - Outlets with ID OUT027 has highest sale and ID OUT010, and OUT019 have lowest sales.
#### - Medium size outlets have more sales, but more items are sold in Small size outlets.
#### - Highest sale is in Tier 2 location outlets, but more units are sold in Tier 3 location types.
#### - Supermarket Type 3 are giving more sales, but more items are sold in Supermarket Type 1.
#### - Oldest (28 Yrs) outlet has given more sales in comparison to others, but outlet aged 15 Yrs is lowest in terms of sales. 


### . Categorical-Categorical ('Item_Outlet_Sales' as Value across categorical features)
#### - Foods and Drinks with Regular and Low Fat respectively are giving more sales.
#### - Foods, Drinks, Non Consumables with Seafood, Hard Drinks, and Households are contributing more for sales.
#### - Items in each category have highest sales in Medium size, Tier 2, Supermarket Type 3 outlets.

#### - Regular Fat items are contributing more for sales, except for few Low Fat Items.
#### - Regular Fat have higher sales in Medium size, Tier 2 and Tier 1 type, and Supermarket Type 3 outlets.
#### - Non Edible items are giving more sales in Medium size, Tier 2 type, and Supermarket Type 3 outlets.

#### - Higher sales for Medium size, Tier 3, and Supermarket Type 3 outlets.
#### - Higher sales for Small size, Tier 2 outlets.
#### - Supermarket Type 1 has higher sales for Medium than High 

#### - Sale of Grocery Store is lower in all location types.

# D. FEATURE ENGINEERING

## a. Observations

#### 1. Drop non relevant features from the dataset.
#### 2. Outliers handling using IQR method and Capping.
#### 3. Feature transformation checks.
#### 4. Scaling for numerical float type features.
#### 5. Categorical data encoding using Ordinal Encoding and One Hot Encoding Techniques.
#### 6. Feature hasher for high cardinality categorical features to reduce dimensions.   (Not Used)
#### 7. Feature selection to increase the performance of the model.

## b. Steps of Feature Engineering
#### 1. Drop non relevant features from the observations.

#### 2. Oultier Detection and Handling using Capping Technique

##### - Detect and handle outliers in numerical features of Train dataset only.
#### 3. Feature Transformation (to be applied in pipeline)

#### Checking various features transformations: (Log), (Square), (Reciprocal), (Square Root), (Exponential), (Yeo-Johnson)
##### - 'Yeo-Johnson' Transformation is performing better in comparison to other transformations.
##### - Apply 'Yeo-Johnson' on numerical features of Train, Validation, and Test dataset.
#### 4. Scaling (StandardScaler / MinMaxScaler) (to be applied in pipeline)

##### - Apply 'StandardScaler' on numerical features of Train, Validation, and Test dataset to reduce the skew.
#### 5. Categorical features encoding using Ordinal encoding and OneHot encoding techniques

#### 6. Feature hasher (Not Used)

#### 7. Feature Selection Techniques (to be applied in pipeline)
##### - Apply SelectKBest with mutual_info_regression to select the top k features for Model Building using estimators.
##### - Treat Train, Validation, and Test dataset with feature selection strategy.

## c. Saving the feature engineered train and test datasets into CSV and PKL files
#### - Train dataset is saved after outlier detection and handling step.
##### - 'bms_FE_train_final.csv'
##### - 'bms_FE_train_final.pkl'

#### - Validation and Test data.
##### - 'bms_FE_valid_final.csv'
##### - 'bms_FE_valid_final.pkl'
##### - 'bms_FE_test_final.csv'
##### - 'bms_FE_test_final.pkl'

# E. ML MODELS IMPLEMENTATIONS & RESULTS

## A. Simple Model (using pipeline)

#### 1. Feature Engineering. (Column Transformation, Standard Scaler, Ordinal Encoder, One Hot Encoder, SelectKBest {mutual info regression})
#### 2. No Hyper Parameters tuning used.

#### 3. Dataset Size:
- Train Dataset size : (8323, 11)
- Validation Dataset size : (100, 11)

#### 4. Model: LinearRegression()
- Train Dataset R2 score : 0.3536, RMSE : 1372.1487 

- Validation Dataset R2 score : 0.2358, RMSE : 1303.4476

- ***Acceptable performance for Train and Validation sets.***
- ***Slight overfitting for Train dataset.***
- ***Performing well on both datasets.***

#### 5. Cross-Validation Score (n_splits=5, shuffle=True, random_state=46)
- Mean R2 Score: 0.3466
- Mean RMSE Score: 1378.9928

## B. Best Tuned Model (using pipeline)

#### 1. Models Hyper Tuned using GridSearchCV: KFold(n_splits=5, shuffle=True, random_state=46)
- 'Lin_Reg':LinearRegression(),
    
- 'Lasso':Lasso(alpha=0.5, max_iter=1000),
    
- 'Ridge':Ridge(alpha=0.05, max_iter=1500), 

- 'KN_REG':KNeighborsRegressor(algorithm='brute', metric='euclidean', n_neighbors=17, weights='uniform'),

- 'SV_REG':SVR(C=1.0, degree=2, gamma='scale', kernel='linear'),

- 'DT_REG':DecisionTreeRegressor(criterion='squared_error', max_depth=5, min_impurity_decrease=0.0, min_samples_split=0.3, splitter='best', random_state=46),

- 'BAG_REG':BaggingRegressor(bootstrap=True, estimator=KNeighborsRegressor(), max_samples=0.25, n_estimators=200, oob_score=True, random_state=46),

- 'RF_REG':RandomForestRegressor(bootstrap=True, criterion='squared_error', max_depth=5, max_samples=0.25, n_estimators=100, oob_score=True, random_state=46),

- 'GB_REG':GradientBoostingRegressor(criterion='squared_error',learning_rate=0.1, max_depth=3, n_estimators=50, subsample=0.75, random_state=46),

- 'HGB_REG':HistGradientBoostingRegressor(learning_rate=0.1, max_depth=3, max_iter=50, max_leaf_nodes=20, l2_regularization=0.1, random_state=46),

- 'XGB_REG':XGBRegressor(objective='reg:squarederror', eval_metric='rmse', seed=46, eta=0.1, gamma=0.01, max_depth=3, n_estimators=50, subsample=0.75)


#### 2. Model's Performance Comparison

##### **Model**    ---> ***R2_Score***

##### 1. **RF_REG**   --->  ***0.5885***
##### 2. **HGB_REG**  --->  ***0.5877***
##### 3. **XGB_REG**  --->  ***0.5876***
##### 4. **GB_REG**   --->  ***0.5868***
##### 5. **BAG_REG**  --->  ***0.5525***
##### 6. **KN_REG**   --->  ***0.5455***
##### 7. **DT_REG**   --->  ***0.5011***
##### 8. **Ridge**    --->  ***0.3491***
##### 9. **Lin_Reg**  --->  ***0.3466***
##### 10. **SV_REG**  --->  ***0.3267***
##### 11. **Lasso**   --->  ***-0.0132***

#### 3. Best Model and Parameters

- **RandomForestRegressor** ***(bootstrap=True, criterion='squared_error', max_depth=5, max_samples=0.25, n_estimators=100, oob_score=True, random_state=46)***


#### 4. Best Model Results

- KFold(n_splits=5, shuffle=True, random_state=46)
-- Mean R2 Score: 0.58849, Mean RMSE Score: 1094.27541,  dataset size : (8323, 11)

- Validation Data : Mean R2 Score: 0.6415, Mean RMSE Score: 892.7377, dataset size : (100,11)
- ***Test Data : Mean R2 Score: 0.5689, Mean RMSE Score: 1232.2241***, dataset size : (100,11)

## C. Production Model

- **RandomForestRegressor** ***(bootstrap=True, criterion='squared_error', max_depth=5, max_samples=0.25, n_estimators=100, oob_score=True, random_state=46)***
- KFold(n_splits=10, shuffle=True, random_state=46), Mean R2 Score: 0.5884, Mean RMSE Score: 1092.5346, dataset size : (8423, 11)
- ***Test Data : Mean R2 Score: 0.5693, Mean RMSE Score: 1231.6046***, dataset size : (100,11)

## D. Gradio App Deployment Files

- App development is done using 'Production Model' and 'Production Data'.
- Production Model saved as 'bms_mdl_prod.pkl' file.
- Xtrain set is saved 'bms_X_prod.pkl' file. To access the feature unique values, and ranges.
- Test set is saved 'bms_FE_test_final.pkl' file. To test the app.