# DATA ASSESSMENT PROCESS and PRE-PROCESSING - {"BIGMART SALES" DATASET}

# A. DATA ASSESSMENT PROCESS

## 1. DATASET SUMMARY
### - It is a 'Big Mart Sales' dataset that provides the information about various features to predict the sales of products at different stores.
### - Train dataset has 8523 observations and 12 features.
### - Test dataset has 5681 observations and 11 features.

### - Total 'float' type features are 4, 'integer' type features are 1, and 'object' type features are 7.
#### - float type : 'Item_Weight', 'Item_Visibility', 'Item_MRP', 'Item_Outlet_Sales'.
#### - int type : 'Outlet_Establishment_Year'.
#### - object type : 'Item_Identifier', 'Item_Fat_Content', 'Item_Type', 'Outlet_Identifier', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type'.

### - For machine learning applications 'Item_Outlet_Sales' feature is the target variable (dependent) and the other 11 features are input (independent).

## 2. FEATURE DESCRIPTIONS
### Table - 'bms_train' using 'bms_train.csv'
### - 'Item_Identifier' feature {object type, independent feature} - ***Unique identifier number assigned to each item.***
### - 'Item_Weight' feature {float type, independent feature} - ***Item weight in gms.***
### - 'Item_Fat_Content' feature {object type, independent feature} - ***Item fat content.***
### - 'Item_Visibility' feature {float type, independent feature} - ***Percentage of total display area of all items in a store allocated to the particular items. Placement value of each item: (0 - 'Far & Behind') and (1 - 'Near & Front')***
### - 'Item_Type' feature {object type, independent feature} - ***Type of food category item belongs to.***
### - 'Item_MRP' feature {float type, independent feature} - ***Max. Retail Price of the item in the outlet.***
### - 'Outlet_Identifier' feature {object type, independent feature} - ***Unique outlet identifier.***
### - 'Outlet_Establishment_Year' feature {int type, independent feature} - ***Year of outlet establishment.***
### - 'Outlet_Size' feature {object type, independent feature} - ***Size of the outlet.***
### - 'Outlet_Location_Type' feature {object type, independent feature} - ***Tier of city in which outlet is located.***
### - 'Outlet_Type' feature {object type, independent feature} - ***Type of outlet.***
### - 'Item_Outlet_Sales' feature {float type, dependent feature} - ***Sales of the item from the outlet.***
#### -- This feature is the target variable for Machine Learning Based Regression Type Problems.

## 3. DATA ISSUES
### Table - bms_train
#### 1. * Dirty Data (Low quality)
##### A. Completeness
##### - Missing Values: Item_Weight=1463, Outlet_Size=2410
##### B. Validity
##### - No duplicate observations
##### C. Accuracy
##### - No inaccuracy issue
##### D. Consistency
##### - "Item_Fat_Content" feature values to be marked as 'Non Edible' where the context is 'Non Consumables'.
##### - "Item_Fat_Content" feature values 'LF','low fat' must be mapped to 'Low Fat' and 'reg' must be mapped to 'Regular'.
##### - "Item_Type" feature values to be marked as 'DR_Dairy' and 'FD_Dairy', where the context is 'Drinks' and 'Foods' respectively.

#### 2. * Messy Data (Untidy / Structural)
##### - No messy data issue

# B. PRE-PROCESSING / DATA CLEANING

## 1. Pre-Processing Level - I.

### - Nothing to pre-process
### - Splitting the dataframe into Train, Validation, and Test datasets using 'bms_train.csv', ***successful***.

### - Saving the split data into CSV and PKL files, ***successful***.
#### . Train ('bms_train_init.csv','bms_train_init.pkl')
#### . Validation ('bms_valid_init.csv','bms_valid_init.pkl')
#### . Test ('bms_test_init.csv','bms_test_init.pkl')

## 2. Pre-Processing Level - II. (Train, Validation, and Test datasets)

### - Creating a high level item category as 'Item_Category' using 'Item_Identifier' feature by selecting first two letters of the feature values. (FD, DR, NC) ---> (Foods, Drinks, Non Consumables), ***successful***.
### - Imputing missing values for the Features, ***successful***.
#### -- 'Item_Weight': using mapping of 'Item_Identifier' and 'Item_Weight' wherever available.
##### --- if new item, then using the median value wieght of 'Item_Category' (derived feature).
#### -- 'Outlet_Size': calculating mode value using 'Outlet_Type' 
### - Marking the 'Item_Fat_Content' value as 'Non Edible', where 'Item_Category' is 'Non Consumables', ***successful***. 
### - Remapping the feature values of "Item_Fat_Content" ('LF','low fat') to 'Low Fat', 'reg' to 'Regular', ***successful***.
### - Marking the 'Item_Type' as 'Dairy Drinks' and 'Dairy Drinks' where the 'Item_Category' is 'Drinks' and 'Foods' respectively, ***successful***.
### - Creating new feature 'Outlet_Age' from 'Outlet_Establishment_Year' by expression = 2013-year, ***successful***.
### - Correcting 'Item_Visibility' if 0, replace with mean, ***successful***.

### - Saving the pre-processed data into CSV and PKL files, ***successful***.
#### . Train ('bms_train_pp.csv','bms_train_pp.pkl')
#### . Validation ('bms_valid_pp.csv','bms_valid_pp.pkl')
#### . Test ('bms_test_pp.csv','bms_test_pp.pkl')