<a href="https://colab.research.google.com/github/nuraishasb/applied-ml/blob/main/Project_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<table class="table table-bordered">
    <tr>
        <th style="text-align:center; width:25%"><img src='https://www.nus.edu.sg/images/default-source/base/logo.png' style="width: 250px; height: 125px; "></th>
        <th style="text-align:center;"><h1>Applied Machine Learning</h1><h2>Project 1 - Data Preparation </h2><h3></h3></th>
    </tr>
</table>

We will be using the supermarket data for regression task. The data (`supermarket.csv`) have been collected at various supermarket outlets and stores in different cities. The aim is to build a predictive model and predict the sales of each product at a particular outlet. Using this model, supermarket managment team will try to understand the properties of products and outlets which play a key role in increasing sales.

Detailed information (i.e. column description) is provided below.

* **Item_Weight:** Weight of product
* **Item_Fat_Content:** Whether the product is low fat or not
* **Item_Visibility:** The % of total display area of all products in a store allocated to the particular product
* **Item_Type:** The category to which the product belongs
* **Item_MRP:** Maximum Retail Price (list price) of the product
* **Outlet_Identifier:** Unique store ID
* **Outlet_Establishment_Year:** The year in which store was established
* **Outlet_Size:** The size of the store in terms of ground area covered
* **Outlet_Location_Type:** The type of city in which the store is located
* **Outlet_Type:** Whether the outlet is just a grocery store or some sort of supermarket
*  <font color='red'> **Item_Outlet_Sales:** Sales of the product in the particular store. This is the TARGET variable. </font>

In this project, you are required to explore and prepare data for ML models by completing the below steps:
* Data Loading and Exploration
* Handling Missing Values if required
* Data Transformation
* Build a simple Linear Regression model


### 1. Data Loading and Exploration
* load the data into a dataframe
* explore both numerical data and categorical data

In [None]:
# import required libraries
import numpy as np
import pandas as pd
import re #regular expression
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# link to google drive
import sys, os
if 'google.colab' in sys.modules:
    # mount google drive
    from google.colab import drive
    drive.mount('/content/gdrive')
    path_to_file = '/content/gdrive/My Drive/Applied_ML/Projects'
    os.chdir(path_to_file)
    !pwd

Mounted at /content/gdrive
/content/gdrive/My Drive/Applied_ML/Projects


In [None]:
# load dataset
df = pd.read_csv('supermarket.csv')

# backup copy
df_backup = df.copy()

# data overview
df.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [None]:
# inspect data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8337 entries, 0 to 8336
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Weight                8337 non-null   float64
 1   Item_Fat_Content           8337 non-null   object 
 2   Item_Visibility            8337 non-null   float64
 3   Item_Type                  8337 non-null   object 
 4   Item_MRP                   8337 non-null   float64
 5   Outlet_Identifier          8337 non-null   object 
 6   Outlet_Establishment_Year  8337 non-null   int64  
 7   Outlet_Size                5955 non-null   object 
 8   Outlet_Location_Type       8337 non-null   object 
 9   Outlet_Type                8337 non-null   object 
 10  Item_Outlet_Sales          8337 non-null   float64
dtypes: float64(4), int64(1), object(6)
memory usage: 716.6+ KB


In [None]:
# numeric data
df_num = df.select_dtypes(['int64', 'float64'])
df_num.head()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
0,9.3,0.016047,249.8092,1999,3735.138
1,5.92,0.019278,48.2692,2009,443.4228
2,17.5,0.01676,141.618,1999,2097.27
3,19.2,0.0,182.095,1998,732.38
4,8.93,0.0,53.8614,1987,994.7052


In [None]:
# categorical data
df_cat = df.select_dtypes(['object'])
df_cat.head()

Unnamed: 0,Item_Fat_Content,Item_Type,Outlet_Identifier,Outlet_Size,Outlet_Location_Type,Outlet_Type
0,Low Fat,Dairy,OUT049,Medium,Tier 1,Supermarket Type1
1,Regular,Soft Drinks,OUT018,Medium,Tier 3,Supermarket Type2
2,Low Fat,Meat,OUT049,Medium,Tier 1,Supermarket Type1
3,Regular,Fruits and Vegetables,OUT010,,Tier 3,Grocery Store
4,Low Fat,Household,OUT013,High,Tier 3,Supermarket Type1


### 2. Missing Values
* for categorical data, replace the missing value with a constant string or the most frequrent category;
* for numerical data, replace the missing value with a constant number or the mean / median value.

In [None]:
# finding NA values
df.isnull().sum()

Unnamed: 0,0
Item_Weight,0
Item_Fat_Content,0
Item_Visibility,0
Item_Type,0
Item_MRP,0
Outlet_Identifier,0
Outlet_Establishment_Year,0
Outlet_Size,2382
Outlet_Location_Type,0
Outlet_Type,0


In [None]:
# finding most freq value for df[Outlet_Size]
df['Outlet_Size'].value_counts()

Unnamed: 0_level_0,count
Outlet_Size,Unnamed: 1_level_1
Medium,2676
Small,2362
High,917


In [None]:
# replacing NA values with 'Medium'
df['Outlet_Size'] = df['Outlet_Size'].fillna('Medium')
# rechecking for NA values
df.isnull().sum()

Unnamed: 0,0
Item_Weight,0
Item_Fat_Content,0
Item_Visibility,0
Item_Type,0
Item_MRP,0
Outlet_Identifier,0
Outlet_Establishment_Year,0
Outlet_Size,0
Outlet_Location_Type,0
Outlet_Type,0


### 3. Data Transformation
* Split Data into Train Data and Test Data
* Transform the input data for both X_train and X_test

In [None]:
# target and features
y = df['Item_Outlet_Sales']
X = df.drop(['Item_Outlet_Sales'], axis=1)

In [None]:
# split train & test
from sklearn.model_selection import train_test_split
X_train1, X_test1, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=13)

In [None]:
# numerical attributes
num_attribs = list(X.select_dtypes(['int64', 'float64']))
num_attribs

['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Outlet_Establishment_Year']

In [None]:
# categorical attributes
cat_attribs = list(X.select_dtypes(['object']))
cat_attribs

['Item_Fat_Content',
 'Item_Type',
 'Outlet_Identifier',
 'Outlet_Size',
 'Outlet_Location_Type',
 'Outlet_Type']

In [None]:
# scaling and encoding
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
scaler = StandardScaler() #scale numerical att
encoder = OrdinalEncoder() #encode categorical att

X_train = X_train1.copy()
X_train[num_attribs] = scaler.fit_transform(X_train1[num_attribs])
X_train[cat_attribs] = encoder.fit_transform(X_train1[cat_attribs])
X_train

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type
3746,1.289876,0.0,-1.015636,9.0,1.382161,8.0,-0.121304,2.0,0.0,1.0
800,0.548981,0.0,-0.634897,9.0,-0.459708,3.0,1.319654,1.0,2.0,2.0
7046,-0.250714,1.0,0.522466,13.0,-0.197346,8.0,-0.121304,2.0,0.0,1.0
4838,-0.001951,0.0,-0.322011,14.0,-0.717803,5.0,-1.562262,1.0,2.0,3.0
4595,0.795946,0.0,0.396974,4.0,-0.716735,6.0,0.719255,2.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...
5876,1.148753,1.0,-0.713850,6.0,0.050976,6.0,0.719255,2.0,1.0,1.0
866,0.666584,1.0,1.082638,6.0,-0.473278,9.0,0.118855,1.0,0.0,1.0
7696,-1.312663,0.0,-0.682361,9.0,0.594563,3.0,1.319654,1.0,2.0,2.0
74,-1.052762,1.0,-0.818975,5.0,-0.410632,8.0,-0.121304,2.0,0.0,1.0


In [None]:
X_test = X_test1.copy()
X_test[num_attribs] = scaler.transform(X_test1[num_attribs])
X_test[cat_attribs] = encoder.transform(X_test1[cat_attribs])
X_test

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type
4339,-1.324424,0.0,-0.164612,13.0,-0.416052,1.0,-1.322103,0.0,2.0,1.0
1384,-1.462018,0.0,-0.251385,3.0,0.169184,2.0,1.079494,1.0,1.0,1.0
6081,0.066812,0.0,-0.196575,14.0,0.080123,8.0,-0.121304,2.0,0.0,1.0
7461,0.901788,0.0,0.076208,4.0,1.270440,2.0,1.079494,1.0,1.0,1.0
1106,-1.026890,0.0,0.603793,3.0,-1.430380,6.0,0.719255,2.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...
1754,-0.168393,0.0,-0.988134,5.0,1.166784,1.0,-1.322103,0.0,2.0,1.0
3040,-1.511411,0.0,1.274495,1.0,-0.667617,0.0,-0.001224,1.0,2.0,0.0
559,0.913548,1.0,-1.286513,14.0,-1.620921,8.0,-0.121304,2.0,0.0,1.0
5904,-0.909287,1.0,0.046844,2.0,-1.366667,1.0,-1.322103,0.0,2.0,1.0


### 4. Build a Simple Linear Regression Model

In [None]:
# Linear Regression Model
from sklearn import linear_model
lm_reg = linear_model.LinearRegression()
lm_reg.fit(X_train, y_train)

In [None]:
# print out the model coefficients and intercept
print(lm_reg.coef_)
print(lm_reg.intercept_)

[  -4.50787055   56.2732685   -71.09253767   -2.22449604  848.84646458
   65.84962399   47.44823629 -371.72760864 -206.14389184  756.11683626]
1519.8164676409458


In [None]:
from sklearn.metrics import mean_absolute_error
print('train_mae:', mean_absolute_error(lm_reg.predict(X_train), y_train),
      '\n test_mae:', mean_absolute_error(lm_reg.predict(X_test), y_test))

# this model predicts correctly with an average error of around $800

train_mae: 834.3658463258274 
 test_mae: 797.7098547388285


In [None]:
# R squared value
lm_reg.score(X_test, y_test)

# 51.9% of the variance of the data can be explained by the model

0.5192037219357589

### (Additional) Attempt with feature selection and interaction effect

In [None]:
train_df = pd.concat([X_train, y_train], axis=1)
train_df.corr()['Item_Outlet_Sales'].sort_values()

Unnamed: 0,Item_Outlet_Sales
Item_Visibility,-0.134391
Outlet_Size,-0.089834
Outlet_Establishment_Year,0.00769
Item_Weight,0.011458
Item_Type,0.014416
Item_Fat_Content,0.01705
Outlet_Location_Type,0.060481
Outlet_Identifier,0.178786
Outlet_Type,0.369826
Item_MRP,0.551476


In [None]:
# keeping highly correlated features
X_train2 = X_train[['Item_MRP', 'Outlet_Identifier', 'Item_Visibility', 'Outlet_Type']]
X_test2 = X_test[['Item_MRP', 'Outlet_Identifier', 'Item_Visibility', 'Outlet_Type']]

# new model
lm_reg2 = linear_model.LinearRegression()
lm_reg2.fit(X_train2, y_train)

# mae
print('train_mae:', mean_absolute_error(lm_reg2.predict(X_train2), y_train),
      '\n test_mae:', mean_absolute_error(lm_reg2.predict(X_test2), y_test))

# rsq
lm_reg2.score(X_test2, y_test)


train_mae: 845.8583844317438 
 test_mae: 807.4696872426134


0.507437034370394

In [None]:
# lr with interaction effect
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(interaction_only=False, include_bias=False)
X_train_interaction = poly.fit_transform(X_train2)
X_test_interaction = poly.transform(X_test2)

# Train the model with interaction features
model = linear_model.LinearRegression()
model.fit(X_train_interaction, y_train)

# mae
print('train_mae:', mean_absolute_error(model.predict(X_train_interaction), y_train),
      '\n test_mae:', mean_absolute_error(model.predict(X_test_interaction), y_test))

# rsq
model.score(X_test_interaction, y_test)

train_mae: 779.0163923959155 
 test_mae: 735.297847824815


0.5661823325544586