# Prediction of sales

### Problem Statement
[The dataset](https://drive.google.com/file/d/1B07fvYosBNdIwlZxSmxDfeAf9KaygX89/view?usp=sharing) represents sales data for 1559 products across 10 stores in different cities. Also, attributes of each product and store are available. The aim is to build a predictive model and determine the sales of each product at a particular store.

|Variable|Description|
|: ------------- |:-------------|
|Item_Identifier|Unique product ID|
|Item_Weight|Weight of product|
|Item_Fat_Content|Whether the product is low fat or not|
|Item_Visibility|The % of total display area of all products in a store allocated to the particular product|
|Item_Type|The category to which the product belongs|
|Item_MRP|Maximum Retail Price (list price) of the product|
|Outlet_Identifier|Unique store ID|
|Outlet_Establishment_Year|The year in which store was established|
|Outlet_Size|The size of the store in terms of ground area covered|
|Outlet_Location_Type|The type of city in which the store is located|
|Outlet_Type|Whether the outlet is just a grocery store or some sort of supermarket|
|Item_Outlet_Sales|Sales of the product in the particulat store. This is the outcome variable to be predicted.|

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.



### In following weeks, we will explore the problem in following stages:

1. **Hypothesis Generation – understanding the problem better by brainstorming possible factors that can impact the outcome**
2. **Data Exploration – looking at categorical & continuous feature summaries and making inferences about the data**
3. **Data Cleaning – imputing missing values in the data and checking for outliers**
4. **Feature Engineering – modifying existing variables and/or creating new ones for analysis**
5. **Model Building – making predictive models on the data**
---------

## 4. Feature Engineering

1. Resolving the issues in the data to make it ready for the analysis.
2. Create some new variables using the existing ones.





### Create a broad category of Type of Item

`Item_Type` variable has many categories which might prove to be very useful in analysis. Look at the `Item_Identifier`, i.e. the unique ID of each item, it starts with either FD, DR or NC. If you see the categories, these look like being Food, Drinks and Non-Consumables. 

**Task:** Use the Item_Identifier variable to create a new column

In [1]:
import pandas as pd

data = pd.read_csv('regression_exercise_cleaned.csv', index_col=0)
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,Small,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                8523 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                8523 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 865.6+ KB


In [3]:
data['Broad_Item_Type'] = data['Item_Identifier'].str[:2]

In [4]:
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Broad_Item_Type
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138,FD
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228,DR
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27,FD
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,Small,Tier 3,Grocery Store,732.38,FD
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052,NC


### Determine the years of operation of a store

**Task:** Make a new column depicting the years of operation of a store (i.e. how long the store exists). 

In [5]:
# referenced: https://stackoverflow.com/questions/26788854/pandas-get-the-age-from-a-date-example-date-of-birth

In [6]:
import datetime as DT
import io
import numpy as np
import pandas as pd

In [7]:
pd.options.mode.chained_assignment = 'warn'

In [8]:
now = pd.Timestamp('now')
data['Outlet_Establishment_Year'] = pd.to_datetime(data['Outlet_Establishment_Year'], format='%Y')
data['Years_of_Operation'] = (now - data['Outlet_Establishment_Year']).astype('<m8[Y]')
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Broad_Item_Type,Years_of_Operation
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999-01-01,Medium,Tier 1,Supermarket Type1,3735.138,FD,23.0
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009-01-01,Medium,Tier 3,Supermarket Type2,443.4228,DR,13.0
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999-01-01,Medium,Tier 1,Supermarket Type1,2097.27,FD,23.0
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998-01-01,Small,Tier 3,Grocery Store,732.38,FD,24.0
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987-01-01,High,Tier 3,Supermarket Type1,994.7052,NC,35.0


### Modify categories of Item_Fat_Content

**Task:** There are difference in representation in categories of Item_Fat_Content variable. This should be corrected.

In [9]:
data['Item_Fat_Content'].unique()

array(['Low Fat', 'Regular', 'low fat', 'LF', 'reg'], dtype=object)

In [10]:
# referenced: https://stackoverflow.com/questions/39602824/pandas-replace-string-with-another-string

data['Item_Fat_Content'] = data['Item_Fat_Content'].replace({'low fat':'Low Fat', 'LF':'Low Fat','reg':'Regular'})

In [11]:
data['Item_Fat_Content'].unique()

array(['Low Fat', 'Regular'], dtype=object)

**Task:** There are some non-consumables as well and a fat-content should not be specified for them. Create a separate category for such kind of observations.

In [12]:
# Look at fat_content for Non-Consumables

data.groupby(['Broad_Item_Type','Item_Fat_Content']).agg('Item_Fat_Content').count()

Broad_Item_Type  Item_Fat_Content
DR               Low Fat              728
                 Regular               71
FD               Low Fat             3190
                 Regular             2935
NC               Low Fat             1599
Name: Item_Fat_Content, dtype: int64

In [13]:
# Look at fat_content for Non-Consumables

NC_data = data[data['Broad_Item_Type']=='NC']
NC_data

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Broad_Item_Type,Years_of_Operation
4,NCD19,8.930000,Low Fat,0.000000,Household,53.8614,OUT013,1987-01-01,High,Tier 3,Supermarket Type1,994.7052,NC,35.0
16,NCB42,11.800000,Low Fat,0.008596,Health and Hygiene,115.3492,OUT018,2009-01-01,Medium,Tier 3,Supermarket Type2,1621.8888,NC,13.0
22,NCB30,14.600000,Low Fat,0.025698,Household,196.5084,OUT035,2004-01-01,Small,Tier 2,Supermarket Type1,1587.2672,NC,18.0
25,NCD06,13.000000,Low Fat,0.099887,Household,45.9060,OUT017,2007-01-01,Small,Tier 2,Supermarket Type1,838.9080,NC,15.0
31,NCS17,18.600000,Low Fat,0.080829,Health and Hygiene,96.4436,OUT018,2009-01-01,Medium,Tier 3,Supermarket Type2,2741.7644,NC,13.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8500,NCQ42,20.350000,Low Fat,0.000000,Household,125.1678,OUT017,2007-01-01,Small,Tier 2,Supermarket Type1,1907.5170,NC,15.0
8502,NCH43,8.420000,Low Fat,0.070712,Household,216.4192,OUT045,2002-01-01,Small,Tier 2,Supermarket Type1,3020.0688,NC,20.0
8504,NCN18,13.384736,Low Fat,0.124111,Household,111.7544,OUT027,1985-01-01,Medium,Tier 3,Supermarket Type3,4138.6128,NC,37.0
8516,NCJ19,18.600000,Low Fat,0.118661,Others,58.7588,OUT018,2009-01-01,Medium,Tier 3,Supermarket Type2,858.8820,NC,13.0


In [14]:
# replace fat content for non-consumable items with NA
# referenced: https://stackoverflow.com/questions/19226488/change-one-value-based-on-another-value-in-pandas

data.loc[data.Broad_Item_Type == 'NC', 'Item_Fat_Content'] = "NA"

In [15]:
# check that fat contents were replaced properly

NC_data = data[data['Broad_Item_Type']=='NC']
NC_data

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Broad_Item_Type,Years_of_Operation
4,NCD19,8.930000,,0.000000,Household,53.8614,OUT013,1987-01-01,High,Tier 3,Supermarket Type1,994.7052,NC,35.0
16,NCB42,11.800000,,0.008596,Health and Hygiene,115.3492,OUT018,2009-01-01,Medium,Tier 3,Supermarket Type2,1621.8888,NC,13.0
22,NCB30,14.600000,,0.025698,Household,196.5084,OUT035,2004-01-01,Small,Tier 2,Supermarket Type1,1587.2672,NC,18.0
25,NCD06,13.000000,,0.099887,Household,45.9060,OUT017,2007-01-01,Small,Tier 2,Supermarket Type1,838.9080,NC,15.0
31,NCS17,18.600000,,0.080829,Health and Hygiene,96.4436,OUT018,2009-01-01,Medium,Tier 3,Supermarket Type2,2741.7644,NC,13.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8500,NCQ42,20.350000,,0.000000,Household,125.1678,OUT017,2007-01-01,Small,Tier 2,Supermarket Type1,1907.5170,NC,15.0
8502,NCH43,8.420000,,0.070712,Household,216.4192,OUT045,2002-01-01,Small,Tier 2,Supermarket Type1,3020.0688,NC,20.0
8504,NCN18,13.384736,,0.124111,Household,111.7544,OUT027,1985-01-01,Medium,Tier 3,Supermarket Type3,4138.6128,NC,37.0
8516,NCJ19,18.600000,,0.118661,Others,58.7588,OUT018,2009-01-01,Medium,Tier 3,Supermarket Type2,858.8820,NC,13.0


In [16]:
# check that fat contents were replaced properly

data.groupby(['Broad_Item_Type','Item_Fat_Content']).agg('Item_Fat_Content').count()

Broad_Item_Type  Item_Fat_Content
DR               Low Fat              728
                 Regular               71
FD               Low Fat             3190
                 Regular             2935
NC               NA                  1599
Name: Item_Fat_Content, dtype: int64

### Numerical and One-Hot Encoding of Categorical variables

Since scikit-learn algorithms accept only numerical variables, we need to **convert all categorical variables into numeric types.** 

- if the variable is Ordinal we can simply map its values into numbers
- if the variable is Nominal (we cannot sort the values) we need to One-Hot Encode them --> create dummy variables

In [17]:
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Broad_Item_Type,Years_of_Operation
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999-01-01,Medium,Tier 1,Supermarket Type1,3735.138,FD,23.0
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009-01-01,Medium,Tier 3,Supermarket Type2,443.4228,DR,13.0
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999-01-01,Medium,Tier 1,Supermarket Type1,2097.27,FD,23.0
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998-01-01,Small,Tier 3,Grocery Store,732.38,FD,24.0
4,NCD19,8.93,,0.0,Household,53.8614,OUT013,1987-01-01,High,Tier 3,Supermarket Type1,994.7052,NC,35.0


In [18]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8523 entries, 0 to 8522
Data columns (total 14 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   Item_Identifier            8523 non-null   object        
 1   Item_Weight                8523 non-null   float64       
 2   Item_Fat_Content           8523 non-null   object        
 3   Item_Visibility            8523 non-null   float64       
 4   Item_Type                  8523 non-null   object        
 5   Item_MRP                   8523 non-null   float64       
 6   Outlet_Identifier          8523 non-null   object        
 7   Outlet_Establishment_Year  8523 non-null   datetime64[ns]
 8   Outlet_Size                8523 non-null   object        
 9   Outlet_Location_Type       8523 non-null   object        
 10  Outlet_Type                8523 non-null   object        
 11  Item_Outlet_Sales          8523 non-null   float64       
 12  Broad_

In [19]:
#  0   Item_Identifier            8523 non-null   object            # ordinal - map values
#  1   Item_Weight                8523 non-null   float64      
#  2   Item_Fat_Content           8523 non-null   object            # nominal - create dummy variable
#  3   Item_Visibility            8523 non-null   float64      
#  4   Item_Type                  8523 non-null   object            # nominal - create dummy variable 
#  5   Item_MRP                   8523 non-null   float64       
#  6   Outlet_Identifier          8523 non-null   object            # ordinal - map values
#  7   Outlet_Establishment_Year  8523 non-null   datetime64[ns]    
#  8   Outlet_Size                8523 non-null   object            # ordinal - map values onto size
#  9   Outlet_Location_Type       8523 non-null   object            # nominal - create dummy variable 
#  10  Outlet_Type                8523 non-null   object            # nominal - create dummy variable 
#  11  Item_Outlet_Sales          8523 non-null   float64       
#  12  Broad_Item_Type            8523 non-null   object            # nominal - create dummy variable 
#  13  Years_of_Operation         8523 non-null   float64      

In [20]:
# Get values of ordinal variable Outlet_Size

data['Outlet_Size'].unique()

array(['Medium', 'Small', 'High'], dtype=object)

In [21]:
# Map values onto numbers

data = data.replace({"Outlet_Size" : {'Small' : 1, 'Medium' : 2, 'High' : 3}})

In [22]:
# Check that values are replaced

data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Broad_Item_Type,Years_of_Operation
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999-01-01,2,Tier 1,Supermarket Type1,3735.138,FD,23.0
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009-01-01,2,Tier 3,Supermarket Type2,443.4228,DR,13.0
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999-01-01,2,Tier 1,Supermarket Type1,2097.27,FD,23.0
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998-01-01,1,Tier 3,Grocery Store,732.38,FD,24.0
4,NCD19,8.93,,0.0,Household,53.8614,OUT013,1987-01-01,3,Tier 3,Supermarket Type1,994.7052,NC,35.0


In [23]:
# how to encode ordinal values
# encode values for 'Item_Identifier'

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
data['Item_Identifier'] = le.fit_transform(data.Item_Identifier.values)

In [24]:
# encode values for 'Outlet_Identifier'

le = preprocessing.LabelEncoder()
data['Outlet_Identifier'] = le.fit_transform(data.Outlet_Identifier.values)

In [25]:
# check that values were encoded

data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Broad_Item_Type,Years_of_Operation
0,156,9.3,Low Fat,0.016047,Dairy,249.8092,9,1999-01-01,2,Tier 1,Supermarket Type1,3735.138,FD,23.0
1,8,5.92,Regular,0.019278,Soft Drinks,48.2692,3,2009-01-01,2,Tier 3,Supermarket Type2,443.4228,DR,13.0
2,662,17.5,Low Fat,0.01676,Meat,141.618,9,1999-01-01,2,Tier 1,Supermarket Type1,2097.27,FD,23.0
3,1121,19.2,Regular,0.0,Fruits and Vegetables,182.095,0,1998-01-01,1,Tier 3,Grocery Store,732.38,FD,24.0
4,1297,8.93,,0.0,Household,53.8614,1,1987-01-01,3,Tier 3,Supermarket Type1,994.7052,NC,35.0


In [26]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8523 entries, 0 to 8522
Data columns (total 14 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   Item_Identifier            8523 non-null   int32         
 1   Item_Weight                8523 non-null   float64       
 2   Item_Fat_Content           8523 non-null   object        
 3   Item_Visibility            8523 non-null   float64       
 4   Item_Type                  8523 non-null   object        
 5   Item_MRP                   8523 non-null   float64       
 6   Outlet_Identifier          8523 non-null   int32         
 7   Outlet_Establishment_Year  8523 non-null   datetime64[ns]
 8   Outlet_Size                8523 non-null   int64         
 9   Outlet_Location_Type       8523 non-null   object        
 10  Outlet_Type                8523 non-null   object        
 11  Item_Outlet_Sales          8523 non-null   float64       
 12  Broad_

In [27]:
# transform nominal variables into Dummy Variables

cat_feats = data.dtypes[data.dtypes == 'object'].index.tolist()
df_dummy = pd.get_dummies(data[cat_feats])
df_dummy

Unnamed: 0,Item_Fat_Content_Low Fat,Item_Fat_Content_NA,Item_Fat_Content_Regular,Item_Type_Baking Goods,Item_Type_Breads,Item_Type_Breakfast,Item_Type_Canned,Item_Type_Dairy,Item_Type_Frozen Foods,Item_Type_Fruits and Vegetables,...,Outlet_Location_Type_Tier 1,Outlet_Location_Type_Tier 2,Outlet_Location_Type_Tier 3,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3,Broad_Item_Type_DR,Broad_Item_Type_FD,Broad_Item_Type_NC
0,1,0,0,0,0,0,0,1,0,0,...,1,0,0,0,1,0,0,0,1,0
1,0,0,1,0,0,0,0,0,0,0,...,0,0,1,0,0,1,0,1,0,0
2,1,0,0,0,0,0,0,0,0,0,...,1,0,0,0,1,0,0,0,1,0
3,0,0,1,0,0,0,0,0,0,1,...,0,0,1,1,0,0,0,0,1,0
4,0,1,0,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8518,1,0,0,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,1,0
8519,0,0,1,1,0,0,0,0,0,0,...,0,1,0,0,1,0,0,0,1,0
8520,0,1,0,0,0,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,1
8521,0,0,1,0,0,0,0,0,0,0,...,0,0,1,0,0,1,0,0,1,0


In [28]:
# drop the nominal variables from the original dataset

numeric_df = data.drop(cat_feats, axis=1)
numeric_df.shape

(8523, 9)

In [29]:
# merge the numeric and dummy variables into one dataset

transformed_df = pd.concat([numeric_df, df_dummy], axis=1)
transformed_df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Visibility,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Item_Outlet_Sales,Years_of_Operation,Item_Fat_Content_Low Fat,...,Outlet_Location_Type_Tier 1,Outlet_Location_Type_Tier 2,Outlet_Location_Type_Tier 3,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3,Broad_Item_Type_DR,Broad_Item_Type_FD,Broad_Item_Type_NC
0,156,9.3,0.016047,249.8092,9,1999-01-01,2,3735.138,23.0,1,...,1,0,0,0,1,0,0,0,1,0
1,8,5.92,0.019278,48.2692,3,2009-01-01,2,443.4228,13.0,0,...,0,0,1,0,0,1,0,1,0,0
2,662,17.5,0.01676,141.618,9,1999-01-01,2,2097.27,23.0,1,...,1,0,0,0,1,0,0,0,1,0
3,1121,19.2,0.0,182.095,0,1998-01-01,1,732.38,24.0,0,...,0,0,1,1,0,0,0,0,1,0
4,1297,8.93,0.0,53.8614,1,1987-01-01,3,994.7052,35.0,0,...,0,0,1,0,1,0,0,0,0,1


**All variables should be by now numeric.**

---------
### Exporting Data

**Task:** You can save the processed data to your local machine as a csv file.

In [30]:
transformed_df.to_csv("regression_exercise_cleaned_transformed.csv")

# Variable Selection Tutorial

In [32]:
transformed_df.columns

Index(['Item_Identifier', 'Item_Weight', 'Item_Visibility', 'Item_MRP',
       'Outlet_Identifier', 'Outlet_Establishment_Year', 'Outlet_Size',
       'Item_Outlet_Sales', 'Years_of_Operation', 'Item_Fat_Content_Low Fat',
       'Item_Fat_Content_NA', 'Item_Fat_Content_Regular',
       'Item_Type_Baking Goods', 'Item_Type_Breads', 'Item_Type_Breakfast',
       'Item_Type_Canned', 'Item_Type_Dairy', 'Item_Type_Frozen Foods',
       'Item_Type_Fruits and Vegetables', 'Item_Type_Hard Drinks',
       'Item_Type_Health and Hygiene', 'Item_Type_Household', 'Item_Type_Meat',
       'Item_Type_Others', 'Item_Type_Seafood', 'Item_Type_Snack Foods',
       'Item_Type_Soft Drinks', 'Item_Type_Starchy Foods',
       'Outlet_Location_Type_Tier 1', 'Outlet_Location_Type_Tier 2',
       'Outlet_Location_Type_Tier 3', 'Outlet_Type_Grocery Store',
       'Outlet_Type_Supermarket Type1', 'Outlet_Type_Supermarket Type2',
       'Outlet_Type_Supermarket Type3', 'Broad_Item_Type_DR',
       'Broad_Item_Typ