<a href="https://colab.research.google.com/github/patelmedha/Prediction-of-Product-Sales/blob/main/Prediction_Of_Product_Sales_Preprocessing_Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Prediction of Product Sales-Preprocessing ML**
**Author: Medha Patel**

##**Project Overview**
- Before splitting your data, you can drop duplicates and fix inconsistencies in categorical data.* (*There is a way to do this after the split, but for this project, you may perform this step before the split)
- Identify the features (X) and target (y): Assign the "Item_Outlet_Sales" column as your target and the rest of the relevant variables as your features matrix.
- Perform a train test split
- Create a preprocessing object to prepare the dataset for Machine Learning
- Make sure your imputation of missing values occurs after the train test split using SimpleImputer.

###Data Dictionary

  - **Item_Identifier**: Unique product ID
  - **Item_Weight**: Weight of product
  - **Item_Fat_Content**: Whether the product is low fat or regular
  - **Item_Visibility**: The percentage of total display area of all products in store allocated to the particular product
  - **Item_Type**: The category to which the product belongs
  - **Item_MRP**: Maximum Retail Price (list price) of the product
  - **Outlet_Identifier**: Unique store ID
  - **Outlet_Establishment_Year**: The year in which store was established
  - **Outlet_Size**: The size of the store in terms of ground area covered
  - **Outlet_Location_Type**: The type of area in which the store is located
  - **Outlet_Type**: Whether the outlet is a grocery store or some sort of supermarket
  - **Item_Outlet_Sales**: Sales of product in particular store. This is the target variable to be predicted





## **Import Libraries**

In [1]:
#Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')


from sklearn.model_selection import train_test_split

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder

from sklearn.pipeline import make_pipeline

from sklearn import set_config
set_config(display='diagram')

from sklearn import set_config
set_config(transform_output='pandas')

##**Load and Inspect Data**

### **Load Data**

In [2]:
#Load Data
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
file_url = '/content/drive/MyDrive/CodingDojo/01-Fundamentals/PROJECT: PREDICTION OF PRODUCT SALES/Data/sales_predictions_2023.csv'

df = pd.read_csv(file_url)

#Copy of Dataframe
df2 = df.copy()

###**Inspect Data**

####**shape()**

In [4]:
df2.shape
print(f'There are {df2.shape[0]} rows, and {df2.shape[1]} columns.')

There are 8523 rows, and 12 columns.


####**Info()**

In [5]:
#Info()
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


####**head()**

In [6]:
#Head()
df2.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


####**describe()**

In [7]:
#Descriptive statistics for numeric columns
df2.describe(include='number')

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
count,7060.0,8523.0,8523.0,8523.0,8523.0
mean,12.857645,0.066132,140.992782,1997.831867,2181.288914
std,4.643456,0.051598,62.275067,8.37176,1706.499616
min,4.555,0.0,31.29,1985.0,33.29
25%,8.77375,0.026989,93.8265,1987.0,834.2474
50%,12.6,0.053931,143.0128,1999.0,1794.331
75%,16.85,0.094585,185.6437,2004.0,3101.2964
max,21.35,0.328391,266.8884,2009.0,13086.9648


In [8]:
#Descriptive statistics for categoric columns
df2.describe(include='object')

Unnamed: 0,Item_Identifier,Item_Fat_Content,Item_Type,Outlet_Identifier,Outlet_Size,Outlet_Location_Type,Outlet_Type
count,8523,8523,8523,8523,6113,8523,8523
unique,1559,5,16,10,3,3,4
top,FDW13,Low Fat,Fruits and Vegetables,OUT027,Medium,Tier 3,Supermarket Type1
freq,10,5089,1232,935,2793,3350,5577


##**Performing Preprocessing Data**

-No Columns to drop.

In [9]:
# Checking for Duplicates
df2.duplicated().sum()

0

-There are 0 duplicates

In [10]:
# Checking missing values
df2.isna().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

####**Data Consistency**

##### **Data Consistency- Categorical Columns**




In [11]:
# save list of categorical column name.
categorical_col = df2.select_dtypes('object').columns
categorical_col

Index(['Item_Identifier', 'Item_Fat_Content', 'Item_Type', 'Outlet_Identifier',
       'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type'],
      dtype='object')

In [12]:
# Check for nunique for categorical columns
for col in categorical_col:
  print(f'Value Counts for {col}')
  print(df2[col].value_counts())
  print('\n')

Value Counts for Item_Identifier
FDW13    10
FDG33    10
NCY18     9
FDD38     9
DRE49     9
         ..
FDY43     1
FDQ60     1
FDO33     1
DRF48     1
FDC23     1
Name: Item_Identifier, Length: 1559, dtype: int64


Value Counts for Item_Fat_Content
Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64


Value Counts for Item_Type
Fruits and Vegetables    1232
Snack Foods              1200
Household                 910
Frozen Foods              856
Dairy                     682
Canned                    649
Baking Goods              648
Health and Hygiene        520
Soft Drinks               445
Meat                      425
Breads                    251
Hard Drinks               214
Others                    169
Starchy Foods             148
Breakfast                 110
Seafood                    64
Name: Item_Type, dtype: int64


Value Counts for Outlet_Identifier
OUT027    935
OUT013    932
OUT049    930
OUT046    930


In [13]:
# Item_Fat_Content- fix the values
df2['Item_Fat_Content'] = df2['Item_Fat_Content'].replace(['low fat' , 'LF'], 'Low Fat')
df2['Item_Fat_Content'] = df2['Item_Fat_Content'].replace('reg' , 'Regular')
df2['Item_Fat_Content'].value_counts()

Low Fat    5517
Regular    3006
Name: Item_Fat_Content, dtype: int64

In [14]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


#####**Data Consistency- Numerical Columns**

In [15]:
# Have list of numerical column name. 
numerical_col = df2.select_dtypes(['int', 'float']).columns
numerical_col

Index(['Item_Weight', 'Item_Visibility', 'Item_MRP',
       'Outlet_Establishment_Year', 'Item_Outlet_Sales'],
      dtype='object')

In [16]:
# check for nunique for numerical columns
for col in numerical_col:
  print(f'Value Counts for {col}')
  print(df2[col].value_counts())
  print('\n')

Value Counts for Item_Weight
12.150    86
17.600    82
13.650    77
11.800    76
15.100    68
          ..
7.275      2
7.685      1
9.420      1
6.520      1
5.400      1
Name: Item_Weight, Length: 415, dtype: int64


Value Counts for Item_Visibility
0.000000    526
0.076975      3
0.162462      2
0.076841      2
0.073562      2
           ... 
0.013957      1
0.110460      1
0.124646      1
0.054142      1
0.044878      1
Name: Item_Visibility, Length: 7880, dtype: int64


Value Counts for Item_MRP
172.0422    7
170.5422    6
196.5084    6
188.1872    6
142.0154    6
           ..
97.3384     1
83.1934     1
96.6752     1
152.6682    1
75.4670     1
Name: Item_MRP, Length: 5938, dtype: int64


Value Counts for Outlet_Establishment_Year
1985    1463
1987     932
1999     930
1997     930
2004     930
2002     929
2009     928
2007     926
1998     555
Name: Outlet_Establishment_Year, dtype: int64


Value Counts for Item_Outlet_Sales
958.7520     17
1342.2528    16
703.0848     15
1845

##**Defining X and y**

In [17]:
#Check and Drop null values in target column Item_Outlet_Sales
df2['Item_Outlet_Sales'].isna().sum()

0

- There are 0 null values in target column. 

### Define X and y

In [18]:
## Define X and y
target = 'Item_Outlet_Sales'

X = df2.drop(columns=target).copy()
y = df2[target].copy()
X.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1


### Train-Test-Split

In [19]:
# Perfoming a train-test-split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [20]:
X_train.shape

(6392, 11)

In [21]:
X_test.shape

(2131, 11)

In [22]:
X_train.dtypes

Item_Identifier               object
Item_Weight                  float64
Item_Fat_Content              object
Item_Visibility              float64
Item_Type                     object
Item_MRP                     float64
Outlet_Identifier             object
Outlet_Establishment_Year      int64
Outlet_Size                   object
Outlet_Location_Type          object
Outlet_Type                   object
dtype: object

## Create 3 Pipelines 

The data is going to divided as follows:

- numeric columns: 
  - Item_Weight, Item_Visibility, Item_MRP, Outlet_Establishment_Year
- ordinal categorical columns : 
  - Item_Fat_Content, Outlet_Size, Outlet_Location_Type
- nominal categorical columns : 
  - Item_Identifier, Item_Type, Outlet_Identifier, Outlet_Type
and preprocess each subset differently.


### 1. Numeric

In [23]:
# PREPROCESSING PIPELINE FOR NUMERIC DATA

# Save list of number column names
num_cols = X_train.select_dtypes("number").columns
print("Numeric Columns:", num_cols)

# Transformers
impute_mean = SimpleImputer(strategy='mean')
scaler = StandardScaler()

# Pipeline
num_pipe = make_pipeline(impute_mean, scaler)
num_pipe

# Tuple
numeric_tuple = ('numeric',num_pipe, num_cols)

Numeric Columns: Index(['Item_Weight', 'Item_Visibility', 'Item_MRP',
       'Outlet_Establishment_Year'],
      dtype='object')


### 2. Ordinal

In [25]:
# PREPROCESSING PIPELINE FOR ORDINAL DATA

# Save list of number column names
ordinal_cols = ['Item_Fat_Content', 'Outlet_Size', 'Outlet_Location_Type']

# Ordered Category Lists
Item_Fat_Content_list = ['Low Fat', 'Regular']
Outlet_Size_list = ['Small', 'Medium', 'High']
Outlet_Location_list = ['Tier 1', 'Tier 2', 'Tier 3']


# Transformers

ord_encoder = OrdinalEncoder(categories=[Item_Fat_Content_list, Outlet_Size_list, Outlet_Location_list])
freq_imputer = SimpleImputer(strategy='most_frequent')

# you might have 100 diff cat for ordinal so its getting out of range so good to scale
scaler_ord = StandardScaler()

# Pipeline
ord_pipe = make_pipeline(freq_imputer, ord_encoder, scaler_ord)

# Tuple
ord_tuple = ('ordinal',ord_pipe, ordinal_cols)

### 2. Nominal

In [26]:
# PREPROCESSING PIPELINE FOR ONE-HOT-ENCODED DATA

# Save list of nominal column names
nominal_cols = X_train.select_dtypes('object').drop(columns=ordinal_cols).columns

# Transformers

missing_imputer = SimpleImputer(strategy='constant', fill_value='missing')
ohe_encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')

# Pipeline
nom_pipe = make_pipeline(missing_imputer , ohe_encoder)

# Tuple
ohe_tuple = ('categorical',nom_pipe, nominal_cols)

## Transform the Features

###1.Numeric


In [27]:
#Fit pipeline on NUMERIC training data
num_pipe.fit(X_train[num_cols])


In [30]:
#Transform NUMERIC Training Data
X_train_num_tf = num_pipe.transform(X_train[num_cols])
X_train_num_tf.head()


Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year
4776,0.817249,-0.712775,1.828109,1.327849
7510,0.55634,-1.291052,0.603369,1.327849
5828,-0.131512,1.813319,0.244541,0.136187
5327,-1.169219,-1.004931,-0.952591,0.732018
4810,1.528819,-0.965484,-0.33646,0.493686


In [31]:
#Confirm null values are 0
X_train_num_tf.isna().sum().sum()

0

In [32]:
#Transform Numeric Testing Data
X_test_num_tf = num_pipe.transform(X_test[num_cols])
X_test_num_tf.head()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year
7503,0.3310089,-0.776646,-0.998816,-1.293807
2957,-1.179892,0.100317,-1.585194,-0.102145
7031,0.3784469,-0.482994,-1.595784,0.136187
1084,4.213344e-16,-0.41544,0.506592,-1.532139
856,-0.6426567,-1.047426,0.886725,0.732018


In [33]:
# Confirm null values are 0
X_test_num_tf.isna().sum().sum()


0

In [35]:
X_test_num_tf.describe()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year
count,2131.0,2131.0,2131.0,2131.0
mean,-0.03678593,0.009778,-0.063075,-0.012057
std,1.00908,1.036354,0.975964,0.990431
min,-1.972108,-1.291052,-1.748367,-1.532139
25%,-0.8934853,-0.764242,-0.780635,-1.293807
50%,4.213344e-16,-0.242938,-0.150932,0.136187
75%,0.7342321,0.557572,0.636041,0.732018
max,2.003199,4.793662,1.989768,1.327849


###2.Ordinal

In [36]:
#Fit pipeline on ORDINAL training data
ord_pipe.fit(X_train[ordinal_cols])

In [37]:
#Transform ORDINAL Training Data
X_train_ord_tf = ord_pipe.transform(X_train[ordinal_cols])
X_test_ord_tf = ord_pipe.transform(X_test[ordinal_cols])
X_train_ord_tf.head()

Unnamed: 0,Item_Fat_Content,Outlet_Size,Outlet_Location_Type
4776,-0.740321,0.287374,1.084948
7510,1.350766,0.287374,1.084948
5828,1.350766,0.287374,-1.384777
5327,-0.740321,-1.384048,-0.149914
4810,-0.740321,0.287374,-0.149914


###3.Nominal

In [38]:
#Fit pipeline on NOMINAL training data
nom_pipe.fit(X_train[nominal_cols])

In [39]:
#Transform the training and test data
X_train_ohe_tf = nom_pipe.transform(X_train[nominal_cols])
X_test_ohe_tf = nom_pipe.transform(X_test[nominal_cols])
X_train_ohe_tf.head()


Unnamed: 0,Item_Identifier_DRA12,Item_Identifier_DRA24,Item_Identifier_DRA59,Item_Identifier_DRB01,Item_Identifier_DRB13,Item_Identifier_DRB24,Item_Identifier_DRB25,Item_Identifier_DRB48,Item_Identifier_DRC01,Item_Identifier_DRC12,...,Outlet_Identifier_OUT019,Outlet_Identifier_OUT027,Outlet_Identifier_OUT035,Outlet_Identifier_OUT045,Outlet_Identifier_OUT046,Outlet_Identifier_OUT049,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3
4776,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
7510,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
5828,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
5327,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4810,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0


##Combining Our Processed Data

Training Data
  - X_train_num_tf
  - X_train_ord_tf
  - X_train_ohe_tf

Test Data
  - X_test_num_tf
  - X_test_ord_tf
  - X_test_ohe_tf

#### Training Data

In [40]:
#Re-combining training data
X_train_tf = pd.concat([X_train_num_tf, X_train_ord_tf,
                               X_train_ohe_tf], axis=1)

In [41]:
X_train_tf.head()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Fat_Content,Outlet_Size,Outlet_Location_Type,Item_Identifier_DRA12,Item_Identifier_DRA24,Item_Identifier_DRA59,...,Outlet_Identifier_OUT019,Outlet_Identifier_OUT027,Outlet_Identifier_OUT035,Outlet_Identifier_OUT045,Outlet_Identifier_OUT046,Outlet_Identifier_OUT049,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3
4776,0.817249,-0.712775,1.828109,1.327849,-0.740321,0.287374,1.084948,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
7510,0.55634,-1.291052,0.603369,1.327849,1.350766,0.287374,1.084948,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
5828,-0.131512,1.813319,0.244541,0.136187,1.350766,0.287374,-1.384777,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
5327,-1.169219,-1.004931,-0.952591,0.732018,-0.740321,-1.384048,-0.149914,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4810,1.528819,-0.965484,-0.33646,0.493686,-0.740321,0.287374,-0.149914,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0


In [45]:
X_train_tf.dtypes

Item_Weight                      float64
Item_Visibility                  float64
Item_MRP                         float64
Outlet_Establishment_Year        float64
Item_Fat_Content                 float64
                                  ...   
Outlet_Identifier_OUT049         float64
Outlet_Type_Grocery Store        float64
Outlet_Type_Supermarket Type1    float64
Outlet_Type_Supermarket Type2    float64
Outlet_Type_Supermarket Type3    float64
Length: 1587, dtype: object

In [46]:
X_train_tf.isna().sum().sum()

0

- All features in train set are numeric and there are 0 null values.

#### Test Data

In [47]:
#Re-combining test data
X_test_tf = pd.concat([X_test_num_tf, X_test_ord_tf,
                               X_test_ohe_tf], axis=1)

In [50]:
X_test_tf.dtypes


Item_Weight                      float64
Item_Visibility                  float64
Item_MRP                         float64
Outlet_Establishment_Year        float64
Item_Fat_Content                 float64
                                  ...   
Outlet_Identifier_OUT049         float64
Outlet_Type_Grocery Store        float64
Outlet_Type_Supermarket Type1    float64
Outlet_Type_Supermarket Type2    float64
Outlet_Type_Supermarket Type3    float64
Length: 1587, dtype: object

In [51]:
X_test_tf.head()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Fat_Content,Outlet_Size,Outlet_Location_Type,Item_Identifier_DRA12,Item_Identifier_DRA24,Item_Identifier_DRA59,...,Outlet_Identifier_OUT019,Outlet_Identifier_OUT027,Outlet_Identifier_OUT035,Outlet_Identifier_OUT045,Outlet_Identifier_OUT046,Outlet_Identifier_OUT049,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3
7503,0.3310089,-0.776646,-0.998816,-1.293807,-0.740321,1.958796,1.084948,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2957,-1.179892,0.100317,-1.585194,-0.102145,-0.740321,-1.384048,-1.384777,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
7031,0.3784469,-0.482994,-1.595784,0.136187,1.350766,0.287374,-1.384777,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
1084,4.213344e-16,-0.41544,0.506592,-1.532139,1.350766,0.287374,1.084948,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
856,-0.6426567,-1.047426,0.886725,0.732018,1.350766,-1.384048,-0.149914,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [52]:
X_test_tf.isna().sum().sum()

0

- All features in test set are numeric and there are 0 null values.