# Data-Preprocessing and Manipulation

This file will consist of **2** parts:
1) Pre-processing with **data visualizations** in mind
2) Pre-processing with **machine learning** in mind

## Part 1

**Importing** neccessary libraries

In [220]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.stats.outliers_influence import variance_inflation_factor


**Importing** the dataset

In [172]:
df = pd.read_csv("Walmart Data Analysis and Forcasting.csv")
df

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,1,05-02-2010,1643690.90,0,42.31,2.572,211.096358,8.106
1,1,12-02-2010,1641957.44,1,38.51,2.548,211.242170,8.106
2,1,19-02-2010,1611968.17,0,39.93,2.514,211.289143,8.106
3,1,26-02-2010,1409727.59,0,46.63,2.561,211.319643,8.106
4,1,05-03-2010,1554806.68,0,46.50,2.625,211.350143,8.106
...,...,...,...,...,...,...,...,...
6430,45,28-09-2012,713173.95,0,64.88,3.997,192.013558,8.684
6431,45,05-10-2012,733455.07,0,64.89,3.985,192.170412,8.667
6432,45,12-10-2012,734464.36,0,54.47,4.000,192.327265,8.667
6433,45,19-10-2012,718125.53,0,56.47,3.969,192.330854,8.667


### Part 1

In [173]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6435 entries, 0 to 6434
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Store         6435 non-null   int64  
 1   Date          6435 non-null   object 
 2   Weekly_Sales  6435 non-null   float64
 3   Holiday_Flag  6435 non-null   int64  
 4   Temperature   6435 non-null   float64
 5   Fuel_Price    6435 non-null   float64
 6   CPI           6435 non-null   float64
 7   Unemployment  6435 non-null   float64
dtypes: float64(5), int64(2), object(1)
memory usage: 402.3+ KB


Changing the data type of **Date** to datetime

In [226]:
df['Date'] = pd.to_datetime(df['Date'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6435 entries, 0 to 6434
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Store         6435 non-null   int64         
 1   Date          6435 non-null   datetime64[ns]
 2   Weekly_Sales  6435 non-null   float64       
 3   Holiday_Flag  6435 non-null   int64         
 4   Temperature   6435 non-null   float64       
 5   Fuel_Price    6435 non-null   float64       
 6   CPI           6435 non-null   float64       
 7   Unemployment  6435 non-null   float64       
 8   Year          6435 non-null   int64         
 9   Month         6435 non-null   int64         
 10  Day           6435 non-null   int64         
 11  Season        6435 non-null   object        
 12  Holiday_Name  6435 non-null   object        
dtypes: datetime64[ns](1), float64(5), int64(5), object(2)
memory usage: 653.7+ KB


#### **Splitting Out** the **Date** column into its components

Adding **Year** column

In [175]:
df['Year'] = pd.to_datetime(df['Date']).dt.year
df

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year
0,1,2010-05-02,1643690.90,0,42.31,2.572,211.096358,8.106,2010
1,1,2010-12-02,1641957.44,1,38.51,2.548,211.242170,8.106,2010
2,1,2010-02-19,1611968.17,0,39.93,2.514,211.289143,8.106,2010
3,1,2010-02-26,1409727.59,0,46.63,2.561,211.319643,8.106,2010
4,1,2010-05-03,1554806.68,0,46.50,2.625,211.350143,8.106,2010
...,...,...,...,...,...,...,...,...,...
6430,45,2012-09-28,713173.95,0,64.88,3.997,192.013558,8.684,2012
6431,45,2012-05-10,733455.07,0,64.89,3.985,192.170412,8.667,2012
6432,45,2012-12-10,734464.36,0,54.47,4.000,192.327265,8.667,2012
6433,45,2012-10-19,718125.53,0,56.47,3.969,192.330854,8.667,2012


Adding **Month** Column

In [176]:
df['Month'] = pd.to_datetime(df['Date']).dt.month
df

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month
0,1,2010-05-02,1643690.90,0,42.31,2.572,211.096358,8.106,2010,5
1,1,2010-12-02,1641957.44,1,38.51,2.548,211.242170,8.106,2010,12
2,1,2010-02-19,1611968.17,0,39.93,2.514,211.289143,8.106,2010,2
3,1,2010-02-26,1409727.59,0,46.63,2.561,211.319643,8.106,2010,2
4,1,2010-05-03,1554806.68,0,46.50,2.625,211.350143,8.106,2010,5
...,...,...,...,...,...,...,...,...,...,...
6430,45,2012-09-28,713173.95,0,64.88,3.997,192.013558,8.684,2012,9
6431,45,2012-05-10,733455.07,0,64.89,3.985,192.170412,8.667,2012,5
6432,45,2012-12-10,734464.36,0,54.47,4.000,192.327265,8.667,2012,12
6433,45,2012-10-19,718125.53,0,56.47,3.969,192.330854,8.667,2012,10


Adding **Day** Column

In [177]:
df['Day'] = pd.to_datetime(df['Date']).dt.day
df

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,Day
0,1,2010-05-02,1643690.90,0,42.31,2.572,211.096358,8.106,2010,5,2
1,1,2010-12-02,1641957.44,1,38.51,2.548,211.242170,8.106,2010,12,2
2,1,2010-02-19,1611968.17,0,39.93,2.514,211.289143,8.106,2010,2,19
3,1,2010-02-26,1409727.59,0,46.63,2.561,211.319643,8.106,2010,2,26
4,1,2010-05-03,1554806.68,0,46.50,2.625,211.350143,8.106,2010,5,3
...,...,...,...,...,...,...,...,...,...,...,...
6430,45,2012-09-28,713173.95,0,64.88,3.997,192.013558,8.684,2012,9,28
6431,45,2012-05-10,733455.07,0,64.89,3.985,192.170412,8.667,2012,5,10
6432,45,2012-12-10,734464.36,0,54.47,4.000,192.327265,8.667,2012,12,10
6433,45,2012-10-19,718125.53,0,56.47,3.969,192.330854,8.667,2012,10,19


Adding **Seasons**

In [178]:
df['Season'] = ''
df.loc[((df['Month'] == 12) & (df['Day'] >= 21)) | ((df['Month'] <= 2) | ((df['Month'] == 3) & (df['Day'] < 21))), 'Season'] = 'Winter'
df.loc[((df['Month'] == 3) & (df['Day'] >= 21)) | ((df['Month'] >= 4) & (df['Month'] < 6)) | ((df['Month'] == 6) & (df['Day'] < 21)), 'Season'] = 'Spring'
df.loc[((df['Month'] == 6) & (df['Day'] >=21)) | ((df['Month'] >= 7) & (df['Month'] < 9)) | ((df['Month'] == 9) & (df['Day'] < 21)), 'Season'] = 'Summer'
df.loc[((df['Month'] == 9) & (df['Day'] >=21)) | ((df['Month'] >=10) & (df['Month'] < 12)) | ((df['Month'] == 12) & (df['Day'] < 21)), 'Season'] = 'Fall'
df

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,Day,Season
0,1,2010-05-02,1643690.90,0,42.31,2.572,211.096358,8.106,2010,5,2,Spring
1,1,2010-12-02,1641957.44,1,38.51,2.548,211.242170,8.106,2010,12,2,Fall
2,1,2010-02-19,1611968.17,0,39.93,2.514,211.289143,8.106,2010,2,19,Winter
3,1,2010-02-26,1409727.59,0,46.63,2.561,211.319643,8.106,2010,2,26,Winter
4,1,2010-05-03,1554806.68,0,46.50,2.625,211.350143,8.106,2010,5,3,Spring
...,...,...,...,...,...,...,...,...,...,...,...,...
6430,45,2012-09-28,713173.95,0,64.88,3.997,192.013558,8.684,2012,9,28,Fall
6431,45,2012-05-10,733455.07,0,64.89,3.985,192.170412,8.667,2012,5,10,Spring
6432,45,2012-12-10,734464.36,0,54.47,4.000,192.327265,8.667,2012,12,10,Fall
6433,45,2012-10-19,718125.53,0,56.47,3.969,192.330854,8.667,2012,10,19,Fall


In [179]:
df.Season.value_counts()

Spring    1755
Summer    1665
Fall      1530
Winter    1485
Name: Season, dtype: int64

In [180]:
# blank_seasons = df[df['Season'] == '']
# unique_months = blank_seasons['Month'].unique()
# print(unique_months)


In [181]:
# for i, row in df.iterrows():
#     if df['Season'] == '':
#         df.loc[i, 'Season'] = 'Winter'

# df

Adding a column that specifies which **Holiday** the 'Holiday_Flag' is referencing

In [182]:
# Define the conditions and corresponding values for the 'Holiday_Name' column
conditions = [
    df['Date'].isin(['12-02-2010', '11-02-2011', '10-02-2012', '08-02-2013']),
    df['Date'].isin(['10-09-2010', '09-09-2011', '07-09-2012', '06-09-2013']),
    df['Date'].isin(['26-11-2010', '25-11-2011', '23-11-2012', '29-11-2013']),
    df['Date'].isin(['31-12-2010', '30-12-2011', '28-12-2012', '27-12-2013'])
]
values = ['Super Bowl', 'Labour Day', 'Thanksgiving', 'Christmas']

# Use numpy.select() to assign the corresponding holiday name based on the conditions
df['Holiday_Name'] = np.select(conditions, values, default='No Holiday')

Now, I will write this df to a new CSV that will be used to create the data visualizations later

In [184]:
df.to_csv('visual_df.csv', index=False)


### Part 2 - Pre-Processing for Machine Learning

Creating a **copy** of the previous df that I'll manipulate for my **ML** models

In [185]:
df_ml = df.copy()

In [186]:
df_ml

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,Day,Season,Holiday_Name
0,1,2010-05-02,1643690.90,0,42.31,2.572,211.096358,8.106,2010,5,2,Spring,No Holiday
1,1,2010-12-02,1641957.44,1,38.51,2.548,211.242170,8.106,2010,12,2,Fall,Super Bowl
2,1,2010-02-19,1611968.17,0,39.93,2.514,211.289143,8.106,2010,2,19,Winter,No Holiday
3,1,2010-02-26,1409727.59,0,46.63,2.561,211.319643,8.106,2010,2,26,Winter,No Holiday
4,1,2010-05-03,1554806.68,0,46.50,2.625,211.350143,8.106,2010,5,3,Spring,No Holiday
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6430,45,2012-09-28,713173.95,0,64.88,3.997,192.013558,8.684,2012,9,28,Fall,No Holiday
6431,45,2012-05-10,733455.07,0,64.89,3.985,192.170412,8.667,2012,5,10,Spring,No Holiday
6432,45,2012-12-10,734464.36,0,54.47,4.000,192.327265,8.667,2012,12,10,Fall,No Holiday
6433,45,2012-10-19,718125.53,0,56.47,3.969,192.330854,8.667,2012,10,19,Fall,No Holiday


**Dropping** the original **Date** column

In [187]:
df_ml = df.drop(columns=['Date'])
df_ml

Unnamed: 0,Store,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,Day,Season,Holiday_Name
0,1,1643690.90,0,42.31,2.572,211.096358,8.106,2010,5,2,Spring,No Holiday
1,1,1641957.44,1,38.51,2.548,211.242170,8.106,2010,12,2,Fall,Super Bowl
2,1,1611968.17,0,39.93,2.514,211.289143,8.106,2010,2,19,Winter,No Holiday
3,1,1409727.59,0,46.63,2.561,211.319643,8.106,2010,2,26,Winter,No Holiday
4,1,1554806.68,0,46.50,2.625,211.350143,8.106,2010,5,3,Spring,No Holiday
...,...,...,...,...,...,...,...,...,...,...,...,...
6430,45,713173.95,0,64.88,3.997,192.013558,8.684,2012,9,28,Fall,No Holiday
6431,45,733455.07,0,64.89,3.985,192.170412,8.667,2012,5,10,Spring,No Holiday
6432,45,734464.36,0,54.47,4.000,192.327265,8.667,2012,12,10,Fall,No Holiday
6433,45,718125.53,0,56.47,3.969,192.330854,8.667,2012,10,19,Fall,No Holiday


In [188]:
df_ml.isnull().sum()

Store           0
Weekly_Sales    0
Holiday_Flag    0
Temperature     0
Fuel_Price      0
CPI             0
Unemployment    0
Year            0
Month           0
Day             0
Season          0
Holiday_Name    0
dtype: int64

Creating **Dummy Variables** for the **Store** Column

In [189]:
df_ml = pd.get_dummies(df_ml, columns=['Store'], prefix='Store', drop_first=True)
df_ml


Unnamed: 0,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Year,Month,Day,Season,...,Store_36,Store_37,Store_38,Store_39,Store_40,Store_41,Store_42,Store_43,Store_44,Store_45
0,1643690.90,0,42.31,2.572,211.096358,8.106,2010,5,2,Spring,...,0,0,0,0,0,0,0,0,0,0
1,1641957.44,1,38.51,2.548,211.242170,8.106,2010,12,2,Fall,...,0,0,0,0,0,0,0,0,0,0
2,1611968.17,0,39.93,2.514,211.289143,8.106,2010,2,19,Winter,...,0,0,0,0,0,0,0,0,0,0
3,1409727.59,0,46.63,2.561,211.319643,8.106,2010,2,26,Winter,...,0,0,0,0,0,0,0,0,0,0
4,1554806.68,0,46.50,2.625,211.350143,8.106,2010,5,3,Spring,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6430,713173.95,0,64.88,3.997,192.013558,8.684,2012,9,28,Fall,...,0,0,0,0,0,0,0,0,0,1
6431,733455.07,0,64.89,3.985,192.170412,8.667,2012,5,10,Spring,...,0,0,0,0,0,0,0,0,0,1
6432,734464.36,0,54.47,4.000,192.327265,8.667,2012,12,10,Fall,...,0,0,0,0,0,0,0,0,0,1
6433,718125.53,0,56.47,3.969,192.330854,8.667,2012,10,19,Fall,...,0,0,0,0,0,0,0,0,0,1


**Cylically Encoding** the **Year, Month, and Day** Columns
- Find more information about cyclical encoding [here](https://towardsdatascience.com/cyclical-features-encoding-its-about-time-ce23581845ca).

In [190]:
# First step is to calculate how many days in each month there are
df_ml['days_in_month'] = np.where(df_ml['Month'] == 2, 28, np.where(df_ml['Month'] == 4, 30, np.where(df_ml['Month'] == 6, 30, np.where(df_ml['Month'] == 9, 30, np.where(df_ml['Month'] == 11, 30, 31)))))

In [191]:
#create cyclical encodings
df_ml['year_sin'] = np.sin(2*np.pi*df_ml['Year']/max(df_ml['Year']))
df_ml['year_cos'] = np.cos(2*np.pi*df_ml['Year']/max(df_ml['Year']))
df_ml['month_sin'] = np.sin(2*np.pi*df_ml['Month']/12)
df_ml['month_cos'] = np.cos(2*np.pi*df_ml['Month']/12)
df_ml['day_sin'] = np.sin(2*np.pi*df_ml['Day']/df_ml['days_in_month'])
df_ml['day_cos'] = np.cos(2*np.pi*df_ml['Day']/df_ml['days_in_month'])


Now, we **Drop** the **Year, Month, Day, and days_in_month** Columns

In [192]:
df_ml.drop(columns=(['Year', 'Month', 'Day', 'days_in_month']), inplace=True)

Create **Dummy Variables** for the **Season** Column

In [193]:
df_ml = pd.get_dummies(df_ml, columns=['Season'], prefix='Season', drop_first=True)


In [195]:
df_ml

Unnamed: 0,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Holiday_Name,Store_2,Store_3,Store_4,...,Store_45,year_sin,year_cos,month_sin,month_cos,day_sin,day_cos,Season_Spring,Season_Summer,Season_Winter
0,1643690.90,0,42.31,2.572,211.096358,8.106,No Holiday,0,0,0,...,0,-6.245670e-03,0.99998,5.000000e-01,-8.660254e-01,0.394356,0.918958,1,0,0
1,1641957.44,1,38.51,2.548,211.242170,8.106,Super Bowl,0,0,0,...,0,-6.245670e-03,0.99998,-2.449294e-16,1.000000e+00,0.394356,0.918958,0,0,0
2,1611968.17,0,39.93,2.514,211.289143,8.106,No Holiday,0,0,0,...,0,-6.245670e-03,0.99998,8.660254e-01,5.000000e-01,-0.900969,-0.433884,0,0,1
3,1409727.59,0,46.63,2.561,211.319643,8.106,No Holiday,0,0,0,...,0,-6.245670e-03,0.99998,8.660254e-01,5.000000e-01,-0.433884,0.900969,0,0,1
4,1554806.68,0,46.50,2.625,211.350143,8.106,No Holiday,0,0,0,...,0,-6.245670e-03,0.99998,5.000000e-01,-8.660254e-01,0.571268,0.820763,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6430,713173.95,0,64.88,3.997,192.013558,8.684,No Holiday,0,0,0,...,1,-2.449294e-16,1.00000,-1.000000e+00,-1.836970e-16,-0.406737,0.913545,0,0,0
6431,733455.07,0,64.89,3.985,192.170412,8.667,No Holiday,0,0,0,...,1,-2.449294e-16,1.00000,5.000000e-01,-8.660254e-01,0.897805,-0.440394,1,0,0
6432,734464.36,0,54.47,4.000,192.327265,8.667,No Holiday,0,0,0,...,1,-2.449294e-16,1.00000,-2.449294e-16,1.000000e+00,0.897805,-0.440394,0,0,0
6433,718125.53,0,56.47,3.969,192.330854,8.667,No Holiday,0,0,0,...,1,-2.449294e-16,1.00000,-8.660254e-01,5.000000e-01,-0.651372,-0.758758,0,0,0


**Dropping** the **Holiday_Name** Column

I'm dropping this column since there's an heavy majority of 'No Holiday' values in the dataframe. Also, we already have the 'Holiday_Flag' column which will encapsulate the information we need.

In [202]:
df_ml.drop(columns='Holiday_Name', inplace=True)


In [203]:
df_ml

Unnamed: 0,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Store_2,Store_3,Store_4,Store_5,...,Store_45,year_sin,year_cos,month_sin,month_cos,day_sin,day_cos,Season_Spring,Season_Summer,Season_Winter
0,1643690.90,0,42.31,2.572,211.096358,8.106,0,0,0,0,...,0,-6.245670e-03,0.99998,5.000000e-01,-8.660254e-01,0.394356,0.918958,1,0,0
1,1641957.44,1,38.51,2.548,211.242170,8.106,0,0,0,0,...,0,-6.245670e-03,0.99998,-2.449294e-16,1.000000e+00,0.394356,0.918958,0,0,0
2,1611968.17,0,39.93,2.514,211.289143,8.106,0,0,0,0,...,0,-6.245670e-03,0.99998,8.660254e-01,5.000000e-01,-0.900969,-0.433884,0,0,1
3,1409727.59,0,46.63,2.561,211.319643,8.106,0,0,0,0,...,0,-6.245670e-03,0.99998,8.660254e-01,5.000000e-01,-0.433884,0.900969,0,0,1
4,1554806.68,0,46.50,2.625,211.350143,8.106,0,0,0,0,...,0,-6.245670e-03,0.99998,5.000000e-01,-8.660254e-01,0.571268,0.820763,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6430,713173.95,0,64.88,3.997,192.013558,8.684,0,0,0,0,...,1,-2.449294e-16,1.00000,-1.000000e+00,-1.836970e-16,-0.406737,0.913545,0,0,0
6431,733455.07,0,64.89,3.985,192.170412,8.667,0,0,0,0,...,1,-2.449294e-16,1.00000,5.000000e-01,-8.660254e-01,0.897805,-0.440394,1,0,0
6432,734464.36,0,54.47,4.000,192.327265,8.667,0,0,0,0,...,1,-2.449294e-16,1.00000,-2.449294e-16,1.000000e+00,0.897805,-0.440394,0,0,0
6433,718125.53,0,56.47,3.969,192.330854,8.667,0,0,0,0,...,1,-2.449294e-16,1.00000,-8.660254e-01,5.000000e-01,-0.651372,-0.758758,0,0,0


In [206]:
df_ml.columns

Index(['Weekly_Sales', 'Holiday_Flag', 'Temperature', 'Fuel_Price', 'CPI',
       'Unemployment', 'Store_2', 'Store_3', 'Store_4', 'Store_5', 'Store_6',
       'Store_7', 'Store_8', 'Store_9', 'Store_10', 'Store_11', 'Store_12',
       'Store_13', 'Store_14', 'Store_15', 'Store_16', 'Store_17', 'Store_18',
       'Store_19', 'Store_20', 'Store_21', 'Store_22', 'Store_23', 'Store_24',
       'Store_25', 'Store_26', 'Store_27', 'Store_28', 'Store_29', 'Store_30',
       'Store_31', 'Store_32', 'Store_33', 'Store_34', 'Store_35', 'Store_36',
       'Store_37', 'Store_38', 'Store_39', 'Store_40', 'Store_41', 'Store_42',
       'Store_43', 'Store_44', 'Store_45', 'year_sin', 'year_cos', 'month_sin',
       'month_cos', 'day_sin', 'day_cos', 'Season_Spring', 'Season_Summer',
       'Season_Winter'],
      dtype='object')

**Writing** the new df to a **CSV** to be used for Machine Learning

In [207]:
df_ml.to_csv('ml_df.csv', index=False)


Before moving on, let's check out the **multicollinearity** of the features

In [217]:
# Calculate VIF for each feature
vif = pd.DataFrame()
vif["Feature"] = df_ml.columns
vif["VIF"] = [variance_inflation_factor(df_ml.values, i) for i in range(len(df_ml.columns))]
sorted_vif = vif.sort_values(by='VIF', ascending=False)

print(sorted_vif)


          Feature           VIF
51       year_cos  26459.420220
4             CPI    858.023205
32       Store_28     98.257034
16       Store_12     97.299080
42       Store_38     95.930984
14       Store_10     95.401832
8         Store_4     95.285525
17       Store_13     94.675201
38       Store_34     93.000174
46       Store_42     91.803915
37       Store_33     91.673216
21       Store_17     91.217632
48       Store_44     90.627445
23       Store_19     80.059757
27       Store_23     80.046538
28       Store_24     80.038705
22       Store_18     79.518937
44       Store_40     79.190747
33       Store_29     79.179865
30       Store_26     78.724828
19       Store_15     78.395057
31       Store_27     73.757799
39       Store_35     71.919805
26       Store_22     71.782063
0    Weekly_Sales     59.059744
50       year_sin     27.584792
5    Unemployment     20.169159
18       Store_14     13.434171
49       Store_45     12.338811
20       Store_16      8.217338
11      

VIF values of greater than **5-10** are considered to have **high multicollinearity**. In the next session, I will have find a way to deal with the extremely high VIC scores for some of these features.