# **Week 5 Readings**

_John Andrew Dixon_

---

**Setup**

In [55]:
# Import necessary Python modules
import pandas as pd

In [56]:
# Load data and verify
url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vSdSJuFHcoz8G3NZPlYpavtY8IjFJDczqqEukadW_rEfumnbd5kpF9H0e9vS9kxrnglCYiwLJy4_PXK/pub?output=csv"
df = pd.read_csv(url)
df['Outlet_Size'].fillna('Small', inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                8523 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


---

## **Stating the Supervised Machine Learning Problem**

In [57]:
# Creates the Features Matrix AKA 2D grid where rows are samples 
# and columns are features. This makes chooses the columns 
# `Outlet_Size`, `Outlet_Location_Type`, and `Outlet_Type` as the
# features of the Features Matrix. All rows are chosen as the 
# samples.
features = list(df.columns)[8:-1]
X = df.loc[:, features]
X

Unnamed: 0,Outlet_Size,Outlet_Location_Type,Outlet_Type
0,Medium,Tier 1,Supermarket Type1
1,Medium,Tier 3,Supermarket Type2
2,Medium,Tier 1,Supermarket Type1
3,Small,Tier 3,Grocery Store
4,High,Tier 3,Supermarket Type1
...,...,...,...
8518,High,Tier 3,Supermarket Type1
8519,Small,Tier 2,Supermarket Type1
8520,Small,Tier 2,Supermarket Type1
8521,Medium,Tier 3,Supermarket Type2


In [58]:
# Creates the Target Vector AKA a 1D column where rows are samples 
# and the column is what you wish to predict. Below makes the 
# `Item_Outlet_Sales` as the the columns to predict. 
# All rows are chosen as the samples.
y = df["Item_Outlet_Sales"]
y

0       3735.1380
1        443.4228
2       2097.2700
3        732.3800
4        994.7052
          ...    
8518    2778.3834
8519     549.2850
8520    1193.1136
8521    1845.5976
8522     765.6700
Name: Item_Outlet_Sales, Length: 8523, dtype: float64

## **Train Test Split (Model Validation)**

Brief explanation:
1. Split a dataset into two sets: training set (default 75%), testing set (default 25%).
2. Train with the training set only.
3. Test with the testing set only. This is to simulate how well it'll do with new data.

A note on Data Leakage:
- It happens when any data from the testing set is used to train the model.
- DO NOT ALLOW THIS TO HAPPEN! Training _**only**_ with data from the training set.

Here's how to implement it:

In [59]:
# Import the TTS from sklearn
from sklearn.model_selection import train_test_split

# Perform the TTS
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Output the lengths to see if they tach
print(len(X_train), len(y_train), len(X_test), len(y_test))

6392 6392 2131 2131


## **Types of Features**

There are three types of features:
1. Numeric Features
    - Int or Float
    - Ex: Price, mpg, IQ
2. Ordinal Features
    - Int or Strings
    - Ex: Star ratings, Grades
3. Categorical (Nominal) Features
    - Int or Strings
    - Ex: Color, Car model

All must be transformed into a number to be used in Machine Learning.

### **Transforming Numeric Features**

These are already numeric but they may need to be scaled in someway.

### **Transforming Ordinal Features**

These type of features would need to be mapped to a numerical value.

In [60]:
# Replacement dictionary to transform ordinal to numerical
sizes = {
    'Small': 0,
    'Medium': 1,
    'High': 2
}

# Replace the ordinal features with numerical values
X_train['Outlet_Size'] = X_train['Outlet_Size'].replace(sizes)
X_test['Outlet_Size'] = X_test['Outlet_Size'].replace(sizes)

### **Transforming Categorical (Nominal) Features**

Transforming a Categorical feature into a numerical value cannot be done in the same way as an Ordinal feature. This is because machine learning algorithms will interpret higher numbers as greater which should not be done with categorical values. To handle this, categorical features can be transformed through one-hot encoding, as shown below.

![image.png](attachment:image.png)

This doesn't need to be done manually. A future lesson will teach an easier way of doing this.

## **Standardization and Scaling Data**

### **Scale**
Means to change the range of values. This does not change the distribution.

### **Standardize**
This is a kind scaling. It means to scale the values so the distribution has a standard deviation of one and a mean of 0. This closely resembles a normal distribution. Units are lost here.

### **Standardizing Data in Python**

In [61]:
# Get the StandardScaler from sklearn
from sklearn.preprocessing import StandardScaler

# Load sample data
url ="https://docs.google.com/spreadsheets/d/e/2PACX-1vQA0bolmqwmt6RsuxHjFoTQLPS5obsBKoZScSpVw_M4Tcb4VMrnnN3eirk3C4mFYWhCMc1r8jLaceKT/pub?output=csv"

df = pd.read_csv(url)

# Verify data loaded
df.head()

Unnamed: 0,Sold,SqFt,Price
0,1,200,906442
1,0,425,272629
2,1,675,824862
3,1,984,720344
4,0,727,879679


In [62]:
# Assign feature matrix
y = df["Sold"]

# Assign target vector
X = df.drop(columns="Sold")

# TTS model validation
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Explore
X_train.describe().round()

Unnamed: 0,SqFt,Price
count,75.0,75.0
mean,564.0,524950.0
std,285.0,274185.0
min,114.0,109277.0
25%,320.0,272804.0
50%,588.0,503613.0
75%,836.0,786078.0
max,997.0,995878.0


In [63]:
# Instantiate the scaler
scaler = StandardScaler()

# Fit the scaler on the training data
# Fit the scaler ONLY on the training data to prevent
# Data leakage
scaler.fit(X_train)

# Use the Scaler to Transform Both the Training and Testing Data
# The test data is transformed based on the Scaler that was fit with 
# the training data
train_scaled = scaler.transform(X_train)
test_scaled = scaler.transform(X_test)

In [64]:
# Convert to DataFrame for ease of exploration
pd.DataFrame(train_scaled).describe().round(2)

Unnamed: 0,0,1
count,75.0,75.0
mean,-0.0,0.0
std,1.01,1.01
min,-1.59,-1.53
25%,-0.86,-0.93
50%,0.09,-0.08
75%,0.96,0.96
max,1.53,1.73


> _**Note:**_ One-hot encoded features do not need to be scaled, but ordinal encoded features should be. 

## **Ordinal and OneHotEncoder in Python**

### **Why Encode Features?**

A computational model needs only numbers. So, all non-numerical data must be encoded into numerical data.

### **Ordinal Encoding**

If there is an order to a categorical feature (i.e. the categorical feature is ordinal, not nominal) just encode them as ordered numbers. 

e.g. 1 for 'Low', 2 for 'Medium', 3 for 'High'

### **What is One-Hot Encoding?**

The following image conveys what `One-Hot Encoding` is:

![image.png](attachment:image.png)

### **Ordinal & One-Hot Encoding in Python**

In [65]:
# import pandas as pd
# from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector
from sklearn.preprocessing import OneHotEncoder

In [66]:
# Load sample data
url ="https://docs.google.com/spreadsheets/d/e/2PACX-1vQoW9-O6pTib6OdlFKWpX2JCU9CqJatagDxTXT_LWVwcTqd8zIfjNeW0QR2hUKp2OHkUVmYsdl2VSbY/pub?output=csv"

df = pd.read_csv(url)

# Verify data loaded
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 33 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Unnamed: 0          1000 non-null   int64  
 1   State               1000 non-null   object 
 2   Lat                 1000 non-null   float64
 3   Lng                 1000 non-null   float64
 4   Area                1000 non-null   object 
 5   Children            993 non-null    float64
 6   Age                 1000 non-null   int64  
 7   Income              1000 non-null   float64
 8   Marital             1000 non-null   object 
 9   Gender              1000 non-null   object 
 10  ReAdmis             1000 non-null   int64  
 11  VitD_levels         1000 non-null   float64
 12  Doc_visits          1000 non-null   int64  
 13  Full_meals_eaten    1000 non-null   int64  
 14  vitD_supp           1000 non-null   int64  
 15  Soft_drink          1000 non-null   int64  
 16  Initial

In [67]:
# The column Complication_risk is ordinal so it must be encoded
df['Complication_risk'].value_counts()

Medium    462
High      312
Low       222
Med         4
Name: Complication_risk, dtype: int64

In [68]:
# Translations to apply to the Ordinal Features
translations = {
    'Low': 0,
    'Med': 1, 'Medium': 1,
    'High': 2
} 

# Apply the mapping
df['Complication_risk'].replace(translations, inplace=True)

# Verify
df['Complication_risk'].value_counts()

1    466
2    312
0    222
Name: Complication_risk, dtype: int64

In [69]:
# Assign feature matrix
y = df["Additional_charges"]

# Assign target vector
X = df.drop(columns=["Unnamed: 0", "Additional_charges"])

# TTS model validation
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [70]:
# Select only columns that are of type Object
# these columns are nominal and will be the target of the 
# One-hot encoding
nominal_cols = make_column_selector(dtype_include='object')
nominal_cols(X_train)

['State', 'Area', 'Marital', 'Gender', 'Initial_admin', 'Services']

In [71]:
# Create a subset of data for only nominal columns
train_nominal_data = X_train[nominal_cols(X_train)]
test_nominal_data = X_test[nominal_cols(X_test)]

In [72]:
# Instantiate the One Hot Encoder
one_hot_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

# Fit the encoder on the training data
one_hot_encoder.fit(train_nominal_data)

# Transform the training and testing data
train_ohe = one_hot_encoder.transform(train_nominal_data)
test_ohe = one_hot_encoder.transform(test_nominal_data)

In [73]:
# Get the column names as generated by the One Hot Encoder
ohe_column_names = one_hot_encoder.get_feature_names_out(train_nominal_data.columns)
train_ohe_df = pd.DataFrame(train_ohe, columns=ohe_column_names)
test_ohe_df = pd.DataFrame(test_ohe, columns=ohe_column_names)
train_ohe_df

Unnamed: 0,State_AK,State_AL,State_AR,State_AZ,State_CA,State_CO,State_CT,State_DC,State_FL,State_GA,...,Gender_f,Gender_m,Gender_male,Initial_admin_Elective Admission,Initial_admin_Emergency Admission,Initial_admin_Observation Admission,Services_Blood Work,Services_CT Scan,Services_Intravenous,Services_MRI
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
745,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
746,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
747,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
748,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0


In [77]:
# Create a selector to get only the numeric columns on a dataframe.
# This will be used to select only the numeric data in the original
# testing and training data. Once selected, they will be concatenated
# with the features created after the one hot encoder was applied
numeric_cols = make_column_selector(dtype_include="number")

# Reset the index count on the X_train and X_test data
train_num_cols = X_train[numeric_cols(X_train)].reset_index(drop=True)
test_num_cols = X_test[numeric_cols(X_test)].reset_index(drop=True)

# Finally, contantnate the new one hot encoded columns to the original 
# numeric columns
X_train_processed = pd.concat([train_num_cols, train_ohe_df], axis=1) 
X_test_processed = pd.concat([test_num_cols, train_ohe_df], axis=1)
X_train_processed

Unnamed: 0,Lat,Lng,Children,Age,Income,ReAdmis,VitD_levels,Doc_visits,Full_meals_eaten,vitD_supp,...,Gender_f,Gender_m,Gender_male,Initial_admin_Elective Admission,Initial_admin_Emergency Admission,Initial_admin_Observation Admission,Services_Blood Work,Services_CT Scan,Services_Intravenous,Services_MRI
0,36.16307,-86.66510,2.0,60,8459.99,0,19.034162,5,1,0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
1,34.96594,-87.12179,5.0,78,22669.31,0,15.903388,7,1,0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
2,36.24648,-83.51232,1.0,60,25536.25,0,18.225040,4,1,0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
3,45.42189,-97.91165,7.0,82,94863.57,0,15.809932,5,0,2,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
4,42.33661,-83.28292,0.0,37,30898.36,0,20.640410,5,1,0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
745,42.05701,-77.43901,1.0,32,4788.93,0,19.029312,6,1,0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
746,40.47773,-86.38658,4.0,27,29461.62,0,15.293840,5,0,0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
747,40.56510,-81.07429,0.0,57,79094.04,0,19.459084,5,0,0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
748,33.97472,-118.35549,0.0,56,25697.12,0,15.871725,5,1,0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
