# 4 Pre-Processing and Training Data<a id='4_Pre-Processing_and_Training_Data'></a>

## 4.1 Contents<a id='4.1_Contents'></a>
* [4 Pre-Processing and Training Data](#4_Pre-Processing_and_Training_Data)
  * [4.1 Contents](#4.1_Contents)
  * [4.2 Introduction](#4.2_Introduction)
  * [4.3 Imports](#4.3_Imports)
  * [4.4 Load Data](#4.4_Load_Data)
  * [4.5 Explore the Data](#4.5_Explore_Data)
  * [4.6 Encode categorical values](#4.6_Encoding)
    * [4.6.1 Replace 'Yes', 'No' valuess](#4.6.1_Replace)  
    * [4.6.2 Encode other Categorical Variables](#4.6.2_Other)    
  * [4.7 Scaling numerical featuresl](#4.7_Scaling)
  * [4.8 Create new featuresl](#4.8_Create)   
  * [4.9 Feature selectionl](#4.9_Feature)   
  * [4.10 Train/Test Splitl](#4.10_Split)    
  * [4.14 Summary](#4.14_Summary)


## 4.2 Introduction<a id='4.2_Introduction'></a>

This notebook comprehensively outlines the process of preparing our dataset for machine learning applications. The workflow includes importing the necessary libraries, loading and exploring the data, conducting feature engineering, encoding categorical variables, scaling numerical features, generating new features, selecting relevant features, and finally, splitting the dataset into training and testing sets. Each step is designed to ensure that the dataset is optimized and ready for building robust and effective machine learning models.

## 4.3 Imports<a id='4.3_Imports'></a>

In [1]:
import pandas as pd
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import __version__ as sklearn_version
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
import datetime

#from library.sb_utils import save_file

## 4.4 Load Data<a id='4.4_Load_Data'></a>

In [2]:
churn_data = pd.read_csv('churn_data_step3_features.csv')
churn_data.head().T

Unnamed: 0,0,1,2,3,4
Gender,0,1,1,1,0
Senior Citizen,0,0,0,0,0
Partner,Yes,No,No,No,No
Dependents,No,No,No,No,No
Tenure,1,34,2,45,2
Phone Service,No,Yes,Yes,No,Yes
Multiple Lines,No phone service,No,No,No phone service,No
Internet Service,DSL,DSL,DSL,DSL,Fiber optic
Online Security,No,Yes,Yes,Yes,No
Online Backup,Yes,No,Yes,No,No


## 4.5 Explore the Data<a id='4.5_Explore_Data'></a>

In [3]:
churn_data.describe()

Unnamed: 0,Gender,Senior Citizen,Tenure,Monthly Charges,Total Charges
count,7043.0,7043.0,7043.0,7043.0,7043.0
mean,0.504756,0.162147,32.371149,64.761692,2279.734304
std,0.500013,0.368612,24.559481,30.090047,2266.79447
min,0.0,0.0,0.0,18.25,0.0
25%,0.0,0.0,9.0,35.5,398.55
50%,1.0,0.0,29.0,70.35,1394.55
75%,1.0,0.0,55.0,89.85,3786.6
max,1.0,1.0,72.0,118.75,8684.8


In [4]:
churn_data.shape

(7043, 20)

In [5]:
churn_data.isnull().sum()

Gender                 0
Senior Citizen         0
Partner                0
Dependents             0
Tenure                 0
Phone Service          0
Multiple Lines         0
Internet Service       0
Online Security        0
Online Backup          0
Device Protection      0
Tech Support           0
Streaming TV           0
Streaming Movies       0
Contract               0
Paperless Billing      0
Payment Method         0
Monthly Charges        0
Total Charges          0
Churn                544
dtype: int64

In [6]:
churn_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Gender             7043 non-null   int64  
 1   Senior Citizen     7043 non-null   int64  
 2   Partner            7043 non-null   object 
 3   Dependents         7043 non-null   object 
 4   Tenure             7043 non-null   int64  
 5   Phone Service      7043 non-null   object 
 6   Multiple Lines     7043 non-null   object 
 7   Internet Service   7043 non-null   object 
 8   Online Security    7043 non-null   object 
 9   Online Backup      7043 non-null   object 
 10  Device Protection  7043 non-null   object 
 11  Tech Support       7043 non-null   object 
 12  Streaming TV       7043 non-null   object 
 13  Streaming Movies   7043 non-null   object 
 14  Contract           7043 non-null   object 
 15  Paperless Billing  7043 non-null   object 
 16  Payment Method     7043 

In [7]:
for x in churn_data.columns:
    print(f'{x} has value counts -- {churn_data[x].nunique()}')

Gender has value counts -- 2
Senior Citizen has value counts -- 2
Partner has value counts -- 2
Dependents has value counts -- 2
Tenure has value counts -- 73
Phone Service has value counts -- 2
Multiple Lines has value counts -- 3
Internet Service has value counts -- 3
Online Security has value counts -- 3
Online Backup has value counts -- 3
Device Protection has value counts -- 3
Tech Support has value counts -- 3
Streaming TV has value counts -- 3
Streaming Movies has value counts -- 3
Contract has value counts -- 3
Paperless Billing has value counts -- 2
Payment Method has value counts -- 4
Monthly Charges has value counts -- 1585
Total Charges has value counts -- 6531
Churn has value counts -- 2


In [8]:
# Copy churn_data to a new DataFrame df
df = churn_data.copy()

In [9]:
unique_values_partner = df['Payment Method'].unique()
print(unique_values_partner)

['Electronic check' 'Mailed check' 'Bank transfer (automatic)'
 'Credit card (automatic)']


## 4.6 Encode categorical values<a id='4.6_Encoding'></a>

### 4.6.1 Replace 'Yes', 'No' values<a id='4.6.1_Replace'></a>

In [10]:
# List of columns to convert
columns = ['Partner', 'Dependents', 'Phone Service', 'Multiple Lines', 'Online Security' , 
           'Online Backup', 'Device Protection', 'Tech Support', 'Streaming TV', 
           'Streaming Movies', 'Paperless Billing', 'Churn']

# Replace 'Yes' with 1 and 'No' with 0 and 'No internet service' with 0
df[columns] = df[columns].replace({'Yes': 1, 'No': 0, 'No internet service': 0, 'No phone service':0})

columns = ['Partner', 'Dependents', 'Phone Service', 'Multiple Lines', 'Online Security' , 
           'Online Backup', 'Device Protection', 'Tech Support', 'Streaming TV', 
           'Streaming Movies', 'Paperless Billing']

# Convert datatype to integer 
df[columns] = df[columns].astype(int)

df.head()

Unnamed: 0,Gender,Senior Citizen,Partner,Dependents,Tenure,Phone Service,Multiple Lines,Internet Service,Online Security,Online Backup,Device Protection,Tech Support,Streaming TV,Streaming Movies,Contract,Paperless Billing,Payment Method,Monthly Charges,Total Charges,Churn
0,0,0,1,0,1,0,0,DSL,0,1,0,0,0,0,Month-to-month,1,Electronic check,29.85,29.85,0.0
1,1,0,0,0,34,1,0,DSL,1,0,1,0,0,0,One year,0,Mailed check,56.95,1889.5,0.0
2,1,0,0,0,2,1,0,DSL,1,1,0,0,0,0,Month-to-month,1,Mailed check,53.85,108.15,1.0
3,1,0,0,0,45,0,0,DSL,1,0,1,1,0,0,One year,0,Bank transfer (automatic),42.3,1840.75,0.0
4,0,0,0,0,2,1,0,Fiber optic,0,0,0,0,0,0,Month-to-month,1,Electronic check,70.7,151.65,1.0


### 4.6.2 Encode other Categorical Variables<a id='4.6.2_Other'></a>

In [11]:
# List of specific columns
columns = ['Internet Service', 'Contract', 'Payment Method']

# Check and clean column names (if necessary)
df.columns = df.columns.str.strip()

# Get unique values for each column and print them
for column in columns:
    if column in df.columns:
        unique_values = df[column].unique()
        print(f"Unique values in the '{column}' column:")
        print(unique_values)
        print()  # Print a newline for better readability
    else:
        print(f"Column '{column}' not found in the DataFrame.")

Unique values in the 'Internet Service' column:
['DSL' 'Fiber optic' 'No']

Unique values in the 'Contract' column:
['Month-to-month' 'One year' 'Two year']

Unique values in the 'Payment Method' column:
['Electronic check' 'Mailed check' 'Bank transfer (automatic)'
 'Credit card (automatic)']



In [12]:
columns = ['Internet Service', 'Contract', 'Payment Method']

# Perform one-hot encoding on the specified columns
df = pd.get_dummies(df, columns=columns, drop_first=True)

df.head()

Unnamed: 0,Gender,Senior Citizen,Partner,Dependents,Tenure,Phone Service,Multiple Lines,Online Security,Online Backup,Device Protection,...,Monthly Charges,Total Charges,Churn,Internet Service_Fiber optic,Internet Service_No,Contract_One year,Contract_Two year,Payment Method_Credit card (automatic),Payment Method_Electronic check,Payment Method_Mailed check
0,0,0,1,0,1,0,0,0,1,0,...,29.85,29.85,0.0,0,0,0,0,0,1,0
1,1,0,0,0,34,1,0,1,0,1,...,56.95,1889.5,0.0,0,0,1,0,0,0,1
2,1,0,0,0,2,1,0,1,1,0,...,53.85,108.15,1.0,0,0,0,0,0,0,1
3,1,0,0,0,45,0,0,1,0,1,...,42.3,1840.75,0.0,0,0,1,0,0,0,0
4,0,0,0,0,2,1,0,0,0,0,...,70.7,151.65,1.0,1,0,0,0,0,1,0


## 4.7 Scale numerical features <a id='4.7_Scaling'></a>

In [13]:
#Feature scaling
from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Scale the numerical features
numeric_features = df.select_dtypes(include=['float64', 'int64']).columns
df[numeric_features] = scaler.fit_transform(df[numeric_features])

# Display the first few rows of the dataframe after scaling
print(df.head())

     Gender  Senior Citizen  Partner  Dependents    Tenure  Phone Service  \
0 -1.009559       -0.439916        1           0 -1.277445              0   
1  0.990532       -0.439916        0           0  0.066327              1   
2  0.990532       -0.439916        0           0 -1.236724              1   
3  0.990532       -0.439916        0           0  0.514251              0   
4 -1.009559       -0.439916        0           0 -1.236724              1   

   Multiple Lines  Online Security  Online Backup  Device Protection  ...  \
0               0                0              1                  0  ...   
1               0                1              0                  1  ...   
2               0                1              1                  0  ...   
3               0                1              0                  1  ...   
4               0                0              0                  0  ...   

   Monthly Charges  Total Charges     Churn  Internet Service_Fiber optic 

## 4.8 Create new features <a id='4.8_Create'></a>

In [14]:
#Create new features 
from sklearn.preprocessing import PolynomialFeatures

# Create interaction terms
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
interaction_terms = poly.fit_transform(df[numeric_features])

# Convert the interaction terms to a DataFrame
interaction_df = pd.DataFrame(interaction_terms, columns=poly.get_feature_names(numeric_features))

# Merge the interaction terms back into the original dataframe
df = pd.concat([df, interaction_df], axis=1)

# Display the first few rows of the dataframe after adding interaction terms
print(df.head())

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

## 4.9 Feature selection <a id='4.9_Feature'></a>

In [15]:
#Feature selection (removing highly correlated features)

# Calculate the correlation matrix
correlation_matrix = df.corr().abs()

# Select upper triangle of correlation matrix
upper = correlation_matrix.where(pd.np.triu(pd.np.ones(correlation_matrix.shape), k=1).astype(pd.np.bool))

# Find features with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]

# Drop highly correlated features
df.drop(to_drop, axis=1, inplace=True)

# Display the first few rows of the dataframe after feature selection
print(df.head())

     Gender  Senior Citizen  Partner  Dependents    Tenure  Phone Service  \
0 -1.009559       -0.439916        1           0 -1.277445              0   
1  0.990532       -0.439916        0           0  0.066327              1   
2  0.990532       -0.439916        0           0 -1.236724              1   
3  0.990532       -0.439916        0           0  0.514251              0   
4 -1.009559       -0.439916        0           0 -1.236724              1   

   Multiple Lines  Online Security  Online Backup  Device Protection  ...  \
0               0                0              1                  0  ...   
1               0                1              0                  1  ...   
2               0                1              1                  0  ...   
3               0                1              0                  1  ...   
4               0                0              0                  0  ...   

   Monthly Charges  Total Charges     Churn  Internet Service_Fiber optic 

  upper = correlation_matrix.where(pd.np.triu(pd.np.ones(correlation_matrix.shape), k=1).astype(pd.np.bool))
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  upper = correlation_matrix.where(pd.np.triu(pd.np.ones(correlation_matrix.shape), k=1).astype(pd.np.bool))


## 4.10 Train/Test Split<a id='4.10_Split'></a>

In [16]:
# Split the data into train and test sets based on whether 'Churn' is NaN
train_df = df[df['Churn'].notna()]  # Train set: rows where 'Churn' is not NaN
test_df = df[df['Churn'].isna()]    # Test set: rows where 'Churn' is NaN

# Separate features (X) and target (y) for the train set
X_train = train_df.drop(columns='Churn')
y_train = train_df['Churn']

# Separate features (X) and target (y) for the test set
X_test = test_df.drop(columns='Churn')
y_test = test_df['Churn']

In [17]:
#Check the `dtypes` attribute of `X_train` to verify all features are numeric
X_train.dtypes

Gender                                    float64
Senior Citizen                            float64
Partner                                     int32
Dependents                                  int32
Tenure                                    float64
Phone Service                               int32
Multiple Lines                              int32
Online Security                             int32
Online Backup                               int32
Device Protection                           int32
Tech Support                                int32
Streaming TV                                int32
Streaming Movies                            int32
Paperless Billing                           int32
Monthly Charges                           float64
Total Charges                             float64
Internet Service_Fiber optic                uint8
Internet Service_No                         uint8
Contract_One year                           uint8
Contract_Two year                           uint8


In [18]:
X_test.dtypes

Gender                                    float64
Senior Citizen                            float64
Partner                                     int32
Dependents                                  int32
Tenure                                    float64
Phone Service                               int32
Multiple Lines                              int32
Online Security                             int32
Online Backup                               int32
Device Protection                           int32
Tech Support                                int32
Streaming TV                                int32
Streaming Movies                            int32
Paperless Billing                           int32
Monthly Charges                           float64
Total Charges                             float64
Internet Service_Fiber optic                uint8
Internet Service_No                         uint8
Contract_One year                           uint8
Contract_Two year                           uint8


## 4.14 Summary

In this notebook, we systematically prepare a dataset for machine learning applications through a series of detailed steps. We begin by importing essential libraries required for data manipulation, visualization, and machine learning. Next, we load and explore the churn dataset, performing initial exploratory data analysis to understand its structure and content. We then proceed with feature engineering, where we encode categorical variables, scale numerical features, and generate new interaction terms to capture potential non-linear relationships in the data. Additionally, we conduct feature selection to remove highly correlated features that could negatively impact model performance. Finally, we split the dataset into training and testing sets based on the presence of NaN values in the 'Churn' column. This comprehensive approach ensures that the dataset is clean, well-structured, and optimized for building robust and effective machine learning models.