# Module 3 - Classification

## Churn prediction project

For this project we'll use the **[Telco Customer Churn](https://www.kaggle.com/datasets/blastchar/telco-customer-churn?resource=download)** dataset available in Kaggle

In this case the file has already been downloaded and placed it in this repository

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('./WA_Fn-UseC_-Telco-Customer-Churn.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


## Data Preparation

In [4]:
df.columns = df.columns.str.lower().str.replace(' ','_')

In [5]:
df_categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)

In [6]:
for colum in df_categorical_columns:
    df[colum] = df[colum].str.lower().str.replace(' ','_')

In [7]:
df.head().T

Unnamed: 0,0,1,2,3,4
customerid,7590-vhveg,5575-gnvde,3668-qpybk,7795-cfocw,9237-hqitu
gender,female,male,male,male,female
seniorcitizen,0,0,0,0,0
partner,yes,no,no,no,no
dependents,no,no,no,no,no
tenure,1,34,2,45,2
phoneservice,no,yes,yes,no,yes
multiplelines,no_phone_service,no,no,no_phone_service,no
internetservice,dsl,dsl,dsl,dsl,fiber_optic
onlinesecurity,no,yes,yes,yes,no


In [8]:
df.dtypes

customerid           object
gender               object
seniorcitizen         int64
partner              object
dependents           object
tenure                int64
phoneservice         object
multiplelines        object
internetservice      object
onlinesecurity       object
onlinebackup         object
deviceprotection     object
techsupport          object
streamingtv          object
streamingmovies      object
contract             object
paperlessbilling     object
paymentmethod        object
monthlycharges      float64
totalcharges         object
churn                object
dtype: object

If we take a look at the totalcharges column, it should be a numeric column but it is an object column. 

To try to understand what is happening, we could try to convert it as numeric and see the result

In [9]:
pd.to_numeric(df.totalcharges)

ValueError: Unable to parse string "_" at position 488

Pandas is telling us that there are values that contains "_". This happened when we replaced spaces with underscores.

To solve this we can do the following:

In [10]:
tc = pd.to_numeric(df.totalcharges, errors='coerce') # this will set the errors to NaN

In [11]:
tc.isnull().sum()

11

And now we can see which values are null and do something about it

In [12]:
df[tc.isnull()][['customerid', 'totalcharges']]

Unnamed: 0,customerid,totalcharges
488,4472-lvygi,_
753,3115-czmzd,_
936,5709-lvoeq,_
1082,4367-nuyao,_
1340,1371-dwpaz,_
3331,7644-omvmy,_
3826,3213-vvolg,_
4380,2520-sgtta,_
5218,2923-arzlg,_
6670,4075-wkniu,_


Now that we identified the problem we can perform the same to the actual column

In [13]:
df.totalcharges = pd.to_numeric(df.totalcharges, errors='coerce') # this will set the errors to NaN

In [14]:
# and fill with 0 the null values
df.totalcharges = df.totalcharges.fillna(0)

In [15]:
df.totalcharges.isnull().sum()

0

And now there are no null values and all the values are numeric as we can see:

In [16]:
df.dtypes

customerid           object
gender               object
seniorcitizen         int64
partner              object
dependents           object
tenure                int64
phoneservice         object
multiplelines        object
internetservice      object
onlinesecurity       object
onlinebackup         object
deviceprotection     object
techsupport          object
streamingtv          object
streamingmovies      object
contract             object
paperlessbilling     object
paymentmethod        object
monthlycharges      float64
totalcharges        float64
churn                object
dtype: object

Finally, lets check our target column **churn**

In [17]:
df.churn.head()

0     no
1     no
2    yes
3     no
4    yes
Name: churn, dtype: object

The values are yes and no. We need this to be 1's and 0's

For this we can do the following trick

In [18]:
df.churn = (df.churn == 'yes').astype(int)

In [19]:
df.churn.head()

0    0
1    0
2    1
3    0
4    1
Name: churn, dtype: int64

Now we have 1's and 0's values instead of yes and no

## Setting the Validation Framewomrk

We previously calculated the size of the dataframes by doing this:

In [25]:
n = len(df)
n_val = int(len(df) * 0.2)
n_test = int(len(df) * 0.2)
n_train = n - n_val - n_test

n_test, n_val, n_train, (n_test + n_val + n_train), n

(1408, 1408, 4227, 7043, 7043)

Now we are going to use Scikit Learn library to get the same result

In [20]:
from sklearn.model_selection import train_test_split

In [26]:
train_test_split?

[0;31mSignature:[0m
[0mtrain_test_split[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0marrays[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtest_size[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtrain_size[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrandom_state[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mshuffle[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mstratify[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Split arrays or matrices into random train and test subsets.

Quick utility that wraps input validation,
``next(ShuffleSplit().split(X, y))``, and application to input data
into a single call for splitting (and optionally subsampling) data into a
one-liner.

Read more in the :ref:`User Guide <cross_validation>`.

Parameters
----------
*arrays : sequence of indexables with sa

In [31]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)

In [32]:
len(df_full_train), len(df_test)

(5634, 1409)

In [33]:
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

In [34]:
len(df_train), len(df_val)

(4225, 1409)

In [35]:
df_train = df_train.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)

Now we can create the target vectors

In [37]:
y_train = df_train.churn.values
y_val = df_val.churn.values
y_test = df_test.churn.values

And delete this data from the train, val and test dataframes

In [38]:
del df_train['churn']
del df_val['churn']
del df_test['churn']

## EDA (Exploratory Data Analysis)

In [40]:
df_full_train = df_full_train.reset_index(drop=True)

1. First we need to look if there are missing values

In [41]:
df_full_train.isnull().sum()

customerid          0
gender              0
seniorcitizen       0
partner             0
dependents          0
tenure              0
phoneservice        0
multiplelines       0
internetservice     0
onlinesecurity      0
onlinebackup        0
deviceprotection    0
techsupport         0
streamingtv         0
streamingmovies     0
contract            0
paperlessbilling    0
paymentmethod       0
monthlycharges      0
totalcharges        0
churn               0
dtype: int64

In [42]:
df_full_train.churn.value_counts()

churn
0    4113
1    1521
Name: count, dtype: int64

In [43]:
# we can do the same with the option "normalized" to see the actual proportion
df_full_train.churn.value_counts(normalize=True)

churn
0    0.730032
1    0.269968
Name: proportion, dtype: float64

In this case the 0.26 value would be the **"Chun Rate"**

We could also calculate this value by getting the mean value of the churn

In [45]:
df_full_train.churn.mean()

0.26996805111821087

We could call it the **Global Churn Rate**

In [47]:
global_churn_rate = df_full_train.churn.mean()
round(global_churn_rate, 2)

0.27

This means that 27% of the users are churning.

Now lets take a look at the other columns/variables

In [48]:
df_full_train.dtypes

customerid           object
gender               object
seniorcitizen         int64
partner              object
dependents           object
tenure                int64
phoneservice         object
multiplelines        object
internetservice      object
onlinesecurity       object
onlinebackup         object
deviceprotection     object
techsupport          object
streamingtv          object
streamingmovies      object
contract             object
paperlessbilling     object
paymentmethod        object
monthlycharges      float64
totalcharges        float64
churn                 int64
dtype: object

We want to have two lists of columns:
- Numerical columns
- Categorial columns

In [49]:
# Numerical
numerical = ['tenure', 'monthlycharges', 'totalcharges']

In [51]:
categorical = ['gender', 'seniorcitizen', 'partner', 'dependents',
       'phoneservice', 'multiplelines', 'internetservice',
       'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport',
       'streamingtv', 'streamingmovies', 'contract', 'paperlessbilling',
       'paymentmethod']

We can see the number of unique values on the categorical columns. This will help when we encode these values for training our model

In [53]:
df_full_train[categorical].nunique()

gender              2
seniorcitizen       2
partner             2
dependents          2
phoneservice        2
multiplelines       3
internetservice     3
onlinesecurity      3
onlinebackup        3
deviceprotection    3
techsupport         3
streamingtv         3
streamingmovies     3
contract            3
paperlessbilling    2
paymentmethod       4
dtype: int64

## Feature Importance: Churn rate and risk ratio

Feature importance analysis (part of EDA) - help us identify which features actually affect our target variable

- Churn rate
- Risk ratio
- Mutual information

#### **Churn rate**

For this we can see the churn rate within differnt groups

In [55]:
df_full_train.head().T

Unnamed: 0,0,1,2,3,4
customerid,5442-pptjy,6261-rcvns,2176-osjuv,6161-erdgd,2364-ufrom
gender,male,female,male,male,male
seniorcitizen,0,0,0,0,0
partner,yes,no,yes,yes,no
dependents,yes,no,no,yes,no
tenure,12,42,71,71,30
phoneservice,yes,yes,yes,yes,yes
multiplelines,no,no,yes,yes,no
internetservice,no,dsl,dsl,dsl,dsl
onlinesecurity,no_internet_service,yes,yes,yes,yes


For example we can see the churn rate based on gender

In [61]:
churn_rate_male = df_full_train[df_full_train['gender'] == 'male']['churn'].mean()
print(f'Male churn rate %f' % churn_rate_male)

Male churn rate 0.263214


In [62]:
churn_rate_female = df_full_train[df_full_train['gender'] == 'female']['churn'].mean()
print(f'Female churn rate %f' %churn_rate_female)

Female churn rate 0.276824


The churn rate among females is almost the same, a little more bigger on the female side

In [63]:
global_churn_rate

0.26996805111821087

We can also do the same considering for example the column partner

In [67]:
churn_partner = df_full_train[df_full_train.partner == 'yes']['churn'].mean()
print(f'The churn for people with partner is: %f' % churn_partner)

The churn for people with partner is: 0.205033


In [75]:
churn_no_partner = df_full_train[df_full_train.partner == 'no']['churn'].mean()
print(f'The churn for people with NO partner is: %f' % churn_no_partner)

The churn for people with NO partner is: 0.329809


In this case the churn rate is quite different depending if the client has partner

We can also calculate the difference between the global churn rate and these churn rate

In [69]:
global_churn_rate - churn_partner

0.06493474245795922

In [70]:
global_churn_rate - churn_no_partner

-0.05984095297455855

If we calculate this difference considering the churn rate based on gender, we can see that there difference is really small

In [72]:
global_churn_rate - churn_rate_male

0.006754520462819769

In [73]:
global_churn_rate - churn_rate_female

-0.006855983216553063

So as a result, we can use the difference as a method to evaluate **feature importance**

1. Difference

Global Churn - Group Churn => by group we mean the churn based on an specific column

- if the **difference is > 0** this means that global churn is bigger and then the possibility of churn is **low**
- if the **difference is < 0** this means that group churn is bigger meaning that the possibility of churn is **high**

#### **Risk Ratio**

If we, instead of the difference, we calculate the ratio between the group churn over the global churn we can also evaluate feature importance

Risk = Group / Global

- if **Risk is > 1** it is likely to churn
- if **Risk is < 1** it is less likely to churn

In [74]:
churn_no_partner / global_churn_rate

1.2216593879412643

In [76]:
churn_partner / global_churn_rate

0.7594724924338315

Based on the Difference and Risk Ratio methods to evaluate Feature Importance, we could imagine it as if we were executing the following SQL query:

```
SELECT
    gender,
    AVG(churn),
    AVG(churn) - global_churn AS diff,
    AVG(churn) / global_churn AS risk

FROM
    data
GROUP BY
    gender
```

And we could do it for each of the features in the dataset. Lets see how we could do this in pandas

In [86]:
for c in categorical:
    print('#######################')
    print(c)
    df_group = df_full_train.groupby(c)['churn'].agg(['mean', 'sum'])
    df_group['diff'] = df_group['mean'] - global_churn_rate
    df_group['risk'] = df_group['mean'] / global_churn_rate
    print(df_group)
    print('#######################')
    print()

#######################
gender
            mean  sum      diff      risk
gender                                   
female  0.276824  774  0.006856  1.025396
male    0.263214  747 -0.006755  0.974980
#######################

#######################
seniorcitizen
                   mean   sum      diff      risk
seniorcitizen                                    
0              0.242270  1144 -0.027698  0.897403
1              0.413377   377  0.143409  1.531208
#######################

#######################
partner
             mean  sum      diff      risk
partner                                   
no       0.329809  967  0.059841  1.221659
yes      0.205033  554 -0.064935  0.759472
#######################

#######################
dependents
                mean   sum      diff      risk
dependents                                    
no          0.313760  1245  0.043792  1.162212
yes         0.165666   276 -0.104302  0.613651
#######################

#######################
phoneservice

So far we have analyzed how individual variables can show the possibility, in this case, of a churn.

Now we will see how to compare between variables in order to understand which variables are more important than others

## Feature Importance: Mutual information

Mutual information tell us how much we can learn about one variable if we know the value of another

In [87]:
from sklearn.metrics import mutual_info_score

In [88]:
mutual_info_score(df_full_train.churn, df_full_train.contract) # we use column contract as an example

0.0983203874041556

In [89]:
# if we do the same with another column like gender
mutual_info_score(df_full_train.churn, df_full_train.gender)

0.0001174846211139946

We can clearly see that contract has a higher score than gender, which means that contract will be more informative when talking about churn compared to gender

We can apply this to all categorical variables to compare them

In [90]:
def mutual_info_churn_score(series):
    return mutual_info_score(df_full_train.churn, series)

In [94]:
mi = df_full_train[categorical].apply(mutual_info_churn_score).sort_values(ascending=False)
mi

contract            0.098320
onlinesecurity      0.063085
techsupport         0.061032
internetservice     0.055868
onlinebackup        0.046923
deviceprotection    0.043453
paymentmethod       0.043210
streamingtv         0.031853
streamingmovies     0.031581
paperlessbilling    0.017589
dependents          0.012346
partner             0.009968
seniorcitizen       0.009410
multiplelines       0.000857
phoneservice        0.000229
gender              0.000117
dtype: float64

With this result we can understand which categorical variable have more impact on the target variable.

This is important since when we analyzed individual variables, we saw that for example partner variable could be important, but when compared with the rest we can see it doesn't have a high score

Now we are going to work with analysis for numerical variables

## Feature importance: Correlation

To measure the dependency when talking about numerical variables we work with correlation.

In particular we measure the **Correlation Coeficient**

In [96]:
df_full_train[numerical].corrwith(df_full_train.churn)

tenure           -0.351885
monthlycharges    0.196805
totalcharges     -0.196353
dtype: float64

In this result, when we get a negative correlation, means that as the variable increases, the chance of churn decreases.

So, the more months the client has the service (tenure), less likely the client will churn.

## One-hot encoding

We will use Scikit learn to encode categorical values

In [97]:
from sklearn.feature_extraction import DictVectorizer

In [98]:
# lets see a sample one-hot encoding
df_train[['gender', 'contract']].iloc[:10]

Unnamed: 0,gender,contract
0,female,two_year
1,male,month-to-month
2,female,month-to-month
3,female,month-to-month
4,female,two_year
5,male,month-to-month
6,male,month-to-month
7,female,month-to-month
8,female,two_year
9,female,month-to-month


In [100]:
# now convert it to a dict
df_train[['gender', 'contract']].iloc[:10].to_dict(orient='records')

[{'gender': 'female', 'contract': 'two_year'},
 {'gender': 'male', 'contract': 'month-to-month'},
 {'gender': 'female', 'contract': 'month-to-month'},
 {'gender': 'female', 'contract': 'month-to-month'},
 {'gender': 'female', 'contract': 'two_year'},
 {'gender': 'male', 'contract': 'month-to-month'},
 {'gender': 'male', 'contract': 'month-to-month'},
 {'gender': 'female', 'contract': 'month-to-month'},
 {'gender': 'female', 'contract': 'two_year'},
 {'gender': 'female', 'contract': 'month-to-month'}]

In [112]:
# now convert it to a dict
dicts = df_train[['gender', 'contract']].iloc[:100].to_dict(orient='records')

Now we have a list of dictionaries where each dictionary has two key/value pairs

In [113]:
dv = DictVectorizer(sparse=False) # we dont want to create a sparce matrix, since we actually need all the 1 and 0 values

In [114]:
dv.fit(dicts)

In [118]:
dv.get_feature_names_out()

array(['contract=month-to-month', 'contract=one_year',
       'contract=two_year', 'gender=female', 'gender=male'], dtype=object)

In [115]:
dv.transform(dicts)

array([[0., 0., 1., 1., 0.],
       [1., 0., 0., 0., 1.],
       [1., 0., 0., 1., 0.],
       [1., 0., 0., 1., 0.],
       [0., 0., 1., 1., 0.],
       [1., 0., 0., 0., 1.],
       [1., 0., 0., 0., 1.],
       [1., 0., 0., 1., 0.],
       [0., 0., 1., 1., 0.],
       [1., 0., 0., 1., 0.],
       [0., 0., 1., 1., 0.],
       [1., 0., 0., 0., 1.],
       [0., 0., 1., 1., 0.],
       [1., 0., 0., 1., 0.],
       [1., 0., 0., 1., 0.],
       [1., 0., 0., 0., 1.],
       [0., 0., 1., 1., 0.],
       [1., 0., 0., 1., 0.],
       [0., 1., 0., 0., 1.],
       [0., 0., 1., 0., 1.],
       [1., 0., 0., 0., 1.],
       [0., 1., 0., 1., 0.],
       [1., 0., 0., 1., 0.],
       [0., 0., 1., 1., 0.],
       [1., 0., 0., 0., 1.],
       [0., 0., 1., 0., 1.],
       [1., 0., 0., 1., 0.],
       [1., 0., 0., 1., 0.],
       [1., 0., 0., 1., 0.],
       [0., 1., 0., 1., 0.],
       [1., 0., 0., 0., 1.],
       [1., 0., 0., 0., 1.],
       [0., 1., 0., 1., 0.],
       [0., 1., 0., 1., 0.],
       [1., 0.

Lets do exactly the same but adding a numerical variable

In [119]:
# now convert it to a dict
dicts = df_train[['gender', 'contract', 'tenure']].iloc[:100].to_dict(orient='records')

In [121]:
dv.fit(dicts)

In [123]:
dv.get_feature_names_out()

array(['contract=month-to-month', 'contract=one_year',
       'contract=two_year', 'gender=female', 'gender=male', 'tenure'],
      dtype=object)

In [125]:
dv.transform(dicts)

array([[ 0.,  0.,  1.,  1.,  0., 72.],
       [ 1.,  0.,  0.,  0.,  1., 10.],
       [ 1.,  0.,  0.,  1.,  0.,  5.],
       [ 1.,  0.,  0.,  1.,  0.,  5.],
       [ 0.,  0.,  1.,  1.,  0., 18.],
       [ 1.,  0.,  0.,  0.,  1.,  4.],
       [ 1.,  0.,  0.,  0.,  1.,  1.],
       [ 1.,  0.,  0.,  1.,  0.,  1.],
       [ 0.,  0.,  1.,  1.,  0., 72.],
       [ 1.,  0.,  0.,  1.,  0.,  6.],
       [ 0.,  0.,  1.,  1.,  0., 72.],
       [ 1.,  0.,  0.,  0.,  1., 17.],
       [ 0.,  0.,  1.,  1.,  0., 66.],
       [ 1.,  0.,  0.,  1.,  0.,  2.],
       [ 1.,  0.,  0.,  1.,  0.,  4.],
       [ 1.,  0.,  0.,  0.,  1.,  3.],
       [ 0.,  0.,  1.,  1.,  0., 71.],
       [ 1.,  0.,  0.,  1.,  0., 32.],
       [ 0.,  1.,  0.,  0.,  1., 53.],
       [ 0.,  0.,  1.,  0.,  1., 56.],
       [ 1.,  0.,  0.,  0.,  1., 61.],
       [ 0.,  1.,  0.,  1.,  0., 41.],
       [ 1.,  0.,  0.,  1.,  0.,  1.],
       [ 0.,  0.,  1.,  1.,  0.,  3.],
       [ 1.,  0.,  0.,  0.,  1.,  3.],
       [ 0.,  0.,  1.,  0

So DictVectorizer library understand that tenure variable is numerical and doesn't try to one-hot encode it

Now lets try with all variables

In [126]:
# now convert it to a dict
train_dicts = df_train[categorical + numerical].to_dict(orient='records')

In [128]:
train_dicts[0]

{'gender': 'female',
 'seniorcitizen': 0,
 'partner': 'yes',
 'dependents': 'yes',
 'phoneservice': 'yes',
 'multiplelines': 'yes',
 'internetservice': 'fiber_optic',
 'onlinesecurity': 'yes',
 'onlinebackup': 'yes',
 'deviceprotection': 'yes',
 'techsupport': 'yes',
 'streamingtv': 'yes',
 'streamingmovies': 'yes',
 'contract': 'two_year',
 'paperlessbilling': 'yes',
 'paymentmethod': 'electronic_check',
 'tenure': 72,
 'monthlycharges': 115.5,
 'totalcharges': 8425.15}

In [129]:
dv.fit(train_dicts)

In [133]:
dv.get_feature_names_out() # -> Now it is a much larger list of columns

array(['contract=month-to-month', 'contract=one_year',
       'contract=two_year', 'dependents=no', 'dependents=yes',
       'deviceprotection=no', 'deviceprotection=no_internet_service',
       'deviceprotection=yes', 'gender=female', 'gender=male',
       'internetservice=dsl', 'internetservice=fiber_optic',
       'internetservice=no', 'monthlycharges', 'multiplelines=no',
       'multiplelines=no_phone_service', 'multiplelines=yes',
       'onlinebackup=no', 'onlinebackup=no_internet_service',
       'onlinebackup=yes', 'onlinesecurity=no',
       'onlinesecurity=no_internet_service', 'onlinesecurity=yes',
       'paperlessbilling=no', 'paperlessbilling=yes', 'partner=no',
       'partner=yes', 'paymentmethod=bank_transfer_(automatic)',
       'paymentmethod=credit_card_(automatic)',
       'paymentmethod=electronic_check', 'paymentmethod=mailed_check',
       'phoneservice=no', 'phoneservice=yes', 'seniorcitizen',
       'streamingmovies=no', 'streamingmovies=no_internet_service',

In [134]:
dv.transform(train_dicts)

array([[0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        7.20000e+01, 8.42515e+03],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+01, 1.02155e+03],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        5.00000e+00, 4.13650e+02],
       ...,
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        2.00000e+00, 1.90050e+02],
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 0.00000e+00,
        2.70000e+01, 7.61950e+02],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        9.00000e+00, 7.51650e+02]])

In [137]:
# another way to do it with less lines of code would be like this:

X_train = dv.fit_transform(train_dicts) # this is the same as to fit it and then transform it

After this we can move forward with validation data

In [138]:
val_dicts = df_val[categorical + numerical].to_dict(orient='records')

In [139]:
X_val = dv.transform(val_dicts)

In [140]:
X_val

array([[0.0000e+00, 0.0000e+00, 1.0000e+00, ..., 1.0000e+00, 7.1000e+01,
        4.9734e+03],
       [1.0000e+00, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00, 1.0000e+00,
        2.0750e+01],
       [1.0000e+00, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00, 1.0000e+00,
        2.0350e+01],
       ...,
       [1.0000e+00, 0.0000e+00, 0.0000e+00, ..., 1.0000e+00, 1.8000e+01,
        1.0581e+03],
       [1.0000e+00, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00, 1.0000e+00,
        9.3300e+01],
       [1.0000e+00, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00, 3.0000e+00,
        2.9285e+02]])