# Machine Learning with Python: Part 2

In this tutorial, we will work on a dataset about bank loans. We will use the data to predict whether a loan will be given to an applicant. 

We will learn about the following topics in this tutorial:
- Encoding categorical variables
- Cross-validation
- Hyperparameter tuning
- Feature selection

Let's start by loading the dataset and taking a look at the data.

In [11]:
import pandas as pd

df = pd.read_csv("loans.csv")
df.head()

Unnamed: 0,gender,married,dependents,education,self_employed,applicant_income,coapplicant_income,loan_amount,term,credit_history,area,loan_given
0,Male,No,0,Graduate,No,584900,0.0,15000000,360.0,1.0,Urban,Y
1,Male,Yes,1,Graduate,No,458300,150800.0,12800000,360.0,1.0,Rural,N
2,Male,Yes,0,Graduate,Yes,300000,0.0,6600000,360.0,1.0,Urban,Y
3,Male,Yes,0,Not Graduate,No,258300,235800.0,12000000,360.0,1.0,Urban,Y
4,Male,No,0,Graduate,No,600000,0.0,14100000,360.0,1.0,Urban,Y


In [12]:
df["loan_given"].value_counts(normalize=True)

loan_given
Y    0.687296
N    0.312704
Name: proportion, dtype: float64

In [13]:
df["credit_history"].value_counts(normalize=True)

credit_history
1.0    0.842199
0.0    0.157801
Name: proportion, dtype: float64

In [14]:
df["dependents"].value_counts(normalize=True)

dependents
0     0.575960
1     0.170284
2     0.168614
3+    0.085142
Name: proportion, dtype: float64

In [15]:
df["education"].value_counts(normalize=True)

education
Graduate        0.781759
Not Graduate    0.218241
Name: proportion, dtype: float64

In [16]:
df["area"].value_counts(normalize=True)

area
Semiurban    0.379479
Urban        0.328990
Rural        0.291531
Name: proportion, dtype: float64

The dataset contains information about loan applicants, including their gender, marital status, number of dependents, education level, employment status, income, loan amount, loan term, credit history, area, and whether a loan was given or not.

The column `loan_given` is our target variable (the variable we want to predict). The other columns are our features (the variables we use to predict the target variable).

# Cleaning the Data

Before we can use the data for machine learning, we need to clean it. Let's handle missing values first.

In [17]:
df.isna().sum() / len(df) * 100

gender                2.117264
married               0.488599
dependents            2.442997
education             0.000000
self_employed         5.211726
applicant_income      0.000000
coapplicant_income    0.000000
loan_amount           0.000000
term                  2.280130
credit_history        8.143322
area                  0.000000
loan_given            0.000000
dtype: float64

The column with the highest number of missing values is `credit_history`. Still, it is only missing 8.14% of the values, so we can drop the rows with missing values.

In [18]:
df.dropna(inplace=True)
df.isna().sum() / len(df) * 100

gender                0.0
married               0.0
dependents            0.0
education             0.0
self_employed         0.0
applicant_income      0.0
coapplicant_income    0.0
loan_amount           0.0
term                  0.0
credit_history        0.0
area                  0.0
loan_given            0.0
dtype: float64

In [19]:
df.duplicated().sum()

0

# Encoding Categorical Variables

So far, we have only worked with numerical variables. However, many datasets contain categorical variables. Categorical variables are variables that have a limited number of possible values. For example, the variable `self_employed` in our dataset has two possible values: `Yes` and `No`. The variable `area` has three possible values: `Urban`, `Rural`, and `Semiurban`.

Machine learning algorithms cannot work with categorical variables directly. For the computer, the values `Yes` and `No` are just strings. Therefore, we need to convert categorical variables into numerical variables. This process is called **encoding**.

There are two main ways to encode categorical variables: **label encoding** and **one-hot encoding**.

## Label Encoding

Label encoding is the process of converting each value of a categorical variable into a number. For example, we can convert the values `Yes` and `No` into the numbers `1` and `0`.

Label encoding is suitable for ordinal variables where there is a meaningful order or hierarchy among the categories. For instance, consider a variable `size` with categories `Small`, `Medium`, and `Large`. We can convert these categories into the numbers `0`, `1`, and `2` because there is a meaningful order among them.

However, label encoding is not suitable for nominal variables where there is no meaningful order among the categories. For instance, consider a variable `colour` with categories `Red`, `Green`, and `Blue`. We cannot convert these categories into the numbers `0`, `1`, and `2` because we cannot say that one colour is greater than another.

**For our dataset, label encoding is not suitable because our categorical variables are nominal variables**. Nevertheless, let's see how we can do label encoding using in Python using the `LabelEncoder` class from the `sklearn.preprocessing` module.

In [20]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
area = le.fit_transform(df["area"])
display(area)

array([2, 0, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 0, 2, 2, 2, 0, 2, 2, 1, 1, 1,
       2, 2, 2, 0, 1, 0, 2, 1, 2, 1, 2, 2, 1, 2, 2, 2, 1, 1, 1, 1, 2, 2,
       1, 1, 0, 2, 2, 2, 2, 0, 0, 1, 1, 2, 2, 2, 1, 2, 1, 1, 1, 2, 2, 2,
       1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2,
       2, 2, 1, 2, 1, 0, 1, 0, 2, 1, 1, 1, 0, 0, 2, 2, 1, 1, 1, 1, 0, 2,
       1, 0, 0, 2, 1, 1, 2, 1, 2, 2, 0, 1, 0, 0, 2, 0, 2, 1, 2, 1, 1, 2,
       1, 0, 2, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 2, 2, 1, 1, 1, 1, 0, 0,
       0, 1, 2, 1, 0, 1, 0, 2, 1, 1, 2, 2, 1, 1, 2, 0, 2, 1, 1, 1, 2, 0,
       2, 1, 0, 1, 2, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 2, 2, 0, 1,
       2, 2, 2, 1, 2, 1, 2, 0, 1, 2, 0, 0, 2, 0, 1, 1, 0, 1, 0, 1, 2, 2,
       2, 2, 0, 1, 1, 1, 1, 2, 1, 2, 1, 2, 2, 0, 0, 1, 0, 1, 0, 0, 1, 2,
       1, 1, 2, 0, 2, 2, 0, 2, 0, 2, 0, 2, 0, 1, 1, 0, 2, 1, 0, 1, 1, 0,
       0, 0, 0, 1, 2, 2, 2, 1, 0, 2, 1, 0, 0, 2, 1, 1, 2, 0, 1, 0, 0, 0,
       1, 0, 2, 2, 1, 1, 1, 2, 0, 0, 1, 1, 0, 1, 1,

`LabelEncoder` has two main methods: `fit` and `transform`. The `fit` method learns the mapping between the categories and the numbers. The `transform` method applies the mapping to the categories and converts them into numbers.

`fit_transform` is a shortcut that combines `fit` and `transform` into a single method.

### Task: encode the variables  `married`, `education`, and `self_employed` using label encoding.

Challenge: Try to encode all of them in a single line of code.

In [21]:
df[["married", "education", "self_employed"]].apply(le.fit_transform)

Unnamed: 0,married,education,self_employed
0,0,0,0
1,1,0,0
2,1,0,1
3,1,1,0
4,0,0,0
...,...,...,...
609,0,0,0
610,1,0,0
611,1,0,0
612,1,0,0


## One-Hot Encoding

One-hot encoding is the process of converting each value of a categorical variable into a vector of zeros and ones. For example, we can convert the values `Yes` and `No` into the vectors `[1, 0]` and `[0, 1]`.

One-hot encoding represents each category as a separate binary feature (or column). Each feature corresponds to a specific category and takes a value of 1 if the observation belongs to that category and 0 otherwise. It is suitable for nominal variables where there is no order or hierarchy among the categories.

We can use one-hot encoding for all categorical variables in our dataset. Let's see how we can do one-hot encoding in Python using the `OneHotEncoder` class from the `sklearn.preprocessing` module.

In [22]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()
area = ohe.fit_transform(df[["area"]])
display(area)
display(area.toarray())

<499x3 sparse matrix of type '<class 'numpy.float64'>'
	with 499 stored elements in Compressed Sparse Row format>

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.],
       ...,
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.]])

`OneHotEncoder` returns a sparse matrix. A sparse matrix is a matrix that contains mostly zeros. Sparse matrices are useful because they save memory. However, they are not very easy to read.

To visualize a sparse matrix, we can convert it into a regular matrix using the `toarray` method.

The sparse format is useful when dealing with large datasets that contain a lot of zeros. For our dataset, the sparse format is not necessary.

### Task: encode all the categorical variables in our dataset using one-hot encoding.

The function `get_dummies` from pandas can also be used to do one-hot encoding. It returns a regular matrix.

In [23]:
pd.get_dummies(df["area"])

Unnamed: 0,Rural,Semiurban,Urban
0,False,False,True
1,True,False,False
2,False,False,True
3,False,False,True
4,False,False,True
...,...,...,...
609,True,False,False
610,True,False,False
611,False,False,True
612,False,False,True


It can also be used to do one-hot encoding of the entire dataset at once.

In [24]:
categorical_variables = ["gender", "dependents", "married", "education", "self_employed", "area"]
pd.get_dummies(df, columns=categorical_variables)

Unnamed: 0,applicant_income,coapplicant_income,loan_amount,term,credit_history,loan_given,gender_Female,gender_Male,dependents_0,dependents_1,...,dependents_3+,married_No,married_Yes,education_Graduate,education_Not Graduate,self_employed_No,self_employed_Yes,area_Rural,area_Semiurban,area_Urban
0,584900,0.0,15000000,360.0,1.0,Y,False,True,True,False,...,False,True,False,True,False,True,False,False,False,True
1,458300,150800.0,12800000,360.0,1.0,N,False,True,False,True,...,False,False,True,True,False,True,False,True,False,False
2,300000,0.0,6600000,360.0,1.0,Y,False,True,True,False,...,False,False,True,True,False,False,True,False,False,True
3,258300,235800.0,12000000,360.0,1.0,Y,False,True,True,False,...,False,False,True,False,True,True,False,False,False,True
4,600000,0.0,14100000,360.0,1.0,Y,False,True,True,False,...,False,True,False,True,False,True,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,290000,0.0,7100000,360.0,1.0,Y,True,False,True,False,...,False,True,False,True,False,True,False,True,False,False
610,410600,0.0,4000000,180.0,1.0,Y,False,True,False,False,...,True,False,True,True,False,True,False,True,False,False
611,807200,24000.0,25300000,360.0,1.0,Y,False,True,False,True,...,False,False,True,True,False,True,False,False,False,True
612,758300,0.0,18700000,360.0,1.0,Y,False,True,False,False,...,False,False,True,True,False,True,False,False,False,True


## Binary Encoding

Do we need to encode binary variables? For instance, the column `self_employed` has two possible values: `Yes` and `No`. Can we just use them as they are?

The answer is yes*. We can use binary variables as they are, even if they are strings. However, we can also encode them using one-hot encoding. The advantage of one-hot encoding is that it makes it easier to add new categories in the future. For example, if we want to add a new category `Maybe` to the column `self_employed`, we can just add a new column `[0, 0]` to the one-hot encoded matrix.

\* _Actually, it depends. It works for most models in scikit-learn, because sklearn converts binary variables into numbers automatically. But it's not always the case, so it's better to encode binary variables as well to be safe._


In [25]:
pd.get_dummies(df, columns=["gender"])

Unnamed: 0,married,dependents,education,self_employed,applicant_income,coapplicant_income,loan_amount,term,credit_history,area,loan_given,gender_Female,gender_Male
0,No,0,Graduate,No,584900,0.0,15000000,360.0,1.0,Urban,Y,False,True
1,Yes,1,Graduate,No,458300,150800.0,12800000,360.0,1.0,Rural,N,False,True
2,Yes,0,Graduate,Yes,300000,0.0,6600000,360.0,1.0,Urban,Y,False,True
3,Yes,0,Not Graduate,No,258300,235800.0,12000000,360.0,1.0,Urban,Y,False,True
4,No,0,Graduate,No,600000,0.0,14100000,360.0,1.0,Urban,Y,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,No,0,Graduate,No,290000,0.0,7100000,360.0,1.0,Rural,Y,True,False
610,Yes,3+,Graduate,No,410600,0.0,4000000,180.0,1.0,Rural,Y,False,True
611,Yes,1,Graduate,No,807200,24000.0,25300000,360.0,1.0,Urban,Y,False,True
612,Yes,2,Graduate,No,758300,0.0,18700000,360.0,1.0,Urban,Y,False,True


## Combining Numerical and Categorical Variables

Once we encode all the categorical variables, we can combine them with the numerical variables to create a single matrix. This matrix will be our feature matrix.

Let's see how to do that in Python. Let's use `pd.get_dummies` because it is much easier to use than `OneHotEncoder`.

In [26]:
df_encoded = pd.get_dummies(df, columns=categorical_variables)
X = df_encoded.drop("loan_given", axis=1).values
# We do not need to encode the target variable, but it makes evaluation easier later
y = df_encoded["loan_given"].apply(lambda x: 1 if x == "Y" else 0).values
print(X.shape)
display(df_encoded.head())

(499, 20)


Unnamed: 0,applicant_income,coapplicant_income,loan_amount,term,credit_history,loan_given,gender_Female,gender_Male,dependents_0,dependents_1,...,dependents_3+,married_No,married_Yes,education_Graduate,education_Not Graduate,self_employed_No,self_employed_Yes,area_Rural,area_Semiurban,area_Urban
0,584900,0.0,15000000,360.0,1.0,Y,False,True,True,False,...,False,True,False,True,False,True,False,False,False,True
1,458300,150800.0,12800000,360.0,1.0,N,False,True,False,True,...,False,False,True,True,False,True,False,True,False,False
2,300000,0.0,6600000,360.0,1.0,Y,False,True,True,False,...,False,False,True,True,False,False,True,False,False,True
3,258300,235800.0,12000000,360.0,1.0,Y,False,True,True,False,...,False,False,True,False,True,True,False,False,False,True
4,600000,0.0,14100000,360.0,1.0,Y,False,True,True,False,...,False,True,False,True,False,True,False,False,False,True


### Task: train a decision tree classifier to predict whether a loan will be given to an applicant. Split the data into training and test sets. Use a classification report to evaluate the model.

# Cross-Validation

In the previous tutorial, we split the data into training and test sets. We trained the model on the training set and evaluated it on the test set. This approach is called **holdout validation**.

Holdout validation is a good approach, but it has one major drawback: it only uses part of the data for training. In our case, we used 80% of the data for training and 20% for testing. This means that we are not using 20% of the data for training. Moreover, the random split might have put most of the difficult examples in the training set and most of the easy examples in the test set. This would make the model look better than it actually is.

To solve this problem, we can use **cross-validation**. Cross-validation is a technique that allows us to use all the data for training and testing. It also allows us to evaluate the model on multiple test sets instead of just one test set.

There are many types of cross-validation. In this tutorial, we will use **k-fold cross-validation**. In k-fold cross-validation, we split the data into `k` folds (also called splits). Then, we train the model `k` times. Each time, we use a different fold for testing and the remaining folds for training. Finally, we average the results of the k models to get the final result.

10-fold and 5-fold cross-validation are the most common types of cross-validation. Let's see how to do it in Python using the `cross_validate` function from the `sklearn.model_selection` module.

In [27]:
from sklearn.model_selection import cross_validate
from sklearn.tree import DecisionTreeClassifier

# Create a Decision Tree classifier
clf = DecisionTreeClassifier()
# Define the scoring metrics
metrics = ["accuracy", "precision", "recall", "f1"]
# Perform 5-fold cross-validation
cv_scores = cross_validate(clf, X, y, cv=5, scoring=metrics)
# Print the cross-validation scores
print("Cross-Validation Scores:", cv_scores)

Cross-Validation Scores: {'fit_time': array([0.00645256, 0.00422764, 0.00407076, 0.00340724, 0.00397491]), 'score_time': array([0.00836945, 0.01041698, 0.00835848, 0.00801325, 0.00623274]), 'test_accuracy': array([0.71      , 0.72      , 0.71      , 0.73      , 0.73737374]), 'test_precision': array([0.78571429, 0.75      , 0.76712329, 0.84745763, 0.828125  ]), 'test_recall': array([0.79710145, 0.88235294, 0.82352941, 0.73529412, 0.77941176]), 'test_f1': array([0.79136691, 0.81081081, 0.79432624, 0.78740157, 0.8030303 ])}


`cross_validate` takes as input a model, a feature matrix, a target vector, the number of folds, and, optionally, a scoring function (metrics to evaluate the model). It returns a dictionary containing the scores of the model on each fold.

Have a look at the dictionary returned by `cross_validate` to understand what is in there.

The most important things to look at are the mean and the standard deviation of the scores. The mean tells us how good the model is on average. The standard deviation tells us how consistent the model is. If the standard deviation is high, it means that the model performs very differently on different folds.

In [32]:
for metric in metrics:
    metric_key = f"test_{metric}"
    print(f"Mean {metric} : {cv_scores[metric_key].mean():.3f}, std: {cv_scores[metric_key].std():.3f}")

Mean accuracy : 0.709, std: 0.013
Mean precision : 0.785, std: 0.028
Mean recall : 0.795, std: 0.026
Mean f1 : 0.789, std: 0.005


## Getting the Models

By default, the `cross_validate` function only returns the scores of the model. If we want to get the models themselves, we can set the parameter `return_estimator` to `True`.

In this case, the returned dictionary will include a key `estimator` that contains the list of models -- one for each fold. 

We can use these models to make predictions on new data.

In [33]:
cv_scores = cross_validate(clf, X, y, cv=5, scoring=metrics, return_estimator=True)
models = cv_scores["estimator"]
print(models)

[DecisionTreeClassifier(), DecisionTreeClassifier(), DecisionTreeClassifier(), DecisionTreeClassifier(), DecisionTreeClassifier()]


## Getting the Model with the Best Score

We can also get the model with best performance according to a specific metric. 

Let's get the model with highest F1.

In [34]:
import numpy as np

# Get the scores for the specified metric
scores = cv_scores["test_f1"]
# Find the index of the model with the best performance
best_model_index = np.argmax(scores)
# Get the best model
best_model = cv_scores["estimator"][best_model_index]
print(best_model)

DecisionTreeClassifier()


We can now use it to make predictions on new data!

In [35]:
best_model.predict(X_test)

NameError: name 'X_test' is not defined

## Task: train a logistic regression model to predict whether a loan will be given to an applicant. Use 5-fold cross-validation to evaluate the model.

Use F1 score as the evaluation metric. Print the mean and standard deviation of the F1 scores. Also print the highest F1 score.

In [36]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
cv_scores = cross_validate(clf, X, y, cv=5, scoring="f1")
print("Cross-Validation Scores:", cv_scores)
print(f"Mean F1: {cv_scores['test_score'].mean():.3f}, std: {cv_scores['test_score'].std():.3f}, max: {cv_scores['test_score'].max():.3f}")

Cross-Validation Scores: {'fit_time': array([0.04668546, 0.01450491, 0.00780964, 0.01877356, 0.01489949]), 'score_time': array([0.00360179, 0.00300455, 0.        , 0.        , 0.00370669]), 'test_score': array([0.79268293, 0.8       , 0.80239521, 0.80952381, 0.81927711])}
Mean F1: 0.805, std: 0.009, max: 0.819


# Hyperparameter Tuning

So far, we have used the default hyperparameters of the models. Hyperparameters are parameters that are not learned during training. They are set before training and remain constant during training. For example, the maximum depth of a decision tree is a hyperparameter. Hyperparameters can have a big impact on the performance of the model.

Each model has its own set of hyperparameters. To check the hyperparameters of a model, we can check the documentation of corresponding sklearn class. For example, the documentation of the `DecisionTreeClassifier` class can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).

We will learn how to optmise the hyperparameters of a model using grid search. Grid search is a technique that allows us to find the best combination of hyperparameters for a model. It works by trying all possible combinations and selecting the best one.

Let's see how to do that in Python using the `GridSearchCV` class from the `sklearn.model_selection` module.

In [37]:
from sklearn.model_selection import GridSearchCV

# Define the model
clf = DecisionTreeClassifier()
# Define the parameter grid
param_grid = {"max_depth": [1, 3, 5, 7, 9]}
# Perform grid search with cross-validation
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring="f1")
# Fit the grid search to the data
grid_search.fit(X, y)
# Print the best parameter and best score
print("Best Parameter:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)


Best Parameter: {'max_depth': 1}
Best Score: 0.8723105935438292


We first define the model we are going to optimise. Then, we define the hyperparameters we want to optimise. We put the hyperparameters in a dictionary. The keys of the dictionary are the names of the hyperparameters and the values are the values we want to try. In this case, we want to try 5 values for the maximum depth of the decision tree: 1, 3, 5, 7, and 9.

The `GridSearchCV` class takes as input the model, the hyperparameters, the number of folds, and the scoring function. Then, it performs cross-validation for each model considering all possible combinations of hyperparameters. 

We can then get the best hyperparameters using the `best_params_` attribute, the best score using `best_score_`, and the also best model using `best_estimator_`.

We can optimise multiple hyperparameters at the same time. Let's optimise the maximum depth and the minimum number of samples required to split an internal node.

In [38]:
# Define the model
clf = DecisionTreeClassifier()
# Define the parameter grid
param_grid = {"max_depth": [1, 3, 5, 7, 9], "min_samples_leaf": [1, 5, 10]}
# Perform grid search with cross-validation
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring="f1")
# Fit the grid search to the data
grid_search.fit(X, y)
# Print the best parameter and best score
print("Best Parameter:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)


Best Parameter: {'max_depth': 1, 'min_samples_leaf': 1}
Best Score: 0.8723105935438292


In this case, we have 5 values for the maximum depth and 3 values for the minimum number of samples. This means that we will try 15 combinations of hyperparameters.

The best score did not change. This means that `min_samples_leaf` did not impact our model, and the default value is good enough.

Can we get a better score?

### Task: select at least one more hyperparameter and optimise it. Print the best hyperparameters and the best score.

Try to beat the current best score! (0.872)

### Task: now optimise the hyperparameters of a logistic regression model. Print the best hyperparameters and the best score.

Can you outperform the decision tree classifier?

### Task: compare the best decision tree classifier and the best logistic regression model. Use a classification report and a confusion matrix to evaluate the models.

What are the main differences?

There are other ways of optimising hyperparameters. In sklearn, you can also use [RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html). Have a look at the documentation to learn more about them. [This tutorial](https://scikit-learn.org/stable/modules/grid_search.html) provides a nice overview of the different methods.

# Feature Selection

We have just learned how to optimise the hyperparameters of a model. We can also optimise the features used to train model. This is called **feature selection**. Feature selection is the process of selecting the most important features for a model. It is useful because it can reduce complexity and improve performance.

There are [many methods](https://scikit-learn.org/stable/modules/feature_selection.html) for feature selection. In this tutorial, we will learn about `SelectKBest`. `SelectKBest` is a method that selects the `k` best features according to a specific metric.

Let's see how it works in sklearn.

In [39]:
from sklearn.feature_selection import SelectKBest, chi2

# Number of features to select
k = 5
selector = SelectKBest(k=k, score_func=chi2)
X_selected = selector.fit_transform(X, y)
# Get the selected feature indices
feature_indices = selector.get_support(indices=True)
# Print the selected features
selected_features = df_encoded.drop("loan_given", axis=1).columns[feature_indices]
print("Selected Features:", selected_features)

Selected Features: Index(['applicant_income', 'coapplicant_income', 'loan_amount',
       'credit_history', 'area_Semiurban'],
      dtype='object')


In [40]:
selector.scores_

array([5.85157485e+05, 3.83022278e+05, 2.84823158e+06, 1.03759099e-01,
       2.06423262e+01, 1.38548898e+00, 2.96649708e-01, 7.01438670e-02,
       3.17649089e-01, 1.80573704e+00, 2.06099137e-01, 3.54696473e+00,
       1.91579885e+00, 7.61999520e-01, 2.89413279e+00, 4.97884540e-02,
       3.10275873e-01, 3.91869869e+00, 8.32637164e+00, 1.70340585e+00])

We are using `chi2` as the scoring function. `chi2` (chi-squared) is a statistical test that measures the dependence between two variables. In this case, it measures the dependence between each feature and the target. The higher the score, the more important the feature.

We can check the scores of each feature using the `scores_` attribute.

In [41]:
print(selector.scores_)

[5.85157485e+05 3.83022278e+05 2.84823158e+06 1.03759099e-01
 2.06423262e+01 1.38548898e+00 2.96649708e-01 7.01438670e-02
 3.17649089e-01 1.80573704e+00 2.06099137e-01 3.54696473e+00
 1.91579885e+00 7.61999520e-01 2.89413279e+00 4.97884540e-02
 3.10275873e-01 3.91869869e+00 8.32637164e+00 1.70340585e+00]


### Task: select the 5 best features and train a decision tree classifier using these features. Print the classification report and the confusion matrix.

### Task: do the same thing with a logistic regression model. Which model performs better with fewer features?