<a href="https://colab.research.google.com/github/ismael-rtellez/Learning_about_Credit_Information/blob/main/Credit_Information_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Perform machine learning on something close to real data**

I'm going to perform using a home credit dataset from kaggle

### Problem 1: Corfirmation of competition contents

*  **What to learn:** The transaction information of the clients
*  **What to predict:** The reapyment abilities
*  **Submission file:** For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. The file should contain a header and have the following format:
```
100001, 0.1
100005, 0.9
100013, 0.2
ect.
```
*  **What kind of index value will the submitted items be evaluated?:** Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.


### Problem 2: Learning and Verification

In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score

# Loading csv
df = pd.read_csv('application_train.csv')

# cleaning the empty data
df_clean = df.dropna()

categorical_features = df_clean.select_dtypes('object').columns.tolist()

# Separating variables
X = df_clean.drop(columns=['TARGET'])
y = df_clean['TARGET']

In [4]:
!pip install category-encoders



In [5]:
from category_encoders import CountEncoder
# Encoding values
X = CountEncoder(cols=categorical_features).fit_transform(X)

In [5]:
# Splitting the data in train and test
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.25, random_state=42)

# Standardizing
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Fitting data
from lightgbm import LGBMClassifier

reg = LGBMClassifier(random_state=5)
reg.fit(X_train_scaled, y_train)

# Predict
reg_predict = reg.predict(X_test_scaled)

print("ACC: ", accuracy_score(y_true=y_test, y_pred=reg_predict))
print("ROC: ", roc_auc_score(y_test, reg_predict))

[LightGBM] [Info] Number of positive: 407, number of negative: 6044
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006073 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 10556
[LightGBM] [Info] Number of data points in the train set: 6451, number of used features: 109
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.063091 -> initscore=-2.698008
[LightGBM] [Info] Start training from score -2.698008
ACC:  0.9432821943282195
ROC:  0.49926181102362205




**Accuracy** is very high which is a good indication

### Problem 3: Estimation for test data

In [6]:
# Loading csv
df_test = pd.read_csv('application_test.csv')

# Cleaning the dataset
df_test_clean = df_test.dropna(axis=0)

# Separating variables
test_X = X = CountEncoder(cols=categorical_features).fit_transform(df_test)

# Standardizing
test_scaler = StandardScaler()
test_X_test_scaled = scaler.fit_transform(test_X)

# Predict
test_reg_predict = reg.predict(test_X_test_scaled)

#print("ACC: ", accuracy_score(y_true=y_test, y_pred=reg_predict))
#print("ROC: ", roc_auc_score(y_test, reg_predict))

kgl_submission = pd.concat([df_test['SK_ID_CURR'], pd.Series(test_reg_predict, name='TARGET')], axis=1)
kgl_submission.to_csv('kggl_submission.csv', index=False)



In [7]:
kgl_submission

Unnamed: 0,SK_ID_CURR,TARGET
0,100001,0
1,100005,0
2,100013,0
3,100028,0
4,100038,0
...,...,...
48739,456221,0
48740,456222,0
48741,456223,0
48742,456224,0


### Problem 4: Based on the baseline model, we will make various improvements to the input feature quantities to improve accuracy

In [2]:
# Cleaning dataset of the empty values
clean_df = df.dropna()
# Separate
X4 = clean_df.drop(columns=['TARGET'])
y4 = clean_df['TARGET']

In [9]:
# imputation
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score

# pattern 1
imp_mean = SimpleImputer(strategy='mean')
# select only numerical columns
X_numerical = X4.select_dtypes(include=np.number)
# drop the missing values - apply imputer to nunmerical data
imp_X_numerical = imp_mean.fit_transform(X_numerical)

# splitting the data into training and test
X_train1, X_test1, y_train1, y_test1 = train_test_split(imp_X_numerical, y4, test_size=0.25, random_state=42)

# standardizing
scaler = StandardScaler()
scaler.fit(X_train1)
X_train_scaled1 = scaler.transform(X_train1)
X_test_scaled1 = scaler.transform(X_test1)

# fitting
from lightgbm import LGBMClassifier
lgbm = LGBMClassifier(random_state=5)
lgb = lgbm.fit(X_train_scaled1, y_train1)

# predicting
reg_pred1 = lgb.predict(X_test_scaled1)

print("Accuracy: ", accuracy_score(y_test1, reg_pred1))

[LightGBM] [Info] Number of positive: 407, number of negative: 6044
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003135 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 10422
[LightGBM] [Info] Number of data points in the train set: 6451, number of used features: 94
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.063091 -> initscore=-2.698008
[LightGBM] [Info] Start training from score -2.698008
Accuracy:  0.9428172942817294




In [12]:
# imputation
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# separate numerical and categorical
X_numerical2 = X4.select_dtypes(include=np.number)
X_categorical2 = X4.select_dtypes(exclude=np.number)

# impute numerical, median strategy
imp_median   = SimpleImputer(strategy='median')
imp_X_numerical2 = imp_median.fit_transform(X_numerical2)

# impute categorical, using the most frequent strategy
imp_most_freq = SimpleImputer(strategy='most_frequent')
imp_X_categorical2 = imp_most_freq.fit_transform(X_categorical2)

# One hot encoding
enc1 = OneHotEncoder(handle_unknown='ignore', sparse_output=False)  # Use sparse_output=false for dense array output
enc_imp_X_categorical = enc1.fit_transform(imp_X_categorical2)

# Combine the imputed numerical data and the one-hot encoded categorical data
imp_X1 = np.hstack((imp_X_numerical2, enc_imp_X_categorical))

# splitting the data into training and test
X_train2, X_test2, y_train2, y_test2 = train_test_split(imp_X1, y4, test_size=0.25, random_state=42)

# standardizing
scaler = StandardScaler()
scaler.fit(X_train2)
X_train_scaled2 = scaler.transform(X_train2)
X_test_scaled2 = scaler.transform(X_test2)

# fitting
from lightgbm import LGBMClassifier
lgbm1 = LGBMClassifier(random_state=5)
lgb1 = lgbm1.fit(X_train_scaled2, y_train2)

# predicting
reg_pred2 = lgb1.predict(X_test_scaled2)

print("Accuracy: ", accuracy_score(y_test2, reg_pred2))

[LightGBM] [Info] Number of positive: 407, number of negative: 6044
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.008371 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 10749
[LightGBM] [Info] Number of data points in the train set: 6451, number of used features: 203
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.063091 -> initscore=-2.698008
[LightGBM] [Info] Start training from score -2.698008
Accuracy:  0.9442119944211994




In [6]:
#I've change parts due to that the RAM not have capable of support so many dates that appear to use toarray()

import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
imp_mf = SimpleImputer(strategy='most_frequent')

# drop the missing values
imp_X2 = imp_mf.fit_transform(X4)

# One hot encoding
from sklearn.preprocessing import OneHotEncoder
enc2 = OneHotEncoder(handle_unknown='ignore', sparse_output=True)
enc_imp_X_2 = enc2.fit_transform(imp_X2)

# splitting the data into training and test
X_train3, X_test3, y_train3, y_test3 = train_test_split(enc_imp_X_2, y4, test_size=0.25, random_state=42)
# standardizing
scaler = StandardScaler(with_mean=False)
scaler.fit(X_train3)
X_train_scaled3 = scaler.transform(X_train3)
X_test_scaled3 = scaler.transform(X_test3)
# fitting
from lightgbm import LGBMClassifier
lgbm2 = LGBMClassifier(random_state=5)
lgb2 = lgbm2.fit(X_train_scaled3, y_train3)
# predicting
reg_pred3 = lgb2.predict(X_test_scaled3)

print("Accuracy: ", accuracy_score(y_test3, reg_pred3))

[LightGBM] [Info] Number of positive: 407, number of negative: 6044
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.040690 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3502
[LightGBM] [Info] Number of data points in the train set: 6451, number of used features: 1751
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.063091 -> initscore=-2.698008
[LightGBM] [Info] Start training from score -2.698008
Accuracy:  0.9437470943747094




In [8]:
#I've change parts due to that the RAM not have capable of support so many dates that appear to use toarray()

imp_cnst = SimpleImputer(strategy='constant')
# drop the missing values
imp_X3 = imp_cnst.fit_transform(X4)

# One hot encoding
from sklearn.preprocessing import OneHotEncoder
enc3 = OneHotEncoder(handle_unknown='ignore', sparse_output=True)
enc_imp_X_3 = enc3.fit_transform(imp_X3)

# splitting the data into training and test
X_train4, X_test4, y_train4, y_test4 = train_test_split(enc_imp_X_3, y4, test_size=0.25, random_state=42)
# standardizing
scaler = StandardScaler(with_mean=False)
scaler.fit(X_train4)
X_train_scaled4 = scaler.transform(X_train4)
X_test_scaled4 = scaler.transform(X_test4)
# fitting
from lightgbm import LGBMClassifier
lgbm3 = LGBMClassifier(random_state=5)
lgb3 = lgbm3.fit(X_train_scaled4, y_train4)
# predicting
reg_pred4 = lgb3.predict(X_test_scaled4)

print("Accuracy: ", accuracy_score(y_test4, reg_pred4))

[LightGBM] [Info] Number of positive: 407, number of negative: 6044
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.040071 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3502
[LightGBM] [Info] Number of data points in the train set: 6451, number of used features: 1751
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.063091 -> initscore=-2.698008
[LightGBM] [Info] Start training from score -2.698008
Accuracy:  0.9437470943747094




In [13]:
# Using Ridge Classifier
from sklearn.linear_model import RidgeClassifier

model = RidgeClassifier()
model.fit(X_train2, y_train2)
y_pred_rgd = model.predict(X_test2)

print("Accuracy:", accuracy_score(y_test2, y_pred_rgd))

Accuracy: 0.9446768944676894


In [14]:
#using Linear SVC

from sklearn.svm import LinearSVC

model = LinearSVC(max_iter=1000, random_state=5)
model.fit(X_train2, y_train2)
y_pred_svc = model.predict(X_test2)

print("Accuracy:", accuracy_score(y_test2, y_pred_svc))

Accuracy: 0.9446768944676894


For the future engineering,
I used imputation and hot encoding technique because they are more useful, and from my observation, for all partners of you can use in simple imputer the accuracy is still high and the stays constant, and even if changed model of prediction to other the difference is minimum in relation to model used before.