1. What to Learn and What to Predict
The Home Credit Default Risk competition focuses on predicting the likelihood of a person defaulting on a loan. The goal is to build a machine learning model that can predict the probability of a loan applicant defaulting on their credit obligations.

Objective: You will build a classification model to predict whether a customer will default on their loan based on various features provided in the dataset, such as:

Personal and financial information of the applicant (e.g., age, income, number of children).
Credit history and loan-specific data.
Other features related to the applicant’s behavior and loan application.
2. What Kind of File Should I Create and Submit to Kaggle?
The file that you will submit to Kaggle should contain the predicted probabilities for each applicant's default status (either 1 for default or 0 for no default). The competition specifically requires you to predict the likelihood of default for each applicant in the test set.

Required file format:

The submission file must be a CSV file.
It should contain two columns:
SK_ID_CURR: A unique identifier for each loan applicant (same as the ID column in the dataset).
TARGET: The predicted probability of default for each applicant (a value between 0 and 1).
The file should look like this:

SK_ID_CURR	    TARGET
1000001	        0.02345
1000002	        0.75432
1000003	        0.00123
...	...
3. How Will Submissions Be Evaluated?
The submissions are evaluated based on AUC-ROC (Area Under the Receiver Operating Characteristic Curve) score.

AUC-ROC measures the ability of the model to distinguish between the two classes (i.e., default or non-default). An AUC of 0.5 means the model is performing no better than random guessing, while an AUC of 1.0 means the model is perfect.
In this competition, the evaluation metric is used to assess the model’s performance on the test data, which is not visible to participants until after submission.


In [6]:
import pandas as pd

# Load datasets
train = pd.read_csv('C:/Users/Mercy/Downloads/application_train.csv')
test = pd.read_csv('C:/Users/Mercy/Downloads/application_train.csv')

# Check data
print(train.head())
print(test.head())


   SK_ID_CURR  TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR  \
0      100002       1         Cash loans           M            N   
1      100003       0         Cash loans           F            N   
2      100004       0    Revolving loans           M            Y   
3      100006       0         Cash loans           F            N   
4      100007       0         Cash loans           M            N   

  FLAG_OWN_REALTY  CNT_CHILDREN  AMT_INCOME_TOTAL  AMT_CREDIT  AMT_ANNUITY  \
0               Y             0          202500.0    406597.5      24700.5   
1               N             0          270000.0   1293502.5      35698.5   
2               Y             0           67500.0    135000.0       6750.0   
3               Y             0          135000.0    312682.5      29686.5   
4               Y             0          121500.0    513000.0      21865.5   

   ...  FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21  \
0  ...                 0             

In [8]:
# Separate numeric and non-numeric columns
numeric_cols = train.select_dtypes(include=['number']).columns
categorical_cols = train.select_dtypes(include=['object']).columns

# Fill missing values
train[numeric_cols] = train[numeric_cols].fillna(train[numeric_cols].mean())
train[categorical_cols] = train[categorical_cols].fillna(train[categorical_cols].mode().iloc[0])

test[numeric_cols] = test[numeric_cols].fillna(test[numeric_cols].mean())
test[categorical_cols] = test[categorical_cols].fillna(test[categorical_cols].mode().iloc[0])


In [9]:
print(train.isnull().sum().sum())  # Should return 0
print(test.isnull().sum().sum())  # Should return 0


0
0


In [13]:
# Define features and target
features = [col for col in train.columns if col not in ['TARGET', 'SK_ID_CURR']]  # Exclude target and ID columns
X_train = train[features]
y_train = train['TARGET']


In [14]:
# Prepare test dataset (use the same features as training data)
X_test = test[features]


In [16]:
# Identify categorical columns
categorical_cols = train.select_dtypes(include=['object']).columns
print(categorical_cols)


Index(['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
       'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
       'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE',
       'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE',
       'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE'],
      dtype='object')


In [17]:
# One-hot encode categorical features
train = pd.get_dummies(train, columns=categorical_cols, drop_first=True)
test = pd.get_dummies(test, columns=categorical_cols, drop_first=True)

# Align train and test datasets (to ensure they have the same columns)
train, test = train.align(test, join='inner', axis=1)


In [18]:
# Define features and target again
X_train = train.drop(columns=['TARGET', 'SK_ID_CURR'])
y_train = train['TARGET']
X_test = test.drop(columns=['TARGET', 'SK_ID_CURR'], errors='ignore')  # Test doesn't have 'TARGET'


In [19]:
from sklearn.linear_model import LogisticRegression

# Train logistic regression model
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)

# Predict probabilities
test['TARGET'] = model.predict_proba(X_test)[:, 1]

# Create submission file
submission = test[['SK_ID_CURR', 'TARGET']]
submission.to_csv('submission.csv', index=False)

print("Submission file created.")


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Submission file created.


In [20]:
# Create submission file
submission = test[['SK_ID_CURR', 'TARGET']]
submission.to_csv('submission.csv', index=False)

print("Submission file created: submission.csv")


Submission file created: submission.csv


In [22]:
import numpy as np  # Import numpy for numerical operations
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

# Stratified K-Fold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

auc_scores = []
for train_idx, val_idx in skf.split(X_train, y_train):
    X_tr, X_val = X_train.iloc[train_idx], X_train.iloc[val_idx]
    y_tr, y_val = y_train.iloc[train_idx], y_train.iloc[val_idx]
    
    model.fit(X_tr, y_tr)
    preds = model.predict_proba(X_val)[:, 1]
    auc = roc_auc_score(y_val, preds)
    auc_scores.append(auc)

print("Mean AUC:", np.mean(auc_scores))  # Use np.mean to compute the average AUC


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Mean AUC: 0.6252831841790717
