<a href="https://colab.research.google.com/github/jigjid/github_task/blob/main/Credit_information_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Preprocessing**

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer

def Encoder(df):
    columnsToEncode = list(df.select_dtypes(include=['category','object']))
    le = LabelEncoder()
    for feature in columnsToEncode:
        try:
            df[feature] = le.fit_transform(df[feature])
        except:
            print('Error encoding '+feature)
    return df

#**Credit Information**
**Problem 1:** Confirmation of competition details.

- What do we learn and what do we predict?

    Many people struggle to get loans due to insufficient or non-existent credit histories. But writing an algorithm to accurately calculate that we are a reliable borrower by researching and predict on loan repayment.


- What kind of file to create and submit to Kaggle

    For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. The file should contain a header and have the following format:
    
    ```
      SK_ID_CURR,TARGET

      100001,0.1
      
      100005,0.9
      
      100013,0.2
      
      etc.
    ```

  
- What kind of index value will be used to evaluate the submitted items?

    Probability

#**Create a Baseline Model**

**Problem 2:** Learning and Verification.

In [None]:
app_train = pd.read_csv('application_train.csv')
app_test = pd.read_csv('application_test.csv')

print("Number of samples {} and feature {} in application_train.".format(len(app_train),len(app_train.columns)))
print("Number of samples {} and feature {} in application_test.".format(len(app_test),len(app_test.columns)))
print(app_test.columns)

app_train.dropna(axis = 1, how='any', inplace=True)
y_train = app_train['TARGET']
app_train = app_train.drop("TARGET", axis=1)

app_test = app_test[app_train.columns]
print(app_train.head(10))

app_train = Encoder(app_train)
app_test = Encoder(app_test)

scaler = StandardScaler()
scaler.fit(app_train)
X_train = scaler.transform(app_train)
X_test = scaler.transform(app_test)

x_tr, x_val, y_tr, y_val = train_test_split(X_train, y_train, random_state=123)

lr = LogisticRegression(C=0.0001)
lr.fit(x_tr, y_tr)

Number of samples 218121 and feature 122 in application_train.
Number of samples 23376 and feature 121 in application_test.
Index(['SK_ID_CURR', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR',
       'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT',
       'AMT_ANNUITY', 'AMT_GOODS_PRICE',
       ...
       'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20',
       'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR',
       'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
       'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
       'AMT_REQ_CREDIT_BUREAU_YEAR'],
      dtype='object', length=121)
   SK_ID_CURR NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY  \
0      100002         Cash loans           M            N               Y   
1      100003         Cash loans           F            N               N   
2      100004    Revolving loans           M            Y               Y   
3      100006         Cash loans       

LogisticRegression(C=0.0001)

**Problem 3:** Estimation for test data.

In [None]:
y_test_pred = lr.predict_proba(x_val)[:, 1]
print("roc auc score", roc_auc_score(y_val, y_test_pred))

y_test_pred = lr.predict_proba(X_test)[:, 1]
data = app_test[["SK_ID_CURR"]]
data['TARGET'] = y_test_pred.tolist()
data.to_csv('submission.csv', index=False)
print(data)

print("BEST SCORE: 0.64511  V2")

roc auc score 0.6050017564581778
       SK_ID_CURR    TARGET
0          100001  0.063240
1          100005  0.104338
2          100013  0.069369
3          100028  0.077455
4          100038  0.091834
...           ...       ...
23371      269955  0.071629
23372      269958  0.082096
23373      269960  0.069842
23374      269966  0.072206
23375      269975  0.059860

[23376 rows x 2 columns]
BEST SCORE: 0.64511  V2


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['TARGET'] = y_test_pred.tolist()


#**Feature Engineering**

**Problem 4:** Feature Engineering.

In [None]:
application_train = pd.read_csv('application_train.csv')
application_test = pd.read_csv('application_test.csv')

correlations = application_train.corr()["TARGET"].sort_values()

print('Most Positive Correlations:\n', correlations.tail(15))
print('\nMost Negative Correlations:\n', correlations.head(15))

print('\nFrom above result, We can see that the most correlated variables are \
DAYS_BIRTH, EXT_SOURCE_3, EXT_SOURCE_2 and EXT_SOURCE_1.\n')     


ext_data = application_train[['TARGET', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH']]
ext_data_corrs = ext_data.corr()
print(ext_data_corrs)


y_ext = ext_data['TARGET']
X_train = ext_data.drop(columns=['TARGET'])
X_test = application_test[X_train.columns]

print(X_train.columns)

imputer = SimpleImputer(strategy="median")
X_train = imputer.fit_transform(X_train)
X_test = imputer.fit_transform(X_test)

scaler = StandardScaler()
scaler.fit(X_train)
train = scaler.transform(X_train)
test = scaler.transform(X_test)


x_tr, x_val, y_tr, y_val = train_test_split(train, y_ext, random_state=123)

log_reg = LogisticRegression(C=0.0001)
log_reg.fit(x_tr, y_tr)

log_reg_pred = log_reg.predict_proba(x_val)[:, 1]
print("roc auc score", roc_auc_score(y_val, log_reg_pred))

y_test_pred = log_reg.predict_proba(X_test)[:, 1]

data = application_test[["SK_ID_CURR"]]
data['TARGET'] = y_test_pred.tolist()
data.to_csv('submission.csv', index=False)
print(data)

print("BEST SCORE: 0.69758  V3")

Most Positive Correlations:
 LIVE_CITY_NOT_WORK_CITY        0.032112
DEF_60_CNT_SOCIAL_CIRCLE       0.032249
DEF_30_CNT_SOCIAL_CIRCLE       0.032963
OWN_CAR_AGE                    0.041047
DAYS_REGISTRATION              0.041197
REG_CITY_NOT_LIVE_CITY         0.043131
FLAG_DOCUMENT_3                0.044692
FLAG_EMP_PHONE                 0.045297
REG_CITY_NOT_WORK_CITY         0.050257
DAYS_ID_PUBLISH                0.052886
DAYS_LAST_PHONE_CHANGE         0.054783
REGION_RATING_CLIENT           0.058438
REGION_RATING_CLIENT_W_CITY    0.060609
DAYS_BIRTH                     0.078254
TARGET                         1.000000
Name: TARGET, dtype: float64

Most Negative Correlations:
 EXT_SOURCE_3                 -0.177147
EXT_SOURCE_2                 -0.160568
EXT_SOURCE_1                 -0.154153
FLOORSMAX_AVG                -0.046965
FLOORSMAX_MEDI               -0.046909
FLOORSMAX_MODE               -0.045893
DAYS_EMPLOYED                -0.044247
AMT_GOODS_PRICE              -0.039332


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['TARGET'] = y_test_pred.tolist()
