### NOTE BOOK COVERS
1. Understand the requirement/Problem statement
2. Preprocessing
3. EDA
4. Imputation
5. Standardization
6. Encodings
7. Pipeline
8. Models
9. Finetune 
10. cross validation
11. Grids
12. Pickling
13. stacking
14. voting
15. streamlit Create UI and link 
16. Deployment

### 1. Understand the requirement/Problem statement

Credit Risk in Fintech Industry. You are required to build and train a model that identifies Fully Paid and Charged-off loans from the loan dataset.
Task:
Your task is to build this model based on the details in this document and submit it. 


Consider these Factors before building the Models:
1.	Use the specific source or dataset for assess credit risk shared with you
2.	What is your intended data split ratio for training, validation, and test sets for the loan dataset? How do you plan to ensure randomness in this split?
3.	Do you plan to explore the importance of these components further?
4.	Do you anticipate class imbalance in the 'loan_status' feature, where 
Fully paid: Applicant has fully paid the loan (the principal and the interest rate)
Charged-off: Applicant has not paid the installments in due time for a long period of time, i.e. Client has defaulted on the loan
If so, how will you address this imbalance?
5.	Will you normalize the features? If yes, what normalization techniques do you have in mind?
6.	Do you intend to perform data preprocessing tasks such as outlier detection, missing value handling, or feature selection before training your model.

### 2. Preprocessing

In [6]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder,LabelEncoder
from sklearn.preprocessing import MinMaxScaler

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LogisticRegression

ImportError: cannot import name 'int' from 'numpy' (/opt/anaconda3/lib/python3.8/site-packages/numpy/__init__.py)

In [7]:
conda update --all

done
Solving environment: \ 
  - conda-forge/noarch::charset-normalizer-3.3.0-pyhd8ed1ab_0, defaults/osx-64::aiohttp-3.8.5-py38h6c40b1e_0
  - conda-forge/osx-64::aiohttp-3.8.1-py38hed1de0f_1, defaults/noarch::charset-normalizer-2.0.4-pyhd3eb1b0done

## Package Plan ##

  environment location: /opt/anaconda3


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _anaconda_depends-2023.09  |  py38_openblas_1          68 KB
    _ipyw_jlab_nb_ext_conf-0.1.0|   py38hecd8cb5_1           5 KB
    aiobotocore-2.7.0          |     pyhd8ed1ab_0          62 KB  conda-forge
    aiohttp-3.8.5              |   py38h6c40b1e_0         442 KB
    aioitertools-0.11.0        |     pyhd8ed1ab_0          22 KB  conda-forge
    aiosignal-1.3.1            |     pyhd8ed1ab_0          12 KB  conda-forge
    alabaster-0.7.13           |     pyhd8ed1ab_0          18 KB  conda-forge
    anaconda-custom            |     

multipledispatch-0.6 | 12 KB     | ##################################### | 100% 
atomicwrites-1.4.1   | 12 KB     | ##################################### | 100% 
toml-0.10.2          | 18 KB     | ##################################### | 100% 
pywavelets-1.4.1     | 3.4 MB    | ##################################### | 100% 
aws-c-http-0.7.13    | 159 KB    | ##################################### | 100% 
arrow-1.3.0          | 98 KB     | ##################################### | 100% 
pytz-2023.3.post1    | 183 KB    | ##################################### | 100% 
babel-2.13.0         | 6.6 MB    | ##################################### | 100% 
intake-0.7.0         | 189 KB    | ##################################### | 100% 
yaml-0.2.5           | 82 KB     | ##################################### | 100% 
typing_utils-0.1.0   | 14 KB     | ##################################### | 100% 
libedit-3.1.20191231 | 103 KB    | ##################################### | 100% 
gflags-2.2.2         | 92 KB

openjpeg-2.5.0       | 322 KB    | ##################################### | 100% 
pyjwt-2.8.0          | 24 KB     | ##################################### | 100% 
regex-2023.10.3      | 307 KB    | ##################################### | 100% 
prompt_toolkit-3.0.3 | 7 KB      | ##################################### | 100% 
seaborn-base-0.12.2  | 226 KB    | ##################################### | 100% 
libcxx-16.0.6        | 1.1 MB    | ##################################### | 100% 
scikit-learn-1.3.1   | 7.1 MB    | ##################################### | 100% 
pyqtwebengine-5.15.7 | 124 KB    | ##################################### | 100% 
llvm-openmp-17.0.3   | 293 KB    | ##################################### | 100% 
mypy_extensions-1.0. | 10 KB     | ##################################### | 100% 
libthrift-0.19.0     | 318 KB    | ##################################### | 100% 
inflection-0.5.1     | 9 KB      | ##################################### | 100% 
comm-0.1.4           | 11 KB

greenlet-3.0.0       | 190 KB    | ##################################### | 100% 
anaconda-navigator-2 | 6.8 MB    | ##################################### | 100% 
llvmlite-0.40.1      | 252 KB    | ##################################### | 100% 
notebook-6.3.0       | 6.3 MB    | ##################################### | 100% 
libllvm15-15.0.7     | 22.8 MB   | ##################################### | 100% 
libevent-2.1.12      | 364 KB    | ##################################### | 100% 
gmpy2-2.1.2          | 166 KB    | ##################################### | 100% 
typing-extensions-4. | 10 KB     | ##################################### | 100% 
ruamel.yaml.clib-0.2 | 117 KB    | ##################################### | 100% 
libclang-12.0.0      | 6.1 MB    | ##################################### | 100% 
blas-2.119           | 14 KB     | ##################################### | 100% 
numba-0.57.1         | 3.9 MB    | ##################################### | 100% 
jupyter-1.0.0        | 8 KB 

In [None]:
data = pd.read_csv("../data/train_loan_data.csv")

In [None]:
data # to see sample data

In [None]:
data.info() # to view entire column info of data

In [None]:
data.describe() # to get all the stats details of numerical columns

In [None]:
data.shape #(rows x columns)

In [None]:
data.duplicated().sum()

In [None]:
data.isna().sum()

### Exploratory Data Analysis

In [None]:
sns.histplot(data['loan_amnt'])

In [None]:
data.verification_status.unique()

In [None]:
data.loan_status.unique()

In [None]:
data.term.unique()

In [None]:
data.sub_grade.unique()

In [None]:
data.purpose.unique()

In [None]:
data.application_type.unique()

In [None]:
data.initial_list_status.unique()

In [None]:
data.home_ownership.unique()

In [None]:
data.grade.unique()

In [None]:
data.select_dtypes(exclude=np.number)

### Handling Missing values

    1.Drop the row
    2.Replace with the statistical properties(Mean, Median and Mode)
    3.Replace with a imputation
    4.Create a new category label


Before dropping any columns are columns analyze how significate these columns to identify credit risk (domain expertise required) 

In [None]:
# Employment length in years. 
# Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.

data.emp_length.unique() # 4588 nulls

In [None]:
# The job title supplied by the Borrower when applying for the loan.
data.emp_title.unique() # 5018 nulls

In [None]:
# Number of currently active bankcard accounts.
data.num_actv_bc_tl.unique() # 3948 nulls

In [None]:
# Number of mortgage accounts.
data.mort_acc.unique() # 2771 nulls

In [None]:
# Total current balance of all accounts
data.tot_cur_bal.unique() # 3948 nulls

In [None]:
# Number of public record bankruptcies.
data.pub_rec_bankruptcies.unique() # 31 nulls

In [None]:
# Revolving line utilization rate, 
#or the amount of credit the borrower is using relative to all available revolving credit.

data.revol_util.unique() # 53 nulls

In [None]:
# The loan title provided by the borrower
data.title.unique() # 970 nulls

### Anlayze outliers for the numerical columns to replace missing values with mean or median

In [None]:
data['num_actv_bc_tl'].plot(kind= 'box') # Have outliers imputation using median

In [None]:
data['num_actv_bc_tl'].mean()

In [None]:
data['num_actv_bc_tl'].median()

In [None]:
data['mort_acc'].plot(kind= 'box') # Have outliers imputation using median 

In [None]:
data['mort_acc'].mean()

In [None]:
data['mort_acc'].median()

In [None]:
data.tot_cur_bal.plot(kind= 'box') # Have outliers imputation using median 

In [None]:
data.tot_cur_bal.mean()

In [None]:
data.tot_cur_bal.median()

In [None]:
data.pub_rec_bankruptcies.plot(kind= 'box') # Have outliers imputation using median 

In [None]:
data.pub_rec_bankruptcies.mean() 

In [None]:
data.pub_rec_bankruptcies.median()

In [None]:
data.revol_util.plot(kind= 'box') # Have outliers imputation using median

In [None]:
data.revol_util.mean() 

In [None]:
data.revol_util.median()

## split data to numeric and categorical

### Numerical data analysis along with imputation to fill na's and minmax encoding to Standardize the data.

In [None]:
numerical_data = data.select_dtypes(include=np.number)

In [None]:
numerical_data.info()

In [None]:
numerical_data

In [None]:
numerical_data.isna().sum()

In [None]:
clone_numeric_data = numerical_data.copy()

In [None]:
numerical_data.columns

In [None]:
imputer = SimpleImputer(missing_values=np.nan, strategy='median')
imputer.fit(clone_numeric_data[numerical_data.columns])
imputed_numeric_data = imputer.fit_transform(clone_numeric_data[numerical_data.columns])

In [None]:
imputed_numeric_data

In [None]:
imputed_numeric_dataframe = pd.DataFrame(imputed_numeric_data,columns=numerical_data.columns)

In [None]:
imputed_numeric_dataframe

In [None]:
imputed_numeric_dataframe.info()
numerical_data.info()

In [None]:
imputed_numeric_dataframe.isna().sum()

### categorical data analysis along with imputation to fill na's and onehot encoding to Standardize the data.

In [None]:
categorical_data = data.select_dtypes(exclude = np.number)

In [None]:
categorical_data.info()

In [None]:
categorical_data.isna().sum()

In [None]:
clone_categorical_data = categorical_data.copy()

In [None]:
categorical_data.columns

In [None]:
imputer = SimpleImputer(strategy='constant', fill_value='missing')
imputer.fit(clone_categorical_data[['emp_length','emp_title','title']])
imputed_categorical_data = imputer.fit_transform(clone_categorical_data[categorical_data.columns])

In [None]:
imputed_categorical_data

In [None]:
imputed_categorical_dataframe = pd.DataFrame(imputed_categorical_data,columns=categorical_data.columns)

In [None]:
imputed_categorical_dataframe

In [None]:
imputed_categorical_dataframe.info()
categorical_data.info()

In [None]:
imputed_categorical_dataframe.isna().sum()

In [None]:
onehotencoder = OneHotEncoder(sparse=False, categories='auto')

In [None]:
categorical_data.columns

In [None]:
 X = onehotencoder.fit_transform(imputed_categorical_dataframe)

In [None]:
X

In [None]:
column_names = onehotencoder.get_feature_names()

In [None]:
column_names

In [None]:
imputed_categorical_dataframe = pd.DataFrame(X, columns=column_names)

In [None]:
imputed_categorical_dataframe

In [None]:
# class CustomTransformer(BaseEstimator, TransformerMixin):
#     def __init__(self,strategy):
#         self.strategy = strategy
        
#     def fit(self, X):
#         self.imputer = SimpleImputer(missing_values=np.nan, strategy=self.strategy)
#         self.imputer.fit(X)
#         return self
    
#     def transform(self, X):
#         return self.imputer.transform(X)
    

### 4. Imputation  5. Standardization  6. Encodings

In [None]:
numerical_imputer = Pipeline([
    ('imputation', SimpleImputer(strategy='mean')),
    ('scale', MinMaxScaler())
])

categorical_imputer = Pipeline([
    ('imputation', SimpleImputer(strategy='constant', fill_value='missing')),
    ('label', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])

In [None]:

preprocess_full = ColumnTransformer(
    transformers=
    [
        ('numerical_preprocessing',numerical_imputer,numerical_data.columns),
        ('categorical_preprocessing',categorical_imputer,['emp_length','emp_title','title'])
    ]
)



### 7. Pipeline

In [None]:
pipe = Pipeline([
         ('preprocess',preprocess_full),
         ('model',LogisticRegression())])



### 8. Models

In [None]:
from sklearn import set_config
set_config(display='diagram')  
display(pipe)

In [None]:
filled = preprocess_full.fit_transform(data)

In [None]:
filled

In [None]:
pd.DataFrame(filled,columns=data.columns).info()

In [None]:
pd.DataFrame(filled,columns=data.columns).info()

In [None]:
data.columns

In [None]:
categorical_data

In [None]:
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
df = imputer.fit_transform(clonedata)

df