### Title: 
# FINAL TEST

### Description:

In this notebook we will perform the process the new data and perform the accuracy with all the models.

### Authors:
#### Hugo Cesar Octavio del Sueldo
#### Jose Lopez Galdon

### Date:
04/12/2020

### Version:
1.0

***

### Libraries

In [None]:
    # Numpy & Pandas to work with the DF
import numpy as np
import pandas as pd

    # Pre-processing
from sklearn import preprocessing,metrics 

    # Visualize DF
from IPython.display import display, HTML

    # Load models
import pickle

## Load data

In [None]:
    # To automate the work as much as possible, we will parameterize the codes, so in this case, we will create an objetct with
    # the path root
name = ''

data = pd.read_csv(f'../data/01_raw/{name}.csv',           # Path root: here we include an f-string with the variable name
                   low_memory = False)                     # To avoid warnings we use set low_memory = False

## Data processing

### Select target

In [None]:
    # Merge Default & Charged Off
data['loan_status'] = data['loan_status'].replace({'Default':'Charged Off'})

    # We will only select those observations with "Fully Paid" & "Charged Off"
data_binary = data[(data['loan_status'] == "Fully Paid") | (data['loan_status'] == "Charged Off")]

    # Now, we will transform into 0 & 1
dummy_dict = {"Fully Paid":0, "Charged Off":1}

    # Finally, we use the dictiony in the dataset
data = data_binary.replace({"loan_status": dummy_dict})

### Select columns

In [None]:
    # Select the columns
columns = ['loan_status', 'funded_amnt', 'term', 'int_rate', 'emp_length', 'home_ownership', 'annual_inc', 'addr_state', 
           'inq_last_6mths', 'open_acc', 'revol_bal', 'revol_util', 'total_acc', 'acc_open_past_24mths', 'avg_cur_bal',
           'bc_open_to_buy', 'bc_util', 'mo_sin_old_il_acct', 'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl',
           'mort_acc', 'num_actv_bc_tl', 'num_actv_rev_tl', 'num_bc_tl', 'num_il_tl', 'num_tl_op_past_12m', 'pct_tl_nvr_dlq',
           'total_il_high_credit_limit', 'debt_settlement_flag']

    # Create a new dataset
data = data[columns]

#### Columns transformations

In [None]:
    # With a apply and lamba function we, in the same line, convert the variable into a float and we drop the last 
    # element.
data['int_rate'] = data['int_rate'].apply(lambda x: float(x[:-1]))

    # Now, we convert the revol_util variable into an string object and the apply the same lamba function as above
data['revol_util'] = data['revol_util'].astype('category')
data['revol_util'] = data['revol_util'].apply(lambda x: x[:-1])
data['revol_util'] = data['revol_util'].astype('float64')

    # We create and object we those variable that we want to convert into a categorical named columns_categ
columns_categ = ["emp_length", "home_ownership", "loan_status", "addr_state", "term", "debt_settlement_flag"]
    
    # Below, we transform the variables into categorical with the astype function.
data[columns_categ] = data[columns_categ].astype('category')

    # with a lambda and apply function we convert the different categories into the variable to a number 
data[columns_categ] = data[columns_categ].apply(lambda x: x.cat.codes)

data.info()

### Drop NaN

In [None]:
data = data.dropna()

### Feature Scaling

In [None]:
data.select_dtypes(include = ['float64', 'int64']).columns

In [None]:
     # Instance of preprocessing
scl = preprocessing.StandardScaler()

    # Take numeric columns
columns = []

    # Apply function
data[columns] = scl.fit_transform(data[columns])

    # Chech results
display(HTML(data.head().to_html()))

### Data Y & Data X

In [None]:
    # Set X data
X = data.drop("loan_status", axis = 1)

    # Set y data
y = data["loan_status"]

    # Check dimensions
X.shape, y.shape

### One Hot Encoding

In [None]:
    # Select those categorical columns
columns_categ = ['term', 'home_ownership', 'emp_length', 'addr_state', 'debt_settlement_flag']
    
    # Below, we transform the variables into categorical with the astype function.
X[columns_categ] = X[columns_categ].astype('category')

    # Check the results
X.info()

In [None]:
    # One Hot Enconding, droping the first column in order to save K-1 
X = pd.get_dummies(X, drop_first=True)

    # Check results
display(HTML(X.head().to_html()))

***

## Testing

### Logistic Regression

In [None]:
    # Parametrize
filename = 'logistic_regression'

    # Load model
model = pickle.load(open(f'../data/04_models/{filename}.sav', 'rb'))

    # Print accuracy
result = model.score(X, y)
print(result)

### Random Forest

In [None]:
    # Parametrize
filename = 'random_forest'

    # Load model
model = pickle.load(open(f'../data/04_models/{filename}.sav', 'rb'))

    # Print accuracy
result = model.score(X, y)
print(result)

### XGBoost

In [None]:
    # Parametrize
filename = 'xgboost_tuned'

    # Load model
model = pickle.load(open(f'../data/04_models/{filename}.sav', 'rb'))

    # Print accuracy
result = model.score(X, y)
print(result)

### Support Vector Machine

In [None]:
    # Parametrize
filename = 'svm'

    # Load model
model = pickle.load(open(f'../data/04_models/{filename}.sav', 'rb'))

    # Print accuracy
result = model.score(X, y)
print(result)