<h1 align="center"> Lending Club Loan Data: Modeling </h1> <br>

### Capstone project:

#### Anomaly detection/prediction enhanced by Natural language processing (NLP) techniques  <br><br>

##### Problem: <br><br>
1) In the financial domain, anomaly detection solutions are not able to use all the text data that was available for the customers like emails, chats with customer agents, survey information, financial transaction information, etc. Not able to channel these data to make better anomaly predictions. <br><br> 2) Numerical and text data used separately. Training the NLP model to catch the anomaly was the most difficult part. <br><br>

##### Solution: <br><br>
1) Designing and implementing a prototype machine learning solution which can ingest both numeric and text data and make better predictions. <br><br> 2) Training of the model done by  converting the dataset to text that NLP technique understands. <br><br>



##### Approach: <br><br>
Goal is to create NLP embeddings for this dataset of loans issued from 2007 to 2018 and use that to predict the loan default. Working with Numeric and Nonnumeric data. NLP techniques doc2vec & ELMO used to create the vectors. Vectors are created and used in the models.<br><br>


##### More details: <br><br>
For more details please refer presentation slides.


## Company Information:
Lending Club is a  peer to peer lending company (United States) 

1) Investors provide funds for potential borrowers. <br>
2) Investors earn a profit depending on the risk they take (the borrowers credit score. <br>
3) Lending Club provides the "bridge" between investors and borrowers. For more information please refer article below. <br><br>

<a src="https://en.wikipedia.org/wiki/Lending_Club"> Lending Club Information </a>





<img src="http://echeck.org/wp-content/uploads/2016/12/Showing-how-the-lending-club-works-and-makes-money-1.png"><br><br>



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
from pprint import pprint
import sys
import string

In [2]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

In [3]:
import dill
import pickle


## Environment Creation:
<a id="Environment_creation"></a>
#Environment_creation



NEW environment created for tensor flow <BR>
<BR>
1) Created new empty virtual environment called tf36 for tensorflow at the command prompt <BR>
2) Installed python = 36, <BR>
3) Environment folders at user/pramodpaul/anaconda3/envs/tf36    (delete this tf36 if you have major issues with anaconda) <BR>
4) Environment.yml (yammer file contains the versions of the software) <BR>
5) Activate tf36 <BR>
6) Restart machine  <BR>
7) In Jupyter notebook, change kernel environment to conda:tf36  <BR>
8) to install packages in this environment,use the below url     <BR>
https://jakevdp.github.io/blog/2017/12/05/installing-python-packages-from-jupyter/  <BR>
Example:   <BR>
#import sys    <BR>
#!conda install --yes --prefix {sys.prefix} gensim <BR>
   

In [137]:
## Data Dictionary
##Please refer attached LCDataDictionary_preprocessing.xlsx   ### modified from original
#Lending Club - Peer to peer lending club company
#  Investors can invest in loans in varuous loan portfolio,  personal, home, 

In [4]:
from gensim.models import Word2Vec
from gensim.models import Doc2Vec
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.test.utils import common_texts, get_tmpfile
from keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn import metrics
from keras.layers import *
from keras.models import *
from keras import initializers, regularizers, constraints, optimizers, layers
from keras.initializers import *
from keras.optimizers import *
import keras.backend as K
from keras.callbacks import *
import tensorflow as tf
from keras.preprocessing.sequence import pad_sequences
from smart_open import smart_open
import datetime 
#from keras.utils import multi_gpu_model

import os
import time
import gc
import re
import random
import keras

import pickle
import dill

Using TensorFlow backend.


In [5]:
import pandas as pd
import numpy as np
import spacy
from tqdm import tqdm
import re
import time
import pickle
pd.set_option('display.max_colwidth', 200)

## Getting data ready for modeling

In [None]:
###looking at single applications only,not joint

In [43]:
##data_NLP_single = data_NLP[data_NLP['application_type'] == 'Individual']

In [44]:
#pickle_out = open("data_NLP_Single.pickle","wb")
#pickle.dump(data_NLP_single, pickle_out)
#pickle_out.close()

In [6]:
## get this files back
pickle_in = open("data_NLP_Single.pickle", "rb")
data_NLP_Single_new = pickle.load(pickle_in)



## Train

In [7]:
data_NLP_s =        data_NLP_Single_new

In [8]:
data_NLP_s.year.unique()

array([2018, 2016, 2015, 2017, 2013, 2012, 2014, 2011, 2010, 2009, 2008,
       2007])

In [9]:
### 2007 - 2015
train_year_filter = [2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017]

In [10]:
Train = data_NLP_s[data_NLP_s['year'].isin(train_year_filter)]

In [11]:
Train.head(5)

Unnamed: 0,acc_now_delinq,annual_income,chargeoff_within_12_mths,collections_12_mths_ex_med,debt_settlement_flag,dti,emp_length,emp_title,grade,hardship_flag,...,settlement_term,tot_coll_amt,loan_condition,loan_status_InGracePeriod,loan_status_ChargedOff,loan_status_DMC_ChargedOff,member_id,issue_d,application_type,year
495242,0.0,high,0.0,0.0,N,low,four years,lead mold maker,medium,N,...,,0.0,1,0,0,0,495243,Sep-2016,Individual,2016
495243,0.0,medium,0.0,0.0,N,low,one year,Compliance Manager,medium,N,...,,3841.0,0,0,1,0,495244,Sep-2016,Individual,2016
495248,0.0,medium,0.0,0.0,N,low,more than ten years,Management meat department,high,N,...,,82.0,0,0,1,0,495249,Sep-2016,Individual,2016
495250,0.0,low,0.0,0.0,N,medium,two years,Enrichment Manager,medium,N,...,,0.0,1,0,0,0,495251,Sep-2016,Individual,2016
495251,0.0,lowest,0.0,0.0,N,low,three years,Teacher,high,N,...,,0.0,1,0,0,0,495252,Sep-2016,Individual,2016


In [12]:
Train['loan_condition'].unique()

array([1, 0])

In [13]:
y_train = Train['loan_condition']

In [14]:
y_train.shape

(722736,)

In [15]:
#y_train = y_train.head(500)

In [16]:
y_train= y_train.head(100000)

In [17]:
Train_temp = Train

In [316]:
#X_train = Train_temp.drop('loan_condition',axis =1)

In [377]:
#X_train.shape

In [18]:
X_train = Train.head(100000)

In [19]:
X_train.shape

(100000, 32)

In [20]:
de = X_train[X_train['loan_condition'] == 1]   ### good loan....balanced dataset

In [21]:
de.shape


(68593, 32)

In [23]:
X_train.shape

(100000, 32)

In [19]:
y_train.dtype

dtype('int64')

## Test batch

In [24]:
#2016 -2018
test_year_filter = [2018]

In [25]:
Test = data_NLP_s[data_NLP_s['year'].isin(test_year_filter)]

In [26]:
Test.head()

Unnamed: 0,acc_now_delinq,annual_income,chargeoff_within_12_mths,collections_12_mths_ex_med,debt_settlement_flag,dti,emp_length,emp_title,grade,hardship_flag,...,settlement_term,tot_coll_amt,loan_condition,loan_status_InGracePeriod,loan_status_ChargedOff,loan_status_DMC_ChargedOff,member_id,issue_d,application_type,year
0,0.0,low,0.0,0.0,N,low,more than ten years,Chef,medium,N,...,,0.0,1,0,0,0,1,Dec-2018,Individual,2018
1,0.0,high,0.0,0.0,N,medium,more than ten years,Postmaster,medium,N,...,,1208.0,1,0,0,0,2,Dec-2018,Individual,2018
2,0.0,medium,0.0,0.0,N,low,six years,Administrative,medium,N,...,,0.0,1,0,0,0,3,Dec-2018,Individual,2018
3,0.0,high,0.0,0.0,N,low,more than ten years,IT Supervisor,medium,N,...,,686.0,1,0,0,0,4,Dec-2018,Individual,2018
4,0.0,medium,0.0,0.0,N,medium,more than ten years,Mechanic,medium,N,...,,0.0,1,0,0,0,5,Dec-2018,Individual,2018


In [27]:
y_test = Test['loan_condition']

In [28]:
y_test.shape

(390790,)

In [47]:
#y_test = y_test.head(100)

In [48]:
#y_test.shape

In [21]:
#Test_temp = Test 

In [29]:
Bad_loan_test = Test[Test['loan_condition'] == 0]

In [30]:
Bad_loan_t = Bad_loan_test.head(1000)

In [31]:
Good_loan_test = Test[Test['loan_condition'] == 1]

In [32]:
Good_loan_t = Good_loan_test.head(1000)

In [33]:
X_test = Good_loan_t.append(Bad_loan_t)

In [34]:
X_test.shape

(2000, 32)

In [35]:
y_test = X_test['loan_condition']

In [36]:
y_test.shape

(2000,)

In [37]:
nlp_csv_file = 'NLP_Preprocessing_Final_CSV_file.csv'
LCDD = pd.read_csv(nlp_csv_file, encoding = 'utf-8',delimiter = ',')
# If you don't specify the type encoding as `utf-8`, you're going to have a difficult time when you try to convert to SQL.
LCDD.head(2)

Unnamed: 0,LendingClub_Column_name,NLP_use,customer,Add_words_1,ToDo,Add_words_2,Modified_Description
0,acc_now_delinq,1,The default customer,has delinquent accounts.,,,The number of accounts on which the borrower is now delinquent is
1,annual_income,1,The default customer,has,change to high/med/low,annual income.,The self-reported annual income provided by the borrower during registration is


In [38]:
LCDD.columns

Index(['LendingClub_Column_name', 'NLP_use', 'customer', 'Add_words_1', 'ToDo',
       'Add_words_2', 'Modified_Description'],
      dtype='object')

In [39]:
LCDD.replace(np.nan,'')

Unnamed: 0,LendingClub_Column_name,NLP_use,customer,Add_words_1,ToDo,Add_words_2,Modified_Description
0,acc_now_delinq,1,The default customer,has delinquent accounts.,,,The number of accounts on which the borrower is now delinquent is
1,annual_income,1,The default customer,has,change to high/med/low,annual income.,The self-reported annual income provided by the borrower during registration is
2,chargeoff_within_12_mths,1,The default customer,has charge off default conditions.,,,Number of charge-offs within 12 months is
3,collections_12_mths_ex_med,1,The default customer,has collections within twelve months excluding health expenses.,,,Number of collections in 12 months excluding medical collections is
4,debt_settlement_flag,1,The default customer,who was charged off has debt settlement status plan.,if value is y or 1,,"Flags whether or not the borrower, who has charged-off, is working with a debt-settlement company."
5,dti,1,The default customer,has,change to high/med/low,Debt-To-Income Ratio,"A ratio calculated using the borrower’s total monthly debt payments on the total debt. obligations, excluding mortgage and the requested LendingClub loan, divided by the borrower’s self-reported m..."
6,emp_length,1,The default customer,employment length in years is,value,,Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.
7,emp_title,1,The default customer,employment title is,value,,The date the listing will expire
8,grade,1,The default customer,has,change to high/med/low,credit score,LendingClub assigned loan grade
9,hardship_flag,1,The default customer,is on hardship plan.,,,Flags whether or not the customer is on a hardship plan


In [38]:
#pickle_out = open("LCDD.pickle","wb")
#pickle.dump(LCDD, pickle_out)
#pickle_out.close()

In [38]:
#pickle_in = open("LCDD.pickle", "rb")
#LCDD = pickle.load(pickle_in)

In [40]:
LCDD.head()

Unnamed: 0,LendingClub_Column_name,NLP_use,customer,Add_words_1,ToDo,Add_words_2,Modified_Description
0,acc_now_delinq,1,The default customer,has delinquent accounts.,,,The number of accounts on which the borrower is now delinquent is
1,annual_income,1,The default customer,has,change to high/med/low,annual income.,The self-reported annual income provided by the borrower during registration is
2,chargeoff_within_12_mths,1,The default customer,has charge off default conditions.,,,Number of charge-offs within 12 months is
3,collections_12_mths_ex_med,1,The default customer,has collections within twelve months excluding health expenses.,,,Number of collections in 12 months excluding medical collections is
4,debt_settlement_flag,1,The default customer,who was charged off has debt settlement status plan.,if value is y or 1,,"Flags whether or not the borrower, who has charged-off, is working with a debt-settlement company."


In [41]:
To_Clean_list = LCDD.LendingClub_Column_name.values

In [42]:
Clean_list = []

for n in To_Clean_list:
     
    dd = str(n).replace(' ','')
    Clean_list.append(dd)
    

In [43]:
Clean_list;

In [44]:
#keys = LCDD['LendingClub_Column_name'].values
Keys = Clean_list

valu = LCDD['Add_words_1'].values

DD_temp_dict =   dict(zip(Keys,valu))
#DD_temp_dict['avg_cur_bal']

In [45]:
DD_temp_dict['loan_condition']

'loan condition is'

In [46]:
NLP_LIST = [
'acc_now_delinq',
'annual_income', 'chargeoff_within_12_mths',
'collections_12_mths_ex_med',
'debt_settlement_flag',
'dti',
'emp_length',
'emp_title',
'grade',
'hardship_flag',
'hardship_reason',
'home_ownership',
'interest_rate',
'loan_status',
'mort_acc',
'num_accts_ever_120_pd',
'num_tl_120dpd_2m',
'num_tl_30dpd',
'num_tl_90g_dpd_24m',
'percent_bc_gt_75',
'pub_rec_bankruptcies',
'settlement_status',
'settlement_term',
'tot_coll_amt',
'loan_condition',
'loan_status_InGracePeriod',
'loan_status_ChargedOff',
'loan_status_DMC_ChargedOff',
'member_id',
'year',
'application_type']

In [61]:
###preparaing for ELMO X_train

default_test_final = X_train

In [62]:
X_train.shape

(100000, 32)

In [63]:
FirstDataRow = default_test_final.iloc[0]
#data.select_dtypes(exclude=[np.number])
DataRowtemp = list(FirstDataRow)

In [64]:
SecondDataRow = default_test_final.iloc[1]
list(SecondDataRow)

[0.0,
 'medium',
 0.0,
 0.0,
 'N',
 'low',
 'one year',
 'Compliance Manager',
 'medium',
 'N',
 nan,
 'MORTGAGE',
 'highest',
 'Charged Off',
 1.0,
 0.0,
 0.0,
 0.0,
 0.0,
 25.0,
 1.0,
 nan,
 nan,
 3841.0,
 0,
 0,
 1,
 0,
 495244,
 'Sep-2016',
 'Individual',
 2016]

In [65]:
default_test_final

Unnamed: 0,acc_now_delinq,annual_income,chargeoff_within_12_mths,collections_12_mths_ex_med,debt_settlement_flag,dti,emp_length,emp_title,grade,hardship_flag,...,settlement_term,tot_coll_amt,loan_condition,loan_status_InGracePeriod,loan_status_ChargedOff,loan_status_DMC_ChargedOff,member_id,issue_d,application_type,year
495242,0.0,high,0.0,0.0,N,low,four years,lead mold maker,medium,N,...,,0.0,1,0,0,0,495243,Sep-2016,Individual,2016
495243,0.0,medium,0.0,0.0,N,low,one year,Compliance Manager,medium,N,...,,3841.0,0,0,1,0,495244,Sep-2016,Individual,2016
495248,0.0,medium,0.0,0.0,N,low,more than ten years,Management meat department,high,N,...,,82.0,0,0,1,0,495249,Sep-2016,Individual,2016
495250,0.0,low,0.0,0.0,N,medium,two years,Enrichment Manager,medium,N,...,,0.0,1,0,0,0,495251,Sep-2016,Individual,2016
495251,0.0,lowest,0.0,0.0,N,low,three years,Teacher,high,N,...,,0.0,1,0,0,0,495252,Sep-2016,Individual,2016
495253,0.0,high,0.0,0.0,N,low,six years,Owner,high,N,...,,0.0,1,0,0,0,495254,Sep-2016,Individual,2016
495254,0.0,high,0.0,0.0,N,medium,more than ten years,Owner,medium,N,...,,0.0,1,0,0,0,495255,Sep-2016,Individual,2016
495256,0.0,low,0.0,0.0,N,low,more than ten years,Supervisor Assistant,low,N,...,,55.0,1,0,0,0,495257,Sep-2016,Individual,2016
495257,0.0,high,0.0,0.0,N,low,more than ten years,Sr. Business Analyst,medium,N,...,,0.0,1,0,0,0,495258,Sep-2016,Individual,2016
495258,0.0,high,0.0,0.0,N,medium,six years,"Director, Talent Acquisition - HR",medium,N,...,,0.0,1,0,0,0,495259,Sep-2016,Individual,2016


In [66]:
#  IMPORTANT _  DO NOT DELETE
Loans_col_name_row = default_test_final.iloc[0:1]
Loans_col_name_row_list = list(Loans_col_name_row)

In [67]:
Loans_temp_dict = dict(zip(Loans_col_name_row_list,DataRowtemp))


In [68]:
Loans_temp_dict

{'acc_now_delinq': 0.0,
 'annual_income': 'high',
 'chargeoff_within_12_mths': 0.0,
 'collections_12_mths_ex_med': 0.0,
 'debt_settlement_flag': 'N',
 'dti': 'low',
 'emp_length': 'four years',
 'emp_title': 'lead mold maker',
 'grade': 'medium',
 'hardship_flag': 'N',
 'hardship_reason': nan,
 'home_ownership': 'MORTGAGE',
 'interest_rate': 'medium',
 'loan_status': 'Current',
 'mort_acc': 3.0,
 'num_accts_ever_120_pd': 0.0,
 'num_tl_120dpd_2m': 0.0,
 'num_tl_30dpd': 0.0,
 'num_tl_90g_dpd_24m': 0.0,
 'percent_bc_gt_75': 75.0,
 'pub_rec_bankruptcies': 1.0,
 'settlement_status': nan,
 'settlement_term': nan,
 'tot_coll_amt': 0.0,
 'loan_condition': 1,
 'loan_status_InGracePeriod': 0,
 'loan_status_ChargedOff': 0,
 'loan_status_DMC_ChargedOff': 0,
 'member_id': 495243,
 'issue_d': 'Sep-2016',
 'application_type': 'Individual',
 'year': 2016}

In [69]:
Loans_col_name_row_list

['acc_now_delinq',
 'annual_income',
 'chargeoff_within_12_mths',
 'collections_12_mths_ex_med',
 'debt_settlement_flag',
 'dti',
 'emp_length',
 'emp_title',
 'grade',
 'hardship_flag',
 'hardship_reason',
 'home_ownership',
 'interest_rate',
 'loan_status',
 'mort_acc',
 'num_accts_ever_120_pd',
 'num_tl_120dpd_2m',
 'num_tl_30dpd',
 'num_tl_90g_dpd_24m',
 'percent_bc_gt_75',
 'pub_rec_bankruptcies',
 'settlement_status',
 'settlement_term',
 'tot_coll_amt',
 'loan_condition',
 'loan_status_InGracePeriod',
 'loan_status_ChargedOff',
 'loan_status_DMC_ChargedOff',
 'member_id',
 'issue_d',
 'application_type',
 'year']

In [87]:
# UNCOMMENT THIS FOR CREATING X_TEST
default_test_final = X_test
default_test_final.shape

(2000, 32)

In [70]:
####REMOVING THE LIST from the 19 columns and sending as a single list
import re
import nltk

In [88]:
WORD_BOWL_LIST = []
Word_bowl = []
Binary_No_list = [ 0, 'n', 'N',np.nan]
Binary_Yes_list = [1,'y', 'Y']
sp = ' '
DC = ''
pe = '.'


for z in range(0,2000,1):
    DF_DATA_ROW= default_test_final.iloc[z]   ##### change default_test_final for sending in the train and values
    DF_DATA_ROW_LIST = list(DF_DATA_ROW)
    DF_Loans_temp_dict = dict(zip(Loans_col_name_row_list,DF_DATA_ROW_LIST))
    #print(z)

    for i in DF_Loans_temp_dict:  
        D_Loan_col_name = i
        if (DF_Loans_temp_dict['loan_condition'] == 0):
               DC = 'The default customer'
        else:
               DC = 'The customer'
                
        if D_Loan_col_name == 'acc_now_delinq':
            if DF_Loans_temp_dict[D_Loan_col_name] > 0:
                append1 = DC + DD_temp_dict[D_Loan_col_name]
                Word_bowl.append(append1)
            else:
                Word_bowl.append('')
                                        
        elif D_Loan_col_name ==  'annual_income' :
            if DF_Loans_temp_dict[D_Loan_col_name] != '':
                append2 = DC + sp + DD_temp_dict[D_Loan_col_name] + sp + DF_Loans_temp_dict[D_Loan_col_name] + sp + str('annual income.')
                Word_bowl.append(append2)        
            else:
                Word_bowl.append('') 
    
    
        elif D_Loan_col_name ==  'chargeoff_within_12_mths' :
            if DF_Loans_temp_dict[D_Loan_col_name] > 0 :
                append3 = DC + sp + DD_temp_dict[D_Loan_col_name]
                Word_bowl.append(append3)        
            else:
                Word_bowl.append('')       
                                
        elif D_Loan_col_name ==  'collections_12_mths_ex_med' :
            if DF_Loans_temp_dict[D_Loan_col_name] > 0 :
                append4 = DC + sp + DD_temp_dict[D_Loan_col_name] 
                Word_bowl.append(append4)        
            else:
                Word_bowl.append('')
                                       
        elif D_Loan_col_name ==  'debt_settlement_flag' :
            if DF_Loans_temp_dict[D_Loan_col_name] != 'N' :
                append4 = DC + sp + DD_temp_dict[D_Loan_col_name] 
                Word_bowl.append(append4)        
            else:
                Word_bowl.append('')        
                                
        elif D_Loan_col_name ==  'dti' :
            if DF_Loans_temp_dict[D_Loan_col_name] != '' :
                append5 = DC + sp + DD_temp_dict[D_Loan_col_name] + DF_Loans_temp_dict[D_Loan_col_name] + sp + 'debt to income ratio.' 
                Word_bowl.append(append5)        
            else:
                Word_bowl.append('')         
                     
        elif D_Loan_col_name ==  'emp_length' :
            if DF_Loans_temp_dict[D_Loan_col_name] != '' :
                append6 = DC + sp + DD_temp_dict[D_Loan_col_name] + sp + str(DF_Loans_temp_dict[D_Loan_col_name]) + pe
                Word_bowl.append(append6)        
            else:
                Word_bowl.append('')          
                
        elif D_Loan_col_name ==  'emp_title' :
            if DF_Loans_temp_dict[D_Loan_col_name] != '' :
                append7 = DC + sp + DD_temp_dict[D_Loan_col_name] + sp + str(DF_Loans_temp_dict[D_Loan_col_name]) + pe 
                Word_bowl.append(append7)        
            else:
                Word_bowl.append('')      
                
        elif D_Loan_col_name ==  'grade' :
            if DF_Loans_temp_dict[D_Loan_col_name] != '' :
                append8 = DC + sp + DD_temp_dict[D_Loan_col_name] + sp + DF_Loans_temp_dict[D_Loan_col_name] + sp + 'credit score.'  
                Word_bowl.append(append8)        
            else:
                Word_bowl.append('')       
                
        elif D_Loan_col_name ==  'hardship_flag' :  ### has to fix nan values
            if DF_Loans_temp_dict[D_Loan_col_name] == 'Y' :
                append9 = DC + sp + DD_temp_dict[D_Loan_col_name] + sp + DF_Loans_temp_dict[D_Loan_col_name]
                append10 = DC + sp + DD_temp_dict['hardship_reason'] + sp + str(DF_Loans_temp_dict['hardship_reason']) + pe
                Word_bowl.append(append9)
                Word_bowl.append(append10)  
            else:
                Word_bowl.append('')            
                
                
        #elif D_Loan_col_name ==  'hardship_reason' :
            #if DF_Loans_temp_dict[D_Loan_col_name] != '' :
                #append10 = DC + sp + DD_temp_dict[D_Loan_col_name] + sp + str(DF_Loans_temp_dict[D_Loan_col_name]) + sp  
                #Word_bowl.append(append10)        
            #else:
                #Word_bowl.append('')           
                
                
        elif D_Loan_col_name ==  'home_ownership' :  ### has to fix 
            if DF_Loans_temp_dict[D_Loan_col_name] != '' :            
                append11 = DC + sp + DD_temp_dict[D_Loan_col_name] + sp + DF_Loans_temp_dict[D_Loan_col_name] + pe
                Word_bowl.append(append11)        
            else:
                Word_bowl.append('')             
                
    
    
        elif D_Loan_col_name ==  'interest_rate' :
            if DF_Loans_temp_dict[D_Loan_col_name] != '' :
                append12 = DC + sp + DD_temp_dict[D_Loan_col_name] + sp + DF_Loans_temp_dict[D_Loan_col_name] + pe
                Word_bowl.append(append12)        
            else:
                Word_bowl.append('')  
    
    
        elif D_Loan_col_name ==  'loan_status' :
            if DF_Loans_temp_dict[D_Loan_col_name] != '' :
                append13 = DC + sp + DD_temp_dict[D_Loan_col_name] + sp + DF_Loans_temp_dict[D_Loan_col_name] + pe
                Word_bowl.append(append13)        
            else:
                Word_bowl.append('')  
    
    
        elif D_Loan_col_name ==  'mort_acc' :
            if DF_Loans_temp_dict[D_Loan_col_name] != '' :
                append14 = DC + sp + DD_temp_dict[D_Loan_col_name]  
                Word_bowl.append(append14)        
            else:
                Word_bowl.append('')
                
        elif D_Loan_col_name ==  'num_accts_ever_120_pd' :
            if DF_Loans_temp_dict[D_Loan_col_name] != 0 :
                append15 = DC + sp + DD_temp_dict[D_Loan_col_name]  
                Word_bowl.append(append15)        
            else:
                Word_bowl.append('')        
                
                
        elif D_Loan_col_name ==  'num_tl_120dpd_2m' :
            if DF_Loans_temp_dict[D_Loan_col_name] != 0 :
                append16 = DC + sp + DD_temp_dict[D_Loan_col_name]  
                Word_bowl.append(append16)        
            else:
                Word_bowl.append('')        
                       
                
        elif D_Loan_col_name ==  'num_tl_30dpd' :
            if DF_Loans_temp_dict[D_Loan_col_name] != 0 :
                append17 = DC + sp + DD_temp_dict[D_Loan_col_name]  
                Word_bowl.append(append17)        
            else:
                Word_bowl.append('')     
        
        
        elif D_Loan_col_name ==  'num_tl_90g_dpd_24m' :
            if DF_Loans_temp_dict[D_Loan_col_name] != 0 :
                append18 = DC + sp + DD_temp_dict[D_Loan_col_name]  
                Word_bowl.append(append18)        
            else:
                Word_bowl.append('')  
        
        
        elif D_Loan_col_name ==  'pub_rec_bankruptcies' :
            if DF_Loans_temp_dict[D_Loan_col_name] != 0 :
                append19 = DC + sp + 'has' +    sp + DD_temp_dict[D_Loan_col_name]  
                Word_bowl.append(append19)        
            else:
                Word_bowl.append('')           
                
    #Word_bowl.append('NEXT.')    
    teLIST = list(Word_bowl)
    processed =     ''.join(teLIST)
    processed = processed.lower()  
    processed = re.sub('[^a-zA-Z]', ' ', processed )  
    processed = re.sub(r'\s+', ' ', processed)
    WORD_BOWL_LIST.append(processed)   
    Word_bowl = []
    
    
    
#Word_bowl            
#WORD_BOWL_LIST          
        
#Word_bowl            
#DF_WORD_BOWL_LIST                 
    

In [136]:
WORD_BOWL_LIST[0:3]

['the customer has low annual income the customer has low debt to income ratio the customer employment length in years is more than ten years the customer employment title is chef the customer has medium credit score the customer home is on rent the customer interest rate is high the customer current loan status is current the customer has mortgage accounts the customer has public record bankruptcies ',
 'the customer has high annual income the customer has medium debt to income ratio the customer employment length in years is more than ten years the customer employment title is postmaster the customer has medium credit score the customer home is on mortgage the customer interest rate is highest the customer current loan status is current the customer has mortgage accounts the customer has public record bankruptcies ',
 'the customer has medium annual income the customer has low debt to income ratio the customer employment length in years is six years the customer employment title is a

In [82]:
X_train_word_bowl_list = WORD_BOWL_LIST

In [83]:
len(X_train_word_bowl_list)

100000

In [90]:
X_test_word_bowl_list = WORD_BOWL_LIST

In [91]:
len(X_test_word_bowl_list)

2000

In [234]:
#len(Default_Word_bowl)

1584

In [68]:

processed_article = X_train_word_bowl_list

In [52]:
import re  
import nltk

In [None]:
###http://www.claudiobellei.com/2018/01/07/backprop-word2vec-python/
#after tokenising, we can do the mapping from text to one-hot encoded context and center words 
#using the function corpus2io 
#which uses the auxiliary function to_categorical (copied from the Keras repository).

In [75]:
documents = all_words

In [76]:
len(documents)

1

In [292]:
#documents

In [172]:
## References
###https://towardsdatascience.com/understand-how-to-transfer-your-paragraph-to-vector-by-doc2vec-1e225ccf102

#https://github.com/makcedward/nlp/blob/master/sample/embeddings/nlp-embeddings-document-doc2vec.ipynb

# Doc2vec impl

# Distributed Bag of Words version of Paragraph Vector (PV-DBOW)
Instead of predicting next word, it use a paragraph vector to classify entire words in the document. 
During training, sampling a list of word and then form a classifer to classify whether word belongs 
to the document such that word vectors can be learnt.


In [76]:
#import Doc2Vec
from gensim.models import Doc2Vec
import logging
import gzip


from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument


logging.basicConfig(format= '%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [77]:
pwd

'/Users/pramodpaul/Documents/nlp/sample'

In [78]:
import sys, os
def add_aion(curr_path=None):
    if curr_path is None:
        dir_path = os.getcwd()
        target_path = os.path.dirname(dir_path)
        if target_path not in sys.path:
            print('Added %s into sys.path.' % (target_path))
            sys.path.insert(0, target_path)
            
add_aion()

Added /Users/pramodpaul/Documents/nlp into sys.path.


In [79]:
import aion

In [80]:
from aion.embeddings.doc2vec import Doc2VecEmbeddings

# import sys

In [81]:
doc2vec_embs = Doc2VecEmbeddings()

In [84]:
##During training, sampling a list of words and then form a classifer to classify whether word 
#belongs to the document such that word vectors can be learnt.

Default_train_tokens = doc2vec_embs.build_vocab(documents=X_train_word_bowl_list)
doc2vec_embs.train(Default_train_tokens)

2019-06-11 19:48:40,507 : INFO : using concatenative 6300-dimensional layer1
2019-06-11 19:48:40,508 : INFO : collecting all words and their counts
2019-06-11 19:48:40,510 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2019-06-11 19:48:40,640 : INFO : PROGRESS: at example #10000, processed 705111 words (5457333/s), 2271 word types, 10000 tags
2019-06-11 19:48:40,754 : INFO : PROGRESS: at example #20000, processed 1419362 words (6316956/s), 3282 word types, 20000 tags
2019-06-11 19:48:40,871 : INFO : PROGRESS: at example #30000, processed 2133021 words (6132532/s), 4083 word types, 30000 tags
2019-06-11 19:48:40,990 : INFO : PROGRESS: at example #40000, processed 2847682 words (6046453/s), 4775 word types, 40000 tags
2019-06-11 19:48:41,107 : INFO : PROGRESS: at example #50000, processed 3562868 words (6164549/s), 5313 word types, 50000 tags
2019-06-11 19:48:41,231 : INFO : PROGRESS: at example #60000, processed 4279320 words (5788131/s), 5847 word types

2019-06-11 19:49:34,289 : INFO : EPOCH 2 - PROGRESS: at 84.53% examples, 82622 words/s, in_qsize 8, out_qsize 0
2019-06-11 19:49:35,325 : INFO : EPOCH 2 - PROGRESS: at 87.32% examples, 81451 words/s, in_qsize 8, out_qsize 0
2019-06-11 19:49:36,325 : INFO : EPOCH 2 - PROGRESS: at 91.10% examples, 81372 words/s, in_qsize 8, out_qsize 0
2019-06-11 19:49:37,345 : INFO : EPOCH 2 - PROGRESS: at 94.27% examples, 80776 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:49:38,350 : INFO : EPOCH 2 - PROGRESS: at 97.47% examples, 80255 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:49:38,922 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-06-11 19:49:38,925 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-11 19:49:38,962 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-11 19:49:39,004 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-11 19:49:39,005 : INFO : EPOCH - 2 : training on 7125443 raw words (

2019-06-11 19:50:41,077 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-11 19:50:41,119 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-11 19:50:41,148 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-11 19:50:41,149 : INFO : EPOCH - 4 : training on 7125443 raw words (2104907 effective words) took 32.5s, 64713 effective words/s
2019-06-11 19:50:42,190 : INFO : EPOCH 5 - PROGRESS: at 4.00% examples, 78865 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:50:43,196 : INFO : EPOCH 5 - PROGRESS: at 8.35% examples, 84671 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:50:44,198 : INFO : EPOCH 5 - PROGRESS: at 13.10% examples, 89614 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:50:45,200 : INFO : EPOCH 5 - PROGRESS: at 17.58% examples, 90831 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:50:46,217 : INFO : EPOCH 5 - PROGRESS: at 21.61% examples, 89484 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:50:47,270 : INFO :

2019-06-11 19:51:42,020 : INFO : EPOCH 2 - PROGRESS: at 10428.74% words, 75592 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:51:43,046 : INFO : EPOCH 2 - PROGRESS: at 14210.32% words, 77565 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:51:44,127 : INFO : EPOCH 2 - PROGRESS: at 18513.50% words, 80192 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:51:45,146 : INFO : EPOCH 2 - PROGRESS: at 22553.85% words, 81786 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:51:46,217 : INFO : EPOCH 2 - PROGRESS: at 25941.52% words, 80231 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:51:47,236 : INFO : EPOCH 2 - PROGRESS: at 29852.18% words, 81001 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:51:48,281 : INFO : EPOCH 2 - PROGRESS: at 33241.97% words, 80119 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:51:49,297 : INFO : EPOCH 2 - PROGRESS: at 36758.75% words, 79975 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:51:50,311 : INFO : EPOCH 2 - PROGRESS: at 40540.88% words, 80404 words/s, in_qsize 7, out_

2019-06-11 19:52:48,286 : INFO : EPOCH 4 - PROGRESS: at 52277.80% words, 88570 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:52:49,297 : INFO : EPOCH 4 - PROGRESS: at 55015.25% words, 86614 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:52:50,340 : INFO : EPOCH 4 - PROGRESS: at 57754.64% words, 84753 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:52:51,365 : INFO : EPOCH 4 - PROGRESS: at 61663.34% words, 84790 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:52:52,372 : INFO : EPOCH 4 - PROGRESS: at 64532.55% words, 83575 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:52:53,376 : INFO : EPOCH 4 - PROGRESS: at 67008.10% words, 82072 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:52:54,377 : INFO : EPOCH 4 - PROGRESS: at 69874.87% words, 81209 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:52:55,401 : INFO : EPOCH 4 - PROGRESS: at 72741.71% words, 80317 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:52:56,410 : INFO : EPOCH 4 - PROGRESS: at 76130.29% words, 80096 words/s, in_qsize 7, out_

2019-06-11 19:53:54,218 : INFO : EPOCH 6 - PROGRESS: at 86955.56% words, 79788 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:53:55,266 : INFO : EPOCH 6 - PROGRESS: at 89563.27% words, 78830 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:53:56,355 : INFO : EPOCH 6 - PROGRESS: at 92824.18% words, 78383 words/s, in_qsize 4, out_qsize 0
2019-06-11 19:53:56,363 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-06-11 19:53:56,384 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-06-11 19:53:56,402 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-06-11 19:53:56,421 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-06-11 19:53:56,422 : INFO : EPOCH - 6 : training on 7125443 raw words (2104402 effective words) took 26.8s, 78517 effective words/s
2019-06-11 19:53:57,460 : INFO : EPOCH 7 - PROGRESS: at 3126.70% words, 67980 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:53:58,516 : INFO : EPOCH 7 - PROGRESS:

2019-06-11 19:54:54,095 : INFO : EPOCH 9 - PROGRESS: at 16427.56% words, 90749 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:54:55,134 : INFO : EPOCH 9 - PROGRESS: at 20729.17% words, 91301 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:54:56,179 : INFO : EPOCH 9 - PROGRESS: at 25159.13% words, 92052 words/s, in_qsize 8, out_qsize 0
2019-06-11 19:54:57,181 : INFO : EPOCH 9 - PROGRESS: at 29330.97% words, 92272 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:54:58,196 : INFO : EPOCH 9 - PROGRESS: at 33502.34% words, 92343 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:54:59,200 : INFO : EPOCH 9 - PROGRESS: at 37280.94% words, 91617 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:55:00,304 : INFO : EPOCH 9 - PROGRESS: at 40149.95% words, 88108 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:55:01,359 : INFO : EPOCH 9 - PROGRESS: at 43150.55% words, 85939 words/s, in_qsize 7, out_qsize 0
2019-06-11 19:55:02,381 : INFO : EPOCH 9 - PROGRESS: at 45627.71% words, 83344 words/s, in_qsize 7, out_

In [85]:
# After that, we can encode it by providing training data and testing data.
X_train = X_train_word_bowl_list



In [86]:
X_train_t = doc2vec_embs.encode(documents=X_train)



In [92]:
X_test = X_test_word_bowl_list

In [93]:
X_test_t = doc2vec_embs.encode(documents=X_test)

# Model testing & evaluation 

In [94]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='newton-cg', max_iter=1000)

model.fit(X_train_t, y_train)




LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=1000, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='newton-cg',
          tol=0.0001, verbose=0, warm_start=False)

In [95]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
#from sklearn.metrics import f1 score

y_pred = model.predict(X_test_t)

print('Accuracy:%.2f%%' % (accuracy_score(y_test, y_pred)*100))
print('Classification Report:')
print(classification_report(y_test, y_pred))



Accuracy:77.00%
Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.54      0.70      1000
           1       0.69      1.00      0.81      1000

   micro avg       0.77      0.77      0.77      2000
   macro avg       0.84      0.77      0.76      2000
weighted avg       0.84      0.77      0.76      2000



In [97]:
from sklearn.metrics import f1_score
print(f1_score(y_test, y_pred))

0.8127035830618893


# ELMO Implementation

In [98]:
#  ELMO
import pandas as pd
import numpy as np
import spacy
from tqdm import tqdm
import re
import time
import pickle
pd.set_option('display.max_colwidth', 200)

In [58]:
pip install tensorflow_hub

Note: you may need to restart the kernel to use updated packages.


In [59]:
pip install "tensorflow>=1.13.0"

Note: you may need to restart the kernel to use updated packages.


In [124]:
import tensorflow_hub as hub
import tensorflow as tf



W0605 19:27:09.199778 140736696550336 __init__.py:56] Some hub symbols are not available because TensorFlow version is less than 1.14


In [125]:


elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)

I0605 19:27:14.111370 140736696550336 resolver.py:79] Using /var/folders/dt/25s994dd6ml5f913nxyr3hbw0000gn/T/tfhub_modules to cache modules.
I0605 19:27:14.121935 140736696550336 resolver.py:398] Downloading TF-Hub Module 'https://tfhub.dev/google/elmo/2'.
I0605 19:27:46.422348 140736696550336 resolver.py:122] Downloading https://tfhub.dev/google/elmo/2: 20.35MB
I0605 19:28:11.755907 140736696550336 resolver.py:122] Downloading https://tfhub.dev/google/elmo/2: 30.35MB
I0605 19:28:27.824897 140736696550336 resolver.py:122] Downloading https://tfhub.dev/google/elmo/2: 40.35MB
I0605 19:29:05.756186 140736696550336 resolver.py:122] Downloading https://tfhub.dev/google/elmo/2: 60.35MB
I0605 19:29:21.574662 140736696550336 resolver.py:122] Downloading https://tfhub.dev/google/elmo/2: 70.35MB
I0605 19:29:41.386104 140736696550336 resolver.py:122] Downloading https://tfhub.dev/google/elmo/2: 80.35MB
I0605 19:30:09.999058 140736696550336 resolver.py:122] Downloading https://tfhub.dev/google/elm

Instructions for updating:
Colocations handled automatically by placer.


W0605 19:41:31.387362 140736696550336 deprecation.py:323] From /Users/pramodpaul/anaconda3/envs/tf36/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py:3632: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.


In [126]:
# just a random sentence
x = ["Roasted ants are a popular snack in Columbia"]

# Extract ELMo features 
embeddings = elmo(x, signature="default", as_dict=True)["elmo"]

embeddings.shape

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I0605 19:42:04.183866 140736696550336 saver.py:1483] Saver not created because there are no variables in the graph to restore


TensorShape([Dimension(1), Dimension(8), Dimension(1024)])

In [None]:
#https://www.analyticsvidhya.com/blog/2019/03/learn-to-use-elmo-to-extract-features-from-text/

In [97]:
y = ["Roasted ants are a popular snack in Columbia"]

In [90]:
def elmo_vectors(x):
  Fembeddings = elmo(x.tolist(), signature="default", as_dict=True)["elmo"]

  with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    sess.run(tf.tables_initializer())
    # return average of ELMo features
    return sess.run(tf.reduce_mean(Fembeddings,1))

In [94]:
def No_List_elmo_vectors(x):
    NL_embeddings = elmo(x, signature="default", as_dict=True)["elmo"]

    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        sess.run(tf.tables_initializer())
        # return average of ELMo features
        return sess.run(tf.reduce_mean(NL_embeddings,1))
        sess.run(tf.reduce_mean(NL_embeddings,1))


In [96]:
#No_List_elmo_vectors(y)

In [99]:
X_train_word_bowl_list[0]

'the customer has high annual income the customer has low debt to income ratio the customer employment length in years is four years the customer employment title is lead mold maker the customer has medium credit score the customer home is on mortgage the customer interest rate is medium the customer current loan status is current the customer has mortgage accounts the customer has public record bankruptcies '

In [175]:
X_train_final_elmo_vector_list = []

In [126]:
processed_article = ''

In [124]:
import re
import nltk

In [142]:
#X_train_final_elmo_vector_list     ### 1 row only here

[array([[-0.4923636 , -0.38836855, -0.3297606 , ..., -0.38820732,
          0.08969769,  0.30809253],
        [-0.87431043, -0.3207273 , -0.23001891, ..., -0.19320996,
          0.13567975,  0.04842685],
        [-0.17113915,  0.14358462, -0.0649466 , ..., -0.05151573,
          0.01388504,  0.13684303],
        ...,
        [-0.17113926,  0.14358456, -0.06494633, ..., -0.05151572,
          0.01388501,  0.13684303],
        [ 0.12321753, -0.32549638, -0.46529433, ...,  0.34293908,
          0.05647948,  0.3826136 ],
        [-0.02840841, -0.04353214,  0.04130161, ...,  0.02583167,
         -0.01429833, -0.0165042 ]], dtype=float32)]

In [99]:
X_train_elmo_embed_list = []

In [182]:
len(X_train_word_bowl_list)

500

In [101]:
X_train_final_elmo_vector_list = []

In [102]:
# 0to40, has 0 to 39 rows,   40 to 100 has 40 to 99 rows

In [None]:
for item in X_train_word_bowl_list[0:50] :                
            embed = elmo(list(item), signature="default", as_dict=True)["elmo"]
            #X_train_elmo_embed_list.append(embed)
            #print(len(X_train_elmo_embed_list))
            with tf.Session() as sess:
                sess.run(tf.global_variables_initializer())
                sess.run(tf.tables_initializer())
                vector_l = sess.run(tf.reduce_mean(embed,1))
                X_train_final_elmo_vector_list.append(vector_l)
                print(len(X_train_final_elmo_vector_list))
                
                

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I0605 10:06:41.753022 140736698655680 saver.py:1483] Saver not created because there are no variables in the graph to restore


In [186]:
print(len(X_train_final_elmo_vector_list))


50


In [187]:
X_train_final_elmo_vector_list[0]

array([[-0.4923636 , -0.38836855, -0.3297606 , ..., -0.38820732,
         0.08969769,  0.30809253],
       [-0.87431043, -0.3207273 , -0.23001891, ..., -0.19320996,
         0.13567975,  0.04842685],
       [-0.17113915,  0.14358462, -0.0649466 , ..., -0.05151573,
         0.01388504,  0.13684303],
       ...,
       [-0.17113926,  0.14358456, -0.06494633, ..., -0.05151572,
         0.01388501,  0.13684303],
       [ 0.12321753, -0.32549638, -0.46529433, ...,  0.34293908,
         0.05647948,  0.3826136 ],
       [-0.02840841, -0.04353214,  0.04130161, ...,  0.02583167,
        -0.01429833, -0.0165042 ]], dtype=float32)

In [63]:
import dill
import pandas as pd
import pickle

In [99]:
#X_train_elmo_vector_list_350to400.pickle
pickle_in = open("X_train_elmo_vector_list_350to400.pickle", "rb")
X_train_Elmo_350to400 = pickle.load(pickle_in)

In [101]:
#X_train_elmo_vector_list_300to350.pickle
pickle_in = open("X_train_elmo_vector_list_300to350.pickle", "rb")
X_train_Elmo_300to350 = pickle.load(pickle_in)

In [103]:
#
pickle_in = open("X_train_elmo_vector_list_400to450.pickle", "rb")
X_train_Elmo_400to450 = pickle.load(pickle_in)

In [102]:
#X_train_elmo_vector_list_450to500.pickle
pickle_in = open("X_train_elmo_vector_list_450to500.pickle", "rb")
X_train_Elmo_450to500 = pickle.load(pickle_in)

In [100]:
# load pickeled arrays
pickle_in = open("X_train_first150.pickle", "rb")
X_train_Elmo_first150 = pickle.load(pickle_in)


In [120]:
# load elmo_train_new
pickle_in = open("X_test_elmo_vector_list_first100.pickle", "rb")
X_test_Elmo_new = pickle.load(pickle_in)

##  Train and test with elmo word vectors

In [129]:
y_train = Train['loan_condition']

In [121]:
#Once we have all the vectors, we can concatenate them back to a single array:
#np.concatenate((a, b), axis=0)

X_train_Elmo_new = np.concatenate((X_train_Elmo_first150,X_train_Elmo_300to350,X_train_Elmo_350to400,
                                 X_train_Elmo_400to450, X_train_Elmo_450to500,X_test_Elmo_new), axis = 0)

In [123]:
#Once we have all the vectors, we can concatenate them back to a single array:
elmo_train_new = np.concatenate(X_train_Elmo_new, axis = 0)

In [124]:
len(elmo_train_new)

8550

In [125]:
elmo_train_new[0]

array([-0.02840841, -0.04353216,  0.04130161, ...,  0.02583167,
       -0.01429834, -0.0165042 ], dtype=float32)

In [133]:
X_train_Elmo_new[0]

array([[-0.02840841, -0.04353216,  0.04130161, ...,  0.02583167,
        -0.01429834, -0.0165042 ],
       [-0.2682776 ,  0.13302805, -0.23178042, ..., -0.05466669,
         0.22285652,  0.11577968],
       [-0.02840841, -0.04353216,  0.04130161, ...,  0.02583167,
        -0.01429834, -0.0165042 ],
       ...,
       [-0.02840841, -0.04353216,  0.04130161, ...,  0.02583167,
        -0.01429834, -0.0165042 ],
       [-0.02840841, -0.04353216,  0.04130161, ...,  0.02583167,
        -0.01429834, -0.0165042 ],
       [-0.14906923,  0.02673437, -0.16956924, ...,  0.11148135,
         0.05151133, -0.17183973]], dtype=float32)

In [122]:
len(X_train_Elmo_new)

450

In [135]:
elmo_train_new[0]

array([-0.02840841, -0.04353216,  0.04130161, ...,  0.02583167,
       -0.01429834, -0.0165042 ], dtype=float32)

In [127]:
elmo_train_new.shape

(8550, 1024)

In [126]:
y_train_t = y_train.head(8550)


In [135]:
y_train_t.shape

(8550,)

In [128]:
from sklearn.model_selection import train_test_split

xtrain, xtest, ytrain, ytest = train_test_split(elmo_train_new, y_train_t,random_state=42,test_size=0.2)  


# Model Building and Evaluation

In [129]:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score


In [130]:
lreg = LogisticRegression()

In [131]:
lreg.fit(xtrain, ytrain)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [132]:
lreg_preds = lreg.predict(xtest)

In [133]:
print(f1_score(ytest, lreg_preds))

0.829051044878383


In [134]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
#from sklearn.metrics import f1 score


print('Accuracy:%.2f%%' % (accuracy_score(ytest, lreg_preds)*100))
print('Classification Report:')
print(classification_report(ytest,lreg_preds))

Accuracy:70.82%
Classification Report:
              precision    recall  f1-score   support

           0       0.50      0.00      0.00       499
           1       0.71      1.00      0.83      1211

   micro avg       0.71      0.71      0.71      1710
   macro avg       0.60      0.50      0.42      1710
weighted avg       0.65      0.71      0.59      1710

