# ROADMAP

**[<font color='green'>1. 读取所需API与数据</font>](#link1)**
<br>**[<font color='green'>2. 去重</font>](#link2)**
- [Function: Duplicates( )](#link2.1)
- [Execution: Duplicates( )](#link2.2)

<br>**[<font color='green'>3. 处理多个查询时间问题</font>](#link3)**
- [Function: Ctime( )](#link3.1)
- [Execution: Ctime( )](#link3.2)

<br>**[<font color='green'>4. 数据初报告</font>](#link4)**
- [Function: Exploration( )](#link4.1)
- [Execution: Exploration( )](#link4.2)

<br>**[<font color='green'>5. 基于缺失与异常的衍生</font>](#link5)**

<br>**[<font color='green'>6. 丢弃缺失过多特征</font>](#link6)**
- [Class: DropNA( )](#link5.1)
- [Execution: DropNA( )](#link5.2)

<br>**[<font color='darkred'>6. Feature Engineering</font>](#link6)**
- [6.0 ID, Observation Point and Label(y)](#link6.0)
    - Extract ID, Observation Point and y from JIBEN_sheet
    - Merge with Loan, Card and Stdcard   
    
- [6.1 Classify the Feature Types](#link6.1)
    - Check time range and the corresponding samples
    - Classification
- [6.2 Loan Features Deep Mining](#link6.2)    
- [6.3 Fill NA](#link6.3)
    - dates & months: 1900.01.01 or 1900.01
    - amounts: 0
    - counts: ?
    - types: ?
- [6.4 Generate](#link6.4)
    - nunique of financeorg
    - whether contain 浦发
    - latest24state
    - sum of amounts
- [6.5 Cross \& Dot Products](#link6.5)
- [6.6 Time Filter](#link6.5)

**<font color='red'>OUTLIERS IN JIBEN</font>**

<a id='link1'>**<font color='green'>1. Import API, Read Data and Reduce Memory</font>**</a>

In [4]:
import pandas as pd
import numpy as np
import time
import math
import h5py
import missingno as msn
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 100)

import sklearn
# sklearn.show_versions()

import Memory as mm

In [5]:
import multiprocessing as mp

In [6]:
start = time.time()

jiben = pd.read_csv('refer/jiben_vars_new_20200424.csv',skiprows=list(range(779441,779509)),
                     encoding="ISO-8859-1",error_bad_lines=False,low_memory=False)
jiben = mm.reduce_mem_usage(jiben)
strange_bad = jiben['bad'].value_counts().index[-1]
jiben['bad'].replace({strange_bad:'N'},inplace=True)
jiben['bad'].fillna('N',inplace=True)

# card = pd.read_csv('refer/kw_poc_card_20200424.csv')
# card = mm.reduce_mem_usage(card)

loan = pd.read_csv('refer/kw_poc_loan_20200424.csv')
loan = mm.reduce_mem_usage(loan)

end = time.time()
print('time cost {}seconds'.format(np.round(end-start,2)))

Reduce Memory Usage Function Reports:
Memory usage of dataframe is 1259.57 MB
Memory usage after optimization is: 869.85 MB
Decreased by 30.9%
 
Reduce Memory Usage Function Reports:
Memory usage of dataframe is 1448.35 MB
Memory usage after optimization is: 1024.19 MB
Decreased by 29.3%
 
time cost 64.83seconds


<a id='link2'>**<font color='green'>2.Check Drop and Save Duplicated Samples</font>**</a>
- <a id='link2.1'>2.1 Function: Duplicates( )</a>

In [None]:
def Duplicates(i,return_dict,df,job_name):
    
    start = time.time()
    
    n,m = df.shape[0],df.shape[1] 
    
    # Check Duplicates
    dup_samples_cond = df.duplicated(keep='first')==True
    kept_cond = df.duplicated(keep='first')==False
    dup_samples = df.loc[dup_samples_cond]
    dup_sample_cnt = dup_samples.shape[0]
    
    print('abondon completely duplicated samples')
    print('find {} duplicate rows'.format(dup_sample_cnt))
    
    end = time.time()
    print('{} duplicates processing finished, time cost {}seconds'.format(job_name,np.round(end-start,2)))
    
    return_dict[i]=[df,dup_samples]
    
    return df,dup_samples

<a id='link2'>**<font color='green'>2.Check Drop and Save Duplicated Samples</font>**</a>
- <a id='link2.1'>2.2 Execution: Duplicates( )</a>
    - **Multiprocessing**
    - **about 10~12min**

In [None]:
start = time.time()
all_sheets = [loan]
sheets_name = ['loan']
if __name__ == '__main__':
    manager = mp.Manager()
    return_dict = manager.dict()
    jobs = []
    for i in range(1):
        p = mp.Process(target=Duplicates, args=(i,return_dict,all_sheets[i],sheets_name[i]))
        jobs.append(p)
        p.start()

    for proc in jobs:
        proc.join()
        
end = time.time()
print('total time cost of duplicates processing takes {}seconds'.format(np.round(end-start,2)))

In [None]:
loan_dup = return_dict[0]
del loan

In [None]:
loan_dup[0].shape

<a id='link3'>**<font color='green'>3. Deal With Multiple Ctime</font>**</a>
- <a id='link3.1'>3.1 Function: Ctime( )</a>

In [None]:
def ctime(procnum,return_dict,job_name,df,
          id_name='id',index_name='index'):
    
    start = time.time()
    # find all ids have multiple ctimes
    group = df.groupby(id_name)['ctime'].nunique()
    cond = group>1
    ids = group.loc[cond].index
    
    
    for i in ids:
        cond = df[id_name]==i
        
        process_df = df.loc[cond]
        df.drop(process_df.index,inplace=True)
        process_cond = process_df.sort_values(by=['ctime',index_name],
                                              ascending=False).duplicated(subset=index_name,
                                                                          keep='first')==False
        process_df = process_df.loc[process_cond]
        df = pd.concat([df,process_df],axis=0)
        
    return_dict[procnum] = df
    
    end = time.time()
    print('dealing ctime of {} takes {}seconds'.format(job_name,np.round(end-start,2)))
    
    return df

<a id='link3'>**<font color='green'>3. Deal With Multiple Ctime</font>**</a>
- <a id='link3.1'>3.2 Execution: Ctime( )</a>
- **about 8~10 mins**

In [None]:
start = time.time()
all_sheets = [card_dup[0]]
sheets_name = ['card']
if __name__ == '__main__':
    manager = mp.Manager()
    return_dict = manager.dict()
    jobs = []
    for i in range(1):
        p = mp.Process(target=ctime, args=(i,return_dict,sheets_name[i],all_sheets[i]))
        jobs.append(p)
        p.start()
    
    for proc in jobs:
        proc.join()
        
end = time.time()
print('total time cost of ctime processing spend {}seconds'.format(np.round(end-start,2)))

In [None]:
ctime_card = return_dict[0]

In [None]:
card_dup[0].shape

In [None]:
del card_dup
del return_dict,all_sheets

In [None]:
ctime_card.shape

<font color='orange'>**Loan Features Classification**</font>

In [7]:
keys_loan=['id']
types_loan = ['financeorg','financetype','type','currency','guaranteetype']
dates_loan = ['opendate','enddate','stateenddate','stateendmonth',
         'scheduledpaymentdate','recentpaydate']
months_loan = ['latest24monthpaymentbeginmonth','latest24monthpaymentendmonth',
          'latest5yearoverduebeginmonth','latest5yearoverdueendmonth']
counts_loan = ['paymentrating','paymentcyc','remainpaymentcyc','curroverduecyc']
states_loan = ['state','class5state','latest24state']
amounts_loan = ['creditlimitamount','balance','scheduledpaymentamount',
           'actualpaymentamount','curroverdueamount','overdue31to60amount',
           'overdue61to90amount','overdue91to180amount','overdueover180amount']

<font color='orange'>**Card Features Classification**</font>

In [8]:
keys_card = ['id']
types_card = ['financeorg','financetype','currency','guaranteetype']
dates_card = ['opendate','stateenddate','stateendmonth','scheduledpaymentdate','recentpaydate']
months_card = ['latest24monthpaymentbeginmonth','latest24monthpaymentendmonth',
          'latest5yearoverduebeginmonth','latest5yearoverdueendmonth']
counts_card = ['curroverduecyc']
states_card = ['state','latest24state']
amounts_card = ['creditlimitamount','sharecreditlimitamount','usedcreditlimitamount',
           'latest6monthusedavgamount','usedhighestamount','scheduledpaymentamount',
           'actualpaymentamount','curroverdueamount','overdue31to60amount',
           'overdue61to90amount','overdue91to180amount','overdueover180amount']

<a id='link4'>**<font color='green'>4. 数据初报告</font>**</a>
- <a id='link4.1'>4.1 Function: Exploration( )</a>

In [None]:
def Exploration(df,almost_empty=0.9):

    start = time.time()
    
    n,m = df.shape[0],df.shape[1] 
    dtypes = pd.DataFrame(df.dtypes)
    
    # Missing Values, Nunique, Data Type
    dtypes.columns = ['data types']
    null = pd.DataFrame(df.isnull().sum())
    null.columns = ['none cnt']
    null['none percentage'] = (null['none cnt']/n).apply(lambda x: "{0:.2f}%".format(x * 100))
    null['empty'] = 1*(null['none cnt']==n)
    null['>='+str(int(almost_empty*100))+'% empty']=1*(null['none cnt']/n>=almost_empty)
    
    nunique = pd.DataFrame(df.nunique())
    nunique.columns = ['n_unique']
    
    exploration = pd.concat([dtypes,null,nunique],axis=1)
    
    
    
    end = time.time()
    
    print('function finished one time, time cost {}seconds'.format(np.round(end-start,2)))
    
#   return [exploration,total_dup_cnt,unique_dup_cnt]
    return exploration

<a id='link4'>**<font color='green'>4. Data Report</font>**</a>
- <a id='link4.1'>4.2 Execution: Exploration( )</a>

In [None]:
jiben_ep = Exploration(jiben)
card_ep = Exploration(ctime_card)
loan_ep = Exploration(ctime_loan)
stdcard_ep = Exploration(ctime_stdcard)
query_ep = Exploration(ctime_query)

<a id='link6'>**<font color='darkred'>6. Feature Engineering</font>**</a>
- <a id='link6.0'>**6.0 Extract ID, Observation Point and Label**</a>
    - **Extract from Jiben sheet**
    - **Merge with Loan, Card and Stdcard**
    
<br><font color='red'>**Warning: `Inner` method is only applied to little sample version. For all data, method is `right` or `left`**</font>

In [9]:
observation_point = jiben[['id','bad','nasrdw_recd_date']]
loan_merged = pd.merge(right = loan, 
                       left = observation_point,
                       how='right',
                       on='id')

In [None]:
observation_point = jiben[['id','bad','nasrdw_recd_date']]
card_merged = pd.merge(right = card,
                       left = observation_point,
                       how='right',
                       on='id')
# del ctime_card

In [10]:
cond1 = loan_merged['bad'].isin(['G','B','N'])
loan_filtered = loan_merged.loc[cond1]

In [11]:
loan_filtered.shape

(4351772, 37)

In [12]:
del loan_merged

<a id='link5'>**<font color='green'>5. 基于缺失与异常的衍生</font>**</a>

In [None]:
def NA_OL_Records_Detailed(df,target_cols):
    
    start = time.time()
        
    df['ol_neg'] = (df[target_cols].fillna(0)<0).sum(axis=1)
    df['na_record'] = df.isnull().sum(axis=1)
    
    cols = target_cols[:7]
    
    thres = list(df[cols].fillna(0).quantile(0.99).values)
    print(thres)
    
    ol_mark = 1*(df[cols].fillna(0)>=thres)

    ol_cols = [i+'_99ol' for i in cols]

    df[ol_cols] = ol_mark
    
    ol_records_g = df.groupby('id')[['na_record']+['ol_neg']+ol_cols].sum()
    
    values = pd.DataFrame(ol_records_g.values)
    values.columns = ['na_record']+['ol_neg']+ol_cols
    ol_records_df = pd.concat([pd.DataFrame(ol_records_g.index),
                               values],axis=1)
    
    end = time.time()
    print('na and outliers record takes {}seconds'.format(np.round(end-start,2)))
    return ol_records_df

In [None]:
ol_loan_df = NA_OL_Records_Detailed(loan_filtered,amounts_loan)

In [None]:
ol_loan_df.shape

In [None]:
ol_loan_df.to_csv('Loan_processed/loan_ol.csv',index=False,encoding='GBK')

**填充缺失值**

In [13]:
class FillNA():
    
    def __init__(self,df,dates,months,types,states,amounts):
        
        self.df = df.copy()
        self.types = types
        self.dates = dates
        self.months = months
        self.fts = list(df.columns)
#        self.counts = counts
        self.states = states
        self.amounts = amounts
        
    def checkFeatures(self):
        
        drop = set(self.dates)-set(self.fts)
        self.dates = list(set(self.dates)-drop)
                
        drop = set(self.months)-set(self.fts)
        self.months = list(set(self.months)-drop)            
            
        drop = set(self.types)-set(self.fts)
        self.types = list(set(self.types)-drop)
        
        drop = set(self.states)-set(self.fts)
        self.states = list(set(self.states)-drop)
        
        drop = set(self.amounts)-set(self.fts)
        self.amounts = list(set(self.amounts)-drop)        
        
    def FillDates(self,value='1900.01.01'):
        
        for i in self.dates:
            self.df[i].fillna(value,inplace=True)
        
    def FillMonths(self,value='1900.01'):

        for i in self.months:
            self.df[i].fillna(value,inplace=True)
            
    def FillAmounts(self,value=0):
        
        for i in self.amounts:
            self.df[i].fillna(value,inplace=True)
        
    def FillTypes(self,value='nothing'):
        for i in self.types:
            self.df[i].fillna(value,inplace=True)
    
    def Fill24States(self,value='*'):
        self.df['latest24state'].fillna(value,inplace=True)
        
    def FillStates(self,value='nothing'):
        for i in self.states:
            self.df[i].fillna(value,inplace=True)
        
    def FillCurrOverdueCyc(self,value=0):
        self.df['curroverduecyc'].fillna(0,inplace=True)
    
    def Ignite(self):
        self.checkFeatures()
        self.FillDates()
        self.FillMonths()
        self.FillAmounts()
        self.FillTypes()
        self.Fill24States()
        self.FillStates()
        self.FillCurrOverdueCyc()

In [14]:
fill = FillNA(loan_filtered,dates_loan,months_loan,types_loan,states_loan,amounts_loan)
fill.Ignite()

In [15]:
loan_fillna = fill.df

In [16]:
del fill

In [17]:
del loan_filtered

<a id='link6'>**<font color='darkred'>6. Feature Engineering</font>**</a>
- <a id='link6.2'>**6.2 Loan Features Deep Mining**</a>
    - 1.opendate has no missing values, paymentcyc has 461865 missing values
    - 2.enddate is calculated by opendate+paymentcyc
    - 3&4.if `state=1,5`, then there will be `normal` enddate; if `state=3`, there will be `NA enddate`
    - 状态截止日：指你的最新状态更新到什么时间，<=查询时间，一般小一个月
    - `'financeorg','financetype','type','currency','guaranteetype'`组合数量53474个，将封装小unique值，其中financeorg有1701个unique值，将进行封装
    - 但是financeorg每一个unique值出现的ID频率<1%，如果合并出现频率<1/1000的unique，会造成合并的频率巨大，其他的都<=1%，所以决定暂不处理
    - 对其他四个categorical变量中频率小的unique值进行合并
    

**Time Filter**
- latest opendate to observation point

In [None]:
class TimeFilter():
    
    def __init__(self,df):
        
        self.df = df.copy()
        self.cols = df.columns
        
    
    def ObsTimeReformat(self):
        
        # Reformat Observation
        
        start = time.time()
        # this is observation start point 
        obs_split = self.df.astype({'nasrdw_recd_date':str})['nasrdw_recd_date'].str.split(pat='',expand=True)
        
        year = obs_split.iloc[:,1]+obs_split.iloc[:,2]+obs_split.iloc[:,3]+obs_split.iloc[:,4]
        month = obs_split.iloc[:,5]+obs_split.iloc[:,6]
        date = obs_split.iloc[:,7]+obs_split.iloc[:,8]
        
        obs_date = year+'.'+month+'.'+date
        self.df['nasrdw_recd_date'] = obs_date
        
        end = time.time()
        print('observation start date reformat finished, time cost {}seconds'.format(np.round(end-start,2)))

    def DateDFMerge(self):
        
        start = time.time()
        # find the latest opendate, group by IDs
        g1 = self.df.groupby('id')['opendate'].max()
        l_opendate_df = pd.DataFrame([g1.index.values,g1.values]).T
        l_opendate_df.columns = ['id','latest_opendate']
        cond = l_opendate_df.duplicated(keep='first')==False
        l_opendate_df = l_opendate_df.loc[cond]
        
        cond = self.df[['id','nasrdw_recd_date']].duplicated(keep='first')==False
        g2 = self.df[['id','nasrdw_recd_date']].loc[cond]
        self.window_df = pd.merge(left=l_opendate_df,right=g2,left_on='id',right_on='id',how='outer')
        
        end = time.time()
        print('window df merging finished, time cost {}seconds'.format(np.round(end-start,2)))   
        

    # WindowFilter() must be run after CtimeReformat()
    def WindowLabel(self):
        
        start = time.time()
        obs_split = self.window_df['nasrdw_recd_date'].str.split(pat='.',expand=True)
        
        year = obs_split.iloc[:,0].astype(int)
        month = obs_split.iloc[:,1]
        date = obs_split.iloc[:,2]
        
        self.window_df['window_time1']=(year-2).astype(str)+'.'+month+'.'+date
        self.window_df['window_time2']=(year-4).astype(str)+'.'+month+'.'+date
        self.window_df['window_time3']=(year-6).astype(str)+'.'+month+'.'+date
        
        self.window_df['window_label1'] = 1*(self.window_df['latest_opendate']>=self.window_df['window_time1'])
        self.window_df['window_label2'] = 1*((self.window_df['latest_opendate']<self.window_df['window_time1'])&(self.window_df['latest_opendate']>=self.window_df['window_time2']))
        self.window_df['window_label3'] = 1*((self.window_df['latest_opendate']<self.window_df['window_time2'])&(self.window_df['latest_opendate']>=self.window_df['window_time3']))
        self.window_df['window_label4'] = 1*(self.window_df['latest_opendate']<self.window_df['window_time3'])
        
        self.window_filter_df = pd.merge(left=self.df,
                                         right=self.window_df.drop('nasrdw_recd_date',axis=1),
                                         left_on='id',
                                          right_on='id',
                                          how='left')
        end = time.time()
        print('window label generation and merging finished, time cost {}seconds'.format(np.round(end-start,2)))
            
    def Ignite(self):
        self.ObsTimeReformat()
        self.DateDFMerge()
        self.WindowLabel()

In [None]:
window_loan = TimeFilter(loan_fillna)
window_loan.Ignite()

In [None]:
loan_window_g = window_loan.window_df.groupby('id')[['window_label1',
                                                     'window_label2',
                                                     'window_label3',
                                                     'window_label4']].min()
loan_window_label = pd.concat([pd.DataFrame(loan_window_g.index.values),
                               pd.DataFrame(loan_window_g.values)],axis=1)
loan_window_label.columns = ['id','window1','window2','window3','window4']
loan_window_label.to_csv('Loan_processed/loan_window.csv',index=False,encoding='GBK')

In [None]:
del loan_window_g,loan_window_label

In [None]:
loan_window = window_loan.window_filter_df

In [None]:
del loan_fillna,window_loan

**Volatilities**

In [None]:
class Volatilities():
    
    def __init__(self,df,amounts):
        
        self.df = df
        self.amounts = amounts
        self.fts = list(df.columns)
    
#     def checkFeatures(self):
        
#         drop = set(self.amounts)-set(self.fts)
#         self.amounts = list(set(self.amounts)-drop)  
        
            
    def Range_Amount(self):       
        
        start = time.time()
        
        g1 = self.df.groupby('id')[self.amounts].max()
        g2 = self.df.groupby('id')[self.amounts].min()
        
        ids1 = g1.index.values
        ids2 = g2.index.values
        
        values1 = g1.values
        values2 = g2.values
        
        self.max_amount = pd.concat([pd.DataFrame(ids1),
                                     pd.DataFrame(values1)],axis=1)
        self.min_amount = pd.concat([pd.DataFrame(ids2),
                                     pd.DataFrame(values2)],axis=1)
        
        max_cols = ['max_'+i for i in self.amounts]
        min_cols = ['min_'+i for i in self.amounts]
        
        self.max_amount.columns = ['id']+max_cols
        self.min_amount.columns = ['id']+min_cols
        
        end = time.time()
        print('range_creditlimitamount finished, time cost {}seconds'.format(np.round(end-start,2))) 
        
    def StdVariance(self):
        
        start = time.time()
        
        g = self.df.groupby('id')[self.amounts[0]].std()
        self.stdvar_df = pd.DataFrame([g.index.values,g.values]).T
        self.stdvar_df.columns=['id','std_whole_'+self.amounts[0]]
        for i in self.amounts[1:]:
            g = self.df.groupby('id')[i].std()
            var = pd.DataFrame([g.index.values,g.values]).T
            var.columns = ['id','stdvar_'+i]
            self.stdvar_df = pd.merge(left=self.stdvar_df,right=var,on='id',how='outer')
        self.stdvar_df.fillna(0,inplace=True)

        end = time.time()
        print('stdvariance finished, time cost {}seconds'.format(np.round(end-start,2)))
        
    def Rate(self):
        
        start = time.time()
        
        denominator = self.df['creditlimitamount']
        
        numerator_cols = self.amounts[1:]
        numerator = self.df[numerator_cols]
        
        rates_df = numerator.div(denominator,axis=0)
        rates_df.fillna(0,inplace=True)
        
        rates_df = pd.concat([self.df['id'],rates_df],axis=1)
        
        mean_rates_g = rates_df.groupby('id')[numerator_cols].mean()
        ids = mean_rates_g.index.values
        values = mean_rates_g.values
        self.mean_rates_df = pd.concat([pd.DataFrame(ids),
                                        pd.DataFrame(values)],axis=1)
        self.mean_rates_df.columns = ['id']+['mean_rate_'+i for i in numerator_cols]
        
        end = time.time()
        print('rate finished, time cost {}seconds'.format(np.round(end-start,2)))
        
    def Ignite(self):
        
#         self.checkFeatures()
        self.Range_Amount()
        self.StdVariance()
        self.Rate()

**Half an hour!!**

In [None]:
v = Volatilities(loan_fillna,amounts_loan)
v.Ignite()

In [None]:
loan_stdvar = v.stdvar_df
loan_max_amt = v.max_amount
loan_min_amt = v.min_amount
loan_rate = v.mean_rates_df

In [None]:
loan_volatility = pd.merge(loan_stdvar,loan_rate,on='id',how='inner')
loan_volatility = pd.merge(loan_volatility,loan_max_amt,on='id',how='inner')
loan_volatility = pd.merge(loan_volatility,loan_min_amt,on='id',how='inner')

In [None]:
loan_volatility.to_csv('Loan_processed/loan_volatlity.csv',index=False)

In [None]:
del v,loan_volatility,loan_stdvar,loan_max_amt,loan_min_amt,loan_rate

<a id='link6'>**<font color='darkred'>6. Feature Engineering</font>**</a>
- <a id='link6.4'>**6.4 Feature Generation**</a>
    - nunique of `financeorg`
    - whether `financeorg` contains `浦发`
    - `latest24state`:count and max
    - `num_of_0`:sum and max
    - `state`: whether contain '逾期' and '呆帐'
    - sum of amounts of each id

In [None]:
class Generate():
    
    def __init__(self,df,types,states,amounts,counts):
        
        self.df = df.copy()
        self.amounts = amounts
        self.types = types
        self.states = states
        self.counts = counts
        self.fts = list(df.columns)
        
    def checkFeatures(self):
        
        drop = set(self.amounts)-set(self.fts)
        self.amounts = list(set(self.amounts)-drop)  
        
        drop = set(self.types)-set(self.fts)
        self.types = list(set(self.types)-drop)  
        
        drop = set(self.states)-set(self.fts)
        self.states = list(set(self.states)-drop) 
        
        drop = set(self.counts)-set(self.fts)
        self.counts = list(set(self.counts)-drop) 
        
    def Num_Records(self):
        
        start = time.time()
        
        g = self.df.groupby('id')['index'].count()
        df1 = pd.DataFrame([g.index.values,g.values]).T
        df1.columns=['id','num_record']
        
        self.num_record_df = df1
        
        end = time.time()
        print('Num_record finished, time cost {}seconds'.format(np.round(end-start,2)))
        
    def n_financeorg(self):
        
        if 'financeorg' not in self.types:
            print('financeorg not kept, skip n_financeorg')
            return None
        
        start = time.time()
        
        financeorg_n = self.df.groupby('id')['financeorg'].nunique()
        ids = financeorg_n.index.values
        values = financeorg_n.values
        self.n_financeorg_df = pd.DataFrame([ids,values]).T
        self.n_financeorg_df.columns = ['id','n_financeorg']
        
        end = time.time()
        print('n_financeorg finished, time cost {}seconds'.format(np.round(end-start,2)))
        
    def pufa_financeorg(self):
        
        if 'financeorg' not in self.types:
            print('financeorg not kept, skip pufa_financeorg')
            return None
        
        start = time.time()
        
        cond = self.df['financeorg']=='浦发银行信用卡中心'
        self.df['if_pufa'] = cond*1
        g = self.df.groupby('id')['if_pufa'].max()
        
        ids = g.index.values
        values = g.values
        
        self.if_pufa_df = pd.concat([pd.DataFrame(ids),pd.DataFrame(values)],axis=1)
        self.if_pufa_df.columns = ['id','if_pufa']

        end = time.time()
        print('pufa_financeorg finished, time cost {}seconds'.format(np.round(end-start,2)))
        
    def n_type(self):
        
        if 'type' not in self.types:
            print('type not kept, skip n_type')
            return None
        
        start = time.time()
        
        n_type = self.df.groupby('id')['financeorg'].nunique()
        ids = n_type.index.values
        values = n_type.values
        self.n_type_df = pd.DataFrame([ids,values]).T
        self.n_type_df.columns = ['id','n_types']
        
        end = time.time()
        print('n_types finished, time cost {}seconds'.format(np.round(end-start,2)))
        
    def n_currency(self):
        
        if 'currency' not in self.types:
            print('currency not kept, skip n_currency')
            return None        
        
        start = time.time()
        
        n_currency = self.df.groupby('id')['currency'].nunique()
        ids = n_currency.index.values
        values = n_currency.values
        self.n_currency_df = pd.DataFrame([ids,values]).T
        self.n_currency_df.columns = ['id','n_currency']
        
        end = time.time()
        print('n_currency finished, time cost {}seconds'.format(np.round(end-start,2)))   
        
    def n_guaranteetype(self):
        
        if 'guaranteetype' not in self.types:
            print('guaranteetype not kept, skip n_guaranteetype')
            return None
        
        start = time.time()
        
        g = self.df.groupby('id')['guaranteetype'].nunique()
        ids = g.index.values
        values = g.values
        self.n_guaranteetype_df = pd.DataFrame([ids,values]).T
        self.n_guaranteetype_df.columns = ['id','n_guaranteetype']
        
        end = time.time()
        print('n_guaranteetype finished, time cost {}seconds'.format(np.round(end-start,2)))  
        
    def cnt_bad_state(self):
        
        if 'state' not in self.states:
            print('state not kept, skip cnt_bad_state')
            return None 
        
        start = time.time()
        
        #stdcard ['4|销户', '6|未激活', '1|正常', '2|冻结', '3|止付', '5|呆帐']
        #loan
        
        bads = self.df['state'].isin(['2|冻结','5|呆帐','3|止付'])==True
        bads = pd.concat([self.df['id'],bads],axis=1)
        bads.columns = ['id','cnt_bad']
        g = bads.groupby('id')['cnt_bad'].sum()
        ids = g.index.values
        values = g.values                       
        self.cnt_bad_state_df=pd.DataFrame([ids,values]).T
        self.cnt_bad_state_df.columns = ['id','cnt_bad']
        
        end = time.time()
        print('cnt_bad_state finished, time cost {}seconds'.format(np.round(end-start,2))) 
        
    def worst_class5state(self):
        
        if 'class5state' not in self.states:
            print('class5state not kept, skip worst_class5state')
            return None
        
        start = time.time()     
        
        temp = self.df[['id','class5state']]
        temp.replace({'1|正常':0,'nothing':1,'9|未知':1,'2|关注':2,'3|次级':3,'4|可疑':4,'5|损失':5},inplace=True)
        g = temp.groupby('id')['class5state'].max()
        ids = g.index.values
        values = g.values
        self.worst_class5state_df=pd.DataFrame([ids,values]).T
        self.worst_class5state_df.columns = ['id','worst_class5state']        
        
        cond = self.worst_class5state_df.duplicated(keep='first')==False
        self.worst_class5state_df = self.worst_class5state_df.loc[cond]
        
        end = time.time()
        print('worst_class5state finished, time cost {}seconds'.format(np.round(end-start,2)))   
        
    def cnt_bad_class5state(self):
        
        if 'class5state' not in self.states:
            print('class5state not kept, skip cnt_bad_class5state')
            return None
        
        start = time.time() 
        
        temp = self.df[['id','class5state']]
        temp.replace({'1|正常':0,'nothing':1,'9|未知':1,'2|关注':2,'3|次级':3,'4|可疑':4,'5|损失':5},inplace=True)
        cond = temp['class5state'].isin([3,4,5])
        temp = pd.concat([temp['id'],cond],axis=1)
        temp.columns=['id','class5state']
        
        g = temp.groupby('id')['class5state'].sum()
        ids = g.index.values
        values = g.values
        self.cnt_bad_class5state_df=pd.DataFrame([ids,values]).T
        self.cnt_bad_class5state_df.columns = ['id','class5state']
        
        end = time.time()
        print('cnt_bad_class5state finished, time cost {}seconds'.format(np.round(end-start,2)))  
        
    def sum_curroverduecyc(self):
        
        if 'curroverduecyc' not in self.counts:
            print('curroverduecyc not kept, skip sum_curroverduecyc')
            return None    
        
        start = time.time()        
        
        g = self.df.groupby('id')['curroverduecyc'].sum()
        ids = g.index.values
        values = g.values
        self.sum_curroverduecyc_df=pd.DataFrame([ids,values]).T
        self.sum_curroverduecyc_df.columns = ['id','sum_curroverduecyc']        

        end = time.time()
        print('sum_curroverduecyc finished, time cost {}seconds'.format(np.round(end-start,2)))  
        
    def latest24state(self):
        
        start = time.time()        
        
        ids = self.df['id'].values
        cond = self.df['latest24state'].str.split(pat='',expand=True).fillna('0').replace({'':'0',
                                                                                          'D':'0',
                                                                                          'C':'0',
                                                                                          'N':'0',
                                                                                          'G':'0',
                                                                                          '/':'0',
                                                                                          '*':'0',
                                                                                          '#':'0'}).astype('int32')
        default_cnt = (cond!=0).sum(axis=1).values
        max_default = cond.max(axis=1).values
        self.latest24state_df = pd.DataFrame([ids,default_cnt,max_default]).T
        self.latest24state_df.columns=['id','cnt_24state','max_24state']
        
        g1 = self.latest24state_df.groupby('id')['cnt_24state'].sum()
        g2 = self.latest24state_df.groupby('id')['max_24state'].max()
        
        self.latest24state_cnt = pd.DataFrame([g1.index.values,g1.values]).T
        self.latest24state_max = pd.DataFrame([g2.index.values,g2.values]).T
        
        self.latest24state_cnt.columns = ['id','cnt_24state']
        self.latest24state_max.columns = ['id','max_24state']
        
        end = time.time()
        print('latest24state finished, time cost {}seconds'.format(np.round(end-start,2)))        
        
    def SumAmount(self):
        
        start = time.time()
        
        g = self.df.groupby('id')[self.amounts].sum()
        ids = pd.DataFrame(g.index.values)
        values = g.values
        cols = ['sum_'+i for i in self.amounts]
        self.sumamount_df = pd.DataFrame(values)
        self.sumamount_df = pd.concat([ids,self.sumamount_df],axis=1)
        self.sumamount_df.columns = (['id']+cols)
        
        end = time.time()
        print('Sum amount finished, time cost {}seconds'.format(np.round(end-start,2)))                  
        
    def Ignite(self):
        self.checkFeatures()
        self.Num_Records()
        self.n_financeorg()
        self.pufa_financeorg()
        self.n_type()
        self.n_currency()
        self.n_guaranteetype()
        self.cnt_bad_state()
        self.worst_class5state()
        self.cnt_bad_class5state()
        self.sum_curroverduecyc()
#         self.latest24state()
        self.SumAmount()  

In [None]:
gene = Generate(loan_fillna,types_loan,states_loan,amounts_loan,counts_loan)
gene.Ignite()

In [None]:
# gene.latest24state()

In [None]:
loan_gene = gene.n_financeorg_df
loan_gene = pd.merge(left=loan_gene,right=gene.num_record_df,on='id',how='inner')
loan_gene = pd.merge(left=loan_gene,right=gene.if_pufa_df,on='id',how='inner')
loan_gene = pd.merge(left=loan_gene,right=gene.n_type_df,on='id',how='inner')
loan_gene = pd.merge(left=loan_gene,right=gene.n_currency_df,on='id',how='inner')
loan_gene = pd.merge(left=loan_gene,right=gene.n_guaranteetype_df,on='id',how='inner')
loan_gene = pd.merge(left=loan_gene,right=gene.cnt_bad_state_df,on='id',how='inner')
loan_gene = pd.merge(left=loan_gene,right=gene.worst_class5state_df,on='id',how='inner')
loan_gene = pd.merge(left=loan_gene,right=gene.cnt_bad_class5state_df,on='id',how='inner')
loan_gene = pd.merge(left=loan_gene,right=gene.sum_curroverduecyc_df,on='id',how='inner')
loan_gene = pd.merge(left=loan_gene,right=gene.sumamount_df,on='id',how='inner')
# card_gene = pd.merge(left=card_gene,right=gene.latest24state_df,on='id',how='inner')

In [None]:
loan_gene.head()

In [None]:
loan_gene.to_csv('Loan_processed/loan_gene.csv',index=False,encoding='GBK')

In [None]:
del loan_gene,gene

**Time Related Features**
- **4300seconds!!!!**

In [71]:
class TimeRelated():
    
    def __init__(self,df,types,states,amounts,counts):
        
        self.whole_df = df.copy()
        self.amounts = amounts
        self.types = types
        self.states = states
        self.counts = counts
        self.df = self.whole_df[['id','index','opendate']+self.amounts]
        
    def Convert(self):
        
        self.df = self.df
        self.df['opendate2'] = pd.to_datetime(self.df['opendate'])
        self.df.drop('opendate',axis=1,inplace=True)
        self.fts = list(self.df.columns)
        
    def cutDF(self):
        
        start = time.time()
        g1 = self.df.groupby('id')
        
        
        latest_g = g1.apply(lambda x:x.sort_values('opendate2',ascending=False).head(1))
        print('latest records finished')
        print()
        
        latest_3_g = g1.apply(lambda x:x.sort_values('opendate2',ascending=False).head(3))
        print('latest 3 records finished')
        print()
        
        latest_6_g = g1.apply(lambda x:x.sort_values('opendate2',ascending=False).head(6))
        print('latest 6 records finished')
        print()
        
        self.latest_df = pd.DataFrame(latest_g.values)
        self.latest_3_df = pd.DataFrame(latest_3_g.values)
        self.latest_6_df = pd.DataFrame(latest_6_g.values)
        
        
        self.latest_df.columns = self.fts
        self.latest_3_df.columns = self.fts
        self.latest_6_df.columns = self.fts
        
        self.latest_df.iloc[:,1:] = self.latest_df.iloc[:,1:].astype(int)
        self.latest_3_df.iloc[:,1:] = self.latest_3_df.iloc[:,1:].astype(int)
        self.latest_6_df.iloc[:,1:] = self.latest_6_df.iloc[:,1:].astype(int)
        
        end = time.time()
        print('DF has been cut, time cost {}seconds'.format(np.round(end-start,2)))
        
    def SumAmounts(self):
        
        start = time.time()
        
        print('start sum')
        sum3_g = self.latest_3_df.groupby('id')[self.amounts].sum()
        sum6_g = self.latest_6_df.groupby('id')[self.amounts].sum()
        
        id3 = pd.DataFrame(sum3_g.index.values)
        id6 = pd.DataFrame(sum6_g.index.values)

        values3 = pd.DataFrame(sum3_g.values)
        values6 = pd.DataFrame(sum6_g.values)
        
        cols3 = ['sum_'+i+'_3' for i in self.amounts]
        cols6 = ['sum_'+i+'_6' for i in self.amounts]
        
        self.sumamount_df3 = pd.concat([id3,values3],axis=1)
        self.sumamount_df6 = pd.concat([id6,values6],axis=1)
        
        self.sumamount_df3.columns = (['id']+cols3)
        self.sumamount_df6.columns = (['id']+cols6)
        
        end1 = time.time()
        print('sum finished, time cost {}seconds'.format(np.round(end1-start,2)))
        print()
        
    def MeanAmounts(self): 
        
        start = time.time()
        print('start mean')
        mean3_g = self.latest_3_df.groupby('id')[self.amounts].mean()
        mean6_g = self.latest_6_df.groupby('id')[self.amounts].mean()
        
        id3 = pd.DataFrame(mean3_g.index.values)
        id6 = pd.DataFrame(mean6_g.index.values)

        values3 = pd.DataFrame(mean3_g.values)
        values6 = pd.DataFrame(mean6_g.values)
        
        cols3 = ['mean_'+i+'_3' for i in self.amounts]
        cols6 = ['mean_'+i+'_6' for i in self.amounts]
        
        self.meanamount_df3 = pd.concat([id3,values3],axis=1)
        self.meanamount_df6 = pd.concat([id6,values6],axis=1)
        
        self.meanamount_df3.columns = (['id']+cols3)
        self.meanamount_df6.columns = (['id']+cols6)
        
        end1 = time.time()
        print('mean finished, time cost {}seconds'.format(np.round(end1-start,2)))
        print()
    
    def NumRecords(self):
        
        start = time.time()
        print('start record')
        num3_g = self.latest_3_df.groupby('id')['index'].count()
        num6_g = self.latest_6_df.groupby('id')['index'].count()
        
        id3 = pd.DataFrame(num3_g.index.values)
        id6 = pd.DataFrame(num6_g.index.values)

        values3 = pd.DataFrame(num3_g.values)
        values6 = pd.DataFrame(num6_g.values)
        
        cols3 = ['num_record_3']
        cols6 = ['num_record_6']
        
        self.numrecord_df3 = pd.concat([id3,values3],axis=1)
        self.numrecord_df6 = pd.concat([id6,values6],axis=1)
        
        self.numrecord_df3.columns = (['id']+cols3)
        self.numrecord_df6.columns = (['id']+cols6)
        
        end1 = time.time()
        print('numrecord finished, time cost {}seconds'.format(np.round(end1-start,2)))
        print()
    

In [72]:
a = TimeRelated(loan_fillna,types_loan,states_loan,amounts_loan,counts_loan)
a.Convert()
a.cutDF() 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


latest records finished

latest 3 records finished

latest 6 records finished

DF has been cut, time cost 4286.15seconds


In [73]:
a.SumAmounts()

start sum
sum finished, time cost 6.58seconds



In [74]:
a.MeanAmounts()

start mean
mean finished, time cost 6.36seconds



In [75]:
a.NumRecords()

start record
numrecord finished, time cost 5.15seconds



In [91]:
latest_df,sumamount_df3,sumamount_df6,meanamount_df3,meanamount_df6,numrecord_df3,numrecord_df6

Unnamed: 0,id,sum_creditlimitamount_6,sum_balance_6,sum_scheduledpaymentamount_6,sum_actualpaymentamount_6,sum_curroverdueamount_6,sum_overdue31to60amount_6,sum_overdue61to90amount_6,sum_overdue91to180amount_6,sum_overdueover180amount_6
0,00001690617d31704261808cee46c21d,898,0,0,0,0,0,0,0,0
1,00001fa8a2f1407958617d40ccf5289b,8200,4290,483,483,0,0,0,0,0
2,000029dd22864abc4062ae0cb8b0b779,270000,270000,2595,2595,0,0,0,0,0
3,000032d505d98892988d1900869bf1be,49600,28851,8254,8254,0,0,0,0,0
4,00003e25c0754ca7a89f10a3a0f1958b,10000,10000,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
758216,ffff0e693b759e7177cb0dd1404884ad,877,0,0,0,0,0,0,0,0
758217,ffff3db7a932f36e9b550a6fe0f97fc3,2499,0,0,0,0,0,0,0,0
758218,ffff7d672362f5bcb25aa5063b9b44b7,8800,6698,460,460,0,0,0,0,0
758219,ffff7f01f127afeabdee0e5fb605d2a3,34800,20033,2463,2463,0,0,0,0,0


<a id='link6'>**<font color='darkred'>6. Feature Engineering</font>**</a>
- <a id='link6.5'>**6.5 Cross \& Dot Products**</a>
    - **DON'T USE FINANCEORG! THIS IS HORIFFYING!!**
    - **<font color='green'>Function: Combine less frequnt values</font>**
    - **<font color='green'>The columns need to be OneHot are `financetype`, `type`, `state`,`currency`,`class5state` and `guaranteetype`</font>**

In [None]:
class ObjCrossAmounts():
    
    def __init__(self,df,obj_cols,amounts,combi_per,warning=60):
        
        self.df = df.copy()
        self.obj_cols = obj_cols
        self.amounts = amounts
        self.combi_per = combi_per
        self.fts = self.df.columns
        self.warning = warning
        self.n = df.shape[0]
        
    def checkFeatures(self):
        
        start = time.time()
        
        drop = set(self.obj_cols)-set(self.fts)
        self.obj_cols = list(set(self.obj_cols)-drop)

        drop = set(self.amounts)-set(self.fts)
        self.amounts = list(set(self.amounts)-drop)        

        end = time.time()
    
        print('Features Checking Finished, time cost {}seconds'.format(np.round(end-start,2)))
        print()
        
    def MergeLowFrequency(self,nunique_thres=5):

        start = time.time()
        
        # to record what values are combined in obj_cols
        self.replace_record = {}
        
        for i in self.obj_cols:
            
            if self.df[i].nunique()<=nunique_thres:
                print('feature #{}# nunique is less than {}, no need to be merged'.format(i,nunique_thres))
                print()
                continue
                
            print('now merging feature #{}#'.format(i))    
            print()
            v_C = self.df[i].value_counts(sort=False)
            thres = v_C.quantile(self.combi_per)
            cond = v_C<=thres
            to_replace = v_C.loc[cond].index.values
            self.replace_record[i] = to_replace
            self.df[i] = self.df[i].replace(to_replace,'combi_'+i)
            
        end = time.time()
        print('Merging Low-Frequency Values Finished, time cost {}seconds'.format(np.round(end-start,2)))
        print()
        
    def ObjCombination(self):
        
        # generate a new column to restore combination
        start = time.time()
        self.df['obj_cross'] = '_vs_'
        
        # generate a column for combinations
        self.df['obj_cross'] = self.df['obj_cross']+self.df[self.obj_cols].astype(str).sum(axis=1)
        
#         for i in self.obj_cols:
#             self.df['obj_cross'] = self.df['obj_cross']+self.df[i]
        
        # extract 'id' and 'obj_cross' columns as a new dataframe
        # following used for amount vs. categorical
        self.combi_df = self.df[['id','obj_cross']]
        self.combi_df['obj_cross'].fillna('nothing',inplace=True)
        self.unique_combi = self.df['obj_cross'].unique()
        self.m = len(self.unique_combi)
        
        assert self.m<=self.warning, print("Warning: too many combinations, total {}, try to set larger threshold".format(self.m))
        end = time.time()
        print('Objectives Combination Finished, time cost {}seconds'.format(np.round(end-start,2)))  
        print()
        
    def ColGeneration(self):
        
        start = time.time()
        
        onehot_matrix = np.zeros((self.n,self.m))
        combi_col = self.combi_df['obj_cross'].values
        for i in range(self.n):
            pos = np.where(self.unique_combi==combi_col[i])[0][0]
            onehot_matrix[i,pos] = 1
        
#         OneHot_df = pd.DataFrame(empty_matrix,columns=self.unique_combi)
#         self.combi_df = pd.concat([self.combi_df,OneHot_df],axis=1)
        
        self.onehot_matrix = onehot_matrix
        
        end = time.time()
        print('One-Hot Columns Generation Finished, time cost {}seconds'.format(np.round(end-start,2)))
        print()
        
    def DotProduct(self,amount_col):
        
        start = time.time()
        
        amount_values = self.df[amount_col].values
        amount_cross_onehot = self.onehot_matrix*amount_values[:,np.newaxis]
        
#         columns = ['_'.join([amount_col,i]) for i in self.unique_combi]
        columns = [amount_col+i for i in self.unique_combi]
        
        df = pd.DataFrame(amount_cross_onehot,columns=columns)
        self.combi_df = pd.concat([self.combi_df,df],axis=1)
        
        end = time.time()
        print('Amount Col {} VS One-Hot Dot Products Finished, time cost {}seconds'.format(amount_col,np.round(end-start,2)))
        print()
    
    def CompressMultipleIDs(self):
        
        start = time.time()
        
        sum_cols = list(self.combi_df.columns)
        sum_cols.remove('id')
        sum_cols.remove('obj_cross')
        
        grouped = self.combi_df.groupby('id')[sum_cols].sum()
        IDs = pd.DataFrame(grouped.index.values,columns=['id'])
        Values = pd.DataFrame(grouped.values,columns=sum_cols)
        self.compressed_df = pd.concat([IDs,Values],axis=1)
        
        end = time.time()
        print('Compress Multiple ID Records to One Finished, time cost {}seconds'.format(np.round(end-start,2)))
        print()        
    
    def Ignite(self):
        
        start = time.time()
        self.checkFeatures()
        self.MergeLowFrequency()
        self.ObjCombination()
        self.ColGeneration()
        
        for i in self.amounts:
            self.DotProduct(i)
        
        self.CompressMultipleIDs()
        
        end = time.time()
        print('Whole Procedure Finished, time cost {}seconds'.format(np.round(end-start,2))) 
        print()  

In [None]:
types_loan+states_loan

In [None]:
obj_cols = ['financetype','type','currency','guaranteetype','state','class5state']

loan_cross = ObjCrossAmounts(df = loan_fillna,
                             obj_cols = ['financetype'],
                             amounts=amounts_loan,
                             combi_per=0.05)
loan_cross.Ignite()
loan_dot_pro1 = loan_cross.compressed_df
col_name =[i + obj_cols[0] for i in list(loan_dot_pro1.columns)[1:]]  #[1:] to exclude id
col_name = ['id']+col_name # put back 'id'
loan_dot_pro1.columns = col_name
del loan_cross

In [None]:
loan_dot_pro1.to_csv('Loan_processed/loan_cross1.csv',index=False,encoding='GBK')

In [None]:
del loan_dot_pro1

In [None]:
obj_cols = ['financetype','type','currency','guaranteetype','state','class5state']

loan_cross = ObjCrossAmounts(df = loan_fillna,
                               obj_cols = ['type'],
                               amounts=amounts_loan,
                               combi_per=0.05)
loan_cross.Ignite()
loan_dot_pro2 = loan_cross.compressed_df
col_name =[i + obj_cols[1] for i in list(loan_dot_pro2.columns)[1:]]  #[1:] to exclude id
col_name = ['id']+col_name # put back 'id'
loan_dot_pro2.columns = col_name

In [None]:
loan_dot_pro2.to_csv('Loan_processed/loan_cross2.csv',index=False,encoding='GBK')

In [None]:
del loan_dot_pro2,loan_cross

In [None]:
obj_cols = ['financetype','type','currency','guaranteetype','state','class5state']

loan_cross = ObjCrossAmounts(df = loan_fillna,
                             obj_cols = ['currency'],
                             amounts=amounts_loan,
                             combi_per=0.05)
loan_cross.Ignite()
loan_dot_pro3 = loan_cross.compressed_df
col_name =[i + obj_cols[2] for i in list(loan_dot_pro3.columns)[1:]]  #[1:] to exclude id
col_name = ['id']+col_name # put back 'id'
loan_dot_pro3.columns = col_name

In [None]:
loan_dot_pro3.to_csv('Loan_processed/loan_cross3.csv',index=False,encoding='GBK')

In [None]:
del loan_cross,loan_dot_pro3

In [None]:
obj_cols = ['financetype','type','currency','guaranteetype','state','class5state']

loan_cross = ObjCrossAmounts(df = loan_fillna,
                             obj_cols = ['guaranteetype'],
                             amounts=amounts_loan,
                             combi_per=0.05)
loan_cross.Ignite()
loan_dot_pro4 = loan_cross.compressed_df
col_name =[i + obj_cols[3] for i in list(loan_dot_pro4.columns)[1:]]  #[1:] to exclude id
col_name = ['id']+col_name # put back 'id'
loan_dot_pro4.columns = col_name

In [None]:
loan_dot_pro4.to_csv('Loan_processed/loan_cross4.csv',index=False,encoding='GBK')

In [None]:
del loan_cross,loan_dot_pro4

In [None]:
obj_cols = ['financetype','type','currency','guaranteetype','state','class5state']

loan_cross = ObjCrossAmounts(df = loan_fillna,
                               obj_cols = ['state'],
                               amounts=amounts_loan,
                               combi_per=0.05)
loan_cross.Ignite()
loan_dot_pro5 = loan_cross.compressed_df
col_name =[i + obj_cols[4] for i in list(loan_dot_pro5.columns)[1:]]  #[1:] to exclude id
col_name = ['id']+col_name # put back 'id'
loan_dot_pro5.columns = col_name

In [None]:
loan_dot_pro5.to_csv('Loan_processed/loan_cross5.csv',index=False,encoding='GBK')

In [None]:
del loan_cross,loan_dot_pro5

In [None]:
obj_cols = ['financetype','type','currency','guaranteetype','state','class5state']

loan_cross = ObjCrossAmounts(df = loan_fillna,
                               obj_cols = ['class5state'],
                               amounts=amounts_loan,
                               combi_per=0.05)
loan_cross.Ignite()
loan_dot_pro6 = loan_cross.compressed_df
col_name =[i + obj_cols[5] for i in list(loan_dot_pro6.columns)[1:]]  #[1:] to exclude id
col_name = ['id']+col_name # put back 'id'
loan_dot_pro6.columns = col_name

In [None]:
loan_dot_pro6.to_csv('Loan_processed/loan_cross6.csv',index=False,encoding='GBK')

In [None]:
del loan_cross,loan_dot_pro6