# 7.06 - feature selection extendend - (with p-values!)

In [1]:
import pandas as pd
import numpy as np

import statsmodels.api as sm
from sklearn.datasets import load_boston

In [2]:
x = load_boston()

In [4]:
x.keys()

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])

In [5]:
# housing prices
y = x['target']

In [7]:
X = pd.DataFrame(x['data'], columns=x['feature_names'])

In [8]:
X

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.0900,1.0,296.0,15.3,396.90,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.90,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.90,5.33
...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1.0,273.0,21.0,391.99,9.67
502,0.04527,0.0,11.93,0.0,0.573,6.120,76.7,2.2875,1.0,273.0,21.0,396.90,9.08
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1.0,273.0,21.0,396.90,5.64
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1.0,273.0,21.0,393.45,6.48


$$\hat{y} = c + \hat{\beta_{0}}x_0 + \hat{\beta_{1}}x_1 + \hat{\beta_{2}}x_2 + ... + \hat{\beta_{n}}x_n + \epsilon$$

* $\hat{\beta_{n}}$ : coefficients that our model needs to calculate
* $x_n$ : our feature data
* $\hat{y}$ : our target variable
* $c$ : bias or intercept

# without constant

In [17]:
model = sm.OLS(y, X).fit()
model.summary()

0,1,2,3
Dep. Variable:,y,R-squared (uncentered):,0.959
Model:,OLS,Adj. R-squared (uncentered):,0.958
Method:,Least Squares,F-statistic:,891.3
Date:,"Thu, 04 Mar 2021",Prob (F-statistic):,0.0
Time:,10:09:40,Log-Likelihood:,-1523.8
No. Observations:,506,AIC:,3074.0
Df Residuals:,493,BIC:,3128.0
Df Model:,13,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
CRIM,-0.0929,0.034,-2.699,0.007,-0.161,-0.025
ZN,0.0487,0.014,3.382,0.001,0.020,0.077
INDUS,-0.0041,0.064,-0.063,0.950,-0.131,0.123
CHAS,2.8540,0.904,3.157,0.002,1.078,4.630
NOX,-2.8684,3.359,-0.854,0.394,-9.468,3.731
RM,5.9281,0.309,19.178,0.000,5.321,6.535
AGE,-0.0073,0.014,-0.526,0.599,-0.034,0.020
DIS,-0.9685,0.196,-4.951,0.000,-1.353,-0.584
RAD,0.1712,0.067,2.564,0.011,0.040,0.302

0,1,2,3
Omnibus:,204.082,Durbin-Watson:,0.999
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1374.225
Skew:,1.609,Prob(JB):,3.9e-299
Kurtosis:,10.404,Cond. No.,8500.0


# With constant

In [11]:
X_added_constant = sm.add_constant(X)

In [13]:
model = sm.OLS(y, X_added_constant).fit()

In [16]:
model.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.741
Model:,OLS,Adj. R-squared:,0.734
Method:,Least Squares,F-statistic:,108.1
Date:,"Thu, 04 Mar 2021",Prob (F-statistic):,6.72e-135
Time:,09:36:08,Log-Likelihood:,-1498.8
No. Observations:,506,AIC:,3026.0
Df Residuals:,492,BIC:,3085.0
Df Model:,13,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,36.4595,5.103,7.144,0.000,26.432,46.487
CRIM,-0.1080,0.033,-3.287,0.001,-0.173,-0.043
ZN,0.0464,0.014,3.382,0.001,0.019,0.073
INDUS,0.0206,0.061,0.334,0.738,-0.100,0.141
CHAS,2.6867,0.862,3.118,0.002,0.994,4.380
NOX,-17.7666,3.820,-4.651,0.000,-25.272,-10.262
RM,3.8099,0.418,9.116,0.000,2.989,4.631
AGE,0.0007,0.013,0.052,0.958,-0.025,0.027
DIS,-1.4756,0.199,-7.398,0.000,-1.867,-1.084

0,1,2,3
Omnibus:,178.041,Durbin-Watson:,1.078
Prob(Omnibus):,0.0,Jarque-Bera (JB):,783.126
Skew:,1.521,Prob(JB):,8.84e-171
Kurtosis:,8.281,Cond. No.,15100.0


* **R²**:
    * ratio between explained variance (that my model is able to explain) and total variance
    * or in math: $$R^2 = 1 - \frac{SS_\text{res}}{SS_\text{total}} = 1 - \frac{\sum_{i}(y_i - f_i)^2}{\sum_i(y_i - \bar{y})^2}$$
    
    * compares the sum of least squares of your model with the sum of least squares with a "dumb" model that would just predict a horizontal line at the mean $\bar{y}$ of all values.
    
* **Adjusted R²**:
    * math: $$R^2_\text{adj} = 1-(1-R^2)\frac{n-1}{n-p-1}$$
        * $n$: number of observations (rows)
        * $p$: number of features
    * adjusts R² for the fact that R² will naturally increase if you add more features, even if they don't provide much explanation power to your model (natural effect)
    * adjusted R² only really increases when the new feature adds explanation power with respect to your variance
    * so it penalizes non useful predictors
    * when **R²** and **adj R²** are vastly different: Hint that you have variables that can be omitted
    * when **R²** $\approx$ **adj. R²**: Good selection of variables
* **F-statistic**:
    * **Overall** significance of the linear regression
    * $H_0$: My data are best fit by a model with a constant only, or: My OLS model and an intercept-only model would perform equally well 
    * $H_1$: My model will better perform with an OLS model
* **Prob (F-statistic)**: 	
    * the associated p-value
    * if p is small, overall regression is meaningful
    
* **AIC/BIC**:
    * stands for *Akaike’s Information Criteria*
    * used for model selection.
    * penalizes the errors made in case a new variable is added to the regression equation.
    * It is calculated as number of parameters minus the likelihood of the overall model.
    * A lower AIC implies a better model.
    * Whereas, BIC stands for Bayesian information criteria and is a variant of AIC where penalties are made more severe.
    
    
* **Prob(Omnibus)**: 
    * measure whether residuals are actually normally distributed (base assumption for OLS)
    * again a Hypthothesis test: H_0 = they are normally distributed
    * When normally distributed: Prob(Omnibus) = 1
    * if very low (close to 0), OLS assumption is not satisfied
    
* **Durbin Watson**:
    * checks whether variance of the errors is constant, "homoscedasticity" (a value between 1 and 2 is prefered here)
    * on the contrary, "[heteroscedasticity](https://en.wikipedia.org/wiki/Heteroscedasticity)" can pose a big problem for linear regression
    
* **Prob(Jarque-Bera)**
    * Should be in line with with Omnibus, large value here indicate that the values are not normally distributed
    
* **Skew**
    * Skew – a measure of data symmetry. We want to see something close to zero, indicating the residual distribution is normal. Note that this value also drives the Omnibus. This result has a small, and therefore good, skew.
    
* **Kurtosis**
    * measure of "peakiness", or curvature of the data.
    * Higher peaks lead to greater Kurtosis.
    * Greater Kurtosis can be interpreted as a tighter clustering of residuals around zero, implying a better model with few outliers.


### single feature statistics

* **p-value**: That feature / predictor is meaningful. this is the result of a statistical hypothesis test:
    * $H_0$: the feature doesn't have an effect on the target - the coefficient is zero, $\beta_{n} = 0$
    * $H_1$: the feature does have a significant effect on the target
    * $t_i = \frac{\hat{\beta_{i}}}{\hat{\sigma_{i}}}$ (observed)
    * p-value is then the probability of achieving a $|t|$ as large or larger than the observed t if $H_0$ was true


## more sources
* good stackoverflow answer on how the values are calculated, [link](https://stats.stackexchange.com/questions/5135/interpretation-of-rs-lm-output)
* very good explanation for all the values in that summary() [here](https://www.accelebrate.com/blog/interpreting-results-from-linear-regression-is-the-data-appropriate)

# Lesson 2, return to case study "healthcare for all" (mail marketing)! 

In [20]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)

import warnings
warnings.filterwarnings('ignore')


In [22]:

ls files_for_lesson_and_activities/

categorical.csv  [0m[01;31mnumerical.csv.zip[0m     target.csv
numerical.csv    regression_data1.csv


In [24]:
numerical = pd.read_csv('files_for_lesson_and_activities/numerical.csv')
categorical = pd.read_csv('files_for_lesson_and_activities/categorical.csv')
targets = pd.read_csv('files_for_lesson_and_activities/target.csv')

In [26]:
targets['TARGET_B'].value_counts()

0    90569
1     4843
Name: TARGET_B, dtype: int64

### we will first try to predict whether someone will answer to our mail

In [30]:
data = pd.concat([numerical, targets], axis=1)
data = data.drop( ['TARGET_D'], axis=1)

In [31]:
data.head()

Unnamed: 0,TCODE,AGE,INCOME,WEALTH1,HIT,MALEMILI,MALEVET,VIETVETS,WWIIVETS,LOCALGOV,STATEGOV,FEDGOV,WEALTH2,POP901,POP902,POP903,POP90C1,POP90C2,POP90C3,POP90C4,POP90C5,ETH1,ETH2,ETH3,ETH4,ETH5,ETH6,ETH7,ETH8,ETH9,ETH10,ETH11,ETH12,ETH13,ETH14,ETH15,ETH16,AGE901,AGE902,AGE903,AGE904,AGE905,AGE906,AGE907,CHIL1,CHIL2,CHIL3,AGEC1,AGEC2,AGEC3,AGEC4,AGEC5,AGEC6,AGEC7,CHILC1,CHILC2,CHILC3,CHILC4,CHILC5,HHAGE1,HHAGE2,HHAGE3,HHN1,HHN2,HHN3,HHN4,HHN5,HHN6,MARR1,MARR2,MARR3,MARR4,HHP1,HHP2,DW1,DW2,DW3,DW4,DW5,DW6,DW7,DW8,DW9,HV1,HV2,HV3,HV4,HU1,HU2,HU3,HU4,HU5,HHD1,HHD2,HHD3,HHD4,HHD5,HHD6,HHD7,HHD8,HHD9,HHD10,HHD11,HHD12,ETHC1,ETHC2,ETHC3,ETHC4,ETHC5,ETHC6,HVP1,HVP2,HVP3,HVP4,HVP5,HVP6,HUR1,HUR2,RHP1,RHP2,RHP3,RHP4,HUPA1,HUPA2,HUPA3,HUPA4,HUPA5,HUPA6,HUPA7,RP1,RP2,RP3,RP4,MSA,ADI,DMA,IC1,IC2,IC3,IC4,IC5,IC6,IC7,IC8,IC9,IC10,IC11,IC12,IC13,IC14,IC15,IC16,IC17,IC18,IC19,IC20,IC21,IC22,IC23,HHAS1,HHAS2,HHAS3,HHAS4,MC1,MC2,MC3,TPE1,TPE2,TPE3,TPE4,TPE5,TPE6,TPE7,TPE8,TPE9,PEC1,PEC2,TPE10,TPE11,TPE12,TPE13,LFC1,LFC2,LFC3,LFC4,LFC5,LFC6,LFC7,LFC8,LFC9,LFC10,OCC1,OCC2,OCC3,OCC4,OCC5,OCC6,OCC7,OCC8,OCC9,OCC10,OCC11,OCC12,OCC13,EIC1,EIC2,EIC3,EIC4,EIC5,EIC6,EIC7,EIC8,EIC9,EIC10,EIC11,EIC12,EIC13,EIC14,EIC15,EIC16,OEDC1,OEDC2,OEDC3,OEDC4,OEDC5,OEDC6,OEDC7,EC1,EC2,EC3,EC4,EC5,EC6,EC7,EC8,SEC1,SEC2,SEC3,SEC4,SEC5,AFC1,AFC2,AFC3,AFC4,AFC5,AFC6,VC1,VC2,VC3,VC4,ANC1,ANC2,ANC3,ANC4,ANC5,ANC6,ANC7,ANC8,ANC9,ANC10,ANC11,ANC12,ANC13,ANC14,ANC15,POBC1,POBC2,LSC1,LSC2,LSC3,LSC4,VOC1,VOC2,VOC3,HC1,HC2,HC3,HC4,HC5,HC6,HC7,HC8,HC9,HC10,HC11,HC12,HC13,HC14,HC15,HC16,HC17,HC18,HC19,HC20,HC21,MHUC1,MHUC2,AC1,AC2,CARDPROM,NUMPROM,CARDPM12,NUMPRM12,RAMNTALL,NGIFTALL,CARDGIFT,MINRAMNT,MAXRAMNT,LASTGIFT,TIMELAG,AVGGIFT,CONTROLN,HPHONE_D,RFA_2F,CLUSTER2,TARGET_B
0,0,60.0,5,9,0,0,39,34,18,10,2,1,5,992,264,332,0,35,65,47,53,92,1,0,0,11,0,0,0,0,0,0,0,11,0,0,0,39,48,51,40,50,54,25,31,42,27,11,14,18,17,13,11,15,12,11,34,25,18,26,10,23,18,33,49,28,12,4,61,7,12,19,198,276,97,95,2,2,0,0,7,7,0,479,635,3,2,86,14,96,4,7,38,80,70,32,84,16,6,2,5,9,15,3,17,50,25,0,0,0,2,7,13,27,47,0,1,61,58,61,15,4,2,0,0,14,1,0,0,2,5,17,73,0.0,177.0,682.0,307,318,349,378,12883,13,23,23,23,15,1,0,0,1,4,25,24,26,17,2,0,0,2,28,4,51,1,46,54,3,88,8,0,0,0,0,0,0,4,1,13,14,16,2,45,56,64,50,64,44,62,53,99,0,0,9,3,8,13,9,0,3,9,3,15,19,5,4,3,0,3,41,1,0,7,13,6,5,0,4,9,4,1,3,10,2,1,7,78,2,0,120,16,10,39,21,8,4,3,5,20,3,19,4,0,0,0,18,39,0,34,23,18,16,1,4,0,23,0,0,5,1,0,0,0,0,0,2,0,3,74,88,8,0,4,96,77,19,13,31,5,14,14,31,54,46,0,0,90,0,10,0,0,0,33,65,40,99,99,6,2,10,7,27,74,6,14,240.0,31,14,5.0,12.0,10.0,4,7.741935,95515,0,4,39,0
1,1,46.0,6,9,16,0,15,55,11,6,2,1,9,3611,940,998,99,0,0,50,50,67,0,0,31,6,4,2,6,4,14,0,0,2,0,1,4,34,41,43,32,42,45,32,33,46,21,13,14,33,23,10,4,2,11,16,36,22,15,12,1,5,4,21,75,55,23,9,69,4,3,24,317,360,99,99,0,0,0,0,0,0,0,5468,5218,12,10,96,4,97,3,9,59,94,88,55,95,5,4,1,3,5,4,2,18,44,5,0,0,0,97,98,98,98,99,94,0,83,76,73,21,5,0,0,0,4,0,0,0,91,91,91,94,4480.0,13.0,803.0,1088,1096,1026,1037,36175,2,6,2,5,15,14,13,10,33,2,5,2,5,15,14,14,10,32,6,2,66,3,56,44,9,80,14,0,0,0,0,0,0,6,0,2,24,32,12,71,70,83,58,81,57,64,57,99,99,0,22,24,4,21,13,2,1,6,0,4,1,0,3,1,0,6,13,1,2,8,18,11,4,3,4,10,7,11,1,6,2,1,16,69,5,2,160,5,5,12,21,7,30,20,14,24,4,24,10,0,0,0,8,15,0,55,10,11,0,0,2,0,3,1,1,2,3,1,1,0,3,0,0,0,42,39,50,7,27,16,99,92,53,5,10,2,26,56,97,99,0,0,0,96,0,4,0,0,0,99,0,99,99,99,20,4,6,5,12,32,6,13,47.0,3,1,10.0,25.0,25.0,18,15.666667,148535,0,2,1,0
2,1,61.611649,3,1,2,0,20,29,33,6,8,1,1,7001,2040,2669,0,2,98,49,51,96,2,0,0,2,0,0,0,0,0,0,0,2,0,0,0,35,43,46,37,45,49,23,35,40,25,13,20,19,16,13,10,8,15,14,30,22,19,25,10,23,21,35,44,22,6,2,63,9,9,19,183,254,69,69,1,6,5,3,3,3,0,497,546,2,1,78,22,93,7,18,36,76,65,30,86,14,7,2,5,11,17,3,17,60,18,0,1,0,0,1,6,18,50,0,4,36,49,51,14,5,4,2,24,11,2,3,6,0,2,9,44,0.0,281.0,518.0,251,292,292,340,11576,32,18,20,15,12,2,0,0,1,20,19,24,18,16,2,0,0,1,28,8,31,11,38,62,8,74,22,0,0,0,0,0,2,2,1,21,19,24,6,61,65,73,59,70,56,78,62,82,99,4,10,5,2,6,12,0,1,9,5,18,20,5,7,6,0,11,33,4,3,2,12,3,3,2,0,7,8,3,3,6,7,1,8,74,3,1,120,22,20,28,16,6,5,3,1,23,1,16,6,0,0,0,10,21,0,28,23,32,8,1,14,1,5,0,0,7,0,0,0,0,0,1,0,0,2,84,96,3,0,0,92,65,29,9,22,3,12,23,50,69,31,0,0,0,6,35,44,0,15,22,77,17,97,92,9,2,6,5,26,63,6,14,202.0,27,14,2.0,16.0,5.0,12,7.481481,15078,1,4,60,0
3,0,70.0,1,4,2,0,23,14,31,3,0,3,0,640,160,219,0,8,92,54,46,61,0,0,11,32,6,2,0,0,0,0,0,31,0,0,1,32,40,44,34,43,47,25,45,35,20,15,25,17,17,12,7,7,20,17,30,14,19,25,11,23,23,27,50,30,15,8,63,9,6,23,199,283,85,83,3,4,1,0,2,0,2,1000,1263,2,1,48,52,93,7,6,36,73,61,30,84,16,6,3,3,21,12,4,13,36,13,0,0,0,10,25,50,69,92,10,15,42,55,50,15,5,4,0,9,42,4,0,5,1,8,17,34,9340.0,67.0,862.0,386,388,396,423,15130,27,12,4,26,22,5,0,0,4,35,5,6,12,30,6,0,0,5,22,14,26,20,46,54,3,58,36,0,0,0,0,0,6,0,0,17,13,15,0,43,69,81,53,68,45,33,31,0,99,23,17,3,0,6,6,0,0,13,42,12,0,0,0,42,0,6,3,0,0,0,23,3,3,6,0,3,3,3,3,3,0,3,6,87,0,0,120,28,12,14,27,10,3,5,0,19,1,17,0,0,0,0,13,23,0,14,40,31,16,0,1,0,13,0,0,4,0,0,0,3,0,0,0,0,29,67,56,41,3,0,94,43,27,4,38,0,10,19,39,45,55,0,0,45,22,17,0,0,16,23,77,22,93,89,16,2,6,6,27,66,6,14,109.0,16,7,2.0,11.0,10.0,9,6.8125,172556,1,4,41,0
4,0,78.0,3,2,60,1,28,9,53,26,3,2,9,2520,627,761,99,0,0,46,54,2,98,0,0,1,0,0,0,0,0,0,0,0,0,0,0,33,45,50,36,46,50,27,34,43,23,14,21,13,15,20,12,5,13,15,34,19,19,31,7,27,16,26,57,36,24,14,42,17,9,33,235,323,99,98,0,0,0,0,0,0,0,576,594,4,3,90,10,97,3,0,42,82,49,22,92,8,20,3,17,9,23,1,1,1,0,21,58,19,0,1,2,16,67,0,2,45,52,53,16,6,0,0,0,9,0,0,0,25,58,74,83,5000.0,127.0,528.0,240,250,293,321,9836,24,29,23,13,4,4,0,0,2,21,30,22,16,4,5,0,0,3,35,8,11,14,20,80,4,73,22,1,1,0,0,0,3,1,2,1,24,27,3,76,61,73,51,65,49,80,31,81,99,10,17,8,2,6,15,3,7,22,2,9,0,7,2,2,0,6,1,5,2,2,12,2,7,6,4,15,29,4,3,26,3,2,7,49,12,1,120,16,20,30,13,3,12,5,2,26,1,20,7,1,1,1,15,28,4,9,16,53,20,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,65,99,0,0,0,90,45,18,25,34,0,1,3,6,33,67,0,0,9,14,72,3,0,0,99,1,21,99,96,6,2,7,11,43,113,10,25,254.0,37,8,3.0,15.0,15.0,14,6.864865,7112,1,2,26,0


### get rid of the imbalance by sampling

#### downsampling

In [33]:
category_0 = data[data['TARGET_B'] == 0]
category_1 = data[data['TARGET_B'] == 1]

category_0 = category_0.sample(len(category_1))

In [34]:
category_0

Unnamed: 0,TCODE,AGE,INCOME,WEALTH1,HIT,MALEMILI,MALEVET,VIETVETS,WWIIVETS,LOCALGOV,STATEGOV,FEDGOV,WEALTH2,POP901,POP902,POP903,POP90C1,POP90C2,POP90C3,POP90C4,POP90C5,ETH1,ETH2,ETH3,ETH4,ETH5,ETH6,ETH7,ETH8,ETH9,ETH10,ETH11,ETH12,ETH13,ETH14,ETH15,ETH16,AGE901,AGE902,AGE903,AGE904,AGE905,AGE906,AGE907,CHIL1,CHIL2,CHIL3,AGEC1,AGEC2,AGEC3,AGEC4,AGEC5,AGEC6,AGEC7,CHILC1,CHILC2,CHILC3,CHILC4,CHILC5,HHAGE1,HHAGE2,HHAGE3,HHN1,HHN2,HHN3,HHN4,HHN5,HHN6,MARR1,MARR2,MARR3,MARR4,HHP1,HHP2,DW1,DW2,DW3,DW4,DW5,DW6,DW7,DW8,DW9,HV1,HV2,HV3,HV4,HU1,HU2,HU3,HU4,HU5,HHD1,HHD2,HHD3,HHD4,HHD5,HHD6,HHD7,HHD8,HHD9,HHD10,HHD11,HHD12,ETHC1,ETHC2,ETHC3,ETHC4,ETHC5,ETHC6,HVP1,HVP2,HVP3,HVP4,HVP5,HVP6,HUR1,HUR2,RHP1,RHP2,RHP3,RHP4,HUPA1,HUPA2,HUPA3,HUPA4,HUPA5,HUPA6,HUPA7,RP1,RP2,RP3,RP4,MSA,ADI,DMA,IC1,IC2,IC3,IC4,IC5,IC6,IC7,IC8,IC9,IC10,IC11,IC12,IC13,IC14,IC15,IC16,IC17,IC18,IC19,IC20,IC21,IC22,IC23,HHAS1,HHAS2,HHAS3,HHAS4,MC1,MC2,MC3,TPE1,TPE2,TPE3,TPE4,TPE5,TPE6,TPE7,TPE8,TPE9,PEC1,PEC2,TPE10,TPE11,TPE12,TPE13,LFC1,LFC2,LFC3,LFC4,LFC5,LFC6,LFC7,LFC8,LFC9,LFC10,OCC1,OCC2,OCC3,OCC4,OCC5,OCC6,OCC7,OCC8,OCC9,OCC10,OCC11,OCC12,OCC13,EIC1,EIC2,EIC3,EIC4,EIC5,EIC6,EIC7,EIC8,EIC9,EIC10,EIC11,EIC12,EIC13,EIC14,EIC15,EIC16,OEDC1,OEDC2,OEDC3,OEDC4,OEDC5,OEDC6,OEDC7,EC1,EC2,EC3,EC4,EC5,EC6,EC7,EC8,SEC1,SEC2,SEC3,SEC4,SEC5,AFC1,AFC2,AFC3,AFC4,AFC5,AFC6,VC1,VC2,VC3,VC4,ANC1,ANC2,ANC3,ANC4,ANC5,ANC6,ANC7,ANC8,ANC9,ANC10,ANC11,ANC12,ANC13,ANC14,ANC15,POBC1,POBC2,LSC1,LSC2,LSC3,LSC4,VOC1,VOC2,VOC3,HC1,HC2,HC3,HC4,HC5,HC6,HC7,HC8,HC9,HC10,HC11,HC12,HC13,HC14,HC15,HC16,HC17,HC18,HC19,HC20,HC21,MHUC1,MHUC2,AC1,AC2,CARDPROM,NUMPROM,CARDPM12,NUMPRM12,RAMNTALL,NGIFTALL,CARDGIFT,MINRAMNT,MAXRAMNT,LASTGIFT,TIMELAG,AVGGIFT,CONTROLN,HPHONE_D,RFA_2F,CLUSTER2,TARGET_B
71763,0,49.000000,4,5,2,0,24,26,21,9,16,1,3,898,222,264,99,0,0,48,52,1,99,0,0,0,0,0,0,0,0,0,0,0,0,0,0,27,36,40,30,40,44,33,33,43,24,18,27,25,12,7,7,5,15,13,34,22,16,20,7,17,13,18,69,45,24,10,43,14,6,37,279,338,96,96,0,0,0,0,0,0,0,537,518,4,4,68,32,95,5,0,57,84,50,33,92,8,23,3,20,8,19,3,0,0,1,27,62,10,0,0,0,4,61,0,1,22,46,51,20,6,0,0,1,31,0,0,0,45,60,71,84,6640.0,351.0,560.0,239,274,266,297,9277,30,22,18,26,2,0,2,0,0,17,24,25,31,0,0,3,0,0,37,17,4,10,49,51,4,70,21,6,6,0,0,0,2,2,0,14,20,21,1,71,76,86,68,79,66,91,48,76,99,12,9,3,3,11,23,2,2,19,0,12,6,8,3,0,0,7,13,1,0,2,17,8,4,7,0,21,13,4,1,9,16,1,3,64,7,0,120,13,20,27,16,11,8,4,3,28,2,25,4,0,0,0,11,24,0,26,0,21,45,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,91,98,1,0,2,89,44,17,7,14,9,20,29,70,75,25,0,0,38,0,44,15,0,3,95,5,91,99,98,6,2,2,1,16,46,6,14,61.00,6,3,5.00,15.0,15.0,19,10.166667,17712,0,3,34,0
56573,2,79.000000,5,9,0,0,26,25,47,6,4,1,6,766,162,376,99,0,0,44,56,91,6,1,2,2,0,0,0,0,0,0,0,0,0,0,1,29,32,38,37,41,47,12,54,33,13,24,32,11,8,7,6,11,32,15,26,16,11,15,8,14,41,37,22,9,1,0,35,12,12,41,124,191,30,30,0,70,69,68,6,6,0,425,434,4,4,27,73,89,11,0,17,43,30,12,54,46,5,0,5,26,39,16,10,63,18,1,5,0,0,0,0,0,30,0,4,14,40,43,11,4,8,61,0,6,1,66,0,1,44,95,97,3720.0,59.0,563.0,205,208,231,251,13124,30,27,29,11,3,0,0,0,0,25,31,26,12,7,0,0,0,0,21,3,34,12,72,28,8,88,5,3,3,0,0,0,3,1,0,8,14,17,0,51,72,85,63,83,58,65,47,80,0,6,11,14,3,10,21,1,0,11,2,11,10,1,6,1,0,6,23,4,4,5,20,7,1,4,4,7,3,10,0,4,0,1,0,80,15,0,120,9,18,33,15,7,11,7,2,20,0,8,13,0,0,0,14,36,0,14,0,41,26,8,6,2,6,0,0,5,0,1,1,0,0,0,1,0,6,71,91,0,4,4,91,40,6,3,20,0,0,0,12,12,88,0,0,81,0,13,3,0,0,99,0,97,99,96,5,3,1,7,31,74,5,12,71.03,24,11,0.03,6.0,5.0,3,2.959583,71689,1,3,48,0
89694,2,59.000000,7,8,2,0,35,29,38,13,5,5,9,2123,592,729,0,99,0,48,52,98,1,0,1,2,0,0,0,0,0,0,0,1,0,0,1,33,41,43,33,44,47,32,37,43,21,11,17,31,15,11,9,5,15,14,33,25,13,21,7,19,16,30,53,33,13,5,63,11,6,20,217,291,85,83,5,11,6,3,0,0,0,892,1002,3,3,75,25,94,6,9,49,81,66,38,92,8,10,2,8,8,16,3,26,58,13,0,0,0,3,8,37,68,92,1,1,65,60,61,16,4,9,2,3,14,8,3,1,13,22,58,83,5560.0,245.0,622.0,375,422,479,529,17217,17,9,19,16,21,11,3,1,4,11,9,16,17,27,13,1,2,5,17,3,54,10,47,53,9,77,20,2,2,0,0,0,0,1,2,30,16,30,20,33,65,76,57,74,54,64,42,86,99,3,33,15,4,10,16,0,1,5,2,8,1,3,2,1,7,5,2,1,5,1,16,10,2,1,1,14,16,9,7,13,5,5,12,54,9,2,140,4,7,24,23,4,19,20,12,18,2,25,3,0,0,0,17,35,2,29,18,38,4,0,4,5,3,0,0,13,1,0,0,0,0,2,1,0,2,67,97,1,1,1,98,60,15,7,15,1,12,30,78,95,5,0,2,61,1,37,0,0,1,98,2,93,99,99,8,2,7,8,11,28,6,13,34.00,6,3,5.00,7.0,5.0,3,5.666667,108784,1,4,16,0
76296,0,61.611649,4,2,0,0,51,21,21,6,0,0,9,1341,336,582,99,0,0,45,55,95,3,1,0,5,0,0,0,0,0,0,0,4,0,0,1,33,46,51,37,49,53,28,45,32,23,13,19,16,13,13,13,13,18,20,28,19,15,38,26,36,39,26,34,20,10,4,45,19,15,21,141,230,70,69,8,27,20,16,0,0,0,299,310,2,2,51,49,90,10,5,32,58,38,19,78,22,13,3,10,14,34,3,22,51,23,1,1,0,0,0,0,0,5,0,8,33,48,47,12,4,11,16,0,21,9,18,0,1,2,16,49,3850.0,83.0,527.0,179,258,222,314,9879,47,21,15,6,7,0,3,0,0,27,20,25,10,12,0,6,0,0,45,11,18,27,48,52,0,78,18,0,0,0,0,0,2,2,0,8,11,16,4,26,58,69,49,67,41,60,53,22,0,35,10,5,1,11,8,2,0,18,2,18,13,1,9,2,1,4,35,0,2,1,22,2,4,4,2,8,6,7,0,6,0,0,3,82,8,0,120,20,22,40,14,0,1,3,2,16,0,16,2,0,0,0,23,51,3,21,28,21,24,0,5,0,11,0,0,7,0,0,1,0,1,0,1,0,0,78,97,1,0,2,76,36,6,8,37,0,0,0,18,22,78,0,0,69,0,31,0,0,0,99,0,99,99,92,4,2,8,10,6,15,5,12,20.00,3,1,3.00,10.0,7.0,2,6.666667,62411,1,3,50,0
91936,1,56.000000,5,9,0,0,30,45,8,16,0,3,8,950,277,313,0,0,99,50,50,98,0,1,0,1,0,0,0,0,0,0,0,1,1,0,0,33,41,43,33,43,46,29,35,42,23,12,20,27,21,10,8,3,9,17,33,25,16,17,3,15,10,31,59,34,12,6,66,8,4,22,237,303,93,93,1,6,5,4,0,0,0,838,866,4,3,87,13,98,2,43,46,88,75,37,94,6,8,3,6,9,8,2,23,64,11,0,0,0,0,3,28,62,92,0,0,63,60,62,17,5,2,4,0,7,2,4,0,16,50,88,94,3800.0,111.0,617.0,394,452,471,499,16904,8,19,17,19,19,18,0,0,0,3,21,14,21,21,20,0,0,0,31,2,59,0,20,80,3,85,11,0,0,0,0,0,4,0,15,18,20,24,3,74,68,80,55,78,52,63,62,0,99,0,24,7,4,3,14,0,4,8,0,18,10,4,2,3,0,8,37,7,1,2,8,0,0,0,4,11,11,2,6,16,0,3,3,74,4,0,120,13,12,32,13,10,17,3,3,24,2,14,11,0,0,0,16,30,0,45,19,8,0,1,2,0,28,0,0,1,0,1,7,0,0,0,1,0,0,82,96,0,0,4,97,92,31,15,23,0,3,7,46,64,36,0,0,71,3,7,19,0,0,13,87,67,99,98,8,3,4,7,17,42,6,12,52.00,4,4,5.00,21.0,21.0,8,13.000000,77910,0,1,41,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32009,2,82.000000,4,9,0,0,34,24,15,7,4,1,6,1076,318,382,0,0,99,50,50,99,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,31,39,41,33,42,45,28,34,45,20,12,25,26,15,11,5,5,15,14,35,25,11,16,6,14,15,31,53,30,10,3,71,7,5,17,215,282,59,59,0,0,0,0,0,0,0,652,717,2,1,88,12,93,7,0,45,83,75,40,93,7,5,1,4,8,12,1,24,64,12,0,0,0,1,5,16,38,73,0,1,50,54,56,16,5,0,0,40,8,0,0,3,3,3,12,61,3160.0,213.0,567.0,349,363,361,378,12580,17,13,20,26,21,2,0,0,0,10,18,22,24,24,3,0,0,0,22,8,33,10,34,66,6,70,28,0,0,0,0,0,0,2,5,49,23,28,1,89,74,81,66,81,66,88,52,99,99,0,7,3,0,8,17,0,4,2,0,23,21,10,4,4,0,18,36,8,3,1,10,2,6,0,0,1,11,2,0,7,4,1,4,84,0,0,120,14,15,53,10,4,0,4,3,22,2,20,3,0,0,0,19,34,2,24,20,15,26,1,8,2,3,0,0,12,0,0,0,0,0,1,0,0,0,83,99,0,0,1,96,73,34,8,16,2,14,37,77,88,12,0,0,2,12,39,31,0,16,95,5,7,99,90,5,3,3,7,27,68,6,12,124.00,12,9,3.00,16.0,16.0,2,10.333333,24637,0,1,44,0
82632,1,61.611649,2,6,3,0,29,40,25,1,1,0,5,712,197,290,0,0,99,48,52,91,1,0,0,15,0,0,0,0,0,0,0,13,0,0,2,33,42,46,36,46,50,27,49,35,16,14,21,19,15,11,10,11,18,22,30,15,15,30,18,29,30,31,40,22,8,4,58,11,13,18,166,244,71,70,12,19,7,5,0,0,0,452,478,2,2,72,28,88,12,10,36,68,53,28,85,15,8,2,6,13,25,2,21,51,19,0,0,0,0,0,1,8,36,0,2,49,55,56,13,4,18,1,9,9,11,4,3,0,3,32,54,0.0,53.0,588.0,285,349,318,384,12540,25,20,16,21,11,5,1,0,0,12,17,22,26,16,7,1,0,0,29,5,42,7,40,60,7,76,18,1,1,0,0,1,4,1,1,46,17,18,1,58,71,81,62,79,59,80,60,99,99,0,4,4,1,7,10,0,0,8,2,21,29,6,8,4,0,3,66,2,0,1,12,3,1,2,0,1,2,2,1,1,1,0,2,93,2,0,120,14,23,41,13,4,4,1,1,20,2,17,3,0,0,0,15,29,1,40,6,25,23,1,11,0,16,0,0,4,1,0,1,0,0,0,1,0,3,69,88,11,0,2,92,60,13,6,46,1,5,7,18,24,76,0,1,82,1,9,4,0,4,94,6,97,99,89,5,2,6,5,25,65,6,13,163.00,11,4,5.00,21.0,21.0,6,14.818182,61661,1,1,44,0
70834,1,46.000000,2,1,10,0,42,14,18,13,0,0,9,504,142,211,0,99,0,52,48,80,2,0,0,33,0,0,0,0,0,0,0,22,0,0,11,32,40,44,35,45,47,26,38,48,14,11,28,16,14,13,12,6,15,15,43,17,10,23,9,21,28,33,39,20,7,1,58,17,7,18,166,239,59,59,2,4,2,0,0,0,0,445,455,2,2,67,33,87,13,9,35,67,52,24,82,18,11,3,9,20,16,4,19,49,12,0,1,1,0,0,0,1,38,0,7,19,43,45,13,5,4,0,37,20,4,0,9,0,0,18,54,0.0,291.0,770.0,195,230,233,265,9941,34,29,11,17,8,0,0,0,0,23,32,16,16,13,0,0,0,0,29,6,33,10,38,62,13,61,25,14,14,0,0,0,0,0,0,4,24,26,7,60,60,63,56,57,50,68,21,90,99,4,3,4,3,14,9,3,4,21,0,20,5,8,7,0,17,7,7,4,4,6,10,4,3,15,3,3,17,0,3,13,0,0,4,78,5,0,120,17,13,44,21,3,2,0,1,32,1,26,5,0,0,0,22,42,0,14,27,18,6,0,3,0,6,2,0,2,5,0,0,0,0,0,0,0,8,48,73,25,0,2,90,59,34,7,39,0,8,11,48,52,48,0,0,87,4,3,0,0,6,99,0,99,99,91,5,2,5,11,16,48,7,24,150.00,8,1,10.00,25.0,15.0,9,18.750000,185477,1,3,53,0
65768,0,29.000000,2,2,0,0,23,29,11,20,6,2,9,1663,465,615,0,99,0,49,51,97,1,0,1,2,0,0,0,0,0,0,0,1,0,0,1,34,41,43,33,43,45,29,30,43,27,11,19,30,21,11,6,3,12,13,34,26,15,14,5,11,22,27,51,30,10,2,68,8,4,19,206,270,72,72,0,28,28,27,0,0,0,955,1035,3,4,64,36,93,7,2,44,76,69,40,90,10,4,1,3,11,16,2,24,64,10,0,1,0,2,6,41,93,99,0,7,67,69,63,15,4,3,24,0,9,1,26,0,24,30,60,98,3360.0,201.0,618.0,506,551,550,626,19626,8,7,16,18,33,8,3,5,2,1,4,14,16,42,11,4,6,2,7,1,51,2,55,45,9,88,9,1,1,0,0,0,0,2,0,31,19,28,14,48,74,86,63,85,63,58,56,61,99,2,28,19,5,20,16,1,1,2,1,5,0,0,2,2,4,5,8,3,2,4,16,12,4,1,1,8,16,9,3,7,10,1,11,64,6,1,160,0,1,12,26,6,40,15,1,32,3,22,8,0,0,0,11,23,0,28,8,12,16,0,10,0,10,0,0,7,0,0,1,0,0,2,1,0,2,62,96,1,2,0,98,77,25,4,10,0,7,34,99,99,0,0,0,58,1,42,0,0,0,99,0,99,99,99,10,4,7,2,5,12,4,10,15.00,1,1,15.00,15.0,15.0,9,15.000000,122354,0,1,15,0


In [35]:
data = pd.concat([category_0, category_1], axis=0)

In [36]:
data

Unnamed: 0,TCODE,AGE,INCOME,WEALTH1,HIT,MALEMILI,MALEVET,VIETVETS,WWIIVETS,LOCALGOV,STATEGOV,FEDGOV,WEALTH2,POP901,POP902,POP903,POP90C1,POP90C2,POP90C3,POP90C4,POP90C5,ETH1,ETH2,ETH3,ETH4,ETH5,ETH6,ETH7,ETH8,ETH9,ETH10,ETH11,ETH12,ETH13,ETH14,ETH15,ETH16,AGE901,AGE902,AGE903,AGE904,AGE905,AGE906,AGE907,CHIL1,CHIL2,CHIL3,AGEC1,AGEC2,AGEC3,AGEC4,AGEC5,AGEC6,AGEC7,CHILC1,CHILC2,CHILC3,CHILC4,CHILC5,HHAGE1,HHAGE2,HHAGE3,HHN1,HHN2,HHN3,HHN4,HHN5,HHN6,MARR1,MARR2,MARR3,MARR4,HHP1,HHP2,DW1,DW2,DW3,DW4,DW5,DW6,DW7,DW8,DW9,HV1,HV2,HV3,HV4,HU1,HU2,HU3,HU4,HU5,HHD1,HHD2,HHD3,HHD4,HHD5,HHD6,HHD7,HHD8,HHD9,HHD10,HHD11,HHD12,ETHC1,ETHC2,ETHC3,ETHC4,ETHC5,ETHC6,HVP1,HVP2,HVP3,HVP4,HVP5,HVP6,HUR1,HUR2,RHP1,RHP2,RHP3,RHP4,HUPA1,HUPA2,HUPA3,HUPA4,HUPA5,HUPA6,HUPA7,RP1,RP2,RP3,RP4,MSA,ADI,DMA,IC1,IC2,IC3,IC4,IC5,IC6,IC7,IC8,IC9,IC10,IC11,IC12,IC13,IC14,IC15,IC16,IC17,IC18,IC19,IC20,IC21,IC22,IC23,HHAS1,HHAS2,HHAS3,HHAS4,MC1,MC2,MC3,TPE1,TPE2,TPE3,TPE4,TPE5,TPE6,TPE7,TPE8,TPE9,PEC1,PEC2,TPE10,TPE11,TPE12,TPE13,LFC1,LFC2,LFC3,LFC4,LFC5,LFC6,LFC7,LFC8,LFC9,LFC10,OCC1,OCC2,OCC3,OCC4,OCC5,OCC6,OCC7,OCC8,OCC9,OCC10,OCC11,OCC12,OCC13,EIC1,EIC2,EIC3,EIC4,EIC5,EIC6,EIC7,EIC8,EIC9,EIC10,EIC11,EIC12,EIC13,EIC14,EIC15,EIC16,OEDC1,OEDC2,OEDC3,OEDC4,OEDC5,OEDC6,OEDC7,EC1,EC2,EC3,EC4,EC5,EC6,EC7,EC8,SEC1,SEC2,SEC3,SEC4,SEC5,AFC1,AFC2,AFC3,AFC4,AFC5,AFC6,VC1,VC2,VC3,VC4,ANC1,ANC2,ANC3,ANC4,ANC5,ANC6,ANC7,ANC8,ANC9,ANC10,ANC11,ANC12,ANC13,ANC14,ANC15,POBC1,POBC2,LSC1,LSC2,LSC3,LSC4,VOC1,VOC2,VOC3,HC1,HC2,HC3,HC4,HC5,HC6,HC7,HC8,HC9,HC10,HC11,HC12,HC13,HC14,HC15,HC16,HC17,HC18,HC19,HC20,HC21,MHUC1,MHUC2,AC1,AC2,CARDPROM,NUMPROM,CARDPM12,NUMPRM12,RAMNTALL,NGIFTALL,CARDGIFT,MINRAMNT,MAXRAMNT,LASTGIFT,TIMELAG,AVGGIFT,CONTROLN,HPHONE_D,RFA_2F,CLUSTER2,TARGET_B
71763,0,49.000000,4,5,2,0,24,26,21,9,16,1,3,898,222,264,99,0,0,48,52,1,99,0,0,0,0,0,0,0,0,0,0,0,0,0,0,27,36,40,30,40,44,33,33,43,24,18,27,25,12,7,7,5,15,13,34,22,16,20,7,17,13,18,69,45,24,10,43,14,6,37,279,338,96,96,0,0,0,0,0,0,0,537,518,4,4,68,32,95,5,0,57,84,50,33,92,8,23,3,20,8,19,3,0,0,1,27,62,10,0,0,0,4,61,0,1,22,46,51,20,6,0,0,1,31,0,0,0,45,60,71,84,6640.0,351.0,560.0,239,274,266,297,9277,30,22,18,26,2,0,2,0,0,17,24,25,31,0,0,3,0,0,37,17,4,10,49,51,4,70,21,6,6,0,0,0,2,2,0,14,20,21,1,71,76,86,68,79,66,91,48,76,99,12,9,3,3,11,23,2,2,19,0,12,6,8,3,0,0,7,13,1,0,2,17,8,4,7,0,21,13,4,1,9,16,1,3,64,7,0,120,13,20,27,16,11,8,4,3,28,2,25,4,0,0,0,11,24,0,26,0,21,45,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,91,98,1,0,2,89,44,17,7,14,9,20,29,70,75,25,0,0,38,0,44,15,0,3,95,5,91,99,98,6,2,2,1,16,46,6,14,61.00,6,3,5.00,15.0,15.0,19,10.166667,17712,0,3,34,0
56573,2,79.000000,5,9,0,0,26,25,47,6,4,1,6,766,162,376,99,0,0,44,56,91,6,1,2,2,0,0,0,0,0,0,0,0,0,0,1,29,32,38,37,41,47,12,54,33,13,24,32,11,8,7,6,11,32,15,26,16,11,15,8,14,41,37,22,9,1,0,35,12,12,41,124,191,30,30,0,70,69,68,6,6,0,425,434,4,4,27,73,89,11,0,17,43,30,12,54,46,5,0,5,26,39,16,10,63,18,1,5,0,0,0,0,0,30,0,4,14,40,43,11,4,8,61,0,6,1,66,0,1,44,95,97,3720.0,59.0,563.0,205,208,231,251,13124,30,27,29,11,3,0,0,0,0,25,31,26,12,7,0,0,0,0,21,3,34,12,72,28,8,88,5,3,3,0,0,0,3,1,0,8,14,17,0,51,72,85,63,83,58,65,47,80,0,6,11,14,3,10,21,1,0,11,2,11,10,1,6,1,0,6,23,4,4,5,20,7,1,4,4,7,3,10,0,4,0,1,0,80,15,0,120,9,18,33,15,7,11,7,2,20,0,8,13,0,0,0,14,36,0,14,0,41,26,8,6,2,6,0,0,5,0,1,1,0,0,0,1,0,6,71,91,0,4,4,91,40,6,3,20,0,0,0,12,12,88,0,0,81,0,13,3,0,0,99,0,97,99,96,5,3,1,7,31,74,5,12,71.03,24,11,0.03,6.0,5.0,3,2.959583,71689,1,3,48,0
89694,2,59.000000,7,8,2,0,35,29,38,13,5,5,9,2123,592,729,0,99,0,48,52,98,1,0,1,2,0,0,0,0,0,0,0,1,0,0,1,33,41,43,33,44,47,32,37,43,21,11,17,31,15,11,9,5,15,14,33,25,13,21,7,19,16,30,53,33,13,5,63,11,6,20,217,291,85,83,5,11,6,3,0,0,0,892,1002,3,3,75,25,94,6,9,49,81,66,38,92,8,10,2,8,8,16,3,26,58,13,0,0,0,3,8,37,68,92,1,1,65,60,61,16,4,9,2,3,14,8,3,1,13,22,58,83,5560.0,245.0,622.0,375,422,479,529,17217,17,9,19,16,21,11,3,1,4,11,9,16,17,27,13,1,2,5,17,3,54,10,47,53,9,77,20,2,2,0,0,0,0,1,2,30,16,30,20,33,65,76,57,74,54,64,42,86,99,3,33,15,4,10,16,0,1,5,2,8,1,3,2,1,7,5,2,1,5,1,16,10,2,1,1,14,16,9,7,13,5,5,12,54,9,2,140,4,7,24,23,4,19,20,12,18,2,25,3,0,0,0,17,35,2,29,18,38,4,0,4,5,3,0,0,13,1,0,0,0,0,2,1,0,2,67,97,1,1,1,98,60,15,7,15,1,12,30,78,95,5,0,2,61,1,37,0,0,1,98,2,93,99,99,8,2,7,8,11,28,6,13,34.00,6,3,5.00,7.0,5.0,3,5.666667,108784,1,4,16,0
76296,0,61.611649,4,2,0,0,51,21,21,6,0,0,9,1341,336,582,99,0,0,45,55,95,3,1,0,5,0,0,0,0,0,0,0,4,0,0,1,33,46,51,37,49,53,28,45,32,23,13,19,16,13,13,13,13,18,20,28,19,15,38,26,36,39,26,34,20,10,4,45,19,15,21,141,230,70,69,8,27,20,16,0,0,0,299,310,2,2,51,49,90,10,5,32,58,38,19,78,22,13,3,10,14,34,3,22,51,23,1,1,0,0,0,0,0,5,0,8,33,48,47,12,4,11,16,0,21,9,18,0,1,2,16,49,3850.0,83.0,527.0,179,258,222,314,9879,47,21,15,6,7,0,3,0,0,27,20,25,10,12,0,6,0,0,45,11,18,27,48,52,0,78,18,0,0,0,0,0,2,2,0,8,11,16,4,26,58,69,49,67,41,60,53,22,0,35,10,5,1,11,8,2,0,18,2,18,13,1,9,2,1,4,35,0,2,1,22,2,4,4,2,8,6,7,0,6,0,0,3,82,8,0,120,20,22,40,14,0,1,3,2,16,0,16,2,0,0,0,23,51,3,21,28,21,24,0,5,0,11,0,0,7,0,0,1,0,1,0,1,0,0,78,97,1,0,2,76,36,6,8,37,0,0,0,18,22,78,0,0,69,0,31,0,0,0,99,0,99,99,92,4,2,8,10,6,15,5,12,20.00,3,1,3.00,10.0,7.0,2,6.666667,62411,1,3,50,0
91936,1,56.000000,5,9,0,0,30,45,8,16,0,3,8,950,277,313,0,0,99,50,50,98,0,1,0,1,0,0,0,0,0,0,0,1,1,0,0,33,41,43,33,43,46,29,35,42,23,12,20,27,21,10,8,3,9,17,33,25,16,17,3,15,10,31,59,34,12,6,66,8,4,22,237,303,93,93,1,6,5,4,0,0,0,838,866,4,3,87,13,98,2,43,46,88,75,37,94,6,8,3,6,9,8,2,23,64,11,0,0,0,0,3,28,62,92,0,0,63,60,62,17,5,2,4,0,7,2,4,0,16,50,88,94,3800.0,111.0,617.0,394,452,471,499,16904,8,19,17,19,19,18,0,0,0,3,21,14,21,21,20,0,0,0,31,2,59,0,20,80,3,85,11,0,0,0,0,0,4,0,15,18,20,24,3,74,68,80,55,78,52,63,62,0,99,0,24,7,4,3,14,0,4,8,0,18,10,4,2,3,0,8,37,7,1,2,8,0,0,0,4,11,11,2,6,16,0,3,3,74,4,0,120,13,12,32,13,10,17,3,3,24,2,14,11,0,0,0,16,30,0,45,19,8,0,1,2,0,28,0,0,1,0,1,7,0,0,0,1,0,0,82,96,0,0,4,97,92,31,15,23,0,3,7,46,64,36,0,0,71,3,7,19,0,0,13,87,67,99,98,8,3,4,7,17,42,6,12,52.00,4,4,5.00,21.0,21.0,8,13.000000,77910,0,1,41,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95298,2,45.000000,5,9,0,0,45,28,37,9,2,3,2,2649,671,1098,0,99,1,46,54,94,1,1,1,8,0,0,0,0,0,0,0,6,0,0,1,39,50,55,42,52,55,23,45,37,18,10,17,16,11,11,14,20,18,19,33,16,14,37,17,35,33,37,30,17,7,3,49,17,16,17,147,226,79,75,6,20,14,8,6,6,0,653,752,4,4,57,43,88,12,33,28,61,44,17,72,28,11,3,9,18,27,6,18,45,30,0,1,0,2,5,14,33,83,1,10,18,43,44,12,4,16,4,1,23,11,8,0,13,38,79,94,6780.0,13.0,803.0,194,218,252,292,12177,40,21,13,17,7,1,1,1,0,26,29,16,20,6,0,1,2,0,36,14,34,17,60,40,13,79,19,0,0,0,0,0,0,2,0,31,17,26,9,47,41,54,31,50,29,44,42,35,0,20,12,16,4,8,11,0,5,16,1,17,2,1,7,0,0,19,5,5,5,2,15,7,6,5,3,11,3,8,5,9,2,3,12,68,6,0,120,11,24,30,20,7,5,2,2,19,1,15,5,0,0,0,20,45,2,28,12,37,12,2,5,2,12,0,0,5,2,1,1,0,0,1,1,0,5,35,95,3,0,2,92,42,15,4,22,0,8,12,43,71,29,0,0,65,11,20,0,0,3,94,4,17,99,95,6,2,8,8,33,81,6,13,238.07,30,16,0.07,17.0,17.0,7,7.935667,154544,0,1,52,1
95309,0,51.000000,5,6,1,1,32,43,24,7,5,6,6,8361,2324,3112,99,0,0,50,50,90,1,1,7,9,1,1,1,3,0,0,0,6,0,0,2,30,35,37,31,41,43,28,50,36,14,10,36,24,11,8,8,4,22,20,33,15,10,17,5,14,19,36,46,27,9,3,66,12,4,18,187,266,70,68,1,22,21,19,1,0,1,1693,1692,5,6,72,28,87,13,8,42,75,64,35,85,15,7,2,5,14,15,7,22,57,11,0,1,0,23,68,91,97,99,0,4,49,56,55,14,4,14,8,7,12,3,13,0,58,80,93,97,6920.0,67.0,862.0,423,467,457,492,17493,8,13,16,28,27,6,3,0,1,5,11,13,28,32,7,2,0,1,16,6,48,4,86,14,17,83,11,0,0,0,0,0,3,2,0,59,22,26,5,67,77,86,68,82,66,76,56,92,99,1,13,18,8,16,15,0,2,8,1,13,3,2,2,1,0,10,14,4,2,3,18,9,5,2,1,8,6,6,9,7,5,6,8,70,4,0,140,2,9,23,25,12,22,6,5,22,3,17,7,1,1,2,16,32,1,43,8,24,16,1,4,1,10,1,1,6,3,1,1,1,0,1,1,0,6,58,90,3,3,3,98,71,19,1,4,36,76,81,85,86,14,1,1,68,1,31,0,0,0,99,0,99,99,98,12,2,3,4,13,36,4,10,35.00,3,2,5.00,15.0,15.0,4,11.666667,171302,1,1,20,1
95398,0,86.000000,5,9,0,1,32,21,26,9,1,0,9,2368,651,930,99,0,0,50,50,85,12,0,3,1,1,0,1,1,0,0,0,1,0,0,1,36,42,45,37,44,47,21,34,39,27,12,21,21,19,14,9,4,16,13,28,24,19,18,5,16,26,33,41,25,10,4,61,7,4,28,172,254,69,65,0,30,30,29,0,0,0,934,975,5,5,75,25,98,2,10,29,70,64,27,85,15,3,1,2,16,17,4,14,57,14,2,9,1,1,2,37,86,99,0,3,58,61,57,14,4,7,24,0,2,1,22,0,30,94,97,97,5080.0,111.0,617.0,476,529,511,562,20261,6,9,16,21,31,12,3,1,1,2,8,17,18,34,15,4,1,1,25,2,74,1,32,68,5,87,8,0,0,0,0,0,2,2,1,18,20,22,2,67,75,80,70,78,69,83,60,89,0,3,25,9,4,13,11,0,3,7,0,14,8,3,3,1,1,3,23,2,2,5,19,5,7,1,2,8,12,6,2,9,1,0,5,76,9,0,140,4,4,27,23,8,20,14,5,20,0,15,9,0,1,0,16,32,0,21,26,26,5,0,1,0,34,1,0,1,1,0,1,0,0,0,0,0,5,72,92,1,3,4,98,72,24,9,19,0,1,13,53,89,11,12,7,79,0,18,2,0,0,99,0,99,99,99,8,4,8,5,29,71,6,16,144.00,10,4,5.00,25.0,20.0,15,14.400000,78831,0,3,3,1
95403,0,58.000000,4,9,0,0,24,46,20,6,1,2,5,1663,450,581,0,1,99,50,50,99,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,30,40,43,33,44,47,32,37,43,20,13,24,23,13,10,8,9,13,17,33,24,12,23,11,23,20,30,51,34,15,4,68,5,6,20,205,286,81,81,2,10,9,7,0,0,0,585,606,3,2,79,21,97,3,13,45,77,70,40,90,10,6,1,4,12,12,3,28,57,15,0,0,0,0,0,8,25,62,0,1,65,64,63,15,4,7,4,8,9,3,7,2,0,2,23,82,0.0,107.0,613.0,281,342,326,376,11157,27,16,17,22,12,4,1,0,0,14,18,19,26,15,5,2,0,0,23,2,46,9,38,62,4,69,15,0,0,0,0,0,4,12,0,53,22,23,1,71,71,78,63,75,62,78,66,79,95,3,9,5,7,10,15,1,2,13,15,12,6,4,3,16,0,6,12,5,1,11,12,3,4,3,1,18,4,4,2,6,1,2,17,59,13,2,120,12,9,45,12,9,11,2,2,29,3,24,4,0,0,0,13,25,1,43,19,21,8,0,1,0,44,0,0,3,0,6,0,0,0,0,1,0,0,85,97,1,0,2,97,75,27,9,28,3,10,17,40,50,50,0,0,35,28,11,16,0,10,51,48,46,99,97,7,4,5,6,22,51,4,8,139.00,12,6,3.00,20.0,20.0,10,11.583333,84678,0,1,56,1


In [37]:
data = data.sample(frac=1)

#### downsampling

In [None]:
# again rebuilding dataframe like above
data = pd.concat([numerical, targets], axis=1)
data = data.drop(['TARGET_D'], axis=1)

In [None]:
# seperating it with respect to TARGET_B value
category_0 = data[data['TARGET_B'] == 0]
category_1 = data[data['TARGET_B'] == 1]

In [None]:
# randomly sampling category_1, blowing it up to the length of the majority class
category_1 = category_1.sample(len(category_0), replace=True)
print(category_1.shape)

data = pd.concat([category_0, category_1], axis=0)
#shuffling the data
data = data.sample(frac=1)
print(data['TARGET_B'].value_counts())

# SMOTE (Recap)

In [38]:
from imblearn.over_sampling import SMOTE

data = pd.concat([numerical, targets], axis=1)
data = data.drop(['TARGET_D'], axis=1)

smote = SMOTE()
y = data['TARGET_B']
X = data.drop(['TARGET_B'], axis=1)


X_sm, y_sm = smote.fit_sample(X, y)
y_sm.value_counts()

1    90569
0    90569
Name: TARGET_B, dtype: int64

# TomekLinks

In [39]:
from imblearn.under_sampling import TomekLinks

data = pd.concat([numerical, targets], axis=1)
data = data.drop(['TARGET_D'], axis=1)

y = data['TARGET_B']
X = data.drop(['TARGET_B'], axis=1)
tl = TomekLinks('majority')
X_tl, y_tl = tl.fit_sample(X, y)
y_tl.value_counts()

0    87970
1     4843
Name: TARGET_B, dtype: int64