
# Purpose of this lesson

* Use the concept of p-values learnt in the previous lesson to implement into the feature selection process
* return back to our case study and start building a model to solve the classification problem

## learning objectives

After this lesson, you guys will be able to

* use **p-values** for feature selection
* Resolve **data imbalance** with **upsampling** techniques
* Resolve **data imbalance** with **downsampling** techniques


# Lesson 1

* statsmodels library to build a simple regression model
* conduct feature selection using p-values



### Boston housing dataset from sklearn

* predict: **median house value** based on certain features provided in the dataset
* won't do data cleaning here, since we only demonstrate p-value here

In [22]:
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", None)
import warnings
warnings.filterwarnings('ignore')

import statsmodels.api as sm
from sklearn.datasets import load_boston

In [23]:
x = load_boston()
# show x

In [24]:
y = x['target']

In [25]:
X = pd.DataFrame(x.data, columns = x['feature_names'])

In [26]:
X.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


**RECAP**
What are we doing here?
$$\hat{y} = \hat{\beta_{0}}x_0 + \hat{\beta_{1}}x_1 + \hat{\beta_{2}}x_2 + ... + \hat{\beta_{n}}x_n + \epsilon$$

* $\hat{\beta_{n}}$ : coefficients that our model needs to calculate
* $x_n$ : our feature data
* $\hat{y}$ : our target variable
* $\beta_{n}$ : bias or intercept

In [16]:
# so, we need to add a column of ones to our dataframe, store the result into variable
# the sm model contains much more stats features & functionality

X_added_constant = sm.add_constant(X)

In [18]:
model = sm.OLS(y, X_added_constant).fit()

In [19]:
model.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.741
Model:,OLS,Adj. R-squared:,0.734
Method:,Least Squares,F-statistic:,108.1
Date:,"Wed, 03 Mar 2021",Prob (F-statistic):,6.72e-135
Time:,21:43:49,Log-Likelihood:,-1498.8
No. Observations:,506,AIC:,3026.0
Df Residuals:,492,BIC:,3085.0
Df Model:,13,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,36.4595,5.103,7.144,0.000,26.432,46.487
CRIM,-0.1080,0.033,-3.287,0.001,-0.173,-0.043
ZN,0.0464,0.014,3.382,0.001,0.019,0.073
INDUS,0.0206,0.061,0.334,0.738,-0.100,0.141
CHAS,2.6867,0.862,3.118,0.002,0.994,4.380
NOX,-17.7666,3.820,-4.651,0.000,-25.272,-10.262
RM,3.8099,0.418,9.116,0.000,2.989,4.631
AGE,0.0007,0.013,0.052,0.958,-0.025,0.027
DIS,-1.4756,0.199,-7.398,0.000,-1.867,-1.084

0,1,2,3
Omnibus:,178.041,Durbin-Watson:,1.078
Prob(Omnibus):,0.0,Jarque-Bera (JB):,783.126
Skew:,1.521,Prob(JB):,8.84e-171
Kurtosis:,8.281,Cond. No.,15100.0


# What do we need to know here?

* **R²**:
    * ratio between explained variance (that my model is able to explain) and total variance
    * or in math: $$R^2 = 1 - \frac{SS_\text{res}}{SS_\text{total}} = 1 - \frac{\sum_{i}(y_i - f_i)^2}{\sum_i(y_i - \bar{y})^2}$$
    
    * compares the sum of least squares of your model with the sum of least squares with a "dumb" model that would just be a horizontal line of the mean $\bar{y}$ of all values.
    
* **Adjusted R²**:
    * math: $$R^2_\text{adj} = 1-(1-R^2)\frac{n-1}{n-p-1}$$
        * $n$: number of observations (rows)
        * $p$: number of features
    * adjusts R² for the fact that R² will naturally increase if you add more features, even if they don't provide much explanation power to your model (natural effect)
    * adjusted R² only really increases when the new feature adds explanation power with respect to your variance
    * so it penalizes non useful predictors
    * when **R²** and **adj R²** are vastly different: Hint that you have variables that can be omitted
    * when **R²** $\approx$ **adj. R²**: Good selection of variables
* **F-statistic**:
    * **Overall** significance of the linear regression
    * $H_0$: My data are best fit by a model with a constant only, or: My OLS model and an intercept-only model would perform equally well 
    * $H_1$: My model will better perform with an OLS model
* **Prob (F-statistic)**: 	
    * the associated p-value
    * if p is small, overall regression is meaningful
    
* **AIC/BIC**:
    * stands for *Akaike’s Information Criteria*
    * used for model selection.
    * penalizes the errors made in case a new variable is added to the regression equation.
    * It is calculated as number of parameters minus the likelihood of the overall model.
    * A lower AIC implies a better model.
    * Whereas, BIC stands for Bayesian information criteria and is a variant of AIC where penalties are made more severe.
    
    
* **Prob(Omnibus)**: 
    * measure whether residuals are actually normally distributed (base assumption for OLS)
    * again a Hypthothesis test: H_0 = they are normally distributed
    * When normally distributed: Prob(Omnibus) = 1
    * if very low (close to 0), OLS assumption is not satisfied
    
* **Durbin Watson**:
    * checks whether variance of the errors is constant, "homoscedasticity" (a value between 1 and 2 is prefered here)
    * on the contrary, "[heteroscedasticity](https://en.wikipedia.org/wiki/Heteroscedasticity)" can pose a big problem for linear regression
    
* **Prob(Jarque-Bera)**
    * Should be in line with with Omnibus, large value here indicate that the values are not normally distributed
    
* **Skew**
    * Skew – a measure of data symmetry. We want to see something close to zero, indicating the residual distribution is normal. Note that this value also drives the Omnibus. This result has a small, and therefore good, skew.
    
* **Kurtosis**
    * measure of "peakiness", or curvature of the data.
    * Higher peaks lead to greater Kurtosis.
    * Greater Kurtosis can be interpreted as a tighter clustering of residuals around zero, implying a better model with few outliers.


### single feature statistics

* **p-value**: That feature / predictor is meaningful. this is the result of a statistical hypothesis test:
    * $H_0$: the feature doesn't have an effect on the target - the coefficient is zero, $\beta_{n} = 0$
    * $H_1$: the feature does have a significant effect on the target
    * $t_i = \frac{\hat{\beta_{i}}}{\hat{\sigma_{i}}}$ (observed)
    * p-value is then the probability of achieving a $|t|$ as large or larger than the observed t if $H_0$ was true


## more sources
* good stackoverflow answer on how the values are calculated, [link](https://stats.stackexchange.com/questions/5135/interpretation-of-rs-lm-output)
* very good explanation for all the values in that summary() [here](https://www.accelebrate.com/blog/interpreting-results-from-linear-regression-is-the-data-appropriate)

we would drop `INDUS` and `AGE`

In [20]:
X = X.drop(['INDUS','AGE'], axis=1)

# Activity 1

* In the lesson, we used a linear regression model to check the significant variables in the model and we removed two variables which were not significant, based on the p-values. Now, you should build a model on the remaining data and check if any other variable/variables turn out be insignificant.
*How is R squared measure different than adjusted R squared? Compare the values for R square and adjusted R square in the two models.


In [27]:
X_added_constant = sm.add_constant(X)
model = sm.OLS(y,X_added_constant).fit()
model.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.741
Model:,OLS,Adj. R-squared:,0.734
Method:,Least Squares,F-statistic:,108.1
Date:,"Wed, 03 Mar 2021",Prob (F-statistic):,6.72e-135
Time:,22:55:17,Log-Likelihood:,-1498.8
No. Observations:,506,AIC:,3026.0
Df Residuals:,492,BIC:,3085.0
Df Model:,13,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,36.4595,5.103,7.144,0.000,26.432,46.487
CRIM,-0.1080,0.033,-3.287,0.001,-0.173,-0.043
ZN,0.0464,0.014,3.382,0.001,0.019,0.073
INDUS,0.0206,0.061,0.334,0.738,-0.100,0.141
CHAS,2.6867,0.862,3.118,0.002,0.994,4.380
NOX,-17.7666,3.820,-4.651,0.000,-25.272,-10.262
RM,3.8099,0.418,9.116,0.000,2.989,4.631
AGE,0.0007,0.013,0.052,0.958,-0.025,0.027
DIS,-1.4756,0.199,-7.398,0.000,-1.867,-1.084

0,1,2,3
Omnibus:,178.041,Durbin-Watson:,1.078
Prob(Omnibus):,0.0,Jarque-Bera (JB):,783.126
Skew:,1.521,Prob(JB):,8.84e-171
Kurtosis:,8.281,Cond. No.,15100.0


In [28]:
X_added_constant = X_added_constant.drop(['INDUS','AGE'], axis=1)
model = sm.OLS(y,X_added_constant).fit()
model.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.741
Model:,OLS,Adj. R-squared:,0.735
Method:,Least Squares,F-statistic:,128.2
Date:,"Wed, 03 Mar 2021",Prob (F-statistic):,5.54e-137
Time:,22:56:31,Log-Likelihood:,-1498.9
No. Observations:,506,AIC:,3022.0
Df Residuals:,494,BIC:,3072.0
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,36.3411,5.067,7.171,0.000,26.385,46.298
CRIM,-0.1084,0.033,-3.307,0.001,-0.173,-0.044
ZN,0.0458,0.014,3.390,0.001,0.019,0.072
CHAS,2.7187,0.854,3.183,0.002,1.040,4.397
NOX,-17.3760,3.535,-4.915,0.000,-24.322,-10.430
RM,3.8016,0.406,9.356,0.000,3.003,4.600
DIS,-1.4927,0.186,-8.037,0.000,-1.858,-1.128
RAD,0.2996,0.063,4.726,0.000,0.175,0.424
TAX,-0.0118,0.003,-3.493,0.001,-0.018,-0.005

0,1,2,3
Omnibus:,178.43,Durbin-Watson:,1.078
Prob(Omnibus):,0.0,Jarque-Bera (JB):,787.785
Skew:,1.523,Prob(JB):,8.6e-172
Kurtosis:,8.3,Cond. No.,14700.0


* when adjusted R2 an R2 approach each other after you have omitted a feature, you have well selected your features
* [more on adjusted R2 and R2](https://towardsdatascience.com/the-enigma-of-adjusted-r-squared-57b01edac9f) and [here](https://www.analyticsvidhya.com/blog/2020/07/difference-between-r-squared-and-adjusted-r-squared/p)
* look at the f-statistic, got better!

# Lesson 2, return to case study "healthcare for all" (mail marketing)! 

In [55]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings('ignore')

numerical = pd.read_csv('files_for_lesson_and_activities/numerical.csv')
categorical = pd.read_csv('files_for_lesson_and_activities/categorical.csv')
targets = pd.read_csv('files_for_lesson_and_activities/target.csv')
print(targets['TARGET_B'].value_counts())

0    90569
1     4843
Name: TARGET_B, dtype: int64


* Huge imbalance!

What if we build the model right away, without managing the imbalance?

* Binary classification problems are usually used for cases where model has to identify some rare but critical even such as fraud/intrusion detection, process failures, and medical diagnosis/monitoring.
* And generally one would see that there is a huge imbalance in the representation of the two classes in such cases.
* And the class that is of **interest** is usually **under**-represented.

* Lets take a simple explanation. In our case, category 0 is represented 90569 times (which is 94.9% of the total samples) while category 1 is represented 4843 times. Even if we do not spend the time in data cleaning and data processing and making a machine learning model, and simply mark the predictions as 0 for all the cases, one can say that over 94% of the times we made the correct prediction. But we can not just randomly make this guess on some new data on which we have to make the predictions.

* A conventional model will not make a reliable and accurate prediction if there is imbalance in the data.
* The **model will be biased towards the class that has more representation**.
* The minority class might be treated **as a noise** in the model.


# prepare data for prediction: Who will be more likely donating?
* We want redict, who is more likely to donate first, so we drop `TARGET_D` for now (include that later!)
* we only use numerical features first, for demonstration purposes
* we need to sample (up or down)

In [32]:
# For demonstration purposes we will use only numerical features
data = pd.concat([numerical, targets], axis=1)

# Dropping target D as this would be the target later, after we predict who is more likely to donate
data = data.drop(['TARGET_D'], axis=1)
data.head()

Unnamed: 0,TCODE,AGE,INCOME,WEALTH1,HIT,MALEMILI,MALEVET,VIETVETS,WWIIVETS,LOCALGOV,STATEGOV,FEDGOV,WEALTH2,POP901,POP902,POP903,POP90C1,POP90C2,POP90C3,POP90C4,POP90C5,ETH1,ETH2,ETH3,ETH4,ETH5,ETH6,ETH7,ETH8,ETH9,ETH10,ETH11,ETH12,ETH13,ETH14,ETH15,ETH16,AGE901,AGE902,AGE903,AGE904,AGE905,AGE906,AGE907,CHIL1,CHIL2,CHIL3,AGEC1,AGEC2,AGEC3,AGEC4,AGEC5,AGEC6,AGEC7,CHILC1,CHILC2,CHILC3,CHILC4,CHILC5,HHAGE1,HHAGE2,HHAGE3,HHN1,HHN2,HHN3,HHN4,HHN5,HHN6,MARR1,MARR2,MARR3,MARR4,HHP1,HHP2,DW1,DW2,DW3,DW4,DW5,DW6,DW7,DW8,DW9,HV1,HV2,HV3,HV4,HU1,HU2,HU3,HU4,HU5,HHD1,HHD2,HHD3,HHD4,HHD5,HHD6,HHD7,HHD8,HHD9,HHD10,HHD11,HHD12,ETHC1,ETHC2,ETHC3,ETHC4,ETHC5,ETHC6,HVP1,HVP2,HVP3,HVP4,HVP5,HVP6,HUR1,HUR2,RHP1,RHP2,RHP3,RHP4,HUPA1,HUPA2,HUPA3,HUPA4,HUPA5,HUPA6,HUPA7,RP1,RP2,RP3,RP4,MSA,ADI,DMA,IC1,IC2,IC3,IC4,IC5,IC6,IC7,IC8,IC9,IC10,IC11,IC12,IC13,IC14,IC15,IC16,IC17,IC18,IC19,IC20,IC21,IC22,IC23,HHAS1,HHAS2,HHAS3,HHAS4,MC1,MC2,MC3,TPE1,TPE2,TPE3,TPE4,TPE5,TPE6,TPE7,TPE8,TPE9,PEC1,PEC2,TPE10,TPE11,TPE12,TPE13,LFC1,LFC2,LFC3,LFC4,LFC5,LFC6,LFC7,LFC8,LFC9,LFC10,OCC1,OCC2,OCC3,OCC4,OCC5,OCC6,OCC7,OCC8,OCC9,OCC10,OCC11,OCC12,OCC13,EIC1,EIC2,EIC3,EIC4,EIC5,EIC6,EIC7,EIC8,EIC9,EIC10,EIC11,EIC12,EIC13,EIC14,EIC15,EIC16,OEDC1,OEDC2,OEDC3,OEDC4,OEDC5,OEDC6,OEDC7,EC1,EC2,EC3,EC4,EC5,EC6,EC7,EC8,SEC1,SEC2,SEC3,SEC4,SEC5,AFC1,AFC2,AFC3,AFC4,AFC5,AFC6,VC1,VC2,VC3,VC4,ANC1,ANC2,ANC3,ANC4,ANC5,ANC6,ANC7,ANC8,ANC9,ANC10,ANC11,ANC12,ANC13,ANC14,ANC15,POBC1,POBC2,LSC1,LSC2,LSC3,LSC4,VOC1,VOC2,VOC3,HC1,HC2,HC3,HC4,HC5,HC6,HC7,HC8,HC9,HC10,HC11,HC12,HC13,HC14,HC15,HC16,HC17,HC18,HC19,HC20,HC21,MHUC1,MHUC2,AC1,AC2,CARDPROM,NUMPROM,CARDPM12,NUMPRM12,RAMNTALL,NGIFTALL,CARDGIFT,MINRAMNT,MAXRAMNT,LASTGIFT,TIMELAG,AVGGIFT,CONTROLN,HPHONE_D,RFA_2F,CLUSTER2,TARGET_B
0,0,60.0,5,9,0,0,39,34,18,10,2,1,5,992,264,332,0,35,65,47,53,92,1,0,0,11,0,0,0,0,0,0,0,11,0,0,0,39,48,51,40,50,54,25,31,42,27,11,14,18,17,13,11,15,12,11,34,25,18,26,10,23,18,33,49,28,12,4,61,7,12,19,198,276,97,95,2,2,0,0,7,7,0,479,635,3,2,86,14,96,4,7,38,80,70,32,84,16,6,2,5,9,15,3,17,50,25,0,0,0,2,7,13,27,47,0,1,61,58,61,15,4,2,0,0,14,1,0,0,2,5,17,73,0.0,177.0,682.0,307,318,349,378,12883,13,23,23,23,15,1,0,0,1,4,25,24,26,17,2,0,0,2,28,4,51,1,46,54,3,88,8,0,0,0,0,0,0,4,1,13,14,16,2,45,56,64,50,64,44,62,53,99,0,0,9,3,8,13,9,0,3,9,3,15,19,5,4,3,0,3,41,1,0,7,13,6,5,0,4,9,4,1,3,10,2,1,7,78,2,0,120,16,10,39,21,8,4,3,5,20,3,19,4,0,0,0,18,39,0,34,23,18,16,1,4,0,23,0,0,5,1,0,0,0,0,0,2,0,3,74,88,8,0,4,96,77,19,13,31,5,14,14,31,54,46,0,0,90,0,10,0,0,0,33,65,40,99,99,6,2,10,7,27,74,6,14,240.0,31,14,5.0,12.0,10.0,4,7.741935,95515,0,4,39,0
1,1,46.0,6,9,16,0,15,55,11,6,2,1,9,3611,940,998,99,0,0,50,50,67,0,0,31,6,4,2,6,4,14,0,0,2,0,1,4,34,41,43,32,42,45,32,33,46,21,13,14,33,23,10,4,2,11,16,36,22,15,12,1,5,4,21,75,55,23,9,69,4,3,24,317,360,99,99,0,0,0,0,0,0,0,5468,5218,12,10,96,4,97,3,9,59,94,88,55,95,5,4,1,3,5,4,2,18,44,5,0,0,0,97,98,98,98,99,94,0,83,76,73,21,5,0,0,0,4,0,0,0,91,91,91,94,4480.0,13.0,803.0,1088,1096,1026,1037,36175,2,6,2,5,15,14,13,10,33,2,5,2,5,15,14,14,10,32,6,2,66,3,56,44,9,80,14,0,0,0,0,0,0,6,0,2,24,32,12,71,70,83,58,81,57,64,57,99,99,0,22,24,4,21,13,2,1,6,0,4,1,0,3,1,0,6,13,1,2,8,18,11,4,3,4,10,7,11,1,6,2,1,16,69,5,2,160,5,5,12,21,7,30,20,14,24,4,24,10,0,0,0,8,15,0,55,10,11,0,0,2,0,3,1,1,2,3,1,1,0,3,0,0,0,42,39,50,7,27,16,99,92,53,5,10,2,26,56,97,99,0,0,0,96,0,4,0,0,0,99,0,99,99,99,20,4,6,5,12,32,6,13,47.0,3,1,10.0,25.0,25.0,18,15.666667,148535,0,2,1,0
2,1,61.611649,3,1,2,0,20,29,33,6,8,1,1,7001,2040,2669,0,2,98,49,51,96,2,0,0,2,0,0,0,0,0,0,0,2,0,0,0,35,43,46,37,45,49,23,35,40,25,13,20,19,16,13,10,8,15,14,30,22,19,25,10,23,21,35,44,22,6,2,63,9,9,19,183,254,69,69,1,6,5,3,3,3,0,497,546,2,1,78,22,93,7,18,36,76,65,30,86,14,7,2,5,11,17,3,17,60,18,0,1,0,0,1,6,18,50,0,4,36,49,51,14,5,4,2,24,11,2,3,6,0,2,9,44,0.0,281.0,518.0,251,292,292,340,11576,32,18,20,15,12,2,0,0,1,20,19,24,18,16,2,0,0,1,28,8,31,11,38,62,8,74,22,0,0,0,0,0,2,2,1,21,19,24,6,61,65,73,59,70,56,78,62,82,99,4,10,5,2,6,12,0,1,9,5,18,20,5,7,6,0,11,33,4,3,2,12,3,3,2,0,7,8,3,3,6,7,1,8,74,3,1,120,22,20,28,16,6,5,3,1,23,1,16,6,0,0,0,10,21,0,28,23,32,8,1,14,1,5,0,0,7,0,0,0,0,0,1,0,0,2,84,96,3,0,0,92,65,29,9,22,3,12,23,50,69,31,0,0,0,6,35,44,0,15,22,77,17,97,92,9,2,6,5,26,63,6,14,202.0,27,14,2.0,16.0,5.0,12,7.481481,15078,1,4,60,0
3,0,70.0,1,4,2,0,23,14,31,3,0,3,0,640,160,219,0,8,92,54,46,61,0,0,11,32,6,2,0,0,0,0,0,31,0,0,1,32,40,44,34,43,47,25,45,35,20,15,25,17,17,12,7,7,20,17,30,14,19,25,11,23,23,27,50,30,15,8,63,9,6,23,199,283,85,83,3,4,1,0,2,0,2,1000,1263,2,1,48,52,93,7,6,36,73,61,30,84,16,6,3,3,21,12,4,13,36,13,0,0,0,10,25,50,69,92,10,15,42,55,50,15,5,4,0,9,42,4,0,5,1,8,17,34,9340.0,67.0,862.0,386,388,396,423,15130,27,12,4,26,22,5,0,0,4,35,5,6,12,30,6,0,0,5,22,14,26,20,46,54,3,58,36,0,0,0,0,0,6,0,0,17,13,15,0,43,69,81,53,68,45,33,31,0,99,23,17,3,0,6,6,0,0,13,42,12,0,0,0,42,0,6,3,0,0,0,23,3,3,6,0,3,3,3,3,3,0,3,6,87,0,0,120,28,12,14,27,10,3,5,0,19,1,17,0,0,0,0,13,23,0,14,40,31,16,0,1,0,13,0,0,4,0,0,0,3,0,0,0,0,29,67,56,41,3,0,94,43,27,4,38,0,10,19,39,45,55,0,0,45,22,17,0,0,16,23,77,22,93,89,16,2,6,6,27,66,6,14,109.0,16,7,2.0,11.0,10.0,9,6.8125,172556,1,4,41,0
4,0,78.0,3,2,60,1,28,9,53,26,3,2,9,2520,627,761,99,0,0,46,54,2,98,0,0,1,0,0,0,0,0,0,0,0,0,0,0,33,45,50,36,46,50,27,34,43,23,14,21,13,15,20,12,5,13,15,34,19,19,31,7,27,16,26,57,36,24,14,42,17,9,33,235,323,99,98,0,0,0,0,0,0,0,576,594,4,3,90,10,97,3,0,42,82,49,22,92,8,20,3,17,9,23,1,1,1,0,21,58,19,0,1,2,16,67,0,2,45,52,53,16,6,0,0,0,9,0,0,0,25,58,74,83,5000.0,127.0,528.0,240,250,293,321,9836,24,29,23,13,4,4,0,0,2,21,30,22,16,4,5,0,0,3,35,8,11,14,20,80,4,73,22,1,1,0,0,0,3,1,2,1,24,27,3,76,61,73,51,65,49,80,31,81,99,10,17,8,2,6,15,3,7,22,2,9,0,7,2,2,0,6,1,5,2,2,12,2,7,6,4,15,29,4,3,26,3,2,7,49,12,1,120,16,20,30,13,3,12,5,2,26,1,20,7,1,1,1,15,28,4,9,16,53,20,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,65,99,0,0,0,90,45,18,25,34,0,1,3,6,33,67,0,0,9,14,72,3,0,0,99,1,21,99,96,6,2,7,11,43,113,10,25,254.0,37,8,3.0,15.0,15.0,14,6.864865,7112,1,2,26,0


#### Activity 2

Research and discuss some potential causes of data imbalance.

**Solution**

* Predicting a rare event, which is usually the case with binary classification problems (as discussed in the lesson)
* Biased sampling/ sampling was not randomized - Imbalance can also be due to the way the samples were collected or sampled from the problem domain. For eg. if you were collecting data from a company but took maximum samples from employees in a particular department
* Measurement errors could be potential error as well



## Downsample

In [33]:
category_0 = data[data['TARGET_B'] == 0]
category_1 = data[data['TARGET_B'] == 1]

In [34]:
# just sample from the majority randomly with the size of the minority class:
category_0 = category_0.sample(len(category_1))
print(category_0.shape)
print(category_1.shape)

(4843, 316)
(4843, 316)


In [36]:
# rebuild the dataframe
data = pd.concat([category_0, category_1], axis=0)

In [37]:
data

Unnamed: 0,TCODE,AGE,INCOME,WEALTH1,HIT,MALEMILI,MALEVET,VIETVETS,WWIIVETS,LOCALGOV,STATEGOV,FEDGOV,WEALTH2,POP901,POP902,POP903,POP90C1,POP90C2,POP90C3,POP90C4,POP90C5,ETH1,ETH2,ETH3,ETH4,ETH5,ETH6,ETH7,ETH8,ETH9,ETH10,ETH11,ETH12,ETH13,ETH14,ETH15,ETH16,AGE901,AGE902,AGE903,AGE904,AGE905,AGE906,AGE907,CHIL1,CHIL2,CHIL3,AGEC1,AGEC2,AGEC3,AGEC4,AGEC5,AGEC6,AGEC7,CHILC1,CHILC2,CHILC3,CHILC4,CHILC5,HHAGE1,HHAGE2,HHAGE3,HHN1,HHN2,HHN3,HHN4,HHN5,HHN6,MARR1,MARR2,MARR3,MARR4,HHP1,HHP2,DW1,DW2,DW3,DW4,DW5,DW6,DW7,DW8,DW9,HV1,HV2,HV3,HV4,HU1,HU2,HU3,HU4,HU5,HHD1,HHD2,HHD3,HHD4,HHD5,HHD6,HHD7,HHD8,HHD9,HHD10,HHD11,HHD12,ETHC1,ETHC2,ETHC3,ETHC4,ETHC5,ETHC6,HVP1,HVP2,HVP3,HVP4,HVP5,HVP6,HUR1,HUR2,RHP1,RHP2,RHP3,RHP4,HUPA1,HUPA2,HUPA3,HUPA4,HUPA5,HUPA6,HUPA7,RP1,RP2,RP3,RP4,MSA,ADI,DMA,IC1,IC2,IC3,IC4,IC5,IC6,IC7,IC8,IC9,IC10,IC11,IC12,IC13,IC14,IC15,IC16,IC17,IC18,IC19,IC20,IC21,IC22,IC23,HHAS1,HHAS2,HHAS3,HHAS4,MC1,MC2,MC3,TPE1,TPE2,TPE3,TPE4,TPE5,TPE6,TPE7,TPE8,TPE9,PEC1,PEC2,TPE10,TPE11,TPE12,TPE13,LFC1,LFC2,LFC3,LFC4,LFC5,LFC6,LFC7,LFC8,LFC9,LFC10,OCC1,OCC2,OCC3,OCC4,OCC5,OCC6,OCC7,OCC8,OCC9,OCC10,OCC11,OCC12,OCC13,EIC1,EIC2,EIC3,EIC4,EIC5,EIC6,EIC7,EIC8,EIC9,EIC10,EIC11,EIC12,EIC13,EIC14,EIC15,EIC16,OEDC1,OEDC2,OEDC3,OEDC4,OEDC5,OEDC6,OEDC7,EC1,EC2,EC3,EC4,EC5,EC6,EC7,EC8,SEC1,SEC2,SEC3,SEC4,SEC5,AFC1,AFC2,AFC3,AFC4,AFC5,AFC6,VC1,VC2,VC3,VC4,ANC1,ANC2,ANC3,ANC4,ANC5,ANC6,ANC7,ANC8,ANC9,ANC10,ANC11,ANC12,ANC13,ANC14,ANC15,POBC1,POBC2,LSC1,LSC2,LSC3,LSC4,VOC1,VOC2,VOC3,HC1,HC2,HC3,HC4,HC5,HC6,HC7,HC8,HC9,HC10,HC11,HC12,HC13,HC14,HC15,HC16,HC17,HC18,HC19,HC20,HC21,MHUC1,MHUC2,AC1,AC2,CARDPROM,NUMPROM,CARDPM12,NUMPRM12,RAMNTALL,NGIFTALL,CARDGIFT,MINRAMNT,MAXRAMNT,LASTGIFT,TIMELAG,AVGGIFT,CONTROLN,HPHONE_D,RFA_2F,CLUSTER2,TARGET_B
11738,1002,61.611649,5,9,0,0,39,50,42,15,6,7,9,615,169,206,99,0,0,47,53,44,50,0,5,0,2,0,0,1,0,3,0,0,0,0,0,39,45,47,36,45,49,26,15,54,31,13,10,25,26,15,7,4,5,6,40,29,20,17,4,15,17,25,58,38,13,5,60,10,6,25,240,299,99,92,0,0,0,0,0,0,0,899,933,5,5,95,5,97,3,0,45,82,68,35,93,7,10,1,8,9,14,1,8,27,9,11,34,4,0,1,24,89,98,0,0,88,64,66,17,4,0,0,0,5,0,0,0,36,99,99,99,5560.0,245.0,622.0,459,491,493,533,16155,3,7,28,20,30,7,0,3,0,3,0,25,24,36,9,0,4,0,20,0,61,5,36,64,5,84,9,7,7,0,0,0,0,0,0,14,28,31,12,67,71,82,63,82,62,82,68,99,0,0,21,17,4,14,14,0,2,8,0,6,3,6,7,0,0,5,10,13,9,3,13,9,3,5,2,3,12,2,11,15,6,7,8,62,3,0,140,8,3,26,27,0,19,17,16,20,0,26,10,0,0,0,17,39,0,50,0,42,0,0,0,0,2,0,0,3,1,0,0,0,0,0,0,0,12,75,91,0,0,9,96,68,20,11,16,0,0,0,99,99,0,0,0,67,0,33,0,0,0,99,0,99,99,99,9,3,3,5,15,36,6,12,60.00,5,2,10.00,15.0,15.0,3,12.000000,108540,0,2,5,0
66101,2,82.000000,2,9,9,0,28,0,61,6,2,0,9,568,180,238,99,0,0,48,52,98,0,0,2,1,2,0,0,0,0,0,0,0,0,0,1,47,53,55,45,53,56,18,35,42,23,7,9,18,17,21,15,13,17,9,38,17,19,38,14,34,24,45,32,18,6,3,73,5,8,14,159,237,96,53,0,3,3,0,0,0,0,1600,1680,13,4,99,1,98,2,25,23,76,71,21,88,12,2,1,1,8,20,1,15,50,33,0,0,0,21,59,96,99,99,2,0,67,65,65,13,4,3,0,0,1,0,0,0,50,50,50,50,1600.0,51.0,602.0,588,758,649,758,29691,7,0,16,11,31,22,7,3,4,0,0,11,16,21,32,11,4,5,35,0,78,0,34,66,4,88,2,0,0,0,0,0,8,2,0,18,17,24,9,53,78,90,68,84,68,99,82,0,99,0,20,22,0,35,11,0,0,2,0,8,0,2,0,0,0,0,25,0,2,9,21,17,4,0,3,9,7,2,0,6,2,0,6,77,9,0,150,4,2,27,23,0,34,11,9,18,4,18,4,0,0,0,12,28,0,0,25,61,0,0,0,0,10,0,0,14,13,0,0,0,0,2,0,0,8,59,99,0,0,0,96,54,13,12,25,0,0,3,19,99,0,39,0,52,0,48,0,0,0,99,0,99,99,99,12,4,12,10,29,84,7,28,163.00,24,12,2.00,15.0,5.0,6,6.791667,13493,1,4,7,0
86196,1,70.000000,5,9,12,0,30,21,41,4,4,2,9,2600,697,820,99,0,0,51,49,99,0,0,1,2,0,0,0,0,0,0,0,1,0,0,1,31,45,48,34,46,49,32,35,42,23,12,19,18,19,18,10,4,12,16,32,25,15,22,5,20,13,34,53,37,19,10,65,7,4,25,221,312,91,87,4,9,5,5,0,0,0,1553,2193,4,4,80,20,97,3,0,43,85,74,36,94,6,7,1,6,8,12,2,26,56,16,0,0,0,39,51,65,81,99,25,1,78,78,72,16,4,5,4,0,11,4,5,0,17,48,83,93,7160.0,291.0,770.0,439,504,545,575,22732,15,14,16,9,22,7,7,2,7,10,15,16,8,25,8,8,2,7,24,4,67,6,30,70,6,83,10,1,1,0,0,0,2,5,0,2,21,22,1,68,63,75,52,73,50,44,27,29,99,9,25,20,5,18,14,1,1,6,1,6,1,2,1,2,0,2,11,3,4,4,18,13,6,3,6,9,8,8,2,4,4,2,9,74,7,1,153,1,3,18,30,4,32,13,10,29,6,21,12,0,0,0,15,30,0,21,18,41,4,1,16,0,6,0,0,1,0,2,1,0,0,1,0,0,3,68,92,3,1,3,97,82,38,10,30,0,9,12,26,49,51,1,3,95,2,3,0,0,0,99,0,83,99,99,10,3,8,8,13,31,6,14,43.00,7,5,5.00,10.0,10.0,2,6.142857,134585,1,4,19,0
43662,1,46.000000,1,9,0,0,38,15,50,10,4,1,2,2763,808,1255,0,0,99,47,53,99,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,44,53,56,44,52,55,20,39,40,21,8,16,15,13,16,20,13,16,16,32,21,15,42,18,39,32,39,28,16,6,2,61,10,12,17,145,220,78,77,2,6,4,3,0,0,0,494,575,3,3,78,22,43,57,91,23,64,55,18,81,19,5,1,4,15,26,3,16,49,34,0,0,0,1,3,9,18,49,0,3,34,48,51,12,2,4,2,14,12,5,4,1,5,14,47,80,0.0,63.0,505.0,201,255,256,312,11807,34,28,15,12,8,3,1,0,0,22,27,18,16,11,4,1,0,0,51,8,36,13,40,60,5,80,14,0,0,0,0,0,3,3,0,47,23,29,13,53,47,56,39,52,36,69,57,73,78,5,13,11,3,13,12,0,0,12,2,15,10,5,4,2,0,13,22,3,2,2,17,5,6,3,1,5,11,5,3,10,5,1,12,68,5,0,120,9,20,37,15,7,8,5,2,17,2,13,4,0,0,0,19,40,1,11,23,52,7,1,8,3,18,0,1,6,3,1,7,0,0,3,0,1,6,81,94,0,0,6,93,48,10,8,35,1,5,10,33,44,56,0,0,76,5,11,4,0,3,38,62,32,99,95,8,2,7,11,16,39,6,13,52.00,4,2,10.00,16.0,16.0,9,13.000000,69888,0,1,59,0
61868,2,86.000000,1,6,4,0,39,16,39,3,0,3,9,1645,448,829,99,0,0,44,56,97,0,0,2,2,0,1,0,0,0,0,0,1,0,0,0,39,44,47,41,47,49,14,39,39,22,10,23,17,14,16,13,7,19,14,30,23,14,25,11,23,42,34,24,11,4,1,51,13,9,27,123,198,28,25,0,71,71,64,0,0,0,1897,2131,5,6,77,23,95,5,3,17,54,44,13,74,26,4,1,3,16,36,4,11,62,24,0,0,0,44,79,96,98,99,14,1,31,47,51,11,4,32,39,0,1,2,21,0,68,93,98,99,1600.0,51.0,602.0,406,519,476,664,26951,15,13,15,28,11,12,0,2,5,4,2,4,39,20,18,0,4,9,21,0,58,1,60,40,7,85,9,3,0,3,0,0,1,2,1,13,24,30,7,76,75,91,62,88,59,48,42,99,99,0,13,20,4,13,25,0,1,4,1,8,2,3,5,1,0,8,24,7,3,7,12,9,4,3,0,8,6,7,2,6,2,2,6,80,3,1,132,6,7,31,20,5,24,7,3,12,1,6,8,0,0,0,12,27,1,13,22,50,5,0,5,0,16,0,0,3,4,1,10,0,2,1,3,2,15,67,88,1,3,8,96,48,8,3,22,17,22,26,42,75,25,47,18,83,2,11,0,0,3,72,28,99,99,99,12,3,11,9,14,61,4,25,47.00,5,2,3.00,16.0,16.0,15,9.400000,8439,0,1,8,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95298,2,45.000000,5,9,0,0,45,28,37,9,2,3,2,2649,671,1098,0,99,1,46,54,94,1,1,1,8,0,0,0,0,0,0,0,6,0,0,1,39,50,55,42,52,55,23,45,37,18,10,17,16,11,11,14,20,18,19,33,16,14,37,17,35,33,37,30,17,7,3,49,17,16,17,147,226,79,75,6,20,14,8,6,6,0,653,752,4,4,57,43,88,12,33,28,61,44,17,72,28,11,3,9,18,27,6,18,45,30,0,1,0,2,5,14,33,83,1,10,18,43,44,12,4,16,4,1,23,11,8,0,13,38,79,94,6780.0,13.0,803.0,194,218,252,292,12177,40,21,13,17,7,1,1,1,0,26,29,16,20,6,0,1,2,0,36,14,34,17,60,40,13,79,19,0,0,0,0,0,0,2,0,31,17,26,9,47,41,54,31,50,29,44,42,35,0,20,12,16,4,8,11,0,5,16,1,17,2,1,7,0,0,19,5,5,5,2,15,7,6,5,3,11,3,8,5,9,2,3,12,68,6,0,120,11,24,30,20,7,5,2,2,19,1,15,5,0,0,0,20,45,2,28,12,37,12,2,5,2,12,0,0,5,2,1,1,0,0,1,1,0,5,35,95,3,0,2,92,42,15,4,22,0,8,12,43,71,29,0,0,65,11,20,0,0,3,94,4,17,99,95,6,2,8,8,33,81,6,13,238.07,30,16,0.07,17.0,17.0,7,7.935667,154544,0,1,52,1
95309,0,51.000000,5,6,1,1,32,43,24,7,5,6,6,8361,2324,3112,99,0,0,50,50,90,1,1,7,9,1,1,1,3,0,0,0,6,0,0,2,30,35,37,31,41,43,28,50,36,14,10,36,24,11,8,8,4,22,20,33,15,10,17,5,14,19,36,46,27,9,3,66,12,4,18,187,266,70,68,1,22,21,19,1,0,1,1693,1692,5,6,72,28,87,13,8,42,75,64,35,85,15,7,2,5,14,15,7,22,57,11,0,1,0,23,68,91,97,99,0,4,49,56,55,14,4,14,8,7,12,3,13,0,58,80,93,97,6920.0,67.0,862.0,423,467,457,492,17493,8,13,16,28,27,6,3,0,1,5,11,13,28,32,7,2,0,1,16,6,48,4,86,14,17,83,11,0,0,0,0,0,3,2,0,59,22,26,5,67,77,86,68,82,66,76,56,92,99,1,13,18,8,16,15,0,2,8,1,13,3,2,2,1,0,10,14,4,2,3,18,9,5,2,1,8,6,6,9,7,5,6,8,70,4,0,140,2,9,23,25,12,22,6,5,22,3,17,7,1,1,2,16,32,1,43,8,24,16,1,4,1,10,1,1,6,3,1,1,1,0,1,1,0,6,58,90,3,3,3,98,71,19,1,4,36,76,81,85,86,14,1,1,68,1,31,0,0,0,99,0,99,99,98,12,2,3,4,13,36,4,10,35.00,3,2,5.00,15.0,15.0,4,11.666667,171302,1,1,20,1
95398,0,86.000000,5,9,0,1,32,21,26,9,1,0,9,2368,651,930,99,0,0,50,50,85,12,0,3,1,1,0,1,1,0,0,0,1,0,0,1,36,42,45,37,44,47,21,34,39,27,12,21,21,19,14,9,4,16,13,28,24,19,18,5,16,26,33,41,25,10,4,61,7,4,28,172,254,69,65,0,30,30,29,0,0,0,934,975,5,5,75,25,98,2,10,29,70,64,27,85,15,3,1,2,16,17,4,14,57,14,2,9,1,1,2,37,86,99,0,3,58,61,57,14,4,7,24,0,2,1,22,0,30,94,97,97,5080.0,111.0,617.0,476,529,511,562,20261,6,9,16,21,31,12,3,1,1,2,8,17,18,34,15,4,1,1,25,2,74,1,32,68,5,87,8,0,0,0,0,0,2,2,1,18,20,22,2,67,75,80,70,78,69,83,60,89,0,3,25,9,4,13,11,0,3,7,0,14,8,3,3,1,1,3,23,2,2,5,19,5,7,1,2,8,12,6,2,9,1,0,5,76,9,0,140,4,4,27,23,8,20,14,5,20,0,15,9,0,1,0,16,32,0,21,26,26,5,0,1,0,34,1,0,1,1,0,1,0,0,0,0,0,5,72,92,1,3,4,98,72,24,9,19,0,1,13,53,89,11,12,7,79,0,18,2,0,0,99,0,99,99,99,8,4,8,5,29,71,6,16,144.00,10,4,5.00,25.0,20.0,15,14.400000,78831,0,3,3,1
95403,0,58.000000,4,9,0,0,24,46,20,6,1,2,5,1663,450,581,0,1,99,50,50,99,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,30,40,43,33,44,47,32,37,43,20,13,24,23,13,10,8,9,13,17,33,24,12,23,11,23,20,30,51,34,15,4,68,5,6,20,205,286,81,81,2,10,9,7,0,0,0,585,606,3,2,79,21,97,3,13,45,77,70,40,90,10,6,1,4,12,12,3,28,57,15,0,0,0,0,0,8,25,62,0,1,65,64,63,15,4,7,4,8,9,3,7,2,0,2,23,82,0.0,107.0,613.0,281,342,326,376,11157,27,16,17,22,12,4,1,0,0,14,18,19,26,15,5,2,0,0,23,2,46,9,38,62,4,69,15,0,0,0,0,0,4,12,0,53,22,23,1,71,71,78,63,75,62,78,66,79,95,3,9,5,7,10,15,1,2,13,15,12,6,4,3,16,0,6,12,5,1,11,12,3,4,3,1,18,4,4,2,6,1,2,17,59,13,2,120,12,9,45,12,9,11,2,2,29,3,24,4,0,0,0,13,25,1,43,19,21,8,0,1,0,44,0,0,3,0,6,0,0,0,0,1,0,0,85,97,1,0,2,97,75,27,9,28,3,10,17,40,50,50,0,0,35,28,11,16,0,10,51,48,46,99,97,7,4,5,6,22,51,4,8,139.00,12,6,3.00,20.0,20.0,10,11.583333,84678,0,1,56,1


In [38]:
# shuffle the data, so that not all 1's of TARGET_B at the end
data = data.sample(frac=1)
data['TARGET_B'].value_counts()

1    4843
0    4843
Name: TARGET_B, dtype: int64

## Upsampling

In [41]:
# again rebuilding dataframe like above
data = pd.concat([numerical, targets], axis=1)
data = data.drop(['TARGET_D'], axis=1)

# seperating it with respect to TARGET_B value
category_0 = data[data['TARGET_B'] == 0]
category_1 = data[data['TARGET_B'] == 1]

# randomly sampling category_1, blowing it up to the length of the majority class
category_1 = category_1.sample(len(category_0), replace=True)
print(category_1.shape)

data = pd.concat([category_0, category_1], axis=0)
#shuffling the data
data = data.sample(frac=1)
print(data['TARGET_B'].value_counts())

(90569, 316)
1    90569
0    90569
Name: TARGET_B, dtype: int64


# Activity 3

Despite the advantage of balancing classes, these techniques also have their weaknesses (there is no free lunch). Brainstorm and discuss some of the disadvantages of upsampling and downsampling.


#### solution

* The simplest implementation of over-sampling is to duplicate random records from the minority class, which can cause overfitting.
* In down-sampling, the simplest technique involves removing random records from the majority class, which can cause loss of information.



In [43]:
ls

7.06_scripted_Feature_Selection_extended.ipynb  [0m[01;34mimages[0m/  [01;34mlab[0m/
[01;34mfiles_for_lesson_and_activities[0m/                [01;34mipynb[0m/   lesson.md


In [44]:
ls files_for_lesson_and_activities/

categorical.csv  [0m[01;31mnumerical.csv.zip[0m     target.csv
numerical.csv    regression_data1.csv


# Recap SMOTE

* Randomly pick a point from the minority class.
* Compute the k-nearest neighbors (for some pre-specified k) for this point.
* Add k new points somewhere between the chosen point and each of its neighbors.


In [53]:
from imblearn.over_sampling import SMOTE

data = pd.concat([numerical, targets], axis=1)
data = data.drop(['TARGET_D'], axis=1)

smote = SMOTE()
y = data['TARGET_B']
X = data.drop(['TARGET_B'], axis=1)


X_sm, y_sm = smote.fit_sample(X, y)
y_sm.value_counts()

1    90569
0    90569
Name: TARGET_B, dtype: int64

# Recap Downsampling with TomekLinks

* TomekLinks are pairs of very close instances, but of opposite classes. Removing the instances of the majority class of each pair increases the space between the two classes, facilitating the classification process.
* It does not make the two classes equal but only removes the points from the majority class that are close to other points in minority class.
* good kaggle source [link](https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets)
* article about a lot of undersampling / oversampling techniques [link](https://machinelearningmastery.com/undersampling-algorithms-for-imbalanced-classification/)


In [54]:
from imblearn.under_sampling import TomekLinks

data = pd.concat([numerical, targets], axis=1)
data = data.drop(['TARGET_D'], axis=1)

y = data['TARGET_B']
X = data.drop(['TARGET_B'], axis=1)
tl = TomekLinks('majority')
X_tl, y_tl = tl.fit_sample(X, y)
y_tl.value_counts()

0    87970
1     4843
Name: TARGET_B, dtype: int64

Wasn't so effective. Show this [link](https://machinelearningmastery.com/undersampling-algorithms-for-imbalanced-classification/). Mentions: Combine it with other methods!


# LAB | Handling data imbalance classification

For this lab and in the next lessons we will build a model on customer churn binary classification problem. You will be using files_for_lab/Customer-Churn.csv file.
Scenario

You are working as an analyst with this internet service provider. You are provided with this historical data about your company's customers and their churn trends. Your task is to build a machine learning model that will help the company identify customers that are more likely to default/churn and thus prevent losses from such customers.
Instructions

In this lab, we will first take a look at the degree of imbalance in the data and correct it using the techniques we learned on the class.

Here is the list of steps to be followed (building a simple model without balancing the data):

    Import the required libraries and modules that you would need.
    Read that data into Python and call the dataframe churnData.
    Check the datatypes of all the columns in the data. You would see that the column TotalCharges is object type. Convert this column into numeric type using pd.to_numeric function.
    Check for null values in the dataframe. Replace the null values.
    Use the following features: tenure, SeniorCitizen, MonthlyCharges and TotalCharges:
        Scale the features either by using normalizer or a standard scaler.
        Split the data into a training set and a test set.
        Fit a logistic regression model on the training data.
        Check the accuracy on the test data.

Note: So far we have not balanced the data.

Managing imbalance in the dataset

    Check for the imbalance.
    Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.
    Each time fit the model and see how the accuracy of the model is.


In [None]:
# Building a simple model without balancing the data

# Reading the data

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings('ignore')
churnData = pd.read_csv('Customer-Churn.csv')
churnData.head()

# Processing data

churnData.dtypes
churnData['TotalCharges']  = pd.to_numeric(churnData['TotalCharges'], errors='coerce')
churnData.isna().sum()
churnData['TotalCharges'] = churnData['TotalCharges'].fillna(np.mean(churnData['TotalCharges']))

X = churnData[['tenure', 'SeniorCitizen','MonthlyCharges', 'TotalCharges']]
Y = pd.DataFrame(data=churnData, columns=['Churn'])
transformer = StandardScaler().fit(X)
scaled_x = transformer.transform(X)

# Building the model

X_train, X_test, y_train, y_test = train_test_split(scaled_x, y, test_size=0.33)
classification = LogisticRegression(random_state=0, solver='lbfgs', multi_class='ovr').fit(X_train, y_train)
classification.score(X_test, y_test)

# Managing imbalance
# upsampling

counts = churnData['Churn'].value_counts()
yes = churnData[churnData['Churn']=='Yes'].sample(counts[0], replace=True)
no = churnData[churnData['Churn']=='No']
data = pd.concat([yes,no], axis=0)
data = data.sample(frac=1)
data['Churn'].value_counts()

X = data[['tenure', 'SeniorCitizen','MonthlyCharges', 'TotalCharges']]
y = pd.DataFrame(data['Churn'])
transformer = StandardScaler().fit(X)
scaled_x = transformer.transform(X)
X_train, X_test, y_train, y_test = train_test_split(scaled_x, y, test_size=0.33)
classification = LogisticRegression(random_state=0, solver='lbfgs', multi_class='ovr').fit(X_train, y_train)
classification.score(X_test, y_test)

# downsampling

yes = churnData[churnData['Churn']=='Yes']
no = churnData[churnData['Churn']=='No']
no = no.sample(len(yes))
data = pd.concat([yes,no], axis=0)
data = data.sample(frac=1)
data['Churn'].value_counts()

X = data[['tenure', 'SeniorCitizen','MonthlyCharges', 'TotalCharges']]
y = pd.DataFrame(data['Churn'])
transformer = StandardScaler().fit(X)
scaled_x = transformer.transform(X)
X_train, X_test, y_train, y_test = train_test_split(scaled_x, y, test_size=0.33)
classification = LogisticRegression(random_state=0, solver='lbfgs', multi_class='ovr').fit(X_train, y_train)
classification.score(X_test, y_test)