
## t/udom/2017/11787
## bsc te

# DATA IMBALANCE   
#### refers to a problem with classification problems where the classes are not represented equally.
#### The following are the methods of data imbalance
### we covered 4 different methods for dealing with imbalanced datasets:

### 1.Change the performance metric
### 2.Oversampling minority class
### 3. Undersampling majority class
### 4. Change the algorithm

In [233]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plot
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, recall_score
data=pd.read_csv('traindata.csv')
data.head(5)

Unnamed: 0,continue_drop,student_id,gender,caste,mathematics_marks,english_marks,science_marks,science_teacher,languages_teacher,guardian,internet,school_id,total_students,total_toilets,establishment_year
0,continue,s01746,M,BC,0.666,0.468,0.666,7,6,other,True,305,354,86.0,1986.0
1,continue,s16986,M,BC,0.172,0.42,0.172,8,10,mother,False,331,516,15.0,1996.0
2,continue,s00147,F,BC,0.212,0.601,0.212,1,4,mother,False,311,209,14.0,1976.0
3,continue,s08104,F,ST,0.434,0.611,0.434,2,5,father,True,364,147,28.0,1911.0
4,continue,s11132,F,SC,0.283,0.478,0.283,1,10,mother,True,394,122,15.0,1889.0


In [171]:
#chek nuul values on the colums
data.isnull().sum()

continue_drop           0
student_id              0
gender                  0
caste                   0
mathematics_marks       0
english_marks           0
science_marks           0
science_teacher         0
languages_teacher       0
guardian                0
internet                0
school_id               0
total_students          0
total_toilets         312
establishment_year    312
dtype: int64

In [172]:
cleanup_nums = {"continue_drop":{"continue":1,"drop":0},
                "gender":{"F":0,"M":1},
                "caste":{"BC":0,"SC":1,"OC":2,"ST":3},
                "guardian":{"mother":0,"father":1,"other":2,"mixed":3}
               }
data.replace(cleanup_nums, inplace=True)
data.internet = data.internet.astype(int)


In [173]:
data.drop('student_id', axis=1, inplace=True)
data.drop('total_toilets', axis=1, inplace=True)
data.drop('school_id', axis=1, inplace=True)
data.drop('establishment_year', axis=1, inplace=True)

In [174]:
print(data.continue_drop.value_counts())

1    16384
0      806
Name: continue_drop, dtype: int64


#### Measuring how data hava is imbalancc between continue and drop attribute
####                     in percentage

In [175]:
(len(data.loc[data.continue_drop==0])) / (len(data.loc[data.continue_drop == 1])) * 100

4.91943359375

#### we can see we have a very imbalanced class - just 4.91% of our dataset belong to the target continue!
#### this is a problem because many machine learning models are designed to maximize overall accuracy


In [176]:
s=y=data.continue_drop
X=data
X.drop('continue_drop', axis=1, inplace=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=27)

In [177]:
# DummyClassifier to predict only target 0
dummy = DummyClassifier(strategy='most_frequent').fit(X_train, y_train)
dummy_pred = dummy.predict(X_test)

# checking unique labels
print('Unique predicted labels: ', (np.unique(dummy_pred)))
# checking accuracy
print('Test score: ', accuracy_score(y_test, dummy_pred))

Unique predicted labels:  [1]
Test score:  0.9469520707305723


As predicted our accuracy score for classifying all students as to continue is 94.6%!
As the Dummy Classifier predicts only Class 0, it is clearly not a good option for our objective of correctly classifying fraudulent transactions.

Let's see how logistic regression performs on this dataset.

In [178]:
# Modeling the data as is
# Train model
lr = LogisticRegression(solver='liblinear').fit(X_train, y_train)
 
# Predict on training set
lr_pred = lr.predict(X_test)
# Checking accuracy
accuracy_score(y_test, lr_pred)

0.9469520707305723

Logistic Regression are the same with Dummy Classifier! We can see that it predicted acuracy are both th same
we have to find better methode of removing imbalance

Let's see if we can apply some techniques for dealing with class imbalance to improve these results.

### 1. Change the performance metric
Accuracy is not the best metric to use when evaluating imbalanced datasets as it can be misleading. Metrics that can provide better insight include:

* Confusion Matrix: a talbe showing correct predictions and types of incorrect predictions.
* Precision: the number of true positives divided by all positive predictions. Precision is also called Positive Predictive Value. It is a measure of a classifier's exactness. Low precision indicates a high number of false positives.
* Recall: the number of true positives divided by the number of positive values in the test data. Recall is also called Sensitivity or the True Positive Rate. It is a measure of a classifier's completeness. Low recall indicates a high number of false negatives.
* F1: Score: the weighted average of precision and recall.

In [179]:
# Checking unique values
predictions = pd.DataFrame(lr_pred)
predictions[0].value_counts()

1    4298
Name: 0, dtype: int64

In [180]:
# f1 score
f1_score(y_test, lr_pred)

0.972753346080306

In [181]:
# confusion matrix
pd.DataFrame(confusion_matrix(y_test, lr_pred))

Unnamed: 0,0,1
0,0,228
1,0,4070


In [182]:
recall_score(y_test, lr_pred)

1.0

as we see tha the best metric to use whic at least can reduce imbalance recall_score, lets another it could be better than this
### 2. Change the algorithm
While in every machine learning problem, its a good rule of thumb to try a variety of algorithms,
it can be especially beneficial with imbalanced datasets. Decision trees frequently perform well on imbalanced data. 
They work by learning a hierachy of if/else questions. This can force both classes to be addressed.

In [183]:
from sklearn.ensemble import RandomForestClassifier
# train model
rfc = RandomForestClassifier(n_estimators=10).fit(X_train, y_train)

# predict on test set
rfc_pred = rfc.predict(X_test)

accuracy_score(y_test, rfc_pred)

1.0

In [184]:
# f1 score
f1_score(y_test, rfc_pred)

1.0

In [185]:
# confusion matrix
pd.DataFrame(confusion_matrix(y_test, rfc_pred))

Unnamed: 0,0,1
0,228,0
1,0,4070


In [186]:
# recall score
recall_score(y_test, rfc_pred)

1.0

we can see tha it is at least to use randomforestclassifier than logisticregression due many metrics are 
supported to reduce imbalance
ok lets the third methode
## Resampling Techniques
### 3. Oversampling Minority Class
Oversampling can be defined as adding more copies of the minority class. Oversampling can be a good choice when you don't have a ton of data to work with. A con to consider when undersampling is that it can cause overfitting and poor generalization to your test set.

In [187]:
from sklearn.utils import resample

In [192]:
# Separate input features and target
data=pd.read_csv('traindata.csv')
cleanup_nums = {"continue_drop":{"continue":1,"drop":0},
                "gender":{"F":0,"M":1},
                "caste":{"BC":0,"SC":1,"OC":2,"ST":3},
                "guardian":{"mother":0,"father":1,"other":2,"mixed":3}
               }
data.replace(cleanup_nums, inplace=True)
data.internet = data.internet.astype(int)
y=data.continue_drop
data.drop('student_id', axis=1, inplace=True)
data.drop('total_toilets', axis=1, inplace=True)
data.drop('school_id', axis=1, inplace=True)
data.drop('establishment_year', axis=1, inplace=True)
x=data
X = data.drop('continue_drop', axis=1)

# setting up testing and training sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=27)

In [193]:
# concatenate our training data back together
X = pd.concat([X_train, y_train], axis=1)
X.head()

Unnamed: 0,gender,caste,mathematics_marks,english_marks,science_marks,science_teacher,languages_teacher,guardian,internet,total_students,continue_drop
664,1,0,0.73,0.502,0.73,6,3,0,0,469,1
13326,1,1,0.337,0.283,0.337,2,6,2,1,96,1
11579,0,0,0.416,0.533,0.416,0,4,0,1,193,1
5105,0,1,0.422,0.436,0.422,1,6,0,0,470,1
5514,1,0,0.633,0.885,0.633,6,2,0,1,155,1


In [219]:
# separate minority and majority classes
not_drop = data[data.continue_drop==1]
drop = data[data.continue_drop==0]


# upsample minority
continue_upsampled = resample(not_drop,replace=True,n_samples=len(drop),random_state=27) # reproducible results
# combine majority and upsampled minority
upsampled = pd.concat([drop, continue_upsampled])

# check new class counts
upsampled.continue_drop.value_counts()

1    806
0    806
Name: continue_drop, dtype: int64

In [201]:
# trying logistic regression again with the balanced dataset
y_train = upsampled.continue_drop
X_train = upsampled.drop('continue_drop', axis=1)

upsampled = LogisticRegression(solver='liblinear').fit(X_train, y_train)

upsampled_pred = upsampled.predict(X_test)

In [202]:
# Checking accuracy
accuracy_score(y_test, upsampled_pred)

0.6463471382038157

In [203]:
# f1 score
f1_score(y_test, upsampled_pred)

0.7753473248595921

In [204]:
# confusion matrix
pd.DataFrame(confusion_matrix(y_test, upsampled_pred))

Unnamed: 0,0,1
0,155,73
1,1447,2623


In [205]:
recall_score(y_test, upsampled_pred)

0.6444717444717445

Our accuracy score decreased after upsampling, but the model is now predicting both attributes more equally, 
making it an improvement over our plain logistic regression above.
### 4. Undersampling Majority Class
Undersampling can be defined as removing some observations of the majority class. 
Undersampling can be a good choice when you have a ton of data -think millions of rows. 
But a drawback to undersampling is that we are removing information that may be valuable.

In [223]:
# downsample majority
drop_downsampled = resample(drop,
                                replace = True, # sample without replacement
                                n_samples = len(not_drop), # match minority n
                                random_state = 27) # reproducible results

# combine minority and downsampled majority
downsampled = pd.concat([drop_downsampled, not_drop])

# checking counts
downsampled.continue_drop.value_counts()

1    16384
0    16384
Name: continue_drop, dtype: int64

In [224]:
# trying logistic regression again with the undersampled dataset

y_train = downsampled.continue_drop
X_train = downsampled.drop('continue_drop', axis=1)

undersampled = LogisticRegression(solver='liblinear').fit(X_train, y_train)

undersampled_pred = undersampled.predict(X_test)

In [225]:
# Checking accuracy
accuracy_score(y_test, undersampled_pred)

0.6512331316891578

In [226]:
# f1 score
f1_score(y_test, undersampled_pred)

0.7807517917215152

In [227]:
# confusion matrix
pd.DataFrame(confusion_matrix(y_test, undersampled_pred))

Unnamed: 0,0,1
0,130,98
1,1401,2669


In [228]:
recall_score(y_test, undersampled_pred)

0.6557739557739558

### Conclusion
We covered 4 different methods for dealing with imbalanced datasets:

* Change the performance metric
* Oversampling minority class
* Undersampling majority class
* Change the algorithm

the best best methode according to our challenge is usin different allgorithim as we saw some algorithim support many metrics 
with gooth and avoidance of imbalance