<a href="https://colab.research.google.com/github/nicolaiberk/Imbalanced/blob/master/ImbalancedProblem.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Why we should care about Imbalanced Data

**Since you now should have a general idea what imbalanced data is, we will now get our hands dirty to figure out what the problem is. We will use a dataset of tweets that were annotated using crowd-coding from [this paper](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3026393) by Fridolin Linder. The tweets are classified according to whether they discuss the refugee movements in 2015. If you have used colab before you can skip the text below and move directly to the code.**

---

The purpose of this notebook is to showcase some code about supervised learning with imbalanced data, as well as to give you the opportunity to play around with it a bit. You can run cells by clicking into a cell and hitting [Ctrl+Enter] on Windows or [Cmd+Enter] on Mac OS or simply clicking the small white 'play' symbol in the left upper corner next to the code. You can run the entire notebook with [Ctrl/Cmd + F9] and all cells preceding the current one with [Ctrl/Cmd + F8]. For more info on colab and check out the [introduction notebook](https://colab.research.google.com/#).

In [38]:
import pandas as pd

tweets = pd.read_csv("https://www.dropbox.com/s/gv56nu1ptrp63ps/annotated_german_refugee_tweets.csv?dl=1")

In [39]:
tweets.describe()

Unnamed: 0.1,Unnamed: 0,tweet_id,annotation
count,24420.0,24420.0,24420.0
mean,12222.371089,6.181582e+17,0.029771
std,7055.634936,5.356872e+16,0.169958
min,0.0,1393165000.0,0.0
25%,6115.75,5.868252e+17,0.0
50%,12221.5,6.224619e+17,0.0
75%,18331.25,6.571315e+17,0.0
max,24441.0,7.290384e+17,1.0


In [40]:
tweets.head()

Unnamed: 0.1,Unnamed: 0,tweet_id,annotation,text
0,0,1393164707,0.0,"""@mayers glückwunsch!!"""
1,1,1550399954,0.0,"""http://twitpic.com/3io4f - Ein neues Netbook ..."
2,2,1910339412,0.0,"""Städte- und Gemeindebund fordert rasche Verso..."
3,3,2069362297,0.0,"""Kennt jemand ein gut sortiertes Online-Archiv..."
4,4,4504479876,0.0,"Ehemaliges Sanatorium im Harz, 33 Zimmer, Schw..."


In [41]:
tweets[tweets.annotation > 0].head()

Unnamed: 0.1,Unnamed: 0,tweet_id,annotation,text
19,19,12892061548150785,1.0,"Ich weiß nicht was drin steht, aber der Tweet...."
37,37,114913181557735425,1.0,RT @danielmack: Geht's noch!? Die Arschlöcher ...
76,76,264089569828421632,1.0,RT @MaHa21721755: Hungernde Flüchtline dürfen ...
77,77,264401692806746113,1.0,"Diese Woche trotz Frost Salate gefunden, die u..."
79,79,265169505188212736,1.0,RT @KapuzenAuf: Wow: Im Land der Frühaufsteher...


A first inspection of the data shows that it contains conatins more than 24,000 observations and for each an index, a tweet ID, the annotation value, as well as the text of the tweet. The outcome of interest - whether the tweet discusses the German refugee movements in 2015 - is fairly rare, with only *2.9%* of tweets talking about this (see mean annotation in second code cell). However note that if we faced a *real* classification problem (in the sense that the data is not annotated yet), we would not know the extent of this imbalance.

---


## The Problem with Imbabalanced Data

Many if not most classificaiton problems in the social sciences concern imbalanced data. But why can't we just train a classifier on a randomly sampled set of 2000 annotated observations? Let's try!

In [45]:
# load relevant packages
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import LogisticRegression as LogReg
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

# generate a random sample of 2000 tweets
rndsmpl = tweets.sample(n=2000,  random_state=42)

# divide into training and test set
X_train, X_test, y_train, y_test = train_test_split(
    rndsmpl.text,
    rndsmpl.annotation,
    test_size=0.20,
    random_state=42
)
X_train = np.array(X_train)
X_test = np.array(X_test)

# fit
pipe = Pipeline([('count', CountVectorizer()), ('LogReg', LogReg())])
pipe.fit(X_train, y_train)

# classify test set and show performance
y_pred = pipe.predict(X_test)
print("Accuracy: ", accuracy_score(y_test, y_pred))

Accuracy:  0.9725


**We can see that the accuracy of the classifier is actually *a-ma-zing*, classfying more than 90% of our samples correctly!** 

So why would we bother *at all* with all this complicated sampling? The issue becomes visible once we delve deeper into our classifier's predictions:

In [46]:
pd.crosstab(y_test, y_pred)

col_0,0.0
annotation,Unnamed: 1_level_1
0.0,389
1.0,11


This table shows how the 400 observations in the test set are annotated (rows) and how our classifier judges their content (columns). We can see that there are two rows for each label - 389 non-related tweets and 11 tweets about immigration - but just one column for the preedictions of unrelated content. What is going on? 

In [48]:
sum(y_pred)

0.0

Calculating the sum of our precicted labels, we can see that **not a single observation** has been classified as content about refugees! 

**THE CLASSIFIER IS CHEATING!!**

![](https://media1.giphy.com/media/zz2a5ctsXTzkidQVSM/giphy.gif?cid=ecf05e47tmun8g4semdtsqem51bish1mvy2z0c4glgbdfjqt&rid=giphy.gif&ct=g)

The classifier just does what it is told, namely to obtain the best fit for the data. Given that a misclassification of eleven labels is not very costly, it just classifies all of them in line with the 95.7% in the majority class, as there is no better fit for the data.