In [1]:
import pandas as pd

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

# Task 1

### Read data

In [2]:
df2 = pd.read_csv('./data/Task1_2.csv', sep=';')
print(df2.shape)

(4070, 9)


In [3]:
df2.head()

Unnamed: 0,POUG,TRE,ID,ZUB,VOL,UIO,VBNM,Type,OIN
0,1,1.75,0,t,f,uuuu,t,n,17.92
1,0,0.29,1,f,f,wwww,f,n,16.92
2,1,0.0,2,f,f,wwww,t,n,31.25
3,0,0.335,3,f,f,uuuu,f,n,48.17
4,0,0.5,4,t,f,wwww,f,n,32.33


In [4]:
df1 = pd.read_csv('./data/Task1_1.csv', sep=';')
print(df1.shape)

(4070, 11)


In [5]:
df1.head()

Unnamed: 0,ID,UKL,GJAH,ZIK,HUI,ERZ,CDx,BJZHD,NKJUD,LPI,BJKG
0,0,160,oooo,x,oooo,www,5.0,vvvv,80.0,800000.0,qqqq
1,1,153,rrr,,uuu,pppp,0.0,mmm,200.0,2000000.0,qqqq
2,2,5,oooo,x,oooo,www,19.0,hh,96.0,960000.0,hh
3,3,9,oooo,,oooo,www,120.0,kkk,0.0,0.0,qqq
4,4,40,oooo,y,oooo,www,0.0,mmm,232.0,2320000.0,qqqq


### Preprocess

Prior to merging `df1` and `df2` we need to ensure unique records. We see below there are 370 duplicate rows, where "duplicate" is defined as having the same value for all columns in a row.

In [6]:
print(f"Number of duplicates in df1: {df1.duplicated().sum()}")
print(f"Number of duplicates in df2: {df2.duplicated().sum()}")

Number of duplicates in df1: 370
Number of duplicates in df2: 370


In [7]:
df1 = df1.drop_duplicates().reset_index(drop=True)
df2 = df2.drop_duplicates().reset_index(drop=True)

assert df1.duplicated().sum() == 0  # sanity check
assert df2.duplicated().sum() == 0

print(df1.shape)
print(df2.shape)

(3700, 11)
(3700, 9)


Although it's not explicitly stated in the instructions, I assume an inner join is desired. As we can see, there's a 100% match rate.

In [8]:
df = df1.merge(df2, on='ID', how='inner')
print(df.shape)

(3700, 19)


In [9]:
df.head()

Unnamed: 0,ID,UKL,GJAH,ZIK,HUI,ERZ,CDx,BJZHD,NKJUD,LPI,BJKG,POUG,TRE,ZUB,VOL,UIO,VBNM,Type,OIN
0,0,160,oooo,x,oooo,www,5.0,vvvv,80.0,800000.0,qqqq,1,1.75,t,f,uuuu,t,n,17.92
1,1,153,rrr,,uuu,pppp,0.0,mmm,200.0,2000000.0,qqqq,0,0.29,f,f,wwww,f,n,16.92
2,2,5,oooo,x,oooo,www,19.0,hh,96.0,960000.0,hh,1,0.0,f,f,wwww,t,n,31.25
3,3,9,oooo,,oooo,www,120.0,kkk,0.0,0.0,qqq,0,0.335,f,f,uuuu,f,n,48.17
4,4,40,oooo,y,oooo,www,0.0,mmm,232.0,2320000.0,qqqq,0,0.5,t,f,wwww,f,n,32.33


### EDA

As the instructions suggested - and the data below confirms - this is a binary classification problem. Importantly, the target variable is highly class imbalanced (i.e. the distribution of classes is highly unequal), with ~92.5% of cases being "y", while only ~7.5% being "n". While phenomenon is common in real-world applications, it also poses some modeling challenges.

To address this, I focus on performance metrics that differentiate between performance by class (e.g. precision, recall, F1 score, confusion matrix), rather than "global" performance indicators like accuracy. The reason being that in expectation, a model could be (in this case) about 92.5% accurate simply by predicting the dominant class for every observation, which would be a poor model.

In [10]:
df['Type'].value_counts(normalize=True)

y    0.925405
n    0.074595
Name: Type, dtype: float64