# 1.0 An end-to-end classification problem



## 1.1 Dataset description



We'll be looking at individual income in the United States. The **data** is from the **1994 census**, and contains information on an individual's **marital status**, **age**, **type of work**, and more. The **target column**, or what we want to predict, is whether individuals make less than or equal to 50k a year, or more than **50k a year**.

You can download the data from the [University of California, Irvine's website](http://archive.ics.uci.edu/ml/datasets/Adult).

## 1.2 Load Libraries, Train and Validation Sets

In [None]:
!pip install wandb

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import roc_auc_score
from sklearn.tree import plot_tree
import wandb

In [None]:
!wandb login --relogin

In [None]:
# save_code tracking all changes of the notebook and sync with Wandb
run = wandb.init(project="Week08_Example_01")

In [None]:
local_path = run.use_artifact("week_07_data_segregation/train_data.csv:latest").file()
df_train = pd.read_csv(local_path)

local_path = run.use_artifact("week_07_data_segregation/test_data.csv:latest").file()
df_test = pd.read_csv(local_path)

In [None]:
df_train.head()

In [None]:
df_test.head()

## 1.3 Train and Dev split

In [None]:
# split-out train/validation and test dataset
x_train, x_val, y_train, y_val = train_test_split(df_train.drop(labels="high_income",axis=1),
                                                    df_train["high_income"],
                                                    test_size=0.30,
                                                    random_state=41,
                                                    shuffle=True,
                                                    stratify=df_train["high_income"])

In [None]:
print("x train: {}".format(x_train.shape))
print("y train: {}".format(y_train.shape))
print("x val: {}".format(x_val.shape))
print("y val: {}".format(y_val.shape))

## 1.4 Removal Outliers

In [None]:
# Verify if columns[int64] has outliers (without data leakage!!!!!!!)

# data
x = x_train.select_dtypes("int64").copy()

# identify outlier in the dataset
lof = LocalOutlierFactor()
outlier = lof.fit_predict(x)
mask = outlier != -1

print("X_train shape [original]: {}".format(x_train.shape))
print("X_train shape [outlier removal]: {}".format(x_train.loc[mask,:].shape))

# income with outliner
x_train = x_train.loc[mask,:].copy()
y_train = y_train[mask].copy()

## 1.5 Encoding target variable

If a categorical target variable needs to be encoded for a classification predictive modeling problem, then the [LabelEncoder class](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) can be used.

In [None]:
# define a categorical encoding for target variable
le = LabelEncoder()

# fit and transoform y_train
y_train = le.fit_transform(y_train)

# transform y_test (avoiding data leakage)
y_val = le.transform(y_val)

print("Classes: {}".format(le.classes_))

In [None]:
# just in case you need the inverse transformation
le.inverse_transform([0, 1])

In [None]:
# sampling of transformed target variable
print(y_train[:5],y_val[-6:-1])

## 1.6 Encoding independent variables [Experiment]

In [None]:
# 
# just an experimentation
#

# drop=first erase redundant column
onehot = OneHotEncoder(sparse=False,drop="first")
# fit using x_train
onehot.fit(x_train["sex"].values.reshape(-1,1))

x_val_aux = x_val.copy()
x_train_aux = x_train.copy()

# transform train and val
x_train_aux[onehot.get_feature_names_out()] = onehot.transform(x_train_aux["sex"].values.reshape(-1,1))
x_val_aux[onehot.get_feature_names_out()] = onehot.transform(x_val_aux["sex"].values.reshape(-1,1))

x_val_aux.head()

In [None]:
onehot.get_feature_names_out()

In [None]:
onehot.inverse_transform([[0]])

## 1.7 Encoding independent variables

In [None]:
# just to review what are categorical columns
x_train.select_dtypes("object").columns.to_list()

In [None]:
# 08 columns are object, transform them to Categorical

# transform object columns to Categorical
for name in x_train.select_dtypes("object").columns.to_list():
  onehot = OneHotEncoder(sparse=False,drop="first")
  # fit using x_train
  onehot.fit(x_train[name].values.reshape(-1,1))

  # transform train and test
  x_train[onehot.get_feature_names_out()] = onehot.transform(x_train[name].values.reshape(-1,1))
  x_val[onehot.get_feature_names_out()] = onehot.transform(x_val[name].values.reshape(-1,1))

In [None]:
x_train.head()

In [None]:
x_val.head()

In [None]:
cols=['workclass','education','marital_status','occupation',
      'relationship','race','sex','native_country']

x_train.drop(labels=cols,axis=1,inplace=True)
x_val.drop(labels=cols,axis=1,inplace=True)

In [None]:
x_train.head()

In [None]:
x_val.head()

## 1.8 Modeling & tuning

In [None]:
# create a pipeline
pipe = Pipeline([("classifier", DecisionTreeClassifier())])

# training 
pipe.fit(x_train,y_train)

# final model
predict = pipe.predict(x_val)

In [None]:
# confusion matrix (we change the way to make equal to slides)
#             true label
#               1     0     
# predict  1    TP    FP
#          0    FN    TN
#

confusion_matrix(predict,y_val,
                 labels=[1,0])

In [None]:
print(accuracy_score(y_val, predict))
print(classification_report(y_val,predict))

In [None]:
fig, ax = plt.subplots(1,1,figsize=(7,4))

ConfusionMatrixDisplay(confusion_matrix(predict,y_val,labels=[1,0]),
                       display_labels=[">50k","<=50k"],).plot(values_format=".0f",ax=ax)

ax.set_xlabel("True Label")
ax.set_ylabel("Predicted Label")
plt.show()

In [None]:
roc_auc_score(y_val, predict, average="macro")

In [None]:
from sklearn.tree import plot_tree # to draw a classification tree
fig, ax = plt.subplots(1,1, figsize=(15, 10))
plot_tree(pipe["classifier"], 
          filled=True, 
          rounded=True, 
          class_names=["<=50k", ">50k"],
          feature_names=x_val.columns, ax=ax)
plt.show()