# 1.0 An end-to-end classification problem (ETL)



## 1.1 Dataset description



We'll be looking at individual income in the United States. The **data** is from the **1994 census**, and contains information on an individual's **marital status**, **age**, **type of work**, and more. The **target column**, or what we want to predict, is whether individuals make less than or equal to 50k a year, or more than **50k a year**.

You can download the data from the [University of California, Irvine's website](http://archive.ics.uci.edu/ml/datasets/Adult).

Let's take the following steps:

1. Load Libraries
2. Fetch Data, including EDA
3. Pre-procesing
4. Data Segregation

<center><img width="600" src="https://drive.google.com/uc?export=view&id=1a-nyAPNPiVh-Xb2Pu2t2p-BhSvHJS0pO"></center>

## 1.2 Install and load libraries

In [None]:
!pip install pandas-profiling==3.1.0

In [None]:
!pip install wandb

In [None]:
import wandb
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport
import tempfile
import os

## 1.3 Exploratory Data Analysis (EDA)

### 1.3.1 Login wandb


In [None]:
# Login to Weights & Biases
!wandb login --relogin

### 1.3.2 Download raw_data artifact from Wandb

In [None]:
# save_code tracking all changes of the notebook and sync with Wandb
run = wandb.init(project="decision_tree", save_code=True)

In [None]:
# donwload the latest version of artifact raw_data.csv
artifact = run.use_artifact("decision_tree/raw_data.csv:latest")

# create a dataframe from the artifact
df = pd.read_csv(artifact.file())

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

### 1.3.3 Pandas Profilling

In [None]:
ProfileReport(df, title="Pandas Profiling Report", explorative=True)

### 1.3.4 EDA Manually

In [None]:
# There are duplicated rows
df.duplicated().sum()

In [None]:
# Delete duplicated rows
df.drop_duplicates(inplace=True)
df.duplicated().sum()

In [None]:
# what the sex column can help us?
pd.crosstab(df.high_income,df.sex,margins=True,normalize=False)

In [None]:
# income vs [sex & race]?
pd.crosstab(df.high_income,[df.sex,df.race],margins=True)

In [None]:
%matplotlib inline

sns.catplot(x="sex", 
            hue="race", 
            col="high_income",
            data=df, kind="count",
            height=4, aspect=.7)
plt.show()

In [None]:
g = sns.catplot(x="sex", 
                hue="workclass", 
                col="high_income",
                data=df, kind="count",
                height=4, aspect=.7)

g.savefig("HighIncome_Sex_Workclass.png", dpi=100)

run.log(
        {
            "High_Income vs Sex vs Workclass": wandb.Image("HighIncome_Sex_Workclass.png")
        }
    )

In [None]:
df.isnull().sum()

In [None]:
run.finish()