# 1.0 An end-to-end classification problem (Part I)



## 1.1 Dataset description



We'll be looking at individual income in the United States. The **data** is from the **1994 census**, and contains information on an individual's **marital status**, **age**, **type of work**, and more. The **target column**, or what we want to predict, is whether individuals make less than or equal to 50k a year, or more than **50k a year**.

You can download the data from the [University of California, Irvine's website](http://archive.ics.uci.edu/ml/datasets/Adult).

Let's take the following steps:

1. Load Libraries
2. Fetch Data, including EDA
3. Pre-procesing
4. Data Segregation

<center><img width="600" src="https://drive.google.com/uc?export=view&id=1a-nyAPNPiVh-Xb2Pu2t2p-BhSvHJS0pO"></center>

## 1.2 Load libraries

In [None]:
import wandb
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport
from dataprep.eda import create_report,plot_diff
from sklearn.model_selection import train_test_split
import tempfile
import os

## 1.3 Get data & Exploratory Data Analysis (EDA)

### 1.3.1 Create the raw_data artifact

In [None]:
# columns used 
columns = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',
           'marital_status', 'occupation', 'relationship', 'race', 
           'sex','capital_gain', 'capital_loss', 'hours_per_week',
           'native_country','high_income']
# importing the dataset
income = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
                   header=None,
                   names=columns)
income.head()

In [None]:
income.to_csv("raw_data.csv",index=False)

In [None]:
!wandb login --relogin

In [None]:
# Send the raw_data.csv to the Wandb storing it as an artifact
!wandb artifact put \
      --name week_07_eda/raw_data.csv \
      --type raw_data \
      --description "The raw data from 1994 US Census" raw_data.csv

### 1.3.2 Download raw_data artifact from Wandb

In [None]:
# save_code tracking all changes of the notebook and sync with Wandb
run = wandb.init(project="week_07_eda", save_code=True)

In [None]:
# donwload the latest version of artifact raw_data.csv
artifact = run.use_artifact("week_07_eda/raw_data.csv:latest")

# create a dataframe from the artifact
df = pd.read_csv(artifact.file())

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

### 1.3.3 DataPrep

In [None]:
create_report(df).show()

### 1.3.4 Pandas Profilling

In [None]:
ProfileReport(df, title="Pandas Profiling Report", explorative=True)

In [None]:
# There are duplicated rows
df.duplicated().sum()

In [None]:
# Delete duplicated rows
df.drop_duplicates(inplace=True)
df.duplicated().sum()


### 1.3.5 EDA Manually

In [None]:
# what the sex column can help us?
pd.crosstab(df.high_income,df.sex,margins=True)

In [None]:
# income vs [sex & race]?
pd.crosstab(df.high_income,[df.sex,df.race])

In [None]:
%matplotlib inline

sns.catplot(x="sex", 
            hue="race", 
            col="high_income",
            data=df, kind="count",
            height=4, aspect=.7)
plt.show()

In [None]:
g = sns.catplot(x="sex", 
                hue="workclass", 
                col="high_income",
                data=df, kind="count",
                height=4, aspect=.7)

g.savefig("HighIncome_Sex_Workclass.png", dpi=100)

run.log(
        {
            "High_Income vs Sex vs Workclass": wandb.Image("HighIncome_Sex_Workclass.png")
        }
    )

In [None]:
df.isnull().sum()

## 1.4 Train & Split

In [None]:
splits = {}
splits["train"], splits["test"] = train_test_split(df,
                                                   test_size=0.30,
                                                   random_state=41,
                                                   stratify=df["high_income"])

In [None]:
# Save the artifacts. We use a temporary directory so we do not leave
# any trace behind

with tempfile.TemporaryDirectory() as tmp_dir:

    for split, df in splits.items():

        # Make the artifact name from the provided root plus the name of the split
        artifact_name = f"data_{split}.csv"

        # Get the path on disk within the temp directory
        temp_path = os.path.join(tmp_dir, artifact_name)

        # Save then upload to W&B
        df.to_csv(temp_path,index=False)

        artifact = wandb.Artifact(
            name=artifact_name,
            type="raw_data",
            description=f"{split} split of dataset week_07_eda/raw_data.csv:latest",
        )
        artifact.add_file(temp_path)

        run.log_artifact(artifact)

        # This waits for the artifact to be uploaded to W&B. If you
        # do not add this, the temp directory might be removed before
        # W&B had a chance to upload the datasets, and the upload
        # might fail
        artifact.wait()

### 1.4.1 Donwload the train and test artifacts

In [None]:
# donwload the latest version of artifacts data_test.csv and data_train.csv
artifact_train = run.use_artifact("week_07_eda/data_train.csv:latest")
artifact_test = run.use_artifact("week_07_eda/data_test.csv:latest")

# create a dataframe from each artifact
df_train = pd.read_csv(artifact_train.file())
df_test  = pd.read_csv(artifact_test.file())

In [None]:
print("Train: {}".format(df_train.shape))
print("Test: {}".format(df_test.shape))

In [None]:
plot_diff([df_train,df_test])

In [None]:
run.finish()