# Load Data

This week we will be using a dataset to predict the income of a person. The dataset is available at the UCI Machine Learning Repository and can be downloaded from the following link: [Adult Data Set](https://archive.ics.uci.edu/ml/datasets/adult). The dataset consists of 48,842 entries. Each entry contains the following information about an individual:

- **age**: the age of an individual.

- **workclass**: a general term to represent the employment status of an individual.

- **fnlwgt**: final weight, this is the number of people the census believes the entry represents.

- **education**: the highest level of education an individual has completed.

- **education_num**: the highest level of education an individual has completed.

- **marital_status**: marital status of an individual.

- **occupation**: the general type of occupation of an individual.

- **relationship**: represents what this individual is relative to others.

- **race**: race of an individual.

- **sex**: sex of an individual.

- **capital_gain**: capital gains for an individual.

- **capital_loss**: capital loss for an individual.

- **hours_per_week**: the number of hours an individual works per week.

- **native_country**: country of origin for an individual.

- **income**: whether or not an individual makes more than 50,000 dollars annually.


In [2]:
from sklearn.datasets import fetch_openml
import ipytest
ipytest.autoconfig()

adult = fetch_openml('adult', version=2 as_frame=True)
X = adult['data']
y = adult['target']

In [56]:
X

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capitalgain,capitalloss,hoursperweek,native-country
0,2,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,1,0,2,United-States
1,3,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,0,United-States
2,2,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,2,United-States
3,3,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,2,United-States
4,1,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,2,Cuba
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,2,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,2,United-States
48838,4,,321403,HS-grad,9,Widowed,,Other-relative,Black,Male,0,0,2,United-States
48839,2,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,3,United-States
48840,2,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,2,0,2,United-States


## Homework 1 - Split data

Select the numeric features into `X_num` and split that data into training and testing sets. 
Call them `X_num_train` and `X_num_test` and (`y_num_train` and `y_num_test` for the target data).

In [None]:
%%ipytest
def test_X_num_train_X_num_test_same():
    assert X_num_train.shape[1] == X_num_test.shape[1], "The number of columns in X_num_train and X_num_test are not the same"

def test_X_num_train_smaller_than_X_num():
    assert X_num_train.shape[0] < X_num.shape[0], "X_num_train is not smaller than X_num"

def test_X_num_train_columns_less_than_X_columns():
    assert X_num_train.shape[1] < X.shape[1], "X_num_train has more columns than X_num"

[32m.[0m[32m.[0m[32m.[0m[32m                                                                                          [100%][0m
[32m[32m[1m3 passed[0m[32m in 0.01s[0m[0m


## Homework 2 - Logistic Regression with Numeric Features

Create a logistic regression model (name the model `lr`) to predict whether an adult will earn more than $50k a year. Use only the numeric columns in the dataset. What is the score?

In [72]:
%%ipytest
import pytest

def test_lr_works():
    assert lr.score(X_num_test, y_num_test) > 0.7, "The logistic regression model does not work"

# test that it fails if you score with just X
def test_lr_fails():
    with pytest.raises(ValueError):
        lr.score(X, y)
    

[32m.[0m[32m.[0m[32m                                                                                           [100%][0m
[32m[32m[1m2 passed[0m[32m in 0.16s[0m[0m


## Homework 3 - XGBoost

Make an xgboost model (call it `xg`) to predict the same outcome, using the same data split. What is the score?

In [75]:
%%ipytest

def test_xg_works():
    assert xg.score(X_num_test, y_num_test == '>50K') > 0.7, "The xgboost model does not work"

def test_xg_fails():
    with pytest.raises(ValueError):
        xg.score(X, y)

def test_xg_is_fit():
    assert hasattr(xg, 'classes_'), "The xgboost model is not fitted"

[32m.[0m[32m.[0m[32m.[0m[32m                                                                                          [100%][0m
[32m[32m[1m3 passed[0m[32m in 0.02s[0m[0m


## Homework 4 - Pipeline

Make a categorical pipeline (called `cat_pipeline`) to fill in missing values with 'Other' and then dummy encode using a maximum of 10 categories.

Remember the call `set_config(transform_output='pandas')`

In [85]:
%%ipytest
import pandas as pd
import numpy as np

def test_cat_pipeline_basic():
    data = pd.DataFrame({'a': ['a', 'b', 'c', 'd', 'e']})
    pipe = cat_pipeline
    res = pipe.fit_transform(data)
    print(res)
    assert res.shape[1] > 1, "The number of columns is not correct"



[32m.[0m[32m                                                                                            [100%][0m
[32m[32m[1m1 passed[0m[32m in 0.01s[0m[0m


## Homework 5 - Logistic Regression Pipepline

Create a numeric pipeline (`num_pipeline`) to standardize the numeric columns.

Use a column transformer (`col_transformer`) to apply the numeric pipepline to the numeric columns and the categorical pipeline ('cat_pipeline`) to the categorical columns.

Make a final pipeline (`pipeline`) that applies the column transformer and then a logistic regression model.

In [102]:
%%ipytest

from pandas.testing import assert_frame_equal

def test_num_pipeline_basic():
    data = pd.DataFrame({'a': [1, 2, 3, 4, 5]})
    pipe = num_pipeline
    res = pipe.fit_transform(data)
    print(res)
    assert_frame_equal(res, 
        pd.DataFrame({'a': np.array([-1.41421356, -0.70710678, 0., 0.70710678, 1.41421356])})
    )

def test_cat_pipeline_basic():
    data = pd.DataFrame({'a': ['a', 'b', 'c', 'd', 'e']})
    pipe = cat_pipeline
    res = pipe.fit_transform(data)
    print(res)
    assert_frame_equal(res,
        pd.DataFrame({'a_b': [0, 1, 0, 0, 0], 'a_c': [0, 0, 1, 0, 0], 'a_d': [0, 0, 0, 1, 0], 'a_e': [0, 0, 0, 0, 1]},
        dtype='float64')
    )

def test_col_transformer_basic():
    res = col_transformer.fit_transform(X)
    columns = res.columns
    print(res)
    assert len(columns) > 60

def test_lr_pipeline():
    assert pipeline.score(X, y) > .5

[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m                                                                                         [100%][0m
[32m[32m[1m4 passed[0m[32m in 0.66s[0m[0m


## Homework 6 - XGBoost take 2

Create an xgboost model (`xg2`) using all of the columns. What is the score?

In [104]:
%%ipytest

def test_xg2_works():
    assert xg2.score(X, y == '>50K') > 0.7, "The xgboost model does not work"

[32m.[0m[32m                                                                                            [100%][0m
[32m[32m[1m1 passed[0m[32m in 0.05s[0m[0m
