# Adult dataset
 
This is a standard imbalanced machine learning dataset.

The dataset is credited to Ronny Kohavi and Barry Becker and was drawn from the 1994 United States Census Bureau data and involves using personal details such as education level to predict whether an individual will earn more or less than $50,000 per year.

The Adult dataset is from the Census Bureau and the task is to predict whether a given adult makes more than $50,000 a year based attributes such as education, hours of work per week, etc..

— Scaling Up The Accuracy Of Naive-bayes Classifiers: A Decision-tree Hybrid, 1996.

The dataset provides 14 input variables that are a mixture of categorical, ordinal, and numerical data types. The complete list of variables is as follows:

Age.
Workclass.
Final Weight.
Education.
Education Number of Years.
Marital-status.
Occupation.
Relationship.
Race.
Sex.
Capital-gain.
Capital-loss.
Hours-per-week.
Native-country.
The dataset contains missing values that are marked with a question mark character (?).

There are a total of 48,842 rows of data, and 3,620 with missing values, leaving 45,222 complete rows.

There are two class values ‘>50K‘ and ‘<=50K‘, meaning it is a binary classification task. The classes are imbalanced, with a skew toward the ‘<=50K‘ class label.

‘>50K’: majority class, approximately 25%.
‘<=50K’: minority class, approximately 75%.
Given that the class imbalance is not severe and that both class labels are equally important, it is common to use classification accuracy or classification error to report model performance on this dataset.

Using predefined train and test sets, reported good classification error is approximately 14 percent or a classification accuracy of about 86 percent. This might provide a target to aim for when working on this dataset.



In [17]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import mglearn
import requests
import io

# load and summarize the dataset

from collections import Counter

   
# Downloading the csv file from your GitHub account

url = "https://raw.githubusercontent.com/maruseppe/Adult-dataset-IMPL/main/adult.data.txt" # Make sure the url is the raw version of the file on GitHub
download = requests.get(url).content

# Reading the downloaded content and turning it into a pandas dataframe
dataframe = pd.read_csv(io.StringIO(download.decode('utf-8')), header=None, na_values='?',index_col=False,
names=['age', 'workclass', 'fnlwgt', 'education', 'education-num',
'marital-status', 'occupation', 'relationship', 'race', 'gender',
'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
'income'])
# drop rows with missing
dataframe = dataframe.dropna()
# summarize the shape of the dataset
print(dataframe.shape)
# summarize the class distribution
target = dataframe.values[:,-1]
counter = Counter(target)
for k,v in counter.items():
    per = v / len(target) * 100
    print('Class=%s, Count=%d, Percentage=%.3f%%' % (k, v, per))

    # For illustration purposes, we only select some of the columns
data = dataframe[['age', 'workclass', 'education', 'gender', 'hours-per-week',
'occupation', 'income']]
# IPython.display allows nice output formatting within the Jupyter notebook
display(data.head())

(32561, 15)
Class= <=50K, Count=24720, Percentage=75.919%
Class= >50K, Count=7841, Percentage=24.081%


Unnamed: 0,age,workclass,education,gender,hours-per-week,occupation,income
0,39,State-gov,Bachelors,Male,40,Adm-clerical,<=50K
1,50,Self-emp-not-inc,Bachelors,Male,13,Exec-managerial,<=50K
2,38,Private,HS-grad,Male,40,Handlers-cleaners,<=50K
3,53,Private,11th,Male,40,Handlers-cleaners,<=50K
4,28,Private,Bachelors,Female,40,Prof-specialty,<=50K


# Features Engineering 

## One-Hot-Encoding (Dummy Variables)


By far the most common way to represent categorical variables is using the one-hotencoding
or one-out-of-N encoding, also known as dummy variables. The idea behind
dummy variables is to replace a categorical variable with one or more new features
that can have the values 0 and 1. The values 0 and 1 make sense in the formula for
linear binary classification (and for all other models in scikit-learn), and we can
represent any number of categories by introducing one new feature per category. Employee", "Private Employee", "Self Employed", and "Self Employed Incorpo
rated". To encode these four possible values, we create four new features, called "Gov
ernment Employee", "Private Employee", "Self Employed", and "Self Employed
Incorporated". A feature is 1 if workclass for this person has the corresponding
value and 0 otherwise, so exactly one of the four new features will be 1 for each data
point. This is why this is called one-hot or one-out-of-N encoding.

A good way to
check the contents of a column is using the value_counts function of a pandas
Series (the type of a single column in a DataFrame), to show us what the unique values
are and how often they appear. We can see from the result pf the following code that there are exactly two values for gender in this dataset, Male and
Female, meaning the data is already in a good format.


There is a very simple way to encode the data in pandas, using the get_dummies function.
The get_dummies function automatically transforms all columns that have
object type (like strings) or are categorical. It is also important to call get_dummies on a DataFrame containing
both the training and the test data. This ensures that categorical values are represented in the same way in the training set
and the test set.

Afterwards, we can use the values attribute to convert the data_dummies DataFrame into a
NumPy array, so that a machine learning model can be trained on it. Be careful to separate
the target variable (which is now encoded in two income columns) from the data
before training a model.

In [22]:
print(data.gender.value_counts())

print("Original features:\n", list(data.columns), "\n")
data_dummies = pd.get_dummies(data)
print("Features after get_dummies:\n", list(data_dummies.columns))
data_dummies.head()

features = data_dummies.loc[:, 'age':'occupation_ Transport-moving']
# Extract NumPy arrays
X = features.values
y = data_dummies['income_ >50K'].values
print("X.shape: {} y.shape: {}".format(X.shape, y.shape))

 Male      21790
 Female    10771
Name: gender, dtype: int64
Original features:
 ['age', 'workclass', 'education', 'gender', 'hours-per-week', 'occupation', 'income'] 

Features after get_dummies:
 ['age', 'hours-per-week', 'workclass_ ?', 'workclass_ Federal-gov', 'workclass_ Local-gov', 'workclass_ Never-worked', 'workclass_ Private', 'workclass_ Self-emp-inc', 'workclass_ Self-emp-not-inc', 'workclass_ State-gov', 'workclass_ Without-pay', 'education_ 10th', 'education_ 11th', 'education_ 12th', 'education_ 1st-4th', 'education_ 5th-6th', 'education_ 7th-8th', 'education_ 9th', 'education_ Assoc-acdm', 'education_ Assoc-voc', 'education_ Bachelors', 'education_ Doctorate', 'education_ HS-grad', 'education_ Masters', 'education_ Preschool', 'education_ Prof-school', 'education_ Some-college', 'gender_ Female', 'gender_ Male', 'occupation_ ?', 'occupation_ Adm-clerical', 'occupation_ Armed-Forces', 'occupation_ Craft-repair', 'occupation_ Exec-managerial', 'occupation_ Farming-fishing

# Model 

## Logreg



In [30]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
print("Test score: {:.2f}".format(logreg.score(X_test, y_test)))

Test score: 0.81


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
