# Predicting Income From Census Information

In this lesson, we will build a machine learning model on some real data.

We will solve the following problem:

Based on a person's demographic information, can you predict whether they make more or less than 50K a year?

We will use this dataset: https://archive.ics.uci.edu/ml/datasets/Census+Income

It contains information like a person's age, country, education, and more.

# Download the Data

The website provides TWO sets of data. The first, called `adult.data` is what we will use as our training and validation sets.

The second, called `adult.test` is what we will use to evaluate the performance of our machine learning model.


**Exercise 1: Why can we NOT `adult.test` as training data in addition to evaluation data?**

**Exercise 1**: We use our training set to tune our parameters, and our validation set to tune our hyperparameters. If we used the contents of `adult.test` for either of these, then we would no longer have data to test on which we haven't tuned our model specifically to be good at classifying. We need to test on completely new data that we haven't been allowed to train on (otherwise we are simply seeing how well our model fit the training data we gave it).

In [2]:
import requests
import os

def fetch_data(url, filepath):
    r = requests.get(url)
    with open(filepath, 'wb') as fd:
        for chunk in r.iter_content(chunk_size=128):
            fd.write(chunk)

# Download adult.data and adult.test into into /tmp.
if not os.path.exists("tmp"):
    os.makedirs("tmp")
adult_data_path = 'tmp/adult.data'
fetch_data(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
    adult_data_path,
)

adult_test_path = 'tmp/adult.test'
fetch_data(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test',
    adult_test_path,
)

# Read the Data

We will read the data using a Python library called Pandas, which can be used to manipulate CSV data
and also interoperates very nicely other Python libraries like numpy (numeric computation), scipy (scientific computation), scikit-learn (machine learning), matplotlib (plotting/visualization).

Basically, we read the CSV into a tabular data structure called a **DataFrame**.

In [3]:
import pandas as pd
import numpy as np

"""
Taken from https://archive.ics.uci.edu/ml/datasets/Census+Income

age: continuous. 
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. 
fnlwgt: continuous. 
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. 
education-num: continuous. 
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. 
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. 
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. 
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. 
sex: Female, Male. 
capital-gain: continuous. 
capital-loss: continuous. 
hours-per-week: continuous. 
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

label_raw: >50K, <=50K. 
"""
col_names = [
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education_num',
    'marital_status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'capital_gain',
    'capital_loss',
    'hours_per_week',
    'native_country',
    'label',
]


adult_data_raw = pd.read_csv(adult_data_path, header=None, names=col_names)
adult_test_raw = pd.read_csv(adult_test_path, header=None, names=col_names)

adult_data_raw

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,label
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


# Preprocess the Data

Like all datasets, the census dataset needs some preprocessing to clean it.

In [4]:
def preprocess(df_raw):
    # We re-write the label column so that it is 1 if the person makes > 50K and 0 if they
    # make <= 50K.
    LABEL_POSITIVE = '>50K'
    df = df_raw.copy()
    df.label = df.label.str.contains(LABEL_POSITIVE).astype(int)
    # Let's get rid of the fnlwgt column because it is meaningless.
    df = df.drop('fnlwgt', 1)
    return df

# The test data has junk on the first line.
adult_test_raw_drop = adult_test_raw.drop(0, axis=0)

adult_data = preprocess(adult_data_raw)
adult_test = preprocess(adult_test_raw_drop)

# Explore the Data

Before doing any machine learning, let's compute some basic statistics and visualizations from
the dataset.

After running the two cells below, here are some observations that we can make:

1. There are roughly 3x as many people who make <= 50K than > 50K
2. Almost all the people in the dataset are from the United States
3. People who work more hours per week tend to make more money
4. White people tend to make more money than black people


**
Exercise 2: Run the two cells below and make 4-5 more observations like the ones above.
**

**Exercise 2:**
1. Married people are more likely to make >50k than people who have never married or have seperated
2. Executives and Specialized Professionals tend to make more money than other occupations
3. Most of the people from the data set are white
4. A much higher proportion of men are paid higher than 50k than women
5. People with higher amounts of education tend to get paid more.

In [13]:
"""
Let's start by counting the number of examples (overall, positive, and negative)
and number of features.
"""

(num_examples, num_features) = adult_data.shape
(num_positive_examples, _) = (adult_data[adult_data.label == 1]).shape
(num_negative_examples, _) = (adult_data[adult_data.label == 0]).shape
print(
    """
    Number of Examples: {}
    Number of Features: {}
    Number of Positive Examples: {}
    Number of Negative Examples: {}
    """.format(
        num_examples,
        num_features - 1, # The label is not a feature so subtract if off.
        num_positive_examples,
        num_negative_examples,
    )
)


    Number of Examples: 32561
    Number of Features: 13
    Number of Positive Examples: 7841
    Number of Negative Examples: 24720
    


In [34]:
"""
Let's make some bar charts and histograms to see how different features
are related to the label.
"""
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt

def histograms(df, feature_name):
    colors = ['blue', 'green']
    labels = ['negative', 'positive']
    for i in (0, 1):
        values = df[df.label == i][feature_name]
        plt.hist(values, normed=1, facecolor=colors[i], alpha=0.3, label=labels[i])
    
    plt.legend(prop={'size': 10})

    plt.xlabel(feature_name)
    plt.ylabel('Frequency')
    plt.title('Histogram of feature = {}'.format(feature_name))
    plt.show()

def bar_charts(df, feature_name):
    colors = ['blue', 'green']
    labels = ['negative', 'positive']
    objects = sorted(df[feature_name].unique())
    for i in (0, 1):
        sub_df = df[df.label == i]
        N = sub_df.shape[0]
        values = sub_df.groupby([feature_name])[feature_name].count() / N
        values_as_dict = {k: v for (k, v) in values.items()}
        frequencies = [values_as_dict.get(k, 0) for k in objects]
        y_pos = np.arange(len(objects))
        plt.bar(y_pos, frequencies, facecolor=colors[i], align='center', alpha=0.3, label=labels[i])
        plt.xticks(y_pos, objects, rotation=90)
    plt.legend(prop={'size': 10})
    plt.ylabel('Frequency')
    plt.title('Bar chart of feature = {}'.format(feature_name))
    plt.show()
'''
histograms(adult_data, 'age')
bar_charts(adult_data, 'education')
bar_charts(adult_data, 'education_num')
bar_charts(adult_data, 'marital_status')
bar_charts(adult_data, 'occupation')
bar_charts(adult_data, 'relationship')
bar_charts(adult_data, 'race')
bar_charts(adult_data, 'sex')
histograms(adult_data, 'capital_gain')
histograms(adult_data, 'capital_loss')
histograms(adult_data, 'hours_per_week')
bar_charts(adult_data, 'native_country')
'''
print(len(list(filter(lambda a: True or a == ' Assoc-acdm', adult_data["education"]))))
1067 / 32561

32561


0.032769263843248055

# Prepare the Data for Machine Learning

Now that we've visualized the data, we want to feed it into a machine learning model.

However, recall that supervised machine learning models take the form ($\mathbf{x} \in \mathbb{R}^d$):

$$
f(\mathbf{x}) = y
$$

Where $y \in \{0, 1, ..., K\}$ for multi-classification ($K = 2$ is binary classification) and $y \in \mathbb{R}$ for regression.

**Exercise 3: What kind of problem are we solving here: Classification (in which case, what is $K$?) or regression?**

In order to do machine learning, we need a design matrix. That is, we need to turn our data into a matrix $X \in \mathbb{R}^{m \times d}$ where $m$ is the number of training examples and $d$ is the dimensionality of the feature vector.

The challenge, however, is that many of our feature values are not numbers. For example, one occupation in our dataset is called 'Tech-support', which is clearly not a number.

For continuous features, like  age and capital gain, this is trivial because they are already numbers.

For categorical features, like occupation and race, this is harder. Let's see how we can turn a categorical feature into a number.

## String Indexing and One-Hot Encoding
One approach you might take is **string indexing**. Basically, if there are $C$ categories, then we assign each category a numerical value between $1$ and $C$. For example, suppose there are three countries: "United States", "China", and "Russia". Then we can assign them numbers like "United States" = 1, "China" = 2, and "Russia" = 3.

This is still not right though, because it imposes an order on a categorical feature, which doesn't make sense. Consequently, we will use **one-hot encoding**. Basically, will create $C$ features: "is-united-states", "is-china", and "is-russia", each of which is binary.

With one-hot encoding, we can easily turn a categorical feature into a set of binary features and enable the machine learning model to learn them.

**
Exercise 4: What is the feature index for "education= Bachelors"? What is the feature value for index 75?
**

**Exercise 3:** We're solving binary classification (K=2). Our two categories are $>50$k and $<50$k. The regression equivalent of this problem would be "given these features of a person, predict their yearly income"

**Exercise 4:** The feature index for Education=Bachelors is 12. Feature index 75 is "Occupation = Farming-Fishing"

In [9]:
# First, let's separate the label from the features.
def separate_label(df, label_col='label'):
    labels = df[label_col]
    df = df.drop(label_col, axis=1)
    return (df, labels)

adult_data_features, adult_data_labels = separate_label(adult_data)
adult_test_features, adult_test_labels = separate_label(adult_test)

# Now, let's one-hot encode categorical features
# DictVectorizer does one-hot encoding on all string columns, so we'll use that.
from sklearn.feature_extraction import DictVectorizer

vectorizer = DictVectorizer(sparse=False)
adult_data_matrix_unscaled = vectorizer.fit_transform(adult_data_features.to_dict('records'))
adult_test_matrix_unscaled = vectorizer.transform(adult_test_features.to_dict('records'))

# This maps feature name to feature index.
vectorizer.vocabulary_

print(vectorizer.vocabulary_["education= Bachelors"])
print([key for key in vectorizer.vocabulary_.keys() if vectorizer.vocabulary_[key] == 75])

12
['occupation= Farming-fishing']


## Scaling to Zero-Mean Unit-Variance

If you look at the matrix, you'll notice that there are some features with very large values. For example, the capital gains can be quite large because it is measured in dollars.

Such large-valued features can mess up the gradient descent algorithm because it will take large steps in that dimension, but small steps in other dimensions.

To mitigate this issue, we will normalize all the features so that they have zero mean and standard variance.

In [10]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
adult_data_matrix = scaler.fit_transform(adult_data_matrix_unscaled)
adult_test_matrix = scaler.transform(adult_test_matrix_unscaled)

pd.DataFrame(adult_data_matrix)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,97,98,99,100,101,102,103,104,105,106
0,0.030671,0.148453,-0.216660,-0.171753,-0.193487,-0.116092,-0.072016,-0.10165,-0.142272,-0.126645,...,0.703071,-0.244450,-0.174295,-0.262097,-0.014664,-1.516792,-0.188389,-0.290936,4.907700,-0.02074
1,0.837109,-0.145920,-0.216660,-0.171753,-0.193487,-0.116092,-0.072016,-0.10165,-0.142272,-0.126645,...,0.703071,-0.244450,-0.174295,-0.262097,-0.014664,-1.516792,-0.188389,3.437186,-0.203761,-0.02074
2,-0.042642,-0.145920,-0.216660,-0.171753,-0.193487,-0.116092,-0.072016,-0.10165,-0.142272,-0.126645,...,0.703071,-0.244450,-0.174295,-0.262097,-0.014664,0.659286,-0.188389,-0.290936,-0.203761,-0.02074
3,1.057047,-0.145920,-0.216660,-0.171753,5.168316,-0.116092,-0.072016,-0.10165,-0.142272,-0.126645,...,0.703071,-0.244450,-0.174295,-0.262097,-0.014664,0.659286,-0.188389,-0.290936,-0.203761,-0.02074
4,-0.775768,-0.145920,-0.216660,-0.171753,-0.193487,-0.116092,-0.072016,-0.10165,-0.142272,-0.126645,...,-1.422331,-0.244450,-0.174295,-0.262097,-0.014664,0.659286,-0.188389,-0.290936,-0.203761,-0.02074
5,-0.115955,-0.145920,-0.216660,-0.171753,-0.193487,-0.116092,-0.072016,-0.10165,-0.142272,-0.126645,...,-1.422331,-0.244450,-0.174295,-0.262097,-0.014664,0.659286,-0.188389,-0.290936,-0.203761,-0.02074
6,0.763796,-0.145920,-0.216660,-0.171753,-0.193487,-0.116092,-0.072016,-0.10165,-0.142272,7.896091,...,-1.422331,-0.244450,-0.174295,-0.262097,-0.014664,0.659286,-0.188389,-0.290936,-0.203761,-0.02074
7,0.983734,-0.145920,-0.216660,-0.171753,-0.193487,-0.116092,-0.072016,-0.10165,-0.142272,-0.126645,...,0.703071,-0.244450,-0.174295,-0.262097,-0.014664,-1.516792,-0.188389,3.437186,-0.203761,-0.02074
8,-0.555830,1.761142,-0.216660,-0.171753,-0.193487,-0.116092,-0.072016,-0.10165,-0.142272,-0.126645,...,-1.422331,-0.244450,-0.174295,-0.262097,-0.014664,0.659286,-0.188389,-0.290936,-0.203761,-0.02074
9,0.250608,0.555214,-0.216660,-0.171753,-0.193487,-0.116092,-0.072016,-0.10165,-0.142272,-0.126645,...,0.703071,-0.244450,-0.174295,-0.262097,-0.014664,0.659286,-0.188389,-0.290936,-0.203761,-0.02074


# Machine Learning

Finally, our data is ready for machine learning.

Let's train a logistic regression model and evaluate its performance.

In [11]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(adult_data_matrix, adult_data_labels)
adult_test_predictions = lr.predict(adult_test_matrix)

from sklearn.metrics import accuracy_score, precision_score, recall_score
def evaluate_predictions(y_true,
                         
                         
                         
                         y_pred):
    print('Accuracy = {}'.format(accuracy_score(y_true, y_pred)))
    precision, recall = precision_score(y_true, y_pred), recall_score(y_true, y_pred)
    print('Precision = {}'.format(precision))
    print('Recall = {}'.format(recall))
    print('F1 = {}'.format(2 * (precision * recall) / (precision + recall)))
    
evaluate_predictions(adult_test_labels, adult_test_predictions)

Accuracy = 0.8293102389288127
Precision = 0.876499647141849
Recall = 0.32293291731669266
F1 = 0.47197415922477676


In [12]:
# Let's see which features are the most important.
# Important features will have high magnitude weights.

feature_importance_pairs = []

feature_name_for_index = {v: k for (k, v) in vectorizer.vocabulary_.items()}
for feature_index in range(len(lr.coef_[0])):
    feature_name = feature_name_for_index[feature_index]
    coefficient = abs(lr.coef_[0][feature_index])
    importance = abs(coefficient)
    feature_importance_pairs.append((feature_name, importance))

# Features, sorted by decreasing importance.
sorted(feature_importance_pairs, key=lambda x: -x[1])

[('capital_gain', 2.3461377062763673),
 ('marital_status= Married-civ-spouse', 0.7555458266976693),
 ('marital_status= Never-married', 0.5359247620209571),
 ('education= Preschool', 0.46269955284240605),
 ('education_num', 0.36822744014392195),
 ('hours_per_week', 0.3643604921192923),
 ('age', 0.34252231321901794),
 ('relationship= Own-child', 0.29796179224946356),
 ('occupation= Priv-house-serv', 0.27689479473081763),
 ('relationship= Wife', 0.26030913189920296),
 ('capital_loss', 0.2598518434641501),
 ('occupation= Other-service', 0.25715013243801416),
 ('occupation= Exec-managerial', 0.2520478061413022),
 ('marital_status= Divorced', 0.22709448336110877),
 ('sex= Female', 0.20397258584516992),
 ('sex= Male', 0.20397258584515293),
 ('occupation= Farming-fishing', 0.17776937625673844),
 ('relationship= Not-in-family', 0.17558832729516222),
 ('occupation= Prof-specialty', 0.16447397393636262),
 ('education= Bachelors', 0.14510968086341397),
 ('occupation= Handlers-cleaners', 0.14085397

**Exercise 5: For four of the important features above, come up with an explanation for why they might be important. For example, if age should be an important feature because people gain more experience as they age and tend to make more money.**

**Exercise 6: The education = 'Assoc-acdm' feature seems like it should have high importance because it indicates soembody's education. However, its importance is very low. Why do you think this is?**

**Exercise 5**: 
1. Capital Gain: This one is the most obvious - you make capital gain if when you sell capital (stocks, real estate), you make profit. Firstly, the people who trade in capital are generally richer (Wall Street, Land Developers), and secondly, their business is doing well and making money if they are making capital gain (and not capital loss)

2. marital_status = Married-civ-spouse: Married people tend to be older, and have more secure lives with higher income. They will have had time to advance in rank, and probably live a financially secure life (since they're married, instead of just partners). They also get tax breaks. It's unclear if one's spouse's income is reported under their own income.

3. education=Preschool: This has a high weight - preschoolers don't work, and so therefore make <50k.

4. Hours Per Week - A person who works long hours will, obviously, make more money than someone who works less hours with their salary. Similarly, top executives and CEOs have long work hours, whereas most middle-class workers will work about the same (unless they work overtime, in which case they would make more money).

**Exercise 6**: The feature 'Assoc-acdm' feature's weight could be explained by a few possibilities. First, the amount of data that has this label is very small, only 3% of the input data falls under this category. Secondly, and more importantly, if we look at the bar graphs above, in the data given, of the input vectors with 'education = Assoc-acdm', about the same number were making >50k and <50k (the two bars were basicaly on top of each other). Therefore, this indicates that counter to our belief, at least in this data, this label cannot be used to distinguish between < 50k and > 50k.

# Your Challenge

Now that we've trained a simple machine learning model and measured its performance, your goal is to train a better
machine learning model.

Here are some tricks for you to try:

1. Select hyperparameters for the Logistic Regression (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
2. Select those hyperparameters with search + cross validation (http://scikit-learn.org/stable/modules/grid_search.html)
3. Use a more powerful classifier like Support Vector Machines (http://scikit-learn.org/stable/modules/svm.html#classification), Decision Trees (http://scikit-learn.org/stable/modules/tree.html), and Gradient Boosted Decision Trees (http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html).
4. Drop features that you don't think will be useful


**Exercise 7: Beat the F1 score above**

In [45]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
param_grid = [
  {'C': [1, 10, 100, 1000], 'kernel': ['linear', 'poly', 'rbf', 'sigmoid', 'precomputed']},
 ]

svm = SVC()
searchedSVM = GridSearchCV(svm, param_grid)
searchedSVM.fit(adult_data_matrix, adult_data_labels)
adult_test_predictions2 = searchedSVM.predict(adult_test_matrix)
evaluate_predictions(adult_test_labels, adult_test_predictions2)

ValueError: X should be a square kernel matrix

In [43]:
evaluate_predictions(adult_test_labels, adult_test_predictions2)

Accuracy = 0.8299858731036177
Precision = 0.8083524027459954
Recall = 0.3673946957878315
F1 = 0.5051841258491241
