<a href="https://colab.research.google.com/github/konradbachusz/tensorflow-notes/blob/master/LinearClassification_Census.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Income Classification using a Linear Classifier

Modified from original code here: https://www.tensorflow.org/tutorials/wide

### Make the notebook compatible with both Python 2 and 3

http://python-future.org/compatible_idioms.html

In [0]:
from __future__ import absolute_import, division, print_function

In [2]:
!pip uninstall tensorflow

Uninstalling tensorflow-2.1.0:
  Would remove:
    /usr/local/bin/estimator_ckpt_converter
    /usr/local/bin/saved_model_cli
    /usr/local/bin/tensorboard
    /usr/local/bin/tf_upgrade_v2
    /usr/local/bin/tflite_convert
    /usr/local/bin/toco
    /usr/local/bin/toco_from_protos
    /usr/local/lib/python2.7/dist-packages/tensorflow-2.1.0.dist-info/*
    /usr/local/lib/python2.7/dist-packages/tensorflow/*
    /usr/local/lib/python2.7/dist-packages/tensorflow_core/*
Proceed (y/n)? y
  Successfully uninstalled tensorflow-2.1.0


In [3]:
!pip install tensorflow==1.5

Collecting tensorflow==1.5
[?25l  Downloading https://files.pythonhosted.org/packages/69/6d/09d4fbeedbafbc6768a94901f14ace4153adba4c2e2c6e6f2080f4a5d1a7/tensorflow-1.5.0-cp27-cp27mu-manylinux1_x86_64.whl (44.4MB)
[K     |████████████████████████████████| 44.4MB 91kB/s 
Collecting tensorflow-tensorboard<1.6.0,>=1.5.0
[?25l  Downloading https://files.pythonhosted.org/packages/cd/ba/d664f7c27c710063b1cdfa0309db8fba98952e3a1ba1991ed98efffe69ed/tensorflow_tensorboard-1.5.1-py2-none-any.whl (3.0MB)
[K     |████████████████████████████████| 3.0MB 35.2MB/s 
Collecting bleach==1.5.0
  Downloading https://files.pythonhosted.org/packages/33/70/86c5fec937ea4964184d4d6c4f0b9551564f821e1c3575907639036d9b90/bleach-1.5.0-py2.py3-none-any.whl
Collecting html5lib==0.9999999
[?25l  Downloading https://files.pythonhosted.org/packages/ae/ae/bcb60402c60932b32dfaf19bb53870b29eda2cd17551ba5639219fb5ebf9/html5lib-0.9999999.tar.gz (889kB)
[K     |████████████████████████████████| 890kB 44.0MB/s 
Building 

In [4]:
"%tensorflow_version 1.5"

'%tensorflow_version 1.5'

In [0]:
import pandas as pd
from six.moves import urllib
import shutil
import tensorflow as tf

In [6]:
print(tf.__version__)
print(pd.__version__)

1.5.0
0.24.2


### Set up the file names where the training data and the test data are to be stored

Note that you'll have to manually create the "census" directory under the current working directory

In [0]:
TRAIN_FILE_NAME = "census/adult.data"
TEST_FILE_NAME = "census/adult.test"

### Download and store the training and test data from the UCI Machine Learning Repository

There are a whole host of interesting datasets here: https://archive.ics.uci.edu/ml/index.php

In [11]:
!mkdir census
!cd census
!mkdir adult.data
!cd ..

mkdir: cannot create directory ‘census’: File exists


In [12]:
urllib.request.urlretrieve(
        "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
        TRAIN_FILE_NAME)

('census/adult.data', <httplib.HTTPMessage instance at 0x7f96a331e0a0>)

In [13]:
urllib.request.urlretrieve(
        "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
        TEST_FILE_NAME)

('census/adult.test', <httplib.HTTPMessage instance at 0x7f96a331e4b0>)

### The columns in the census data

In [0]:
CSV_COLUMNS = [
    "age", "workclass", "fnlwgt", "education", "education_num",
    "marital_status", "occupation", "relationship", "race", "gender",
    "capital_gain", "capital_loss", "hours_per_week", "native_country",
    "income_bracket"
]

### Read training data into a dataframe

Sample and explore the data to understand what information is available. This will also be used to set up feature columns which will serve as an input to our linear classifier.

In [0]:
df = pd.read_csv(
      TRAIN_FILE_NAME,
      names=CSV_COLUMNS,
      skipinitialspace=True,
      skiprows=1)

In [17]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
1,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
2,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
3,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
4,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K


### Choose only those columns which seem relevant to predicting income

* Removed the "fnlwgt" column, the number of people the census takers believe that observation represents (sample weight)
* Removed "capital_gain" and "capital_loss", continuous, dense columns usually work well with neural networks

In [0]:
TRIMMED_REORDERED_COLUMNS = [
    "age", "workclass", "education", "education_num",
    "marital_status", "relationship", "race", "gender", "occupation", 
    "hours_per_week", "native_country", "income_bracket"
]

In [0]:
df = df[TRIMMED_REORDERED_COLUMNS]

In [20]:
df.head()

Unnamed: 0,age,workclass,education,education_num,marital_status,relationship,race,gender,occupation,hours_per_week,native_country,income_bracket
0,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Husband,White,Male,Exec-managerial,13,United-States,<=50K
1,38,Private,HS-grad,9,Divorced,Not-in-family,White,Male,Handlers-cleaners,40,United-States,<=50K
2,53,Private,11th,7,Married-civ-spouse,Husband,Black,Male,Handlers-cleaners,40,United-States,<=50K
3,28,Private,Bachelors,13,Married-civ-spouse,Wife,Black,Female,Prof-specialty,40,Cuba,<=50K
4,37,Private,Masters,14,Married-civ-spouse,Wife,White,Female,Exec-managerial,40,United-States,<=50K


### Feature columns with categorical values

Find the unique values in a column and set up a categorical feature for those columns

In [21]:
df['gender'].unique()

array(['Male', 'Female'], dtype=object)

In [22]:
df['race'].unique()

array(['White', 'Black', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo',
       'Other'], dtype=object)

In [23]:
df['education'].unique()

array(['Bachelors', 'HS-grad', '11th', 'Masters', '9th', 'Some-college',
       'Assoc-acdm', 'Assoc-voc', '7th-8th', 'Doctorate', 'Prof-school',
       '5th-6th', '10th', '1st-4th', 'Preschool', '12th'], dtype=object)

In [24]:
df['marital_status'].unique()

array(['Married-civ-spouse', 'Divorced', 'Married-spouse-absent',
       'Never-married', 'Separated', 'Married-AF-spouse', 'Widowed'],
      dtype=object)

In [25]:
df['relationship'].unique()

array(['Husband', 'Not-in-family', 'Wife', 'Own-child', 'Unmarried',
       'Other-relative'], dtype=object)

In [26]:
df['workclass'].unique()

array(['Self-emp-not-inc', 'Private', 'State-gov', 'Federal-gov',
       'Local-gov', '?', 'Self-emp-inc', 'Without-pay', 'Never-worked'],
      dtype=object)

### Set up categorical feature columns

Use *tf.feature_column.categorical_column_with_vocabulary_list* if the categorical columns have a finite set of values that we know in advance

In [0]:
gender = tf.feature_column.categorical_column_with_vocabulary_list(
    "gender", ["Female", "Male"])

race = tf.feature_column.categorical_column_with_vocabulary_list(
    "race", ['White', 'Black', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo', 'Other'])

education = tf.feature_column.categorical_column_with_vocabulary_list(
    "education", [
        "Bachelors", "HS-grad", "11th", "Masters", "9th",
        "Some-college", "Assoc-acdm", "Assoc-voc", "7th-8th",
        "Doctorate", "Prof-school", "5th-6th", "10th", "1st-4th",
        "Preschool", "12th"
    ])

marital_status = tf.feature_column.categorical_column_with_vocabulary_list(
    "marital_status", [
        "Married-civ-spouse", "Divorced", "Married-spouse-absent",
        "Never-married", "Separated", "Married-AF-spouse", "Widowed"
    ])

relationship = tf.feature_column.categorical_column_with_vocabulary_list(
    "relationship", [
        "Husband", "Not-in-family", "Wife", "Own-child", "Unmarried",
        "Other-relative"
    ])

workclass = tf.feature_column.categorical_column_with_vocabulary_list(
    "workclass", [
        "Self-emp-not-inc", "Private", "State-gov", "Federal-gov",
        "Local-gov", "?", "Self-emp-inc", "Without-pay", "Never-worked"
    ])

### Columns with continuous values

Use *tf.feature_column.numeric_column* to set up columns which have values in a numeric range

In [0]:
age = tf.feature_column.numeric_column("age")

education_num = tf.feature_column.numeric_column("education_num")

hours_per_week = tf.feature_column.numeric_column("hours_per_week")

### Bucketed columns

Sometimes the relationship between a continuous feature and the label is not linear. A person's income may grow with age in the early stage of one's career, then the growth may slow at some point, and finally the income decreases after retirement. In this scenario, using the **raw age as a real-valued feature column** might not be a good choice because the model can only learn one of the three cases:

* Income always increases at some rate as age grows (positive correlation),
* Income always decreases at some rate as age grows (negative correlation), or
* Income stays the same no matter at what age (no correlation)

If we want to **learn the fine-grained correlation** between income and each age group separately, we can leverage bucketization. 

Bucketization is a process of dividing the entire range of a continuous feature into a set of consecutive bins/buckets, and then converting the original numerical feature into a bucket ID (as a categorical feature) depending on which bucket that value falls into. 

In [0]:
age_buckets = tf.feature_column.bucketized_column(
    age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])

### Categorical column values might change over time

In [30]:
df['occupation'].unique()

array(['Exec-managerial', 'Handlers-cleaners', 'Prof-specialty',
       'Other-service', 'Adm-clerical', 'Sales', 'Craft-repair',
       'Transport-moving', 'Farming-fishing', 'Machine-op-inspct',
       'Tech-support', '?', 'Protective-serv', 'Armed-Forces',
       'Priv-house-serv'], dtype=object)

In [31]:
df['native_country'].unique()

array(['United-States', 'Cuba', 'Jamaica', 'India', '?', 'Mexico',
       'South', 'Puerto-Rico', 'Honduras', 'England', 'Canada', 'Germany',
       'Iran', 'Philippines', 'Italy', 'Poland', 'Columbia', 'Cambodia',
       'Thailand', 'Ecuador', 'Laos', 'Taiwan', 'Haiti', 'Portugal',
       'Dominican-Republic', 'El-Salvador', 'France', 'Guatemala',
       'China', 'Japan', 'Yugoslavia', 'Peru',
       'Outlying-US(Guam-USVI-etc)', 'Scotland', 'Trinadad&Tobago',
       'Greece', 'Nicaragua', 'Vietnam', 'Hong', 'Ireland', 'Hungary',
       'Holand-Netherlands'], dtype=object)

### Categorical columns with unknown values

If you don't know the list of categorical columns in advance then we use *tf.feature_column.categorical_column_with_hash_bucket* where every column value will be hashed to a unique integer.

The chances of collisions are usually low.

In [0]:
occupation = tf.feature_column.categorical_column_with_hash_bucket(
    "occupation", hash_bucket_size=1000)

native_country = tf.feature_column.categorical_column_with_hash_bucket(
    "native_country", hash_bucket_size=1000)

### Base columns which use the raw values from the dataset

In [0]:
base_columns = [
    gender, race, marital_status, workclass, occupation,
    native_country, age_buckets, education
]

### Crossed columns express more complex relationships between data

Some relationships between individual features and the output maybe hard to define. Two or more features considered together might have a more direct impact on the output. Feature crosses are **engineered features** which allows you to specify this more complex relationship.

Education and occupation when considered together will be a better predictor of income than either of them alone.

In [0]:
crossed_columns = [
    tf.feature_column.crossed_column(
        ["education", "occupation"], hash_bucket_size=1000),
    tf.feature_column.crossed_column(
        [age_buckets, "education", "occupation"], hash_bucket_size=1000),
    tf.feature_column.crossed_column(
        ["native_country", "occupation"], hash_bucket_size=1000)
]

### Continuous valued columns

If these columns are dense they are more suitable for deep neural networks

In [0]:
deep_columns = [
    education_num,
    hours_per_week
]

### The input function in an estimator maps the features and the corresponding labels

The standard library method *tf.estimator.inputs.pandas_input_fn* allows us to specify feature data as a pandas dataframe and the labels as a list.

The input function specifies the features and labels for training
    

In [0]:
def input_fn(file_name, num_epochs, shuffle):
  df = pd.read_csv(
      file_name,
      names=CSV_COLUMNS,
      skipinitialspace=True,
      skiprows=1)
  df = df[TRIMMED_REORDERED_COLUMNS]  
  
  # Remove NaN elements
  df = df.dropna(how="any", axis=0)

  # Use numeric labels to represent incomes below and above 50K
  labels = df["income_bracket"].apply(lambda x: ">50K" in x).astype(int)
  
  return tf.estimator.inputs.pandas_input_fn(
      x=df,
      y=labels,
      batch_size=100,
      num_epochs=num_epochs,
      shuffle=shuffle,
      num_threads=5)

In [0]:
MODEL_DIR = "./linear_classifier"

In [0]:
!mkdir linear_classifier

### Remove the old saved model so we generate entirely new parameters

In [0]:
shutil.rmtree(MODEL_DIR)

### Pass in all 3 sets of columns

For this model, the base columns are the ones which really affect the output

In [41]:
linear_estimator = tf.estimator.LinearClassifier(
        model_dir=MODEL_DIR, feature_columns=base_columns)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f96a32dfbd0>, '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tf_random_seed': None, '_master': '', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_model_dir': './linear_classifier', '_save_summary_steps': 100}


In [42]:
linear_estimator.train(
      input_fn=input_fn(TRAIN_FILE_NAME, num_epochs=None, shuffle=True),
      steps=1000)

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into ./linear_classifier/model.ckpt.
INFO:tensorflow:loss = 69.31474, step = 1
INFO:tensorflow:global_step/sec: 195.673
INFO:tensorflow:loss = 40.331852, step = 101 (0.513 sec)
INFO:tensorflow:global_step/sec: 237.359
INFO:tensorflow:loss = 37.552864, step = 201 (0.422 sec)
INFO:tensorflow:global_step/sec: 237.359
INFO:tensorflow:loss = 32.54548, step = 301 (0.421 sec)
INFO:tensorflow:global_step/sec: 236.392
INFO:tensorflow:loss = 37.57751, step = 401 (0.423 sec)
INFO:tensorflow:global_step/sec: 238.763
INFO:tensorflow:loss = 32.417934, step = 501 (0.419 sec)
INFO:tensorflow:global_step/sec: 233.505
INFO:tensorflow:loss = 30.631018, step = 601 (0.429 sec)
INFO:tensorflow:global_step/sec: 234.785
INFO:tensorflow:loss = 30.466763, step = 701 (0.425 sec)
INFO:tensorflow:global_step/sec: 233.102
INFO:tensorflow:loss = 34.140953, step = 801 (0.429 sec)
INFO:tensorflow:global_step/sec: 232.716
INFO:tensorfl

<tensorflow.python.estimator.canned.linear.LinearClassifier at 0x7f96a32dffd0>

### Evaluate the test data and predict the income levels of the adults

In [43]:
results = linear_estimator.evaluate(
      input_fn=input_fn(TEST_FILE_NAME, num_epochs=1, shuffle=False),
      steps=None)

INFO:tensorflow:Starting evaluation at 2020-04-20-17:05:27
INFO:tensorflow:Restoring parameters from ./linear_classifier/model.ckpt-1000
INFO:tensorflow:Finished evaluation at 2020-04-20-17:05:33
INFO:tensorflow:Saving dict for global step 1000: accuracy = 0.83784777, accuracy_baseline = 0.76377374, auc = 0.88623196, auc_precision_recall = 0.6957885, average_loss = 0.34862748, global_step = 1000, label/mean = 0.23622628, loss = 34.82211, prediction/mean = 0.23022261


In [44]:
for key in sorted(results):
    print("%s: %s" % (key, results[key]))

accuracy: 0.83784777
accuracy_baseline: 0.76377374
auc: 0.88623196
auc_precision_recall: 0.6957885
average_loss: 0.34862748
global_step: 1000
label/mean: 0.23622628
loss: 34.82211
prediction/mean: 0.23022261
