# Classification Exercise

We'll be working with some California Census Data, we'll be trying to use various features of an individual to predict what class of income they belogn in (>50k or <=50k). 

Here is some information about the data:

<table>
<thead>
<tr>
<th>Column Name</th>
<th>Type</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>age</td>
<td>Continuous</td>
<td>The age of the individual</td>
</tr>
<tr>
<td>workclass</td>
<td>Categorical</td>
<td>The type of employer the  individual has (government,  military, private, etc.).</td>
</tr>
<tr>
<td>fnlwgt</td>
<td>Continuous</td>
<td>The number of people the census  takers believe that observation  represents (sample weight). This  variable will not be used.</td>
</tr>
<tr>
<td>education</td>
<td>Categorical</td>
<td>The highest level of education  achieved for that individual.</td>
</tr>
<tr>
<td>education_num</td>
<td>Continuous</td>
<td>The highest level of education in  numerical form.</td>
</tr>
<tr>
<td>marital_status</td>
<td>Categorical</td>
<td>Marital status of the individual.</td>
</tr>
<tr>
<td>occupation</td>
<td>Categorical</td>
<td>The occupation of the individual.</td>
</tr>
<tr>
<td>relationship</td>
<td>Categorical</td>
<td>Wife, Own-child, Husband,  Not-in-family, Other-relative,  Unmarried.</td>
</tr>
<tr>
<td>race</td>
<td>Categorical</td>
<td>White, Asian-Pac-Islander,  Amer-Indian-Eskimo, Other, Black.</td>
</tr>
<tr>
<td>gender</td>
<td>Categorical</td>
<td>Female, Male.</td>
</tr>
<tr>
<td>capital_gain</td>
<td>Continuous</td>
<td>Capital gains recorded.</td>
</tr>
<tr>
<td>capital_loss</td>
<td>Continuous</td>
<td>Capital Losses recorded.</td>
</tr>
<tr>
<td>hours_per_week</td>
<td>Continuous</td>
<td>Hours worked per week.</td>
</tr>
<tr>
<td>native_country</td>
<td>Categorical</td>
<td>Country of origin of the  individual.</td>
</tr>
<tr>
<td>income</td>
<td>Categorical</td>
<td>"&gt;50K" or "&lt;=50K", meaning  whether the person makes more  than \$50,000 annually.</td>
</tr>
</tbody>
</table>

### THE DATA

In [1]:
import pandas as pd

In [2]:
inc = pd.read_csv('census_data.csv')

In [3]:
inc.head()

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


**TensorFlow won't be able to understand strings as labels, you'll need to use pandas .apply() method to apply a custom function that converts them to 0s and 1s.**

**Convert the Label column to 0s and 1s instead of strings.**

In [4]:
inc['income_bracket'].unique()

array([' <=50K', ' >50K'], dtype=object)

In [5]:
def label_enc(string):
    if string == ' <=50K':
        return 0
    else:
        return 1

In [6]:
inc['income_bracket'] = inc['income_bracket'].apply(label_enc)

In [7]:
inc.head()

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0


In [8]:
inc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 14 columns):
age               32561 non-null int64
workclass         32561 non-null object
education         32561 non-null object
education_num     32561 non-null int64
marital_status    32561 non-null object
occupation        32561 non-null object
relationship      32561 non-null object
race              32561 non-null object
gender            32561 non-null object
capital_gain      32561 non-null int64
capital_loss      32561 non-null int64
hours_per_week    32561 non-null int64
native_country    32561 non-null object
income_bracket    32561 non-null int64
dtypes: int64(6), object(8)
memory usage: 3.5+ MB


### Perform a Train Test Split on the Data

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
X_train, X_test, y_train,y_test = train_test_split(inc.drop(['income_bracket'],axis=1),inc['income_bracket'],random_state = 101, test_size = 0.3)

### Create the Feature Columns for tf.esitmator

**Take note of categorical vs continuous values!**

In [11]:
import tensorflow as tf

In [12]:
inc.columns

Index(['age', 'workclass', 'education', 'education_num', 'marital_status',
       'occupation', 'relationship', 'race', 'gender', 'capital_gain',
       'capital_loss', 'hours_per_week', 'native_country', 'income_bracket'],
      dtype='object')

**Create the tf.feature_columns for the categorical values. Use vocabulary lists or just use hash buckets.**

In [13]:
workclass = tf.feature_column.categorical_column_with_vocabulary_list('workclass',inc['workclass'].unique())
education = tf.feature_column.categorical_column_with_vocabulary_list('education',inc['education'].unique())
maritalstatus = tf.feature_column.categorical_column_with_vocabulary_list('marital_status',inc['marital_status'].unique())
occupation = tf.feature_column.categorical_column_with_vocabulary_list('occupation',inc['occupation'].unique())
relationship = tf.feature_column.categorical_column_with_vocabulary_list('relationship',inc['relationship'].unique())
race = tf.feature_column.categorical_column_with_vocabulary_list('race',inc['race'].unique())
gender = tf.feature_column.categorical_column_with_vocabulary_list('gender',inc['gender'].unique())
nativecountry = tf.feature_column.categorical_column_with_vocabulary_list('native_country',inc['native_country'].unique())


**Create the continuous feature_columns for the continuous values using numeric_column**

In [14]:
age = tf.feature_column.numeric_column('age')
educationnum = tf.feature_column.numeric_column('education_num')
capital_gain = tf.feature_column.numeric_column('capital_gain')
capital_loss = tf.feature_column.numeric_column('capital_loss')
hoursperweek = tf.feature_column.numeric_column('hours_per_week')

**Put all these variables into a single list.**

In [15]:
feat_cols = [workclass,education,maritalstatus,occupation,relationship,race,gender,nativecountry,
            age,educationnum,capital_gain,capital_loss,hoursperweek]

### Create Input Function


In [16]:
input_func = tf.estimator.inputs.pandas_input_fn(X_train,y_train,batch_size=64,num_epochs=1000,shuffle = True)

#### Create your model with tf.estimator

**Create a LinearClassifier.(If you want to use a DNNClassifier, keep in mind you'll need to create embedded columns out of the cateogrical feature that use strings)**

In [17]:
model = tf.estimator.LinearClassifier(feature_columns=feat_cols,n_classes=2)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmpqhptgnpb', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fa9ab1fada0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


**Train your model on the data**

In [18]:
model.train(input_func,steps=5000)

Instructions for updating:
To construct input pipelines, use the `tf.data` module.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/tmpqhptgnpb/model.ckpt.
INFO:tensorflow:loss = 44.36142, step = 1
INFO:tensorflow:global_step/sec: 178.541
INFO:tensorflow:loss = 179.31288, step = 101 (0.566 sec)
INFO:tensorflow:global_step/sec: 300.191
INFO:tensorflow:loss = 416.16324, step = 201 (0.330 sec)
INFO:tensorflow:global_step/sec: 304.008
INFO:tensorflow:loss = 68.63394, step = 301 (0.331 sec)
INFO:tensorflow:global_step/sec: 269.834
INFO:tensorflow:loss = 324.6959, step = 401 (0.371 sec)
INFO:te

<tensorflow.python.estimator.canned.linear.LinearClassifier at 0x7fa9ab1fa940>

### Evaluation

**Create a prediction input function**

In [19]:
pred_input_func = tf.estimator.inputs.pandas_input_fn(X_test,shuffle=False)

** Use model.predict() and pass in your input function. This will produce a generator of predictions, which you can then transform into a list, with list() **

In [20]:
pred = list(model.predict(pred_input_func))

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpqhptgnpb/model.ckpt-5000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


In [21]:
pred

[{'logits': array([-1.1775234], dtype=float32),
  'logistic': array([0.23549777], dtype=float32),
  'probabilities': array([0.76450217, 0.23549779], dtype=float32),
  'class_ids': array([0]),
  'classes': array([b'0'], dtype=object)},
 {'logits': array([-9.398579], dtype=float32),
  'logistic': array([8.283487e-05], dtype=float32),
  'probabilities': array([9.9991715e-01, 8.2834864e-05], dtype=float32),
  'class_ids': array([0]),
  'classes': array([b'0'], dtype=object)},
 {'logits': array([-0.50953954], dtype=float32),
  'logistic': array([0.37530148], dtype=float32),
  'probabilities': array([0.6246985 , 0.37530148], dtype=float32),
  'class_ids': array([0]),
  'classes': array([b'0'], dtype=object)},
 {'logits': array([-11.138866], dtype=float32),
  'logistic': array([1.4536017e-05], dtype=float32),
  'probabilities': array([9.9998546e-01, 1.4536019e-05], dtype=float32),
  'class_ids': array([0]),
  'classes': array([b'0'], dtype=object)},
 {'logits': array([41.283638], dtype=float3

**Create a list of only the class_ids key values from the prediction list of dictionaries, these are the predictions you will use to compare against the real y_test values.**

In [22]:
classids = []
for i in pred:
    classids.append(int(i['class_ids']))
    

In [23]:
classids

[0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,


In [24]:
from sklearn.metrics import classification_report

In [25]:
print(classification_report(y_test,classids))

              precision    recall  f1-score   support

           0       0.89      0.89      0.89      7436
           1       0.65      0.64      0.64      2333

   micro avg       0.83      0.83      0.83      9769
   macro avg       0.77      0.76      0.77      9769
weighted avg       0.83      0.83      0.83      9769



## Using DNNClassifier

In [26]:
workclass = tf.feature_column.embedding_column(workclass,dimension=len(inc['workclass'].unique()))
education = tf.feature_column.embedding_column(education,dimension=len(inc['education'].unique()))
maritalstatus = tf.feature_column.embedding_column(maritalstatus,len(inc['marital_status'].unique()))
occupation = tf.feature_column.embedding_column(occupation,dimension=len(inc['occupation'].unique()))
relationship = tf.feature_column.embedding_column(relationship,dimension=len(inc['relationship'].unique()))
race = tf.feature_column.embedding_column(race,dimension=len(inc['race'].unique()))
gender = tf.feature_column.embedding_column(gender,dimension=len(inc['gender'].unique()))
nativecountry = tf.feature_column.embedding_column(nativecountry,dimension=len(inc['native_country'].unique()))


In [27]:
feat_cols = [workclass,education,maritalstatus,occupation,relationship,race,gender,nativecountry,
            age,educationnum,capital_gain,capital_loss,hoursperweek]

In [28]:
model1 = tf.estimator.DNNClassifier(hidden_units=[100,200,100],feature_columns=feat_cols)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmpu0w7j1hf', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fa9aaf26c18>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


In [29]:
model1.train(input_func,steps=5000)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/tmpu0w7j1hf/model.ckpt.
INFO:tensorflow:loss = 547.7559, step = 1
INFO:tensorflow:global_step/sec: 122.993
INFO:tensorflow:loss = 56.15202, step = 101 (0.815 sec)
INFO:tensorflow:global_step/sec: 201.83
INFO:tensorflow:loss = 15.421727, step = 201 (0.500 sec)
INFO:tensorflow:global_step/sec: 178.443
INFO:tensorflow:loss = 23.862686, step = 301 (0.556 sec)
INFO:tensorflow:global_step/sec: 217.027
INFO:tensorflow:loss = 27.26506, step = 401 (0.462 sec)
INFO:tensorflow:global_step/sec: 214.224
INFO:tensorflow:loss = 28.002129, step = 501 (0.465 sec)
INFO:tensorflow:global_step/sec: 245.862
INFO:tensorflow:loss = 14.085717, step = 601 (0.406 sec)
INFO:tensorflow:global_step/sec: 205.382
INFO:tensorflow:loss

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x7fa9aaf26550>

In [30]:
pred = list(model.predict(pred_input_func))
classids = []
for i in pred:
    classids.append(int(i['class_ids']))

print(classification_report(y_test,classids))   

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpqhptgnpb/model.ckpt-5000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
              precision    recall  f1-score   support

           0       0.89      0.89      0.89      7436
           1       0.65      0.64      0.64      2333

   micro avg       0.83      0.83      0.83      9769
   macro avg       0.77      0.76      0.77      9769
weighted avg       0.83      0.83      0.83      9769



**Both the DNNClassifier as well as the LinearClassification classes worked the same.**