# Classification Exercise

We'll be working with some California Census Data, we'll be trying to use various features of an individual to predict what class of income they belogn in (>50k or <=50k). 

Here is some information about the data:

<table>
<thead>
<tr>
<th>Column Name</th>
<th>Type</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>age</td>
<td>Continuous</td>
<td>The age of the individual</td>
</tr>
<tr>
<td>workclass</td>
<td>Categorical</td>
<td>The type of employer the  individual has (government,  military, private, etc.).</td>
</tr>
<tr>
<td>fnlwgt</td>
<td>Continuous</td>
<td>The number of people the census  takers believe that observation  represents (sample weight). This  variable will not be used.</td>
</tr>
<tr>
<td>education</td>
<td>Categorical</td>
<td>The highest level of education  achieved for that individual.</td>
</tr>
<tr>
<td>education_num</td>
<td>Continuous</td>
<td>The highest level of education in  numerical form.</td>
</tr>
<tr>
<td>marital_status</td>
<td>Categorical</td>
<td>Marital status of the individual.</td>
</tr>
<tr>
<td>occupation</td>
<td>Categorical</td>
<td>The occupation of the individual.</td>
</tr>
<tr>
<td>relationship</td>
<td>Categorical</td>
<td>Wife, Own-child, Husband,  Not-in-family, Other-relative,  Unmarried.</td>
</tr>
<tr>
<td>race</td>
<td>Categorical</td>
<td>White, Asian-Pac-Islander,  Amer-Indian-Eskimo, Other, Black.</td>
</tr>
<tr>
<td>gender</td>
<td>Categorical</td>
<td>Female, Male.</td>
</tr>
<tr>
<td>capital_gain</td>
<td>Continuous</td>
<td>Capital gains recorded.</td>
</tr>
<tr>
<td>capital_loss</td>
<td>Continuous</td>
<td>Capital Losses recorded.</td>
</tr>
<tr>
<td>hours_per_week</td>
<td>Continuous</td>
<td>Hours worked per week.</td>
</tr>
<tr>
<td>native_country</td>
<td>Categorical</td>
<td>Country of origin of the  individual.</td>
</tr>
<tr>
<td>income</td>
<td>Categorical</td>
<td>"&gt;50K" or "&lt;=50K", meaning  whether the person makes more  than \$50,000 annually.</td>
</tr>
</tbody>
</table>

## Follow the Directions in Bold. If you get stuck, check out the solutions lecture.

### THE DATA

** Read in the census_data.csv data with pandas**

In [1]:
import pandas as pd

In [2]:
census = pd.read_csv('census_data.csv')

In [3]:
census.head()

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


** TensorFlow won't be able to understand strings as labels, you'll need to use pandas .apply() method to apply a custom function that converts them to 0s and 1s. This might be hard if you aren't very familiar with pandas, so feel free to take a peek at the solutions for this part.**

** Convert the Label column to 0s and 1s instead of strings.**

In [5]:
census['income_bracket'] = census['income_bracket'].apply(lambda l: (0 if l == ' <=50K' else 1))

In [6]:
census.head()

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0


### Perform a Train Test Split on the Data

In [7]:
from sklearn.model_selection import train_test_split

In [11]:
x_data = census.drop('income_bracket',axis=1)
x_data.head()

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba


In [13]:
y = census['income_bracket']
y.head()

0    0
1    0
2    0
3    0
4    0
Name: income_bracket, dtype: int64

In [14]:
X_train, X_test, y_train, y_test = train_test_split(x_data, y,test_size=0.3,
                                                    random_state=101)

### Create the Feature Columns for tf.esitmator

** Take note of categorical vs continuous values! **

In [15]:
x_data.columns

Index(['age', 'workclass', 'education', 'education_num', 'marital_status',
       'occupation', 'relationship', 'race', 'gender', 'capital_gain',
       'capital_loss', 'hours_per_week', 'native_country'],
      dtype='object')

** Import Tensorflow **

In [16]:
import tensorflow as tf

** Create the tf.feature_columns for the categorical values. Use vocabulary lists or just use hash buckets. **

In [34]:
workclass = tf.feature_column.categorical_column_with_hash_bucket('workclass',hash_bucket_size=10)
education = tf.feature_column.categorical_column_with_hash_bucket('education',hash_bucket_size=20)
marital_status = tf.feature_column.categorical_column_with_hash_bucket('marital_status',hash_bucket_size=10)
occupation = tf.feature_column.categorical_column_with_hash_bucket('occupation',hash_bucket_size=20)
relationship = tf.feature_column.categorical_column_with_hash_bucket('relationship',hash_bucket_size=20)
race = tf.feature_column.categorical_column_with_hash_bucket('race',hash_bucket_size=10)
gender = tf.feature_column.categorical_column_with_vocabulary_list('gender', ['Male','Female'])
country = tf.feature_column.categorical_column_with_hash_bucket('native_country', hash_bucket_size=50)

In [35]:
len(x_data['native_country'].unique())

42

** Create the continuous feature_columns for the continuous values using numeric_column **

In [38]:
age = tf.feature_column.numeric_column('age')
education_num = tf.feature_column.numeric_column('education_num')
gain = tf.feature_column.numeric_column('capital_gain')
loss = tf.feature_column.numeric_column('capital_loss')
hours = tf.feature_column.numeric_column('hours_per_week')


** Put all these variables into a single list with the variable name feat_cols **

In [39]:
feat_cols = [age, workclass, education, education_num, marital_status, occupation, relationship, race, gender, gain,
            loss, hours, country]

### Create Input Function

** Batch_size is up to you. But do make sure to shuffle!**

In [41]:
input_func = tf.estimator.inputs.pandas_input_fn(x_data, y, batch_size=100,num_epochs=1000,shuffle=True)

#### Create your model with tf.estimator

**Create a LinearClassifier.(If you want to use a DNNClassifier, keep in mind you'll need to create embedded columns out of the cateogrical feature that use strings, check out the previous lecture on this for more info.)**

In [42]:
model = tf.estimator.LinearClassifier(feature_columns=feat_cols)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_checkpoints_steps': None, '_model_dir': '/var/folders/d2/0mftypjn6lj4_6tsb6y_1y_r0000gn/T/tmpbjdzoifw', '_session_config': None, '_log_step_count_steps': 100, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_tf_random_seed': 1}


** Train your model on the data, for at least 5000 steps. **

In [43]:
model.train(input_func,steps=10000)

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /var/folders/d2/0mftypjn6lj4_6tsb6y_1y_r0000gn/T/tmpbjdzoifw/model.ckpt.
INFO:tensorflow:loss = 69.3147, step = 1
INFO:tensorflow:global_step/sec: 197.509
INFO:tensorflow:loss = 901.35, step = 101 (0.507 sec)
INFO:tensorflow:global_step/sec: 222.414
INFO:tensorflow:loss = 88.7023, step = 201 (0.450 sec)
INFO:tensorflow:global_step/sec: 282.298
INFO:tensorflow:loss = 1856.72, step = 301 (0.356 sec)
INFO:tensorflow:global_step/sec: 235.247
INFO:tensorflow:loss = 202.503, step = 401 (0.426 sec)
INFO:tensorflow:global_step/sec: 217.832
INFO:tensorflow:loss = 69.2856, step = 501 (0.457 sec)
INFO:tensorflow:global_step/sec: 226.311
INFO:tensorflow:loss = 51.2451, step = 601 (0.442 sec)
INFO:tensorflow:global_step/sec: 228.171
INFO:tensorflow:loss = 97.7285, step = 701 (0.438 sec)
INFO:tensorflow:global_step/sec: 227.929
INFO:tensorflow:loss = 30.5383, step = 801 (0.440 sec)
INFO:tensorflow:global_step/s

INFO:tensorflow:global_step/sec: 223.803
INFO:tensorflow:loss = 142.423, step = 8401 (0.444 sec)
INFO:tensorflow:global_step/sec: 234.806
INFO:tensorflow:loss = 47.2486, step = 8501 (0.426 sec)
INFO:tensorflow:global_step/sec: 229.258
INFO:tensorflow:loss = 43.7737, step = 8601 (0.436 sec)
INFO:tensorflow:global_step/sec: 229.638
INFO:tensorflow:loss = 48.3143, step = 8701 (0.435 sec)
INFO:tensorflow:global_step/sec: 235.08
INFO:tensorflow:loss = 51.2621, step = 8801 (0.427 sec)
INFO:tensorflow:global_step/sec: 225.855
INFO:tensorflow:loss = 35.7998, step = 8901 (0.441 sec)
INFO:tensorflow:global_step/sec: 227.234
INFO:tensorflow:loss = 32.0009, step = 9001 (0.440 sec)
INFO:tensorflow:global_step/sec: 229.413
INFO:tensorflow:loss = 24.9169, step = 9101 (0.436 sec)
INFO:tensorflow:global_step/sec: 228.505
INFO:tensorflow:loss = 30.8114, step = 9201 (0.438 sec)
INFO:tensorflow:global_step/sec: 221.324
INFO:tensorflow:loss = 22.5735, step = 9301 (0.452 sec)
INFO:tensorflow:global_step/sec

<tensorflow.python.estimator.canned.linear.LinearClassifier at 0x10e7ef048>

### Evaluation

** Create a prediction input function. Remember to only supprt X_test data and keep shuffle=False. **

In [44]:
pred_fn = tf.estimator.inputs.pandas_input_fn(X_test, batch_size=100,shuffle=False

** Use model.predict() and pass in your input function. This will produce a generator of predictions, which you can then transform into a list, with list() **

In [45]:
pred_gen = model.predict(input_fn=pred_fn)

** Each item in your list will look like this: **

In [46]:
predictions = list(pred_gen)

INFO:tensorflow:Restoring parameters from /var/folders/d2/0mftypjn6lj4_6tsb6y_1y_r0000gn/T/tmpbjdzoifw/model.ckpt-10000


** Create a list of only the class_ids key values from the prediction list of dictionaries, these are the predictions you will use to compare against the real y_test values. **

In [51]:
final_preds = []
for pred in predictions:
    final_preds.append(pred['class_ids'][0])

In [52]:
final_preds

[0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,


** Import classification_report from sklearn.metrics and then see if you can figure out how to use it to easily get a full report of your model's performance on the test data. **

In [53]:
from sklearn.metrics import classification_report

In [55]:
print(classification_report(y_test,final_preds))

             precision    recall  f1-score   support

          0       0.86      0.94      0.90      7436
          1       0.74      0.53      0.62      2333

avg / total       0.83      0.84      0.83      9769



# Great Job!