# Income Prediction

We'll be working with some California Census Data, we'll be trying to use various features of an individual to predict what class of income they belogn in (>50k or <=50k). 

Here is some information about the data:

<table>
<thead>
<tr>
<th>Column Name</th>
<th>Type</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>age</td>
<td>Continuous</td>
<td>The age of the individual</td>
</tr>
<tr>
<td>workclass</td>
<td>Categorical</td>
<td>The type of employer the  individual has (government,  military, private, etc.).</td>
</tr>
<tr>
<td>fnlwgt</td>
<td>Continuous</td>
<td>The number of people the census  takers believe that observation  represents (sample weight). This  variable will not be used.</td>
</tr>
<tr>
<td>education</td>
<td>Categorical</td>
<td>The highest level of education  achieved for that individual.</td>
</tr>
<tr>
<td>education_num</td>
<td>Continuous</td>
<td>The highest level of education in  numerical form.</td>
</tr>
<tr>
<td>marital_status</td>
<td>Categorical</td>
<td>Marital status of the individual.</td>
</tr>
<tr>
<td>occupation</td>
<td>Categorical</td>
<td>The occupation of the individual.</td>
</tr>
<tr>
<td>relationship</td>
<td>Categorical</td>
<td>Wife, Own-child, Husband,  Not-in-family, Other-relative,  Unmarried.</td>
</tr>
<tr>
<td>race</td>
<td>Categorical</td>
<td>White, Asian-Pac-Islander,  Amer-Indian-Eskimo, Other, Black.</td>
</tr>
<tr>
<td>gender</td>
<td>Categorical</td>
<td>Female, Male.</td>
</tr>
<tr>
<td>capital_gain</td>
<td>Continuous</td>
<td>Capital gains recorded.</td>
</tr>
<tr>
<td>capital_loss</td>
<td>Continuous</td>
<td>Capital Losses recorded.</td>
</tr>
<tr>
<td>hours_per_week</td>
<td>Continuous</td>
<td>Hours worked per week.</td>
</tr>
<tr>
<td>native_country</td>
<td>Categorical</td>
<td>Country of origin of the  individual.</td>
</tr>
<tr>
<td>income</td>
<td>Categorical</td>
<td>"&gt;50K" or "&lt;=50K", meaning  whether the person makes more  than \$50,000 annually.</td>
</tr>
</tbody>
</table>

### THE DATA

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('census_data.csv')

In [3]:
data.head()

Unnamed: 0,age,workclass,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K




** Convert the Label column to 0s and 1s instead of strings.**

In [4]:
data.income_bracket.describe()

count      32561
unique         2
top        <=50K
freq       24720
Name: income_bracket, dtype: object

In [5]:
def label_fix(label):
    if label ==' <=50K':
        return 0
    else:
        return 1

In [6]:
data['income_bracket'] = data['income_bracket'].apply(label_fix)

In [7]:
data.income_bracket.describe()

count    32561.000000
mean         0.240810
std          0.427581
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: income_bracket, dtype: float64

### Perform a Train Test Split on the Data

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
X_data = data.drop('income_bracket', axis=1)
y_label =data['income_bracket']
X_train, X_test, y_train, y_test = train_test_split(X_data, y_label, test_size=0.3, random_state=101)

### Create the Feature Columns 



In [10]:
data.columns
#categorical:'workclass', 'education', 'marital_status', 'occupation', 
#'relationship', 'race', 'gender', 'native_country'

#continous:'age', education_num', 'capital_gain','capital_loss', 'hours_per_week'

Index(['age', 'workclass', 'education', 'education_num', 'marital_status',
       'occupation', 'relationship', 'race', 'gender', 'capital_gain',
       'capital_loss', 'hours_per_week', 'native_country', 'income_bracket'],
      dtype='object')

** Import Tensorflow **

In [11]:
import tensorflow as tf

  from ._conv import register_converters as _register_converters


In [12]:
workclass = tf.feature_column.categorical_column_with_hash_bucket('workclass', hash_bucket_size=1000)
education = tf.feature_column.categorical_column_with_hash_bucket('education', hash_bucket_size=1000)
marital_status =tf.feature_column.categorical_column_with_hash_bucket('marital_status', hash_bucket_size=1000)
occupation = tf.feature_column.categorical_column_with_hash_bucket('occupation', hash_bucket_size=1000)
relationship = tf.feature_column.categorical_column_with_hash_bucket('relationship', hash_bucket_size=1000)
race = tf.feature_column.categorical_column_with_hash_bucket('race', hash_bucket_size=1000)
gender = tf.feature_column.categorical_column_with_vocabulary_list('gender', ['Female', 'Male'])
native_country = tf.feature_column.categorical_column_with_hash_bucket('native_country', hash_bucket_size=1000)

In [13]:
age = tf.feature_column.numeric_column('age')
education_num = tf.feature_column.numeric_column('education_num')
capital_gain= tf.feature_column.numeric_column('capital_gain')
capital_loss = tf.feature_column.numeric_column('capital_loss')
hours_per_week =tf.feature_column.numeric_column('hours_per_week')

In [14]:
feat_cols = [workclass, education, marital_status, occupation, relationship, race, gender, native_country, age, education_num, capital_gain, capital_loss, hours_per_week]

### Create Input Function


In [15]:
input_func = tf.estimator.inputs.pandas_input_fn(x=X_train, y=y_train, batch_size=1000, num_epochs=None, shuffle=True)

### Create your model 



In [16]:
model = tf.estimator.LinearClassifier(feature_columns=feat_cols)


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/var/folders/2v/xzpgsprx4dgft47fp23gnl_r0000gn/T/tmp2x2nfx2j', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x10c536470>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


** Train your model **

In [17]:
model.train(input_fn=input_func, steps=5000)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into /var/folders/2v/xzpgsprx4dgft47fp23gnl_r0000gn/T/tmp2x2nfx2j/model.ckpt.
INFO:tensorflow:loss = 693.1464, step = 1
INFO:tensorflow:global_step/sec: 40.2943
INFO:tensorflow:loss = 464.79443, step = 101 (2.490 sec)
INFO:tensorflow:global_step/sec: 37.1132
INFO:tensorflow:loss = 2064.2388, step = 201 (2.688 sec)
INFO:tensorflow:global_step/sec: 40.5909
INFO:tensorflow:loss = 6154.034, step = 301 (2.463 sec)
INFO:tensorflow:global_step/sec: 44.6772
INFO:tensorflow:loss = 3241.856, step = 401 (2.238 sec)
INFO:tensorflow:global_step/sec: 46.1439
INFO:tensorflow:loss = 555.3562, step = 501 (2.167 sec)
INFO:tensorflow:global_step/sec: 39.716
INFO:tensorflow:loss = 422.79926, step = 601 (2.519 sec)
INFO:tensorflow:gl

<tensorflow.python.estimator.canned.linear.LinearClassifier at 0x1a1e2b72b0>

### Evaluation


In [18]:
pred_input_func =tf.estimator.inputs.pandas_input_fn(x=X_test, batch_size=len(X_test), shuffle=False)

** Prediction **

In [19]:
predictions = list(model.predict(input_fn=pred_input_func))

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /var/folders/2v/xzpgsprx4dgft47fp23gnl_r0000gn/T/tmp2x2nfx2j/model.ckpt-5000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


In [20]:
predictions[0]

{'class_ids': array([0]),
 'classes': array([b'0'], dtype=object),
 'logistic': array([0.28895798], dtype=float32),
 'logits': array([-0.9004502], dtype=float32),
 'probabilities': array([0.71104205, 0.28895798], dtype=float32)}

In [21]:
final_preds = []
for pred in predictions:
    final_preds.append(pred['class_ids'][0])

In [22]:
final_preds[:10]

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [23]:
results = pd.DataFrame({'>=50K': final_preds, 'Actual >=50K': y_test})

In [24]:
results

Unnamed: 0,>=50K,Actual >=50K
22357,0,0
26009,0,0
20734,0,0
17695,0,0
27908,0,1
27225,0,0
13108,0,0
27552,0,0
14043,0,0
30313,0,0


In [25]:
from sklearn.metrics import classification_report

In [26]:
print(classification_report(y_test, final_preds))

             precision    recall  f1-score   support

          0       0.87      0.93      0.90      7436
          1       0.73      0.57      0.64      2333

avg / total       0.84      0.85      0.84      9769



In [27]:
from sklearn.metrics import accuracy_score
print("Accuracy score: {}".format([accuracy_score(y_test, final_preds)]))

Accuracy score: [0.8467601596888116]
