# Homework 5 Part I: Spam Classification in SciKit-Learn

This assignment uses data from https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

Data processing was inspired by https://www.kaggle.com/overflow012/d/uciml/sms-spam-collection-dataset/text-preprocessing-classification

Before getting started, run this to upgrade SciKit-Learn to 0.19.1.  Then go to Kernel | Restart in Jupyter.

In [1]:
! pip install -U scikit-learn

Requirement already up-to-date: scikit-learn in /usr/local/lib/python2.7/dist-packages
[33mYou are using pip version 9.0.3, however version 10.0.0 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [150]:
import pandas as pd
import numpy as np
import tensorflow as tf

####
# Helper function:
#  Return the k most frequently appearing keywords in the dataframe
def top_k(data_df, vec, k):
    X = vec.fit_transform(data_df['sms'].values)
    labels = vec.get_feature_names()
    return pd.DataFrame(columns = labels, data = X.toarray()).sum().sort_values(ascending = False)[:k]

sms_df = pd.read_csv('spam.csv', encoding='latin-1')
sms_df.columns = ['class', 'sms', 'a', 'b', 'c']

## Step 1.1 Data Wrangling

Clean up sms_df.  Delete 'a', 'b', 'c', lowercase the sms text

In [151]:
## TODO: Data wrangling / cleaning
sms_df = sms_df.drop(['a','b','c'], axis = 1)

def clean_txt(x):
    s = str(x)
    return s.lower()

txt = u'\n'.join(sms_df.sms.values.tolist()).encode('utf-8')
sms = txt.split('\n')
sms_vals = list(map(clean_txt, sms))
sms_df.sms = pd.Series(sms_vals)

## Step 1.1 Results

In [152]:
sms_df

Unnamed: 0,class,sms
0,ham,"go until jurong point, crazy.. available only ..."
1,ham,ok lar... joking wif u oni...
2,spam,free entry in 2 a wkly comp to win fa cup fina...
3,ham,u dun say so early hor... u c already then say...
4,ham,"nah i don't think he goes to usf, he lives aro..."
5,spam,freemsg hey there darling it's been 3 week's n...
6,ham,even my brother is not like to speak with me. ...
7,ham,as per your request 'melle melle (oru minnamin...
8,spam,winner!! as a valued network customer you have...
9,spam,had your mobile 11 months or more? u r entitle...


In [44]:
sms_df.groupby('class').describe()

Unnamed: 0_level_0,sms,sms,sms,sms
Unnamed: 0_level_1,count,unique,top,freq
class,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4515,"sorry, i'll call later",30
spam,747,653,please call our customer service representativ...,4


## Step 1.2. Vectorizing the Text

In [153]:
## TODO: Generate feature vectors
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(decode_error = 'ignore', stop_words = 'english')


## Let's see the most frequent terms in spam

In [154]:
top_spam = top_k(sms_df[sms_df['class'] == 'spam'], vec, 30)

top_spam

free          224
txt           163
ur            144
mobile        127
text          125
stop          121
claim         113
reply         104
www            98
prize          93
just           78
cash           76
won            76
uk             74
150p           71
send           70
new            69
nokia          67
win            64
urgent         63
tone           60
week           60
50             57
contact        56
service        56
msg            54
com            54
18             51
16             51
guaranteed     50
dtype: int64

## Vs ham...

In [155]:
top_ham = top_k(sms_df[sms_df['class'] == 'ham'], vec, 30)

top_ham

gt       318
lt       316
just     293
ok       287
ll       265
ur       241
know     236
good     233
got      232
like     232
come     227
day      209
time     201
love     199
going    169
home     165
want     164
lor      162
need     158
sorry    157
don      151
da       150
today    139
later    135
dont     132
did      129
send     129
think    128
pls      123
hi       122
dtype: int64

## Step 1.2.2 Regularize URLs and Numbers

Import _regularize_ here, and use *regularize_urls* and *regularize_numbers*
on the columns.

In [51]:
# TODO: Regularize/tokenize URLs and numbers
from regularize import regularize_urls
from regularize import regularize_numbers
r_urls = regularize_urls(sms_df.sms)
r_nums = regularize_numbers(r_urls)
sms_df.sms = pd.Series(r_nums)
X = vec.fit_transform(sms_df.sms.values)
ix_s, ix_h = sms_df['class'] == 'spam', sms_df['class'] == 'ham'
top30_spam, top30_ham = top_k(sms_df[ix_s], vec, 30), top_k(sms_df[ix_h], vec, 30)

## Step 1.2.2 Results

Re-run the CountVectorizer, re-create vector X, and re-compute the top-30 spam terms.  Output the top-30 spam terms.

In [56]:
# TODO: Top-30 spam terms
top30_spam

_num_         3289
free           228
txt            165
_url_          147
ur             144
mobile         129
stop           126
text           125
claim          113
reply          104
prize           92
just            78
won             76
cash            76
nokia           71
send            70
win             70
new             69
urgent          63
week            60
tone            59
box             57
msg             56
service         56
contact         56
guaranteed      50
ppm             49
customer        49
mins            47
phone           46
dtype: int64

## Step 1.3 Creating Features

Take the top-30 spam + top-30 ham words, and create a new CountVectorizer,
called *relevant_vec*, which _only_ includes those words.
See http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html.

In [156]:
# TODO: Vector of 'important' words
#relevant_vec
rv_df = pd.concat([pd.DataFrame(top30_ham), pd.DataFrame(top30_spam)]).reset_index()
rv_l = rv_df[rv_df.columns[0]].values.tolist()
rv_vocabulary = np.unique(rv_l).tolist()
relevant_vec = CountVectorizer(decode_error = 'ignore', stop_words = 'english', vocabulary=rv_vocabulary)

In [157]:
sms_df['sms'].values.tolist()

['go until jurong point, crazy.. available only in bugis n great world la e buffet... cine there got amore wat...',
 'ok lar... joking wif u oni...',
 "free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005. text fa to 87121 to receive entry question(std txt rate)t&c's apply 08452810075over18's",
 'u dun say so early hor... u c already then say...',
 "nah i don't think he goes to usf, he lives around here though",
 "freemsg hey there darling it's been 3 week's now and no word back! i'd like some fun you up for it still? tb ok! xxx std chgs to send, \xc3\xa5\xc2\xa31.50 to rcv",
 'even my brother is not like to speak with me. they treat me like aids patent.',
 "as per your request 'melle melle (oru minnaminunginte nurungu vettam)' has been set as your callertune for all callers. press *9 to copy your friends callertune",
 'winner!! as a valued network customer you have been selected to receivea \xc3\xa5\xc2\xa3900 prize reward! to claim call 09061701461. claim code kl341. va

In [158]:
import sklearn.model_selection as ms
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np

# X is the feature array, based off relevant words
X = relevant_vec.fit_transform(sms_df['sms'].values).toarray()

# Compute the length of each sms message, normalized
# by max length
Xlen = np.zeros((X.shape[0],1))
inx = 0
for v in sms_df['sms'].values:
        Xlen[inx,0] = len(v)
        inx += 1
Xlen = Xlen / max(Xlen)
# Add the length as another feature
X = np.hstack((X, Xlen))

y = np.array((sms_df['class'] == 'spam').astype(int))

# Now we split...
X_train, X_test, y_train, y_test = ms.train_test_split(X, 
                                                    y, test_size=0.2, random_state=42)

X_train

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.09120879],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.15054945],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.05054945],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.04945055],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.02857143],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.03846154]])

## Step 1.4 Classifier Evaluation

In [159]:
from sklearn.tree import DecisionTreeClassifier as DTC
from sklearn.svm import SVC
import sklearn.model_selection as ms
from sklearn.linear_model import LogisticRegression as LR
import numpy as np

# Results, as a list of dictionaries
classifier_results = []

## Sample depth-2 decision tree
dt_model = DecisionTreeClassifier(max_depth=2)
dt_model.fit(X_train, y_train)
y_pred_test = dt_model.predict(X_test)
test_score = dt_model.score(X_test, y_test)
classifier_results.append({'Classifier': 'DecTree', 'Depth': 2, 'Score': test_score})

# TODO: Code for creating and testing classifiers mentioned in Step 1.4 of HW document
# Results, as a list of dictionaries
classifier_results = []

# 1. Classifier=DecTree: 
#     - Decision Tree Classifiers with max_depth 1-5 and random_state=42. 
#       (5 different classifiers)
def make_dtc(max_d):
    dt_model = DTC(criterion='gini', splitter='best', max_depth=max_d, random_state=42)
    dt_model.fit(X_train, y_train)
    y_pred_test = dt_model.predict(X_test)
    test_score = dt_model.score(X_test, y_test)
    return {'Classifier': 'DecTree', 'Depth': max_d, 'Score': test_score}

# 2. Classifier=LogReg-L1 and Classifier=LogReg-L2: 
#     - Logistic Regression with parameters solver=’liblinear’ and penalty=’l1’ and random_state=42 
#       and solver=’liblinear’ and penalty=’l2’and random_state=42  (2 different classifiers)
def make_lr(pl):
    lr_model = LR(penalty='l' + str(pl), random_state=42, solver='liblinear')
    lr_model.fit(X_train, y_train)
    y_pred_test = lr_model.predict(X_test)
    test_score = lr_model.score(X_test, y_test)
    return {'Classifier': 'LogReg-L' + str(pl), 'Depth': None, 'Score': test_score}

# 3. Classifier=SVM: 
#     - Support Vector Machines (SVC) with random_state=42.
def make_svc():
    svm_model = SVC(C=1.0, kernel='rbf', random_state=42)
    svm_model.fit(X_train, y_train)
    y_pred_test = svm_model.predict(X_test)
    test_score = svm_model.score(X_test, y_test)
    return {'Classifier': 'SVC', 'Depth': None, 'Score': test_score}


# Fit classifiers
classifier_results = list(map(make_dtc, [x for x in range(1,6)]))  
classifier_results.append(make_lr(1))
classifier_results.append(make_lr(2))
classifier_results.append(make_svc())

## Step 1.4 Results

In [160]:
pd.DataFrame(classifier_results)

Unnamed: 0,Classifier,Depth,Score
0,DecTree,1.0,0.865471
1,DecTree,2.0,0.885202
2,DecTree,3.0,0.900448
3,DecTree,4.0,0.914798
4,DecTree,5.0,0.918386
5,LogReg-L1,,0.95157
6,LogReg-L2,,0.946188
7,SVC,,0.940807


## Step 2.0 Ensembles

In [161]:
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.ensemble import AdaBoostClassifier as ABC
from sklearn.ensemble import BaggingClassifier as BC

ensemble_names = ['RandomForest', 'Bag-DecTree', 'Bag-LogReg-L', 'Bag-SVM', 
                  'Boost-DecTree', 'Boost-LogReg-L1', 'Boost-LogReg-L2', 'Boost-SVM']

# 1. RandomForestClassifier
def make_rfc():
    clf = RFC(n_estimators=31, random_state=314) 
    clf.fit(X_train, y_train)
    y_pred_test = clf.predict(X_test)
    test_score = clf.score(X_test, y_test)
    return {'Classifier': 'RandomForest', 'Depth': None, 'Score': test_score}

# 2. BaggingClassifier
def make_bag_dtc():
    clf = DTC(random_state=42)
    clf_bc = BC(clf, n_estimators=31, random_state=314)
    clf_bc.fit(X_train, y_train)
    y_pred_test = clf_bc.predict(X_test)
    test_score = clf_bc.score(X_test, y_test)
    return {'Classifier': 'Bag-DecTree', 'Depth': None, 'Score': test_score}

def make_bag_lr(pl):
    clf = LR(penalty='l' + str(pl), random_state=42, solver='liblinear')
    clf_bc = BC(clf, n_estimators=31, random_state=314)
    clf_bc.fit(X_train, y_train)
    y_pred_test = clf_bc.predict(X_test)
    test_score = clf_bc.score(X_test, y_test)
    return {'Classifier': 'Bag-LogReg-L' + str(pl), 
            'Depth': None, 'Score': test_score}

def make_bag_svc():
    clf = SVC(C=1.0, kernel='rbf', random_state=42)
    clf_bc = BC(clf, n_estimators=31, random_state=314)
    clf_bc.fit(X_train, y_train)
    y_pred_test = clf_bc.predict(X_test)
    test_score = clf_bc.score(X_test, y_test)
    return {'Classifier': 'Bag-SVM', 'Depth': None, 'Score': test_score}


# 3. AdaBoostClassifiers
def make_boost_dtc():
    clf = DTC(random_state=42)
    clf_bc = ABC(clf, n_estimators=31, random_state=314, algorithm='SAMME')
    clf_bc.fit(X_train, y_train)
    y_pred_test = clf_bc.predict(X_test)
    test_score = clf_bc.score(X_test, y_test)
    return {'Classifier': 'Boost-DecTree', 'Depth': None, 'Score': test_score}

def make_boost_lr(pl):
    clf = SVC(C=1.0, kernel='rbf', random_state=42)
    clf_bc = ABC(clf, n_estimators=31, random_state=314, algorithm='SAMME')
    clf_bc.fit(X_train, y_train)
    y_pred_test = clf_bc.predict(X_test)
    test_score = clf_bc.score(X_test, y_test)
    return {'Classifier': 'Boost-LogReg-L' + str(pl), 'Depth': None, 'Score': test_score}

def make_boost_svc():
    clf = SVC(C=1.0, kernel='rbf', random_state=42)
    clf_bc = ABC(clf, n_estimators=31, random_state=314, algorithm='SAMME')
    clf_bc.fit(X_train, y_train)
    y_pred_test = clf_bc.predict(X_test)
    test_score = clf_bc.score(X_test, y_test)
    return {'Classifier': 'Boost-SVM', 'Depth': None, 'Score': test_score}

## Compute ensemble classifier results here

In [162]:
classifier_results.append(make_rfc())
classifier_results.append(make_bag_dtc())
classifier_results.append(make_bag_lr(1))
classifier_results.append(make_bag_lr(2))
classifier_results.append(make_bag_svc())
classifier_results.append(make_boost_dtc())
classifier_results.append(make_boost_lr(1))
classifier_results.append(make_boost_lr(2))
classifier_results.append(make_boost_svc())

## Step 2.0 Results

In [163]:
pd.DataFrame(classifier_results)

Unnamed: 0,Classifier,Depth,Score
0,DecTree,1.0,0.865471
1,DecTree,2.0,0.885202
2,DecTree,3.0,0.900448
3,DecTree,4.0,0.914798
4,DecTree,5.0,0.918386
5,LogReg-L1,,0.95157
6,LogReg-L2,,0.946188
7,SVC,,0.940807
8,RandomForest,,0.956054
9,Bag-DecTree,,0.956054


## Step 3.0 Neural Networks

In [164]:
from sklearn.linear_model import Perceptron
from sklearn.neural_network import MLPClassifier

# 1. Perceptron with random_state=42.
def make_ppt():
    clf_bc = Perceptron(random_state=42)
    clf_bc.fit(X_train, y_train)
    y_pred_test = clf_bc.predict(X_test)
    test_score = clf_bc.score(X_test, y_test)
    return {'Classifier': 'Perceptron', 'Depth': None, 'Score': test_score}
  
# 2. MLPClassifier with 3 hidden nodes in one layer, and random_state=42.
# 3. MLPClassifier with 10 hidden nodes in one layer, and random_state=42.
# 4. MLPClassifier with 10 hidden nodes in each of 3 layers, and random_state=42.
def mlp(layers):
    return MLPClassifier(hidden_layer_sizes=layers, random_state=42)
nn_models = [mlp((3)), mlp((10)), mlp((10,10,10))]

def make_mlp(ix):
    clf_bc = nn_models[ix]
    clf_bc.fit(X_train, y_train)
    y_pred_test = clf_bc.predict(X_test)
    test_score = clf_bc.score(X_test, y_test)
    return {'Classifier': 'MLPClassifier' + str(ix), 'Depth': None, 'Score': test_score}

classifier_results.append(make_ppt())
classifier_results.append(make_mlp(0))
classifier_results.append(make_mlp(1))
classifier_results.append(make_mlp(2))

## Step 3.0 Results

In [165]:
pd.DataFrame(classifier_results)

Unnamed: 0,Classifier,Depth,Score
0,DecTree,1.0,0.865471
1,DecTree,2.0,0.885202
2,DecTree,3.0,0.900448
3,DecTree,4.0,0.914798
4,DecTree,5.0,0.918386
5,LogReg-L1,,0.95157
6,LogReg-L2,,0.946188
7,SVC,,0.940807
8,RandomForest,,0.956054
9,Bag-DecTree,,0.956054


## Step 4.0 TensorFlow

In [212]:
#! pip install tensorflow
import tensorflow as tf

# The first cell installs TensorFlow and imports the package. 
# For this part of the homework, you will use X_train and y_train created in Step 1.3.

import tempfile
model_dir = tempfile.mkdtemp()
tf.set_random_seed(42)

In [266]:
# TODO: Define TensorFlow columns
# Define TensorFlow columns (features) for each of the top-occurring keys in 
# spam and ham.  Also add an additional column for the length. 

# Store these columns in a list.
vocabulary = [x.encode('utf-8') for x in rv_vocabulary]
vocabulary.append('length')

# TODO: Create function input_fn(x,y)
# Create a function input_fn that takes parameters x (2D np array of features) 
# and y (1D np array of labels).  This should create a tensor for each column 
# of the 2D array x. You can think of this as creating a tensor for each feature. 
# This function should return a tuple of the dictionary of the tensors created 
# from the columns and a tensor created from the second input y.
def input_fn(X, y):
    fc, lbl = {}, tf.constant(y) 
    for i in range(len(vocabulary)):
        fc[vocabulary[i]] = tf.constant(X[:, i])
    return fc, lbl

# TODO: Create function test_input_fn()
# Create a function test_input_fn that takes no arguments, but returns the output 
# of passing in the test set and labels to input_fn. 
def test_input_fn():
    return input_fn(X_test, y_test)


# TODO: Create function train_input_fn()
# Create a similar function train_input_fn that does the same thing except 
# passes in the training set and labels.
def train_input_fn():
    return input_fn(X_train, y_train)

## Step 4.3.1

#### Data Check for Step 4.3 part 1.
1. Create a **DNNClassifier** with two hidden layers of 5 units each, and run for 1000 steps. 
2. Under the Markdown Cell saying “Step 4.3.1 Results”, run the train operation over the training data. 
3. In the next Cell, run the evaluate operation over the test data.  
4. In the next Cell, sort the results of the **evaluate** operation by key, and output the keys and their values.  
5. For reference, here is an example of using the DNNClassifier. Note the accuracy.

In [292]:
from tensorflow.contrib.learn import DNNClassifier as DNN
from tensorflow.contrib.learn import LinearClassifier  as LC
from tensorflow.contrib.layers import real_valued_column  as rvc

tf.set_random_seed(42) 
fc = [rvc(x) for x in vocabulary]

In [293]:
# TODO: Create DNNClassifier
# Build a DNN with 2 hidden layers and 10 nodes in each hidden layer.
clf_dnn= DNN(feature_columns=fc, hidden_units=[5,5], n_classes=2)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_task_type': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fb3a3617410>, '_model_dir': '/tmp/tmpMvbIRF', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_session_config': None, '_tf_random_seed': None, '_save_summary_steps': 100, '_environment': 'local', '_num_worker_replicas': 0, '_task_id': 0, '_log_step_count_steps': 100, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_evaluation_master': '', '_master': ''}


## Step 4.3.1 Results

In [296]:
# TODO: train
clf_dnn.fit(input_fn=train_input_fn, steps = 1000)



INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpMvbIRF/model.ckpt-1000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1001 into /tmp/tmpMvbIRF/model.ckpt.
INFO:tensorflow:loss = 0.117591195, step = 1001
INFO:tensorflow:global_step/sec: 319.088
INFO:tensorflow:loss = 0.11597445, step = 1101 (0.315 sec)
INFO:tensorflow:global_step/sec: 445.905
INFO:tensorflow:loss = 0.114658564, step = 1201 (0.225 sec)
INFO:tensorflow:global_step/sec: 488.296
INFO:tensorflow:loss = 0.1136109, step = 1301 (0.203 sec)
INFO:tensorflow:global_step/sec: 466.618
INFO:tensorflow:loss = 0.11271803, step = 1401 (0.214 sec)
INFO:tensorflow:global_step/sec: 429.795
INFO:tensorflow:loss = 0.1118606, step = 1501 (0.234 sec)
INFO:tensorflow:global_step/sec: 477.062
INFO:tensorflow:loss = 0.11115308, step = 1601 (0.208 sec)
INFO:tensorflow:global_step/sec: 459.059

DNNClassifier(params={'head': <tensorflow.contrib.learn.python.learn.estimators.head._BinaryLogisticHead object at 0x7fb3c0977fd0>, 'hidden_units': [5, 5], 'feature_columns': (_RealValuedColumn(column_name='_num_', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='_url_', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='box', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='cash', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='claim', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='come', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='contact', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='customer', dimension=1, default_value=N

In [297]:
# TODO: evaluate
results_dnn = clf_dnn.evaluate(input_fn=test_input_fn, steps = 1)



INFO:tensorflow:Starting evaluation at 2018-04-16-01:42:55
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpMvbIRF/model.ckpt-2000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [1/1]
INFO:tensorflow:Finished evaluation at 2018-04-16-01:42:56
INFO:tensorflow:Saving dict for global step 2000: accuracy = 0.95784754, accuracy/baseline_label_mean = 0.13452914, accuracy/threshold_0.500000_mean = 0.95784754, auc = 0.9440622, auc_precision_recall = 0.873392, global_step = 2000, labels/actual_label_mean = 0.13452914, labels/prediction_mean = 0.13037567, loss = 0.15212546, precision/positive_threshold_0.500000_mean = 0.912, recall/positive_threshold_0.500000_mean = 0.76


In [303]:
# TODO: results
keylist_dff = results_dnn.keys()
keylist_dff.sort()
for key in keylist_dff:
    print "%s: %s" % (key, results_dnn[key])

accuracy: 0.95784754
accuracy/baseline_label_mean: 0.13452914
accuracy/threshold_0.500000_mean: 0.95784754
auc: 0.9440622
auc_precision_recall: 0.873392
global_step: 2000
labels/actual_label_mean: 0.13452914
labels/prediction_mean: 0.13037567
loss: 0.15212546
precision/positive_threshold_0.500000_mean: 0.912
recall/positive_threshold_0.500000_mean: 0.76


## Step 4.3.2

#### Data Check for Step 4.3 part 2.
1. Create a **LinearClassifier** and run for 1000 steps. 
2. Under the Markdown Cell saying “Step 4.3.2 Results”, run the train operation over the training data. 
3. In the next Cell, run the evaluate operation over the test data.  
4. In the next Cell, sort the results of the evaluate operation by key, and output the keys and their values.  Note the accuracy.

In [279]:
# TODO: Create LinearClassifier
clf_lc = LC(feature_columns=fc)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_task_type': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fb3b9db9890>, '_model_dir': '/tmp/tmphx7kdm', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_session_config': None, '_tf_random_seed': None, '_save_summary_steps': 100, '_environment': 'local', '_num_worker_replicas': 0, '_task_id': 0, '_log_step_count_steps': 100, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_evaluation_master': '', '_master': ''}


## Step 4.3.2 Results

In [280]:
# TODO: train
clf_lc.fit(input_fn=train_input_fn, steps = 1000)

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into /tmp/tmphx7kdm/model.ckpt.
INFO:tensorflow:loss = 0.69314593, step = 1
INFO:tensorflow:global_step/sec: 207.344
INFO:tensorflow:loss = 0.29134312, step = 101 (0.485 sec)
INFO:tensorflow:global_step/sec: 291.846
INFO:tensorflow:loss = 0.23684062, step = 201 (0.341 sec)
INFO:tensorflow:global_step/sec: 475.935
INFO:tensorflow:loss = 0.21105811, step = 301 (0.210 sec)
INFO:tensorflow:global_step/sec: 504.286
INFO:tensorflow:loss = 0.19564131, step = 401 (0.198 sec)
INFO:tensorflow:global_step/sec: 462.83
INFO:tensorflow:loss = 0.18521579, step = 501 (0.217 sec)
INFO:tensorflow:global_step/sec: 507.473
INFO:tensorflow:loss = 0.17761028, step = 601 (0.196 sec)
INFO:tensorflow:global_step/sec: 598.415
INFO:tensorflow:loss = 0.17177033, step = 701 (0.168 sec)
INFO:tensorflow:global_step

LinearClassifier(params={'gradient_clip_norm': None, 'head': <tensorflow.contrib.learn.python.learn.estimators.head._BinaryLogisticHead object at 0x7fb3b9db9ad0>, 'joint_weights': False, 'optimizer': None, 'feature_columns': [_RealValuedColumn(column_name='_num_', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='_url_', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='box', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='cash', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='claim', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='come', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(column_name='contact', dimension=1, default_value=None, dtype=tf.float32, normalizer=None), _RealValuedColumn(co

In [305]:
# TODO: evaluate
results_lc  = clf_lc.evaluate(input_fn=test_input_fn, steps = 1)

INFO:tensorflow:Starting evaluation at 2018-04-16-01:46:19
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmphx7kdm/model.ckpt-1000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [1/1]
INFO:tensorflow:Finished evaluation at 2018-04-16-01:46:20
INFO:tensorflow:Saving dict for global step 1000: accuracy = 0.94170403, accuracy/baseline_label_mean = 0.13452914, accuracy/threshold_0.500000_mean = 0.94170403, auc = 0.93237996, auc_precision_recall = 0.8503746, global_step = 1000, labels/actual_label_mean = 0.13452914, labels/prediction_mean = 0.13494353, loss = 0.17869483, precision/positive_threshold_0.500000_mean = 0.96703297, recall/positive_threshold_0.500000_mean = 0.58666664


In [306]:
# TODO: results
keylist_lc = results_lc.keys()
keylist_lc.sort()
for key in keylist_lc:
    print "%s: %s" % (key, results_lc[key])

accuracy: 0.94170403
accuracy/baseline_label_mean: 0.13452914
accuracy/threshold_0.500000_mean: 0.94170403
auc: 0.93237996
auc_precision_recall: 0.8503746
global_step: 1000
labels/actual_label_mean: 0.13452914
labels/prediction_mean: 0.13494353
loss: 0.17869483
precision/positive_threshold_0.500000_mean: 0.96703297
recall/positive_threshold_0.500000_mean: 0.58666664
