# TensorFlow Sprint Challenge

In this challenge, you'll use TensorFlow with a newly released dataset to model authorship of Victorian Era literature.

http://archive.ics.uci.edu/ml/datasets/Victorian+Era+Authorship+Attribution

We will use this data to classify authors based on the semantic meaning of their text. All questions include at least the minimum imports needed, but you're welcome to import more or change as you wish.

## Question 1 - Load and summarize the data

The UCI link has information about the data and a zip file, but you can also get direct links to the data as csv files from this site: https://dataworks.iupui.edu/handle/11243/23

Hint - pandas has good CSV reading functionality.

Another hint - you want the "train" data, as it comes with author labels.

After you load the data, validate that it's loaded by printing *basic* information - how many observations you have, a few example observations, etc. Don't spend long on this!

In [0]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [0]:
data = pd.read_csv('https://dataworks.iupui.edu/bitstream/handle/11243/23/Gungor_2018_VictorianAuthorAttribution_data.csv?sequence=3&isAllowed=y')

In [3]:
print(data.head())

                                                text
0  nt it seems te me how much money is he worth a...
1  to talk about why you heard of such a case as ...
2  my foot on the ground and said i believe you d...
3  hour or wait for miss oh wait for by all means...
4  will not listen to such words now go and remem...


In [0]:
df = pd.read_csv('https://dataworks.iupui.edu/bitstream/handle/11243/23/Gungor_2018_VictorianAuthorAttribution_data-train.csv?sequence=2&isAllowed=y')

In [5]:
print(df.head())

                                                text  author
0  ou have time to listen i will give you the ent...       1
1  wish for solitude he was twenty years of age a...       1
2  and the skirt blew in perfect freedom about th...       1
3  of san and the rows of shops opposite impresse...       1
4  an hour s walk was as tiresome as three in a s...       1


In [6]:
print(df.shape)

(53678, 2)


In [7]:
df['author'].value_counts()

8     6914
26    4441
14    2696
37    2387
45    2312
21    2307
39    2266
48    1825
33    1742
19    1543
4     1483
15    1460
43    1266
38    1163
25    1159
9     1108
18    1078
42    1022
30     972
50     914
1      912
41     911
28     823
10     755
32     703
36     693
17     660
35     659
29     645
12     627
46     605
20     587
22     495
13     485
44     468
23     455
34     453
40     430
6      407
11     383
2      382
24     380
27     306
3      213
16     183
Name: author, dtype: int64

In [8]:
df['author'][:5400].value_counts()

8    2003
4    1483
1     912
6     407
2     382
3     213
Name: author, dtype: int64

## Question 2 - Encode the text

Authorship is about text, but computers need numbers - use TensorFlow and the techniques we've learned this week to represent the text in a quantitative way. There isn't a single right way to do this, but your choice will affect how you process the data - if you need to tokenize or if you'll handle whole sentences, etc.

Hint - even with TensorFlow Hub, you may have to reduce the data (either in number of observations or the size of the text for any given observation) to make this tractable.

Another hint - you want your encoding to somehow say something about *semantics*, that is the meaning of the text.

In [0]:
!pip install -q keras

In [10]:
import keras
from keras.models import Sequential
from keras.layers import Dense

Using TensorFlow backend.


In [0]:
import tensorflow as tf
import tensorflow_hub as hub
import os
import re

In [12]:
sentence_module = hub.Module(
    "https://tfhub.dev/google/universal-sentence-encoder/2")
print(sentence_module)

INFO:tensorflow:Using /tmp/tfhub_modules to cache modules.
<tensorflow_hub.module.Module object at 0x7f2191843710>


In [0]:
sample_size = round(df.shape[0] * .3)

In [0]:
sample_size = int(sample_size)

In [15]:
df['text'][:sample_size].tolist()[:5]

['ou have time to listen i will give you the entire story he said it may form the basis of a future novel and prove quite as interesting as one of your own invention i had the time to listen of course one has time for anything and everything agreeable in the best place to hear the tale was in a victoria and with my good on the box with the coachman we set out at once on a drive to the as the recital was only half through when we reached the house we postponed the remainder while we stopped there for an excellent lunch on the way back to my friend continued and finished the story it was indeed quite suitable for use and i told my friend with thanks that i should at once put it in shape for my readers i said i should make a few alterations in it for the sake of dramatic interest but in the main would follow the lines he had given me it would spoil my romance were i to answer on this page the question that must be uppermost in the reader s mind i have already revealed almost too much of t

In [52]:
sentences = df['text'][:sample_size].tolist()

embeddings = sentence_module(sentences)
with tf.Session() as sess:
  sess.run(tf.global_variables_initializer())
  sess.run(tf.tables_initializer())
  encoded_sentences = sess.run(embeddings)

print(encoded_sentences)

INFO:tensorflow:Saver not created because there are no variables in the graph to restore
[[-0.04503655  0.06227405  0.02272171 ...  0.04139075  0.0147866
  -0.0621862 ]
 [-0.0148914   0.05193076  0.04208258 ... -0.03243674  0.02725637
  -0.05898011]
 [ 0.01468     0.05377739  0.02667934 ... -0.04516049  0.04419674
  -0.06183365]
 ...
 [ 0.0092257   0.058184    0.04633651 ...  0.0229649   0.0304555
  -0.00558858]
 [ 0.02641287  0.03589829  0.03901041 ...  0.0375279   0.04315262
  -0.03402424]
 [-0.01654375  0.00387224  0.01902384 ... -0.02258296 -0.01426152
  -0.01774742]]


In [53]:
# Step 3 - Measure the distance to explore the similarity
from sklearn.neighbors import DistanceMetric

dist = DistanceMetric.get_metric('euclidean')
print(dist.pairwise(encoded_sentences))

[[0.         0.64645548 0.69450181 ... 0.86708374 0.76081009 0.88258559]
 [0.64645548 0.         0.44446305 ... 0.8903237  0.82460346 0.85702439]
 [0.69450181 0.44446305 0.         ... 0.89131159 0.81931864 0.80659785]
 ...
 [0.86708374 0.8903237  0.89131159 ... 0.         0.77365944 0.82975704]
 [0.76081009 0.82460346 0.81931864 ... 0.77365944 0.         0.60986384]
 [0.88258559 0.85702439 0.80659785 ... 0.82975704 0.60986384 0.        ]]


In [0]:
dist_list = dist.pairwise(encoded_sentences)

In [55]:
dist_list.shape

(16103, 16103)

In [0]:
dist_list_0 = dist_list[0].tolist()

In [57]:
print(dist_list_0)

[0.0, 0.6464554803380002, 0.6945018085542586, 0.6837492586582161, 0.6171772969781352, 0.7071877765560203, 0.6972850157809583, 0.7643597301208352, 0.8523303152044617, 0.9136939042655516, 0.8801729722951432, 0.7033903659815138, 0.8144596135210751, 0.755337300285431, 0.7108733052033523, 0.7542925386862106, 0.8144483057035874, 0.7304115149573346, 0.7304812099619332, 0.7939150986315473, 0.7654095837248188, 0.7041833720551482, 0.7337386861070481, 0.7925053260521167, 0.8024900881395284, 0.7908366876132704, 0.8202991582167742, 0.7964521982346308, 0.7905953073179044, 0.8465729929529253, 0.8131024235797285, 0.7170414402690245, 0.8401727798552707, 0.7502220678312326, 0.7878292003405876, 0.8309143528884708, 0.7902285645104954, 0.7627957487002095, 0.7919721047435536, 0.9307502211003252, 0.875769320438826, 0.7632449989732155, 0.8308437277162277, 0.7497996326264719, 0.7979045352695461, 0.8103179173409809, 0.712683523696039, 0.703286270381562, 0.8596204578033644, 0.7919469760401194, 0.7311214639612912

In [58]:
sorted_dist = sorted(dist_list[0])
  
print(sorted_dist)

[0.0, 0.4787512786770313, 0.4971023789232943, 0.517749180257867, 0.521216451241154, 0.529399827752297, 0.5387961059174892, 0.5445428376166618, 0.5524048361605484, 0.5605203502459385, 0.563733506503018, 0.5654328789158386, 0.5678410958852931, 0.577055555905335, 0.5826076461568733, 0.5838100521391687, 0.5838152432739475, 0.5845259289861265, 0.5846218860804581, 0.5852761607170008, 0.585340735631759, 0.5858062846280987, 0.5859243194150496, 0.5877498236089156, 0.5878920642813337, 0.5888990104852014, 0.5909477699296012, 0.5915375550106576, 0.5924590907612551, 0.5937516262580115, 0.5942644239362461, 0.5951323909591001, 0.5959517576125644, 0.596289264865758, 0.5965408074766435, 0.5968906761698527, 0.596943157297419, 0.598006038757015, 0.5988287795450017, 0.5998009567497146, 0.599991013439398, 0.6004305548444665, 0.6005381789132828, 0.6020615409703937, 0.6022009869963224, 0.6035553144664076, 0.6038621433850295, 0.6043584710817181, 0.6046381499428236, 0.6069171010300909, 0.6071797012948276, 0.60

In [59]:
sorted_dist[1]

0.4787512786770313

In [60]:
sorted_dist[1] == dist_list[1]

array([False, False, False, ..., False, False, False])

In [61]:
for i in range(len(dist_list_0)):
  if sorted_dist[1] == dist_list_0[i]:
    print(i)
  if sorted_dist[2] == dist_list_0[i]:
    print(i)
  if sorted_dist[3] == dist_list_0[i]:
    print(i)
  if sorted_dist[4] == dist_list_0[i]:
    print(i)

2614
9544
9876
10181


In [62]:
print(df['text'][0])
print(df['text'][2614])

ou have time to listen i will give you the entire story he said it may form the basis of a future novel and prove quite as interesting as one of your own invention i had the time to listen of course one has time for anything and everything agreeable in the best place to hear the tale was in a victoria and with my good on the box with the coachman we set out at once on a drive to the as the recital was only half through when we reached the house we postponed the remainder while we stopped there for an excellent lunch on the way back to my friend continued and finished the story it was indeed quite suitable for use and i told my friend with thanks that i should at once put it in shape for my readers i said i should make a few alterations in it for the sake of dramatic interest but in the main would follow the lines he had given me it would spoil my romance were i to answer on this page the question that must be uppermost in the reader s mind i have already revealed almost too much of the

In [103]:
len(sentences)

16103

In [63]:
print(df['author'][0])
print(df['author'][2614])
print(df['author'][341])
print(df['author'][3217])
print(df['author'][3871])

1
4
1
6
8


## Question 3 - Fit and evaluate a model

Now that you have the data represented quantitatively, the magic begins! Build a neural network model that classifies authors based on the semantic encodings of their text. You should do the following steps:

1. Make sure you know what is X (the independent variable, what you predict with) and what is y (the dependent variable/label, what you want to predict)
2. You should use train_test_split to have both training and testing data, otherwise your neural network will overfit and not generalize well
3. You'll have to change how the label is represented - remember one-hot encoding, and check out `to_categorical`
4. When you add layers to the network - the main dimensions that matter are `input_dim` for the first layer and the dimensionality of the output layer

Feel free to look things up and use what resources you can find. The official documentation for these libraries is generally good (if overwhelming), and [this article](https://machinelearningmastery.com/multi-class-classification-tutorial-keras-deep-learning-library/) shows a baseline model using Keras for a classification problem (with images, but the output encoding and layer should be similar).

The result of this problem should be an accuracy score reported from evaluating the model with the testing data. Getting any running model that runs and gets a score is great - if you want to try to improve your accuracy (stretch goal), you can play with the model parameters but you may also have to revisit how you made the encodings (in particular how you reduced the data for the encodings to actually be tractable).

My initial attempts had accuracy around 0.2 (using default 75%/25% train/test split) - with a bit of work, I was able to get to 0.50946. Keep in mind that there are quite a few authors in this dataset, so accuracy far from 1 can still be much better than random.

In [0]:
from sklearn.model_selection import train_test_split
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical  # Makes "one-hot" encoding from label
import keras.layers as layers
from keras.models import Model
from keras import models, regularizers, layers, optimizers, losses, metrics

In [65]:
encoded_sentences

array([[-0.04503655,  0.06227405,  0.02272171, ...,  0.04139075,
         0.0147866 , -0.0621862 ],
       [-0.0148914 ,  0.05193076,  0.04208258, ..., -0.03243674,
         0.02725637, -0.05898011],
       [ 0.01468   ,  0.05377739,  0.02667934, ..., -0.04516049,
         0.04419674, -0.06183365],
       ...,
       [ 0.0092257 ,  0.058184  ,  0.04633651, ...,  0.0229649 ,
         0.0304555 , -0.00558858],
       [ 0.02641287,  0.03589829,  0.03901041, ...,  0.0375279 ,
         0.04315262, -0.03402424],
       [-0.01654375,  0.00387224,  0.01902384, ..., -0.02258296,
        -0.01426152, -0.01774742]], dtype=float32)

In [66]:
type(encoded_sentences)

numpy.ndarray

In [87]:
X = pd.DataFrame(data=encoded_sentences)
y_orig = df['author'][:sample_size]

print(X.shape)
print(y_orig.shape)

# one_hot_y_train = to_categorical(y_train)
# one_hot_y_test = to_categorical(y_test)

# print(one_hot_y_train.shape)
# print(one_hot_y_test.shape)

y = to_categorical(y_orig)
print(y.shape)

(16103, 512)
(16103,)
(16103, 15)


In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [89]:
y_test.shape

(5314, 15)

In [0]:
# MODEL

# model = Sequential()
# model.add(layers.Dense(256, kernel_regularizer=regularizers.l1(0.001), activation='relu', input_shape=(512,)))
# model.add(layers.Dropout(0.5))
# model.add(layers.Dense(256, kernel_regularizer=regularizers.l1(0.001), activation='relu'))
# model.add(layers.Dropout(0.5))
# model.add(layers.Dense(9, activation='softmax'))

model = Sequential()
model.add(layers.Dense(256, activation='relu', input_shape=(512,)))
model.add(layers.Dense(256, activation='relu'))
model.add(layers.Dense(y_test.shape[1], activation='softmax'))

In [91]:
# FIT / TRAIN model

NumEpochs = 20
BatchSize = 32

model.compile(optimizer='rmsprop',loss='binary_crossentropy', metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=NumEpochs, batch_size=BatchSize, validation_data=(X_test, y_test))

results = model.evaluate(X_test, y_test)
print("_"*100)
print("Test Loss and Accuracy")
print("results ", results)

history_dict = history.history
history_dict.keys()

Train on 10789 samples, validate on 5314 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
____________________________________________________________________________________________________
Test Loss and Accuracy
('results ', [0.1099202516925124, 0.9730272465152403])


['acc', 'loss', 'val_acc', 'val_loss']

In [92]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_12 (Dense)             (None, 256)               131328    
_________________________________________________________________
dense_13 (Dense)             (None, 256)               65792     
_________________________________________________________________
dense_14 (Dense)             (None, 15)                3855      
Total params: 200,975
Trainable params: 200,975
Non-trainable params: 0
_________________________________________________________________


In [93]:
# PREDICT

predictions = model.predict(X_test)
# Each entry in predictions is a vector of length 46
print(predictions.shape)

result = y_test
# result['Prediction'] = predictions


# pd.DataFrame({'Predictions':round(predictions), 'Actual': y_test})

(5314, 15)


In [94]:
result.shape

(5314, 15)

In [95]:
predictions.shape

(5314, 15)

In [96]:
result[:10]

array([[0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]],
      dtype=float32)

In [97]:
predictions

array([[1.52687899e-12, 1.35028069e-08, 4.63556034e-11, ...,
        1.67272862e-08, 1.68991501e-05, 1.12824591e-06],
       [3.59618400e-19, 7.08826109e-10, 1.55839976e-11, ...,
        2.39691245e-09, 3.69294706e-09, 2.86776245e-01],
       [1.74036138e-19, 1.18500370e-13, 1.98924679e-11, ...,
        5.92902650e-15, 8.03810794e-13, 8.58102894e-11],
       ...,
       [9.34188733e-17, 3.56163739e-08, 4.80010059e-11, ...,
        1.42993195e-11, 6.11410200e-10, 4.43464751e-06],
       [3.78054098e-17, 9.99999762e-01, 6.31302843e-10, ...,
        1.91707136e-11, 2.63803302e-11, 8.04879330e-10],
       [4.38841299e-14, 8.07706840e-07, 1.06651126e-03, ...,
        9.43406908e-10, 4.03907325e-06, 2.17337068e-02]], dtype=float32)

In [98]:
predict_df = pd.DataFrame(predictions)
print(predict_df.head())

             0             1             2             3             4   \
0  1.526879e-12  1.350281e-08  4.635560e-11  4.556976e-06  9.996707e-01   
1  3.596184e-19  7.088261e-10  1.558400e-11  3.685557e-10  3.983215e-07   
2  1.740361e-19  1.185004e-13  1.989247e-11  1.025595e-11  1.686405e-05   
3  6.181573e-20  9.265514e-15  2.765942e-12  1.087064e-13  1.226413e-09   
4  7.904010e-15  6.238552e-08  9.443553e-10  3.064893e-08  2.234280e-05   

             5             6             7         8             9   \
0  1.884628e-12  9.942929e-09  1.811043e-12  0.000200  5.401257e-06   
1  2.171632e-19  4.909606e-13  3.008544e-19  0.713223  3.272156e-17   
2  7.406608e-20  2.426437e-14  1.854507e-19  0.999983  5.031192e-10   
3  1.583277e-20  1.380037e-13  3.066609e-20  1.000000  1.776554e-10   
4  4.073388e-15  5.437043e-09  1.091543e-14  0.999976  3.801667e-09   

             10            11            12            13            14  
0  7.909071e-05  2.170902e-05  1.672729e-08  1.6

In [99]:
predict_df.round().astype(int)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


# Summary
 * Loaded data into Pandas DF
 * Innitial sample size of 10% to reduce processing time
 * Universal sentence encoding from Tensorflow hub
 * One hot encoded authors for model
 * Build NN model with 3 dense layers
 * Val_acc around 96.5% on 10% of data
 * Re-ran everything with 20% sample and val_acc improved to 98.2%

# Other / Old Code

In [80]:
print(df.head())

                                                text  author
0  ou have time to listen i will give you the ent...       1
1  wish for solitude he was twenty years of age a...       1
2  and the skirt blew in perfect freedom about th...       1
3  of san and the rows of shops opposite impresse...       1
4  an hour s walk was as tiresome as three in a s...       1


In [0]:
X = df['text']
y = df['author']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


In [0]:
# Step 3 - Create datasets (Only take up to 150 words)
# train_text = X_train.tolist()
# train_text = [' '.join(t.split()[0:150]) for t in train_text]
# train_text = np.array(train_text, dtype=object)[:, np.newaxis]
# train_label = y_train.tolist()

# test_text = X_test.tolist()
# test_text = [' '.join(t.split()[0:150]) for t in test_text]
# test_text = np.array(test_text, dtype=object)[:, np.newaxis]
# test_label = y_test.tolist()

In [0]:
# # Step 4 - Get and initialize ELMo
# elmo_model = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)
# sess.run(tf.global_variables_initializer())
# sess.run(tf.tables_initializer())

In [0]:
# # Step 5 - Define a Keras-compatible ElmoEmbedding Layer
# def ElmoEmbedding(x):
#     return elmo_model(tf.squeeze(tf.cast(x, tf.string)), signature="default",
#                       as_dict=True)["default"]

In [0]:
# # Step 6 - Build and train! Or buy more compute...
# input_text = layers.Input(shape=(1,), dtype=tf.string)
# embedding = layers.Lambda(ElmoEmbedding, output_shape=(1024,))(input_text)
# dense = layers.Dense(256, activation='relu')(embedding)
# pred = layers.Dense(1, activation='sigmoid')(dense)

# model = Model(inputs=[input_text], outputs=pred)

# model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# model.summary()

# model.fit(train_text, 
#           train_label,
#           validation_data=(test_text, test_label),
#           epochs=5,
#           batch_size=32)