First pull in the data from dropbox.

In [1]:
!wget https://www.dropbox.com/s/viz3bmc8cil4w1y/train_data.csv?dl=1
!wget https://www.dropbox.com/s/07wu5by7llczd36/test_data.csv?dl=1
!wget https://www.dropbox.com/s/80gdinsmalrrcll/sample_solution.csv?dl=1

--2023-04-19 04:23:23--  https://www.dropbox.com/s/viz3bmc8cil4w1y/train_data.csv?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.5.18, 2620:100:601d:18::a27d:512
Connecting to www.dropbox.com (www.dropbox.com)|162.125.5.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/dl/viz3bmc8cil4w1y/train_data.csv [following]
--2023-04-19 04:23:23--  https://www.dropbox.com/s/dl/viz3bmc8cil4w1y/train_data.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc2ec34981fb0dec39ba29049f80.dl.dropboxusercontent.com/cd/0/get/B6dLi9Jua8JzJWmnr5hDuUCugrKJwsSF8Zpw6Gh46NCO3O2IdCIfy0iVVJNDeAoXX76V6oC2x1bJGVVf4Nf0xRYvwarRBrkVcx-yO6P0t0zfjIQ8jLZSbKYC_YfQQGGH6grtWu5zs2jcinvXBGLZL9lxZbSkiW48dlnVbgW8CbcAKQ/file?dl=1# [following]
--2023-04-19 04:23:24--  https://uc2ec34981fb0dec39ba29049f80.dl.dropboxusercontent.com/cd/0/get/B6dLi9Jua8JzJWmnr5hDuUCugrKJwsSF8Zpw6Gh46NCO3O2IdCIfy0iVVJNDeAoXX76V6oC2

**Read in the data**.

First we will read in the files that have the training and testing data. We'll look at the first 5 rows to get a feel for what's inside:

In [2]:
import pandas as pd
train_data = pd.read_csv('train_data.csv?dl=1', encoding = 'latin-1')
test_data = pd.read_csv('test_data.csv?dl=1', encoding = 'latin-1')
train_data.iloc[0:5]
train_data.Text.iloc[0]

'From pvconway cudnvr denver colorado edu Subject TIN files coutours Lines 15 Hi I am working on a project that needs to create contour lines from random data points The work that I have done so far tells me that I need to look into Triangulated Irregular Networks TIN the Delauney criiterion and the Krige method Does anyone have any suggestions for references programs and hopefully source code for creating contours Any help with this or any surface modeling would be greatly appreciated I can be reached at the addresses below Paul Conway PVCONWAY COPPER DENVER COLORADO EDU PVCONWAY CUDNVR DENVER COLORADO EDU'

**Preprocess the text**

Now we'll process the text and convert the words to sequences of integers, keeping only the 10,000 most common words.

In [3]:
from tensorflow.keras.preprocessing.text import Tokenizer
texts = train_data['Text']
test_sentences = test_data['Text']
labels = train_data['Label']

# RNN specific
max_words = 15000
token = Tokenizer(max_words)
token.fit_on_texts(texts)
vocab_size = max_words + 1

sequences = token.texts_to_sequences(texts)
test_sequences = token.texts_to_sequences(test_sentences)



In [4]:
sequences[0]

[14,
 2375,
 567,
 15,
 29,
 757,
 364,
 32,
 211,
 599,
 8,
 127,
 587,
 16,
 4,
 902,
 10,
 816,
 2,
 1048,
 12264,
 32,
 14,
 1726,
 234,
 716,
 1,
 175,
 10,
 8,
 21,
 378,
 56,
 319,
 2302,
 63,
 10,
 8,
 174,
 2,
 255,
 135,
 13159,
 3156,
 757,
 1,
 6,
 1,
 1316,
 108,
 171,
 21,
 62,
 1569,
 12,
 1628,
 578,
 6,
 3032,
 446,
 413,
 12,
 3321,
 62,
 197,
 22,
 17,
 25,
 62,
 1782,
 9166,
 48,
 18,
 1680,
 1015,
 8,
 39,
 18,
 3094,
 33,
 1,
 2532,
 1117,
 535,
 7020,
 2375,
 567,
 15,
 2375,
 567,
 15]

**Pad the sequences**

Next we 0-pad the sequences, and short sequences will have 0s prepended so that each sequence is exactly 100 integers long. Note long sentences will be trimmed so that only the first 100 words are used.

In [5]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
seq_len = 100
X = pad_sequences(sequences, maxlen=seq_len)
X_test = pad_sequences(test_sequences, maxlen=seq_len)
print(X.shape)
print(X[0])
print(len(sequences[0]))

(11314, 100)
[    0     0     0     0     0     0     0     0     0    14  2375   567
    15    29   757   364    32   211   599     8   127   587    16     4
   902    10   816     2  1048 12264    32    14  1726   234   716     1
   175    10     8    21   378    56   319  2302    63    10     8   174
     2   255   135 13159  3156   757     1     6     1  1316   108   171
    21    62  1569    12  1628   578     6  3032   446   413    12  3321
    62   197    22    17    25    62  1782  9166    48    18  1680  1015
     8    39    18  3094    33     1  2532  1117   535  7020  2375   567
    15  2375   567    15]
91


**Make y one-hot**

Now we will convert the label to a one-hot representation:

In [6]:
import numpy as np
## Make y one-hot ##
y = np.zeros( (len(labels), len(np.unique(labels)) ) )
for l in np.unique(labels):
    pos_inds = np.where(labels == l)[0]
    y[pos_inds,l] = 1

num_classes = y.shape[1]
print(y.shape)

(11314, 20)


**Set up and train the RNN using the _functional_ API**

In [7]:
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Activation, Dropout, Embedding, Input, Dropout, LSTM

num_timesteps = X.shape[1] #X.shape[0]=number of samples, and X.shape[1]=number of time steps

## Functional API specific
input = Input(shape=(num_timesteps, ))

x = Embedding(input_dim = vocab_size, output_dim = 128, name='embedding')(input)
x = LSTM(units = 128)(x)
x = Dropout(0.5)(x)
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)
x = Dense(units=20, activation='softmax', name='output')(x)

model = Model(inputs=input, outputs=x)

In [8]:
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 100)]             0         
                                                                 
 embedding (Embedding)       (None, 100, 128)          1920128   
                                                                 
 lstm (LSTM)                 (None, 128)               131584    
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 dense (Dense)               (None, 256)               33024     
                                                                 
 dropout_1 (Dropout)         (None, 256)               0         
                                                                 
 output (Dense)              (None, 20)                5140  

In [9]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X,y,epochs=15, validation_split=0.2, batch_size = 256)


Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x7f45700c4ee0>

In [10]:
test_preds = model.predict(X_test, batch_size=256)
submission_template = pd.read_csv('sample_solution.csv?dl=1')

for j in range(test_preds.shape[1]):
  submission_template.iloc[:,j+1] = test_preds[:,j]

submission_template.to_csv('submission.csv', index=False)



  submission_template.iloc[:,j+1] = test_preds[:,j]


**Download the submission and submit it to Kaggle**

Download your predictions and submit them to the [kaggle](https://www.kaggle.com/c/bmi-707-rnn/) leaderboard!

In [11]:
from google.colab import files
files.download('submission.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

**Possible Modifications**

- More LSTM layers
- Use a GRU instead of an LSTM
- Use dropout within the LSTM - carefully read the [docs](https://keras.io/layers/recurrent/#lstm) how to do this, it's different than normal dropout.
- Add a dense layer after the LSTM
- Change the dimension of the embedding layer
- Change the preprocessing steps (sequence length, vocab size, etc)
- Try [bidirectional](https://keras.io/layers/wrappers/#bidirectional) units