<b>Google Colab</b> <a href="https://colab.research.google.com/github/kirillzyusko/deeplearning/blob/master/7/lab.ipynb">link</a>

Authorize google + kaggle

In [3]:
from googleapiclient.discovery import build
import io, os
from googleapiclient.http import MediaIoBaseDownload
from google.colab import auth
auth.authenticate_user()
drive_service = build('drive', 'v3')
results = drive_service.files().list(
        q="name = 'kaggle.json'", fields="files(id)").execute()
kaggle_api_key = results.get('files', [])
filename = "/content/.kaggle/kaggle.json"
os.makedirs(os.path.dirname(filename), exist_ok=True)
request = drive_service.files().get_media(fileId=kaggle_api_key[0]['id'])
fh = io.FileIO(filename, 'wb')
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
    status, done = downloader.next_chunk()
    print("Download %d%%." % int(status.progress() * 100))
os.chmod(filename, 600)


For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

Download 100%.


Be sure, that we authorized and have an access to kaggle

In [4]:
%ls /content/.kaggle/

[0m[01;32mkaggle.json[0m*


# **Part 1: Download dataset, extract**

Download dataset:

In [5]:
!mkdir ~/.kaggle
!cp /content/.kaggle/kaggle.json ~/.kaggle/kaggle.json
!kaggle datasets download lakshmi25npathi/imdb-dataset-of-50k-movie-reviews -p /content/kaggle/imdb

Downloading imdb-dataset-of-50k-movie-reviews.zip to /content/kaggle/imdb
 97% 25.0M/25.7M [00:00<00:00, 38.9MB/s]
100% 25.7M/25.7M [00:00<00:00, 37.6MB/s]


Extract .zip

In [6]:
!unzip kaggle/imdb/imdb-dataset-of-50k-movie-reviews.zip -d data

Archive:  kaggle/imdb/imdb-dataset-of-50k-movie-reviews.zip
  inflating: data/IMDB Dataset.csv   


Files:

In [7]:
%ls data

'IMDB Dataset.csv'


Read data using pandas:

In [8]:
import pandas as pd

df = pd.read_csv('data/IMDB Dataset.csv')
print(df.shape)

df.head()

(50000, 2)


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


Imports

In [0]:
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import SimpleRNN
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Bidirectional
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.layers import Dropout

from tensorflow.keras.utils import to_categorical
from sklearn.utils import shuffle
import matplotlib.pyplot as plt

In [0]:
dictionary_length = 10000
input_length = 100

tokenizer = Tokenizer(num_words=dictionary_length)
tokenizer.fit_on_texts(df.review.values)

In [11]:
post_seq = tokenizer.texts_to_sequences(df.review.values)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 124252 unique tokens.


In [0]:
post_seq_padded = pad_sequences(post_seq, maxlen=input_length)

In [0]:
import numpy as np

x_original = post_seq_padded
x_original = np.array(x_original)

y_original = df['sentiment'].replace({ 'positive': 1, 'negative': 0 }).values
y_original = np.array(y_original)

x, y = shuffle(x_original, y_original, random_state=23)

In [0]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5, random_state=42)
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.2, random_state=42)

# **Part 2: RNN with LSTM**

In [0]:
model = Sequential()
model.add(Embedding(dictionary_length, 8, input_length=input_length))
model.add(Bidirectional(LSTM(16, return_sequences=False))) # dropout=0.2, recurrent_dropout=0.2
model.add(Dense(1, activation='sigmoid'))

model.summary()

Model: "sequential_10"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_10 (Embedding)     (None, 100, 8)            80000     
_________________________________________________________________
bidirectional_18 (Bidirectio (None, 32)                3200      
_________________________________________________________________
dense_18 (Dense)             (None, 1)                 33        
Total params: 83,233
Trainable params: 83,233
Non-trainable params: 0
_________________________________________________________________


In [0]:
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [0]:
model.fit(x=x_train, y=y_train, batch_size=256, verbose=1, epochs=10, validation_data=(x_val, y_val))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f47d0ec9c50>

In [0]:
model.evaluate(x_test, y_test)



[0.542818009853363, 0.8339599967002869]

# **Part 3: Using GloVe**

In [0]:
!wget "http://nlp.stanford.edu/data/glove.6B.zip"

--2020-04-03 21:06:25--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2020-04-03 21:06:26--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2020-04-03 21:06:26--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2020-0

In [0]:
!mkdir glove
!unzip glove.6B.zip -d glove

Archive:  glove.6B.zip
  inflating: glove/glove.6B.50d.txt  
  inflating: glove/glove.6B.100d.txt  
  inflating: glove/glove.6B.200d.txt  
  inflating: glove/glove.6B.300d.txt  


In [0]:
glove_dir = 'glove'

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


In [0]:
embedding_dim = 100

embedding_matrix = np.zeros((dictionary_length, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < dictionary_length:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

In [0]:
model = Sequential()
model.add(Embedding(dictionary_length, embedding_dim, input_length=input_length))
model.add(Bidirectional(LSTM(16))) # dropout=0.2, recurrent_dropout=0.2
model.add(Dense(1, activation='sigmoid'))

model.summary()

model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.fit(x_train, y_train, batch_size=32, epochs=10, verbose=1, validation_data=(x_val, y_val))
score, acc = model.evaluate(x_test, y_test)

print('Test accuracy:', acc)

Model: "sequential_13"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_11 (Embedding)     (None, 100, 100)          1000000   
_________________________________________________________________
bidirectional_19 (Bidirectio (None, 32)                14976     
_________________________________________________________________
dense_19 (Dense)             (None, 1)                 33        
Total params: 1,015,009
Trainable params: 1,015,009
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test accuracy: 0.8490399718284607


# **Part 4: New NN architecture**

In [0]:
model = Sequential()
model.add(Embedding(dictionary_length, embedding_dim, input_length=input_length))
model.add(Bidirectional(LSTM(32, return_sequences=True))) # dropout=0.2, recurrent_dropout=0.2
model.add(Bidirectional(LSTM(32, return_sequences=True)))
model.add(Bidirectional(LSTM(32)))
model.add(Dense(128, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.summary()

model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.fit(x_train, y_train, batch_size=32, epochs=10, verbose=1, validation_data=(x_val, y_val))
score, acc = model.evaluate(x_test, y_test)

print('Test accuracy:', acc)

Model: "sequential_14"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_12 (Embedding)     (None, 100, 100)          1000000   
_________________________________________________________________
bidirectional_20 (Bidirectio (None, 100, 64)           34048     
_________________________________________________________________
bidirectional_21 (Bidirectio (None, 100, 64)           24832     
_________________________________________________________________
bidirectional_22 (Bidirectio (None, 64)                24832     
_________________________________________________________________
dense_20 (Dense)             (None, 128)               8320      
_________________________________________________________________
dense_21 (Dense)             (None, 32)                4128      
_________________________________________________________________
dense_22 (Dense)             (None, 1)               

# **Part 5: DeepMoji**

???

In [15]:
!git clone https://github.com/bfelbo/DeepMoji.git

Cloning into 'DeepMoji'...
remote: Enumerating objects: 281, done.[K
remote: Total 281 (delta 0), reused 0 (delta 0), pack-reused 281[K
Receiving objects: 100% (281/281), 110.54 MiB | 33.72 MiB/s, done.
Resolving deltas: 100% (142/142), done.
Checking out files: 100% (66/66), done.


In [16]:
%cd DeepMoji

/content/DeepMoji


In [17]:
%ls

[0m[01;34mdata[0m/      emoji_overview.png  [01;34mexamples[0m/  [01;34mmodel[0m/     [01;34mscripts[0m/  [01;34mtests[0m/
[01;34mdeepmoji[0m/  emoji_unicode.csv   LICENSE    README.md  setup.py


In [18]:
import sys
from os.path import abspath, dirname

sys.path.insert(0, '/content/DeepMoji')
sys.path.insert(0, '/content/DeepMoji/deepmoji')
sys.path.insert(0, '/content/DeepMoji/examples')

print(sys.path)

['/content/DeepMoji/examples', '/content/DeepMoji/deepmoji', '/content/DeepMoji', '', '/env/python', '/usr/lib/python36.zip', '/usr/lib/python3.6', '/usr/lib/python3.6/lib-dynload', '/usr/local/lib/python3.6/dist-packages', '/usr/lib/python3/dist-packages', '/usr/local/lib/python3.6/dist-packages/IPython/extensions', '/root/.ipython']


In [23]:
%cd scripts

/content/DeepMoji/scripts


In [19]:
from __future__ import print_function
import os
from subprocess import call

curr_folder = os.path.basename(os.path.normpath(os.getcwd()))

weights_filename = 'deepmoji_weights.hdf5'
weights_folder = 'model'
weights_path = '{}/{}'.format(weights_folder, weights_filename)
if curr_folder == 'scripts':
    weights_path = '../' + weights_path
weights_download_link = 'https://www.dropbox.com/s/xqarafsl6a8f9ny/deepmoji_weights.hdf5?dl=0#'


MB_FACTOR = float(1 << 20)


def prompt():
    while True:
        valid = {
            'y': True,
            'ye': True,
            'yes': True,
            'n': False,
            'no': False,
        }
        if 'TRAVIS' in os.environ:
            choice = 'yes'
        else:
            choice = input().lower()
        if choice in valid:
            return valid[choice]
        else:
            print('Please respond with \'y\' or \'n\' (or \'yes\' or \'no\')')


download = True
if os.path.exists(weights_path):
    print('Weight file already exists at {}. Would you like to redownload it anyway? [y/n]'.format(weights_path))
    download = prompt()
    already_exists = True
else:
    already_exists = False

if download:
    print('About to download the pretrained weights file from {}'.format(weights_download_link))
    if not already_exists:
        print('The size of the file is roughly 85MB. Continue? [y/n]')
    else:
        os.unlink(weights_path)

    if already_exists or prompt():
        print('Downloading...')

        # urllib.urlretrieve(weights_download_link, weights_path)
        # with open(weights_path,'wb') as f:
        #     f.write(requests.get(weights_download_link).content)

        # downloading using wget due to issues with urlretrieve and requests
        sys_call = 'wget {} -O {}'.format(weights_download_link, os.path.abspath(weights_path))
        print("Running system call: {}".format(sys_call))
        call(sys_call, shell=True)

        if os.path.getsize(weights_path) / MB_FACTOR < 80:
            raise ValueError("Download finished, but the resulting file is too small! " +
                             "It\'s only {} bytes.".format(os.path.getsize(weights_path)))
        print('Downloaded weights to {}'.format(weights_path))
else:
    print('Exiting.')

About to download the pretrained weights file from https://www.dropbox.com/s/xqarafsl6a8f9ny/deepmoji_weights.hdf5?dl=0#
The size of the file is roughly 85MB. Continue? [y/n]
y
Downloading...
Running system call: wget https://www.dropbox.com/s/xqarafsl6a8f9ny/deepmoji_weights.hdf5?dl=0# -O /content/DeepMoji/model/deepmoji_weights.hdf5
Downloaded weights to model/deepmoji_weights.hdf5


In [20]:
%ls

[0m[01;34mdata[0m/      emoji_overview.png  [01;34mexamples[0m/  [01;34mmodel[0m/     [01;34mscripts[0m/  [01;34mtests[0m/
[01;34mdeepmoji[0m/  emoji_unicode.csv   LICENSE    README.md  setup.py


In [5]:
!pip3 uninstall -y tensorflow
!pip3 install tensorflow==1.13.1

Uninstalling tensorflow-2.2.0rc2:
  Successfully uninstalled tensorflow-2.2.0rc2
Collecting tensorflow==1.13.1
[?25l  Downloading https://files.pythonhosted.org/packages/77/63/a9fa76de8dffe7455304c4ed635be4aa9c0bacef6e0633d87d5f54530c5c/tensorflow-1.13.1-cp36-cp36m-manylinux1_x86_64.whl (92.5MB)
[K     |████████████████████████████████| 92.5MB 63kB/s 
Collecting tensorflow-estimator<1.14.0rc0,>=1.13.0
[?25l  Downloading https://files.pythonhosted.org/packages/bb/48/13f49fc3fa0fdf916aa1419013bb8f2ad09674c275b4046d5ee669a46873/tensorflow_estimator-1.13.0-py2.py3-none-any.whl (367kB)
[K     |████████████████████████████████| 368kB 27.3MB/s 
Collecting tensorboard<1.14.0,>=1.13.0
[?25l  Downloading https://files.pythonhosted.org/packages/0f/39/bdd75b08a6fba41f098b6cb091b9e8c7a80e1b4d679a581a0ccd17b10373/tensorboard-1.13.1-py3-none-any.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 35.7MB/s 
Collecting mock>=2.0.0
  Downloading https://files.pythonhosted.org/packages/cd/7

In [21]:
import tensorflow as tf
print(tf.__version__)

1.13.1


In [0]:
"""Trains the DeepMoji architecture on the IMDB sentiment classification task.
   This is a simple example of using the architecture without the pretrained model.
   The architecture is designed for transfer learning - it should normally
   be used with the pretrained model for optimal performance.
"""
from __future__ import print_function
import numpy as np
from keras.preprocessing import sequence
from keras.datasets import imdb
from deepmoji.model_def import deepmoji_architecture

# Seed for reproducibility
np.random.seed(1337)

batch_size = 256

print('Build model...')
model = deepmoji_architecture(nb_classes=2, nb_tokens=dictionary_length, maxlen=input_length)
model.summary()

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
model.fit(x_train, y_train, batch_size=batch_size, epochs=10,
          validation_data=(x_val, y_val))
score, acc = model.evaluate(x_test, y_test, batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Build model...
Instructions for updating:
Colocations handled automatically by placer.


Using TensorFlow backend.


Model: "DeepMoji"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 100)          0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 100, 256)     2560000     input_1[0][0]                    
__________________________________________________________________________________________________
activation_1 (Activation)       (None, 100, 256)     0           embedding[0][0]                  
__________________________________________________________________________________________________
bi_lstm_0 (Bidirectional)       (None, 100, 1024)    3149824     activation_1[0][0]               
___________________________________________________________________________________________