## Text Generation Using patent abstracts from patent search for `neural network`

### Files required:

1. `neural_network_patent_query.csv`

2. `train-embeddings.h5`

Copy the above files to your drive from [this](https://drive.google.com/drive/folders/1cbAesB-eejsRKdCHpnFSyXiu81Y5a5HU?usp=sharing) link.

### Read the dataset 

Use variable name for the dataframe as data

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
import pandas as pd
import numpy as np

In [0]:
data = pd.read_csv("/content/drive/My Drive/AIML/Sequence Models/neural_network_patent_query.csv")

### Check 1

In [10]:
data.head()

Unnamed: 0,patent_abstract,patent_date,patent_number,patent_title
0,""" A """"Barometer"""" Neuron enhances stability in...",1996-07-09,5535303,"""""""Barometer"""" neuron for a neural network"""
1,""" This invention is a novel high-speed neural ...",1993-10-19,5255349,"""Electronic neural network for solving """"trave..."
2,An optical information processor for use as a ...,1995-01-17,5383042,3 layer liquid crystal neural network with out...
3,A method and system for intelligent control of...,2001-01-02,6169981,3-brain architecture for an intelligent decisi...
4,A method and system for intelligent control of...,2003-06-17,6581048,3-brain architecture for an intelligent decisi...


In [11]:
data.shape

(3522, 4)

Now, all the patent abstract data is in `data['patent_abstract']`

For ease of access, assign a variable name `abstracts` to `data['patent_abstract']`

In [0]:
abstracts = data['patent_abstract']

In [13]:
print (abstracts)

0       " A ""Barometer"" Neuron enhances stability in...
1       " This invention is a novel high-speed neural ...
2       An optical information processor for use as a ...
3       A method and system for intelligent control of...
4       A method and system for intelligent control of...
5       " An unknown object is non-destructively and q...
6       " An unknown object is non-destructively and q...
7       A security system comprised of a device for mo...
8       A target recognition system and method wherein...
9       A target recognition system and method wherein...
10      A supervised procedure for obtaining weight va...
11      A method of accelerating the training of an ar...
12      An apparatus is described herein. The apparatu...
13      Approaches for accurate neural network trainin...
14      Embodiments are generally directed to neural n...
15      Systems and methods using a neural network bas...
16      A system for real-time analysis of weld qualit...
17      A meth

### Tokenize the text

Initialize the Tokenizer class with variable name `tokenizer`

Use tokenizer.fit_on_texts(`<list of texts>`) on `abstracts`

In [0]:
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(lower=True,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')

In [0]:
tokenizer.fit_on_texts(abstracts)

### Run the below code to extract insights from tokenizer

In [0]:
word_idx = tokenizer.word_index
idx_word = tokenizer.index_word
word_counts = tokenizer.word_counts

### Total no.of words in given dataset acc. to Tokenizer

Run the below code

In [0]:
num_words = len(word_idx) + 1

In [27]:
print ("Total no.of words = %d" %(num_words))

Total no.of words = 11755


### The given pre-trained model `train-embeddings.h5` is on 16192 tokens, hence take num_words as 16192 to use the pre-trained model.

Run the below code.

In [0]:
num_words = 16192

### Encode words to integers using texts_to_sequences in keras

Use variable name `sequences`

In [30]:
sequences = tokenizer.texts_to_sequences(abstracts)
#print(sequences)

[[2, 5727, 54, 3123, 2026, 9, 2, 7, 6, 17, 26, 118, 53, 25, 2, 1380, 373, 1520, 17, 2960, 96, 8405, 5, 200, 1380, 794, 9, 2, 3791, 1380, 1955, 1440, 1, 5727, 54, 221, 25, 2, 6714, 3538, 16, 228, 17, 120, 26, 8406, 2, 4088, 3791, 4, 1380, 5, 2, 885, 353, 25, 2, 4509, 3539, 3, 3791, 4, 1380, 22, 82, 2, 240, 171, 160, 3, 2961, 1, 5727, 54, 2357, 5, 1, 17, 86, 1331, 173, 86, 1138, 1, 160, 3, 2961, 3, 1, 4509, 488, 4, 217, 2, 537, 1349, 16, 2961, 18, 27, 5, 1, 17, 22, 1621, 1, 17, 5, 2, 171, 49, 31, 16, 952, 180, 886, 80, 1, 228, 160, 3, 2961, 3, 1, 4509, 488], [80, 71, 8, 2, 967, 187, 366, 7, 6, 34, 116, 10, 1622, 1, 3339, 8407, 4, 93, 625, 496, 832, 148, 89, 2, 967, 855, 308, 904, 2, 407, 198, 135, 703, 172, 1745, 1, 644, 388, 3, 1, 515, 55, 25, 1, 84, 3, 8408, 5, 23, 5001, 1, 135, 8, 2794, 13, 218, 1407, 191, 165, 55, 25, 1408, 1, 116, 1745, 114, 468, 258, 79, 21, 3, 22, 3792, 215, 3, 1, 515, 2095, 4, 905, 2358, 22, 2453, 48, 968], [11, 222, 48, 116, 10, 108, 25, 2, 162, 61, 771, 89, 2, 

### Consider only abstracts greater than 70 words

Run the below code

In [0]:
seq_lengths = [len(x) for x in sequences]
over_idx = [i for i, l in enumerate(seq_lengths) if l > 70]

new_texts = []
new_sequences = []

# Only keep sequences with more than training length tokens
for i in over_idx:
    new_texts.append(abstracts[i])
    new_sequences.append(sequences[i])

Now, we have abstracts in new_texts and words encoded to integers in new_sequences.

### Generate features and labels

If trainining_length is 50, take every 49 sequence as feature and every last word of each 50 sequence as label.

Run the below code to generate features and labels.

In [0]:
features = []
labels = []

training_length = 50
# Iterate through the sequences of tokens
for each_sequence in new_sequences:
    
    # Create multiple training examples from each sequence
    for i in range(training_length, len(each_sequence)):
        # Extract the features and label
        extract = each_sequence[i - training_length: i + 1]
        
        # Set the features and label
        features.append(extract[:-1])
        labels.append(extract[-1])

In [37]:
print("There are %d sequences." %(len(features)))

There are 294130 sequences.


### Split into train and validation sets

1. Shuffle the features and labels accordingly.

2. Split into train and validation sets. Use variable names X_train, X_valid, y_train and y_valid accordingly. Consider 70:30

3. Convert y_train and y_valid to one-hot encodings

In [0]:
#from sklearn.utils import shuffle
#features,labels = shuffle(features, labels, random_state=42)

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
X_train,X_valid,y_train_labels,y_valid_labels = train_test_split(features, labels, test_size=0.3)

In [54]:
print(features[0])
print(labels[0])

[55, 25, 26, 24, 389, 1081, 8, 3948, 5, 11, 60, 4909, 9, 1, 5601, 41, 63, 22, 9, 1078, 50, 1, 6438, 63, 5, 5602, 3807, 55, 25, 3325, 177, 4, 1006, 94, 19, 395, 1844, 788, 255, 2781, 2662, 5603, 167, 753, 1845, 4, 16, 146, 65, 255]
5


In [55]:
print(X_train[0])
print(y_train[0])

[13, 1108, 27, 42, 10, 1, 15, 50, 13, 1, 423, 6, 335, 13, 204, 21, 73, 5, 2, 54, 1, 124, 3, 1, 2021, 45, 323, 57, 55, 25, 572, 50, 6388, 24, 1, 45, 915, 39, 23, 257, 95, 51, 1, 492, 85, 3, 1, 423, 6, 1]
1


In [0]:
# Using int8 for memory savings
y_train = np.zeros((len(y_train_labels), num_words), dtype=np.int8)
y_valid = np.zeros((len(y_valid_labels), num_words), dtype=np.int8)

# One hot encoding of labels
for example_index, word_index in enumerate(y_train_labels):
    y_train[example_index, word_index] = 1

for example_index, word_index in enumerate(y_valid_labels):
    y_valid[example_index, word_index] = 1

### Check 2

Run the below code to check some features and their corresponding labels.

In [57]:
for i, sequence in enumerate(X_train[:5]):
    text = []
    for idx in sequence:
        text.append(idx_word[idx])
        
    print('Features: ' + ' '.join(text) + '\n')
    print('Label: ' + idx_word[np.argmax(y_train[i])] + '\n')

Features: by executing signal processing for the input signals by the recurrent network formed by units each corresponding to a neuron the features of the sequential time series pattern such as voice signals fluctuating on the time axis can be extracted through learning the coupling state of the recurrent network the

Label: present

Features: pressure a neural network is trained with occlusion obtained data whereafter the trained coefficients are utilized to implement the wedge pressure estimator a flow directed catheter is utilized to transduce the pressure waveform which is then input to the processing computer through a analog to digital data acquisition board the

Label: data

Features: also be carried out in the spatial domain by omitting the fourier transformation and power spectrum calculation the roi is then scaled for input into a neural network trained to detect microcalcifications the neural network outputs rois with detected microcalcifications the method and system can al

### Build Model

#### Consider the following details while building the model.

Embedding dimension = 100

64 LSTM cells in one layer with return_sequences as `False`

Fully connected layer with 64 units on top of LSTM

'relu' activation

Drop out for regularization

Output Dense layer with size of num_words for matching the size of one-hot encoding of each word

'softmax' activation

Categorical cross entropy loss

Metric accuracy

In [0]:
from keras.models import Sequential, load_model
from keras.layers import LSTM, Dense, Dropout, Embedding, Masking, Bidirectional
from keras.optimizers import Adam

In [59]:
model = Sequential()

# Embedding layer
model.add(
    Embedding(
        input_dim=len(word_idx) + 1,
        output_dim=100,
        weights=None,
        trainable=True))

# Recurrent layer
model.add(
    LSTM(
        64, return_sequences=False, dropout=0.1,
        recurrent_dropout=0.1))

# Fully connected layer
model.add(Dense(64, activation='relu'))

# Dropout for regularization
model.add(Dropout(0.5))

# Output layer
model.add(Dense(num_words, activation='softmax'))

# Compile the model
model.compile(
    optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 100)         1175500   
_________________________________________________________________
lstm_1 (LSTM)                (None, 64)                42240     
_________________________________________________________________
dense_1 (Dense)              (None, 64)                4160      
_________________________________________________________________
dropout_1 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 16192)             1052480   
Total params: 2,274,380
Trainable params: 2,274,380
Non-trainable params: 0
_________________________________________________________________


In [62]:
X_train = np.array(X_train)
X_valid = np.array(X_valid)
print (X_train.shape)

(205891, 50)


In [63]:
h = model.fit(X_train, y_train, epochs = 20, batch_size = 2000, 
          validation_data = (X_valid, y_valid), 
          verbose = 1)

Train on 205891 samples, validate on 88239 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<!-- ### Training -->

### Load in Pre-Trained Model
We can load in a model trained for 150 epochs and train this model for another 20 epochs.

1. Import `load_model` from `keras.models`

2. Load the model file `train-embeddings.h5` using `load_model`. Use variable name `model`.

3. Do model.fit() on training and validation sets to train the model. Consider batch_size as 2048 and epochs as 20.

In [65]:
from keras.models import load_model

# Load in model and demonstrate training
model = load_model('/content/drive/My Drive/AIML/Sequence Models/train-embeddings.h5')
h = model.fit(X_train, y_train, epochs = 4, batch_size = 2048, 
          validation_data = (X_valid, y_valid), 
          verbose = 1)

Train on 205891 samples, validate on 88239 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


### Generate Text 

Run this to check the text output by the model. This function randomly generates input of length 50 words for the model and then generates the next 50 words. 

In [66]:
seed_length=50
new_words=50
diversity=1
n_gen=1

import random

# Choose a random sequence
seq = random.choice(sequences)

# Choose a random starting point
seed_idx = random.randint(0, len(seq) - seed_length - 10)
# Ending index for seed
end_idx = seed_idx + seed_length

gen_list = []

for n in range(n_gen):
    # Extract the seed sequence
    seed = seq[seed_idx:end_idx]
    original_sequence = [idx_word[i] for i in seed]
    generated = seed[:] + ['#']

    # Find the actual entire sequence
    actual = generated[:] + seq[end_idx:end_idx + new_words]
        
    # Keep adding new words
    for i in range(new_words):

        # Make a prediction from the seed
        preds = model.predict(np.array(seed).reshape(1, -1))[0].astype(np.float64)

        # Diversify
        preds = np.log(preds) / diversity
        exp_preds = np.exp(preds)

        # Softmax
        preds = exp_preds / sum(exp_preds)

        # Choose the next word
        probas = np.random.multinomial(1, preds, 1)[0]

        next_idx = np.argmax(probas)

        # New seed adds on old word
        #             seed = seed[1:] + [next_idx]
        seed += [next_idx]
        generated.append(next_idx)
    # Showing generated and actual abstract
    n = []

    for i in generated:
        n.append(idx_word.get(i, '< --- >'))

    gen_list.append(n)

a = []

for i in actual:
    a.append(idx_word.get(i, '< --- >'))

a = a[seed_length:]

gen_list = [gen[seed_length:seed_length + len(a)] for gen in gen_list]

print ' '.join(original_sequence)
print "\n"
# print gen_list
print ' '.join(gen_list[0][1:])
# print a

on a representation of n grams in the text sequence and a second representation of the input text sequence generated by a first neural network the model also includes a decoder which sequentially predicts terms of the canonical form based on the first and second representations and a predicted prefix


generate to vector outputs is 41 signal compounds each n graphemes with at the at a convolutional using into to the device of a vectors number
