# Autocorrect Keras Notebook
<a href="https://colab.research.google.com/github/hacksaremeta/IS-Sentence-Completion/blob/datasets/src/autocorrect_keras.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Sentence completion using a LSTM RNN (with TF2 Keras)  
For a general idea on how this works see [Tensorflow Docs: Text generation with an RNN](https://www.tensorflow.org/text/tutorials/text_generation) and [Will Koehrsen: RNNs by Example in Python](https://towardsdatascience.com/recurrent-neural-networks-by-example-in-python-ffd204f99470).
In this case the sequences given are words instead of characters and the RNN predicts the next word.
Therefore we use the Keras Tokenizer to convert sentences to vectors of word representatives (integers).
After tokenization each 'word' will be converted to a feature vector using Keras pre-trained embeddings.
Then we train the network by giving it n 'words' (features) from the PubMed training data and having it predict the (n+1)-th word (label) in the sequence.
The predicted word is then compared to the actual word present in the training data and back-propagation is used to tweak the network layers.

In [6]:
%run dataset.ipynb

[INFO] DataManager: Dataset already exists, skipping fetch


In [7]:
import logging
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Masking, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer

In [8]:
if __name__ == "__main__":
    # Init logging
    logging.basicConfig(level=logging.DEBUG, format='[%(levelname)s] %(name)s: %(message)s')
    log = logging.getLogger("Main")

    # Create DataManager in '../res/datasets' folder
    data_folder = os.path.join("..", "res", "datasets")
    dman = DataManager("mymail@example.com", data_folder)

    dataset_name = "RNA Dataset"
    queries = ["RNA", "mRNA", "tRNA"]

    # Gather maximum of 100 abstracts for each query
    # I would suggest around 5 - 20 abstracts in total for the small data sets
    # and maybe 500 - 5000 for the final ones but we'll have to test
    # since that depends on how long it takes to train the network
    # This only queries PubMed if data if the data is not already present
    dman.create_dataset(queries, dataset_name, 5)

    # Load the dataset
    abstracts = dman.load_full_dataset(dataset_name)
    abstracts_mrna = dman.load_query_from_dataset(dataset_name, queries[1])

    ab = dman.remove_punctuation(dataset_name)

    assert(len(ab) > 0)
    log.debug(f"First extracted abstract: {ab[0]}")

    # Tokenize abstracts
    # See https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer
    # Filters slightly modified (comp. to docs) to keep punctuation
    # Lowercase has to be used for pre-trained embeddings
    tokenizer = Tokenizer(
        num_words=None, 
        filters='#$%&()*+-,<=>@[\\]^_`{|}~\t\n',
        lower = True, split = ' '
    )

    tokenizer.fit_on_texts(ab)

    # Generates list of lists of integers
    # Can be reversed with the sequences_to_texts() function of the tokenizer
    sequences = tokenizer.texts_to_sequences(ab)

    assert(len(sequences) > 0)
    log.debug(f"First tokenized sequence: {sequences[0]}")

    # Prepare training data
    # Extract features and labels
    # Number of words before prediction: num_pred
    num_pred = 20
    features, labels = DataUtils.extract_features_and_labels(sequences, 20)

    assert(len(features) > 0 and len(labels) > 0)
    log.debug(f"First extracted feature: {tokenizer.sequences_to_texts(features)[0]} {features[0]}")
    log.debug(f"First extracted label: {tokenizer.index_word[labels[0]]} [{labels[0]}]")

    # Create Keras LSTM RNN model
    pass


[INFO] DataManager: Dataset already exists, skipping fetch
[DEBUG] Main: First extracted abstract: Long noncoding RNA nuclear paraspeckle assembly transcript 1 (lncRNA NEAT1) is abnormally expressed in numerous tumors and functions as an oncogene, but the role of NEAT1 in laryngocarcinoma is largely unknown. Our study validated that NEAT1 expression was markedly upregulated in laryngocarcinoma tissues and cells. Downregulation of NEAT1 dramatically suppressed cell proliferation and invasion through inhibiting miR-524-5p expression. Additionally, NEAT1 overexpression promoted cell growth and metastasis, while overexpression of miR-524-5p could reverse the effect. NEAT1 increased the expression of histone deacetylase 1 gene (HDAC1) via sponging miR-524-5p. Mechanistically, overexpression of HDAC1 recovered the cancer-inhibiting effects of miR-524-5p mimic or NEAT1 silence by deacetylation of tensin homolog deleted on chromosome ten (PTEN) and inhibiting AKT signal pathway. Moreover, in v