# Sentence Completion (TUD IS Project)

<a href="https://colab.research.google.com/github/hacksaremeta/IS-Sentence-Completion/blob/datasets/is_autocomplete.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Table of contents
* 1 [Introduction (TODO)](#introduction)  
* 2 [Training data preparation](#data_preparation)  
    * 2.1 [DataManager class](#data_manager)  
    * 2.2 [DataUtils class](#data_utils)  
* 3 [Keras Implementation](#impl_keras)  
    * 3.1 [Data preparation](#keras_preparation)  
    * 3.2 [The neural network](#keras_rnn)  

<a id="data_preparation"></a>
## Training data preparation

In order to fetch data from PubMed and save it into different datasets as well as to load those datasets, some functionality is needed. This functionality will be provided by the [DataManager class](#data_manager).
The loaded dataset then has to be prepared for training the neural network. This includes tokenization, label and feature extraction and encoding, all of which is handled by the [DataUtils class](#data_utils).
TODO: more explanation / documentation ...

<a id="data_manager"></a>
### DataManager Class

- Provides functionality regarding data including fetch, persistence and TF2/Keras preparation utils

In [1]:
import os, json, logging, string
from Bio import Entrez, Medline

In [2]:
class DataManager():
    """Provides fetch, save and load functionality for datasets in json format"""
    
    def __init__(self, email, root_dir):
        self.email = email
        self.root_dir = root_dir
        self.log = logging.getLogger(self.__class__.__name__)

    def _exists_dataset(self, name):
        """Checks whether a dataset with the given name exists"""
        if not os.path.isdir(self.root_dir):
            return False
            
        for file in os.listdir(self.root_dir):
            if file.endswith(".json"):
                with open(os.path.join(self.root_dir, file), 'r') as f:
                    content = json.load(f)
                    if content["name"] == name:
                        return True
        return False

    def _fetch_papers(self, query : str, limit : int) -> 'list[dict]':
        """Retrieves data from PubMed"""
        Entrez.email = self.email
        record = Entrez.read(Entrez.esearch(db="pubmed", term=query, retmax=limit))
        idlist = record["IdList"]

        self.log.info("\nFound %d records for %s." % (len(idlist), query.strip()))

        records = Medline.parse(Entrez.efetch(db="pubmed", id=idlist, rettype="medline", retmode = "text"))
        return list(records)

    def _fetch_abstracts(self, query : str, limit : int) -> 'list[str]':
        """Retrieves abstracts from PubMed"""
        papers = self._fetch_papers(query, limit)
        list_of_abstracts = [p['AB'] for p in papers]

        return list_of_abstracts
        
    def create_dataset(self, queries : 'list[str]', name : str, limit=50, overwrite=False) -> None:
        """
        Wraps other methods in this class
        Creates a dataset from multiple queries
        Does nothing if the dataset is already present (param overwrite)
        Limits every query to <limit> results
        """
        exists_dataset = self._exists_dataset(name)
        if not exists_dataset or (exists_dataset and overwrite):
            self.log.info("Dataset does not exist, fetching from PubMed...")

            res = dict()
            res["name"] = name
            res["data"] = list()
            
            for q in queries:
                q_data = dict()
                q_data["query"] = q
                q_data["abstracts"] = self._fetch_abstracts(q, limit)
                res["data"].append(q_data)
            
            self._save_dataset(res, name)
        else:
            self.log.info("Dataset already exists, skipping fetch")

    def _save_dataset(self, dataset: dict, name : str) -> None:
        """
        Creates a file <name>.json in the dataset directory
        For JSON file structure see below
        Param dataset has a structure analogous to the JSON file
        """
        if not os.path.isdir(self.root_dir):
            os.makedirs(self.root_dir)

        with open(os.path.join(self.root_dir, name + ".json"), 'w') as f:
            json.dump(dataset, f, indent=2)
        
    def load_full_dataset(self, name : str) -> 'list[str]':
        """
        Finds the file that matches given <name> in JSON information,
        parses it, loading all abstracts into a list (one string for each abstract)
        and returns it (Error if dataset doesn't exist)
        """

        if  not self._exists_dataset(name):
            self.log.info("Dataset does not exist")
            
        else:
           with open(os.path.join(self.root_dir, name+'.json'), 'r') as file:

                abstract_list=[]
                jsonObject = json.load(file)

                data_list= jsonObject['data']

                for item in data_list:
                    abstract_list.extend(item['abstracts'])
                return abstract_list

    def load_query_from_dataset(self, name : str, query : str) -> 'list[str]':
        """Like load_full_dataset but only loads abstracts for a single query"""


        result = self._exists_dataset(name)

        if  result:

            with open(os.path.join(self.root_dir, name+'.json'), 'r') as file:

                query_abstracts=[]
                jsonObject = json.load(file)
                data_list= jsonObject['data']

                q_names = [x['query'] for x in data_list]

                if query not in q_names:
                    self.log.info("The Query that you are searching for,does not exist in the Dataset")
                else:

                      for queries in data_list:
                            if queries['query'] == query:
                              query_abstracts.extend(queries['abstracts'])
                              return query_abstracts

        else:
             self.log.info("Dataset does not exist")


    def remove_punctuation(self, name:str) -> 'list[str]':


            abstracts_list= self.load_full_dataset(name)

            for text in abstracts:

                text = text.translate(str.maketrans('', '', string.punctuation))
                abstracts_list.append(text)


            return  abstracts_list

<a id="data_utils"></a>
### DataUtils Class
- Static class providing utility functions to prepare data for training

In [3]:
import numpy as np
from typing import Any
from sklearn.model_selection import train_test_split

In [4]:
# TODO: unify method param types (all np.array instead of list)
class DataUtils():
    """Provides utility functions for data preparation"""
    
    @staticmethod
    def extract_features_and_labels(sequences : 'list[list[Any]]', train_len : int) -> 'tuple[list[Any], list[Any]]':
        """
        Extracts features of size <train_len> from the sequences
        Also extracts every (<train_len>+1)-th word as labels
        Returns tuple(features, labels)
        """
        features = []
        labels = []
        for s in sequences:
            for i in range(train_len, len(s)):

                # Extract <train_len> + 1 words and
                # shift by 1 after each iteration
                # That way it generates a lot of training
                # samples from a relatively small amount of data
                ex = s[i-train_len : i+1]

                # First <train_len> words are features
                features.append(ex[:-1])
                
                # (<train_len>+1)-th word is label
                labels.append(ex[-1])
        
        return (features, labels)
             
    @staticmethod
    def encode_data(labels: 'list[Any]', num_code_words : int) -> np.array:
        """
        One-hot encode labels using numpy to
        improve the training speed of the network
        """

        # Use numpy for better compatibility and performance
        # Data type: 8bit integers for binary numbers (0, 1)
        # Could be optimized in space by using single bits instead
        # But that adds overhead in calculation (tradeoff time - space)
        # Since we want improved training speed we just use
        # numpys smallest data type byte/uint8 here
        labels_encoded = np.zeros((len(labels), num_code_words), dtype=np.uint8)

        # One-hot encode
        for i, word in enumerate(labels):
            labels_encoded[i, word] = 1
            
        return labels_encoded
    
    # Uses Scikit-learn here; maybe replace with own method in the future
    @staticmethod
    def split_data(features: np.array, labels: np.array, _test_size=0.2) -> Any:
        """
        Splits features and labels into training and validation data sets
        Returns: (features_training, features_validation, labels_training, labels_validation)
        """
        return train_test_split(features, labels, test_size=_test_size)

<a id="impl_keras"></a>
## Keras Implementation (LSTM RNN)  
For the general methodology regarding Keras neural networks see [Tensorflow Docs: Text generation with an RNN](https://www.tensorflow.org/text/tutorials/text_generation), [Sanchit Tanwar: Building our first neural network in keras](https://towardsdatascience.com/building-our-first-neural-network-in-keras-bdc8abbc17f5) and [Will Koehrsen: Recurrent Neural Networks by Example in Python](https://towardsdatascience.com/recurrent-neural-networks-by-example-in-python-ffd204f99470).
In this case the sequences given are words instead of characters and the RNN predicts the next word.
Therefore we use the Keras Tokenizer to convert sentences to vectors of word representatives (integers).
After tokenization each 'word' will be converted to a feature vector using Keras pre-trained embeddings.
Then we train the network by giving it n 'words' (features) from the PubMed training data and having it predict the (n+1)-th word (label) in the sequence.
The predicted word is then compared to the actual word present in the training data and back-propagation is used to tweak the network layers.

<a id="keras_preparation"></a>
### Data preparation

In [7]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Masking, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer

In [16]:
if __name__ == "__main__":
    # Init logging
    logging.basicConfig(level=logging.DEBUG, format='[%(levelname)s] %(name)s: %(message)s')
    log = logging.getLogger("Main")

    # Create DataManager in '../res/datasets' folder
    data_folder = os.path.join("..", "res", "datasets")
    dman = DataManager("mymail@example.com", data_folder)

    dataset_name = "RNA Dataset"
    queries = ["RNA", "mRNA", "tRNA"]

    # Gather maximum of 100 abstracts for each query
    # I would suggest around 5 - 20 abstracts in total for the small data sets
    # and maybe 500 - 5000 for the final ones but we'll have to test
    # since that depends on how long it takes to train the network
    # This only queries PubMed if data if the data is not already present
    dman.create_dataset(queries, dataset_name, 5)

    # Load the dataset
    abstracts = dman.load_full_dataset(dataset_name)
    abstracts_mrna = dman.load_query_from_dataset(dataset_name, queries[1])

    ab = dman.remove_punctuation(dataset_name)

    assert(len(ab) > 0)
    log.debug(f"First extracted abstract: {ab[0]}")

    # Tokenize abstracts
    # See https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer
    # Filters slightly modified (comp. to docs) to keep punctuation
    # Lowercase has to be used for pre-trained embeddings
    tokenizer = Tokenizer(
        num_words=None, 
        filters='#$%&()*+-,<=>@[\\]^_`{|}~\t\n',
        lower = True, split = ' '
    )

    tokenizer.fit_on_texts(ab)

    # Generates list of lists of integers
    # Can be reversed with the sequences_to_texts() function of the tokenizer
    sequences = tokenizer.texts_to_sequences(ab)

    assert(len(sequences) > 0)
    log.debug(f"First tokenized sequence: {sequences[0]}")

    # Prepare data for input to RNN
    # Extract features and labels
    # Number of words before prediction: num_pred
    num_pred = 20
    features, labels = DataUtils.extract_features_and_labels(sequences, 20)

    assert(len(features) > 0 and len(labels) > 0)
    log.debug(f"First extracted feature: {tokenizer.sequences_to_texts(features)[0]} {features[0]}")
    log.debug(f"First extracted label: {tokenizer.index_word[labels[0]]} [{labels[0]}]")

    # One-hot encode data for improved training performance
    num_code_words = len(tokenizer.index_word) + 1
    labels_encoded = DataUtils.encode_data(labels, num_code_words)
    
    assert(len(labels_encoded) > 0)
    log.debug(f"First one-hot encoded label: [0 ... {labels_encoded[0][labels[0]]} (at index {labels[0]}) ... 0]")

    # Final log for prepared data
    log.info(f"Loaded {labels_encoded.shape[0]} sequences"
             f" with an encoded length of ~{labels_encoded.shape[1] // 8} bytes per sequence")
    
    # Convert features to numpy array
    # This is necessary for input to the RNN
    features = np.array(features)
    
    # Split dataset into training and validation sets
    features_training, features_validation, labels_training, labels_validation = \
    DataUtils.split_data(features, labels_encoded, 0.1)
    
    assert(len(features_training) > 0 and len(features_validation) > 0 and
           len(labels_training) > 0 and len(labels_validation) > 0)
    log.info(f"Size of training data: {features_training.shape[0]} sequences")
    log.info(f"Size of validation data: {features_validation.shape[0]} sequences")
    
    log.info("Training data preparation finished")

[INFO] DataManager: Dataset already exists, skipping fetch
[DEBUG] Main: First extracted abstract: Long noncoding RNA nuclear paraspeckle assembly transcript 1 (lncRNA NEAT1) is abnormally expressed in numerous tumors and functions as an oncogene, but the role of NEAT1 in laryngocarcinoma is largely unknown. Our study validated that NEAT1 expression was markedly upregulated in laryngocarcinoma tissues and cells. Downregulation of NEAT1 dramatically suppressed cell proliferation and invasion through inhibiting miR-524-5p expression. Additionally, NEAT1 overexpression promoted cell growth and metastasis, while overexpression of miR-524-5p could reverse the effect. NEAT1 increased the expression of histone deacetylase 1 gene (HDAC1) via sponging miR-524-5p. Mechanistically, overexpression of HDAC1 recovered the cancer-inhibiting effects of miR-524-5p mimic or NEAT1 silence by deacetylation of tensin homolog deleted on chromosome ten (PTEN) and inhibiting AKT signal pathway. Moreover, in v

<a id="keras_rnn"></a>
### The neural Network

In [None]:
# TODO: create neural network using Keras
pass

# Training the model
# history = model.fit(features_training, labels_training,
# validation_data=(features_validation, labels_validation), epochs=100, batch_size=64)