# Working demonstration of KINS injector using the Census Income dataset

Notebook organisation:
1. [**imports and utility functions**](#imports)
2. [**dataset description, analysis and preprocessing**](#dataset-description,-analysis-and-preprocessing)
3. [**injection**](#injection) (if you are interested only in the injection mechanism skip the other parts)
4. [**training and evaluation**](#training-and-evaluation)

Note: Internet connection is required to download the dataset.
If the files of the Census Income dataset are in folder `data` Internet connection is not required.

<a id='imports'></a>
## Imports and utility functions

Some necessary imports:
- __re__ for regex operations
- __os__ to use other resources in this repository
- __pandas__ for data retrieval and statistics
- __tensorflow__ for neural networks
- __psyki__ for symbolic knowledge injection
- __typing__ for better quality

In [1]:
import os
from sklearn.model_selection import train_test_split
import pandas as pd
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense
from tensorflow.python.framework.random_seed import set_seed
from psyki.logic import Theory
from psyki.logic.prolog import TuProlog
from typing import Callable
from psyki.ski import Injector

os.getcwd()
from knowledge import PATH as KNOWLEDGE_PATH
from data import SpliceJunction

In [2]:
SPLICE_KNOWLEDGE_FILE = str(KNOWLEDGE_PATH / SpliceJunction.knowledge_file_name)

# Training parameters

SEED = 0
EPOCHS = 20
BATCH_SIZE = 32
VERBOSE = 1

## Dataset description, analysis and preprocessing
(If you are interested only in the injection part you can skip this section)

Download (if not already present) train and test set.
The dataset contains a sequence of 60 nucleotides and a label.
The label can be:
- __ei__: Exon-intron boundary, parts of the DNA sequence that are retained after splicing (donors);
- __ie__: Intron-exon boundary, parts of the DNA sequence that are spliced out (acceptors);
- __n__: Non-splicing boundary, parts of the DNA sequence that are neither retained nor spliced out.

In [3]:
not_processed = SpliceJunction.get_train()

A first look at the dataset:
- 3190 samples in total, 2552 in the training set and 638 in the test set;
- There are three columns, the first one is the label, the second one is the identifier of the sample and the third one is the sequence;
- Classes are reparted as follows:
    - ei: 768
    - ie: 767
    - n: 1655

In [4]:
not_processed

Unnamed: 0,0,1,2
0,EI,ATRINS-DONOR-521,CCAGCTGCATCACAGGAGGCCAGCGAGCAGGTCTGTTCCAAGGGCC...
1,EI,ATRINS-DONOR-905,AGACCCGCCGGGAGGCGGAGGACCTGCAGGGTGAGCCCCACCGCCC...
2,EI,BABAPOE-DONOR-30,GAGGTGAAGGACGTCCTTCCCCAGGAGCCGGTGAGAAGCGCAGTCG...
3,EI,BABAPOE-DONOR-867,GGGCTGCGTTGCTGGTCACATTCCTGGCAGGTATGGGGCGGGGCTT...
4,EI,BABAPOE-DONOR-2817,GCTCAGCCCCCAGGTCACCCAGGAACTGACGTGAGTGTCCCCATCC...
...,...,...,...
3185,N,ORAHBPSBD-NEG-2881,TCTCTTCCCTTCCCCTCTCTCTTTCTTTCTTTTCTCTCCTCTTCTC...
3186,N,ORAINVOL-NEG-2161,GAGCTCCCAGAGCAGCAAGAGGGCCAGCTGAAGCACCTGGAGAAGC...
3187,N,ORARGIT-NEG-241,TCTCGGGGGCGGCCGGCGCGGCGGGGAGCGGTCCCCGGCCGCGGCC...
3188,N,TARHBB-NEG-541,ATTCTACTTAGTAAACATAATTTCTTGTGCTAGATAACCAAATTAA...


### Data preprocessing

- we discard columns 1 and create 60 new features from column 2;
- we then expand the 60 features into **4 x 60 = 240** new features. We basically create 4 new features for each of the 60 nucleotides. Each of the 4 new features is a binary feature that is 1 if the nucleotide is A, C, G or T respectively (multi-hot encoding). This is done because there are other symbols other than A, C, G and T in the dataset (e.g., the symbol M is used to represent a nucleotide that can be both A or C).

In [5]:
processed_dataset = SpliceJunction.get_processed_dataset(not_processed)
train, test = train_test_split(processed_dataset, test_size=0.2, random_state=SEED, stratify=processed_dataset.iloc[:, -1])
train

Unnamed: 0,X_30a,X_30c,X_30g,X_30t,X_29a,X_29c,X_29g,X_29t,X_28a,X_28c,...,X28t,X29a,X29c,X29g,X29t,X30a,X30c,X30g,X30t,240
2165,0,0,1,0,0,0,0,1,1,0,...,1,1,0,0,0,0,0,1,0,n
1665,0,0,0,1,0,1,0,0,1,0,...,0,0,0,1,0,0,0,1,0,n
2559,0,1,0,0,0,0,0,1,0,0,...,0,0,1,0,0,0,1,0,0,n
1452,0,1,0,0,0,1,0,0,1,0,...,0,0,1,0,0,0,0,0,1,ie
1037,0,0,1,0,0,0,0,1,0,0,...,0,1,0,0,0,0,0,1,0,ie
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
467,0,0,1,0,0,0,1,0,0,1,...,0,0,0,0,1,0,0,1,0,ei
1948,0,1,0,0,0,1,0,0,1,0,...,0,1,0,0,0,1,0,0,0,n
2938,1,0,0,0,0,0,1,0,0,0,...,0,1,0,0,0,0,0,1,0,n
3099,0,0,1,0,0,1,0,0,0,0,...,0,1,0,0,0,0,0,1,0,n


<a id='injection'></a>
## Injection

### Knowledge

In [6]:
knowledge = TuProlog.from_file(SPLICE_KNOWLEDGE_FILE)
theory = Theory(knowledge, train)

In [7]:
for rule_name in sorted(set([rule.lhs.predication for rule in theory.formulae])):
    print(rule_name)

class
ei_stop
exon_intron
ie_stop
intron_exon
m_of_n
not_both_zero
pyramidine_rich


We want to make some rules trainable.
We decide to train:
- `class`
- `exon_intron`
- `intron_exon`
- `pyrimidine_rich`

In [8]:
theory.set_formulae_trainable(['class', 'exon_intron', 'intron_exon', 'pyrimidine_rich'])