# Working demonstration of KINS injector using the Census Income dataset

Notebook organisation:
1. [**imports and utility functions**](#imports)
2. [**dataset description, analysis and preprocessing**](#dataset-description,-analysis-and-preprocessing)
3. [**injection**](#injection) (if you are interested only in the injection mechanism skip the other parts)
4. [**training and evaluation**](#training-and-evaluation)

Note: Internet connection is required to download the dataset.
If the files of the Census Income dataset are in folder `data` Internet connection is not required.

<a id='imports'></a>
## Imports and utility functions

Some necessary imports:
- __re__ for regex operations
- __os__ to use other resources in this repository
- __pandas__ for data retrieval and statistics
- __tensorflow__ for neural networks
- __psyki__ for symbolic knowledge injection
- __typing__ for better quality

In [1]:
import os
import pandas as pd
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense
from tensorflow.python.framework.random_seed import set_seed
from psyki.logic import Theory
from psyki.logic.prolog import TuProlog
from psyki.ski import Injector

os.getcwd()
from knowledge import PATH as KNOWLEDGE_PATH
from data import CensusIncome

Macro definition.

In [2]:
CENSUS_KNOWLEDGE_FILE = str(KNOWLEDGE_PATH / CensusIncome.knowledge_file_name)

# Activation functions used for building the uneducated predictor

ACTIVATION: str = "relu"
LAST_ACTIVATION: str = "sigmoid"

# Training parameters

SEED = 0
EPOCHS = 20
BATCH_SIZE = 32
VERBOSE = 1

Function that creates an uneducated predictor with possibly a number of hidden layers.

In [3]:
def create_uneducated_predictor(input_shape: tuple, outputs: int, neurons_per_hidden_layer: list[int]) -> Model:
    predictor_input = Input(input_shape)
    x = predictor_input
    for neurons in neurons_per_hidden_layer:
        x = Dense(neurons, activation=ACTIVATION)(x)
    x = Dense(outputs, activation=LAST_ACTIVATION)(x)
    return Model(predictor_input, x)

<a id='dataset'></a>

## Dataset description, analysis and preprocessing
(If you are interested only in the injection part you can skip this section)

Download (if not already present) train and test set.
The dataset contains general information about individuals (e.g., age, sex, education, etc.) and their yearly income (i.e., more or less than 50,000 USD).

In [4]:
not_processed_train = CensusIncome.get_train()
not_processed_test = CensusIncome.get_test()

A first look to the dataset.

In [5]:
not_processed_train

Unnamed: 0,Age,WorkClass,Fnlwgt,Education,EducationNumeric,MaritalStatus,Occupation,Relationship,Ethnicity,Sex,CapitalGain,CapitalLoss,HoursPerWeek,NativeCountry,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


As you can see we are dealing with different data types. In particular:
- Age, Fnlwgt, CapitalGain, CapitalLoss and HoursPerWeek are continuous (integer) features;
- EducationNumeric is ordinal;
- Sex is binary;
- the remaining features are nominal (WorkClass, Education, MaritalStatus, Occupation, Relationship, NativeCountry)

Ok. All feature names are self-explaining but Fnlwgt. What the hell is that?
Fnlwgt stands for FinalWeight, and it is a popular belief that it should indicate the estimated number of people represented by the row.
However, if we simply compute the sum of this feature along all the dataset (train and test) this value is...

In [6]:
f'{sum(not_processed_train.Fnlwgt) + sum(not_processed_test.Fnlwgt):,}'

'12,358,746,784'

so more than 9 billions, this is more than the actual Earth population! (and this is a dataset from the 90s concerning only the US)
Therefore, this is not the correct interpretation of this feature.

The actual meaning of Fnlwgt is much more complicated. If we look at the original data description available [here](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names) we can read:

> Description of fnlwgt (final weight)
>
> The weights on the CPS files are controlled to independent estimates of the
civilian noninstitutional population of the US.  These are prepared monthly
for us by Population Division here at the Census Bureau.  We use 3 sets of
controls.
These are:
    1. A single cell estimate of the population 16+ for each state.
    2. Controls for Hispanic Origin by age and sex.
    3. Controls by Race, age and sex.

> We use all three sets of controls in our weighting program and "rake" through
them 6 times so that by the end we come back to all the controls we used.
>
> The term estimate refers to population totals derived from CPS by creating
"weighted tallies" of any specified socio-economic characteristics of the
population.
>
> People with similar demographic characteristics should have
similar weights. There is one important caveat to remember
about this statement. That is that since the CPS sample is
actually a collection of 51 state samples, each with its own
probability of selection, the statement only applies within
state.

Long story short, it is a similarity metric computed upon the other features.
We can definitively ignore it in our study.

### Data preprocessing

- Fnlwgt is discarded;
- Education is discarded as well because EducationNumeric has the same information;
- Sex is mapped into 0 (Male) and 1 (Female);
- Income is mapped into 0 (<=50K) and 1 (>50K) as well;
- The remaining nominal features are one-hot encoded (WorkClass, MaritalStatus, Occupation, Relationship, NativeCountry).

In [7]:
processed_dataset = CensusIncome.get_processed_dataset(pd.concat((not_processed_train, not_processed_test), axis=0))
train = processed_dataset.iloc[:not_processed_train.shape[0], :]
test = processed_dataset.iloc[not_processed_train.shape[0]:, :]
train.describe()

Unnamed: 0,Age,EducationNumeric,Sex,CapitalGain,CapitalLoss,HoursPerWeek,WorkClass_unknown,WorkClass_federal_gov,WorkClass_local_gov,WorkClass_never_worked,...,NativeCountry_puerto_rico,NativeCountry_scotland,NativeCountry_south,NativeCountry_taiwan,NativeCountry_thailand,NativeCountry_trinadad_tobago,NativeCountry_united_states,NativeCountry_vietnam,NativeCountry_yugoslavia,income
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,...,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,10.080679,0.330795,1077.648844,87.30383,40.437456,0.056386,0.029483,0.064279,0.000215,...,0.003501,0.000369,0.002457,0.001566,0.000553,0.000584,0.895857,0.002058,0.000491,0.24081
std,13.640433,2.57272,0.470506,7385.292085,402.960219,12.347429,0.23067,0.169159,0.245254,0.014661,...,0.059068,0.019194,0.049507,0.039546,0.023506,0.024149,0.305451,0.045316,0.022162,0.427581
min,17.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,28.0,9.0,0.0,0.0,0.0,40.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
50%,37.0,10.0,0.0,0.0,0.0,40.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
75%,48.0,12.0,1.0,0.0,0.0,45.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
max,90.0,16.0,1.0,99999.0,4356.0,99.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


How features are correlated with the target variable income?

In [8]:
train.corr().income.sort_values(ascending=False)

income                              1.000000
MaritalStatus_married_civ_spouse    0.444696
Relationship_husband                0.401035
EducationNumeric                    0.335154
Age                                 0.234037
                                      ...   
Occupation_other_service           -0.156348
Relationship_not_in_family         -0.188497
Sex                                -0.215980
Relationship_own_child             -0.228532
MaritalStatus_never_married        -0.318440
Name: income, Length: 91, dtype: float64

<a id='injection'></a>
## Injection

### Knowledge

In [9]:
knowledge = TuProlog.from_file(CENSUS_KNOWLEDGE_FILE)
theory = Theory(knowledge, train)

# You can also create a theory in one single line providing the file path of the knowledge instead of the knowledge itself.
# theory = Theory(CENSUS_KNOWLEDGE_FILE, train)


This knowledge is extracted from a decision tree trained on the train dataset.
The overall accuracy of the tree is 84.9% on the train set.
It consists in the following 10 rules

In [10]:
for rule in theory.formulae:
    print(f"{rule.rhs} -> {rule.lhs.args.last}")

EducationNumeric > 12.0, MaritalStatus_married_civ_spouse > 0.0 -> 1.0
EducationNumeric < 12.0, CapitalGain < 5119.0, CapitalLoss < 1820.0 -> 0.0
EducationNumeric > 12.0, MaritalStatus_married_civ_spouse < 0.0, CapitalGain < 7073.0 -> 0.0
EducationNumeric > 12.0, MaritalStatus_married_civ_spouse < 0.0, CapitalGain > 7073.0 -> 1.0
CapitalGain < 5119.0, CapitalLoss > 1820.0, MaritalStatus_married_civ_spouse < 0.0 -> 0.0
CapitalGain < 5119.0, CapitalLoss > 1820.0, MaritalStatus_married_civ_spouse > 0.0, EducationNumeric < 8.0 -> 0.0
CapitalGain < 5119.0, CapitalLoss > 1820.0, MaritalStatus_married_civ_spouse > 0.0, EducationNumeric > 8.0 -> 1.0
CapitalGain > 7073.0 -> 1.0
MaritalStatus_married_civ_spouse < 0.0 -> 0.0
True -> 1.0


It is possible to specify that a certain rule is trainable, i.e., the weights and biases of neurons corresponding to the logic rule are affected by the training process.
Usually, allowing the training of all rules means that the training is slower but accuracy should be higher.
In this case, we choose not to train the rules.

In [11]:
# You can make all rules trainable by running the following line
theory.set_all_formulae_trainable()

# You can also make specific rules trainable by running the following line
theory.set_formulae_trainable(["class"])  # class is the name of the rule (i.e., the name of the predicate)

# To make all rules not trainable (a.k.a. static) run the following line
theory.set_all_formulae_static()

AttributeError: 'Theory' object has no attribute 'set_all_formulae_trainable'

### The actual injection is as simple as that

In [None]:
set_seed(0)
uneducated = create_uneducated_predictor(train.shape[1]-1, 1, [10])
injector = Injector.kins(uneducated)
educated = injector.inject(theory)

Done!

<a id='training'></a>
## Training and evaluation

From now on it is just the same as a common ML project

In [None]:
educated.compile(optimizer='adam', loss='binary_crossentropy', metrics='accuracy')
history_educated = educated.fit(train.iloc[:, :-1], train.iloc[:, -1], epochs=EPOCHS, batch_size=BATCH_SIZE, verbose=VERBOSE)

In [None]:
_, acc = educated.evaluate(test.iloc[:, :-1], test.iloc[:, -1])
print(f'test set accuracy of the educated predictor: {acc*100:.2f}%')

What about the uneducated predictor?

In [None]:
uneducated.compile(optimizer='adam', loss='binary_crossentropy', metrics='accuracy')
history_uneducated = uneducated.fit(train.iloc[:, :-1], train.iloc[:, -1], epochs=EPOCHS, batch_size=BATCH_SIZE, verbose=VERBOSE)

In [None]:
_, acc = uneducated.evaluate(test.iloc[:, :-1], test.iloc[:, -1])
print(f'test set accuracy of the uneducated predictor: {acc*100:.2f}%')