## Creating a Konfuzio project from scratch

## Creating a project and the prerequisites

This is an extract from [this](https://help.konfuzio.com/tutorials/quickstart/index.html) video tutorial. For more detailed information address it.

To create a project, go to the main server page and press "Create a project". Give it a name and save it. 

Before adding any documents, we need to create labels for future annotations. They can include any type of information that is to be extracted from the documents, i.e. "Total", "Description", "Net sum". 

To create a label, go to Home > Labels > + Add. 

Each label has to be named; it also has the tickbox "Multiple" allowing for it occuring in the document more than once. 

Threshold allows setting a minimal level of confidence (model's prediction accuracy – how sure the model is about the label) for putting a label — for example, 0.1 threshold means the label will only be assigned if model has 10 or more percents of confidence about this label. 

Data type allows to choose what type of information will be under the label, i.e. text, percentage.

After filling all the necessary fields, save the label.

Labels are grouped into label sets. We need to create at least one label set because labels have to be assigned one.  

To create a label set, go to Home > Label sets > + Add; you can also use an automatically made set that has a name similar to that of the project. 

After creating or selecting a label set, add all the necessary labels in it.

If you create a new label set, you also need to assign it a category.

After all changes, save the resulting set. 

##  Uploading the documents

Ideally, to build a dataset you will need at least a hundred of documents. The training and testing shares can be divided as 80%/20%. 

All the documents must fit a certain set of requirements:

- each document has to contain only one unit (i.e. no multiple documents scanned as a single file);
- if document's length is more than one page, the pages have to be sorted or at least the ground-truth first page should go first;
- the formats supported are PNG and PDF;
- each document has to be ascribed a certain category.

Preferably, a balanced dataset would contain not only single-page documents, but also multiple-paged ones in an equal amount; category-based splitting should also be balanced.

To add documents, go to the Documents page and add all the necessary files. After that it is possible to open any of the documents in the smartview by clicking on the title – and there you can start labelling them.

To assign a label to a span, click on any square selection, choose the appropriate label and label set and save it. This creates the annotation. 

After labelling all the documents, you need to assign each of them either to the training or to the testing set; this can be done in a dropdown menu above the document opened in the smartview that says "Set a status". 

## Training Extraction AI

When all the documents are uploaded and processed, we can start training. For that, go to Home > Extraction AIs and select Train extraction AI in the upper right corner. Select the desired category and save the instance. After that the AI will be queued for training that will start as soon as the server's capacity is available. Once the training has finished, a status will be updated to "Training finished".

After reviewing the evaluation results, you may want to repeat training cycle with some new documents added in a similar manner. 

## Training Categorization AI

After training all the necessary extraction AI instances, we can switch to training Categorization AI. 

To create categories, go to Home > Categories > +Add, give new category a name and save it.

To train Categorization AI on the documents, you need to assign categories to them; this can be done in the column Category in Documents. After that, go to Home > Categorization AIs and select Train categorization AI in the upper right corner. Select the project and save the instance. After that the AI will be queued for training that will start as soon as the server's capacity is available. Once the training has finished, a status will be updated to "Training finished".

After reviewing the evaluation results, you may want to repeat training cycle with some new documents added in a similar manner. 

## Alternate training options

Before you start, make sure you have **installed** and **initialized** the konfuzio_sdk package as shown in the readme of the [repository](https://github.com/konfuzio-ai/Python-SDK).

In [None]:
!pip install konfuzio-sdk

In [None]:
!konfuzio_sdk init

Importing necessary libraries and packages:

In [1]:
import os

import numpy as np
import pandas as pd
import tensorflow as tf

from collections import Counter
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.preprocessing.text import Tokenizer
from konfuzio_sdk.data import Project, Document
from nltk import word_tokenize
from PIL import Image
from tqdm import tqdm

Setting seed for reproducibility purposes:

In [2]:
seed_value = 42
os.environ['PYTHONHASHSEED'] = str(seed_value)

We will use a multilayered perceptron architecture built with Keras library and a vocabulary built by using Counter.