Skip to content

klondike-AI/klassifier

Repository files navigation

Introduction

With this Klondike Classifier you can predict a target field according to the model generated by the train.

⚙️ Setup the environment

▶️ STEP 1:

Download the latest release from this project page. Unzip the file in /etc/klondike_classifier

▶️ STEP 2:

Install these requirements:

  • cd /etc/klondike_classifier
  • pip3 install -r requirements.txt
  • python3 -m nltk.downloader stopwords or python3 -m nltk.downloader all
  • chmod 777 -R treetagger
  • chmod 777 -R CNN_tuning
  • pip3 install -e git+git://github.com/ildiopantofola/nonconformist.git@master#egg=nonconformist
  • python3 -m spacy download en_core_web_lg
  • python3 -m spacy download fr_core_news_lg
  • python3 -m spacy download de_core_news_lg
  • python3 -m spacy download it_core_news_lg
  • python3 -m spacy download es_core_news_lg
  • python3 -m spacy download pt_core_news_lg
  • python3 -m spacy download nb_core_news_lg
  • python3 -m spacy download da_core_news_lg
  • cd treetagger && unzip lib.zip

▶️ STEP 3:

Go to https://huggingface.co/neuraly/bert-base-italian-cased-sentiment > Files and versions > download and copy in folder pretrained_models the files:

  • config.json
  • special_tokens_map.json
  • tf_model.h5
  • tokenizer_config.json
  • vocab.txt

✏️ Configuration

In utilities/connection.json* there are the connections to MySQL databases used by the classifier to read data during the train process and write the result of the prediction.

In utilities/connection_service.json you can configure the connection to the table that contains various services: a service is a specific classifier with its trained model.

In utilities/connection_cron.json you can configure the connection to the cron table which can contain various version of the same service.

In the services table you can configure the source table with data to train your classifier in these columns:

  • training_table: table name
  • training_table_key: key column of the table
  • training_columns: llist of columns read to train the classifier
  • training_where: conditions to the table
  • training_target: column with the attribute to predict

You can configure the connection to training_table in utilities/connection.json

In utilities/connection_predictions.json there is the connection to the ai_classified table that contains the predictions.

CREATE TABLE `services` (
  `id` int(11) NOT NULL,
  `training_table` varchar(50) NOT NULL DEFAULT '',
  `training_table_key` varchar(50) NOT NULL DEFAULT '',
  `training_columns` text NOT NULL,
  `training_where` text,
  `training_target` varchar(50) NOT NULL DEFAULT '',
  `parameters` json DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `cron` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `serviceid` int(11) NOT NULL,
  `planned` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
  `started` timestamp NULL DEFAULT '0000-00-00 00:00:00',
  `ended` timestamp NULL DEFAULT '0000-00-00 00:00:00',
  `status` int(1) NOT NULL DEFAULT '0',
  `training_result` text,
  PRIMARY KEY (`id`),
  KEY `serviceid` (`serviceid`,`status`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `ai_classified` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `crmid` int(19) NOT NULL,
  `cronid` int(11) NOT NULL,
  `guessed` longtext,
  `guessed_time` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
  `applied` int(1) NOT NULL DEFAULT '0',
  `applied_time` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
  PRIMARY KEY (`id`),
  KEY `applied` (`applied`),
  KEY `crmid` (`crmid`,`cronid`,`applied`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

example of a classifier's configuration that predict the category of a list of tickets

INSERT INTO `services` (`id`, `training_table`, `training_table_key`, `training_columns`, `training_where`, `training_target`, `parameters`)
VALUES
	(1, 'tickets', 'ticketid', 'ticket_title,description', 'ticketcategories <> \'test\' and createdtime >= \"2015-01-01 00:00:00\"', 'ticketcategories', '{\"lemming\": true, \"language\": \"italian\", \"disable_CNN\": true, \"min_cardinality\": 20}');
INSERT INTO `cron` (`id`, `serviceid`, `planned`, `started`, `ended`, `status`, `training_result`)
VALUES
	(1, 1, '2022-06-03 18:00:00', NULL, '0000-00-00 00:00:00', 0, NULL);

Service parameters

In the column parameters of the table services you can configure:

  • min_cardinality to exclude from the dataset rows with cardinality lower than this value
  • language (""/"italian"/"english") to interpret text in a specific language
  • lemming (true/false)
  • stemming (true/false)
  • disable_CNN (true/false) to skip the CNN in the train command
  • CNN_config is a json with these attributes
{    
    "NB_WORDS" : 10000,                                     -->  number of words in the dictionary
    "NB_EPOCHS" : 30,                                       -->  Number of epochs
    "BATCH_SIZE" : 512,                                     -->  Size of the batches used in the mini-batch gradient descent    
    "MAX_LEN" : 400,                                        -->  Maximum number of words in a sequence
    "EMBEDDING_DIM" : 150,                                  -->  Number of dimensions of the GloVe word embeddings
    "NB_CONVOLUTION_FILTERS" : 128,                         -->  Number of convolution filters
    "CONVOLUTION_KERNEL_SIZE" : 4,                          -->  Convolution Kernel Size
    "LABEL_SMOOTHING" : 0.3,                                -->  label smoothing index
    "EARLYSTOPPING_PATIENCE" : 10,                          -->  number of epochs without improvement in the monitored param that the model waits before stopping
    "EARLYSTOPPING_MONITOR_PARAM" : "val_loss",             -->  the value monitored for early stopping
    "DROPOUT_PROB" : 0.5,                                   -->  dropout CNN index
    "PARAMS_AUTOTUNING" : false,                            -->  enables CNN hyperparams autotuning via Keras tuner class
    "MULTIGROUP_CNN" : true,                                -->  enables CNN MultiGroup custom embeddings mode
    "MG_GLOVE_EMB_FILE" : "itwiki_20180420_300d.txt",       -->  CNN MultiGroup Glove embeddings file name (has to be inside the embeddings folder)
    "MG_FASTTEXT_EMB_FILE" : "embed_wiki_it_1.3M_52D.vec",  -->  CNN MultiGroup FastText embeddings file name (has to be inside the embeddings folder)
    "MG_GLOVE_EMB_DIM" : 300,                               -->  CNN MultiGroup Glove embeddings vectors dimension
    "MG_FASTTEXT_EMB_DIM" : 52,                             -->  CNN MultiGroup FastText embeddings vectors dimension
}

🖥️ Usage

👩‍🏫 Train

The train command tests several algorithms, then holds the most accured model.

python3 CRM_classifier.py --train --from_db --cron_id <CRONID>

example for train the service 1

python3 CRM_classifier.py --train --from_db --cron_id 1

🔮 Predict

The classification command is used to predict a target field according to the generated model by the train. If you want to predict the target value of a new row you have to insert data in training_table with training_columns populated and execute this command with the id of the row.

python3 CRM_classifier.py --classify --from_db --cron_id <CRONID> --table <TABLENAME> --target <TARGETKEY> --id <ID>

example for classify the service 1

python3 CRM_classifier.py --classify --from_db --cron_id 1 --table tickets --target ticketid --id 54723