## What is Ludwig

Ludwig is a toolbox built on top of TensorFlow that allows to train and test deep learning models without the need to write code.

All you need to provide is a CSV file containing your data, a list of columns to use as inputs, and a list of columns to use as outputs, Ludwig will do the rest. Simple commands can be used to train models both locally and in a distributed way, and to use them to predict on new data.
(note also HDF5 and json input is available)

Developed by Uber, is release under the open source Apache License 2.0.

Available @ https://github.com/uber/ludwig

### installation and steps

#### Install:   
pip install ludwig  

python -m spacy download en


#### Train:    
Prepare your data in a CSV file, define input and output feature in a model definition YAML file.

#### Predict:    
use a pre-trained model to predict the output targets.

#### Visualize:    
Ludwig comes with many visualization options to understand deep learning models performance and compare their predictions.



## Dataset examined

In our examples we will use a subset of patent data taken from EPO PATSTAT (patentdata.csv);

the data contains 5.000 EP applications with the following fields

APPLN_AUTH patent filing office (EPO)
APPLN_ID  application unique id
EARLIEST_FILING_YEAR   priority year
APPLN_ABSTRACT   patent abstract
IPC4  first IPC - leading 4 chars
PSN_NAME   first applicant standardized name













In [6]:
import pandas as pd

data =pd.read_csv("patentdata.csv")
data.head()


Unnamed: 0,EARLIEST_FILING_YEAR,APPLN_ABSTRACT,PSN_NAME,IPC4
0,1999,A mobile station (MS) that comprises an interf...,NOKIA CORPORATION,G06K
1,1991,Methods are disclosed for the production of an...,MRC (MEDICAL RESEARCH COUNCIL),G01N
2,1999,The detector includes scintillators (S1 to S5)...,PHILIPS INTELLECTUAL PROPERTY & STANDARDS,G01T
3,1999,The specification describes techniques for wir...,LUCENT TECHNOLOGIES,H01L
4,1999,The specification describes source/drain conta...,LUCENT TECHNOLOGIES,H01L


In [5]:
data.describe()

Unnamed: 0,EARLIEST_FILING_YEAR
count,5000.0
mean,2005.1648
std,4.096217
min,1979.0
25%,2005.0
50%,2007.0
75%,2007.0
max,2008.0


### Data definition in Ludwig

Previously was explain that training a model in Ludwig is pretty straightforward: you provide a CSV dataset and a model definition YAML file.

The model definition contains a list of input features and output features, all you have to do is specify names of the columns in the CSV that are inputs to your model alongside with their datatypes, and names of columns in the CSV that will be outputs, the target variables which the model will learn to predict. Ludwig will compose a deep learning model accordingly and train it for you.

Currently the available datatypes in Ludwig are:

binary

numerical

category

set

bag

sequence

text

timeseries

image

The model definition can contain additional information, in particular how to preprocess each column in the CSV.


### Training

From Patents dataset we will use abstract, year and applicant standard name to forecast the content of IPC4 field

EARLIEST_FILING_YEAR,APPLN_ABSTRACT,PSN_NAME,IPC4

"EP",1,1999,"a lot of text","G06K","NOKIA CORPORATION"

Model training by calling 

ludwig train [options]

  --data_csv DATA_CSV   input data CSV file. If it has a split column, it will
                        be used for splitting (0: train, 1: validation, 2:
                        test), otherwise the dataset will be randomly split
                        
  -mdf --model_definition_file MODEL_DEFINITION_FILE
                        YAML file describing the model. 
                        
  --output_directory OUTPUT_DIRECTORY
                        directory that contains the results                        
                        

(refer to: https://uber.github.io/ludwig/user_guide/)

We will use a multi input model defined in modeldefinition.yaml file


The structure of the model definition file is a dictionary with five keys:


input_features: []

combiner: {}

output_features: []

training: {}

preprocessing: {}




input_features:

    -
    
        name: EARLIEST_FILING_YEAR
        
        type: numerical
        
    -
    
        name: APPLN_ABSTRACT
        
        type: text
        
        missing_value_strategy: ''
        
    -
    
        name: PSN_NAME
        
        type: text
        
        missing_value_strategy: ''
        

output_features:

    -
    
        name: IPC4
        
        type: category



and start the training typing the following command in your console:

#### ludwig train --data_csv patent_data.csv --model_definition_file model_definition.yaml --output_directory patresults1

or run the batch file 

### ludwig_run_pat.bat


(note you can distribute the training of your models using Horovod, which allows to train on a single machine with multiple GPUs as well as on multiple machines)

## what happens:


After training, Ludwig will create a directory under results containing the trained model with its hyperparameters and summary statistics of the training process. 

Inside the folders you will find a lot of intermediate files: in particular  one HDF5 and one JSON. The HDF5 file contains the data mapped to numpy ndarrays, while the JSON file contains the mappings from the values in the tensors to their original labels.
When rerunning the training or going to next steps, those files will be used to save time.

Data can be splitted among  train, validation and test in several ways: either providing a column named SPLIT or three separate data sets (--data_train_csv, --data_validation_csv, --data_test_csv).

Other important files are:

description.json - a file containing a description of the training process with all the information to reproduce it.

training_statistics.json which contains records of all measures and losses for each epoch.

model - a directory containing model hyperparameters, weights, checkpoints and logs (for TensorBoard).


## visualize training results


You can visualize them using one of the several visualization options available in the visualize tool,


ludvig visualize -v (type of graph) -ts/ps path to stats.json

#### ludwig visualize --visualization learning_curves --training_statistics path/to/training_statistics.json

or run 

#### ludwig_vis_pat.bat



other visualizations available:

https://uber.github.io/ludwig/user_guide/#visualizations


confusion_matrix

compare_performance

compare_classifiers_performance_from_pred

### how to read loss and accuracy:

The lower the loss, the better a model. The loss is calculated on training and validation and its interperation is how well the model is doing for these two sets. Unlike accuracy, loss is not a percentage. It is a summation of the errors made for each example in training or validation sets.

Loss is often used in the training process to find the "best" parameter values for your model 

Accuracy is more from an applied perspective. Once you find the optimized parameters above, you use this metrics to evaluate how accurate your model's prediction is compared to the true data.


 Hits@K measure: counts a prediction as correct if the model produces it among the first k



## Predict

If you want your previously trained model to predict target output values, you can type the following command in your console:

#### ludwig predict --data_csv path/to/data.csv --model_path /path/to/model -od output directory

Running this command will return model predictions and some test performance statistics if the new dataset contains ground truth information to compare to. Those can be visualized by the visualize tool, which can also be used to compare performances and predictions of different models, for instance:


#### ludwig visualize --visualization compare_performance -ps path/to/test_statistics_model_1.json path/to/test_statistics_model_2.json




In [None]:
https://dev.to/chrishunt/code-free-machine-learning-with-ludwig-2gap
    
    https://towardsdatascience.com/metrics-to-evaluate-your-machine-learning-algorithm-f10ba6e38234

Note with

### ludwig.experiment

it is possible to run train & predict at once

#### ludwig experiment  --data_csv cars.csv --model_definition_file modeldef.yaml --output_directory results




In the folder there are two versions of batch programs that run (ludwig_run_pat.bat and ludwig_run_pat_ipc4v2.bat)
Ludwig and visualize (ludwig_vis_pat.bat and ludwig_vis_pat_ipc4v2.bat) results

the world document 

LUDWIG-quickgraphs.docx

shows results

note the two different models 



Improvements:

at mode level:

        level: word
        encoder: parallel_cnn
        preprocessing:
          word_format: english_tokenize

at training level:

training:
  batch_size: 128
  epochs: 1000
  early_stop: 50
  learning_rate: 0.003
  optimizer:
    type: adagrad    
