<img src="../assets/UpLabel.png" width="400" align="left"><br><br><br><br>

# Introduction to UpLabel

UpLabel is a lightweight, Python-based and modular tool which serves to support your machine learning tasks by making the data labeling process more efficient and structured. UpLabel is presented and tested within the MLADS-Session *"Distributed and Automated Data Labeling using Active Learning: Insights from the Field"*.
 
## Session Description
High-quality training data is essential for succeeding at any supervised machine learning task. There are numerous open source tools that allow for a structured approach to labeling. Instead of randomly choosing labeling data, we make use of machine learning itself for continuously improving the training data quality. Based on the expertise of the labelers as well as the complexity of the data, labeling tasks can be distrubuted in an intelligent way. Based on a real-world example from one of our customers, we will show how to apply the latest technology to optimize the task of labeling data for NLP problems.  

## Software Component and User Flow
The following images serve to illustrate the user labeler flow and the software component flow.

### Software Component Flow
---
<p><img src="../assets/MLADS_Components.png" width="60%" align="center"></p>

### User Flow
---
<p><img src="../assets/MLADS_UserFlow.png" width="60%" align="center"></p>

### Prepare Workspace

Required libraries are loaded below, for the most part they get imported by the main-script.

In [None]:
import matplotlib as plt
import sys

sys.path.append('../code')
import main

In [None]:
%matplotlib inline

## Task Setup

There are two possible ways to go for this session:
1. You can use our example data (German news data)
2. Or your own data, if you brought some.

#### If you want to use our example:
- Use 'lab' as your project reference below (see step *"Run Iteration #0"*). The example case will be loaded.
- Set the `dir` parameter to the folder, where the lab data is located, e.g. `C:/uplabel/data/lab/`

#### If you brought your own data:
- Either create a task config (either copy the code below and save it as `params.yml`) and save it in a subfolder of `task`
- The task can be named as you like
- Or simply rename the folder "sample" to your desired project name and use the sample file in it
- Set the `dir` parameter to the folder, where your data is going to be located

```yaml
data:
    dir: ~/[YOUR DIRECTORY GOES HERE]/[projectname]
    source: input.txt
    cols: ['text','label']
    extras: []
    target_column: label
    text_column: text
parameters:
    task: cat
    language: de
    labelers: 3
    min_split_size: 0
    max_split_size : 300
    quality: 1
    estimate_clusters: True
    quality_size: 0.1
    overlap_size: 0.1
```


In [None]:
project_name = 'news_en'

## Run Iteration #0

- This is the start of the initial iteration of the UpLabel process. 
- Feel free to create your own project, by adding a parameter file to `\tasks` and your data to `\data\[project name]`. Don't forget to update the `'project_name'` variable above, with the name of your task.

Note: you can add `'debug_iter_id=X'` to repeat an iteration, where X is your iteration number.

In [None]:
main.Main(project_name)

## Fun part: label your data

- After the first iteration, you can start labeling your data
- You can find the data splits in the folder you have set to the `dir`-parameter
- File names are named this way: 
    - `[original file name]-it_[iteration number]-split_[split number].xlsx`, like `data-it_1-split_1.xlsx`
- Open your data and label it!

## Run Iteration #1

In [None]:
main.Main(project_name)

## Label some more!

## Run Iteration #2

In [None]:
main.Main(project_name)