# Manual labelling of data

This document will demostrate example of text data labelling. There might occur such situations that freshly gathered data or even long time available dataset doesn't have labels. Lot of machine learning algorithms needs labels to train model on as response variable. But even rest, all unsupervised algorithms would utilize at least small volume of labeled validation data. Therefore labelling of some part of dataset is always useful. 

## Example dataset

As example we will use 100 randomly selected answers from publicly available dataset [Yahoo Answers!](https://drive.google.com/drive/folders/0Bz8a_Dbh9Qhbfll6bVpmNUtUcFdjYmF2SEpmZUZUcVNiMUw1TWN6RDV3a0JHT3kxLVhVR2M). This popular dataset was used for many text classification white papers, although it already has labels it would be good for purpose of this showcase. Our task is to select lables of of 10 possible classes. In original dataset each answer have selected jus 1 class, however our tool can select even multiple classes.

In [1]:
import pickle
# Load data with pickle library
with open('data/text_data.pickle', 'rb') as saved_data:
    text_list = pickle.load(saved_data)
    
# Since often we just want to label only smaller valume of data, we might want sample data. 
# It might be usefull to create id's to be able to match labeled part of dataset to rest.
# In this case we just create list with values 1 to length of data as id's.
id_list = [i+1 for i in range(len(text_list))] 

Here are 10 classes that we can select from:

In [2]:
label_list = ['Culture', 'Science and Mathematics', 'Health', 'Education and Reference',
              'Computers and Internet', 'Sports', 'Entertainment', "Business and Finance",
              'Family and Relationships', 'Government and Politics']

### Instructions

Once we start labelling function it will loop through given data list. Each text from data will be presented to us together with list of options. User's task is just to select one of the options by typing correct number from displayed options. First option with number -2 is Exit from labelling. If user press this option program will shut down, all records that were so far labeled will be stored in the created database and user can return later and continue with next unlabeled record. Second option -1 gives user opportunity to add new label into label list, if user press this option prompt asking to input new label will appear and user must provide it. This label will be assigned to currently displayed record and from now on it will be showed in menu as a new option. Third option 0 should be used in time when user is finished with current record. If no label was provided then this record will be completely skipped from labeled database, otherwise selections will be recorded into database. Beside of these 3 options menu will write down all labels with given numbers from 1 to length of label_list. User must type number of selected label and press enter. Then the same menu will display again asking for another label, this will repeat one more time since we decided to allow up to 3 labels to single record. If it would be appropriate to have less than 3 labels, user must press 0 to finish labelling of current record. Otherwise program would ask 3 times for each record and then would continue with next record automatically.

In [3]:
import manual_labelling
random_seed = 123
# if we are starting with labeling, create database from scratch.
manual_labelling.start_labelling('data/database.sqlite', text_list, id_list, label_list, True, random_seed, True)

# if we wish to continue with previously labeled database last argument of function must be changed to False
# manual_labelling.start_labelling('data/database.sqlite', text_list, id_list, label_list, True, random_seed, False)

----------------------------------------
NUMBER OF LABELS GIVEN TO CURRENT TEXT: 0/3.
Already given labels: []
----------------------------------------
TEXT FOR LABELLING: 
 101.4 is really not that high.  Just make sure he's drinking plenty of fluids and taking some Tylenol.  Keep cold wash clothes on his forehead as well.  Keep an eye on the temperature and if it gets past 103 or so, I'd go to the ER.
----------------------------------------
PLEASE SELECT ONE OF FOLLOWING OPTIONS:
-2.) Exit
-1.) Add new class
0.) Press 0 to finish labelling of current text
1.) Culture
2.) Science and Mathematics
3.) Health
4.) Education and Reference
5.) Computers and Internet
6.) Sports
7.) Entertainment
8.) Business and Finance
9.) Family and Relationships
10.) Government and Politics
All used labels:
['Culture', 'Science and Mathematics', 'Health', 'Education and Reference', 'Computers and Internet', 'Sports', 'Entertainment', 'Business and Finance', 'Family and Relationships', 'Government and Pol

At this point we have database with few labeled records:

In [4]:
import pandas as pd
import sqlite3
import ast

# load labeled validation data
db = sqlite3.connect("data/database.sqlite")
df = pd.read_sql_query("SELECT * FROM label_table", db)
db.close()

# Now we can access labeled data and even transform labels into lists
list_labels = df.apply(lambda row: ast.literal_eval(row["labels"]), axis=1)
for index, labels in enumerate(list_labels):
    print(f"Text: \n {df.text.loc[index]}")
    print(f"Labels: \n {labels}")
    print("*" * 40)

Text: 
 One suggestion, the white text can be a bit bright and glaring to the eyes when first arriving to your site. Perhaps use a shade lighter #CCCCCC instead of #FFFFFF. Other than that, good job!
Labels: 
 ['Computers and Internet']
****************************************
Text: 
 WELL IVE HAD ALOT OF BAD THINGS HAPPEN TO ME BUT THE WORST  2 THINGS 2 YRS AGO MY DAD DIED THE OTHER 2005 FOUND OUT I HAVE BREAST CANCER... GETTING TREATMENTS NOW CHEMO SURGERY RADATION HERPECTION...NOT OVER IT NEVER WILL BE,,,,DAD WONT BE OVER HIM EITHER,,, BUT I MANAGE TO GO THROUGH EACH DAY ...THINKING THERE IS A REASON FOR ME TO BE HERE,,,,AND I DONT NO WHAT IT COULD BE,,
Labels: 
 ['Health', 'Family and Relationships']
****************************************
Text: 
 Yes....wait till you get married........lot less problems in life.
Labels: 
 ['Family and Relationships', 'Education and Reference']
****************************************
Text: 
 Below are examples of tools available on the Web right 