# Demo

The labeling process consists of two phases:
1. Preparing numerical knowledge bases (NKB): The data could be extracted from knowledge graph (City, DBpedia, Wikidata) or your custom data (Open Data).
2. Labeling queries (bag of numbers (BON)): Searching the most relevant numerical attributes in NKB with respect to a specific numerical similarity (SemanticTyper, DSL, Distribution Based Simialrity). 

## Configuration
We will be working on 
- 4 datasets: City Data, DBpedia, Wikidata, Open Data
- 5 evaluation methods: SemanticTyper, DSL, DBS1 (Manhattan), DBS2 (Eulidean), DBSinf (Chebyshev)

In [None]:
import io_worker
import labeling
from attribute_obj import NumericalAttribute

dataset_names = ["all", "city", "dbpedia", "wikidata", "open"]
methods = ["semantictyper", "dsl", "dbs1", "dbs2", "dbsinf"]

config_dataname = "all"
config_trans_size = 100

## Preprare NKB
There are two type of dataset: txt and csv. 
1. Txt: City Data and Open Data was organized as the txt format as SemanticTyper. Each dataset contain n source files; each source has m numerical attributes. The first line of a source file is a number of numerical attributes. The next two lines are the semantic label and m number of values of the first numerical attributes. The next m lines are values the numerical attributes where the first value of the line is number character of the numerical values. The next lines are the information of the next numerical attributes.
2. DBpedia, Wikidata data was organized as csv format. Each csv file have one a numerical columns where the file name is the semantic label.


In [2]:
io_worker.print_status("Data Loading: ")
labels = set()
knowledge_base = []

if "all" in config_dataname:
    process_datasets = dataset_names[1:]
else:
    process_datasets = [config_dataname]
    
for data_name in process_datasets:
    temp_labels, temp_kb = io_worker.load_numerical_dataset(data_name)
    labels = labels.union(temp_labels)
    knowledge_base.extend(temp_kb)
    io_worker.print_status("Overall: %d(labels) - %d(columns) | Load: %s: %d(labels) - %d(columns) " %
                           (len(labels), len(knowledge_base), data_name, len(temp_labels), len(temp_kb)))



Data Loading: 
Overall: 30(labels) - 300(columns) | Load: city: 30(labels) - 300(columns) 


Overall: 226(labels) - 2329(columns) | Load: dbpedia: 203(labels) - 2029(columns) 


Overall: 380(labels) - 4017(columns) | Load: wikidata: 169(labels) - 1688(columns) 


Overall: 428(labels) - 4517(columns) | Load: open: 50(labels) - 500(columns) 


## Semantic Labeling
Prepare unknown query:
1. BON: the query as a bag of numbers
2. CSV file: user can custom the example query from "./data/test/" 

In [3]:
# Query:
# Bag of Numbers
query_1 = NumericalAttribute("Unknown", [1.7, 1.65, 1.7, 1.55, 1.71, 1.65, 1.88])
query_2 = NumericalAttribute("Unknown", [0.88, 0.92, 0.17, 0.65, 0.66, 0.90, 0.88, 0.72, 0.76, 0.99])
query_3 = NumericalAttribute("Unknown", [2018, 1965, 1987, 1999, 2017, 2015, 2011, 2012, 2012, 2011])

# CSV files
queries = io_worker.load_queries("test")
queries = [query_1, query_2, query_3] + queries
io_worker.print_status("Queries: %d" % len(queries))

Queries: 7


## Labeling Results using DBS1
The results depicted top 10 relevant semantic labels in NKB using DBS1. 

In [4]:
# Semantic labeling using the Manhattan distance DBS1 (l1)
sem_method, sem_result = labeling.get_labels(queries, knowledge_base, "dbs1", config_trans_size)
io_worker.print_labeling(sem_method, sem_result)


Method: dbs1

Query 1: Unknown [1.55, 1.65, 1.65, 1.70, 1.70, ...]
+------+-----------------+--------------------------------+
| Rank |    Distance     |             Label              |
| 1    | 10.887          | height                         |
+------+-----------------+--------------------------------+
| 2    | 16.892          | refractive_index               |
+------+-----------------+--------------------------------+
| 3    | 55.750          | aprRainInch                    |
+------+-----------------+--------------------------------+
| 4    | 58.198          | density                        |
+------+-----------------+--------------------------------+
| 5    | 64.549          | periapsis                      |
+------+-----------------+--------------------------------+
| 6    | 69.740          | decRainInch                    |
+------+-----------------+--------------------------------+
| 7    | 80.618          | width                          |
+------+-----------------+-------

## Labeling Results (Compare Labeling Methods)
A comparision semantic labeling between the p-value based similarity and the DBS. 

In [5]:
io_worker.print_labeling_compare_mode([labeling.get_labels(queries, knowledge_base, "dbs1", config_trans_size),
                                       labeling.get_labels(queries, knowledge_base, "dbs2", config_trans_size),
                                       labeling.get_labels(queries, knowledge_base, "dbsinf", config_trans_size),
                                       labeling.get_labels(queries, knowledge_base, "semantictyper", config_trans_size),
                                       labeling.get_labels(queries, knowledge_base, "dsl", config_trans_size),
                                       ])


Query 1: Unknown [1.55, 1.65, 1.65, 1.70, 1.70, ...]
+------+-----------------+-----------------+-----------------+-----------------+-----------------+
| Rank |      dbs1       |      dbs2       |     dbsinf      |  semantictyper  |       dsl       |
| 1    | height          | height          | height          | conversion_to_S | conversion_to_S |
|      |                 |                 |                 | I_unit          | I_unit          |
+------+-----------------+-----------------+-----------------+-----------------+-----------------+
| 2    | refractive_inde | refractive_inde | refractive_inde | perimeter       | perimeter       |
|      | x               | x               | x               |                 |                 |
+------+-----------------+-----------------+-----------------+-----------------+-----------------+
| 3    | aprRainInch     | aprRainInch     | humanDevelopmen | impactFactor    | aprRainInch     |
|      |                 |                 | tIndex     