# Data Quality

This notebook aims to show the performance of the pipeline, in terms of precision and recall.

The main aim is to measure the performance of activities that are delegated to external services, like ChatGpt or other LLMs. Data quality measurements for classic tasks like dropping html tags and semi-empty columns are ignored since the moment we trust the tools python gives us.
There's an experiment execution for each dataset.
In this notebook we will see mainly column naming, with a short focus on other tasks, like record linkage, that are not involved in the final project.

### How data quality is measured

Independently from the tasks we want to perform, we measure the similarity between two tables: the input and what we expect.

## Column naming

Column naming is the activity to give a name to each column of a table.
There are two different scenarios:
- column names are provided: in this case we have an ontology
- column names are not provided

Following the study done with "Column Type Annotation using ChatGPT" (https://arxiv.org/abs/2306.00745) we study only the case where ontology is provided, since it gives the most accurate result.
Furthermore, the user is asked to give some inputs, like the ontology, the importance of each item of the ontology and the type (title, prose, numeric).

## Experiments

Initialization, create objects for datasets and ontology reading:

In [2]:
import json

from filesystem import DataReader
from proxy import PipelineClient
from IPython.display import JSON

dr = DataReader.DataReader("datasets/")
pc = PipelineClient.PipelineClient(base_url="http://localhost:8080")

ontologies_file_path = "ontologies.json"
with open(ontologies_file_path, 'r') as file:
    ontologies = json.load(file)

### Companies 2
This is a simple dataset with manually web extracted data, just using XPath expressions for each attribute to extract

In [10]:
experiment_name = "companies_2"
df_companies = dr.read_dataset(experiment_name)
companies_ontology = ontologies["companies"]

In [3]:
df_companies

Unnamed: 0,0,1,2,3,4
0,Diversified Financials,1939,$276.1B,Berkshire Hathaway,"Omaha, Nebraska"
1,Banking,1984,$208.1B,ICBC,"Beijing, China"
2,Oil & Gas Operations,1933,$400.4B,Saudi Arabian Oil Company (Saudi Aramco),"Dhahran, Saudi Arabia"
3,Banking and Financial Services,2000,$124.5B,JPMorgan Chase,"New York, New York"
4,Banking,2014,$202.1B,China Construction Bank,"Beijing, China"
5,Retail and Wholesale,1994,$469.8B,Amazon,"Seattle, Washington"
6,"Semiconductors, Electronics, Electrical Engine...",1976,$378.7B,Apple,"Cupertino, California"
7,Banking,1979,$181.4B,Agricultural Bank of China,"Beijing, China"
8,Banking and Financial Services,1904,$96.8B,Bank of America,"Charlotte, North Carolina"
9,Consumer Durables,1937,$281.7B,Toyota Motor,"Toyota, Japan"


In [13]:
print(json.dumps(companies_ontology, indent=2))

{
  "Company Name": {
    "type": "TITLE",
    "importance": 9
  },
  "Location": {
    "type": "TITLE",
    "importance": 5
  },
  "Foundation": {
    "type": "DATE",
    "importance": 3
  },
  "Industry": {
    "type": "TITLE",
    "importance": 6
  },
  "Revenue": {
    "type": "MONEY",
    "importance": 2
  }
}


Then, set job parameters

In [61]:
new_job = pc.create_new_job(job_name=experiment_name, ontology=companies_ontology)
new_job_id = new_job["jobId"]

Add the table

In [62]:
pc.upload_table(table_name=experiment_name,df=df_companies,job_id=new_job_id,columns_to_ignore=[])

Run the pipeline

In [3]:
pc.start_job(new_job_id)

#### Results
The pipeline got a 100% accuracy in the results.
With the basic scenario where the user can face a simple dataset with low missing values rate and well formatted data, we can get the maximum accuracy.

In [4]:
df_companies_result = dr.read_result(experiment_name)
df_companies_result

Unnamed: 0,Industry,Foundation,Revenue,Company Name,Location
0,Diversified Financials,1939,$276.1B,Berkshire Hathaway,"Omaha, Nebraska"
1,Banking,1984,$208.1B,ICBC,"Beijing, China"
2,Oil & Gas Operations,1933,$400.4B,Saudi Arabian Oil Company (Saudi Aramco),"Dhahran, Saudi Arabia"
3,Banking and Financial Services,2000,$124.5B,JPMorgan Chase,"New York, New York"
4,Banking,2014,$202.1B,China Construction Bank,"Beijing, China"
5,Retail and Wholesale,1994,$469.8B,Amazon,"Seattle, Washington"
6,"Semiconductors, Electronics, Electrical Engine...",1976,$378.7B,Apple,"Cupertino, California"
7,Banking,1979,$181.4B,Agricultural Bank of China,"Beijing, China"
8,Banking and Financial Services,1904,$96.8B,Bank of America,"Charlotte, North Carolina"
9,Consumer Durables,1937,$281.7B,Toyota Motor,"Toyota, Japan"


### Companies 3
This is another plain example with manually extracted web data. This experiment can show us that we may have multiple columns with the same name

### Finance 1


In [22]:
experiment_name = "finance_1"
df_finance = dr.read_dataset(experiment_name)
finance_ontology = ontologies["finance"]

In [23]:
df_finance

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,531,532,533,534,535,536,537,538,539,540
0,2023-11-07 18:48:26,consulenti-del-lavoro-sinsedia-il-consiglio-na...,https://www.ansa.it/sito/notizie/ordini_profes...,https://www.ansa.it/sito/notizie/economia/prof...,//div[@class='article-teaser']/div[@class='art...,1,"<meta charset=""UTF-8""> </meta>","Consulenti del lavoro, s'insedia il Consiglio ...","<meta content=""width=device-width, initial-sca...","<link as=""style"" href=""/sito/cssnew/1024093008...",...,,,,,,,,,,
1,2023-11-08 16:05:38,cnos-fap-grandi-chance-per-la-filiera-tecnico-...,https://www.ansa.it/sito/notizie/fisco_lavoro/...,https://www.ansa.it/sito/notizie/economia/prof...,//div[@class='article-teaser']/div[@class='art...,1,"<meta charset=""UTF-8""> </meta>","Cnos-Fap, 'grandi chance per la filiera tecnic...","<meta content=""width=device-width, initial-sca...","<link as=""style"" href=""/sito/cssnew/1024093008...",...,,,,,,,,,,
2,2023-11-11 00:14:31,ad-aosta-lunedi-convegno-della-cassa-dottori-c...,https://www.ansa.it/sito/notizie/casse_previde...,https://www.ansa.it/sito/notizie/economia/prof...,//div[@class='article-teaser']/div[@class='art...,1,"<meta charset=""UTF-8""> </meta>",Ad Aosta lunedì convegno della Cassa dottori c...,"<meta content=""width=device-width, initial-sca...","<link as=""style"" href=""/sito/cssnew/1024093008...",...,,,,,,,,,,
3,2023-11-11 00:14:32,calderone-per-le-pensioni-rinvio-o-stretta-sol...,https://www.ansa.it/sito/notizie/ordini_profes...,https://www.ansa.it/sito/notizie/economia/prof...,//div[@class='article-teaser']/div[@class='art...,1,"<meta charset=""UTF-8""> </meta>","Calderone, per le pensioni rinvio, o stretta s...","<meta content=""width=device-width, initial-sca...","<link as=""style"" href=""/sito/cssnew/1024093008...",...,,,,,,,,,,
4,2023-11-11 00:14:33,manovra-confprofessioni-incentivi-ad-autonomi-...,https://www.ansa.it/sito/notizie/fisco_lavoro/...,https://www.ansa.it/sito/notizie/economia/prof...,//div[@class='article-teaser']/div[@class='art...,1,"<meta charset=""UTF-8""> </meta>","Manovra: Confprofessioni, incentivi ad autonom...","<meta content=""width=device-width, initial-sca...","<link as=""style"" href=""/sito/cssnew/1024093008...",...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
139,2023-11-08 16:05:38,alfredo-granata-nominato-direttore-generale-di...,https://www.ansa.it/sito/notizie/casse_previde...,https://www.ansa.it/sito/notizie/economia/prof...,//div[@class='article-teaser']/div[@class='art...,6,"<meta charset=""UTF-8""> </meta>",Alfredo Granata nominato direttore generale di...,"<meta content=""width=device-width, initial-sca...","<link as=""style"" href=""/sito/cssnew/1024093008...",...,,,,,,,,,,
140,2023-11-07 18:48:26,dottori-commercialisti-l8-novembre-la-cassa-in...,https://www.ansa.it/sito/notizie/casse_previde...,https://www.ansa.it/sito/notizie/economia/prof...,//div[@class='article-teaser']/div[@class='art...,7,"<meta charset=""UTF-8""> </meta>","Dottori commercialisti, l'8 novembre la Cassa ...","<meta content=""width=device-width, initial-sca...","<link as=""style"" href=""/sito/cssnew/1024093008...",...,,,,,,,,,,
141,2023-11-08 16:05:39,maltempo-commercialisti-chiedono-lo-stop-alle-...,https://www.ansa.it/sito/notizie/ordini_profes...,https://www.ansa.it/sito/notizie/economia/prof...,//div[@class='article-teaser']/div[@class='art...,7,"<meta charset=""UTF-8""> </meta>","Maltempo, commercialisti chiedono lo stop alle...","<meta content=""width=device-width, initial-sca...","<link as=""style"" href=""/sito/cssnew/1024093008...",...,,,,,,,,,,
142,2023-11-07 18:48:26,notai-parte-oggi-il-registro-volontario-dei-te...,https://www.ansa.it/sito/notizie/ordini_profes...,https://www.ansa.it/sito/notizie/economia/prof...,//div[@class='article-teaser']/div[@class='art...,8,"<meta charset=""UTF-8""> </meta>","Notai, parte oggi il registro volontario dei t...","<meta content=""width=device-width, initial-sca...","<link as=""style"" href=""/sito/cssnew/1024093008...",...,,,,,,,,,,


In [24]:
print(json.dumps(finance_ontology, indent=2))

{
  "Title": {
    "type": "TITLE",
    "importance": 8
  },
  "Publication Date": {
    "type": "DATE",
    "importance": 6
  },
  "Article Content": {
    "type": "PROSE",
    "importance": 2
  },
  "Category": {
    "type": "TITLE",
    "importance": 4
  },
  "Author": {
    "type": "NAME",
    "importance": 2
  }
}


In [25]:
new_job = pc.create_new_job(job_name=experiment_name, ontology=finance_ontology)
new_job_id = new_job["jobId"]

In [26]:
pc.upload_table(table_name=experiment_name,df=df_finance,job_id=new_job_id,columns_to_ignore=[0,1,2])

In [27]:
pc.start_job(new_job_id)

### Finance 2

In [None]:
experiment_name = "finance_2"
df_finance = dr.read_dataset(experiment_name)
finance_ontology = ontologies["finance"]

In [None]:
df_finance

In [None]:
print(json.dumps(finance_ontology, indent=2))

In [None]:
new_job = pc.create_new_job(job_name=experiment_name, ontology=finance_ontology)
new_job_id = new_job["jobId"]
pc.upload_table(table_name=experiment_name,df=df_finance,job_id=new_job_id,columns_to_ignore=[0,1,2])
pc.start_job(new_job_id)

### Hotels 1

In [None]:
experiment_name = "hotels_1"
df_hotels = dr.read_dataset(experiment_name)
hotels_ontology = ontologies["hotels"]

In [None]:
df_hotels

In [None]:
print(json.dumps(hotels_ontology, indent=2))

In [None]:
new_job = pc.create_new_job(job_name=experiment_name, ontology=hotels_ontology)
new_job_id = new_job["jobId"]
pc.upload_table(table_name=experiment_name,df=df_hotels,job_id=new_job_id,columns_to_ignore=[0,1])
pc.start_job(new_job_id)

### Hotels 2

In [None]:
experiment_name = "hotels_2"
df_hotels = dr.read_dataset(experiment_name)
hotels_ontology = ontologies["hotels"]

In [None]:
df_hotels

In [None]:
print(json.dumps(hotels_ontology, indent=2))

In [None]:
new_job = pc.create_new_job(job_name=experiment_name, ontology=hotels_ontology)
new_job_id = new_job["jobId"]
pc.upload_table(table_name=experiment_name,df=df_hotels,job_id=new_job_id,columns_to_ignore=[0,1,2])
pc.start_job(new_job_id)

### Jobs 1

### Jobs 2

### Real estate 1

### Restaurants 1

### Restaurants 2