# S-MatCNGenPy

Este é um passo-a-passo da implementação em python do método **S-MatCNGenPy**, desenvolvido no trabalho \[1\]. O seu principal objetivo é garantir o suporte a referência ao esquema de dados em busca por palavras-chave em banco de dados. Observe que algumas consultas, como visto abaixo, não estão relacionadas apenas a valores do banco de dados, mas a própria estrutura do esquema.

```
    filmes do Will Smith
```
- **`filmes`** : relação Movie
- **`Will`, `Smith`** : instâncias da tabela Person(Name) 


#### Leituras Importantes

> [\[1\]](https://drive.google.com/file/d/1ZnljlKss9a8M_RDqseTYfZbQCjDhcJkk/view) MARTINS, Paulo Rodrigo O.; DA SILVA, Altigran Soares. *Uma Abordagem para Suporte a Referências ao Esquema em Consultas por Palavras-Chave em Bancos de Dados Relacionais*. Trabalho de Conclusão de Curso (Ciência da Computação), Universidade Federal do Amazonas, 2017. 

> [\[2\]]() DE OLIVEIRA, Pericles; DA SILVA, Altigran; DE MOURA, Edleno. *Match-Based Candidate Network Generation for Keyword Queries over Relational Databases*. In: Data Engineering (ICDE), 2018 IEEE 34st International Conference on. IEEE, 2016. Aceito pra Pubicação

> [\[3\]](https://dl.acm.org/citation.cfm?id=1989383) BERGAMASCHI, Sonia et al. *Keyword search over relational databases: a metadata approach*. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, 2011. p. 565-576.

In [None]:
import psycopg2
from psycopg2 import sql
from pprint import pprint as pp
from collections import defaultdict
import string
import itertools
import copy
from math import log1p
from queue import deque
import ast
import gc
from queue import deque

import nltk 
#nltk.download('wordnet')
#nltk.download('omw')
#nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

import gensim.models.keyedvectors as word2vec
from gensim.models import KeyedVectors


stw_set = set(stopwords.words('english')) - {'will'}

# Connect to an existing database
conn = psycopg2.connect("dbname=imdb user=imdb password=imdb")

# Open a cursor to perform database operations
cur = conn.cursor()

## Pré-processamento

Antes mesmo de receber os querysets, o sistema passa por um pré-processamento, que é responsavél pela criação de dois índices invertidos:

* **wordHash**: tabela que associa cada termo do banco de dados com o seu **IAF (Inverse Attribute Frequency)** e também referencia todas Tabelas, Colunas e CTIDs em que a palavra ocorre. Nota: o CTID é o endereço físico de uma linha em uma tabela, utilizado para encontrar rapidamente uma tupla.
```python
wordHash['term'] = ( IAF , { 'table': { 'column' : [ctid] } } )
```
* **attributeHash**: tabela que para cada atributo (documento), armazena a sua norma e o número de palavras distintas.
```python
attributeHash['table']['column'] = ( norm , num_distinct_words )
```

### Criação dos Índices Invertidos

O processo de criação é realizado em três etapas. Primeiramente, o procedimento ```createInvertedIndex()``` faz uma varredura no banco de dados e preenche parcialmente o ```wordHash```, faltando apenas calcular os IAFs para cada termo. Além disso, este procedimento também ele também armazena no ```attributeHash``` o total de palavras distintas para cada atributo.

Em seguida, os IAFs de cada termo são processados através do método ```processIAF(wordHash,attributeHash)```. Por último, as normas dos atributos (documentos) são calculadas no método ```processNormsOfAttributes(wordHash,attributeHash)```.

In [None]:
#Word2Vec
def loadWordEmbeddingsModel(filename = "word_embeddings/word2vec/GoogleNews-vectors-negative300.bin"):
    model = KeyedVectors.load_word2vec_format(filename,
                                                       binary=True, limit=500000)
    return model


#GloVe
#def loadWordEmbeddingsModel(filename = "word_embeddings/word2vec/GoogleNews-vectors-negative300.bin"):
#    model = KeyedVectors.load_word2vec_format(filename, limit=500000)
#    return model

In [None]:
embeddingModel = loadWordEmbeddingsModel()

In [None]:
#Apesar de ID está no word embedding model, sabemos que esse campo não deve ser indexado
#'id' in embeddingModel

In [None]:
def createInvertedIndex(embeddingModel):
    #Output: wordHash (Term Index) with this structure below
    #map['word'] = [ 'table': ( {column} , ['ctid'] ) ]

    '''
    The Term Index is built in a preprocessing step that scans only
    once all the relations over which the queries will be issued.
    '''
    
    wordHash = {}
    attributeHash = {}
    
    
    # Get list of tablenames
    cur.execute("SELECT DISTINCT tablename FROM pg_tables WHERE schemaname!='pg_catalog' AND schemaname !='information_schema';")
    for table in cur.fetchall():
        table_name = table[0]
        
        if table_name not in embeddingModel:
            print('TABLE ',table_name, 'SKIPPED')
            continue
        
        print('INDEXING TABLE ',table_name)
        
        attributeHash[table_name] = {}
        
        #Get all tuples for this tablename
        cur.execute(
            sql.SQL("SELECT ctid, * FROM {};").format(sql.Identifier(table_name))
            #NOTE: sql.SQL is needed to specify this parameter as table name (can't be passed as execute second parameter)
        )
        printSkippedColumns = True
        for row in cur.fetchall(): 
            for column in range(1,len(row)):
                column_name = cur.description[column][0] 
                
                if column_name not in embeddingModel or column_name=='id':
                    if printSkippedColumns:
                        print('\tCOLUMN ',column_name,' SKIPPED')
                    continue
                
                ctid = row[0]

                for word in [word.strip(string.punctuation) for word in str(row[column]).lower().split()]:
                    
                    #Ignoring STOPWORDS
                    if word in stw_set:
                        continue

                    #If word entry doesn't exists, it will be inicialized (setdefault method),
                    #Append the location for this word
                    wordHash.setdefault(word, {})                    
                    wordHash[word].setdefault( table_name , {} )
                    wordHash[word][table_name].setdefault( column_name , [] ).append(ctid)
                    
                    attributeHash[table_name].setdefault(column_name,(0,set()))
                    attributeHash[table_name][column_name][1].add(word)
            printSkippedColumns=False
        
        #Count words
        
        for (column_name,(norm,wordSet)) in attributeHash[table_name].items():
            num_distinct_words = len(wordSet)
            wordSet.clear()
            attributeHash[table_name][column_name] = (norm,num_distinct_words)
        

    print ('INVERTED INDEX CREATED')
    return (wordHash,attributeHash)

In [None]:
(wordHash,attributeHash) = createInvertedIndex(embeddingModel)

In [None]:
#pp(wordHash['denzel'])

In [None]:
#pp(attributeHash)

In [None]:
def processIAF(wordHash,attributeHash):
    
    total_attributes = sum([len(attribute) for attribute in attributeHash.values()])
    
    for (term, values) in wordHash.items():
        
        attributes_with_this_term = sum([len(attribute) for attribute in wordHash[term].values()])
        
        IAF = log1p(total_attributes/attributes_with_this_term)
                
        wordHash[term] = (IAF,values)
    print('IAF PROCESSED')

In [None]:
processIAF(wordHash,attributeHash)

In [None]:
#pp(wordHash['denzel'])

In [None]:
def processNormsOfAttributes(wordHash,attributeHash,embeddingModel):
  
    # Get list of tablenames
    cur.execute("SELECT DISTINCT tablename FROM pg_tables WHERE schemaname!='pg_catalog' AND schemaname !='information_schema';")
    for table in cur.fetchall():
        table_name = table[0]
        
        if table_name not in embeddingModel:
            print('TABLE ',table_name, 'SKIPPED')
            continue
        
        print('PROCESSING TABLE ',table_name)
        
        #Get all tuples for this tablename
        cur.execute(
            sql.SQL("SELECT ctid, * FROM {};").format(sql.Identifier(table_name))
            #NOTE: sql.SQL is needed to specify this parameter as table name (can't be passed as execute second parameter)
        )
        
        printSkippedColumns = False
        for row in cur.fetchall():
            for column in range(1,len(row)):
                column_name = cur.description[column][0]  
                
                if column_name not in embeddingModel or column_name=='id':
                    if printSkippedColumns:
                        print('\tCOLUMN ',column_name,' SKIPPED')
                    continue
                
                ctid = row[0]

                for word in [word.strip(string.punctuation) for word in str(row[column]).lower().split()]:
                    
                    #Ignoring STOPWORDS
                    if word in stw_set:
                        continue
                    
                    (prevNorm,num_distinct_words)=attributeHash[table_name][column_name]
                    
                    IAF = wordHash[word][0]
                    
                    Norm = prevNorm + IAF
                    
                    attributeHash[table_name][column_name]=(Norm,num_distinct_words)
            printSkippedColumns = False

    print ('NORMS OF ATTRIBUTES PROCESSED')

In [None]:
processNormsOfAttributes(wordHash,attributeHash,embeddingModel)

In [None]:
#pp(wordHash['denzel'])

In [None]:
#pp(attributeHash)

## Main

O processamento das consultas é realizado em 

In [None]:
def getQuerySets(filename='querysets/queryset_imdb_martins.txt'):
    QuerySet = []
    with open(filename,encoding='utf-8-sig') as f:
        for line in f.readlines():
            
            #The line bellow Remove words not in OLIVEIRA experiments
            #Q = [word.strip(string.punctuation) for word in line.split() if word not in ['title','dr.',"here's",'char','name'] and word not in stw_set]  
            
            Q = [word.strip(string.punctuation) for word in line.lower().split() if word not in stw_set]  
            
            QuerySet.append(Q)
    return QuerySet
        
QuerySet = getQuerySets()
QuerySet

### Recuperação de Tuple-sets
Esta etapa consiste em recuperar conjuntos de tuplas que contém cada palavra-chave, chamados de tuple-sets. O algoritmo `TSFind`, que realiza esse processo, pode ser é divido em três partes: 
* **Recuperação de tuplas:** Essa parte consiste em encontrar os conjuntos de tuplas que contém cada uma das palavras do Queryset. Essas informações já foram pré-processadas no índice invertido `wordHash`.
* **Interseção de tuplas:** Esta parte acontece no algoritmo `TSInter` e é responsável por encontrar tuplas que contém mais de uma das palavras-chave. Além disso, esta etapa irá garantir que os tuple-sets `TABLE{word}` contenham apenas a palavra `word` e nenhuma outra palavra do queryset. Esta propriedade é necessária para encontrar a cobertura mínima (etapa de criação de query matches). 
* **Criação de tuple-sets:** Esta parte irá condensar os resultados. Em vez de listar todas as tuplas que contenham as palavras-chave, precisamos apenas saber quais colunas possuem cada uma das palavras. Por isso, os tuple-sets terão a estrutura (o primeiro atributo refere-se a *value* ou *schema*):
```python
TupleSet = ('table','column', frozenset({schemaWords}), frozenset({valueWords}))
```

In [None]:
def TSFind(Q):
    #Input:  A keyword query Q=[k1, k2, . . . , km]
    #Output: Set of non-free and non-empty tuple-sets Rq

    '''
    The tuple-set Rki contains the tuples of Ri that contain all
    terms of K and no other keywords from Q
    '''
    
    #Part 1: Find sets of tuples containing each keyword
    global P
    P = {}
    for keyword in Q:
        tupleset = set()
        
        if keyword not in wordHash:
            continue
        
        for (table,attributes) in wordHash.get(keyword)[1].items():
            for (attribute,ctids) in attributes.items():
                for ctid in ctids:
                    tupleset.add( (table,attribute,ctid) )
        P[frozenset([keyword])] = tupleset
    
    #Part 2: Find sets of tuples containing larger termsets
    P = TSInter(P)
    
    #Part 3:Build tuple-sets
    Rq = set()
    
    schemaWords = frozenset()
    for valueWords , tuples in P.items():
        for (table,attribute,ctid) in tuples:
            Rq.add( (table,attribute,schemaWords,valueWords) )
    #print ('TUPLE SETS CREATED')
    return Rq


def TSInter(P):
    #Input: A Set of non-empty tuple-sets for each keyword alone P 
    #Output: The Set P, but now including larger termsets (process Intersections)

    '''
    Termset is any non-empty subset K of the terms of a query Q        
    '''
    
    Pprev = {}
    Pprev=copy.deepcopy(P)
    Pcurr = {}

    combinations = [x for x in itertools.combinations(Pprev.keys(),2)]
    for ( Ki , Kj ) in combinations:
        Tki = Pprev[Ki]
        Tkj = Pprev[Kj]
        
        X = Ki | Kj
        Tx = Tki & Tkj        
        
        if len(Tx) > 0:            
            Pcurr[X]  = Tx            
            Pprev[Ki] = Tki - Tx         
            Pprev[Kj] = Tkj - Tx
            
    if Pcurr != {}:
        Pcurr = copy.deepcopy(TSInter(Pcurr))
        
    #Pprev = Pprev U Pcurr
    Pprev.update(Pcurr)     
    return Pprev   

In [None]:
Q = ['actor', 'james', 'bond']
Rq = TSFind(Q)
pp(Rq)

### Criação Schema-sets

Esta etapa consiste na criação dos Schema-sets, que é uma estrutura análoga aos tuple-sets vistos na etapa anterior. Aqui, o processo também é divido em três partes: 
* **Mapeamento de Elementos do Esquema (*Schema Matching*):** Essa parte consiste em analisar a similaridade entre as palavras do querysets e elementos do esquema (nomes de relações e atributos).
* **Análise de Termos Adjacentes:** Esta parte verifica as relações entre as palavras chave, muitas vezes uma palavras-chave relacioada a elemento do esquema delimita o domínio das palavras-chave adjacentes. Ex: Actor James Bond delimita a palavra James para nome de Pessoa, em vez de nome de Filme.
* **Criação de Schema-sets:** Esta parte irá formatar os resultados para ficarem semelhantes à estrutura de tuple-sets, seguindo a estrutura a seguir (o primeiro atributo refere-se a *value* ou *schema*):
```python
SchemaSet = ('s','table','column', frozenset({words}))
```

#### Similaridades para o Schema-Matching

Para o mapeamento de palavras para elementos do esquema, foram utilizadas métricas de similaridade de escrita e semântica.
O Coeficiente de Jaccard é uma métrica que avalia a interseção entre duas palavras, sendo ideal para similaridades de escrita, como abreviações ou erros de digitação. 

Por outro lado, as métricas semânticas utilizam o dicionário léxico WordNet para encontrar similaridades de sentido. O pacote de ferramentas NLTK disponibiliza uma série de métricas semânticas [aqui](http://www.nltk.org/howto/wordnet.html "WordNet Interface"). Entre elas, as principais são a Path Similarity e a Wu-Palmer Similarity. A primeira métrica procura encontrar a menor distância entre duas palavras, no grafo de relações do WordNet, enquanto a segunda analisa o ancestral comum mais próximo entre duas palavras.

In [None]:
def wordnet_similarity(wordA,wordB):
    
    A = set(wn.synsets(wordA))
    B = set(wn.synsets(wordB))
    
    wupSimilarities = [0]
    pathSimilarities = [0]
    for (sense1,sense2) in itertools.product(A,B):        
        wupSimilarities.append(wn.wup_similarity(sense1,sense2) or 0)
        pathSimilarities.append(wn.path_similarity(sense1,sense2) or 0)
    return max(max(wupSimilarities),max(pathSimilarities))

def jaccard_similarity(wordA,wordB):
    
    A = set(wordA)
    B = set(wordB)
    
    return len(A & B ) / len(A | B)

In [None]:
def getSchemaGraph():
    #Output: A Schema Graph G  with the structure below:
    # G['node'] = edges
    # G['table'] = { 'foreign_table' : (direction, column, foreign_column) }
    
    
    G = {} 
    cur.execute("SELECT tablename FROM pg_tables WHERE schemaname!='pg_catalog' AND schemaname !='information_schema';")
    for table in cur.fetchall():
        G.setdefault(table[0],{})
    
    sql = "SELECT DISTINCT                 tc.table_name, kcu.column_name,                 ccu.table_name AS foreign_table_name, ccu.column_name AS foreign_column_name             FROM information_schema.table_constraints AS tc              JOIN information_schema.key_column_usage AS kcu                 ON tc.constraint_name = kcu.constraint_name             JOIN information_schema.constraint_column_usage AS ccu                 ON ccu.constraint_name = tc.constraint_name             WHERE constraint_type = 'FOREIGN KEY'"
    cur.execute(sql)
    relations = cur.fetchall()
    
    for (table,column,foreign_table,foreign_column) in relations:
        G[table][foreign_table] = (1,column, foreign_column)
        G[foreign_table][table] = (-1,foreign_column,column)
    print ('SCHEMA CREATED')
    return G
G = getSchemaGraph()
G

In [None]:
def createEmbeddingsHash(model,attributeHash,weight=0.5):
    
    wordEmbeddingsHashA = {}
    
    for table in attributeHash:
        
        if table not in model:
            continue
        
        wordEmbeddingsHashA[table]={word.lower() for word,sim in model.most_similar(table)}
        #wordEmbeddingsHashA[table]={wnl.lemmatize(word).lower() for word,sim in model.most_similar(table)}
            
        for column in attributeHash[table]:
            if column not in model or column=='id':
                continue
            wordEmbeddingsHashA[column]={wnl.lemmatize(word).lower() for word,sim in model.most_similar(column)}
            #wordEmbeddingsHashA[column]={wnl.lemmatize(word).lower() for word,sim in model.most_similar(column)}
    
    wordEmbeddingsHashB = copy.deepcopy(wordEmbeddingsHashA)
    
    for table in attributeHash:
        
        if table not in model:
            continue
        
        for column in attributeHash[table]:
            
            if column not in model or column=='id':
                continue
            
            similarSet = { wnl.lemmatize(word).lower() for word,sim in model.most_similar(positive=(table,column))}
            wordEmbeddingsHashB[column].update(similarSet)
            
    G = getSchemaGraph()
    for tableA in G:
        
        if tableA not in model:
            continue
        
        for tableB in G[tableA]:
            
            if tableB not in model:
                continue
            
            similarSet = { wnl.lemmatize(word).lower() for word,sim in model.most_similar(positive=(tableA,tableB))}
            wordEmbeddingsHashB[tableA].update(similarSet)
            wordEmbeddingsHashB[tableB].update(similarSet)
            
            
            
    wordEmbeddingsHashC = copy.deepcopy(wordEmbeddingsHashA)
    
    for table in attributeHash:
        
        if table not in model:
            continue
        
        for column in attributeHash[table]:
            
            if column not in model or column=='id':
                continue
            
            avg_vec = (model[table]*weight + model[column]*(1-weight))   
            similarSet = { wnl.lemmatize(word).lower() 
                          for word,sim in model.similar_by_vector(avg_vec)}
            wordEmbeddingsHashC[column].update(similarSet)
            
    G = getSchemaGraph()
    for tableA in G:
        
        if tableA not in model:
            continue
        
        for tableB in G[tableA]:
            
            if tableB not in model:
                continue
            
            avg_vec = (model[table]*weight + model[column]*(1-weight))
            similarSet = { wnl.lemmatize(word).lower() 
                          for word,sim in model.similar_by_vector(avg_vec)}
            wordEmbeddingsHashC[tableA].update(similarSet)            
    
    return wordEmbeddingsHashA,wordEmbeddingsHashB,wordEmbeddingsHashC

In [None]:
def embedding10_similarity(schema,word,wordEmbeddingsHash):
    if schema not in wordEmbeddingsHash:
        return 0
    
    #lemmatize is used to remove plural form   wnl.lemmatize('wolves')='wolf'
    if wnl.lemmatize(word) in wordEmbeddingsHash[schema]:
        return 1
    else:
        return 0        

In [None]:
def embedding_similarity(wordA,wordB,model):
    if wordA not in model or wordB not in model:
        return 0
    return model.similarity(wordA,wordB)

#### Algoritmo para Criação dos Schema-Sets

In [None]:
def word_similarity(schema_term,word,
                    wn_sim=True, jaccard_sim=True,
                    emb_sim=False,  emb_model=None,
                    emb10_sim=False, emb10_hash=None):
    
    
    sim_list=[0]
    
    if wn_sim:
        sim_list.append( wordnet_similarity(schema_term,word) )

    if jaccard_sim:
        sim_list.append( jaccard_similarity(schema_term,word) )

    if emb_sim and emb_model is not None:
        sim_list.append( embedding_similarity(schema_term,word,emb_model) )

    sim = max(sim_list) 

    if emb10_sim and emb10_hash is not None:
        if embedding10_similarity(schema_term,word,emb10_hash) == 0:
            sim=0
        else:
            if len(sim_list)==1:
                sim=1

    return sim

In [None]:
def SchSFind(Q,threshold=0.8, 
             sim_args={}):    
    S = []
    for keyword in Q:
        for (table,values) in attributeHash.items():
            
            sim = word_similarity(table,keyword,**sim_args)
            
            if sim >= threshold:
                S.append( (table,'*',{keyword},sim) )
            
            for attribute in values.keys():
                
                if(attribute=='id'):
                    continue
                
                sim = word_similarity(attribute,keyword,**sim_args)
                
                if sim >= threshold:
                    S.append( (table,attribute,{keyword},sim) )
    #S = SchSInter(S)

    #print ('SCHEMA SETS CREATED')
    valueWords = frozenset()
    Sq = {(table,attribute,frozenset(schemaWords),valueWords) for (table,attribute,schemaWords,sim) in S}
        
    return Sq

In [None]:
wordEmbeddingsModel=loadWordEmbeddingsModel()
(wordEmbeddingsHashA,wordEmbeddingsHashB,wordEmbeddingsHashC) = createEmbeddingsHash(wordEmbeddingsModel,attributeHash,weight=0.5)

In [None]:
Q = ['actor', 'james', 'bond']
SimilarityCoeficient = 0.799999999999
Sq = SchSFind(Q,SimilarityCoeficient,{'emb10_sim':True,'emb10_hash':wordEmbeddingsHashB})
Sq

### Criação de Query Matches

As etapas anteriores, de criação de schema-sets e tuple-sets, foram responsáveis por identificar quais relações possuem alguma informação sobre as palavras-chave. Nesta etapa de criação de full matches, o objetivo é combinar esses tuple-sets e schema-sets para se obter uma resposta completa, mínima e relevante para o usuário. 

O algoritmo `QMGen` é responsável por encontrar combinações de tuple-sets/schema-sets que compõem uma cobertura mínima (`MinimalCover`) sobre o queryset.
- **Total**: Cada palavra-chave deve estar presente em ao menos uma das tuplas da query-match.
- **Mínima**: Não é possível remover nenhum tuple-set/schema-set da query-match e manter a cobertura total sobre o queryset.

In [None]:
def MinimalCover(MC, Q):
    #Input:  A subset MC (Match Candidate) to be checked as total and minimal cover
    #Output: If the match candidate is a TOTAL and MINIMAL cover

    Subset = [schemaWords|valueWords for table,attribute,schemaWords,valueWords in MC]
    u = set().union(*Subset)    
    
    isTotal = (u == set(Q))
    for element in Subset:
        
        new_u = list(Subset)
        new_u.remove(element)
        
        new_u = set().union(*new_u)
        
        if new_u == set(Q):
            return False
    
    return isTotal

In [None]:
def QMGen(Q,Rq):
    #Input:  A keyword query Q, The set of non-empty non-free tuple-sets Rq
    #Output: The set Mq of query matches for Q
    
    '''
    Query match is a set of tuple-sets that, if properly joined,
    can produce networks of tuples that fulfill the query. They
    can be thought as the leaves of a Candidate Network.
    
    '''
    
    Mq = []
    for i in range(1,len(Q)+1):
        for subset in itertools.combinations(Rq,i):            
            if(MinimalCover(subset,Q)):
                #print('----------------------------------------------\nM')
                #pp(set(subset))
                #print('\n')
                M = MInter(set(subset))
                #pp(M)
                Mq.append(M)
                
                
    return Mq

def MInter(M):
    #print('M',M)
    Mprev = copy.deepcopy(M)
    Mcurr = set()

    combinations = [x for x in itertools.combinations(Mprev,2)]

    
    for ( (tableA,attributeA,schemaWordsA,valueWordsA) , (tableB,attributeB,schemaWordsB,valueWordsB) ) in combinations:
          
        #se  forem tabelas diferentes ou não tiverem value words mapeadas em ambos os tuplesets
        if (tableA!=tableB) or (len(valueWordsA)>0 and len(valueWordsB)>0):
            continue             
        
        tableC=tableA
        
        if len(valueWordsA)>0:
            attributeC=attributeA
        else:
            attributeC=attributeB
        
        schemaWordsC = schemaWordsA|schemaWordsB
        valueWordsC  = valueWordsA | valueWordsB #levando em consideração que um deles é vazio
        
        Mcurr.add( (tableC,attributeC,frozenset(schemaWordsC),frozenset(valueWordsC)) )
        
        Mprev = Mprev - {(tableA,attributeA,schemaWordsA,valueWordsA)}
        Mprev = Mprev - {(tableB,attributeB,schemaWordsB,valueWordsB)}
            
    if len(Mcurr)>0:
        Mcurr = copy.deepcopy(MInter(Mcurr))
        
    Mprev.update(Mcurr)     
    return Mprev   

In [None]:
Q = ['actor', 'james', 'bond']

Rq = TSFind(Q)

SimilarityCoeficient = 0.799999999999
Sq = SchSFind(Q,SimilarityCoeficient,{'emb10_sim':True,'emb10_hash':wordEmbeddingsHashB})

Mq= QMGen(Q,Rq|Sq)

for element in Mq:
    pp(element)
    print()

In [None]:
def QMRank(Mq,mi,smi,sim_args={}):
    Ranking = []
    for M in Mq:
        cosprod = schemaprod = 1
        thereIsValueTerms = thereIsSchemaTerms = False
        
        for (table,attribute,schemaWords,valueWords) in M:           
            
            if (len(valueWords)>0):
                
                thereIsValueTerms=True
                
                (norm_attribute,distinct_terms) = attributeHash[table][attribute]

                wsum = 0

                for term in valueWords:

                    IAF = wordHash[term][0]

                    ctids = wordHash[term][1][table][attribute]
                    fkj = len(ctids)

                    if fkj>0:

                        TF = log1p(fkj) / log1p(distinct_terms)

                        wsum = wsum + TF*IAF
                
                cos = wsum/norm_attribute
                cosprod *= cos
                
            if (len(schemaWords)>0):
                
                thereIsSchemaTerms=True
                
                if(attribute == '*'):
                    schemaElement = table
                else:
                    schemaElement = attribute
                
                schemasum = 0
                
                for term in schemaWords:
                    schemasum+=word_similarity(schemaElement,term,sim_args)
                
                schemaprod *= schemasum
                
        valuescore = schemascore = 0
        
        # O tamanho da query match não está sendo considerado no ranking, mas será analisado no ranking de Cns.
        #score = 1/len(M)
        score = 1.0
        
        if thereIsValueTerms:
            valuescore = mi * cosprod 
            score*=valuescore
        
        if thereIsSchemaTerms:
           
            schemascore = smi * schemaprod
            score*=schemascore
            
        Ranking.append( (M,score,schemascore,valuescore) )
    return sorted(Ranking,key=lambda x: x[1],reverse=True)

In [None]:
mi = 46457610.86662768
smi = 1

RankedMq = QMRank(Mq,mi,smi)


for (j, (M,score,schemascore,valuescore) ) in enumerate(RankedMq):
    if j>10:
        break
    print(j+1,'ª QM')
    print('Schema Score:',"%.8f" % schemascore,
          '\nValue Score: ',"%.8f" % valuescore,
          '\n|M|: ',"%02d (Não considerado para calcular o total score)" % len(M),
          '\nTotal Score: ',"%.8f" % score)
    pp(M)
    print('----------------------------------------------------------------------\n')

### Criação e Ranking de Candidate Networks

Na etapa anterior, obteve-se as full matches, que compreendem todas as informações necessárias para o usuário. O próximo passo é encontrar maneiras de conectar estas informações para formar uma resposta para o usuário. Estas conexões, chamadas de candidate networks, são derivadas das restrições de integridade referencial do banco de dados, também conhecidas como chaves estrangeiras.

A criação de candidate networks utiliza dois grafos:
- **Schema Graph**: vértice que representa o banco de dados e é utilizado como base para o match graph. Ele contém como vértices os free tuple-sets associados a cada relação do banco de dados e como arestas as restrições de integridade referencial.

    O Schema Graph foi implementado como um dicionário, no qual cada vértice aponta para um outro vértice. Além disso, também é armazenada informações sobre as arestas, como direção e quais atributos entre as tabelas tem a relação de restrição referencial. A estrutura do Schema Graph pode ser observada a seguir:
   
```python
    G['table'] = { 'foreign_table' : (direction, column, foreign_column) }
```

Como existem diferentes maneiras de se conectar as informações associadas as palavras-chave, várias candidate networks serão geradas. Entretanto, na maioria das vezes, apenas uma delas contém uma resposta relevante para o usuário. Por este motivo, esta esta etapa irá ranquear as candidate networks por relevância.

In [None]:
def getSchemaGraph():
    #Output: A Schema Graph G  with the structure below:
    # G['node'] = edges
    # G['table'] = { 'foreign_table' : (direction, column, foreign_column) }
    
    
    G = {} 
    cur.execute("SELECT tablename FROM pg_tables WHERE schemaname!='pg_catalog' AND schemaname !='information_schema';")
    for table in cur.fetchall():
        G.setdefault(table[0],{})
    
    sql = '''
        SELECT DISTINCT
            tc.table_name, kcu.column_name,
            ccu.table_name AS foreign_table_name, ccu.column_name AS foreign_column_name             
        FROM
            information_schema.table_constraints AS tc
            JOIN information_schema.key_column_usage AS kcu 
                ON tc.constraint_name = kcu.constraint_name
            JOIN information_schema.constraint_column_usage AS ccu 
                ON ccu.constraint_name = tc.constraint_name
        WHERE constraint_type = 'FOREIGN KEY'
    '''
    cur.execute(sql)
    relations = cur.fetchall()
    
    for (table,column,foreign_table,foreign_column) in relations:
        G[table][foreign_table] = (1,column, foreign_column)
        G[foreign_table][table] = (-1,foreign_column,column)
    return G

In [None]:
G = getSchemaGraph()
G


- **Match Graph**: grafo gerado a partir de uma query match e o schema graph. No entanto, no match graph tuple-sets/schema-sets também são modelados como vértices. Para criá-lo, adiciona-se ao schema graph os tuple-sets/schema-sets presentes na query match. Um tuple-set de uma tabela x terá os mesmos relacionamentos (arestas) que o vértice x.

```python
    Gts['table'] = { 'foreign_table' : (direction, column, foreign_column) }

    Gts[('s','table','column', frozenset({words}))] = { 'foreign_table' : (direction, column, foreign_column) }
```

In [None]:
def MatchGraph(Rq, G, M):
    #Input:  The set of non-empty non-free tuple-sets Rq,
    #        The Schema Graph G,
    #        A Query Match M
    #Output: A Schema Graph Gts  with the structure below:
    # G['node'] = edges
    # G['table'] = { 'foreign_table' : (direction, column, foreign_column) }

    '''
    A Match Subgraph Gts[M] is a subgraph of G that contains:
        The set of free tuple-sets of G
        The query match M
    '''
    
    Gts = copy.deepcopy(G)
    
    #Insert non-free nodes
    for (table ,attribute, schemaWords, valueWords) in M:
        Gts[(table ,attribute, schemaWords, valueWords)]=copy.deepcopy(Gts[table])
        for foreign_table , (direction,column,foreign_column) in Gts[(table ,attribute, schemaWords, valueWords)].items():
            Gts[foreign_table][(table ,attribute, schemaWords, valueWords)] = (direction*(-1),foreign_column,column)
    return Gts 

In [None]:
M = RankedMq[0][0]
Gts = MatchGraph(Rq|Sq, G, M)

print('QM:')
pp(M)
print('\nGts:')
pp(Gts)

#### Algoritmo para Criação e Ranking de Candidate Networks

Para criar uma candidate network, o algoritmo `SingleCN` procura um caminho mínimo no match graph que visite todas os non-free tuple-sets/schema-sets da query match. 

Este caminho deve ser:
- **Mínimo:** garantido através do algoritmo de caminho mínima baseado em busca por largura (BFS).
- **Total:** a função `containsMatch` garante que todos os tuple-sets/schema-sets da query match sejam visitados.
- **Seguro (*Sound*):** uma joining networks of tuple-sets é considerado sound se ela não contém uma subárvore na forma $R^K - S^L - R^M $, na qual $R$ e $S$ são relações e o schema graph tem uma aresta $R \rightarrow S$.

O ranking das Candidate Networks agora é feito parcialmente na etapa de ranking de Query Matches. Restando apenas penalizar Candidate Networks grandes, dividindo o score pelo seu tamanho.

In [None]:
def containsMatch(Ji,M):
    for relation in M:
        if relation not in Ji:
            return False
    return True

def isJNTSound(Gts,Ji):
    if len(Ji)<3:
        return True
    
    for i in range(len(Ji)-2):
        
        if type(Ji[i]) is str:
            tableA = Ji[i]
        else:
            (tableA,attributeA,schemaWordsA,valueWordsA) = Ji[i]
            
        if type(Ji[i+2]) is str:
            tableB = Ji[i+2]
        else:
            (tableB,attributeB,schemaWordsB,valueWordsB) = Ji[i+2]         
            
        if tableA==tableB:
            edge_info = Gts[Ji[i]][Ji[i+1]]
            if(edge_info[0] == -1):
                return False
    return True

In [None]:
def SingleCN(FM,Gts,Tmax,showLog=False):  
  
    if showLog:
        print('================================================================================\nSINGLE CN')
        print('Tmax ',Tmax)
        print('FM')
        pp(FM)

        print('\n\nGts')
        pp(Gts)
        print('\n\n')
    
    F = deque()

    first_element = list(FM)[0]
    J = [first_element]
    
    if len(FM)==1:
        return J
    
    F.append(J)
    
    while F:
        J = F.popleft()           
        u = J[-1]
        
        sortedAdjacents = sorted(Gts[u].items(),key=lambda x : type(x[0]) is str)
        
        if showLog:
            print('--------------------------------------------\nParctial CN')
            print('J ',J,'\n')

            print('\nAdjacents:')
            pp(Gts[u].items())
            
            print('\nSorted Adjacents:')
            pp(sortedAdjacents)
            
            print('F:')
            pp(F)
    
        for (adjacent,edge_info) in sortedAdjacents:
            if showLog:
                pp(adjacent)
                print('is str',(type(adjacent) is str),'notinJ',(adjacent not in J))
            if (type(adjacent) is str) or (adjacent not in J):
                Ji = J + [adjacent]
                
                
                if (Ji not in F) and (len(Ji)<Tmax) and (isJNTSound(Gts,Ji)):
                    
                    if showLog:
                        print('isSound:')
                    
                    if(containsMatch(Ji,FM)):
                        
                        if showLog:
                            print('--------------------------------------------\nGenerated CN')
                            print('J ',Ji,'\n')
                        
                        return Ji
                    else:
                        F.append(Ji)

In [None]:
SingleCN(M,Gts,10)

In [None]:
[x for x in range(10)][:5]

In [None]:
def MatchCN(G,Sq,Rq,RankedMq,topK=10):    
    Cns = []                        
    for  (M,score,schemascore,valuescore) in RankedMq[:topK]:
        Gts = MatchGraph(Rq|Sq, G, M)
        Cn = SingleCN(M,Gts,10)
        if(Cn is not None):
            
            
            #Dividindo score pelo tamanho da cn (SEGUNDA PARTE DO RANKING)
            
            CnScore = score/len(Cn)
            
            Cns.append( (Cn,Gts,M,CnScore,schemascore,valuescore) )
    
    #Ordena CNs pelo CnScore
    RankedCns=sorted(Cns,key=lambda x: x[3],reverse=True)
    
    return RankedCns

In [None]:
RankedCns = MatchCN(G,Sq,Rq,RankedMq)
for (j, (Cn,Gts,M,score,schemascore,valuescore) ) in enumerate(RankedCns):
    if j>10:
        break
    print(j+1,'ª CN')
    print('Schema Score:',"%.8f" % schemascore,
          '\nValue Score: ',"%.8f" % valuescore,
          '\n|Cn|: ',"%02d (Considerado para o Total Score)" % len(Cn),
          '\nTotal Score: ',"%.8f" % score)
    pp(Cn)
    print('----------------------------------------------------------------------\n')

In [None]:
def getSQLfromCN(Gts,Cn):
    #print('CN:\n',Cn)
    
    selected_attributes = [] 
    tables = []
    conditions=[]
    relationships = []
    
    for i in range(len(Cn)):
        
        if(type(Cn[i]) is str):
            tableA = Cn[i]
            attrA=''
            valueWords=[]
        else:
            (tableA,attrA, _ ,valueWords) = Cn[i]             
                
        A = 't' + str(i)
        
        if(attrA != ''):
            selected_attributes.append(A +'.'+ attrA)
        
        tables.append(tableA+' '+A)
            
        #tratamento de keywords
        for term in valueWords:
            condition = 'CAST('+A +'.'+ attrA + ' AS VARCHAR) ILIKE \'%' + term + '%\''
            conditions.append(condition)
        
        if(i<len(Cn)-1):
            if(type(Cn[i+1]) is str):
                tableB = Cn[i+1]
            else:
                (tableB,attrB, _ , _ )=Cn[i+1]
                  
            B = 't'+str(i+1)
            
            edge_info = Gts[Cn[i]][Cn[i+1]]
            (direction,joining_attrA,joining_attrB) = edge_info
            
            relationships.append( (A,B) )
            
            condition = A + '.' + joining_attrA + ' = ' + B + '.' + joining_attrB         
            conditions.append(condition)
    
    tables_id = ['t'+str(i)+'.__search_id' for i in range(len(tables))]
    
    relationshipsText = ['('+str(a)+'.__search_id'+','+str(b)+'.__search_id'+')' for (a,b) in relationships]
    
    
    sqlText = 'SELECT '
    sqlText +=' ('+', '.join(tables_id)+') AS Tuples '
    if len(relationships)>0:
        sqlText +=', ('+', '.join(relationshipsText)+') AS Relationships'
        
    sqlText += ' , ' + ' , '.join(selected_attributes)
    
    sqlText +=' FROM ' + ', '.join(tables)
    sqlText +=' WHERE ' + ' AND '.join(conditions)
    '''
    print('SELECT:\n',selected_attributes)
    print('TABLES:\n',tables)
    print('CONDITIONS:')
    pp(conditions)
    print('RELATIONSHIPS:')
    pp(relationships)
    '''    
    #print('SQL:\n',sql)
    return sqlText

In [None]:
for (j, (Cn,Gts,M,score,schemascore,valuescore) ) in enumerate(RankedCns):
    pp(Cn)
    print('\n',getSQLfromCN(Gts,Cn))
    print('\n--------------------------------------------')

In [None]:
def getGoldenStandards():
    goldenStandards = {}
    for i in range(1,51):
        filename = 'golden_standards/0'+str(i).zfill(2) +'.txt'
        with open(filename) as f:

            listOfTuples = []
            Q = ()
            for i, line in enumerate(f.readlines()):
              
                line_without_comment =line.split('#')[0]
                
                if(i==2):
                    comment_of_line = line.split('#')[1]
                    
                    #Remove words not in OLIVEIRA experiments
                    Q = tuple([word for word in comment_of_line.split() if word not in ['title','dr.',"here's",'char','name'] and word not in stw_set])
                
                if line_without_comment:                    
                    
                    relevantResult = eval(line_without_comment)
                    listOfTuples.append( relevantResult )
            
            goldenStandards[Q]=listOfTuples
            
    return goldenStandards


goldenStandards = getGoldenStandards()

In [None]:
def evaluateCN(CnResult,goldenStandard):
    '''
    print('Verificar se são iguais:\n')
    print('Result: \n',CnResult)
    print('Golden Result: \n',goldenStandard)
    '''
    
    tuplesOfCNResult =  set(CnResult[0])
    
    tuplesOfStandard =  set(goldenStandard[0])
        
    #Check if the CN result have all tuples in golden standard
    if tuplesOfCNResult.issuperset(tuplesOfStandard) == False:
        return False
    
    
    relationshipsOfCNResult = CnResult[1]
    
    for goldenRelationship in goldenStandard[1]:
        
        (A,B) = goldenRelationship
        
        if (A,B) not in relationshipsOfCNResult and (B,A) not in relationshipsOfCNResult:
            return False
        
    return True


def evaluanteResult(Result,Query):
    
    goldenStandard = goldenStandards[tuple(Query)]
    
    for goldenRow in goldenStandard:

        found = False

        for row in Result:
            if evaluateCN(row,goldenRow):
                found = True

        if not found:
            return False
        
    return True
            
            
x=[('(39292828,5360667,21231023)', '("(39292828,5360667)","(5360667,21231023)")', 'Hamill, Mark', 'Luke Skywalker'), ('(39292828,5360749,21231023)', '("(39292828,5360749)","(5360749,21231023)")', 'Hamill, Mark', 'Luke Skywalker'), ('(39292828,5360752,21231023)', '("(39292828,5360752)","(5360752,21231023)")', 'Hamill, Mark', 'Luke Skywalker'), ('(39292828,5360753,21231023)', '("(39292828,5360753)","(5360753,21231023)")', 'Hamill, Mark', 'Luke Skywalker')]
q = ['hamill', 'skywalker']

def normalizeResult(ResultFromDatabase):
    normalizedResult = []
    
    for row in ResultFromDatabase:        
        if type(row[0]) == int:
            tuples = [row[0]]
        else:
            tuples = eval(str(row[0]))
        
        try:
            relationships = eval(row[1])
            relationships = [eval(element) for element in relationships]
        except:
            relationships = []
            
        
        normalizedResult.append( (tuples,relationships) )
    return normalizedResult

normX = normalizeResult(x)

evaluanteResult(normX,q)

In [None]:
def getRelevantPosition(RankedCns,Q):
    
    for (position,(Cn,Gts,M,score)) in enumerate(RankedCns):

        #print('CN:\n')
        #pp(Cn)
        
        SQL = getSQLfromCN(Gts,Cn)

        #print(SQL)
        
        cur.execute(SQL)
        Results = cur.fetchall()

        NResults = normalizeResult(Results)

        Relevance = evaluanteResult(NResults,Q)

        if Relevance == True:
            return position+1

    return -1

### Mais abaixo tem a execução para outras CNS (querysets)

# Execução

In [None]:
def preProcessing(emb_model="word_embeddings/word2vec/GoogleNews-vectors-negative300.bin"):
    global wordHash
    global attributeHash
    global wordEmbeddingsModel
    global wordEmbeddingsHashA
    global wordEmbeddingsHashB
    global wordEmbeddingsHashC
    
    wordEmbeddingsModel=loadWordEmbeddingsModel(emb_model)
    
    (wordHash,attributeHash) = createInvertedIndex(wordEmbeddingsModel)
    processIAF(wordHash,attributeHash)
    processNormsOfAttributes(wordHash,attributeHash,wordEmbeddingsModel)
    
    (wordEmbeddingsHashA,wordEmbeddingsHashB,wordEmbeddingsHashC) = createEmbeddingsHash(wordEmbeddingsModel,attributeHash,weight=0.5)
    
    print('PRE-PROCESSING STAGE FINISHED')

In [None]:
def main(mi,smi,sim_args={},showLog=False,querySetFileName='querysets/queryset_imdb_martins.txt'):   
    QuerySets = getQuerySets(querySetFileName)
    maxscores = (list(),list())
    for (i,Q) in enumerate(QuerySets):
       
        print('QUERY-SET ',Q,'\n')
        
        print('FINDING TUPLE-SETS')
        Rq = TSFind(Q)
        print(len(Rq),'TUPLE-SETS CREATED\n')
        
        print('FINDING SCHEMA-SETS')
        SimilarityThreshold = 0.799999999999
        Sq = SchSFind(Q,SimilarityThreshold,sim_args)

        print(len(Sq),' SCHEMA-SETS CREATED\n')
        
        print('GENERATING QUERY MATCHES')
        Mq = QMGen(Q,Sq|Rq)
        print (len(Mq),'QUERY MATCHES CREATED\n')
        
        RankedMq = QMRank(Mq,mi,smi)
        
         
        '''    
        for (j, (M,score , ( valuescore , schemascore , tam )) ) in enumerate(RankedMq):
            if j>10:
                break
            print(j+1,'ª QM')
            print('Value Score: ',"%.8f" % valuescore,'\nSchema Score:',"%.8f" % schemascore, '\n|M|: ',tam,'\nTotal Score: ',"%.8f" % score)
            pp(M)
            print('----------------------------------------------------------------------\n')
        '''    
        Mq=[M for (M,score , ( valuescore , schemascore , tam )) in RankedMq][:20]
        
        if showLog:
            for M in Mq[:20]:
                pp(M)
                print('\n\n')
        
        print('GENERATING CANDIDATE NETWORKS')
        G = getSchemaGraph()
        
        Cns = MatchCN(G,Rq,Sq,Mq)
        
        print (len(Cns),'CANDIDATE NETWORKS CREATED\n')
        
        if showLog:
            for Cn in Cns[:20]:
                pp(Cn[0])
                print('\n\n')
                #pp(Cn[1])
                #print('\n\n\n==================================================================================\n')
                
        print('RANKING CANDIDATE NETWORKS')
        RankedCns = CNRank(Cns,mi,smi)
        for (j,Cn) in enumerate(RankedCns):
            if j>10:
                break
            print(j+1,'ª CN')
            print('Value Score: ',"%.8f" % Cn[4][0],'\nSchema Score:',"%.8f" % Cn[4][1], '\n|Cn|: ',Cn[4][2],'\nTotal Score: ',"%.8f" % Cn[3])
            pp(Cn[0])
            print('----------------------------------------------------------------------\n')
        
            maxscores[0].append(Cn[4][0])
            maxscores[1].append(Cn[4][1])
        gc.collect()
        
        print('==========================================================================\
==========================================================================\
==========================================================================\
==========================================================================\
==========================================================================\
==========================================================================')
    return maxscores

In [None]:
#pp(wordHash['denzel'])

In [None]:
#pp(attributeHash)

In [None]:
#pp(wordEmbeddingsHashA)

In [None]:
#pp(wordEmbeddingsHashB)

In [None]:
mi = 0.90/1.9372498568291752e-06
mi

In [None]:
preProcessing()

In [None]:
mi = 464576.1086662768
smi = 1
maxscores = main(mi,smi)
maxscores

### Observando os maxscores que eu atribui valores a constante mi para normalizar o resultado

In [None]:
#max(maxscores[0]), max(maxscores[1])

## Experimento

In [None]:
def experimento(mi,smi,threshold,goldenMappings,sim_args={},
                showLog=False,querySetFileName='querysets/queryset_imdb_martins.txt'):   
    QuerySets = getQuerySets(querySetFileName)
    
    goldenMappings = goldenMappings.copy()
    TP=[]
    FP=[]
    FN=[]
    
    for (i,Q) in enumerate(QuerySets):
               
        Sq = SchSFind(Q,threshold,sim_args)

        for schema_mapping in Sq:

            if schema_mapping in goldenMappings:
                TP.append(schema_mapping)
                goldenMappings.remove(schema_mapping)
            else:
                FP.append(schema_mapping)

    FN=goldenMappings

    #print('TP: ')
    #pp(TP)

    #print('FP: ')
    #pp(FP)

    #print('FN: ')
    #pp(FN)
    
    tp=len(TP)
    fp=len(FP)
    fn=len(FN)
    
    #print(tp,fp,fn)
    try:
        precision = tp/(tp+fp)
        recall = tp/(tp+fn)
        f1=(precision*recall)/(precision+recall)
    except ZeroDivisionError:
        precision=recall=f1=-1
        pass
    
    return (precision,recall,f1,TP,FP,FN)
                    
                    
            

In [None]:
goldenMappings = [
    #William Smith nickname
    ('s', 'person', 'name', frozenset({'nickname'})),
    ('s', 'character', 'name', frozenset({'nickname'})),
    #protagonist sound music
    ('s', 'character', '*', frozenset({'protagonist'})),
    #character Forrest Gump
    ('s', 'character', '*', frozenset({'character'})),
    #script of Casablanca
    ('s', 'casting', 'note', frozenset({'script'})),
    #best movie award James Cameron
    ('s', 'movie', '*', frozenset({'movie'})),
    #actor James Bond
    ('s', 'person', '*', frozenset({'actor'})),
    #flick Ellen Page thriller
    ('s', 'movie', '*', frozenset({'flick'})),
    #movie Terry Gilliam Benicio del Toro Dr gonzo
    ('s', 'movie', '*', frozenset({'movie'})),
    #director artificial intelligent Haley Joel Osment
    #Trivia Don Quixote
    #Movie Steven Spielberg
    ('s', 'movie', '*', frozenset({'movie'})),
    #German fellow actor Mel Gibson
    ('s', 'person', '*', frozenset({'actor'})),
    #Fellowship Ring King Towers
    #Lord of the Rings films
    ('s', 'movie', '*', frozenset({'films'})),
    #Director John Hughes Matthew Broderick 1986
    #cast Friends
    ('s', 'casting', '*', frozenset({'cast'})),
    #Henry Fonda mine
    #name of actress in Lara Croft film
    ('s', 'character', 'name', frozenset({'name'})),
    ('s', 'person', 'name', frozenset({'name'})),
    ('s', 'person', '*', frozenset({'actress'})),
    ('s', 'movie', '*', frozenset({'film'})),
    #Russell Crowe gladiator char name
    ('s', 'character', '*', frozenset({'character'})),
    ('s', 'character', 'name', frozenset({'name'})),
    ('s', 'person', 'name', frozenset({'name'})),
    #Darth Vader
    #Norman Bates
    #Atticus surname
    ('s', 'character', 'name', frozenset({'surname'})),
    ('s', 'person', 'name', frozenset({'surname'})),
    #social network
    #Space Odyssey Adventure year
    #Chihiro animation
    #actor Draco Harry Potter
    ('s', 'person', '*', frozenset({'actor'})),
]

In [None]:
#print(("%.2f; %.2f; %.2f; %.2f;" % (threshold,precision,recall,f1)).replace('.',','))

In [None]:
preProcessing()

In [None]:
model=wordEmbeddingsModel
model.most_similar(positive=('person','movie'))

In [None]:
(wordEmbeddingsHashA,wordEmbeddingsHashB,
 wordEmbeddingsHashC) = createEmbeddingsHash(wordEmbeddingsModel,
                                             attributeHash,weight=0.5)
pp(wordEmbeddingsHashC['person'])

In [None]:
#wn_sim=True, jaccard_sim=True,
#emb_sim=False,  emb_model=None,
#emb10_sim=False, emb10_hash=None
#wordEmbeddingsHashB

#sim_args={}
#sim_args={'wn_sim':False,'jaccard_sim':False,'emb10_sim':True,'emb10_hash':wordEmbeddingsHashB}
#sim_args={'emb10_sim':True,'emb10_hash':wordEmbeddingsHashC}
#sim_args={'emb_sim':True,'emb_model':wordEmbeddingsModel}
results = []
threshold=0.9
for threshold in [x/100 for x in range(50,101)][::5]:
    print('threshold:', threshold)
    for weight in [x/100 for x in range(50,101)][::5]:
        (wordEmbeddingsHashA,wordEmbeddingsHashB,
     wordEmbeddingsHashC) = createEmbeddingsHash(wordEmbeddingsModel,
                                                 attributeHash,weight=weight)

        sim_args={'emb10_sim':True,'emb10_hash':wordEmbeddingsHashC}

        precision,recall,f1,TP,FP,FN = experimento(mi,smi,threshold,goldenMappings,sim_args=sim_args)

        results.append( (weight,precision,recall,f1) )

        print(("%.2f; %.2f; %.2f; %.2f; %d; %d; %d" % 
               (weight,precision,recall,f1,len(TP),len(FP),len(FN))).replace('.',','))

        #print('threshold',threshold)
        #print('precision',precision)
        #print('recall',recall)
        #print('f1',f1)

        if False:
            print('TP')
            pp(TP)

            print('FP')
            pp(FP)

            print('FN')
            pp(FN)



In [None]:
#wn_sim=True, jaccard_sim=True,
#emb_sim=False,  emb_model=None,
#emb10_sim=False, emb10_hash=None
#wordEmbeddingsHashB

#sim_args={}
#sim_args={'wn_sim':False,'jaccard_sim':False,'emb10_sim':True,'emb10_hash':wordEmbeddingsHashB}
#sim_args={'emb10_sim':True,'emb10_hash':wordEmbeddingsHashC}
#sim_args={'emb_sim':True,'emb_model':wordEmbeddingsModel}
results = []
for weight in [x/100 for x in range(50,101)][::5]:
    (wordEmbeddingsHashA,wordEmbeddingsHashB,
 wordEmbeddingsHashC) = createEmbeddingsHash(wordEmbeddingsModel,
                                             attributeHash,weight=weight)
    
    sim_args={'wn_sim':False,'jaccard_sim':False,'emb10_sim':True,'emb10_hash':wordEmbeddingsHashC}
    
    precision,recall,f1,TP,FP,FN = experimento(mi,smi,threshold,goldenMappings,sim_args=sim_args)
    
    results.append( (weight,precision,recall,f1) )
    
    print(("%.2f; %.2f; %.2f; %.2f; %d; %d; %d" % 
           (weight,precision,recall,f1,len(TP),len(FP),len(FN))).replace('.',','))
    
    #print('threshold',threshold)
    #print('precision',precision)
    #print('recall',recall)
    #print('f1',f1)
    
    if False:
        print('TP')
        pp(TP)

        print('FP')
        pp(FP)

        print('FN')
        pp(FN)
    


## Analisando apenas os querysets clássicos para 
## NÃO ENCONTRAR FALSO POSITIVOS schema mappings

In [None]:
#wn_sim=True, jaccard_sim=True,
#emb_sim=False,  emb_model=None,
#emb10_sim=False, emb10_hash=None
#wordEmbeddingsHashB

sim_args={}
#sim_args={'wn_sim':False,'jaccard_sim':False,'emb10_sim':True,'emb10_hash':wordEmbeddingsHashB}
#sim_args={'emb10_sim':True,'emb10_hash':wordEmbeddingsHashA}
#sim_args={'emb_sim':True,'emb_model':wordEmbeddingsModel}
results = []
for threshold in [x/100 for x in range(50,101)][::5]:
    precision,recall,f1,TP,FP,FN = experimento(mi,smi,threshold,goldenMappings,sim_args=sim_args,
                                               querySetFileName='querysets/queryset_imdb_spark.txt')
    
    results.append( (threshold,precision,recall,f1) )
    
    print(("%.2f; %.2f; %.2f; %.2f; %d; %d; %d" % (threshold,precision,recall,f1,len(TP),len(FP),len(FN))).replace('.',','))
    
    #print('threshold',threshold)
    #print('precision',precision)
    #print('recall',recall)
    #print('f1',f1)
    
    if True:
        #print('TP')
        #pp(TP)

        print('FP')
        pp(FP)

        #print('FN')
        #pp(FN)