# S-MatCNGenPy

Este é um passo-a-passo da implementação em python do método **S-MatCNGenPy**, desenvolvido no trabalho \[1\]. O seu principal objetivo é garantir o suporte a referência ao esquema de dados em busca por palavras-chave em banco de dados. Observe que algumas consultas, como visto abaixo, não estão relacionadas apenas a valores do banco de dados, mas a própria estrutura do esquema.

```
    filmes do Will Smith
```
- **`filmes`** : relação Movie
- **`Will`, `Smith`** : instâncias da tabela Person(Name) 


#### Leituras Importantes

> [\[1\]](https://drive.google.com/file/d/1ZnljlKss9a8M_RDqseTYfZbQCjDhcJkk/view) MARTINS, Paulo Rodrigo O.; DA SILVA, Altigran Soares. *Uma Abordagem para Suporte a Referências ao Esquema em Consultas por Palavras-Chave em Bancos de Dados Relacionais*. Trabalho de Conclusão de Curso (Ciência da Computação), Universidade Federal do Amazonas, 2017. 

> [\[2\]]() DE OLIVEIRA, Pericles; DA SILVA, Altigran; DE MOURA, Edleno. *Match-Based Candidate Network Generation for Keyword Queries over Relational Databases*. In: Data Engineering (ICDE), 2018 IEEE 34st International Conference on. IEEE, 2016. Aceito pra Pubicação

> [\[3\]](https://dl.acm.org/citation.cfm?id=1989383) BERGAMASCHI, Sonia et al. *Keyword search over relational databases: a metadata approach*. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, 2011. p. 565-576.

In [1]:
import psycopg2
from psycopg2 import sql
from pprint import pprint as pp
from collections import defaultdict
import string
import itertools
import copy
from math import log1p
from queue import deque
import ast
import gc
from queue import deque

import nltk 
#nltk.download('wordnet')
#nltk.download('omw')
#nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn

stw_set = set(stopwords.words('english')) - {'will'}

# Connect to an existing database
conn = psycopg2.connect("dbname=imdb user=postgres")

# Open a cursor to perform database operations
cur = conn.cursor()

## Pré-processamento

Antes mesmo de receber os querysets, o sistema passa por um pré-processamento, que é responsavél pela criação de dois índices invertidos:

* **wordHash**: tabela que associa cada termo do banco de dados com o seu **IAF (Inverse Attribute Frequency)** e também referencia todas Tabelas, Colunas e CTIDs em que a palavra ocorre. Nota: o CTID é o endereço físico de uma linha em uma tabela, utilizado para encontrar rapidamente uma tupla.
```python
wordHash['term'] = ( IAF , { 'table': { 'column' : [ctid] } } )
```
* **attributeHash**: tabela que para cada atributo (documento), armazena a sua norma e o número de palavras distintas.
```python
attributeHash['table']['column'] = ( norm , num_distinct_words )
```

### Criação dos Índices Invertidos

O processo de criação é realizado em três etapas. Primeiramente, o procedimento ```createInvertedIndex()``` faz uma varredura no banco de dados e preenche parcialmente o ```wordHash```, faltando apenas calcular os IAFs para cada termo. Além disso, este procedimento também ele também armazena no ```attributeHash``` o total de palavras distintas para cada atributo.

Em seguida, os IAFs de cada termo são processados através do método ```processIAF(wordHash,attributeHash)```. Por último, as normas dos atributos (documentos) são calculadas no método ```processNormsOfAttributes(wordHash,attributeHash)```.

In [2]:
def createInvertedIndex():
    #Output: wordHash (Term Index) with this structure below
    #map['word'] = [ 'table': ( {column} , ['ctid'] ) ]

    '''
    The Term Index is built in a preprocessing step that scans only
    once all the relations over which the queries will be issued.
    '''
    
    wordHash = {}
    attributeHash = {}
    
    # Get list of tablenames
    cur.execute("SELECT DISTINCT tablename FROM pg_tables WHERE schemaname!='pg_catalog' AND schemaname !='information_schema';")
    for table in cur.fetchall():
        table_name = table[0]
        print('INDEXING TABLE ',table_name)
        
        attributeHash[table_name] = {}
        
        #Get all tuples for this tablename
        cur.execute(
            sql.SQL("SELECT ctid, * FROM {};").format(sql.Identifier(table_name))
            #NOTE: sql.SQL is needed to specify this parameter as table name (can't be passed as execute second parameter)
        )

        for row in cur.fetchall():
            for column in range(1,len(row)):
                column_name = cur.description[column][0]   
                ctid = row[0]

                for word in [word.strip(string.punctuation) for word in str(row[column]).lower().split()]:
                    
                    #Ignoring STOPWORDS
                    if word in stw_set:
                        continue

                    #If word entry doesn't exists, it will be inicialized (setdefault method),
                    #Append the location for this word
                    wordHash.setdefault(word, {})                    
                    wordHash[word].setdefault( table_name , {} )
                    wordHash[word][table_name].setdefault( column_name , [] ).append(ctid)
                    
                    attributeHash[table_name].setdefault(column_name,(0,set()))
                    attributeHash[table_name][column_name][1].add(word)
        
        #Count words
        
        for (column_name,(norm,wordSet)) in attributeHash[table_name].items():
            num_distinct_words = len(wordSet)
            wordSet.clear()
            attributeHash[table_name][column_name] = (norm,num_distinct_words)
        

    print ('INVERTED INDEX CREATED')
    return (wordHash,attributeHash)

(wordHash,attributeHash) = createInvertedIndex()

INDEXING TABLE  casting
INDEXING TABLE  role
INDEXING TABLE  person
INDEXING TABLE  movie
INDEXING TABLE  character
INVERTED INDEX CREATED


In [3]:
pp(wordHash['denzel'])

{'casting': {'note': ['(2394,52)', '(3822,72)']},
 'character': {'name': ['(858,89)']},
 'person': {'name': ['(206,57)',
                     '(589,99)',
                     '(615,91)',
                     '(722,53)',
                     '(987,44)',
                     '(1211,109)',
                     '(1257,105)',
                     '(1409,17)',
                     '(1670,26)',
                     '(1840,55)',
                     '(1959,105)',
                     '(2177,95)']}}


In [4]:
pp(attributeHash['movie'])

{'__search_id': (0, 181706),
 'episode_nr': (0, 1),
 'episode_of_id': (0, 1),
 'id': (0, 181706),
 'imdb_id': (0, 1),
 'imdb_index': (0, 13),
 'kind_id': (0, 1),
 'phonetic_code': (0, 18016),
 'production_year': (0, 123),
 'season_nr': (0, 1),
 'series_years': (0, 1),
 'title': (0, 79535)}


In [5]:
def processIAF(wordHash,attributeHash):
    
    total_attributes = sum([len(attribute) for attribute in attributeHash.values()])
    
    for (term, values) in wordHash.items():
        
        attributes_with_this_term = sum([len(attribute) for attribute in wordHash[term].values()])
        
        IAF = log1p(total_attributes/attributes_with_this_term)
                
        wordHash[term] = (IAF,values)
    print('IAF PROCESSED')

processIAF(wordHash,attributeHash)

IAF PROCESSED


In [6]:
pp(wordHash['denzel'])

(2.614959778036198,
 {'casting': {'note': ['(2394,52)', '(3822,72)']},
  'character': {'name': ['(858,89)']},
  'person': {'name': ['(206,57)',
                      '(589,99)',
                      '(615,91)',
                      '(722,53)',
                      '(987,44)',
                      '(1211,109)',
                      '(1257,105)',
                      '(1409,17)',
                      '(1670,26)',
                      '(1840,55)',
                      '(1959,105)',
                      '(2177,95)']}})


In [7]:
def processNormsOfAttributes(wordHash,attributeHash):
  
    # Get list of tablenames
    cur.execute("SELECT DISTINCT tablename FROM pg_tables WHERE schemaname!='pg_catalog' AND schemaname !='information_schema';")
    for table in cur.fetchall():
        table_name = table[0]
        print('PROCESSING TABLE ',table_name)
        
        #Get all tuples for this tablename
        cur.execute(
            sql.SQL("SELECT ctid, * FROM {};").format(sql.Identifier(table_name))
            #NOTE: sql.SQL is needed to specify this parameter as table name (can't be passed as execute second parameter)
        )

        for row in cur.fetchall():
            for column in range(1,len(row)):
                column_name = cur.description[column][0]   
                ctid = row[0]

                for word in [word.strip(string.punctuation) for word in str(row[column]).lower().split()]:
                    
                    #Ignoring STOPWORDS
                    if word in stw_set:
                        continue
                    
                    (prevNorm,num_distinct_words)=attributeHash[table_name][column_name]
                    
                    IAF = wordHash[word][0]
                    
                    Norm = prevNorm + IAF
                    
                    attributeHash[table_name][column_name]=(Norm,num_distinct_words)
                    

    print ('NORMS OF ATTRIBUTES PROCESSED')

processNormsOfAttributes(wordHash,attributeHash)

PROCESSING TABLE  casting
PROCESSING TABLE  role
PROCESSING TABLE  person
PROCESSING TABLE  movie
PROCESSING TABLE  character
NORMS OF ATTRIBUTES PROCESSED


In [8]:
pp(attributeHash['movie'])

{'__search_id': (665691.1324713555, 181706),
 'episode_nr': (187705.0247137609, 1),
 'episode_of_id': (187705.0247137609, 1),
 'id': (510658.84283990104, 181706),
 'imdb_id': (187705.0247137609, 1),
 'imdb_index': (186715.63296297463, 13),
 'kind_id': (271455.1406490868, 1),
 'phonetic_code': (406945.19711942895, 18016),
 'production_year': (413884.3735530032, 123),
 'season_nr': (187705.0247137609, 1),
 'series_years': (187705.0247137609, 1),
 'title': (1190111.4197476357, 79535)}


## Main

O processamento das consultas é realizado em 

In [9]:
def getQuerySets():
    QuerySet = []
    with open('querysets/queryset_imdb_martins.txt') as f:
        for line in f.readlines():
            
            #The line bellow Remove words not in OLIVEIRA experiments
            Q = [word.strip(string.punctuation) for word in line.split() if word not in ['title','dr.',"here's",'char','name'] and word not in stw_set]  
            
            #Q = [word.strip(string.punctuation) for word in line.split() if word not in stw_set]  
            
            QuerySet.append(Q)
    return QuerySet
        
QuerySet = getQuerySets()
QuerySet

[['will', 'smith'],
 ['sound', 'music'],
 ['forrest', 'gump'],
 ['casablanca'],
 ['best', 'movie', 'award', 'James', 'Cameron'],
 ['actor', 'James', 'Bond'],
 ['movie', 'Ellen', 'Page', 'thriller'],
 ['movie', 'Terry', 'Gilliam', 'Benicio', 'del', 'Toro', 'Dr', 'gonzo'],
 ['director', 'artificial', 'intelligent', 'Haley', 'Joel', 'Osment'],
 ['Trivia', 'Don', 'Quixote'],
 ['Movie', 'Steven', 'Spielberg'],
 ['German', 'fellow', 'actor', 'Mel', 'Gibson'],
 ['actor',
  'The',
  'Fellowship',
  'Ring',
  'The',
  'Return',
  'King',
  'The',
  'Two',
  'Towers'],
 ['Director', 'John', 'Hughes', 'Matthew', 'Broderick', '1986'],
 ['cast', 'Friends'],
 ['forrest', 'gump'],
 ['henry', 'fonda', 'mine'],
 ['russell', 'crowe', 'gladiator'],
 ['darth', 'vader'],
 ['norman', 'bates'],
 ['atticus', 'finch']]

### Recuperação de Tuple-sets
Esta etapa consiste em recuperar conjuntos de tuplas que contém cada palavra-chave, chamados de tuple-sets. O algoritmo `TSFind`, que realiza esse processo, pode ser é divido em três partes: 
* **Recuperação de tuplas:** Essa parte consiste em encontrar os conjuntos de tuplas que contém cada uma das palavras do Queryset. Essas informações já foram pré-processadas no índice invertido `wordHash`.
* **Interseção de tuplas:** Esta parte acontece no algoritmo `TSInter` e é responsável por encontrar tuplas que contém mais de uma das palavras-chave. Além disso, esta etapa irá garantir que os tuple-sets `TABLE{word}` contenham apenas a palavra `word` e nenhuma outra palavra do queryset. Esta propriedade é necessária para encontrar a cobertura mínima (etapa de criação de query matches). 
* **Criação de tuple-sets:** Esta parte irá condensar os resultados. Em vez de listar todas as tuplas que contenham as palavras-chave, precisamos apenas saber quais colunas possuem cada uma das palavras. Por isso, os tuple-sets terão a estrutura (o primeiro atributo refere-se a *value* ou *schema*):
```python
TupleSet = ('v','table','column', frozenset({words}))
```

In [10]:
def TSFind(Q):
    #Input:  A keyword query Q=[k1, k2, . . . , km]
    #Output: Set of non-free and non-empty tuple-sets Rq

    '''
    The tuple-set Rki contains the tuples of Ri that contain all
    terms of K and no other keywords from Q
    '''
    
    #Part 1: Find sets of tuples containing each keyword
    global P
    P = {}
    for keyword in Q:
        tupleset = set()
        
        if keyword not in wordHash:
            continue
        
        for (table,attributes) in wordHash.get(keyword)[1].items():
            for (attribute,ctids) in attributes.items():
                for ctid in ctids:
                    tupleset.add( (table,attribute,ctid) )
        P[frozenset([keyword])] = tupleset
    
    #Part 2: Find sets of tuples containing larger termsets
    P = TSInter(P)
    
    #Part 3:Build tuple-sets
    Rq = set()
    for keyword , tuples in P.items():
        for (table,attribute,ctid) in tuples:
            Rq.add( ('v',table,attribute,keyword) )
    print ('TUPLE SETS CREATED')
    return Rq


def TSInter(P):
    #Input: A Set of non-empty tuple-sets for each keyword alone P 
    #Output: The Set P, but now including larger termsets (process Intersections)

    '''
    Termset is any non-empty subset K of the terms of a query Q        
    '''
    
    Pprev = {}
    Pprev=copy.deepcopy(P)
    Pcurr = {}

    combinations = [x for x in itertools.combinations(Pprev.keys(),2)]
    for ( Ki , Kj ) in combinations:
        Tki = Pprev[Ki]
        Tkj = Pprev[Kj]
        
        X = Ki | Kj
        Tx = Tki & Tkj        
        
        if len(Tx) > 0:            
            Pcurr[X]  = Tx            
            Pprev[Ki] = Tki - Tx         
            Pprev[Kj] = Tkj - Tx
            
    if Pcurr != {}:
        Pcurr = copy.deepcopy(TSInter(Pcurr))
        
    #Pprev = Pprev U Pcurr
    Pprev.update(Pcurr)     
    return Pprev   

Q = ['actor', 'james', 'bond']
Rq = TSFind(Q)
pp(Rq)

TUPLE SETS CREATED
{('v', 'casting', 'note', frozenset({'bond'})),
 ('v', 'casting', 'note', frozenset({'james'})),
 ('v', 'casting', 'note', frozenset({'james', 'bond'})),
 ('v', 'casting', 'note', frozenset({'actor'})),
 ('v', 'casting', 'note', frozenset({'james', 'actor'})),
 ('v', 'character', 'name', frozenset({'james'})),
 ('v', 'character', 'name', frozenset({'bond'})),
 ('v', 'character', 'name', frozenset({'james', 'bond'})),
 ('v', 'character', 'name', frozenset({'actor'})),
 ('v', 'character', 'name', frozenset({'james', 'actor'})),
 ('v', 'movie', 'title', frozenset({'james'})),
 ('v', 'movie', 'title', frozenset({'bond'})),
 ('v', 'movie', 'title', frozenset({'james', 'bond'})),
 ('v', 'movie', 'title', frozenset({'actor'})),
 ('v', 'person', 'name', frozenset({'james'})),
 ('v', 'person', 'name', frozenset({'james', 'actor'})),
 ('v', 'person', 'name', frozenset({'bond'})),
 ('v', 'person', 'name', frozenset({'james', 'bond'})),
 ('v', 'role', 'role', frozenset({'actor'}

### Criação Schema-sets

Esta etapa consiste na criação dos Schema-sets, que é uma estrutura análoga aos tuple-sets vistos na etapa anterior. Aqui, o processo também é divido em três partes: 
* **Mapeamento de Elementos do Esquema (*Schema Matching*):** Essa parte consiste em analisar a similaridade entre as palavras do querysets e elementos do esquema (nomes de relações e atributos).
* **Análise de Termos Adjacentes:** Esta parte verifica as relações entre as palavras chave, muitas vezes uma palavras-chave relacioada a elemento do esquema delimita o domínio das palavras-chave adjacentes. Ex: Actor James Bond delimita a palavra James para nome de Pessoa, em vez de nome de Filme.
* **Criação de Schema-sets:** Esta parte irá formatar os resultados para ficarem semelhantes à estrutura de tuple-sets, seguindo a estrutura a seguir (o primeiro atributo refere-se a *value* ou *schema*):
```python
SchemaSet = ('s','table','column', frozenset({words}))
```

#### Similaridades para o Schema-Matching

Para o mapeamento de palavras para elementos do esquema, foram utilizadas métricas de similaridade de escrita e semântica.
O Coeficiente de Jaccard é uma métrica que avalia a interseção entre duas palavras, sendo ideal para similaridades de escrita, como abreviações ou erros de digitação. 

Por outro lado, as métricas semânticas utilizam o dicionário léxico WordNet para encontrar similaridades de sentido. O pacote de ferramentas NLTK disponibiliza uma série de métricas semânticas [aqui](http://www.nltk.org/howto/wordnet.html "WordNet Interface"). Entre elas, as principais são a Path Similarity e a Wu-Palmer Similarity. A primeira métrica procura encontrar a menor distância entre duas palavras, no grafo de relações do WordNet, enquanto a segunda analisa o ancestral comum mais próximo entre duas palavras.

In [11]:
def wordNetSimilarity(wordA,wordB):
    
    A = set(wn.synsets(wordA))
    B = set(wn.synsets(wordB))
    
    wupSimilarities = [0]
    pathSimilarities = [0]
    for (sense1,sense2) in itertools.product(A,B):        
        wupSimilarities.append(wn.wup_similarity(sense1,sense2) or 0)
        pathSimilarities.append(wn.path_similarity(sense1,sense2) or 0)
    return max(max(wupSimilarities),max(pathSimilarities))

def jaccard_similarity(wordA,wordB):
    
    A = set(wordA)
    B = set(wordB)
    
    return len(A & B ) / len(A | B)
    
def wordSimilarity(wordA,wordB):
    return max( (jaccard_similarity(wordA,wordB),wordNetSimilarity(wordA,wordB)) )

In [12]:
set(wn.synsets('come'))

{Synset('arrive.v.01'),
 Synset('come.v.01'),
 Synset('come.v.03'),
 Synset('come.v.04'),
 Synset('come.v.05'),
 Synset('come.v.06'),
 Synset('come.v.09'),
 Synset('come.v.10'),
 Synset('come.v.11'),
 Synset('come.v.13'),
 Synset('come.v.15'),
 Synset('come.v.16'),
 Synset('come.v.20'),
 Synset('come.v.21'),
 Synset('derive.v.05'),
 Synset('do.v.04'),
 Synset('fall.v.04'),
 Synset('hail.v.02'),
 Synset('issue_forth.v.01'),
 Synset('occur.v.02'),
 Synset('semen.n.01'),
 Synset('total.v.01')}

#### Algoritmo para Criação dos Schema-Sets

In [13]:
def SchSFind(Q,threshold):
    S = []
    
    for (position,keyword) in enumerate(Q):
        for (table,values) in attributeHash.items():
            
            sim = wordSimilarity(keyword,table)
            if sim >= threshold:
                S.append( (table,'*',{keyword},position,sim) )
            
            for attribute in values.keys():
                
                if(attribute=='id'):
                    continue
                
                sim = wordSimilarity(keyword,attribute)
                
                if sim >= threshold:
                    S.append( (table,attribute,{keyword},position,sim) )
    #S = SchSInter(S)

    print ('SCHEMA SETS CREATED')
    Sq = {('s',table,attribute,frozenset(keywords)) for (table,attribute,keywords,position,sim) in S}
        
    return Sq

'''
Em vez de interseções, deve ser feita uma análise dos adjacentes..

def SchSInter(S):
    
    Scurr= S.copy()
    
    somethingChanged = False

    combinations = [x for x in itertools.combinations(Scurr,2)]
    
    for ( A , B ) in combinations:    
    
        (tableA,attributeA,wordsA,positionA,simA) = A
        (tableB,attributeB,wordsB,positionB,simB) = B
        
        if A not in Scurr or B not in Scurr:
            continue
        
        if tableA == tableB and abs(positionA-positionB)<=1:
            print('A:\n',A)
            print('B:\n',B)
            
            AB = (tableA, '*' , wordsA | wordsB, max((positionA,positionB)) , max((simA,simB)) )
            
            Scurr.remove(A)
            Scurr.remove(B)
            Scurr.append(AB)
            
            somethingChanged = True 
   
    if somethingChanged:
        return SchSInter(Scurr)
    
    return Scurr
'''
''

''

In [14]:
Q = QuerySet[5] = ['actor', 'james', 'bond']
SimilarityCoeficient = 0.799999999999
Sq = SchSFind(Q,SimilarityCoeficient)
Sq

SCHEMA SETS CREATED


{('s', 'casting', 'note', frozenset({'bond'})),
 ('s', 'character', '*', frozenset({'bond'})),
 ('s', 'movie', 'title', frozenset({'bond'})),
 ('s', 'person', '*', frozenset({'actor'}))}

### Criação de Query Matches

As etapas anteriores, de criação de schema-sets e tuple-sets, foram responsáveis por identificar quais relações possuem alguma informação sobre as palavras-chave. Nesta etapa de criação de full matches, o objetivo é combinar esses tuple-sets e schema-sets para se obter uma resposta completa, mínima e relevante para o usuário. 

O algoritmo `QMGen` é responsável por encontrar combinações de tuple-sets/schema-sets que compõem uma cobertura mínima (`MinimalCover`) sobre o queryset.
- **Total**: Cada palavra-chave deve estar presente em ao menos uma das tuplas da query-match.
- **Mínima**: Não é possível remover nenhum tuple-set/schema-set da query-match e manter a cobertura total sobre o queryset.

In [15]:
def MinimalCover(MC, Q):
    #Input:  A subset MC (Match Candidate) to be checked as total and minimal cover
    #Output: If the match candidate is a TOTAL and MINIMAL cover

    Subset = [termset for category,table,attribute,termset in MC]
    u = set().union(*Subset)    
    
    isTotal = (u == set(Q))
    for element in Subset:
        
        new_u = list(Subset)
        new_u.remove(element)
        
        new_u = set().union(*new_u)
        
        if new_u == set(Q):
            return False
    
    return isTotal

In [16]:
def QMGen(Q,Rq):
    #Input:  A keyword query Q, The set of non-empty non-free tuple-sets Rq
    #Output: The set Mq of query matches for Q
    
    '''
    Query match is a set of tuple-sets that, if properly joined,
    can produce networks of tuples that fulfill the query. They
    can be thought as the leaves of a Candidate Network.
    
    '''
    
    Mq = []
    for i in range(1,len(Q)+1):
        for subset in itertools.combinations(Rq,i):
            if(MinimalCover(subset,Q)):
                Mq.append(set(subset))
    return Mq

In [17]:
Q =['actor', 'draco', 'harry','potter']

Rq = TSFind(Q)
SimilarityCoeficient = 0.799999999999
Sq = SchSFind(Q,SimilarityCoeficient)

Mq = QMGen(Q,Sq|Rq)
for match in Mq:
    pp(match)
    print('\n\n')

TUPLE SETS CREATED
SCHEMA SETS CREATED
{('s', 'person', '*', frozenset({'actor'})),
 ('v', 'character', 'name', frozenset({'draco'})),
 ('v', 'movie', 'title', frozenset({'harry', 'potter'}))}



{('v', 'character', 'name', frozenset({'actor'})),
 ('v', 'character', 'name', frozenset({'draco'})),
 ('v', 'movie', 'title', frozenset({'harry', 'potter'}))}



{('v', 'character', 'name', frozenset({'draco'})),
 ('v', 'movie', 'title', frozenset({'harry', 'potter'})),
 ('v', 'movie', 'title', frozenset({'actor'}))}



{('v', 'casting', 'note', frozenset({'actor'})),
 ('v', 'character', 'name', frozenset({'draco'})),
 ('v', 'movie', 'title', frozenset({'harry', 'potter'}))}



{('v', 'character', 'name', frozenset({'draco'})),
 ('v', 'movie', 'title', frozenset({'harry', 'potter'})),
 ('v', 'person', 'name', frozenset({'actor'}))}



{('v', 'character', 'name', frozenset({'draco'})),
 ('v', 'movie', 'title', frozenset({'harry', 'potter'})),
 ('v', 'role', 'role', frozenset({'actor'}))}



{(

### Criação de Candidate Networks

Na etapa anterior, obteve-se as full matches, que compreendem todas as informações necessárias para o usuário. O próximo passo é encontrar maneiras de conectar estas informações para formar uma resposta para o usuário. Estas conexões, chamadas de candidate networks, são derivadas das restrições de integridade referencial do banco de dados, também conhecidas como chaves estrangeiras.

A criação de candidate networks utiliza dois grafos:
- **Schema Graph**: vértice que representa o banco de dados e é utilizado como base para o match graph. Ele contém como vértices os free tuple-sets associados a cada relação do banco de dados e como arestas as restrições de integridade referencial.

    O Schema Graph foi implementado como um dicionário, no qual cada vértice aponta para um outro vértice. Além disso, também é armazenada informações sobre as arestas, como direção e quais atributos entre as tabelas tem a relação de restrição referencial. A estrutura do Schema Graph pode ser observada a seguir:
   
```python
    G['table'] = { 'foreign_table' : (direction, column, foreign_column) }
```

In [18]:
def getSchemaGraph():
    #Output: A Schema Graph G  with the structure below:
    # G['node'] = edges
    # G['table'] = { 'foreign_table' : (direction, column, foreign_column) }
    
    
    G = {} 
    cur.execute("SELECT tablename FROM pg_tables WHERE schemaname!='pg_catalog' AND schemaname !='information_schema';")
    for table in cur.fetchall():
        G.setdefault(table[0],{})
    
    sql = "SELECT DISTINCT                 tc.table_name, kcu.column_name,                 ccu.table_name AS foreign_table_name, ccu.column_name AS foreign_column_name             FROM information_schema.table_constraints AS tc              JOIN information_schema.key_column_usage AS kcu                 ON tc.constraint_name = kcu.constraint_name             JOIN information_schema.constraint_column_usage AS ccu                 ON ccu.constraint_name = tc.constraint_name             WHERE constraint_type = 'FOREIGN KEY'"
    cur.execute(sql)
    relations = cur.fetchall()
    
    for (table,column,foreign_table,foreign_column) in relations:
        G[table][foreign_table] = (1,column, foreign_column)
        G[foreign_table][table] = (-1,foreign_column,column)
    print ('SCHEMA CREATED')
    return G
G = getSchemaGraph()
G

SCHEMA CREATED


{'casting': {'movie': (1, 'movie_id', 'id'),
  'person': (1, 'person_id', 'id'),
  'character': (1, 'person_role_id', 'id'),
  'role': (1, 'role_id', 'id')},
 'character': {'casting': (-1, 'id', 'person_role_id')},
 'person': {'casting': (-1, 'id', 'person_id')},
 'role': {'casting': (-1, 'id', 'role_id')},
 'movie': {'casting': (-1, 'id', 'movie_id')}}


- **Match Graph**: grafo gerado a partir de uma query match e o schema graph. No entanto, no match graph tuple-sets/schema-sets também são modelados como vértices. Para criá-lo, adiciona-se ao schema graph os tuple-sets/schema-sets presentes na query match. Um tuple-set de uma tabela x terá os mesmos relacionamentos (arestas) que o vértice x.

```python
    Gts['table'] = { 'foreign_table' : (direction, column, foreign_column) }

    Gts[('s','table','column', frozenset({words}))] = { 'foreign_table' : (direction, column, foreign_column) }
```

In [19]:
def MatchGraph(Rq, G, M):
    #Input:  The set of non-empty non-free tuple-sets Rq,
    #        The Schema Graph G,
    #        A Query Match M
    #Output: A Schema Graph Gts  with the structure below:
    # G['node'] = edges
    # G['table'] = { 'foreign_table' : (direction, column, foreign_column) }

    '''
    A Match Subgraph Gts[M] is a subgraph of G that contains:
        The set of free tuple-sets of G
        The query match M
    '''
    
    Gts = copy.deepcopy(G)
    
    tables = set()
    #Insert non-free nodes
    for (category,table ,attribute, keywords) in M:
        Gts[(category,table,attribute,keywords)]=copy.deepcopy(Gts[table])
        for foreign_table , (direction,column,foreign_column) in Gts[(category,table,attribute,keywords)].items():
            Gts[foreign_table][(category,table,attribute,keywords)] = (direction*(-1),foreign_column,column)

    return Gts 

In [20]:
M = {('s', 'person', '*', frozenset({'actor'})),
 ('v', 'character', 'name', frozenset({'draco'})),
 ('v', 'movie', 'title', frozenset({'potter', 'harry'}))}

Gts = MatchGraph(Rq|Sq, G, M)
pp(Gts)

{'casting': {'character': (1, 'person_role_id', 'id'),
             'movie': (1, 'movie_id', 'id'),
             'person': (1, 'person_id', 'id'),
             'role': (1, 'role_id', 'id'),
             ('s', 'person', '*', frozenset({'actor'})): (1, 'person_id', 'id'),
             ('v', 'character', 'name', frozenset({'draco'})): (1,
                                                                'person_role_id',
                                                                'id'),
             ('v', 'movie', 'title', frozenset({'harry', 'potter'})): (1,
                                                                       'movie_id',
                                                                       'id')},
 'character': {'casting': (-1, 'id', 'person_role_id')},
 'movie': {'casting': (-1, 'id', 'movie_id')},
 'person': {'casting': (-1, 'id', 'person_id')},
 'role': {'casting': (-1, 'id', 'role_id')},
 ('s', 'person', '*', frozenset({'actor'})): {'casting': (-1,
             

#### Algoritmo para Criação de Candidate Networks

Para criar uma candidate network, o algoritmo `SingleCN` procura um caminho mínimo no match graph que visite todas os non-free tuple-sets/schema-sets da query match. 

Este caminho deve ser:
- **Mínimo:** garantido através do algoritmo de caminho mínima baseado em busca por largura (BFS).
- **Total:** a função `containsMatch` garante que todos os tuple-sets/schema-sets da query match sejam visitados.
- **Seguro (*Sound*):** uma joining networks of tuple-sets é considerado sound se ela não contém uma subárvore na forma $R^K - S^L - R^M $, na qual $R$ e $S$ são relações e o schema graph tem uma aresta $R \rightarrow S$.

In [21]:
def containsMatch(Ji,M):
    for relation in M:
        if relation not in Ji:
            return False
    return True

def isJNTSound(Gts,Ji):
    if len(Ji)<3:
        return True
    
    for i in range(len(Ji)-2):
        
        if type(Ji[i]) is str:
            tableA = Ji[i]
        else:
            tableA = Ji[i][0]
            
        if type(Ji[i+2]) is str:
            tableB = Ji[i+2]
        else:
            tableB = Ji[i+2][0]          
            
        if tableA==tableB:
            edge_info = Gts[Ji[i]][Ji[i+1]]
            if(edge_info[0] == -1):
                return False
    return True

In [22]:
def SingleCN(FM,Gts,Tmax):    
    '''
    print('================================================================================\nSINGLE CN')
    print('Tmax ',Tmax)
    print('FM')
    pp(FM)
    
    print('\n\nGts')
    pp(Gts)
    print('\n\n')
    '''
    F = deque()

    first_element = list(FM)[0]
    J = [first_element]
    
    if len(FM)==1:
        return J
    
    F.append(J)
    
    while F:
        J = F.popleft()           
        u = J[-1]
        '''
        print('--------------------------------------------\nParctial CN')
        print('J ',J,'\n')
        
        print('\nAdjacents:')
        pp(Gts[u].items())
        '''
        for (adjacent,edge_info) in Gts[u].items():
            if (type(adjacent) is str) or (adjacent not in J):
                Ji = J + [adjacent]
                if (Ji not in F) and (len(Ji)<Tmax) and (isJNTSound(Gts,Ji)):
                    if(containsMatch(Ji,FM)):
                        '''
                        print('--------------------------------------------\nGenerated CN')
                        print('J ',Ji,'\n')
                        '''
                        return Ji
                    else:
                        F.append(Ji)

In [23]:
SingleCN(M,Gts,10)

[('v', 'movie', 'title', frozenset({'harry', 'potter'})),
 'casting',
 ('s', 'person', '*', frozenset({'actor'})),
 'casting',
 ('v', 'character', 'name', frozenset({'draco'}))]

In [24]:
def MatchCN(G,Sq,Rq,Mq):    
    Cns = []                        
    for M in Mq:
        Gts = MatchGraph(Rq|Sq, G, M)
        Cn = SingleCN(M,Gts,10)
        if(Cn is not None):
            Cns.append( (Cn,Gts,M) )
    return Cns


Cns = MatchCN(G,Rq,Sq,Mq)
for Cn in Cns:
    pp(Cn)
    print('--------------------------------------------------------------\n')

([('v', 'movie', 'title', frozenset({'harry', 'potter'})),
  'casting',
  ('s', 'person', '*', frozenset({'actor'})),
  'casting',
  ('v', 'character', 'name', frozenset({'draco'}))],
 {'casting': {'character': (1, 'person_role_id', 'id'),
              'movie': (1, 'movie_id', 'id'),
              'person': (1, 'person_id', 'id'),
              'role': (1, 'role_id', 'id'),
              ('s', 'person', '*', frozenset({'actor'})): (1,
                                                           'person_id',
                                                           'id'),
              ('v', 'character', 'name', frozenset({'draco'})): (1,
                                                                 'person_role_id',
                                                                 'id'),
              ('v', 'movie', 'title', frozenset({'harry', 'potter'})): (1,
                                                                        'movie_id',
                                        

                                                             'id'),
                                                   'person': (1,
                                                              'person_id',
                                                              'id'),
                                                   'role': (1, 'role_id', 'id'),
                                                   ('s', 'person', '*', frozenset({'actor'})): (1,
                                                                                                'person_id',
                                                                                                'id'),
                                                   ('v', 'character', 'name', frozenset({'draco'})): (1,
                                                                                                      'person_role_id',
                                                                                                      'id'

  ('v', 'casting', 'note', frozenset({'potter'})): {'character': (1,
                                                                  'person_role_id',
                                                                  'id'),
                                                    'movie': (1,
                                                              'movie_id',
                                                              'id'),
                                                    'person': (1,
                                                               'person_id',
                                                               'id'),
                                                    'role': (1,
                                                             'role_id',
                                                             'id'),
                                                    ('v', 'character', 'name', frozenset({'draco'})): (1,
                                            

  'role': {'casting': (-1, 'id', 'role_id')},
  ('s', 'person', '*', frozenset({'actor'})): {'casting': (-1,
                                                           'id',
                                                           'person_id')},
  ('v', 'character', 'name', frozenset({'harry'})): {'casting': (-1,
                                                                 'id',
                                                                 'person_role_id')},
  ('v', 'character', 'name', frozenset({'draco'})): {'casting': (-1,
                                                                 'id',
                                                                 'person_role_id')},
  ('v', 'person', 'name', frozenset({'potter'})): {'casting': (-1,
                                                               'id',
                                                               'person_id')}},
 {('s', 'person', '*', frozenset({'actor'})),
  ('v', 'character', 'name', frozenset({'

             ('v', 'casting', 'note', frozenset({'actor'})): (-1,
                                                              'id',
                                                              'person_id')},
  'role': {'casting': (-1, 'id', 'role_id'),
           ('v', 'casting', 'note', frozenset({'potter'})): (-1,
                                                             'id',
                                                             'role_id'),
           ('v', 'casting', 'note', frozenset({'actor'})): (-1,
                                                            'id',
                                                            'role_id')},
  ('v', 'casting', 'note', frozenset({'potter'})): {'character': (1,
                                                                  'person_role_id',
                                                                  'id'),
                                                    'movie': (1,
                                             

                                                             'movie_id',
                                                             'id'),
                                                   'person': (1,
                                                              'person_id',
                                                              'id'),
                                                   'role': (1, 'role_id', 'id'),
                                                   ('v', 'character', 'name', frozenset({'draco'})): (1,
                                                                                                      'person_role_id',
                                                                                                      'id'),
                                                   ('v', 'movie', 'title', frozenset({'potter'})): (1,
                                                                                                    'movie_id',
                    

### Ranking de Candidate Networks

Como existem diferentes maneiras de se conectar as informações associadas as palavras-chave, várias candidate networks serão geradas. Entretanto, na maioria das vezes, apenas uma delas contém uma resposta relevante para o usuário. Por este motivo, esta esta etapa irá avaliar e ranquear as candidate networks por relevância.

In [25]:
def CNRank(Cns,mi,smi):
    Ranking = []
    for (Cn,Gts,M) in Cns:
        cosprod = 1
        valuecont = 0

        schemaprod = 1
        schemacont = 0
        
        for relation in Cn:
            if(type(relation) is str):
                continue
            
            (category,table,attribute,predicates) = relation
            
            if (category == 'v'):
                
                valuecont+=1
                
                if predicates == frozenset(['']):
                    continue

                (norm_attribute,distinct_terms) = attributeHash[table][attribute]

                wsum = 0

                for term in predicates:

                    IAF = wordHash[term][0]

                    ctids = wordHash[term][1][table][attribute]
                    fkj = len(ctids)

                    if fkj>0:

                        TF = log1p(fkj) / log1p(distinct_terms)

                        wsum = wsum + TF*IAF
                
                cos = wsum/norm_attribute
                cosprod *= cos
            elif (category == 's'):
                
                schemacont+=1
                
                if(attribute == '*'):
                    schemaTerm = table
                else:
                    schemaTerm = attribute
                
                schemasum = 0
                
                for term in predicates:
                    schemasum+=wordSimilarity(term, schemaTerm)
                
                schemaprod *= schemasum
                
            
        
        valuescore = schemascore = 0
        
        score = 1/len(Cn)
        
        if valuecont>0:
            valuescore = mi * cosprod 
            score*=valuescore
        
        if schemacont>0:
            schemascore = smi * schemaprod
            score*=schemascore
            
        Ranking.append((Cn,Gts,M,score , ( valuescore , schemascore , len(Cn) )  ))
    return sorted(Ranking,key=lambda x: x[3],reverse=True)

In [26]:
mi = 464576.1086662768
smi = 1

SortedCn = CNRank(Cns,mi,smi)
SortedCn

for Cn in SortedCn:
    print('Value Score: ',"%.8f" % Cn[4][0],'\nSchema Score:',"%.8f" % Cn[4][1], '\n|Cn|: ',Cn[4][2],'\nTotal Score: ',"%.8f" % Cn[3])
    pp(Cn[0])
    print('----------------------------------------------------------------------\n')

Value Score:  0.00000012 
Schema Score: 0.80000000 
|Cn|:  3 
Total Score:  0.00000003
[('v', 'character', 'name', frozenset({'draco'})),
 ('v', 'casting', 'note', frozenset({'harry', 'potter'})),
 ('s', 'person', '*', frozenset({'actor'}))]
----------------------------------------------------------------------

Value Score:  0.00000018 
Schema Score: 0.80000000 
|Cn|:  5 
Total Score:  0.00000003
[('v', 'movie', 'title', frozenset({'harry', 'potter'})),
 'casting',
 ('s', 'person', '*', frozenset({'actor'})),
 'casting',
 ('v', 'character', 'name', frozenset({'draco'}))]
----------------------------------------------------------------------

Value Score:  0.00000000 
Schema Score: 0.00000000 
|Cn|:  5 
Total Score:  0.00000000
[('v', 'role', 'role', frozenset({'actor'})),
 'casting',
 'movie',
 ('v', 'casting', 'note', frozenset({'harry', 'potter'})),
 ('v', 'character', 'name', frozenset({'draco'}))]
----------------------------------------------------------------------

Value Score:

### Mais abaixo tem a execução para outras CNS (querysets)

# Execução

In [27]:
def preProcessing():
    global wordHash
    global attributeHash
    (wordHash,attributeHash) = createInvertedIndex()
    processIAF(wordHash,attributeHash)
    processNormsOfAttributes(wordHash,attributeHash)
    print('PRE-PROCESSING STAGE FINISHED')

In [28]:
def main(mi,smi):   
    QuerySets = getQuerySets()
    maxscores = (list(),list())
    for (i,Q) in enumerate(QuerySets):
       
        print('QUERY-SET ',Q,'\n')
        
        print('FINDING TUPLE-SETS')
        Rq = TSFind(Q)
        print(len(Rq),'TUPLE-SETS CREATED\n')
        
        print('FINDING SCHEMA-SETS')
        SimilarityThreshold = 0.799999999999
        Sq = SchSFind(Q,SimilarityThreshold)
        print(len(Sq),' SCHEMA-SETS CREATED\n')
        
        print('GENERATING QUERY MATCHES')
        Mq = QMGen(Q,Sq|Rq)
        print (len(Mq),'QUERY MATCHES CREATED\n')
        '''
        for M in Mq[:20]:
            pp(M)
            print('\n\n')
        '''
        print('GENERATING CANDIDATE NETWORKS')
        G = getSchemaGraph()
        
        Cns = MatchCN(G,Rq,Sq,Mq)
        
        print (len(Cns),'CANDIDATE NETWORKS CREATED\n')
        
        '''
        for Cn in Cns[:20]:
            pp(Cn[0])
            print('\n\n')
            #pp(Cn[1])
            #print('\n\n\n==================================================================================\n')
        '''
        print('RANKING CANDIDATE NETWORKS')
        RankedCns = CNRank(Cns,mi,smi)
        for (j,Cn) in enumerate(RankedCns):
            print(j+1,'ª CN')
            print('Value Score: ',"%.8f" % Cn[4][0],'\nSchema Score:',"%.8f" % Cn[4][1], '\n|Cn|: ',Cn[4][2],'\nTotal Score: ',"%.8f" % Cn[3])
            pp(Cn[0])
            print('----------------------------------------------------------------------\n')
        
            maxscores[0].append(Cn[4][0])
            maxscores[1].append(Cn[4][1])
        gc.collect()
        
        print('==========================================================================\
==========================================================================\
==========================================================================\
==========================================================================\
==========================================================================\
==========================================================================')
    return maxscores

In [29]:
mi = 0.90/1.9372498568291752e-06
mi

464576.1086662768

In [30]:
mi = 464576.1086662768
smi = 1
maxscores = main(mi,smi)
maxscores

QUERY-SET  ['will', 'smith'] 

FINDING TUPLE-SETS
TUPLE SETS CREATED
12 TUPLE-SETS CREATED

FINDING SCHEMA-SETS
SCHEMA SETS CREATED
1  SCHEMA-SETS CREATED

GENERATING QUERY MATCHES
24 QUERY MATCHES CREATED

GENERATING CANDIDATE NETWORKS
SCHEMA CREATED
24 CANDIDATE NETWORKS CREATED

RANKING CANDIDATE NETWORKS
1 ª CN
Value Score:  0.90004087 
Schema Score: 0.00000000 
|Cn|:  1 
Total Score:  0.90004087
[('v', 'character', 'name', frozenset({'smith', 'will'}))]
----------------------------------------------------------------------

2 ª CN
Value Score:  0.88388259 
Schema Score: 0.00000000 
|Cn|:  1 
Total Score:  0.88388259
[('v', 'person', 'name', frozenset({'smith', 'will'}))]
----------------------------------------------------------------------

3 ª CN
Value Score:  0.82548012 
Schema Score: 0.00000000 
|Cn|:  1 
Total Score:  0.82548012
[('v', 'movie', 'title', frozenset({'smith', 'will'}))]
----------------------------------------------------------------------

4 ª CN
Value Score:  

SCHEMA SETS CREATED
0  SCHEMA-SETS CREATED

GENERATING QUERY MATCHES
18 QUERY MATCHES CREATED

GENERATING CANDIDATE NETWORKS
SCHEMA CREATED
18 CANDIDATE NETWORKS CREATED

RANKING CANDIDATE NETWORKS
1 ª CN
Value Score:  0.50758332 
Schema Score: 0.00000000 
|Cn|:  1 
Total Score:  0.50758332
[('v', 'character', 'name', frozenset({'gump', 'forrest'}))]
----------------------------------------------------------------------

2 ª CN
Value Score:  0.23511653 
Schema Score: 0.00000000 
|Cn|:  1 
Total Score:  0.23511653
[('v', 'movie', 'title', frozenset({'gump', 'forrest'}))]
----------------------------------------------------------------------

3 ª CN
Value Score:  0.00000010 
Schema Score: 0.00000000 
|Cn|:  2 
Total Score:  0.00000005
[('v', 'character', 'name', frozenset({'gump'})),
 ('v', 'casting', 'note', frozenset({'forrest'}))]
----------------------------------------------------------------------

4 ª CN
Value Score:  0.00000007 
Schema Score: 0.00000000 
|Cn|:  2 
Total Score:  0

QUERY-SET  ['movie', 'Ellen', 'Page', 'thriller'] 

FINDING TUPLE-SETS
TUPLE SETS CREATED
5 TUPLE-SETS CREATED

FINDING SCHEMA-SETS
SCHEMA SETS CREATED
1  SCHEMA-SETS CREATED

GENERATING QUERY MATCHES
0 QUERY MATCHES CREATED

GENERATING CANDIDATE NETWORKS
SCHEMA CREATED
0 CANDIDATE NETWORKS CREATED

RANKING CANDIDATE NETWORKS
QUERY-SET  ['movie', 'Terry', 'Gilliam', 'Benicio', 'del', 'Toro', 'Dr', 'gonzo'] 

FINDING TUPLE-SETS
TUPLE SETS CREATED
10 TUPLE-SETS CREATED

FINDING SCHEMA-SETS
SCHEMA SETS CREATED
1  SCHEMA-SETS CREATED

GENERATING QUERY MATCHES
0 QUERY MATCHES CREATED

GENERATING CANDIDATE NETWORKS
SCHEMA CREATED
0 CANDIDATE NETWORKS CREATED

RANKING CANDIDATE NETWORKS
QUERY-SET  ['director', 'artificial', 'intelligent', 'Haley', 'Joel', 'Osment'] 

FINDING TUPLE-SETS
TUPLE SETS CREATED
7 TUPLE-SETS CREATED

FINDING SCHEMA-SETS
SCHEMA SETS CREATED
0  SCHEMA-SETS CREATED

GENERATING QUERY MATCHES
0 QUERY MATCHES CREATED

GENERATING CANDIDATE NETWORKS
SCHEMA CREATED
0 CANDIDAT

10 TUPLE-SETS CREATED

FINDING SCHEMA-SETS
SCHEMA SETS CREATED
0  SCHEMA-SETS CREATED

GENERATING QUERY MATCHES
18 QUERY MATCHES CREATED

GENERATING CANDIDATE NETWORKS
SCHEMA CREATED
18 CANDIDATE NETWORKS CREATED

RANKING CANDIDATE NETWORKS
1 ª CN
Value Score:  0.50758332 
Schema Score: 0.00000000 
|Cn|:  1 
Total Score:  0.50758332
[('v', 'character', 'name', frozenset({'gump', 'forrest'}))]
----------------------------------------------------------------------

2 ª CN
Value Score:  0.23511653 
Schema Score: 0.00000000 
|Cn|:  1 
Total Score:  0.23511653
[('v', 'movie', 'title', frozenset({'gump', 'forrest'}))]
----------------------------------------------------------------------

3 ª CN
Value Score:  0.00000010 
Schema Score: 0.00000000 
|Cn|:  2 
Total Score:  0.00000005
[('v', 'character', 'name', frozenset({'gump'})),
 ('v', 'casting', 'note', frozenset({'forrest'}))]
----------------------------------------------------------------------

4 ª CN
Value Score:  0.00000007 
Schema S

TUPLE SETS CREATED
13 TUPLE-SETS CREATED

FINDING SCHEMA-SETS
SCHEMA SETS CREATED
0  SCHEMA-SETS CREATED

GENERATING QUERY MATCHES
54 QUERY MATCHES CREATED

GENERATING CANDIDATE NETWORKS
SCHEMA CREATED
54 CANDIDATE NETWORKS CREATED

RANKING CANDIDATE NETWORKS
1 ª CN
Value Score:  0.00000027 
Schema Score: 0.00000000 
|Cn|:  2 
Total Score:  0.00000013
[('v', 'casting', 'note', frozenset({'crowe', 'russell'})),
 ('v', 'movie', 'title', frozenset({'gladiator'}))]
----------------------------------------------------------------------

2 ª CN
Value Score:  0.00000025 
Schema Score: 0.00000000 
|Cn|:  2 
Total Score:  0.00000012
[('v', 'casting', 'note', frozenset({'crowe', 'russell'})),
 ('v', 'character', 'name', frozenset({'gladiator'}))]
----------------------------------------------------------------------

3 ª CN
Value Score:  0.00000043 
Schema Score: 0.00000000 
|Cn|:  5 
Total Score:  0.00000009
[('v', 'movie', 'title', frozenset({'gladiator'})),
 'casting',
 'movie',
 'casting',
 

SCHEMA CREATED
7 CANDIDATE NETWORKS CREATED

RANKING CANDIDATE NETWORKS
1 ª CN
Value Score:  0.45226215 
Schema Score: 0.00000000 
|Cn|:  1 
Total Score:  0.45226215
[('v', 'character', 'name', frozenset({'darth', 'vader'}))]
----------------------------------------------------------------------

2 ª CN
Value Score:  0.00000011 
Schema Score: 0.00000000 
|Cn|:  5 
Total Score:  0.00000002
[('v', 'character', 'name', frozenset({'darth'})),
 'casting',
 'movie',
 'casting',
 ('v', 'character', 'name', frozenset({'vader'}))]
----------------------------------------------------------------------

3 ª CN
Value Score:  0.00000009 
Schema Score: 0.00000000 
|Cn|:  5 
Total Score:  0.00000002
[('v', 'character', 'name', frozenset({'darth'})),
 'casting',
 'movie',
 'casting',
 ('v', 'movie', 'title', frozenset({'vader'}))]
----------------------------------------------------------------------

4 ª CN
Value Score:  0.00000002 
Schema Score: 0.00000000 
|Cn|:  2 
Total Score:  0.00000001
[('v', 

SCHEMA SETS CREATED
0  SCHEMA-SETS CREATED

GENERATING QUERY MATCHES
9 QUERY MATCHES CREATED

GENERATING CANDIDATE NETWORKS
SCHEMA CREATED
9 CANDIDATE NETWORKS CREATED

RANKING CANDIDATE NETWORKS
1 ª CN
Value Score:  0.41060044 
Schema Score: 0.00000000 
|Cn|:  1 
Total Score:  0.41060044
[('v', 'character', 'name', frozenset({'finch', 'atticus'}))]
----------------------------------------------------------------------

2 ª CN
Value Score:  0.00000008 
Schema Score: 0.00000000 
|Cn|:  2 
Total Score:  0.00000004
[('v', 'person', 'name', frozenset({'atticus'})),
 ('v', 'casting', 'note', frozenset({'finch'}))]
----------------------------------------------------------------------

3 ª CN
Value Score:  0.00000015 
Schema Score: 0.00000000 
|Cn|:  5 
Total Score:  0.00000003
[('v', 'person', 'name', frozenset({'atticus'})),
 'casting',
 'movie',
 'casting',
 ('v', 'person', 'name', frozenset({'finch'}))]
----------------------------------------------------------------------

4 ª CN
Value 

([0.9000408681151703,
  0.8838825931151858,
  0.8254801175644624,
  0.5969858199439341,
  0.3915028453185758,
  0.5460286266504614,
  0.5038087186085582,
  0.3845481210644917,
  3.7157771999365327e-07,
  3.3390872032545067e-07,
  2.8471285265364694e-07,
  2.4150959195220537e-07,
  2.2283563922190843e-07,
  5.182390742957327e-07,
  4.781679040446292e-07,
  4.6570215814793726e-07,
  4.296932360287161e-07,
  1.700864300356993e-07,
  3.970887307286442e-07,
  3.663851212884536e-07,
  3.649769491118624e-07,
  3.2797710806856413e-07,
  1.73162518970903e-07,
  2.7965516429843574e-07,
  0.8704237934349187,
  0.5400611396845142,
  0,
  0,
  0,
  0,
  0.32782188051941474,
  0.32782188051941474,
  0.4776658800181151,
  0.4776658800181151,
  0.39275791341680366,
  0.39275791341680366,
  0.3708787715128575,
  0.3708787715128575,
  0.21223925916509942,
  0.21223925916509942,
  0.31176834236730705,
  0.31176834236730705,
  0.23922333830826092,
  0.23922333830826092,
  0.18059784379647015,
  0.18059784

### Observando os maxscores que eu atribui valores a constante mi para normalizar o resultado

In [31]:
max(maxscores[0]), max(maxscores[1])

(0.9000408681151703, 0.875)

In [32]:
def detailedInfo(Q):   
    print('QUERY-SET ',Q,'\n')

    print('FINDING TUPLE-SETS')
    Rq = TSFind(Q)
    print(len(Rq),'TUPLE-SETS CREATED\n')

    print('FINDING SCHEMA-SETS')
    SimilarityThreshold = 0.799999999999
    Sq = SchSFind(Q,SimilarityThreshold)
    print(len(Sq),' SCHEMA-SETS CREATED\n')

    print('GENERATING QUERY MATCHES')
    Mq = QMGen(Q,Sq|Rq)
    print (len(Mq),'QUERY MATCHES CREATED\n')
    
    for M in Mq[:20]:
        pp(M)
        print('\n\n')
    
    print('GENERATING CANDIDATE NETWORKS')
    G = getSchemaGraph()

    Cns = MatchCN(G,Rq,Sq,Mq)

    print (len(Cns),'CANDIDATE NETWORKS CREATED\n')

    '''
    for Cn in Cns[:20]:
        pp(Cn[0])
        print('\n\n')
        #pp(Cn[1])
        #print('\n\n\n==================================================================================\n')
    '''
    print('RANKING CANDIDATE NETWORKS')
    mi = 2.7E+12
    smi = 1
    RankedCns = CNRank(Cns,mi,smi)
    for (j,Cn) in enumerate(RankedCns):
        print(j+1,'ª CN')
        print('Value Score: ',"%.8f" % Cn[4][0],'\nSchema Score:',"%.8f" % Cn[4][1], '\n|Cn|: ',Cn[4][2],'\nTotal Score: ',"%.8f" % Cn[3])
        pp(Cn[0])
        print('----------------------------------------------------------------------\n')

    gc.collect()

    print('==========================================================================\
==========================================================================\
==========================================================================\
==========================================================================\
==========================================================================\
==========================================================================')

In [33]:
detailedInfo(['ellen','page'])

QUERY-SET  ['ellen', 'page'] 

FINDING TUPLE-SETS
TUPLE SETS CREATED
9 TUPLE-SETS CREATED

FINDING SCHEMA-SETS
SCHEMA SETS CREATED
0  SCHEMA-SETS CREATED

GENERATING QUERY MATCHES
17 QUERY MATCHES CREATED

{('v', 'person', 'name', frozenset({'page', 'ellen'}))}



{('v', 'movie', 'title', frozenset({'page'})),
 ('v', 'person', 'name', frozenset({'ellen'}))}



{('v', 'character', 'name', frozenset({'page'})),
 ('v', 'person', 'name', frozenset({'ellen'}))}



{('v', 'person', 'name', frozenset({'ellen'})),
 ('v', 'person', 'name', frozenset({'page'}))}



{('v', 'casting', 'note', frozenset({'page'})),
 ('v', 'person', 'name', frozenset({'ellen'}))}



{('v', 'character', 'name', frozenset({'ellen'})),
 ('v', 'movie', 'title', frozenset({'page'}))}



{('v', 'character', 'name', frozenset({'page'})),
 ('v', 'character', 'name', frozenset({'ellen'}))}



{('v', 'character', 'name', frozenset({'ellen'})),
 ('v', 'person', 'name', frozenset({'page'}))}



{('v', 'casting', 'note', frozens

## Todos os mapeamentos de elementos para esquema

In [34]:
for Q in getQuerySets():
    Sq = SchSFind(Q,0.8)
    print(Q)
    pp(Sq)
    print('\n\n')
    

SCHEMA SETS CREATED
['will', 'smith']
{('s', 'movie', 'title', frozenset({'will'}))}



SCHEMA SETS CREATED
['sound', 'music']
{('s', 'character', 'name', frozenset({'sound'})),
 ('s', 'person', 'name', frozenset({'sound'})),
 ('s', 'role', '*', frozenset({'music'})),
 ('s', 'role', 'role', frozenset({'music'}))}



SCHEMA SETS CREATED
['forrest', 'gump']
set()



SCHEMA SETS CREATED
['casablanca']
set()



SCHEMA SETS CREATED
['best', 'movie', 'award', 'James', 'Cameron']
{('s', 'movie', '*', frozenset({'movie'})),
 ('s', 'person', '*', frozenset({'best'})),
 ('s', 'role', '*', frozenset({'best'})),
 ('s', 'role', 'role', frozenset({'best'}))}



SCHEMA SETS CREATED
['actor', 'James', 'Bond']
{('s', 'casting', 'note', frozenset({'Bond'})),
 ('s', 'character', '*', frozenset({'Bond'})),
 ('s', 'movie', 'title', frozenset({'Bond'})),
 ('s', 'person', '*', frozenset({'actor'}))}



SCHEMA SETS CREATED
['movie', 'Ellen', 'Page', 'thriller']
{('s', 'movie', '*', frozenset({'movie'}))}



S

In [35]:
wordHash['actor']

(2.151762203259462,
 {'casting': {'note': ['(221,124)',
    '(277,47)',
    '(381,38)',
    '(565,94)',
    '(568,64)',
    '(758,99)',
    '(985,84)',
    '(995,99)',
    '(1052,68)',
    '(1061,100)',
    '(1151,8)',
    '(1339,76)',
    '(1437,48)',
    '(1440,67)',
    '(1535,20)',
    '(1571,114)',
    '(1828,29)',
    '(1862,115)',
    '(1873,65)',
    '(2293,60)',
    '(2494,42)',
    '(2777,28)',
    '(3089,42)',
    '(3299,125)',
    '(3350,22)',
    '(3633,42)',
    '(3821,30)',
    '(3828,68)',
    '(3919,30)',
    '(3942,49)',
    '(4016,83)',
    '(4135,55)',
    '(4211,100)',
    '(4400,42)',
    '(4401,47)',
    '(4597,42)',
    '(4653,37)',
    '(4894,8)',
    '(4928,118)',
    '(4940,36)',
    '(5083,97)',
    '(5185,63)',
    '(5278,37)',
    '(5289,18)',
    '(5483,31)',
    '(5490,77)',
    '(5602,95)',
    '(5682,48)',
    '(5702,36)',
    '(5782,39)']},
  'role': {'role': ['(0,1)']},
  'person': {'name': ['(1728,31)']},
  'movie': {'title': ['(3,95)',
    '(20,83)