<img src="./images/qinsti.png" align="left" alt="drawing" width="100"/>
<br><br>
<div align="left">
    <h2>Intelligent Tagging - Demo</h2>
</div>



### Obtain the API Key
- via registering at https://permid.org/

### Documentation
- https://developers.refinitiv.com/open-permid/intelligent-tagging-restful-api/docs


In [44]:
with open("my-permid-api-key.txt",'r') as f:
    api_key = f.readlines()[0]

In [45]:
import requests
import os
import sys
calais_url = "https://api-eit.refinitiv.com/permid/calais"

## Tagging function


In [46]:
def tagit(input_file, output_dir,access_token,
          content_type = "text/raw",
          output_format = "xml/rdf",
          content_class = "news",
          omit_original = "true"):
    try:
        if not os.path.exists(input_file):
            print('The file [%s] does not exist' % input_file)
            return
        if not os.path.exists(output_dir):
            os.makedirs(output_dir)
        headers = {"X-AG-Access-Token" : access_token, 
                   "Content-Type" : content_type,
                   "outputformat" : output_format,
                   "x-calais-contentClass": content_class,
                   "omitOutputtingOriginalText": omit_original
        }
        sendFile(input_file, headers, output_dir,output_format)
    except Exception as e:
        print ('Error in connect ' , e)

        

## Call Cloud TRIT for taggging

In [47]:
#Invoke the API
def sendFile(file_name, headers, output_dir,output_format):
    print("Tagging :" ,file_name)
    with open(file_name, 'rb') as input_data:
        response = requests.post(calais_url,
                                 data=input_data,
                                 headers=headers,
                                 timeout=80)
        print ('status code: %s' % response.status_code)
    content = response.text
    if response.status_code == 200:
        saveFile(file_name, output_dir, content, output_format)


## Save the Tagged Output

In [48]:
#Save the Output
def saveFile(file_name, output_dir, content, output_format):
    output_extension = ""
    if output_format=="xml/rdf" :
        output_extension = ".xml"
    if output_format=="application/json" :
        output_extension = ".json"
    if output_format=="text/n3" :
        output_extension = ".n3"
    output_file_name = os.path.basename(file_name).split(".")[0]
    output_file_name = output_file_name + output_extension
    output_file = open(os.path.join(output_dir, output_file_name), 'wb')
    output_file.write(content.encode('utf-8'))
    output_file.close()



## Tag a single document(txt)

In [49]:
input_dir    = "./data/tagging/"
output_dir    = input_dir
temp_file     = input_dir+ "article.txt"
tagit(temp_file, output_dir, api_key,
              content_type = "text/raw",
              output_format = "xml/rdf",
              content_class = "news",
              omit_original = "true")           


Tagging : ./data/tagging/article.txt
status code: 200


## Tag a set of files from a specific folder

Specify the input directory, output directory, api key 

In [50]:
input_dir    = "./data/tagging/batch/input/"
output_dir    = "./data/tagging/batch/output/"

def tag_docs():
    for _path,_dir,files in os.walk(input_dir):
        f           = files
        files       = [input_dir + i for i in list(f)]
        n           = len(files)
        for i in range(n):
            print(files[i], "processing ", i+1, " out of ", n,  "files")
            tagit(files[i], output_dir, api_key,
              content_type = "text/raw",
              output_format = "xml/rdf",
              content_class = "news",
              omit_original = "true")
            
    return None



## Tag the Documents

In [51]:
tag_docs()

./data/tagging/batch/input/article1.txt processing  1  out of  3 files
Tagging : ./data/tagging/batch/input/article1.txt
status code: 200
./data/tagging/batch/input/article2.txt processing  2  out of  3 files
Tagging : ./data/tagging/batch/input/article2.txt
status code: 200
./data/tagging/batch/input/article3.txt processing  3  out of  3 files
Tagging : ./data/tagging/batch/input/article3.txt
status code: 200


## Analyzing output - Single Document

We will use **rdflib** and **CalaisModel** to analyze the output. Let us import all the relevant modules

* Download http://www.opencalais.com/calaismodel-abstraction-layer/
* Install for Anacondo
* Install rdflib: *conda install -c conda-forge rdflib* 
* Methods in abstraction layer 
  - getCalaisObjectById
  - getCalaisObjectByType 
  - getAllTypes
  - getAllCalaisObjects


In [52]:
import numpy as np


In [53]:
from calaisModel import CalaisModel
from collections import defaultdict
from pprint import pprint
import pdb
import rdflib
output_dir    = "./data/tagging/"

temp_output  = output_dir+ "article.xml"


### How many triples ?

In [54]:
g  = rdflib.Graph()
g.parse(temp_output)
print(f"The number of elements in the graphs is {len(g)}")


The number of elements in the graphs is 1487


### How many objects ?

In [55]:
cm = CalaisModel(temp_output)
bj = cm.getAllCalaisObjects()

# Get all object types
for object_type in cm.getAllTypes():
    print(object_type)


http://s.opencalais.com/1/type/sys/InstanceInfo
http://s.opencalais.com/1/type/em/e/Company
http://s.opencalais.com/1/type/tag/Confidence
http://s.opencalais.com/1/type/sys/RelevanceInfo
http://s.opencalais.com/1/type/em/r/CompanyInvestment
http://s.opencalais.com/1/type/em/e/Person
http://s.opencalais.com/1/type/tag/SocialTag
http://s.opencalais.com/1/type/sys/DocInfo
http://s.opencalais.com/1/type/er/Company
http://s.opencalais.com/1/type/em/e/ProvinceOrState
http://s.opencalais.com/1/type/er/Geo/ProvinceOrState
http://s.opencalais.com/1/type/em/e/IndustryTerm
http://s.opencalais.com/1/type/em/r/Acquisition
http://s.opencalais.com/1/type/em/r/Alliance
http://s.opencalais.com/1/type/em/r/CompanyTechnology
http://s.opencalais.com/1/type/er/Geo/Country
http://s.opencalais.com/1/type/em/e/Technology
http://s.opencalais.com/1/type/em/e/Country
http://s.opencalais.com/1/type/em/e/City
http://s.opencalais.com/1/type/lid/DefaultLangId
http://s.opencalais.com/1/type/em/r/BusinessRelation
http

### How many objects ?

In [39]:
obj = cm.getAllCalaisObjects()
print(len(cm.getAllCalaisObjects()))

206


In [56]:
id = "http://d.opencalais.com/comphash-1/9669602a-43aa-34e4-814b-a6ee6100e216"
temp = cm.getCalaisObjectById(id)
print(list(temp.getLiterals()))
print(list(temp.getReferences()))
print(temp.getObjectId())
print(temp.getType())
print(list(temp.getLiterals()))
print(list(temp.getReferences()))
print(list(temp.getExternalURIs()))


['http://s.opencalais.com/1/pred/recognizedas', 'http://s.opencalais.com/1/pred/name', 'http://s.opencalais.com/1/pred/confidencelevel', 'http://s.opencalais.com/1/pred/nationality', 'http://s.opencalais.com/1/pred/forenduserdisplay']
[]
http://d.opencalais.com/comphash-1/9669602a-43aa-34e4-814b-a6ee6100e216
http://s.opencalais.com/1/type/em/e/Company
['http://s.opencalais.com/1/pred/recognizedas', 'http://s.opencalais.com/1/pred/name', 'http://s.opencalais.com/1/pred/confidencelevel', 'http://s.opencalais.com/1/pred/nationality', 'http://s.opencalais.com/1/pred/forenduserdisplay']
[]
[]


### Get all the companies in the document

In [57]:
#Extract all companies 
companies = cm.getCalaisObjectByType("http://s.opencalais.com/1/type/er/Company")

# Get all the confidence scores for the identified entities
for company in companies:
    temp= company.getLiterals()
    print(temp['http://s.opencalais.com/1/pred/name'],temp['http://s.opencalais.com/1/pred/permid'])


['GOLDEN GATE CAPITAL, INC.'] ['5000046422']
['EMC CORPORATION'] ['4295903890']
['SECUREWORKS CORP.'] ['5048024231']
['SonicWALL B.V.'] ['5000690970']
['PEROT SYSTEMS CORPORATION'] ['5000067065']
['VMWARE, INC.'] ['4295907347']
['NTT DATA CORPORATION'] ['4295877060']
['DELL INC.'] ['4295906157']
['QUEST SOFTWARE INC.'] ['4295914668']
['FRANCISCO PARTNERS, L.P.'] ['4296392467']
['ELLIOTT MANAGEMENT CORPORATION'] ['4295985165']


### Serialiaze in to N Triple file

In [42]:
temp_output_nt    = output_dir+ "article.nt"
g.serialize(temp_output_nt, format="nt")


### How many PermIds?

In [43]:
perm_id = rdflib.URIRef(u'http://s.opencalais.com/1/pred/permid')
for s, p, o in g:
    if p == perm_id:        
        if o.startswith("http"):
            print(o)
        else:
            print("https://permid.org/1-"+o)


https://permid.org/1-4295906157
https://permid.org/1-505062
https://permid.org/1-4295985165
https://permid.org/1-5000690970
https://permid.org/1-404011
https://permid.org/1-4296392467
https://permid.org/1-404011
https://permid.org/1-4295877060
https://permid.org/1-5048024231
https://permid.org/1-5000067065
https://permid.org/1-4295914668
https://permid.org/1-100148
https://permid.org/1-5000046422
https://permid.org/1-404011
https://permid.org/1-100319
https://permid.org/1-100299
https://permid.org/1-4295903890
https://permid.org/1-4295907347


## Storing RDF Output in Neo4j

1. Create a Neo4j empty store(http://dist.neo4j.org/neo4j-enterprise-3.2.2-windows.zip)
2. Create a RDF document that stores the data from the input files



### Step 1
Put in neosemantics-3.2.0.1-beta.jar in plugins

* CREATE INDEX ON :Resource(uri);
* CREATE INDEX ON :URI(uri);
* CREATE INDEX ON :BNode(uri);
* CREATE INDEX ON :Class(uri);
* CREATE INDEX ON :Class(uri);
* Usae the relevant plugin for RDF import


### Step 2  : Import the RDF