
# **Conceptual Graph-Based Recommendation System for PRM Issues: OL Approach**
Welcome to this notebook dedicated to the exploration and implementation of a cutting-edge recommendation system designed to address Public Relations Management (PRM) issues. This innovative system leverages the power of Conceptual Graphs and employs an Online Learning (OL) approach to provide tailored solutions for PRM challenges.

In this notebook, we will delve into the foundations of conceptual graphs, their role in recommendation systems, and how the online learning paradigm enhances their effectiveness. We will walk through the step-by-step process of building and deploying this recommendation system, aiming to empower PRM professionals with a powerful tool to make informed decisions and optimize their strategies.

By the end of this notebook, you will have a comprehensive understanding of how to implement and customize this recommendation system to suit your specific PRM needs. Let's embark on this exciting journey towards revolutionizing your PRM strategies!

## Dependencies Installation

In [1]:
!pip install pdfplumber
!pip install autocorrect
!pip install stanza
!pip install PyMuPDF
!pip install transformers==4.12.0
!pip install PyPDF2
!pip install transformers
!pip install textacy
!pip install rouge
!pip install sentence-transformers
!pip install faiss-cpu --no-cache

Collecting pdfplumber
  Downloading pdfplumber-0.11.0-py3-none-any.whl (56 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.4/56.4 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pdfminer.six==20231228 (from pdfplumber)
  Downloading pdfminer.six-20231228-py3-none-any.whl (5.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m44.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-4.29.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.8/2.8 MB[0m [31m78.1 MB/s[0m eta [36m0:00:00[0m:00:01[0m
Installing collected packages: pypdfium2, pdfminer.six, pdfplumber
Successfully installed pdfminer.six-20231228 pdfplumber-0.11.0 pypdfium2-4.29.0
Collecting autocorrect
  Downloading autocorrect-2.6.1.tar.gz (622 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
from rouge import Rouge
import  pdfplumber
import string
import re
import stanza
from transformers import pipeline
import pandas as pd
#Download the Stanza model for your desired language (e.g., English)
stanza.download('en')
nlp = stanza.Pipeline(lang='en', processors='tokenize,pos')
#fix_spelling = pipeline("text2text-generation",model="oliverguhr/spelling-correction-english-base")



Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.8.0/models/default.zip:   0%|          | 0…

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

# Extract Concepts

* After obtaining the raw text data, the subsequent step involved data engineering. This process focused on distilling essential information from the raw text to form a well-organized and lucid dataframe, which would serve as the foundation for constructing the knowledge graph and ontology.

* To initiate this phase, we employed the Stanza matcher to identify and extract key concepts. By specifying patterns corresponding to these concepts, we effectively configured the matcher. Additionally, we meticulously reviewed and rectified any incorrect matches to enhance the accuracy of the Stanza matcher for our specific case. This diligence paid off, resulting in the successful extraction of all relevant concepts.

In [None]:
import fitz  # PyMuPDF
import re
import torch
from transformers import BertTokenizer, BertForMaskedLM

# Charger le modèle BERT
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForMaskedLM.from_pretrained(model_name)

def extract_titles_and_descriptions_from_pdf(pdf_file):
    pdf_document = fitz.open(pdf_file)
    titles_with_descriptions = []

    for page_num in range(len(pdf_document)):
        page = pdf_document[page_num]
        page_text = page.get_text()
        page_text=page_text.replace('—','\n')
        # Utilisez une expression régulière pour rechercher des motifs de titre "11.x" ou "11.x.x" ou "11.x.x.x"
        matches = re.findall(r'('+'11+(\.\d+)+)\s+(.+)', page_text)
        for match in matches:
            title = match[0]
            description = match[2]

            # Prédire la partie manquante de la description en utilisant BERT
            masked_text = f"[MASK] {description}"
            input_ids = tokenizer.encode(masked_text, add_special_tokens=True, return_tensors="pt")
            mask_index = input_ids[0].tolist().index(tokenizer.mask_token_id)

            with torch.no_grad():
                predictions = model(input_ids)
            predicted_token_id = torch.argmax(predictions.logits[0, mask_index]).item()
            predicted_word = tokenizer.decode(predicted_token_id)

            # Remplacez le masque par le mot prédit dans la description
            description = description.replace("[MASK]", predicted_word)

            # Ajoutez le titre et la description complète à la liste
            titles_with_descriptions.append(f"{title} {description}")

    pdf_document.close()
    return titles_with_descriptions

# Exemple d'utilisation
pdf_file_path = '/kaggle/input/prm-pmbok6-2017/PRM-PMBOK6-2017 (1).pdf'
titles_with_descriptions = extract_titles_and_descriptions_from_pdf(pdf_file_path)

In [None]:

def extract_Figure_from_pdf(pdf_file):
    pdf_document = fitz.open(pdf_file)
    titles_with_descriptions = []

    for page_num in range(len(pdf_document)):
        page = pdf_document[page_num]
        page_text = page.get_text()
        page_text=page_text.replace('—','\n')
        #print(page_text)
        # Utilisez une expression régulière pour rechercher des motifs de titre "11.x" ou "11.x.x" ou "11.x.x.x"
        matches = re.findall(r'^(Figure)+\s([1-9][0-9]?|100)+-+([1-9][0-9]?|100)+.', page_text, re.MULTILINE)
        for match in matches:
            title = match[0]
            title2 = match[1]
            description = match[2]

            # Prédire la partie manquante de la description en utilisant BERT
            masked_text = f"[MASK] {description}"
            input_ids = tokenizer.encode(masked_text, add_special_tokens=True, return_tensors="pt")
            mask_index = input_ids[0].tolist().index(tokenizer.mask_token_id)

            with torch.no_grad():
                predictions = model(input_ids)
            predicted_token_id = torch.argmax(predictions.logits[0, mask_index]).item()
            predicted_word = tokenizer.decode(predicted_token_id)

            # Remplacez le masque par le mot prédit dans la description
            description = description.replace("[MASK]", predicted_word)

            # Ajoutez le titre et la description complète à la liste
            titles_with_descriptions.append(f"{title} {title2}-{description}")

    pdf_document.close()
    return titles_with_descriptions

# Exemple d'utilisation
pdf_file_path = '/kaggle/input/prm-pmbok6-2017/PRM-PMBOK6-2017 (1).pdf'
Figure = extract_Figure_from_pdf(pdf_file_path)

# Extract using pdfplumber into text corpus

In [None]:
file = open('/kaggle/input/prm-pmbok6-2017/PRM-PMBOK6-2017 (1).pdf','rb')
project_risk_management = ''
with pdfplumber.open(file) as pdf:
    for i in range(0,64):
        page = pdf.pages[i].filter(lambda obj: not (obj["object_type"] == "char" and obj["size"] > 30))
        project_risk_management += page.extract_text()
project_risk_management = project_risk_management.replace('\n','\n ')
project_risk_management = project_risk_management.replace('  ',' ')
project_risk_management = project_risk_management.lower()

# Auto correct

using spelling-correction-english-base model we correct the corpus 

In [None]:
#fix_spelling = pipeline("text2text-generation",model="oliverguhr/spelling-correction-english-base")

In [None]:
""""#Example text (you can replace this with your corpus)
#Process the corpus text
doc = nlp(project_risk_management)
l=[]
#Iterate through each sentence
for sentence in doc.sentences:
    # Access the text of the sentence
    sentence_text = " ".join([word.text for word in sentence.words])

#Print the sentence text
    l.append(fix_spelling(sentence_text,max_length=2048)[0]['generated_text'])
project_risk_management=" ".join(l)    """

# Extract definition using gpt2

In [None]:
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer, TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments

# Load the pretrained GPT-2 model and tokenizer
model_name = "gpt2-medium"  # Choose a GPT-2 variant
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

In [None]:
file = open('output.txt','w')
file.write(project_risk_management)
file.close()

In [None]:
# Load and preprocess your fine-tuning dataset
# Replace 'your_dataset.txt' with the path to your dataset
dataset = TextDataset(
    tokenizer=tokenizer,
    file_path='/kaggle/working/output.txt',
    block_size=128,  # Adjust block size as needed
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False,
)

# Configure training arguments
training_args = TrainingArguments(
    output_dir='./fine-tuned-gpt2',  # Specify the output directory
    overwrite_output_dir=True,
    num_train_epochs=20,  # Adjust the number of training epochs
    per_device_train_batch_size=16,  # Adjust batch size as needed
    save_steps=10_000,  # Specify how often to save the model
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

# Fine-tune the model
trainer.train()

# Save the fine-tuned model
trainer.save_model()

# You can now use the fine-tuned model for text generation tasks

In [None]:
# Remove numbers from each element while keeping spaces
cleaned_concept_list = [' '.join(filter(str.isalpha, element.split())) for element in titles_with_descriptions]

In [None]:
# Load the fine-tuned model
'''fine_tuned_model = GPT2LMHeadModel.from_pretrained('./fine-tuned-gpt2')  # Load from the fine-tuned model directory

# Set the model to evaluation mode
fine_tuned_model.eval()
definitions={}
for i in range(len(cleaned_concept_list)):
    
# Define a starting prompt for text generation
    prompt = "define the " + cleaned_concept_list[i] + " concept."

# Tokenize the prompt
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

# Generate text using the fine-tuned model
    output = fine_tuned_model.generate(input_ids, max_length=200, num_return_sequences=1)

# Decode and print the generated text
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    definitions[cleaned_concept_list[i]] = generated_text
#print(definitions)'''

# Create concepts dataframe
* **We created our initial dataframe with 6 columns:**
1. **Risk_concepts**: contains the name of the concept.
2. **Relation_Type**: contains the type of relation with another concept or process (e.g. has input, is SubClass of ...).
3. **Definition**: contains the isDefinedBy part, which refers to definition of the concept, plus the isDescribedBy part.
4. **Clean_definition**: contains the isDefinedBy part, which refers to definition of the concept.
5. **Figure**: contains the isDescribedin part, which refers to a figure.
6. **Described_in**: contains the isDescribedin part, which refers to a section.

In [None]:
data_frame = pd.DataFrame()
data_frame['risk_concepts'] = [c for c in  titles_with_descriptions]
data_frame['Relation_Type'] = ""
data_frame['Definition'] = ""
data_frame['Clean_definition'] = ""
data_frame['Figure'] = ""
data_frame['Described_in'] = ""

In [None]:
cleaned_concept_list = [' '.join(filter(str.isalpha, element.split())) for element in titles_with_descriptions]

# Extract and add Relation_Type

In [None]:
for idx, row in data_frame.iterrows():
    if ":" in str(row.risk_concepts) :
            key=str(row.risk_concepts)[str(row.risk_concepts).index(":")+2:]
            data_frame.at[idx, 'Relation_Type'] = f'Has {key}'

# Add the Definition

In [None]:
data_df=pd.read_csv("/kaggle/input/definition-vf/definition_vf.csv")

In [None]:
for idx, row in data_frame.iterrows():
    concept=' '.join(filter(str.isalpha, str(row.risk_concepts).split()))
    def_string=""
    data_frame.at[idx,'risk_concepts'] = concept.lower()
    for i in data_df[data_df["Concept"]==concept.lower()]["Definition"].values:
        def_string=str(i)
    data_frame.at[idx,'Definition'] = def_string
        
#data_frame.at[len(concept_risk)-1, 'Definition']=text[text.find(data_frame.iloc[-1]['risk_concepts'])+ len(concept_risk[-1]):]

In [None]:
#for i in range(len(titles_with_descriptions) - 1):
    #data_frame['Definition'][i] = definitions[titles_with_descriptions[i]]
#data_frame.at[len(concept_risk)-1, 'Definition']=text[text.find(data_frame.iloc[-1]['risk_concepts'])+ len(concept_risk[-1]):]

In [None]:
data_frame['Clean_definition'] = data_frame['Definition']

# Extract the "described in section" and "figure" 

In [None]:
import pandas as pd
from transformers import GPT2Tokenizer, GPT2Model
import torch
import re

# Supposons que vous ayez déjà chargé votre fichier CSV dans un dataframe df
# df = pd.read_csv("def.csv")

# Chargez le tokenizer GPT-2 préentraîné
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Chargez le modèle GPT-2 préentraîné
model = GPT2Model.from_pretrained("gpt2")

# Définissez une fonction pour extraire tous les numéros de section au format "x.x.x.x" après "described in section"
def extraire_numeros_sections(texte):
    # Recherchez tous les textes "described in section" dans le texte
    indices = [match.start() for match in re.finditer("described in section", texte)]

    numeros_sections = set()  # Utilisez un ensemble pour stocker les numéros de section uniques

    for start_idx in indices:
        # Extrait le texte après "described in section"
        texte_apres_section = texte[start_idx + len("described in section"):].strip()

        # Ajoutez une séquence d'arrêt de texte au texte
        texte_apres_section += tokenizer.eos_token

        # Utilisez le tokenizer pour prétraiter le texte
        texte_enc = tokenizer(texte_apres_section, return_tensors="pt")

        # Passez les données au modèle GPT-2 pour l'encodage
        with torch.no_grad():
            outputs = model(**texte_enc)

        # Utilisez une expression régulière pour extraire les numéros de section (x.x.x.x)
        numeros_section_match = re.findall(r'described in section (\d+\.\d+\.\d+\.\d+)', texte)

        if numeros_section_match:
            numeros_sections.update(numeros_section_match)  # Utilisez "update" pour ajouter des éléments à l'ensemble

    return ["descriped in section "+ str(i) for i in list(numeros_sections)]
# Appliquez la fonction pour créer une nouvelle colonne contenant les numéros de section extraits
data_frame["Described_in"] = data_frame["Definition"].apply(extraire_numeros_sections)



In [None]:
data_frame["Described_in"][1]

In [None]:
item_list = []
for item in data_frame["Definition"]:
    item_list.append(item.split())
item_list_not_splitted = []
for item in data_frame["Definition"]:
    item_list_not_splitted.append(item)

In [None]:
#Extract
described_list = []
index_list = []
for i in range(len(item_list)):
    for j in range(len(item_list[i])):
        if item_list[i][j] == "described" and item_list[i][j+1] == "in" and item_list[i][j+2] == "section" or item_list[i][j] == "(described" and item_list[i][j+1] == "in" and item_list[i][j+2] == "section":
            described_list.append(item_list[i][j]+" "+item_list[i][j+1]+" "+item_list[i][j+2]+" "+item_list[i][j+3])
            index_list.append(i)

#load and remove
for i in range(len(index_list)):
    data_frame['Clean_definition'].iloc[index_list[i]] = item_list_not_splitted[index_list[i]].replace("described in section", "").replace("section","").replace("figure","")

In [None]:
#Extract
figure_list = []
figure_index_list = []
for i in range(len(item_list)):
    for j in range(len(item_list[i])):
        if item_list[i][j] == "figure":
            figure_list.append(item_list[i][j]+" "+item_list[i][j+1])
            #print(item_list[i][j+1])
            figure_index_list.append(i)

#load and remove
for i in range(len(figure_index_list)):
    data_frame['Figure'].iloc[figure_index_list[i]] += " " + (figure_list[i])

# Remove "described in section" and "figure" from clean Definition

In [None]:
for i in range(len(figure_index_list)):
    data_frame['Clean_definition'].iloc[figure_index_list[i]] = item_list_not_splitted[figure_index_list[i]].replace("described in section", "")
    data_frame['Clean_definition'].iloc[figure_index_list[i]] = item_list_not_splitted[figure_index_list[i]].replace("section","")
    data_frame['Clean_definition'].iloc[figure_index_list[i]] = item_list_not_splitted[figure_index_list[i]].replace("figure","")

# Pre-processing phase 

In [None]:
processing = pd.DataFrame()
processing['risk_concepts'] = data_frame['risk_concepts']
processing['Relation_Type'] = data_frame['Relation_Type']
processing['Definition'] = data_frame['Definition']
processing['Clean_definition'] = ""
processing['Figure'] = data_frame['Figure']
processing['Described_in'] = data_frame['Described_in']

In [None]:
tmp =[]
for item in data_frame['Clean_definition']:
    item = item
    tmp.append(item)

* **Removing stopwords, digits and punctuation**

In [None]:
pattern = string.punctuation.replace('.','')

In [None]:
res = ''
for i in range(len(tmp)):
    ###### PUNCTUATION #########################
    tmp[i] = tmp[i].replace('uu','')
    tmp[i] = tmp[i].replace('\n','')
    tmp[i] = tmp[i].replace('—','')
    tmp[i] = tmp[i].replace(',',' ')
    tmp[i] = tmp[i].translate(str.maketrans('', '', pattern))
    ###### Digits #############################
    for ele in tmp[i]:
        if ele.isdigit():
            tmp[i] = tmp[i].replace(ele, ' ')

In [None]:
for i in range(len(tmp)):
    processing['Clean_definition'].iloc[i] = tmp[i]

* **Removing special characters**

In [None]:
def remove_special_characters(text):
    """
        Remove special special characters, including symbols, emojis, and other graphic characters
    """
    emoji_pattern = re.compile(
        '['
        u'\U0001F600-\U0001F64F'  # emoticons
        u'\U0001F300-\U0001F5FF'  # symbols & pictographs
        u'\U0001F680-\U0001F6FF'  # transport & map symbols
        u'\U0001F1E0-\U0001F1FF'  # flags (iOS)
        u'\U00002702-\U000027B0'
        u'\U000024C2-\U0001F251'
        ']+',
        flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)
processing['Clean_definition'] = processing['Clean_definition'].apply(lambda x: remove_special_characters(x))
processing['Clean_definition'] = processing['Clean_definition'].apply(lambda s : s.lower())

In [None]:
processing

# Load the REBEL model

REBEL is a text2text model trained by BabelScape by fine-tuning BART for translating a raw input sentence containing entities and implicit relations into a set of triplets that explicitly refer to those relations. It has been trained on more than 200 different relation types.

In [None]:
Triplets = pd.DataFrame()
Triplets['risk_concepts'] = data_frame['risk_concepts']
Triplets['head'] = ""
Triplets['type'] = ""
Triplets['tail'] = ""

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

In [None]:
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Babelscape/rebel-large")
model = AutoModelForSeq2SeqLM.from_pretrained("Babelscape/rebel-large")

# From text to KB

The next step is to write a function that is able to parse the strings generated by REBEL and transform them into relation triplets. This function must take into account additional new tokens used while training the model.

In [None]:
def extract_relations_from_model_output(text):
    relations = []
    relation, subject, relation, object_ = '', '', '', ''
    text = text.strip()
    current = 'x'
    text_replaced = text.replace("<s>", "").replace("<pad>", "").replace("</s>", "")
    for token in text_replaced.split():
        if token == "<triplet>":
            current = 't'
            if relation != '':
                relations.append({
                    'head': subject.strip(),
                    'type': relation.strip(),
                    'tail': object_.strip()
                })
                relation = ''
            subject = ''
        elif token == "<subj>":
            current = 's'
            if relation != '':
                relations.append({
                    'head': subject.strip(),
                    'type': relation.strip(),
                    'tail': object_.strip()
                })
            object_ = ''
        elif token == "<obj>":
            current = 'o'
            relation = ''
        else:
            if current == 't':
                subject += ' ' + token
            elif current == 's':
                object_ += ' ' + token
            elif current == 'o':
                relation += ' ' + token
    if subject != '' and relation != '' and object_ != '':
        relations.append({
            'head': subject.strip(),
            'type': relation.strip(),
            'tail': object_.strip()
        })
    return relations

In [None]:
class KB():
    def __init__(self):
        self.relations = []

    def are_relations_equal(self, r1, r2):
        return all(r1[attr] == r2[attr] for attr in ["head", "type", "tail"])

    def exists_relation(self, r1):
        return any(self.are_relations_equal(r1, r2) for r2 in self.relations)

    def add_relation(self, r):
        if not self.exists_relation(r):
            self.relations.append(r)

    def save(self):
        dict_list = []
        for r in self.relations:
            dict_list.append(r)
        return(dict_list)
        #df = pd.DataFrame(dict_list, columns=['head', 'type', 'tail'])
        #print(dict_list)

In [None]:
def from_small_text_to_kb(text, verbose=False):
    kb = KB()

    # Tokenizer text
    model_inputs = tokenizer(text, max_length=512, padding=True, truncation=True,
                            return_tensors='pt')
    if verbose:
        print(f"Num tokens: {len(model_inputs['input_ids'][0])}")

    # Generate
    gen_kwargs = {
        "max_length": 1000,
        "length_penalty": 0,
        "num_beams": 3,
        "num_return_sequences": 3
    }
    
    generated_tokens = model.generate(
        **model_inputs,
        **gen_kwargs,
    )
    
    decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=False)
    
    # create kb
    for sentence_pred in decoded_preds:
        relations = extract_relations_from_model_output(sentence_pred)
        for r in relations:
            kb.add_relation(r)

    return kb

In [None]:
from tqdm import tqdm

In [None]:
relation_list =[]
for item in tqdm(processing['Clean_definition']):
    kb = from_small_text_to_kb(item, verbose=False) #verbose = True to return the number of tokens in each sentence
    relation_list.append(kb.save())

In [None]:
dict_list =[]
for item in tqdm(relation_list):
    for dic in item:
        dict_list.append(dic)
triplet = pd.DataFrame(dict_list, columns=['head', 'type', 'tail'])

In [None]:
triplet

In [None]:
unique_triplets = triplet.drop_duplicates()

In [None]:
unique_triplets

In [None]:
tiplet_unique = triplet['head'].unique()

In [None]:
final_triplet_unique = triplet.drop_duplicates(keep='first')

In [None]:
final_triplet_unique = final_triplet_unique.reset_index(drop=True)

In [None]:
final_triplet_unique

In [None]:
import textacy
import nltk
from nltk.corpus import stopwords
import copy

# Download the stopwords dataset (if not already downloaded)
nltk.download('stopwords')
from nltk.tokenize import word_tokenize, sent_tokenize
import spacy
nlp_spacy = spacy.load("en_core_web_lg")


In [None]:
text = ''
for item in processing['Clean_definition']:
    text += ''.join(item)

In [None]:
example_sent = text
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
filtered_sentence = []
  
for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)
test_text = ' '.join(filtered_sentence)

# Extract subject verb and object using Textacy

In [None]:
import textacy
triple_list = []
for sentence in test_text.split('.'):
    t1 = nlp_spacy(sentence)
    triple = textacy.extract.subject_verb_object_triples(t1)
    if triple:
        triple_to_list = list(triple)
        triple_list.append(pd.DataFrame(triple_to_list))
svo=pd.concat(triple_list, axis=0) # this should concat all dfs on top of one another using axis=0
svo.columns=['subject','verb','object'] # change your columns on teh final df

In [None]:
svo

In [None]:
svo = svo.reset_index(drop=True)

In [None]:
final_svo = copy.deepcopy(svo) 

In [None]:
for idx,row in final_svo.iterrows():
    for c in final_svo.columns:
        for i in row[c]:
            if isinstance(i, spacy.tokens.span.Span):
                row[c] = i.text
            else:
                row[c] = i

In [None]:
for idx,item in enumerate(final_svo['subject']):
    if isinstance(item, spacy.tokens.token.Token):
        item = item.text
        final_svo.iloc[idx]['subject'] = item
    else:
        item = item
        final_svo.iloc[idx]['subject'] = item
for idx,item in enumerate(final_svo['verb']):
    if isinstance(item, spacy.tokens.token.Token):
        item = item.text
        final_svo.iloc[idx]['verb'] = item
    else:
        item = item
        final_svo.iloc[idx]['verb'] = item
for idx,item in enumerate(final_svo['object']):
    if isinstance(item, spacy.tokens.token.Token):
        item = item.text
        final_svo.iloc[idx]['object'] = item
    else:
        item = item
        final_svo.iloc[idx]['object'] = item

In [None]:
concept_list = []
concept_index = []
for _,row in final_svo.iterrows():
    item = row['subject']
    for i in tiplet_unique:
        if (i in item):
            concept_list.append(item)
            concept_index.append(list(final_svo['subject']).index(item))
svo_unique = dict(zip(concept_index, concept_list))

In [None]:
final_svo_unique = final_svo.drop_duplicates(keep='first')


In [None]:
final_svo_unique

In [None]:
final_svo = final_svo.reset_index(drop=True)

In [None]:
final_svo['subject'].unique()

# Final data frame

In [None]:
for idx,item in enumerate(processing['Described_in']):
    processing['Described_in'].iloc[idx] = ",".join(item)

In [None]:
processing['Described_in']

In [None]:
df_final = pd.DataFrame()
df_final['Concepts'] = data_frame['risk_concepts']
df_final['Type_relation'] = ''
df_final['Concept_of_type_relation'] = ''
df_final['Definition'] = processing['Clean_definition']
df_final['Synonym'] = ''
df_final['Reference'] = processing['Figure'] + processing['Described_in']
df_final['Process_name'] = ''

# Adding final_triplet_unique relation and its range


In [None]:
for idx1,row1 in final_triplet_unique.iterrows():
    for idx2,row2 in df_final.iterrows():
        if row1['head'] in row2['Concepts']:
            df_final.at[idx2,'Type_relation']=row1['type']
            #row2['Type_relation'] = row1['type']
            df_final.at[idx2,'Concept_of_type_relation']=row1['tail']

# Adding svo_unique  and its range


In [None]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
for key, value in svo_unique.items():
    for key2,row in df_final.iterrows():
        if value in row['Concepts']:  
            df_final.at[key2,'Process_name'] = stemmer.stem(final_svo.iloc[key]['verb'])+" "+(final_svo.iloc[key]['object'])

# Similarity case

In [None]:
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

In [None]:
import torch

In [None]:

def similarity(sentence1,sentence2):
    # Tokenize and encode the sentences
    inputs1 = tokenizer(sentence1, return_tensors="pt", padding=True, truncation=True)
    inputs2 = tokenizer(sentence2, return_tensors="pt", padding=True, truncation=True)

    # Get the embeddings
    with torch.no_grad():
        outputs1 = model(**inputs1)
        outputs2 = model(**inputs2)

    # Take the embeddings of the [CLS] token
    embeddings1 = outputs1.last_hidden_state[:, 0, :]
    embeddings2 = outputs2.last_hidden_state[:, 0, :]

    # Calculate cosine similarity
    similarity_score = cosine_similarity(embeddings1, embeddings2)[0][0]

    # Compare the similarity score
    return similarity_score 

In [None]:
sentence1 = "i read 7 books"
sentence2 = "i have 5 books"
similarity(sentence1,sentence2)

In [None]:
df_final = df_final.drop_duplicates()

In [None]:
for i in range(0, (df_final.shape[0])-1):
    text1= df_final.iloc[i]['Definition']
    print(i)
    for j in range(i+1,df_final.shape[0]):
        text2=df_final.iloc[j]['Definition']
        
        if similarity(text1,text2) == 1 :
            df_final.iloc[i]['Synonym'] += ',' + df_final.iloc[j]['Concepts']
            for ele in df_final['Synonym'].iloc[i]:
                if ele.isdigit():
                    df_final['Synonym'].iloc[i] = df_final['Synonym'].iloc[i].replace(ele, ' ')

In [None]:
df_final

# Evaluation 


In [None]:
all_data_combined = df_final['Concepts'].tolist() + df_final['Type_relation'].tolist()+ df_final['Definition'].tolist()+df_final['Synonym'].tolist()+df_final['Reference'].tolist()+df_final['Process_name'].tolist()

In [None]:
ch=''
for e in all_data_combined:
    ch += '. '+str(e)

In [None]:
proc = project_risk_management

In [None]:
rouge = Rouge()
score=rouge.get_scores(ch, proc, avg=True)
print('Precision = '+str(score['rouge-1']['p']))
print('Recall = ' +str(score['rouge-1']['r']))
print('f-measure = '+str(score['rouge-1']['f']))

# Semantic Search


Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines, which only find documents based on lexical matches, semantic search can also find synonyms.

In fact, this type of search makes browsing more complete by understanding almost exactly what the user is trying to ask, instead of simply matching keywords to pages. The idea behind semantic search is to embed all entries in your corpus, which can be sentences, paragraphs, or documents, into a vector space. At search time, the query is embedded into the same vector space and the closest embedding from your corpus is found. These entries should have a high semantic overlap with the query

In [None]:
from sentence_transformers import SentenceTransformer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
import numpy as np
import faiss
import time
from sentence_transformers import CrossEncoder
from pprint import pprint

In [None]:
model = SentenceTransformer('msmarco-distilbert-base-dot-prod-v3')
model.to(device)

In [None]:
Definition_list = df_final.Definition.tolist()

**FAISS: (Facebook AI Similarity Search)** is a library that allows developers to quickly search for embeddings of multimedia documents that are similar to each other. It solves the limitations of traditional query search engines that are optimized for hash-based searches and provides more scalable similarity search functions.

In [None]:
encoded_data = model.encode(Definition_list)
encoded_data = np.asarray(encoded_data.astype('float32'))
index = faiss.IndexIDMap(faiss.IndexFlatIP(768))
index.add_with_ids(encoded_data, np.array(range(0, len(df_final))))
faiss.write_index(index, 'Definition.index')


In [None]:
def fetch_related_definition(dataframe_idx):
    info = df_final.iloc[dataframe_idx]
    meta_dict = dict()
    meta_dict['Concepts'] = info['Concepts']
    meta_dict['Definition'] = info['Definition']
    return meta_dict

def search(query, top_k, index, model):
    t = time.time()
    query_vector = model.encode([query])
    top_k = index.search(query_vector, top_k)
    print('>>>> Results in Total Time: {}'.format(time.time()-t))
    top_k_ids = top_k[1].tolist()[0]
    top_k_ids = list(np.unique(top_k_ids))
    results =  [fetch_related_definition(idx) for idx in top_k_ids]
    return results

In [None]:
query="plan risk management"

In [None]:
results=search(query, top_k=5, index=index, model=model)
print("\n")
for result in results:
    print('\t',result)

In [None]:
## Load our cross-encoder. Use fast tokenizer to speed up the tokenization
cross_model = CrossEncoder('cross-encoder/ms-marco-TinyBERT-L-6', max_length=512)

In [None]:
def cross_score(model_inputs):
    scores = cross_model.predict(model_inputs)
    return scores

model_inputs = [[query,item['Definition']] for item in results]
scores = cross_score(model_inputs)
# sort the scores in decreasing order
ranked_results = [{'Definition': inp['Definition'], 'Score': score} for inp, score in zip(results, scores)]
print("\n")
for result in ranked_results:
    print('\t',pprint(result))