# Tool Architecture

Contextualizing the problem in ML
To contextualize the problem, this has been divided into goals, each goal contextualized in a different area of requirements engineering.
- Goal 1.	Support Change-Impact Analysis.
- Goal 2.	Domain mapping and ontology creation.
    - a.	Requirement analysis.
- Goal 3.	Trace or elicit safety-related aspects from existing norms and standards.
    - a.	Requirement elicitation.
- Goal 4.	Facilitating effective tracing, reuse, and analysis of system requirements to prevent safety violations.
    - a.	Classify safety violations.


## Import Data

The dataset originates from PURE, a requirements collection formatted in XML. All XML files share a standard namespace called 'req_document.xsd,' simplifying the development of a function to parse the XML files into a dataframe. This dataframe includes columns for the relevant XML tags, indicating whether the text corresponds to a requirement or information, the XML tree of the entry (or path), and the associated ID, that stands for the requirement number.

TODO: 
- Missing test of quality with other sources
- Import all in folder into instances of classes


In [3]:
# import modules
import xml.etree.ElementTree as ET
import pandas as pd
import sys
import string
import re
import nltk
import sys
import os
import gensim
import pprint

# Add the src directory to sys.path
sys.path.append(os.path.abspath(r'C:\dev\NLP2RE_Sandbox\src'))


print(sys.version)
print(sys.executable)

3.11.5 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:26:23) [MSC v.1916 64 bit (AMD64)]
c:\ProgramData\anaconda3\python.exe


In [4]:
# Specify the path to your XML file and namespace
xml_file_path = r'C:\dev\NLP2RE_Sandbox\data\work_data\2007-ertms.xml'
namespace = {'ns': 'req_document.xsd'}

In [5]:
# Parse .xml to df
from Utils import parse_xml

# import utils.ParseXML as ParseXML
df = parse_xml.process_xml_with_namespace(xml_file_path, namespace)
df.tail(10)

Unnamed: 0,tag,text,id,path
621,meaning,Transmission of ETCS information from a train ...,,req_document/p/glossary/glossary_item/meaning
622,term,Train trip,,req_document/p/glossary/glossary_item/term
623,meaning,"Is used when a train passes a ""danger"" signal,...",,req_document/p/glossary/glossary_item/meaning
624,term,Warning,,req_document/p/glossary/glossary_item/term
625,meaning,Audible and/or visual indication to alert the ...,,req_document/p/glossary/glossary_item/meaning
626,term,Wheelslip,,req_document/p/glossary/glossary_item/term
627,meaning,When a traction-driven wheel loses adhesion wi...,,req_document/p/glossary/glossary_item/meaning
628,term,Wheelslide,,req_document/p/glossary/glossary_item/term
629,meaning,When a braked wheel loses adhesion with the ra...,,req_document/p/glossary/glossary_item/meaning
630,title,Other technical functions,11.0,req_document/p/title


## Clean Data
Having retrieved the text in the XML file to a workable format (dataframes), it is necessary to cleanup and tokenize the text.
This is not as straightforward as it might seem since care is needed to preserve special words such as those separated by hyphens (e-governance, non-functional, …). 


First, let’s tokenize the documents, remove common words as well as words that only appear once in the corpus:


In [6]:
# Create 'text_clean' attribute in df (list of tokens)
from Utils import clean_data

df['text_clean'] = df['text'].apply(lambda x: clean_data.preprocess_data_str(x))

df.head()

Unnamed: 0,tag,text,id,path,text_clean
0,title,ERTMS/ETCS Functional Requirements Specificati...,,req_document/title,"[ertm, etc, function, requir, specif, fr]"
1,version,5.00,,req_document/version,"[5, 00]"
2,issue_date,2007-06-21,,req_document/issue_date,"[2007, 06, 21]"
3,file_number,ERA/ERTMS/003204,,req_document/file_number,"[era, ertm, 003204]"
4,change_date,2007-06-21,,req_document/change_log/change_log_item/change...,"[2007, 06, 21]"


## Corpus


In [10]:
# Create a corpus
# clean_text = clean_data.df_tokenize(df['text_clean'], 2)
clean_text = df['text_clean']
pprint.pprint(clean_text)

0              [ertm, etc, function, requir, specif, fr]
1                                                [5, 00]
2                                         [2007, 06, 21]
3                                    [era, ertm, 003204]
4                                         [2007, 06, 21]
                             ...                        
626                                          [wheelslip]
627    [when, a, traction, driven, wheel, lose, adhes...
628                                          [wheelslid]
629    [when, a, brake, wheel, lose, adhes, with, the...
630                           [other, technic, function]
Name: text_clean, Length: 432, dtype: object


## Dictionary

In [11]:
 # Create a Dictionary: Associate each word in the corpus with a unique integer ID
    """
    This dictionary defines the vocabulary of all words that our processing knows about.

    Parameters:
    - corpus (str): A list of input text to be indexed.

    Returns:
    - A dictionary of unique tokes with an associated ID
    """

from gensim import corpora

doc_dictionary = corpora.Dictionary(clean_text)
pprint.pprint(doc_dictionary.token2id)



{'': 14,
 '0': 15,
 '00': 6,
 '003204': 11,
 '03': 577,
 '06': 8,
 '1': 105,
 '12': 452,
 '2': 16,
 '2007': 9,
 '21': 10,
 '24': 437,
 '3': 17,
 '4': 388,
 '5': 7,
 '500': 100,
 '541': 578,
 '6': 453,
 '7': 389,
 '8': 475,
 'a': 18,
 'abil': 589,
 'abl': 87,
 'about': 318,
 'absolut': 544,
 'accept': 264,
 'accid': 438,
 'accord': 209,
 'accordingli': 344,
 'account': 354,
 'accuraci': 375,
 'acknowledg': 153,
 'acoust': 398,
 'action': 236,
 'activ': 106,
 'actual': 385,
 'adapt': 218,
 'addit': 252,
 'adhes': 334,
 'adjust': 600,
 'administr': 603,
 'advanc': 585,
 'advisori': 547,
 'after': 154,
 'afterward': 155,
 'against': 328,
 'ahead': 319,
 'air': 265,
 'alert': 493,
 'all': 19,
 'allow': 79,
 'alreadi': 465,
 'also': 335,
 'an': 190,
 'and': 20,
 'ani': 107,
 'anoth': 331,
 'appear': 494,
 'appertain': 422,
 'appli': 108,
 'applic': 52,
 'approach': 628,
 'appropri': 186,
 'approv': 564,
 'are': 337,
 'area': 91,
 'as': 109,
 'ask': 282,
 'aspect': 597,
 'assess': 439,
 'assi

## BOW

Word ID and freq

In [13]:
bow = [doc_dictionary.doc2bow(text) for text in clean_text]
pprint.pprint(bow)

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)],
 [(6, 1), (7, 1)],
 [(8, 1), (9, 1), (10, 1)],
 [(0, 1), (11, 1), (12, 1)],
 [(8, 1), (9, 1), (10, 1)],
 [(13, 1)],
 [(0, 1),
  (1, 1),
  (3, 2),
  (4, 3),
  (14, 1),
  (15, 1),
  (16, 1),
  (17, 1),
  (18, 1),
  (19, 1),
  (20, 1),
  (21, 1),
  (22, 1),
  (23, 1),
  (24, 1),
  (25, 2),
  (26, 2),
  (27, 2),
  (28, 1),
  (29, 2),
  (30, 1),
  (31, 1),
  (32, 1),
  (33, 1),
  (34, 1),
  (35, 1),
  (36, 1),
  (37, 1),
  (38, 1),
  (39, 1),
  (40, 1),
  (41, 1),
  (42, 1),
  (43, 2),
  (44, 1),
  (45, 1),
  (46, 4),
  (47, 1),
  (48, 2),
  (49, 1),
  (50, 1),
  (51, 1)],
 [(1, 4),
  (3, 2),
  (4, 5),
  (5, 3),
  (14, 1),
  (20, 2),
  (25, 1),
  (26, 1),
  (29, 1),
  (32, 3),
  (33, 5),
  (35, 1),
  (40, 1),
  (42, 2),
  (46, 5),
  (48, 2),
  (52, 4),
  (53, 4),
  (54, 1),
  (55, 1),
  (56, 2),
  (57, 1),
  (58, 1),
  (59, 1),
  (60, 2),
  (61, 2),
  (62, 1),
  (63, 4),
  (64, 2),
  (65, 1),
  (66, 2),
  (67, 2),
  (68, 1),
  (69, 3),
  (7

# Topic Modelling

## Sample
https://radimrehurek.com/gensim/models/ldamodel.html

In [28]:
import gensim
from gensim.corpora.dictionary import Dictionary

common_texts = [['computer', 'computer', 'human', 'interface'],
                ['survey', 'user', 'computer', 'system', 'response', 'time'],
                ['eps', 'user', 'interface', 'system'],
                ['system', 'human', 'system', 'eps'],
                ['user', 'response', 'time'],
                ['trees'],
                ['graph', 'trees'],
                ['graph', 'minors', 'trees'],
                ['graph', 'minors', 'survey']]
# Create a corpus from a list of texts
common_dictionary = Dictionary(common_texts)
pprint.pprint(common_dictionary.token2id)


{'computer': 0,
 'eps': 8,
 'graph': 10,
 'human': 1,
 'interface': 2,
 'minors': 11,
 'response': 3,
 'survey': 4,
 'system': 5,
 'time': 6,
 'trees': 9,
 'user': 7}


In [29]:
common_BOW = [common_dictionary.doc2bow(text) for text in common_texts]

pprint.pprint(common_BOW)

[[(0, 2), (1, 1), (2, 1)],
 [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(2, 1), (5, 1), (7, 1), (8, 1)],
 [(1, 1), (5, 2), (8, 1)],
 [(3, 1), (6, 1), (7, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(4, 1), (10, 1), (11, 1)]]


In [30]:
# Train the model on the corpus.
lda = gensim.models.ldamodel.LdaModel(common_BOW, num_topics=3, id2word=common_dictionary)
topics = lda.print_topics()

# Print topics with words instead of IDs
for topic in topics:
    print(topic)


(0, '0.142*"trees" + 0.130*"minors" + 0.130*"graph" + 0.129*"user" + 0.126*"response" + 0.125*"time" + 0.038*"system" + 0.037*"computer" + 0.036*"eps" + 0.036*"human"')
(1, '0.247*"computer" + 0.144*"human" + 0.142*"interface" + 0.130*"trees" + 0.048*"graph" + 0.043*"minors" + 0.042*"user" + 0.041*"system" + 0.041*"survey" + 0.041*"time"')
(2, '0.189*"system" + 0.101*"user" + 0.100*"survey" + 0.100*"eps" + 0.099*"graph" + 0.060*"computer" + 0.059*"response" + 0.059*"interface" + 0.059*"trees" + 0.059*"time"')


In [31]:
# Create a new corpus, made of previously unseen documents.
other_texts = [
    ['computer', 'time', 'graph'],
    ['survey', 'response', 'eps'],
    ['human', 'system', 'computer']
]
other_corpus = [common_dictionary.doc2bow(text) for text in other_texts]

unseen_doc = other_corpus[0]
vector = lda[unseen_doc]  # get topic probability distribution for a document

pprint.pprint(vector)

[(0, 0.5582571), (1, 0.3399654), (2, 0.10177751)]


## Practice 

In [None]:
import gensim

lda_model = gensim.models.ldamodel.LdaModel(bow, num_topics=7, id2word= doc_dictionary)
#lda = gensim.models.ldamodel.LdaModel(common_corpus, num_topics=10, id2word=common_dictionary)

pprint.pprint(lda_model.print_topics())

## Computing Model Perplexity
The LDA model (lda_model) we have created above can be used to compute the model’s perplexity, i.e. how good the model is. The lower the score the better the model will be. It can be done with the help of following script −


In [None]:
print('\nPerplexity: ', lda_model.log_perplexity(bow))

## Finding Optimal Number of Topics for LDA
We can find the optimal number of topics for LDA by creating many LDA models with various values of topics. Among those LDAs we can pick one having highest coherence value.
Following function named coherence_values_computation() will train multiple LDA models. It will also provide the models as well as their corresponding coherence score −

In [None]:
#FIXME: 
from gensim.models import CoherenceModel
import matplotlib.pyplot as plt


def coherence_values_computation(dictionary, corpus, texts, limit, start=2, step=3):
   coherence_values = []
   model_list = []
   for num_topics in range(start, limit, step):
      model = gensim.models.ldamodel.LdaModel(bow, num_topics=num_topics, id2word= doc_dictionary)
      model_list.append(model)
   coherencemodel = CoherenceModel(
      model=model, texts=texts, dictionary=dictionary, coherence='c_v'
   )
   coherence_values.append(coherencemodel.get_coherence())
   return model_list, coherence_values

limit=50; start=1; step=8;

model_list, coherence_values = coherence_values_computation (
   dictionary=doc_dictionary, corpus=bow, texts=clean_text, 
   start=start, limit=limit, step=step
)
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

In [None]:
import pyLDAvis
import pyLDAvis.gensim

pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, bow, doc_dictionary)
vis

From the above output, the bubbles on the left-side represents a topic and larger the bubble, the more prevalent is that topic. The topic model will be good if the topic model has big, non-overlapping bubbles scattered throughout the chart.

https://www.tutorialspoint.com/gensim/gensim_documents_and_lda_model.htm


## Query, the model using new, unseen documents

In [None]:
other_texts = [
    ['computer', 'time', 'graph'],
    ['survey', 'response', 'eps'],
    ['human', 'system', 'computer']
]
other_corpus = [doc_dictionary.doc2bow(text) for text in other_texts]

unseen_doc = other_corpus[0]
vector = lda_model[unseen_doc]  # get topic probability distribution for a document

pprint.pprint(vector)

# Get the tuple with the highest value based on the second element of each tuple
topic_with_highest_value = max(vector, key=lambda x: x[1])

print("Tuple with the highest value:", topic_with_highest_value)

topic_with_highest_value[1]

lda_model.print_topic(topic_with_highest_value[0])

## Test Query

In [None]:
# Filter the DataFrame to include only rows where 'tag' is equal to "req"
df_req = df[df['tag'] == 'req'].copy()

# Display the new DataFrame
df_req.head(30)

In [None]:
def get_topic_distribution(df, dictionary, lda_model):
    """
    Create a new column 'prob_dist' containing the topic distribution for each document in the DataFrame.

    Parameters:
    - df (pandas.DataFrame): Input DataFrame containing text data.
    - dictionary (gensim.corpora.Dictionary): Gensim dictionary object.
    - lda_model (gensim.models.ldamodel.LdaModel): Trained LDA model.

    Returns:
    - df_with_prob_dist (pandas.DataFrame): DataFrame with an additional column 'prob_dist'.
    """
    # Ensure 'prob_dist' column exists
    df['prob_dist'] = None

    # Iterate over each row in the DataFrame
    for index, row in df.iterrows():
        # Convert text to bag-of-words representation
        doc_bow = dictionary.doc2bow(row['text_clean'])
        
        # Extract the topic distribution
        topic_distribution = lda_model[doc_bow]

        # Assign the topic distribution to the 'prob_dist' column
        df.at[index, 'prob_dist'] = topic_distribution

    return df


# Usage example:
df_req = get_topic_distribution(df_req, doc_dictionary, lda_model)
df_req.head(30)