# Tool Architecture

Contextualizing the problem in ML
To contextualize the problem, this has been divided into goals, each goal contextualized in a different area of requirements engineering.
- Goal 1.	Support Change-Impact Analysis.
- Goal 2.	Domain mapping and ontology creation.
    - a.	Requirement analysis.
- Goal 3.	Trace or elicit safety-related aspects from existing norms and standards.
    - a.	Requirement elicitation.
- Goal 4.	Facilitating effective tracing, reuse, and analysis of system requirements to prevent safety violations.
    - a.	Classify safety violations.


## Import Data

The dataset originates from PURE, a requirements collection formatted in XML. All XML files share a standard namespace called 'req_document.xsd,' simplifying the development of a function to parse the XML files into a dataframe. This dataframe includes columns for the relevant XML tags, indicating whether the text corresponds to a requirement or information, the XML tree of the entry (or path), and the associated ID, that stands for the requirement number.

TODO: 
- Missing test of quality with other sources
- Import all in folder into instances of classes


In [2]:
import xml.etree.ElementTree as ET
import pandas as pd
import sys
import string
import re
import nltk
import pprint
from Utils import parse_xml

print(sys.version)
print(sys.executable)

# Specify the path to your XML filec
xml_file_path = r'C:\dev\NLP-Sandbox\PURE\requirements-xml\0000 - cctns.xml'

# Define the namespace
namespace = {'ns': 'req_document.xsd'}

# import utils.ParseXML as ParseXML
df = parse_xml.process_xml_with_namespace(xml_file_path, namespace)
df.head(30)

3.11.5 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:26:23) [MSC v.1916 64 bit (AMD64)]
c:\ProgramData\anaconda3\python.exe


Unnamed: 0,tag,text,id,path
0,title,E-GOVERNANCE MISSION MODE PROJECT (MMP),,req_document/title/title
1,title,CRIME & CRIMINAL TRACKING NETWORK AND SYSTEMS ...,,req_document/title/title
2,title,FUNCTIONAL REQUIREMENTS SPECIFICATION V1.0 (DR...,,req_document/title/title
3,title,MINISTRY OF HOME AFFAIRS GOVERNMENT OF INDIA,,req_document/title/title
4,version,1.0,,req_document/version
5,title,INTRODUCTION,1.0,req_document/p/title
6,title,The Functional Requirements Specifications (FR...,,req_document/p/text_body
7,title,FUNCTIONAL OVERVIEW,2.0,req_document/p/title
8,title,CCTNS V1.0 functionality is designed to focus ...,,req_document/p/text_body
9,title,DESCRIPTION OF THE MODULES AND FUNCTIONAL REQU...,3.0,req_document/p/title


## Clean Data
Having retrieved the text in the XML file to a workable format (dataframes), it is necessary to cleanup and tokenize the text.
This is not as straightforward as it might seem since care is needed to preserve special words such as those separated by hyphens (e-governance, non-functional, …). 


First, let’s tokenize the documents, remove common words as well as words that only appear once in the corpus:

### Remove punctiation, Tokenization and remove stopwords


In [3]:
from Utils import clean_data

clean_data_case = "lsa_preprocess"

if clean_data_case == "lsa_preprocess":
    df['text_clean'] = df['text'].apply(lambda x: clean_data.preprocess_data_str(x))
else:
    df['text_clean'] = df['text'].apply(lambda x: clean_data.clean_text(x.lower(),False,False))

# Display the sub-dataframe
df.head()

Unnamed: 0,tag,text,id,path,text_clean
0,title,E-GOVERNANCE MISSION MODE PROJECT (MMP),,req_document/title/title,"[e, govern, mission, mode, project, mmp]"
1,title,CRIME & CRIMINAL TRACKING NETWORK AND SYSTEMS ...,,req_document/title/title,"[crime, crimin, track, network, system, cctn]"
2,title,FUNCTIONAL REQUIREMENTS SPECIFICATION V1.0 (DR...,,req_document/title/title,"[function, requir, specif, v1, 0, draft]"
3,title,MINISTRY OF HOME AFFAIRS GOVERNMENT OF INDIA,,req_document/title/title,"[ministri, home, affair, govern, india]"
4,version,1.0,,req_document/version,"[1, 0]"


## Create Corpus 


In [4]:
def df_tokenized_2_corpus(df_column, min_word_freq=1):
    """
    Process a DataFrame attribute containing a list of tokenized data.

    Parameters:
    - df_column (pandas.Series): DataFrame column containing a list of tokenized data.
    - min_word_freq (int): minimum word frequency

    Returns:
    - processed_corpus (list of lists): Processed corpus after filtering based on word frequencies.
    """
    # Count word frequencies
    from collections import defaultdict
    frequency = defaultdict(int)
    
    # Count word frequencies
    for text_list in df_column:
        for token in text_list:
            frequency[token] += 1

    # Only keep words that appear more than once
    processed_corpus = [[token for token in text_list if frequency[token] > min_word_freq] for text_list in df_column]
    
    return processed_corpus

# Example usage:
corpus = df_tokenized_2_corpus(df['text_clean'], 2)
pprint.pprint(corpus)

[['e', 'mode'],
 ['crime', 'crimin', 'track', 'network', 'system', 'cctn'],
 ['function', 'requir', 'specif', 'v1', '0'],
 ['home'],
 ['1', '0'],
 [],
 ['function',
  'requir',
  'specif',
  'report',
  'provid',
  'detail',
  'descript',
  'function',
  'requir',
  'version',
  'cctn',
  'key',
  'guid',
  'principl',
  'function',
  'design',
  'cctn',
  'v1',
  '0',
  'critic',
  'function',
  'provid',
  'valu',
  'polic',
  'personnel',
  'cut',
  'turn',
  'improv',
  'area',
  'investig',
  'crime',
  'crimin'],
 ['function', 'overview'],
 ['cctn',
  'v1',
  '0',
  'function',
  'design',
  'valu',
  'record',
  'citizen',
  'within',
  'crime',
  'investig',
  'area',
  'base',
  'guid',
  'principl',
  'state',
  'differ',
  'function',
  'block',
  'identifi',
  'detail',
  'function',
  'block'],
 ['descript', 'modul', 'function', 'requir'],
 ['function',
  'cctn',
  'applic',
  'provid',
  'valu',
  'polic',
  'personnel',
  'oper',
  'cut',
  'eas',
  'day',
  'day',
  'op

## Dictionary

In [7]:
from gensim import corpora

def corpus_2_dictionary(corpus):
    """
    Associate each word in the corpus with a unique integer ID
    This dictionary defines the vocabulary of all words that our processing knows about.

    Parameters:
    - corpus (str): A list of input text to be indexed.


    Returns:
    - A dictionary of unique tokes with an associated ID
    """
    return corpora.Dictionary(corpus)

dictionary = corpus_2_dictionary(corpus)
pprint.pprint(dictionary.token2id)



{'0': 8,
 '1': 14,
 '2': 269,
 '20': 193,
 '3': 246,
 '9241': 190,
 'abil': 69,
 'abl': 147,
 'accept': 109,
 'access': 126,
 'account': 210,
 'achiev': 143,
 'act': 48,
 'action': 114,
 'activ': 140,
 'adapt': 249,
 'addit': 100,
 'administr': 129,
 'advanc': 70,
 'alert': 95,
 'allow': 152,
 'also': 71,
 'altern': 197,
 'applic': 43,
 'appropri': 151,
 'architectur': 270,
 'area': 15,
 'attempt': 148,
 'attribut': 110,
 'audit': 128,
 'avail': 144,
 'avoid': 213,
 'base': 34,
 'behaviour': 250,
 'block': 35,
 'browser': 127,
 'cach': 283,
 'capabl': 130,
 'capac': 263,
 'care': 241,
 'case': 64,
 'categori': 124,
 'cctn': 2,
 'certain': 211,
 'chang': 157,
 'citizen': 36,
 'clear': 203,
 'clearli': 225,
 'click': 183,
 'color': 238,
 'colour': 185,
 'common': 174,
 'commun': 237,
 'complaint': 49,
 'compon': 248,
 'configur': 99,
 'connect': 276,
 'consist': 244,
 'contain': 222,
 'content': 186,
 'context': 115,
 'control': 149,
 'core': 271,
 'court': 65,
 'creat': 101,
 'crime': 3

## BOW

Word ID and freq

In [66]:
bow = [dictionary.doc2bow(text) for text in corpus]
pprint.pprint(bow)

[[(0, 1), (1, 1)],
 [(2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(8, 1), (9, 1), (10, 1), (11, 1), (12, 1)],
 [(13, 1)],
 [(8, 1), (14, 1)],
 [],
 [(2, 2),
  (3, 1),
  (4, 1),
  (8, 1),
  (9, 4),
  (10, 2),
  (11, 1),
  (12, 1),
  (15, 1),
  (16, 1),
  (17, 1),
  (18, 1),
  (19, 1),
  (20, 1),
  (21, 1),
  (22, 1),
  (23, 1),
  (24, 1),
  (25, 1),
  (26, 1),
  (27, 1),
  (28, 2),
  (29, 1),
  (30, 1),
  (31, 1),
  (32, 1)],
 [(9, 1), (33, 1)],
 [(2, 1),
  (3, 1),
  (8, 1),
  (9, 3),
  (12, 1),
  (15, 1),
  (19, 1),
  (20, 1),
  (21, 1),
  (23, 1),
  (27, 1),
  (31, 1),
  (34, 1),
  (35, 2),
  (36, 1),
  (37, 1),
  (38, 1),
  (39, 1),
  (40, 1),
  (41, 1)],
 [(9, 1), (10, 1), (18, 1), (42, 1)],
 [(2, 1),
  (9, 2),
  (17, 1),
  (25, 1),
  (26, 2),
  (28, 1),
  (31, 1),
  (43, 1),
  (44, 2),
  (45, 1),
  (46, 2)],
 [(47, 1)],
 [(23, 1),
  (26, 4),
  (34, 1),
  (36, 2),
  (42, 1),
  (45, 1),
  (47, 1),
  (48, 1),
  (49, 2),
  (50, 1),
  (51, 1),
  (52, 1),
  (53, 1),
  (54, 1),
  (5

# Model

## Gensim Compare

In [50]:
from gensim import models

# train the model
tfidf = models.TfidfModel(bow)

# transform the "system minors" string
words = "system requirements".lower().split()
print(tfidf[dictionary.doc2bow(words)])

[(6, 1.0)]


The ``tfidf`` model again returns a list of tuples, where the first entry is
the token ID and the second entry is the tf-idf weighting. The words that occur
more times are weighted lower.

You can save trained models to disk and later load them back, either to
continue training on new training documents or to transform new documents.

``gensim`` offers a number of different models/transformations.
For more, see `sphx_glr_auto_examples_core_run_topics_and_transformations.py`.

Once you've created the model, you can do all sorts of cool stuff with it.
For example, to transform the whole corpus via TfIdf and index it, in
preparation for similarity queries:




In [51]:
from gensim import similarities

index = similarities.SparseMatrixSimilarity(tfidf[bow], num_features=12)

and to query the similarity of our query document ``query_document`` against every document in the corpus:



In [52]:
query_document = 'system engineering'.split()
query_bow = dictionary.doc2bow(query_document)
pprint.pprint(query_bow)

[(6, 1)]


## Transforming vectors

### Comparing 

From now on, tfidf is treated as a read-only object that can be used to convert any vector from the old representation (bag-of-words integer counts) to the new representation (TfIdf real-valued weights).
Once the transformation model has been initialized, it can be used on any vectors (provided they come from the same vector space, of course), even if they were not used in the training corpus at all. This is achieved by a process called folding-in for LSA, by topic inference for LDA etc.

In [57]:
from gensim import models
tfidf = models.TfidfModel(bow)
lsi = models.LsiModel(bow, id2word=dictionary, num_topics=2)


Now suppose a user typed in the query “Human computer interaction”. We would like to sort our nine corpus documents in decreasing order of relevance to this query. Unlike modern search engines, here we only concentrate on a single aspect of possible similarities—on apparent semantic relatedness of their texts (words). No hyperlinks, no random-walk static ranks, just a semantic extension over the boolean keyword match:

In [61]:
doc = "The help should be accessible to the users both in the offline and online mode"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]  # convert the query to LSI space
print(vec_lsi)

[(0, -0.019687872533270735), (1, 0.0026600405458696618)]


In addition, we will be considering cosine similarity to determine the similarity of two vectors. Cosine similarity is a standard measure in Vector Space Modeling, but wherever the vectors represent probability distributions, different similarity measures may be more appropriate.


### Initializing query structures
To prepare for similarity queries, we need to enter all documents which we want to compare against subsequent queries. In our case, they are the same nine documents used for training LSI, converted to 2-D LSA space. But that’s only incidental, we might also be indexing a different corpus altogether.



In [62]:
from gensim import similarities
index = similarities.MatrixSimilarity(lsi[bow_corpus])  # transform corpus to LSI space and index it

pprint.pprint(index)

<gensim.similarities.docsim.MatrixSimilarity object at 0x000002BA5E816690>


### Performing queries

To obtain similarities of our query document against the nine indexed documents:

In [63]:
sims = index[vec_lsi]  # perform a similarity query against the corpus
# pprint.pprint(list(enumerate(sims)))  # print (document_number, document_similarity) 2-tuples

sims = sorted(enumerate(sims), key=lambda item: -item[1])
for doc_position, doc_score in sims:
    print(doc_score, corpus[doc_position])
   


0.99999696 ['access', 'case', 'activ', 'case', 'relat', 'document', 'data', 'also', 'need', 'store', 'audit', 'trail', 'ensur', 'data']
0.9999914 ['use', 'colour', 'colour', 'use', 'care', 'take', 'account', 'human', 'capabl', 'colour', 'mean', 'inform', 'color', 'mean', 'user', 'may', 'certain', 'color', 'color', 'color']
0.99998677 ['make', 'navig', 'descript', 'navig', 'design', 'help', 'user', 'understand', 'gener', 'guidanc', 'achiev', 'descript', 'iso', '9241']
0.9999215 ['search', 'modul', 'cctn', 'give', 'polic', 'personnel', 'abil', 'execut', 'advanc', 'search', 'case', 'use', 'search', 'function', 'polic', 'personnel', 'search', 'particular', 'type', 'crime', 'properti', 'etc', 'also', 'give', 'user', 'abil', 'custom', 'result', 'view', 'crimin', 'case', 'make', 'report', 'easi', 'polic', 'enabl', 'execut', 'differ', 'type', 'queri', 'report', 'relat', 'etc']
0.9997348 ['system', 'support', 'tier', 'requir']
0.99929965 ['system', 'design', 'perform', 'polic', 'station', 'conn