![](../reports/presentations/20231205/1.png)

# Pipeline

## Overview

![](../reports/presentations/20231205/2.png)
![](../reports/presentations/20231205/3.png)
![](../reports/presentations/20231205/4.png)

## Preprocessing
#### Tasks:
- text transformation (is the data available in the needed format or does it need
to be transformed or even generated in that format?)
- text cleaning (e.g. remove stop words, lemmatize)
- extraction of desired information (e.g. sentences, noun phrases, certain
entities, activities of a process)
- feature engineering (e.g. are features highly correlated and can be combined?; is
a combination of certain features more insightful for given problem?)
- feature enrichment (are there additional features that are not included in the
data but seem necessary/advantageous to include?; can these be collected or generated?)

### Formatting

In [1]:
# External imports
import os
import sys
import spacy


# Get the current working directory (assuming the notebook is in the notebooks folder)
current_dir = os.getcwd()

# Add the parent directory (project root) to the Python path
parent_dir = os.path.dirname(current_dir)
sys.path.append(parent_dir)

# Relative imports
from src.preprocess import txt_to_df, rmv_and_rplc, chng_stpwrds, lmtz_and_rmv_stpwrds
from src.modelling import gt_mtchs
from src.utils import prnt_brk
from src.evaluation import cnstrnts_gs, sbert_smlarty, sbert_smlarty_cmpntns

In [2]:
# Define variables to use as keys
coffee = "Coffee"
cdm_ren = "CDM/Renewables"

# Define file paths
file_paths_input = {
    coffee: os.path.join('..', 'data', 'coffee', 'input-coffee.txt'),
    cdm_ren: os.path.join('..', 'data', 'cdm', 'input-cdm-amsia190-reduced.txt'),    
}

# Split text into columns
data = {key: txt_to_df(path) for key, path in file_paths_input.items()}

In [3]:
data[coffee]

Unnamed: 0,Section,Raw,Processed
0,,<!--\n\tSources used for this handbook:\n\t\t-...,
1,Coffee Roasting Handbook 1st Edition Exclusive:,,
2,About our Coffee:,\nWe roast our own coffee in the coffeehouse o...,
3,Controls and Basic Settings:,,
4,Controls and Basic Settings:/Power Switch:,\nThe power switch is the upper-left knob on t...,
5,Controls and Basic Settings:/Heater Control:,\nThe knob at the bottom-left of the control p...,
6,Controls and Basic Settings:/Ammeter:,\nThe ammeter is on the upper-left of the cont...,
7,Controls and Basic Settings:/Blower Control:,\nThe blower is multi-purpose; it moves air th...,
8,Controls and Basic Settings:/Thermometer:,\nThe bean temperature displayed on the thermo...,
9,Controls and Basic Settings:/Circuit Breaker:,\nThe circuit breaker is at the roaster’s back...,


In [4]:
print(data[coffee].iloc[14].Raw)


In general, we first distinguish between three roasting degrees: light, medium and dark. Secondly, we have acknowledge the coffee pile height in the tray as that play a big role in the temperature of the roasting ovens which are the third and last constraint that needs to be abided to.

For better understanding, we describe temperature rules with boundary temperatures t_min and t_max and distinguish between:
	-> open brackets, e.g., (t_min, t_max), this means that the boundary temperatures ARE NOT included in the rules.
	-> closed brackets, e.g., [t_min, t_max], this means that the boundary temperatures ARE included in the rules.

	-> Light Roast:
	Goes through roasting oven 1,2 and 3.
	Light roasts are light brown with no oil on the bean surface, with a toasted grain taste and noticeable acidity. A common misconception is that Light Roasts don’t have as much caffeine as their darker, bolder counterparts. However, the truth is exactly the opposite! As beans roast, the caffeine slowly 

### Cleaning

#### Remove intends and specific literals

In [5]:
remove = ["\n", "\t","-->", "->"]
replace = {}

for case, df in data.items():
    data[case].Processed = data[case].Raw.apply(rmv_and_rplc, remove=remove, replace=replace)

In [6]:
text = data[coffee].iloc[14].Processed
print(text)

 In general, we first distinguish between three roasting degrees: light, medium and dark. Secondly, we have acknowledge the coffee pile height in the tray as that play a big role in the temperature of the roasting ovens which are the third and last constraint that needs to be abided to.  For better understanding, we describe temperature rules with boundary temperatures t_min and t_max and distinguish between:    open brackets, e.g., (t_min, t_max), this means that the boundary temperatures ARE NOT included in the rules.    closed brackets, e.g., [t_min, t_max], this means that the boundary temperatures ARE included in the rules.     Light Roast:  Goes through roasting oven 1,2 and 3.  Light roasts are light brown with no oil on the bean surface, with a toasted grain taste and noticeable acidity. A common misconception is that Light Roasts don’t have as much caffeine as their darker, bolder counterparts. However, the truth is exactly the opposite! As beans roast, the caffeine slowly coo

#### Define stop words

In [7]:
# Define custom stop words
add_stpwrds = []
non_stpwrds = [
    "above",
    "all",
    "and",
    "at",
    "before",
    "below",
    "between",
    "both",
    "can",
    "else",
    "even",
    "for",
    "if",
    "last",
    "least",
    "less",
    "must",
    "next",
    "no",
    "none",
    "not",
    "nothing",
    "only",
    "otherwise",
    "over",
    "or",
    "should",
    "than",
    "then",
    "to",
    "up"
]

# Add and remove custom stop words globally to the spacy.util.get_lang_class('en')
stpwrds = chng_stpwrds(add=add_stpwrds,remove=non_stpwrds, remove_numbers=True,verbose=True)

# Uncomment this line to restore the default set of stpwrds
# stpwrds = chng_stpwrds(restore_default=True, verbose=True)

eight successfuly added to removal list!
eleven successfuly added to removal list!
fifteen successfuly added to removal list!
fifty successfuly added to removal list!
first successfuly added to removal list!
five successfuly added to removal list!
forty successfuly added to removal list!
four successfuly added to removal list!
hundred successfuly added to removal list!
nine successfuly added to removal list!
one successfuly added to removal list!
six successfuly added to removal list!
sixty successfuly added to removal list!
ten successfuly added to removal list!
third successfuly added to removal list!
three successfuly added to removal list!
twelve successfuly added to removal list!
twenty successfuly added to removal list!
two successfuly added to removal list!
Stop word [ above ] successfully removed!
Stop word [ all ] successfully removed!
Stop word [ and ] successfully removed!
Stop word [ at ] successfully removed!
Stop word [ before ] successfully removed!
Stop word [ below ] s

#### Remove stop words and lemmatize

In [8]:
for case, df in data.items():
    df['Doc'] = df['Processed'].apply(lmtz_and_rmv_stpwrds, model='en_core_web_lg', verbose=True)

< ! --   Sources used for [31mthis[0m handbook :      Employee Handbook Coffeehouse Five , 323 Market Plaza Greenwood , [31mIN[0m 46142 , 317.300.4330      Quest Coffee Roaster Handbook , First Edition April 2021 , amended May 2021      Copper Moon Coffee , https://www.coppermooncoffee.com/blogs/newsroom/what-is-the-difference-between-light-medium-and-dark-roast-coffee   >
< ! --   source use for handbook :      Employee Handbook Coffeehouse Five , 323 Market Plaza Greenwood , 46142 , 317.300.4330      Quest Coffee Roaster Handbook , First Edition April 2021 , amend May 2021      Copper Moon Coffee , https://www.coppermooncoffee.com/blogs/newsroom/what-is-the-difference-between-light-medium-and-dark-roast-coffee   >


 

[31mWe[0m roast [31mour[0m [31mown[0m coffee [31min[0m [31mthe[0m coffeehouse [31mon[0m [31ma[0m weekly basis .
roast coffee coffeehouse weekly basis .

[31mThere[0m [31mare[0m two primary things to know [31mabout[0m [31mour[0m coffee roasting

## Modelling
#### Tasks:
- rule based entity matching
- extraction of process steps, variables and constraints
- formatting to fit into desired output

### Extracting process steps and variables

In [10]:
process_steps, variables = (), ()

# TODO: Extraction of process steps
# Summarisation libraries

# TODO: Extraction of variables ?

### Extracting constraints

In [11]:
text = data[cdm_ren].iloc[2].Doc
prnt_brk(text,linebreak=100)

  to validate applicability project activity , organisation need to follow specific evaluation proce
dure . if fail one step , evaluation procedure terminate result non - applicability and methodology 
shall not apply .    first , shall evaluate if project involve new installation or if project replac
e exist onsite fossil - fuel - fire generation . if one condition hold true , next step to check if 
grid connection present or present at time crediting period . if unit to supply electricity newly co
nnect to grid at time crediting period , evaluation procedure terminate result non - applicability .
   if project not limit to supply individual household stand - electricity system involve household 
grid connection prior to project activity , need to evaluate if one following three exception hold t
rue .    ( 1 ) sum instal capacity all renewable energy unit connect to grid supply household less t
han 15 MW .   ( 2 ) project involve renewable energy - base lighting application and emissi

In [24]:
# TODO: Define more patterns

pattern = [
    {"LEMMA": "great"},
    {"LEMMA": "than"}
]

pattern_name = "GREATER_THAN"

# TODO: Generalize gt_mtchs to hold multiple patterns

matches = gt_mtchs(text, pattern, pattern_name)

[(14628693206119734160, 329, 331), (14628693206119734160, 349, 351)]
Match: 14628693206119734160 329 331
Span: great than
Context: project activity great than 4 w
Match: 14628693206119734160 349 351
Span: great than
Context: power plant great than 4 w


In [None]:
"""
Suggestions for patterns:

01. Requirements keywords: shall, should, must, duties, requirements, require, condition, constraint
02. Numerical value (in combination with unit)
03. Conditional statement keywords: if, then, else, otherwise, case, and, or
04. Comparison keywords: greater than, below, between
05. Comparison operators: <, >, >=, <=, !=, ==
"""

In [25]:
# TODO: Transform matches into formalized structure

# TODO: Compose process steps, variables and matches into constraints

In [3]:
constraints_dummy_cdm = {
    # Exact match
    "c1": "({check project type}, {check connection type}, {directly follows}, {project type == new installation OR project type == replacing existing fossil-fuel-fired generation})",
    # Close match
    "c2": "({check power plant type}, {check hydro power plant}, {directly follows}, {power plant type == hydro power})",
    # No match
    "c3": "({check combustion}, {check emissions}, {directly follows}, {fuel == coal})"
}

## Evaluation
#### Tasks:
- similarity scores of word embeddings of constraints
- final score

### Calculating similarity scores

In [4]:
# Define file paths
file_paths_gs = {
    coffee: os.path.join('..', 'data', 'coffee', 'output-coffee.txt'),
    cdm_ren: os.path.join('..', 'data', 'cdm', 'output-cdm-amsia190-reduced.txt'),    
}

# Transform content to dictionary
constraints_gs_cdm = cnstrnts_gs(file_paths_gs[cdm_ren])
constraints_gs_cdm

{'c1': '({check project type}, {check connection type}, {directly follows}, {project type == new installation OR project type == replacing existing fossil-fuel-fired generation})',
 'c2': '({check connection type}, {check power plant type}, {directly follows}, {connection type == limited to supplying individual households with stand-alone electricity systems AND connection type != new grid connections planned at any time during the crediting period})',
 'c3': '({check connection type}, {check grid exceptions}, {directly follows}, {connection type != limited to supplying individual households with stand-alone electricity systems AND connection type != new grid connections planned at any time during the crediting period})',
 'c4': '({check grid exceptions}, {check power plant type}, {directly follows}, {sum of installed capacities of all renewable energy units < 15 MW OR (project involves renewable energy-based lighting applications AND emission reductions per system < 5 tonnes of CO2e a

In [5]:
similarity_scores = sbert_smlarty(constraints_dummy_cdm, constraints_gs_cdm, model='all-mpnet-base-v2')

for key, value in similarity_scores.items():
    print(f'{key}: {value}')

c1 - c1: 1.000000238418579
c1 - c2: 0.6574954986572266
c1 - c3: 0.5637484788894653
c1 - c4: 0.4421740174293518
c1 - c5: 0.6409293413162231
c1 - c6: 0.5762197971343994
c1 - c7: 0.4834948778152466
c1 - c8: 0.5600879192352295
c1 - c9: 0.6285613775253296
c1 - c10: 0.5241575241088867
c1 - c11: 0.42186567187309265
c1 - c12: 0.8325357437133789
c1 - c13: 0.45722901821136475
c1 - c14: 0.39302337169647217
c1 - c15: 0.38831067085266113
c1 - c16: 0.46394315361976624
c1 - c17: 0.43315309286117554
c1 - c18: 0.5277394652366638
c1 - c19: 0.46391743421554565
c2 - c1: 0.5672578811645508
c2 - c2: 0.6056338548660278
c2 - c3: 0.4182375371456146
c2 - c4: 0.49076879024505615
c2 - c5: 0.9075765013694763
c2 - c6: 0.9548299312591553
c2 - c7: 0.6894903779029846
c2 - c8: 0.5379923582077026
c2 - c9: 0.5423135757446289
c2 - c10: 0.49241816997528076
c2 - c11: 0.4490783214569092
c2 - c12: 0.5065698027610779
c2 - c13: 0.2578161954879761
c2 - c14: 0.4193108081817627
c2 - c15: 0.6174739599227905
c2 - c16: 0.457644104957

In [20]:
matches_step_1, matches_step_2, matches_constraints = sbert_smlarty_cmpntns(similarity_scores, constraints_dummy_cdm, constraints_gs_cdm, threshold=0.5, model='paraphrase-MiniLM-L6-v2', matching_mode='unique')

In [21]:
matches_step_1

Unnamed: 0,Constraint pair,Group similarity,From,To,Similarity
0,c1 - c1,1.0,check project type,check project type,1.0
1,c2 - c6,0.95483,check power plant type,check power plant type,1.0
2,c3 - c7,0.517645,check combustion,check hydro power plant conditions,0.287319


In [22]:
matches_step_2

Unnamed: 0,Constraint pair,Group similarity,From,To,Similarity
0,c1 - c1,1.0,check connection type,check connection type,1.0
1,c2 - c6,0.95483,check hydro power plant,check hydro power plant conditions,0.949709
2,c3 - c7,0.517645,check emissions,check heat and power cogeneration,0.447099


In [23]:
matches_constraints

Unnamed: 0,Constraint pair,Group similarity,From,To,Similarity
0,c1 - c1,1.0,project type == new installation OR project ty...,project type == new installation OR project ty...,1.0
1,c2 - c6,0.95483,power plant type == hydro power,power plant type == hydro power plant,0.964375
2,c3 - c7,0.517645,fuel == coal,(reservoir == existing AND volume of reservoir...,0.252244


In [20]:
print("PERFECT MATCH")
print('c1 - c1', similarity_scores['c1 - c1'])
print("Dummy: ", constraints_dummy_cdm['c1'])
print("GS: ", constraints_gs_cdm['c1'])

print("\nCLOSE MATCH")
print('c2 - c6', similarity_scores['c2 - c6'])
print("Dummy: ", constraints_dummy_cdm['c2'])
print("GS: ", constraints_gs_cdm['c6'])

print("\nNO MATCH")
print('c3 - c8', similarity_scores['c3 - c8'])
print("Dummy: ", constraints_dummy_cdm['c3'])
print("GS: ", constraints_gs_cdm['c8'])

PERFECT MATCH
c1 - c1 0.9999998
Dummy:  ({check project type}, {check connection type}, {directly follows}, {project type == new installation OR project type == replacing existing fossil-fuel-fired generation})
GS:  ({check project type}, {check connection type}, {directly follows}, {project type == new installation OR project type == replacing existing fossil-fuel-fired generation})

CLOSE MATCH
c2 - c6 0.9708441
Dummy:  ({check power plant type}, {check hydro power plant}, {directly follows}, {power plant type == hydro power})
GS:  ({check power plant type}, {check hydro power plant conditions}, {directly follows}, {power plant type == hydro power plant})

NO MATCH
c3 - c8 0.6591883
Dummy:  ({check combustion}, {check emissions}, {directly follows}, {fuel == coal})
GS:  ({check heat and power cogeneration}, {check non-renewable components}, {directly follows}, {system != combined heat and power cogeneration})


### Calculating final score

In [None]:
# TODO: Define final score
"""
Proposal: Several metrics.

(1) Take the highest available score for each extracted constraint and generate the mean over all constraints.

(2) Divide the number of extracted constraints having at least one similarity score higher than THRESHOLD (e.g. 0.9) by the number of constraints in the GS. Exclude duplicates, e.g. c1 - c2 with 0.95 and c2 - c2 with 0.92 would only count once, taking the higher score of 0.95 but therefore having no match with c2 from the GS. 

(3) Punish duplicates/ constraints which are not in the GS.

(4) Aggregate (1) - (3)
"""

## Next steps


![](../reports/presentations/20231205/5.png)
![](../reports/presentations/20231205/6.png)

Rule based extraction of process steps?

A: 

Relevance of negated constraints?

A: 


Specifics of the calculation of similarity scores?
    
Threshold

A:

"directly follows"

A:

Structure

A:

Calculation of final score?

A:

Literature recommendations? 

A:


![](../reports/presentations/20231205/7.png)