## DP-1 Oliver Adler demo

Nakoniec som sa rozhodol pre hľadanie implicitných vzťahov medzi vzormi. Používal som zatiaľ iba organizačné vzory.

In [415]:
import os
import pandas as pd
import numpy as np
from itertools import product
import transformers
import tensorflow as tf
from sklearn.metrics.pairwise import cosine_similarity,euclidean_distances
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

from transformers import LongformerTokenizerFast, \
LongformerModel, LongformerTokenizer, LongformerConfig, Trainer, TrainingArguments, EvalPrediction, AutoTokenizer, AutoModel
from transformers.models.longformer.modeling_longformer import LongformerPreTrainedModel, LongformerClassificationHead

from tensorflow.keras import layers
from tensorflow.keras import losses

Pre toto demo som použil vzory z jazyka ProjectManagementPatternLanguage. Z každého vzoru vezmem jeho názov, počiatočný kontext (= problém ktorý rieši) a jeho konečný kontext. Text rozdeľujem podľa schémy, ktorú vzory z tejto kolekcie majú. 

In [352]:
directory = "patterns"
starting_contexts = []
resulting_contexts = []
pattern_names = []
for filename in os.listdir(directory):
    print(filename)
    with open(directory+"/"+filename) as f:
        lines = f.readlines()
        pattern_names.append(lines[0].replace('\n',''))
        starting_context = ""
        
        resulting_context = ""
        starting_context_flag = False
        resulting_context_flag = False
        for line in lines:
            #end of resulting context statement
            if "âœ¥" in line and resulting_context_flag:
                break
            
            #beginning of starting context statement
            if "âœ¥" in line and not starting_context_flag:
                starting_context_flag = True
                continue
                
            #beginning of resulting context statement
            if "Therefore:" in line and starting_context_flag:
                starting_context_flag = False
                resulting_context_flag = True
                continue

            if starting_context_flag and line != "\n":
                starting_context+=line

            if resulting_context_flag and line != "\n":
                resulting_context+=line
      
        starting_contexts.append(starting_context.translate({ord('\n'): None}))
        resulting_contexts.append(resulting_context.translate({ord('\n'): None}))

BuildPrototypes.txt
CommunityOfTrust.txt
CompletionHeadroom.txt
DayCare.txt
DeveloperControlsProcess.txt
DevelopmentEpisode.txt
DontInteruptAnInterrupt.txt
GetOnWithIt.txt
ImpliedRequirements.txt
IncrementalIntegration.txt
InformalLaborPlan.txt
InterruptsUnjamBlocking.txt
MercenaryAnalyst.txt
NamedStableBases.txt
PrivateWorld.txt
ProgrammingEpisode.txt
RecommitmentMeeting.txt
SacrificeOnePerson.txt
SizeTheSchedule.txt
SomeoneAlwaysMakesProgress.txt
SurrogateCustomer.txt
TakeNoSmallSlips.txt
TeamPerTask.txt
WorkFlowsInward.txt
WorkQueue.txt
WorkSplit.txt


In [353]:
patterns=[]
for i,pattern_name in enumerate(pattern_names):
    patterns.append(pattern_name)
    patterns.append(starting_contexts[i])
    patterns.append(resulting_contexts[i])

In [358]:
d = {"pattern_names": pattern_names, 'starting_contexts': starting_contexts, "resulting_contexts": resulting_contexts}
df = pd.DataFrame(data=d)

In [359]:
df

Unnamed: 0,pattern_names,starting_contexts,resulting_contexts
0,Build Prototypes,A project must test requirements and design de...,Build an isolated prototype solution whose pur...
1,Community Of Trust,It is essential that the people in a team trus...,"Do things that explicitly demonstrate trust, s..."
2,Completion Headroom,Every project must commit to delivery on a few...,Project work group completion dates from remai...
3,Day Care,Your experts are spending all their time mento...,"Put one expert in charge of all the novices, l..."
4,Developer Controls Process,"A development culture, like any culture, can b...",Make the Developer the focal point of process ...
5,Development Episode,It's important to build on the collective stre...,Approach all development as a group activity a...
6,Don't Interrupt An Interrupt,It's important to balance a desire that Someon...,"If a developer is already working in ""interrup..."
7,Get On With It,You can't wait until you have every last requi...,As soon as you have confidence about some proj...
8,Implied Requirements,A commitment implies an agreement between peop...,Select and name chunks of functionality. Use n...
9,Incremental Integration,"For iterative development to work well, it is ...",Provide a mechanism to allow developers to bui...


Zo vzorov vytváram dvojice, kde pre dvojicu vzorov X a Y, budem mať v dataframe ich názvy, počiatočný kontext vzoru X a konečný kontext vzoru Y 

In [360]:
df_new = pd.DataFrame(list(product(df.pattern_names, df.pattern_names)), columns=['pattern_name1','pattern_name2'])
df_new

Unnamed: 0,pattern_name1,pattern_name2
0,Build Prototypes,Build Prototypes
1,Build Prototypes,Community Of Trust
2,Build Prototypes,Completion Headroom
3,Build Prototypes,Day Care
4,Build Prototypes,Developer Controls Process
...,...,...
671,Work Split,Take No Small Slips
672,Work Split,Team Per Task
673,Work Split,Work Flows Inward
674,Work Split,Work Queue


In [361]:
df_new["resulting_context"] = df_new.apply(lambda row: df[df["pattern_names"]==str(row.pattern_name1)].resulting_contexts.item(), axis=1)
df_new

Unnamed: 0,pattern_name1,pattern_name2,resulting_context
0,Build Prototypes,Build Prototypes,Build an isolated prototype solution whose pur...
1,Build Prototypes,Community Of Trust,Build an isolated prototype solution whose pur...
2,Build Prototypes,Completion Headroom,Build an isolated prototype solution whose pur...
3,Build Prototypes,Day Care,Build an isolated prototype solution whose pur...
4,Build Prototypes,Developer Controls Process,Build an isolated prototype solution whose pur...
...,...,...,...
671,Work Split,Take No Small Slips,Divide a task into an urgent and deferred comp...
672,Work Split,Team Per Task,Divide a task into an urgent and deferred comp...
673,Work Split,Work Flows Inward,Divide a task into an urgent and deferred comp...
674,Work Split,Work Queue,Divide a task into an urgent and deferred comp...


In [362]:
df_new["starting_context"] = df_new.apply(lambda row: df[df["pattern_names"]==str(row.pattern_name2)].starting_contexts.item(), axis=1)
df_new

Unnamed: 0,pattern_name1,pattern_name2,resulting_context,starting_context
0,Build Prototypes,Build Prototypes,Build an isolated prototype solution whose pur...,A project must test requirements and design de...
1,Build Prototypes,Community Of Trust,Build an isolated prototype solution whose pur...,It is essential that the people in a team trus...
2,Build Prototypes,Completion Headroom,Build an isolated prototype solution whose pur...,Every project must commit to delivery on a few...
3,Build Prototypes,Day Care,Build an isolated prototype solution whose pur...,Your experts are spending all their time mento...
4,Build Prototypes,Developer Controls Process,Build an isolated prototype solution whose pur...,"A development culture, like any culture, can b..."
...,...,...,...,...
671,Work Split,Take No Small Slips,Divide a task into an urgent and deferred comp...,It's difficult to know how long a project shou...
672,Work Split,Team Per Task,Divide a task into an urgent and deferred comp...,Large distractions (usually called crises) mus...
673,Work Split,Work Flows Inward,Divide a task into an urgent and deferred comp...,An organization must seek a structure that bes...
674,Work Split,Work Queue,Divide a task into an urgent and deferred comp...,"It is difficult to do linear, monochronic sche..."


In [363]:
#Vymažem záznamy, kde X a Y sú rovnaké vzory
df_final = df_new[df_new.pattern_name1 != df_new.pattern_name2]
df_final

Unnamed: 0,pattern_name1,pattern_name2,resulting_context,starting_context
1,Build Prototypes,Community Of Trust,Build an isolated prototype solution whose pur...,It is essential that the people in a team trus...
2,Build Prototypes,Completion Headroom,Build an isolated prototype solution whose pur...,Every project must commit to delivery on a few...
3,Build Prototypes,Day Care,Build an isolated prototype solution whose pur...,Your experts are spending all their time mento...
4,Build Prototypes,Developer Controls Process,Build an isolated prototype solution whose pur...,"A development culture, like any culture, can b..."
5,Build Prototypes,Development Episode,Build an isolated prototype solution whose pur...,It's important to build on the collective stre...
...,...,...,...,...
670,Work Split,Surrogate Customer,Divide a task into an urgent and deferred comp...,It is important to exchange ideas and clarify ...
671,Work Split,Take No Small Slips,Divide a task into an urgent and deferred comp...,It's difficult to know how long a project shou...
672,Work Split,Team Per Task,Divide a task into an urgent and deferred comp...,Large distractions (usually called crises) mus...
673,Work Split,Work Flows Inward,Divide a task into an urgent and deferred comp...,An organization must seek a structure that bes...


## Skladanie vzorov do sekvencií (nápad)

V tomto bode som skúsil zo sekvencie opísanej v jazyku vzorov ProjectManagementPatternLanguage 
(A Story About Project Management - https://sites.google.com/a/scrumplop.org/published-patterns/Organizational-Patterns-of-Agile-Software-Development/bookoutline/thepatternlanguages/organizationdesignpatterns/projectmanagementpatternlanguage?authuser=0). 


Z príbehu som vybral postupnosti vzorov tak ako išli za sebou. Napríklad prvý je spomínaný vzor BuildPrototypes a následne bol aplikovaný SizeTheSchedule. Takže do tabuľky spravím záznam sequence==1 pre záznam kde sa nachádza tento pár. Keďže som sa chcel zamerať na to kedy, respektíve po aplikovaní akého vzrou je môžné/vhodné aplikovať druhý vzor, bral som tento vzťah len pre dvojicu po sebe idúcich vzorov v príbehu. Teda, po SizeTheSchedule nasleduje vzor NamedStableBases. Dvojica SizeTheSchedule a NamedStableBases budú mať sequence==1, ale pár BuildPrototypes a NamedStableBases bude mať 0 (Nemusí to byť správne uvažovanie). 

In [364]:
df_final["sequence"]=0
df_final

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_final["sequence"]=0


Unnamed: 0,pattern_name1,pattern_name2,resulting_context,starting_context,sequence
1,Build Prototypes,Community Of Trust,Build an isolated prototype solution whose pur...,It is essential that the people in a team trus...,0
2,Build Prototypes,Completion Headroom,Build an isolated prototype solution whose pur...,Every project must commit to delivery on a few...,0
3,Build Prototypes,Day Care,Build an isolated prototype solution whose pur...,Your experts are spending all their time mento...,0
4,Build Prototypes,Developer Controls Process,Build an isolated prototype solution whose pur...,"A development culture, like any culture, can b...",0
5,Build Prototypes,Development Episode,Build an isolated prototype solution whose pur...,It's important to build on the collective stre...,0
...,...,...,...,...,...
670,Work Split,Surrogate Customer,Divide a task into an urgent and deferred comp...,It is important to exchange ideas and clarify ...,0
671,Work Split,Take No Small Slips,Divide a task into an urgent and deferred comp...,It's difficult to know how long a project shou...,0
672,Work Split,Team Per Task,Divide a task into an urgent and deferred comp...,Large distractions (usually called crises) mus...,0
673,Work Split,Work Flows Inward,Divide a task into an urgent and deferred comp...,An organization must seek a structure that bes...,0


In [365]:
#vytvorené dvojice z príbehu
indexes = []

indexes.append(df_final.index[(df_final.pattern_name1=="Build Prototypes") & (df_final.pattern_name2=="Size The Schedule")])
indexes.append(df_final.index[(df_final.pattern_name1=="Size The Schedule") & (df_final.pattern_name2=="Named Stable Bases")])
indexes.append(df_final.index[(df_final.pattern_name1=="Named Stable Bases") & (df_final.pattern_name2=="Private World")])
indexes.append(df_final.index[(df_final.pattern_name1=="Private World") & (df_final.pattern_name2=="Developer Controls Process")])
indexes.append(df_final.index[(df_final.pattern_name1=="Developer Controls Process") & (df_final.pattern_name2=="Work Flows Inward")])
indexes.append(df_final.index[(df_final.pattern_name1=="Developer Controls Process") & (df_final.pattern_name2=="Work Queue")])
indexes.append(df_final.index[(df_final.pattern_name1=="Developer Controls Process") & (df_final.pattern_name2=="Informal Labor Plan")])
indexes.append(df_final.index[(df_final.pattern_name1=="Developer Controls Process") & (df_final.pattern_name2=="Programming Episode")])
indexes.append(df_final.index[(df_final.pattern_name1=="Work Queue") & (df_final.pattern_name2=="Take No Small Slips")])
indexes.append(df_final.index[(df_final.pattern_name1=="Informal Labor Plan") & (df_final.pattern_name2=="Take No Small Slips")])
indexes.append(df_final.index[(df_final.pattern_name1=="Programming Episode") & (df_final.pattern_name2=="Take No Small Slips")])
indexes.append(df_final.index[(df_final.pattern_name1=="Take No Small Slips") & (df_final.pattern_name2=="Recommitment Meeting ")])
indexes.append(df_final.index[(df_final.pattern_name1=="Take No Small Slips") & (df_final.pattern_name2=="Team Per Task")])
indexes.append(df_final.index[(df_final.pattern_name1=="Take No Small Slips") & (df_final.pattern_name2=="Sacrifice One Person ")])
indexes.append(df_final.index[(df_final.pattern_name1=="Team Per Task") & (df_final.pattern_name2=="Someone Always Makes Progress ")])

indexes

[Int64Index([18], dtype='int64'),
 Int64Index([481], dtype='int64'),
 Int64Index([352], dtype='int64'),
 Int64Index([368], dtype='int64'),
 Int64Index([127], dtype='int64'),
 Int64Index([128], dtype='int64'),
 Int64Index([114], dtype='int64'),
 Int64Index([119], dtype='int64'),
 Int64Index([645], dtype='int64'),
 Int64Index([281], dtype='int64'),
 Int64Index([411], dtype='int64'),
 Int64Index([562], dtype='int64'),
 Int64Index([568], dtype='int64'),
 Int64Index([563], dtype='int64'),
 Int64Index([591], dtype='int64')]

In [366]:
for index in indexes:
    df_final.loc[index,'sequence']=1
df_final[df_final['sequence']==1]

Unnamed: 0,pattern_name1,pattern_name2,resulting_context,starting_context,sequence
18,Build Prototypes,Size The Schedule,Build an isolated prototype solution whose pur...,Both overly ambitious schedules and overly gen...,1
114,Developer Controls Process,Informal Labor Plan,Make the Developer the focal point of process ...,A schedule of developer work tasks can both as...,1
119,Developer Controls Process,Programming Episode,Make the Developer the focal point of process ...,Programming is the act of deciding now what wi...,1
127,Developer Controls Process,Work Flows Inward,Make the Developer the focal point of process ...,An organization must seek a structure that bes...,1
128,Developer Controls Process,Work Queue,Make the Developer the focal point of process ...,"It is difficult to do linear, monochronic sche...",1
281,Informal Labor Plan,Take No Small Slips,Let individuals devise their own short-term pl...,It's difficult to know how long a project shou...,1
352,Named Stable Bases,Private World,Stabilize system interfaces â€” the architectu...,How can we balance the need for developers to ...,1
368,Private World,Developer Controls Process,Provide a mechanism where developers can maint...,"A development culture, like any culture, can b...",1
411,Programming Episode,Take No Small Slips,Develop a program in discrete episodes. Select...,It's difficult to know how long a project shou...,1
481,Size The Schedule,Named Stable Bases,Reward developers for negotiating a schedule t...,It is important to integrate software frequent...,1


## Použitie Neurónovej siete

Zo získaných dát som sa pokúsil spraviť demo neúronovej siete na klasifikáciu dvojíc textov. Teda podobnosť medzi konečným kontextom (resulting_context) vzoru X a začiatočným kontextom (starting_context) vzoru Y. Kde ak tieto kontexy sú podobné, je možné povedať že vzor Y môže byť použitý po vzore X. Zároveň by sieť pracovala s informáciou o tom či už boli v sekvencií čo som robil v prechádzajúcom úseku kódu.

Avšak model, ktorý by dokázal pozerať na sémantiku textov, v mojom prípade to bol BERT respektíve model na ňom založený, dokázal zobrať iba texty do veľkosti 509, čo je príliš málo na moje textové opisy. Existujú riešenia založené na modeli BERT, ktoré zvládnu aj dlhšie texty ako BigBird alebo LongFormer, tie sa mi však nepodarilo spojazdniť zatiaľ. 

In [176]:
max(df_final["starting_context"].apply(len))

1506

In [216]:
val_df["starting_context"].apply(len)

147     324
148    1506
149     232
151     861
152    1346
153     663
154    1003
155    1075
156     492
157     709
158    1281
159     644
160     476
161     324
162    1506
163     232
164     449
166    1346
167     663
168    1003
169    1075
170     492
171     709
172    1281
173     644
174     476
175     324
176    1506
177     232
178     449
179     861
181     663
182    1003
183    1075
184     492
185     709
186    1281
187     644
188     476
189     324
190    1506
191     232
192     449
193     861
194    1346
Name: starting_context, dtype: int64

## Iné riešenie (bez neurónovej siete)

Za použitia BERT modelu all-mpnet-base-v2, ktorý je natrénovaný a pripravený na použitie, viem vytvoriť vektorovú reprezentáciu textu, kde je zachovaná aj jeho sémantika. Vektory oboch kontextov viem porovnávať na základe metrík kosínusová podobnosť alebo euklidovksá vzdialenosť. Tieto mezriky podobností vektorov môžem použiť ako vstupy do ML modelu.

In [416]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-mpnet-base-v2')


In [294]:
df_final

Unnamed: 0,pattern_name1,pattern_name2,resulting_context,starting_context,sequence
1,Build Prototypes,Developer Controls Process,Build an isolated prototype solution whose pur...,"A development culture, like any culture, can b...",0
2,Build Prototypes,Informal Labor Plan,Build an isolated prototype solution whose pur...,A schedule of developer work tasks can both as...,0
3,Build Prototypes,Named Stable Bases,Build an isolated prototype solution whose pur...,It is important to integrate software frequent...,0
4,Build Prototypes,Private World,Build an isolated prototype solution whose pur...,How can we balance the need for developers to ...,0
5,Build Prototypes,Programming Episode,Build an isolated prototype solution whose pur...,Programming is the act of deciding now what wi...,0
...,...,...,...,...,...
190,Work Queue,Size The Schedule,Produce a schedule that is simply a prioritize...,Both overly ambitious schedules and overly gen...,0
191,Work Queue,Someone Always Makes Progress,Produce a schedule that is simply a prioritize...,It is important to keep a team moving forward ...,0
192,Work Queue,Take No Small Slips,Produce a schedule that is simply a prioritize...,It's difficult to know how long a project shou...,1
193,Work Queue,Team Per Task,Produce a schedule that is simply a prioritize...,Large distractions (usually called crises) mus...,0


In [302]:
#vytváranie vektorových reprezentácií oboch kontextov
df_final['starting_embedding'] = df_final['starting_context'].apply(model.encode)
df_final['resulting_embedding'] = df_final['resulting_context'].apply(model.encode)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_final['starting_embedding'] = df_final['starting_context'].apply(model.encode)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_final['resulting_embedding'] = df_final['resulting_context'].apply(model.encode)


In [374]:
df_final

Unnamed: 0,pattern_name1,pattern_name2,resulting_context,starting_context,sequence,starting_embeddings,resulting_embeddings
1,Build Prototypes,Community Of Trust,Build an isolated prototype solution whose pur...,It is essential that the people in a team trus...,0,"[0.033992566, 0.0021839319, -0.0064396067, 0.0...","[0.026106788, 0.04424294, 0.006456062, 0.00941..."
2,Build Prototypes,Completion Headroom,Build an isolated prototype solution whose pur...,Every project must commit to delivery on a few...,0,"[-0.06432862, 0.046334974, -0.015850563, -0.08...","[0.026106788, 0.04424294, 0.006456062, 0.00941..."
3,Build Prototypes,Day Care,Build an isolated prototype solution whose pur...,Your experts are spending all their time mento...,0,"[0.069971845, 0.056567065, -0.00016744515, -0....","[0.026106788, 0.04424294, 0.006456062, 0.00941..."
4,Build Prototypes,Developer Controls Process,Build an isolated prototype solution whose pur...,"A development culture, like any culture, can b...",0,"[0.04957857, 0.05668401, -0.046187162, -0.0080...","[0.026106788, 0.04424294, 0.006456062, 0.00941..."
5,Build Prototypes,Development Episode,Build an isolated prototype solution whose pur...,It's important to build on the collective stre...,0,"[0.0035809937, 0.0012670304, -0.019009784, 0.0...","[0.026106788, 0.04424294, 0.006456062, 0.00941..."
...,...,...,...,...,...,...,...
670,Work Split,Surrogate Customer,Divide a task into an urgent and deferred comp...,It is important to exchange ideas and clarify ...,0,"[0.07129714, 0.05089531, -0.022737782, 0.02519...","[0.0035704507, 0.05868317, -0.023359573, -0.02..."
671,Work Split,Take No Small Slips,Divide a task into an urgent and deferred comp...,It's difficult to know how long a project shou...,0,"[0.015812408, 0.10102768, -0.009773148, -0.043...","[0.0035704507, 0.05868317, -0.023359573, -0.02..."
672,Work Split,Team Per Task,Divide a task into an urgent and deferred comp...,Large distractions (usually called crises) mus...,0,"[0.0011973408, 0.04903567, 0.011451123, -0.010...","[0.0035704507, 0.05868317, -0.023359573, -0.02..."
673,Work Split,Work Flows Inward,Divide a task into an urgent and deferred comp...,An organization must seek a structure that bes...,0,"[0.04887739, 0.0011259597, -0.029870065, -0.04...","[0.0035704507, 0.05868317, -0.023359573, -0.02..."


In [376]:
#počítanie kosínusovej podobnosti

cos_sim = []
for pairs in df_final[['resulting_embeddings','starting_embeddings']].values:
    cos_sim.append(cosine_similarity([pairs[0],pairs[1]]))
cos_values = []
for values in cos_sim:
    cos_values.append(values[0][1])

[array([[0.9999999 , 0.25257915],
        [0.25257915, 0.99999994]], dtype=float32),
 array([[0.9999999 , 0.14763771],
        [0.14763771, 1.        ]], dtype=float32),
 array([[0.9999999 , 0.22450677],
        [0.22450677, 1.0000002 ]], dtype=float32),
 array([[0.9999999 , 0.27709526],
        [0.27709526, 1.0000001 ]], dtype=float32),
 array([[0.9999999 , 0.06495934],
        [0.06495934, 1.0000001 ]], dtype=float32),
 array([[0.9999999 , 0.12164262],
        [0.12164262, 1.0000004 ]], dtype=float32),
 array([[0.9999999 , 0.25244573],
        [0.25244573, 1.0000002 ]], dtype=float32),
 array([[0.9999999 , 0.23166391],
        [0.23166391, 0.9999998 ]], dtype=float32),
 array([[0.9999999 , 0.29673555],
        [0.29673555, 0.99999976]], dtype=float32),
 array([[0.9999999, 0.1845297],
        [0.1845297, 1.0000002]], dtype=float32),
 array([[0.9999999 , 0.14992702],
        [0.14992702, 1.        ]], dtype=float32),
 array([[0.9999999 , 0.24772869],
        [0.24772869, 1.0000002 ]], 

In [404]:
#počítanie euklidovskej vzdialenosti


euc_dist = []
for pairs in df_final[['resulting_embeddings','starting_embeddings']].values:
    euc_dist.append(euclidean_distances([pairs[0],pairs[1]]))
euc_dist

euc_values = []
for values in euc_dist:
    euc_values.append(values[0][1])

[array([[0.       , 1.2226373],
        [1.2226373, 0.       ]], dtype=float32),
 array([[0.       , 1.3056511],
        [1.3056511, 0.       ]], dtype=float32),
 array([[0.       , 1.2453861],
        [1.2453861, 0.       ]], dtype=float32),
 array([[0.       , 1.2024182],
        [1.2024182, 0.       ]], dtype=float32),
 array([[0.       , 1.3675092],
        [1.3675092, 0.       ]], dtype=float32),
 array([[0.       , 1.3254112],
        [1.3254112, 0.       ]], dtype=float32),
 array([[0.       , 1.2227463],
        [1.2227463, 0.       ]], dtype=float32),
 array([[0.       , 1.2396258],
        [1.2396258, 0.       ]], dtype=float32),
 array([[0.       , 1.1859716],
        [1.1859716, 0.       ]], dtype=float32),
 array([[0.       , 1.2770828],
        [1.2770828, 0.       ]], dtype=float32),
 array([[0.       , 1.3038965],
        [1.3038965, 0.       ]], dtype=float32),
 array([[0.      , 1.226598],
        [1.226598, 0.      ]], dtype=float32),
 array([[0.      , 1.257617],
  

In [378]:
df_final['cos_similarity']=cos_values

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_final['cos_similarity']=cos_values


In [409]:
df_final['euc_distance']=euc_values

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_final['euc_distance']=euc_values


Výsledná tabuľka s metrikami

In [410]:
df_final

Unnamed: 0,pattern_name1,pattern_name2,resulting_context,starting_context,sequence,starting_embeddings,resulting_embeddings,cos_similarity,euc_distance
1,Build Prototypes,Community Of Trust,Build an isolated prototype solution whose pur...,It is essential that the people in a team trus...,0,"[0.033992566, 0.0021839319, -0.0064396067, 0.0...","[0.026106788, 0.04424294, 0.006456062, 0.00941...",0.252579,1.222637
2,Build Prototypes,Completion Headroom,Build an isolated prototype solution whose pur...,Every project must commit to delivery on a few...,0,"[-0.06432862, 0.046334974, -0.015850563, -0.08...","[0.026106788, 0.04424294, 0.006456062, 0.00941...",0.147638,1.305651
3,Build Prototypes,Day Care,Build an isolated prototype solution whose pur...,Your experts are spending all their time mento...,0,"[0.069971845, 0.056567065, -0.00016744515, -0....","[0.026106788, 0.04424294, 0.006456062, 0.00941...",0.224507,1.245386
4,Build Prototypes,Developer Controls Process,Build an isolated prototype solution whose pur...,"A development culture, like any culture, can b...",0,"[0.04957857, 0.05668401, -0.046187162, -0.0080...","[0.026106788, 0.04424294, 0.006456062, 0.00941...",0.277095,1.202418
5,Build Prototypes,Development Episode,Build an isolated prototype solution whose pur...,It's important to build on the collective stre...,0,"[0.0035809937, 0.0012670304, -0.019009784, 0.0...","[0.026106788, 0.04424294, 0.006456062, 0.00941...",0.064959,1.367509
...,...,...,...,...,...,...,...,...,...
670,Work Split,Surrogate Customer,Divide a task into an urgent and deferred comp...,It is important to exchange ideas and clarify ...,0,"[0.07129714, 0.05089531, -0.022737782, 0.02519...","[0.0035704507, 0.05868317, -0.023359573, -0.02...",0.220626,1.248498
671,Work Split,Take No Small Slips,Divide a task into an urgent and deferred comp...,It's difficult to know how long a project shou...,0,"[0.015812408, 0.10102768, -0.009773148, -0.043...","[0.0035704507, 0.05868317, -0.023359573, -0.02...",0.437349,1.060803
672,Work Split,Team Per Task,Divide a task into an urgent and deferred comp...,Large distractions (usually called crises) mus...,0,"[0.0011973408, 0.04903567, 0.011451123, -0.010...","[0.0035704507, 0.05868317, -0.023359573, -0.02...",0.475158,1.024541
673,Work Split,Work Flows Inward,Divide a task into an urgent and deferred comp...,An organization must seek a structure that bes...,0,"[0.04887739, 0.0011259597, -0.029870065, -0.04...","[0.0035704507, 0.05868317, -0.023359573, -0.02...",0.314848,1.170599


Najväčšiu podobnosť mali vzory CompletionHeadRoom a WorkSplit čo keď sa pozrieme do ích textových opisov tak je vidieť, že tam naozaj existuje nejaký vzťah

In [418]:
df_final.sort_values(by=['cos_similarity'],ascending=False).head(20)

Unnamed: 0,pattern_name1,pattern_name2,resulting_context,starting_context,sequence,starting_embeddings,resulting_embeddings,cos_similarity,euc_distance
77,Completion Headroom,Work Split,Project work group completion dates from remai...,A work group has an obligation to make its eff...,0,"[0.015930163, 0.03433592, -0.0064756554, -0.05...","[-0.02881442, 0.105386145, -0.0058587473, -0.0...",0.727226,0.738612
602,Work Flows Inward,Developer Controls Process,Work should flow in to developer from stakehol...,"A development culture, like any culture, can b...",0,"[0.04957857, 0.05668401, -0.046187162, -0.0080...","[0.03404403, 0.020410828, -0.044784002, 0.0222...",0.655913,0.829562
352,Named Stable Bases,Private World,Stabilize system interfaces â€” the architectu...,How can we balance the need for developers to ...,1,"[0.0037527636, 0.13019443, -0.008576251, -0.04...","[-0.028876815, 0.034207717, -0.014542314, -0.0...",0.610649,0.882441
30,Community Of Trust,Developer Controls Process,"Do things that explicitly demonstrate trust, s...","A development culture, like any culture, can b...",0,"[0.04957857, 0.05668401, -0.046187162, -0.0080...","[0.013892247, 0.0677837, -0.011958739, 0.01426...",0.602927,0.891149
427,Recommitment Meeting,Interrupts Unjam Blocking,Assemble a meeting of interested management an...,A comprehensive scheduling plan is difficult i...,0,"[-0.0021273368, 0.07085707, -0.023154281, -0.0...","[-0.011295064, 0.060159247, -0.019415945, -0.0...",0.598939,0.895613
178,Don't Interrupt An Interrupt,Team Per Task,"If a developer is already working in ""interrup...",Large distractions (usually called crises) mus...,0,"[0.0011973408, 0.04903567, 0.011451123, -0.010...","[-0.03205411, 0.03237927, -0.04433392, -0.0210...",0.597559,0.897152
479,Size The Schedule,Interrupts Unjam Blocking,Reward developers for negotiating a schedule t...,A comprehensive scheduling plan is difficult i...,0,"[-0.0021273368, 0.07085707, -0.023154281, -0.0...","[0.020484738, 0.051352836, -0.035016783, -0.00...",0.592164,0.903146
244,Incremental Integration,Informal Labor Plan,Provide a mechanism to allow developers to bui...,A schedule of developer work tasks can both as...,0,"[0.017129539, 0.06161713, -0.04943146, -0.0459...","[-0.017080508, 0.059721574, -0.05366238, -0.04...",0.584082,0.91205
271,Informal Labor Plan,Interrupts Unjam Blocking,Let individuals devise their own short-term pl...,A comprehensive scheduling plan is difficult i...,0,"[-0.0021273368, 0.07085707, -0.023154281, -0.0...","[-0.018673072, 0.08835107, -0.023630276, -0.00...",0.563295,0.934565
489,Size The Schedule,Take No Small Slips,Reward developers for negotiating a schedule t...,It's difficult to know how long a project shou...,0,"[0.015812408, 0.10102768, -0.009773148, -0.043...","[0.020484738, 0.051352836, -0.035016783, -0.00...",0.555571,0.942793


Naopak najmenej podobné kontexty mali MercenaryAnalyst a NamedStableBases.

In [419]:
df_final.sort_values(by=['cos_similarity']).head(20)

Unnamed: 0,pattern_name1,pattern_name2,resulting_context,starting_context,sequence,starting_embeddings,resulting_embeddings,cos_similarity,euc_distance
325,Mercenary Analyst,Named Stable Bases,"Hire a technical writer, proficient in the nec...",It is important to integrate software frequent...,0,"[-0.008631551, 0.04980158, -0.008348653, -0.05...","[0.008096245, 0.038123347, -0.0067579434, -0.0...",0.009815,1.407256
298,Interrupts Unjam Blocking,Mercenary Analyst,If a role is about to block on a critical reso...,Technical documentation is the dirty work ever...,0,"[0.020133274, -0.014507973, -0.016232308, 0.00...","[0.009520725, -0.01611056, -0.0016354743, -0.0...",0.012062,1.405658
442,Sacrifice One Person,Build Prototypes,Assign just one person to it until it gets han...,A project must test requirements and design de...,0,"[0.05032039, 0.019018563, 0.0053541255, -0.029...","[0.00020712917, -0.003472787, -0.018456511, 0....",0.025786,1.395861
566,Take No Small Slips,Surrogate Customer,Prefer a single large slip to several small sl...,It is important to exchange ideas and clarify ...,0,"[0.07129714, 0.05089531, -0.022737782, 0.02519...","[-0.018102376, -0.0074234153, 0.008787064, -0....",0.029477,1.393214
454,Sacrifice One Person,Mercenary Analyst,Assign just one person to it until it gets han...,Technical documentation is the dirty work ever...,0,"[0.020133274, -0.014507973, -0.016232308, 0.00...","[0.00020712917, -0.003472787, -0.018456511, 0....",0.038279,1.386882
211,Implied Requirements,Day Care,Select and name chunks of functionality. Use n...,Your experts are spending all their time mento...,0,"[0.069971845, 0.056567065, -0.00016744515, -0....","[0.056498844, 0.0061101937, -0.013270222, -0.0...",0.041281,1.384716
225,Implied Requirements,Sacrifice One Person,Select and name chunks of functionality. Use n...,"Small distractions can add up, and sap the str...",0,"[0.010107814, -0.04778266, -0.017414702, -0.01...","[0.056498844, 0.0061101937, -0.013270222, -0.0...",0.047398,1.380291
549,Take No Small Slips,Day Care,Prefer a single large slip to several small sl...,Your experts are spending all their time mento...,0,"[0.069971845, 0.056567065, -0.00016744515, -0....","[-0.018102376, -0.0074234153, 0.008787064, -0....",0.049555,1.378728
637,Work Queue,Named Stable Bases,Produce a schedule that is simply a prioritize...,It is important to integrate software frequent...,0,"[-0.008631551, 0.04980158, -0.008348653, -0.05...","[0.023706136, 0.051402166, -0.029743375, -0.00...",0.050903,1.37775
17,Build Prototypes,Sacrifice One Person,Build an isolated prototype solution whose pur...,"Small distractions can add up, and sap the str...",0,"[0.010107814, -0.04778266, -0.017414702, -0.01...","[0.026106788, 0.04424294, 0.006456062, 0.00941...",0.052221,1.376792


Jeden zo spôsobov ako použiť získané data s ML je napríklad použitie regresie, v tomto prípade som skúsil logistickú regresiu kde ako vstup idú metriky podbnosti. Samozrejme model potrebuje aj label-y takže som skúsil použiť moje predtým vytvorené informácie, či sú vzoru v postupnosti alebo nie. Ale nakoľko som použil len jednu sekvenciu (aj to možno nie s úplne správnym uvažovaním) model nedokázal na testovacej vzroke správne určiť 3 dvojice vzorov, ktoré som identifikoval ako páry v postupnosti

In [411]:
X = df_final[['cos_similarity','euc_distance']]
y = df_final['sequence']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=16)


In [412]:
logreg = LogisticRegression(random_state=16)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

In [414]:
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

array([[160,   0],
       [  3,   0]], dtype=int64)

## Čo ďalej

V prvom rade rožšíriť databázu, spravovať aj ostatné jazkyky vzorov, prípadne ich príbehy o postupnosti (sekvencii). Taktiež nájsť riešnie ako použiť modely neurónovej siete na klasifikáciu párov textov (kontextov) na základe ich podobnosti.

Pri prípade používania alternatívneho riešenia (text ako vektor). Navrhnutie metódy, ktorá bude schopná klasifikovať vzťahy vzorov na zákade ich vektorovej podobnosti