# Introduction - Using COSINE Metric

In this notebook we demonstrate the use of **VSM (Vector Space Model)** technique of Information Retrieval context to make trace link recovery between Test Cases and Bug Reports.

We model our study as follows:

* Each bug report title, summary and description compose a single query.
* We use each use case content as an entire document that must be returned to the query made

# Import Libraries

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from mod_finder_util import mod_finder_util
mod_finder_util.add_modules_origin_search_path()

import pandas as pd

from modules.models_runner.tc_br_models_runner import TC_BR_Runner
from modules.models_runner.tc_br_models_runner import TC_BR_Models_Hyperp
from modules.utils import aux_functions
from modules.utils import firefox_dataset_p2 as fd

import modules.utils.tokenizers as tok

from modules.models.vsm import VSM

from IPython.display import display

import warnings; warnings.simplefilter('ignore')

# Load Datasets

In [3]:
tcs = [x for x in range(37,59)]
orc = fd.Tc_BR_Oracles.read_oracle_expert_df()
orc_subset = orc[orc.index.isin(tcs)]
#aux_functions.highlight_df(orc_subset)

OracleExpert.shape: (195, 91)


In [4]:
tcs = [13,37,60,155]
brs = [1292566,1267501]

testcases = fd.Datasets.read_testcases_df()
testcases = testcases[testcases.TC_Number.isin(tcs)]
bugreports = fd.Datasets.read_selected_bugreports_df()
bugreports = bugreports[bugreports.Bug_Number.isin(brs)]

print('tc.shape: {}'.format(testcases.shape))
print('br.shape: {}'.format(bugreports.shape))

TestCases.shape: (195, 12)
SelectedBugReports.shape: (91, 18)
tc.shape: (4, 12)
br.shape: (2, 18)


In [5]:
print(bugreports.iloc[0,:].Summary)
bugreports

New Private Browsing start-page overflows off the *left side of the window* (making content unscrollable) for small window sizes


Unnamed: 0,Bug_Number,Summary,Platform,Component,Version,Creation_Time,Whiteboard,QA_Whiteboard,First_Comment_Text,First_Comment_Creation_Time,Status,Product,Priority,Resolution,Severity,Is_Confirmed,br_name,br_desc
6,1267501,New Private Browsing start-page overflows off ...,Unspecified,Private Browsing,48 Branch,2016-04-26T01:12:11Z,[fxprivacy],,STR: 1. Open a new private browsing window. ...,2016-04-26T01:12:11Z,RESOLVED,Firefox,P1,FIXED,normal,True,BR_1267501_SRC,1267501 New Private Browsing start-page overfl...
32,1292566,"The ""open"" button in the subview for temporari...",Unspecified,Downloads Panel,50 Branch,2016-08-05T14:16:47Z,[fxprivacy],,The Downloads Panel subview for blocked downlo...,2016-08-05T14:16:47Z,VERIFIED,Firefox,P2,FIXED,normal,True,BR_1292566_SRC,"1292566 The ""open"" button in the subview for t..."


In [6]:
testcases

Unnamed: 0,TC_Number,TestDay,Feature_ID,Firefox_Feature,Gen_Title,Crt_Nr,Title,Preconditions,Steps,Expected_Result,tc_name,tc_desc
12,13,20160603 + 20160624 + 20161014,1,New Awesome Bar,Awesome Bar Search,1,Default State,,1. Launch Firefox.\t\n2. No AwesomeBar Entry,1. Firefox launches without any issues.\n2. UR...,TC_13_TRG,13 20160603 + 20160624 + 20161014 1 New Awesom...
36,37,20160603 + 20160708,3,APZ - Async Scrolling,APZ - Async Scrolling,1,Scroll through a long web page,- make sure layers.async-pan-zoom.enabled is t...,1. Launch Firefox.\t\n2. Open: https://en.wiki...,"1. \n2.\n3. The scrolling is smooth, without a...",TC_37_TRG,37 20160603 + 20160708 3 APZ - Async Scrolling...
59,60,20160722,4,Browser Customization,browser customization,2,Install and use complete themes,,1. Install a few complete themes.\n2. Restart ...,1. The user is able to initiate installation p...,TC_60_TRG,60 20160722 4 Browser Customization browser cu...
154,155,20161028,15,Downloads Dropmaker,downloads dropmaker,3,The downloads button works properly no matter ...,,1. Launch Firefox with a clean profile\t\n2. C...,1. Firefox is successfully launched\n2. Custom...,TC_155_TRG,155 20161028 15 Downloads Dropmaker downloads ...


# Running VSM Model

In [7]:
corpus = testcases.tc_desc
query = bugreports.br_desc
test_cases_names = testcases.tc_name
bug_reports_names = bugreports.br_name

vsm_hyperp = TC_BR_Models_Hyperp.get_vsm_model_hyperp()
vsm_model = VSM(**vsm_hyperp)
vsm_model.set_name('VSM_Model_TC_BR')
vsm_model.recover_links(corpus, query, test_cases_names, bug_reports_names)

 ..Total processing time: 1.08 seconds


In [8]:
vsm_model.get_sim_matrix().shape

(4, 2)

In [9]:
sim_matrix = vsm_model.get_sim_matrix()
aux_functions.highlight_df(sim_matrix)

br_name,BR_1267501_SRC,BR_1292566_SRC
tc_name,Unnamed: 1_level_1,Unnamed: 2_level_1
TC_13_TRG,0.186374,0.0510993
TC_37_TRG,0.369537,0.0698247
TC_60_TRG,0.0277002,0.0337788
TC_155_TRG,0.0628534,0.750349


### TF-IDF Application

In [10]:
print(bugreports.Summary.values[0])
print(bugreports.First_Comment_Text.values[0])

New Private Browsing start-page overflows off the *left side of the window* (making content unscrollable) for small window sizes
STR:  1. Open a new private browsing window.  2. Resize the window to be skinny, say 300-400px wide.  3. Try to scroll around horizontally to read the page's contents (using the scrollbars).  ACTUAL RESULTS: - If you scroll all the way to the left, you'll see that the page's contents overflow off the left side of the viewport, to the extent that they're unscrollable and hence unreadable. - If you scroll all the way to the right, you'll see that the page's background-color ends abruptly, and some text protrudes past that.   EXPECTED RESULTS: * Contents should be scrollable/readable. * No awkward background-color-ending in the region of the viewport that is scrollable.


In [23]:
q = query.values[0]

tok = vsm_model.vectorizer.tokenizer
words_list = tok.__call__(q)

from collections import Counter
wordcount = Counter(words_list).most_common()

df = pd.DataFrame(columns=['term','tf'])
df['term'] = [x for (x,_) in wordcount]
df['tf'] = [y for (_,y) in wordcount]

print(df.shape)

display(df.head())

(56, 2)


Unnamed: 0,term,tf
0,private,3
1,browsing,3
2,content,3
3,window,3
4,scroll,3


In [12]:
vsm_model.get_terms_matrix().shape

(4, 98)

In [24]:
vsm_model.get_query_vector().shape

(2, 98)

In [13]:
terms_matrix = pd.DataFrame(vsm_model.get_terms_matrix())
#svd_matrix.index = test_cases_names
aux_functions.highlight_df(terms_matrix)

Unnamed: 0,0
0,"(0, 56)	0.17630788676932263  (0, 8)	0.4472483972441562  (0, 10)	0.28547285848937193  (0, 75)	0.2236241986220781  (0, 24)	0.17630788676932263  (0, 81)	0.2236241986220781  (0, 54)	0.17630788676932263  (0, 44)	0.28547285848937193  (0, 33)	0.28547285848937193  (0, 9)	0.2236241986220781  (0, 32)	0.2236241986220781  (0, 40)	0.17630788676932263  (0, 89)	0.2236241986220781  (0, 26)	0.2236241986220781  (0, 78)	0.2236241986220781  (0, 34)	0.2236241986220781  (0, 58)	0.17630788676932263"
1,"(0, 10)	0.15348803449178755  (0, 44)	0.07674401724589378  (0, 33)	0.07674401724589378  (0, 40)	0.09479412911203762  (0, 58)	0.18958825822407524  (0, 5)	0.24046866586850485  (0, 7)	0.24046866586850485  (0, 74)	0.3607029988027573  (0, 73)	0.3607029988027573  (0, 47)	0.12023433293425242  (0, 93)	0.12023433293425242  (0, 48)	0.24046866586850485  (0, 83)	0.24046866586850485  (0, 86)	0.24046866586850485  (0, 20)	0.24046866586850485  (0, 57)	0.09479412911203762  (0, 35)	0.12023433293425242  (0, 92)	0.12023433293425242  (0, 53)	0.12023433293425242  (0, 94)	0.12023433293425242  (0, 6)	0.12023433293425242  (0, 42)	0.3607029988027573  (0, 80)	0.12023433293425242  (0, 21)	0.12023433293425242  (0, 79)	0.12023433293425242  (0, 41)	0.12023433293425242  (0, 69)	0.12023433293425242"
2,"(0, 56)	0.0638579654582327  (0, 24)	0.0638579654582327  (0, 54)	0.0638579654582327  (0, 12)	0.2554318618329308  (0, 22)	0.16199146405648335  (0, 37)	0.16199146405648335  (0, 90)	0.08099573202824167  (0, 18)	0.4049786601412084  (0, 85)	0.7289615882541751  (0, 71)	0.08099573202824167  (0, 38)	0.16199146405648335  (0, 91)	0.08099573202824167  (0, 0)	0.08099573202824167  (0, 36)	0.08099573202824167  (0, 66)	0.08099573202824167  (0, 72)	0.08099573202824167  (0, 31)	0.0638579654582327  (0, 43)	0.08099573202824167  (0, 39)	0.16199146405648335  (0, 70)	0.08099573202824167  (0, 65)	0.08099573202824167  (0, 1)	0.08099573202824167  (0, 46)	0.08099573202824167  (0, 64)	0.08099573202824167  (0, 25)	0.08099573202824167  (0, 4)	0.08099573202824167  (0, 3)	0.08099573202824167  (0, 76)	0.08099573202824167  (0, 49)	0.08099573202824167"
3,"(0, 10)	0.1479248251978957  (0, 44)	0.04930827506596524  (0, 33)	0.09861655013193048  (0, 57)	0.06090552932508972  (0, 12)	0.06090552932508972  (0, 31)	0.06090552932508972  (0, 28)	0.6952584704459103  (0, 30)	0.15450188232131337  (0, 13)	0.46350564696394014  (0, 97)	0.07725094116065669  (0, 68)	0.07725094116065669  (0, 50)	0.07725094116065669  (0, 63)	0.07725094116065669  (0, 88)	0.07725094116065669  (0, 15)	0.07725094116065669  (0, 67)	0.07725094116065669  (0, 16)	0.07725094116065669  (0, 51)	0.07725094116065669  (0, 77)	0.07725094116065669  (0, 23)	0.15450188232131337  (0, 29)	0.07725094116065669  (0, 27)	0.07725094116065669  (0, 55)	0.07725094116065669  (0, 61)	0.07725094116065669  (0, 84)	0.15450188232131337  (0, 60)	0.07725094116065669  (0, 59)	0.07725094116065669  (0, 82)	0.07725094116065669  (0, 45)	0.07725094116065669  (0, 52)	0.07725094116065669  (0, 62)	0.07725094116065669  (0, 96)	0.07725094116065669  (0, 2)	0.07725094116065669  (0, 17)	0.15450188232131337  (0, 95)	0.07725094116065669  (0, 19)	0.07725094116065669  (0, 87)	0.07725094116065669  (0, 11)	0.07725094116065669  (0, 14)	0.07725094116065669"


In [14]:
len(vsm_model.vectorizer.get_feature_names())

98

In [29]:
import modules.utils.tokenizers as tok

tokenizer = tok.WordNetBased_LemmaTokenizer()
tokens = [tokenizer.__call__(doc) for doc in bugreports.br_desc]
final_tokens = set()
for token_list in tokens:
    for t in token_list:
        final_tokens.add(t)

#final_tokens = final_tokens.intersection(set())       
final_tokens = sorted(list(final_tokens))
print(final_tokens)

dff = pd.DataFrame(final_tokens)
print(dff.shape)
#display(dff)

['abruptly', 'actual', 'addition', 'additional', 'around', 'ask', 'awkward', 'blocked', 'branch', 'browsing', 'button', 'case', 'clarify', 'confirmation', 'content', 'contents', 'designed', 'designer', 'dialog', 'downloads', 'end', 'expected', 'experience', 'extent', 'firefox', 'fixed', 'for', 'fxprivacy', 'hence', 'horizontally', 'if', 'left', 'making', 'malware', 'many', 'message', 'modal', 'nan', 'needed', 'new', 'no', 'normal', 'open', 'overflow', 'page', 'panel', 'past', 'potentially', 'presence', 'previous', 'private', 'protrudes', 'read', 'region', 'regression', 'replacement', 'resize', 'resolved', 'results', 'right', 'say', 'scroll', 'scrollable', 'scrollbars', 'see', 'show', 'side', 'size', 'skinny', 'small', 'special', 'str', 'subview', 'surely', 'temporarily', 'text', 'the', 'time', 'true', 'try', 'uncommon', 'unreadable', 'unscrollable', 'unspecified', 'unwanted', 'user', 'using', 'verified', 'viewport', 'wanted', 'way', 'wide', 'window']
(93, 1)


In [30]:
X = vsm_model.vectorizer.transform(bugreports.br_desc)
df1 = pd.DataFrame(X.T.toarray())
df1.index = vsm_model.vectorizer.get_feature_names()
df1.rename(columns={0:'BR_1267501_SRC'}, inplace=True)
print(df1.shape)
aux_functions.highlight_df(df1)

(98, 2)


Unnamed: 0,BR_1267501_SRC,1
able,0.0,0.0
active,0.0,0.0
aero,0.0,0.0
appearance,0.0,0.0
appears,0.0,0.0
apz,0.0,0.0
arrow,0.0,0.0
async,0.0,0.0
awesome,0.0,0.0
awesomebar,0.0,0.0


In [31]:
query_vec = vsm_model._query_vector
query_vec = pd.DataFrame(query_vec)
query_vec.index = bug_reports_names
query_vec

Unnamed: 0_level_0,0
br_name,Unnamed: 1_level_1
BR_1267501_SRC,"(0, 96)\t0.5501920389608993\n (0, 92)\t0.18..."
BR_1292566_SRC,"(0, 91)\t0.1590805021550478\n (0, 86)\t0.15..."


In [33]:
vsm_model.docs_feats_df

Unnamed: 0,mrw,dl
TC_13_TRG,"[awesome, firefox, launch, bar, url, awesomebar]",23
TC_37_TRG,"[scroll, scrolling, key, make, true, apz]",43
TC_60_TRG,"[theme, complete, browser, install, installati...",52
TC_155_TRG,"[downloads, button, customize, dropmaker, colo...",65
BR_1267501_SRC,"[scroll, window, page, new, using, true]",78
BR_1292566_SRC,"[downloads, panel, button, open, previous, user]",66
