# Introduction - Using COSINE Metric

In this notebook we demonstrate the use of **LSI (Latent Semantic Indexing)** technique of Information Retrieval context to make trace link recovery between Test Cases and Bug Reports.

We model our study as follows:

* Each bug report title, summary and description compose a single query.
* We use each use case content as an entire document that must be returned to the query made

# Import Libraries

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from mod_finder_util import mod_finder_util
mod_finder_util.add_modules_origin_search_path()

import pandas as pd

from modules.models_runner.tc_br_models_runner import TC_BR_Runner
from modules.models_runner.tc_br_models_runner import TC_BR_Models_Hyperp
from modules.utils import aux_functions
from modules.utils import firefox_dataset_p2 as fd

import modules.utils.tokenizers as tok

from modules.models.lsi import LSI

from IPython.display import display

import warnings; warnings.simplefilter('ignore')

# Load Datasets

In [3]:
tcs = [x for x in range(37,59)]
orc = fd.Tc_BR_Oracles.read_oracle_expert_df()
orc_subset = orc[orc.index.isin(tcs)]
#aux_functions.highlight_df(orc_subset)

OracleExpert.shape: (195, 91)


In [4]:
tcs = [13,37,60,155]
brs = [1292566,1267501]

testcases = fd.Datasets.read_testcases_df()
testcases = testcases[testcases.TC_Number.isin(tcs)]
bugreports = fd.Datasets.read_selected_bugreports_df()
bugreports = bugreports[bugreports.Bug_Number.isin(brs)]

print('tc.shape: {}'.format(testcases.shape))
print('br.shape: {}'.format(bugreports.shape))

TestCases.shape: (195, 12)
SelectedBugReports.shape: (91, 18)
tc.shape: (4, 12)
br.shape: (2, 18)


In [5]:
print(bugreports.iloc[0,:].Summary)
bugreports

New Private Browsing start-page overflows off the *left side of the window* (making content unscrollable) for small window sizes


Unnamed: 0,Bug_Number,Summary,Platform,Component,Version,Creation_Time,Whiteboard,QA_Whiteboard,First_Comment_Text,First_Comment_Creation_Time,Status,Product,Priority,Resolution,Severity,Is_Confirmed,br_name,br_desc
6,1267501,New Private Browsing start-page overflows off ...,Unspecified,Private Browsing,48 Branch,2016-04-26T01:12:11Z,[fxprivacy],,STR: 1. Open a new private browsing window. ...,2016-04-26T01:12:11Z,RESOLVED,Firefox,P1,FIXED,normal,True,BR_1267501_SRC,1267501 New Private Browsing start-page overfl...
32,1292566,"The ""open"" button in the subview for temporari...",Unspecified,Downloads Panel,50 Branch,2016-08-05T14:16:47Z,[fxprivacy],,The Downloads Panel subview for blocked downlo...,2016-08-05T14:16:47Z,VERIFIED,Firefox,P2,FIXED,normal,True,BR_1292566_SRC,"1292566 The ""open"" button in the subview for t..."


In [6]:
testcases

Unnamed: 0,TC_Number,TestDay,Feature_ID,Firefox_Feature,Gen_Title,Crt_Nr,Title,Preconditions,Steps,Expected_Result,tc_name,tc_desc
12,13,20160603 + 20160624 + 20161014,1,New Awesome Bar,Awesome Bar Search,1,Default State,,1. Launch Firefox.\t\n2. No AwesomeBar Entry,1. Firefox launches without any issues.\n2. UR...,TC_13_TRG,13 20160603 + 20160624 + 20161014 1 New Awesom...
36,37,20160603 + 20160708,3,APZ - Async Scrolling,APZ - Async Scrolling,1,Scroll through a long web page,- make sure layers.async-pan-zoom.enabled is t...,1. Launch Firefox.\t\n2. Open: https://en.wiki...,"1. \n2.\n3. The scrolling is smooth, without a...",TC_37_TRG,37 20160603 + 20160708 3 APZ - Async Scrolling...
59,60,20160722,4,Browser Customization,browser customization,2,Install and use complete themes,,1. Install a few complete themes.\n2. Restart ...,1. The user is able to initiate installation p...,TC_60_TRG,60 20160722 4 Browser Customization browser cu...
154,155,20161028,15,Downloads Dropmaker,downloads dropmaker,3,The downloads button works properly no matter ...,,1. Launch Firefox with a clean profile\t\n2. C...,1. Firefox is successfully launched\n2. Custom...,TC_155_TRG,155 20161028 15 Downloads Dropmaker downloads ...


# Running LSI Model

In [7]:
corpus = testcases.tc_desc
query = bugreports.br_desc
test_cases_names = testcases.tc_name
bug_reports_names = bugreports.br_name

lsi_hyperp = TC_BR_Models_Hyperp.get_lsi_model_hyperp()
lsi_model = LSI(**lsi_hyperp)
lsi_model.set_name('LSI_Model_TC_BR')
lsi_model.recover_links(corpus, query, test_cases_names, bug_reports_names)

 ..Total processing time: 1.12 seconds


In [8]:
lsi_model.get_sim_matrix().shape

(4, 2)

In [9]:
sim_matrix = lsi_model.get_sim_matrix()
aux_functions.highlight_df(sim_matrix)

br_name,BR_1267501_SRC,BR_1292566_SRC
tc_name,Unnamed: 1_level_1,Unnamed: 2_level_1
TC_13_TRG,0.470123,0.0679615
TC_37_TRG,0.932144,0.092866
TC_60_TRG,0.0698727,0.0449254
TC_155_TRG,0.158545,0.997955


### TF-IDF Application

In [10]:
print(bugreports.Summary.values[0])
print(bugreports.First_Comment_Text.values[0])

New Private Browsing start-page overflows off the *left side of the window* (making content unscrollable) for small window sizes
STR:  1. Open a new private browsing window.  2. Resize the window to be skinny, say 300-400px wide.  3. Try to scroll around horizontally to read the page's contents (using the scrollbars).  ACTUAL RESULTS: - If you scroll all the way to the left, you'll see that the page's contents overflow off the left side of the viewport, to the extent that they're unscrollable and hence unreadable. - If you scroll all the way to the right, you'll see that the page's background-color ends abruptly, and some text protrudes past that.   EXPECTED RESULTS: * Contents should be scrollable/readable. * No awkward background-color-ending in the region of the viewport that is scrollable.


In [11]:
q = query.values[0]

tok = lsi_model.vectorizer.tokenizer
words_list = tok.__call__(q)

from collections import Counter
wordcount = Counter(words_list).most_common()

df = pd.DataFrame(columns=['term','tf',''])

print(wordcount)

[('private', 3), ('browsing', 3), ('content', 3), ('window', 3), ('scroll', 3), ('page', 3), ('new', 2), ('overflow', 2), ('side', 2), ('unscrollable', 2), ('results', 2), ('if', 2), ('way', 2), ('left', 2), ('see', 2), ('viewport', 2), ('making', 1), ('small', 1), ('size', 1), ('unspecified', 1), ('branch', 1), ('fxprivacy', 1), ('nan', 1), ('str', 1), ('open', 1), ('resize', 1), ('skinny', 1), ('say', 1), ('wide', 1), ('try', 1), ('around', 1), ('horizontally', 1), ('read', 1), ('using', 1), ('scrollbars', 1), ('actual', 1), ('extent', 1), ('hence', 1), ('unreadable', 1), ('right', 1), ('end', 1), ('abruptly', 1), ('text', 1), ('protrudes', 1), ('past', 1), ('expected', 1), ('contents', 1), ('no', 1), ('awkward', 1), ('region', 1), ('scrollable', 1), ('resolved', 1), ('firefox', 1), ('fixed', 1), ('normal', 1), ('true', 1)]


In [12]:
lsi_model.get_svd_matrix().shape

(4, 4)

In [13]:
svd_matrix = pd.DataFrame(lsi_model.get_svd_matrix())
#svd_matrix.index = test_cases_names
aux_functions.highlight_df(svd_matrix)

Unnamed: 0,0,1,2,3
0,0.720291,-0.0532987,-0.127135,0.679835
1,0.635239,-0.332496,-0.399589,-0.57118
2,0.180721,0.926883,-0.30391,-0.125959
3,0.48101,0.170679,0.832273,-0.216379


In [14]:
len(lsi_model.vectorizer.get_feature_names())

98

In [16]:
import modules.utils.tokenizers as tok

tokenizer = tok.WordNetBased_LemmaTokenizer()
tokens = [tokenizer.__call__(doc) for doc in bugreports.br_desc]
final_tokens = set()
for token_list in tokens:
    for t in token_list:
        final_tokens.add(t)

#final_tokens = final_tokens.intersection(set())       
final_tokens = sorted(list(final_tokens))
print(final_tokens)

dff = pd.DataFrame(final_tokens)
print(dff.shape)
#display(dff)

['abruptly', 'actual', 'addition', 'additional', 'around', 'ask', 'awkward', 'blocked', 'branch', 'browsing', 'button', 'case', 'clarify', 'confirmation', 'content', 'contents', 'designed', 'designer', 'dialog', 'downloads', 'end', 'expected', 'experience', 'extent', 'firefox', 'fixed', 'for', 'fxprivacy', 'hence', 'horizontally', 'if', 'left', 'making', 'malware', 'many', 'message', 'modal', 'nan', 'needed', 'new', 'no', 'normal', 'open', 'overflow', 'page', 'panel', 'past', 'potentially', 'presence', 'previous', 'private', 'protrudes', 'read', 'region', 'regression', 'replacement', 'resize', 'resolved', 'results', 'right', 'say', 'scroll', 'scrollable', 'scrollbars', 'see', 'show', 'side', 'size', 'skinny', 'small', 'special', 'str', 'subview', 'surely', 'temporarily', 'text', 'the', 'time', 'true', 'try', 'uncommon', 'unreadable', 'unscrollable', 'unspecified', 'unwanted', 'user', 'using', 'verified', 'viewport', 'wanted', 'way', 'wide', 'window']
(93, 1)


In [20]:
X = lsi_model.vectorizer.transform(bugreports.br_desc)
df1 = pd.DataFrame(X.T.toarray())
df1.index = lsi_model.vectorizer.get_feature_names()
df1.rename(columns={0:'BR_1267501_SRC',1:'BR_1292566_SRC'}, inplace=True)
print(df1.shape)
aux_functions.highlight_df(df1)

(98, 2)


Unnamed: 0,BR_1267501_SRC,BR_1292566_SRC
able,0.0,0.0
active,0.0,0.0
aero,0.0,0.0
appearance,0.0,0.0
appears,0.0,0.0
apz,0.0,0.0
arrow,0.0,0.0
async,0.0,0.0
awesome,0.0,0.0
awesomebar,0.0,0.0


In [21]:
query_vec = lsi_model._query_vector
query_vec = pd.DataFrame(query_vec)
query_vec.index = bug_reports_names
query_vec

Unnamed: 0_level_0,0,1,2,3
br_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BR_1267501_SRC,0.340724,-0.096243,-0.132655,-0.119206
BR_1292566_SRC,0.377781,0.133219,0.603442,-0.201805


In [23]:
from sklearn.metrics import pairwise
results = pd.DataFrame(pairwise.cosine_similarity(X=svd_matrix, Y=query_vec))
results.index = test_cases_names
results.rename(columns={0:bug_reports_names.values[0],
                        1:bug_reports_names.values[1]}, inplace=True)
aux_functions.highlight_df(results)

Unnamed: 0_level_0,BR_1267501_SRC,BR_1292566_SRC
tc_name,Unnamed: 1_level_1,Unnamed: 2_level_1
TC_13_TRG,0.470123,0.0679615
TC_37_TRG,0.932144,0.092866
TC_60_TRG,0.0698727,0.0449254
TC_155_TRG,0.158545,0.997955


In [24]:
import numpy as np

tokenizer = tok.WordNetBased_LemmaTokenizer()
tokens = [tokenizer.__call__(doc) for doc in testcases.tc_desc]
final_tokens = []
for token_list in tokens:
    for t in token_list:
        final_tokens.append(t)

print(np.unique(final_tokens))
print(len(np.unique(final_tokens)))

['able' 'active' 'aero' 'after' 'all' 'appearance' 'appears' 'apz' 'arrow'
 'async' 'awesome' 'awesomebar' 'bar' 'blue' 'browser' 'button' 'changed'
 'clean' 'click' 'color' 'complete' 'completed' 'config' 'ctrl'
 'customization' 'customize' 'default' 'disabled' 'display' 'download'
 'downloads' 'drag' 'dropmaker' 'enabled' 'entry' 'firefox' 'home' 'http'
 'initiate' 'install' 'installation' 'installed' 'issue' 'jerkiness' 'key'
 'latest' 'launch' 'launched' 'least' 'lightweight' 'long' 'make'
 'manager' 'matter' 'menu' 'mode' 'mouse' 'nan' 'navigation' 'new' 'no'
 'on' 'once' 'open' 'page' 'panel' 'perform' 'place' 'placed' 'positioned'
 'previous' 'previously' 'process' 'profile' 'properly' 'rendering'
 'replaces' 'restart' 'restarted' 'scroll' 'scrolling' 'search' 'section'
 'select' 'set' 'smooth' 'space' 'state' 'successfully' 'sure' 'tab' 'the'
 'theme' 'true' 'turn' 'ui' 'url' 'use' 'user' 'using' 'web' 'wheel'
 'white' 'windows' 'without' 'work']
106


Term-by-Document Matrix - SVD Matrix

In [26]:
df = pd.DataFrame(lsi_model.svd_model.components_.T)
df.index = lsi_model.vectorizer.get_feature_names()
df.rename(columns={0:'TC_13_TRG',1:'TC_37_TRG',2:'TC_60_TRG',3:'TC_155_TRG'}, inplace=True)
print(df.shape)
aux_functions.highlight_df(df)

(98, 4)


Unnamed: 0,TC_13_TRG,TC_37_TRG,TC_60_TRG,TC_155_TRG
able,0.0123381,0.0749508,-0.0256177,-0.0119869
active,0.0123381,0.0749508,-0.0256177,-0.0119869
aero,0.0313209,0.0131635,0.0669119,-0.0196397
appearance,0.0123381,0.0749508,-0.0256177,-0.0119869
appears,0.0123381,0.0749508,-0.0256177,-0.0119869
apz,0.128757,-0.0798242,-0.100001,-0.161379
arrow,0.0643787,-0.0399121,-0.0500006,-0.0806894
async,0.128757,-0.0798242,-0.100001,-0.161379
awesome,0.27154,-0.0237988,-0.0591763,0.357246
awesomebar,0.13577,-0.0118994,-0.0295881,0.178623


In [27]:
lsi_model.docs_feats_df

Unnamed: 0,mrw,dl
TC_13_TRG,"[awesome, firefox, launch, bar, url, awesomebar]",23
TC_37_TRG,"[scroll, scrolling, key, make, true, apz]",43
TC_60_TRG,"[theme, complete, browser, install, installati...",52
TC_155_TRG,"[downloads, button, customize, dropmaker, colo...",65
BR_1267501_SRC,"[scroll, window, page, new, using, true]",78
BR_1292566_SRC,"[downloads, panel, button, open, previous, user]",66
