#                                        Continuous Integration 

## 1. Automatic Fault Detection

The goal is to find, in a revision, which Change-List (CL) caused a regression, in a short amount of time.

<img src="CL.png" alt="Drawing" style="width: 500px;"/>

Test are executed periodically, for example, every $N$  $CL$. In this case, at $CL_{G}$ all tests are passing (*Green*) and at $CL_{R}$ some test are failing (*Red*). Hence, there has to be some **Culprit CL** in the range $[CL_{G}, CL_{R}]$, that caused a regression.

A possible solution to finding the culprit is conducting a search over the range $[CL_{G}, CL_{R}]$, an example of it is doing *binary-search*. However, to obtain results faster, we can implement machine learning models that tell us where to look first and which **CL's** are more likely to be the ones that caused a regression.

This way, we are able to notify the developers responsable for those changes and catch/correct them early on.

## 2. Optimizing Regression Testing

Above, we show how it is possible to prioritize different CL's to avoid testing "unnecessary" or "non-revealing" CL's. However, if a CL is very suspicious, but tests take 8+ hours to discover if the change is defective or not, it is not of much use. 

So in parallel to Defect Prediction, the way Regression Testing is done can be optimized by applying these techniques:

- **Test Case Minimisation**
- **Test Case Selection**
- **Test Case Prioritization**

This way, we propose to reduce the lag time between a commit and project status feedback by a significant amount, saving time and resources.

## 3. Notification

Now, we can rank **CL's** by their suspiciousness and notify developers to rectify and take a closer look/apply more tests at a specific change list, not much time after they've submitted a change.

***

# Implementation 

These algorithms are applied in the following situations: 
1. **OpenSource Projects**: datasets that are available online (in this case, one dataset for defect prediction -*Bugzilla* , and another for test case failure prediction - *Closure-Compiler*
2. **Real-World Data**: source control log obtained over 10 years from a company.

In [1]:
import pandas as pd
import numpy as np

## 1. Features used in Defect Prediction:

From the paper "*Deep Learning for Just-In-Time Defect Prediction*" (2015), by Xinli Yang et al.

- **id**: Unique identifier of CL
- **author**: Developer name responsable for the change
- **timestamp**: Commit time
- **ns**: The number of modified subsystems
- **nd**: The number of modified directories
- **nf**: The number of modified files
- **entropy**: Distribution of modified code across each file
- **la**: Lines of code added
- **ld**: Lines of code deleted
- **lt**: Lines of code in a file before the change
- **fix**: Whether or not the change is a defect fix
- **ndev**: The number of developers that changed the modified files
- **age**: The average time interval between the last and the current change
- **nuc**: The number of unique changes to the modified files
- **exp**: Developer experience
- **rexp**: Recent developer experience
- **sexp**: Developer experience on a subsystem

### Label
- **Per change**: Suspiciousness , from 1 or 0

## Open Source Dataset - Bugzilla

In [2]:
bugzilla = pd.read_csv('../jit/input/bugzilla.csv')

In [3]:
bugzilla.head()

Unnamed: 0,transactionid,commitdate,ns,nm,nf,entropy,la,ld,lt,fix,ndev,pd,npt,exp,rexp,sexp,bug
0,3,2001/12/12 17:41,1,1,3,0.57938,0.09362,0.0,480.666667,1,14,596,0.666667,143,133.5,129,1
1,7,1999/10/12 12:57,1,1,1,0.0,0.0,0.0,398.0,1,1,0,1.0,140,140.0,137,1
2,8,2002/5/15 16:55,3,3,52,0.739279,0.183477,0.208913,283.519231,0,23,15836,0.75,984,818.65,978,0
3,9,2002/1/21 15:37,1,1,8,0.685328,0.016039,0.01288,514.375,1,21,1281,1.0,579,479.25,550,0
4,10,2001/12/19 16:44,2,2,38,0.769776,0.091829,0.072746,366.815789,1,21,6565,0.763158,413,313.25,405,0


In [4]:
print(f"This data set contains {bugzilla.shape[0]} instances and {bugzilla.shape[1]} features")

This data set contains 4620 instances and 17 features


In [5]:
target_count = bugzilla['bug'].value_counts()

min_class = target_count.idxmin()
ind_min_class = target_count.index.get_loc(min_class)

print('Minority class:', target_count[ind_min_class])
print('Majority class:', target_count[1 - ind_min_class])
print('Proportion:', round(target_count[ind_min_class] / target_count[1 - ind_min_class], 2), ': 1')

Minority class: 1696
Majority class: 2924
Proportion: 0.58 : 1


**NOTE**: In the presence of an unbalanced dataset, only accuracy values above the 58% should be considered valid. Otherwise, this result could be attributed to chance/randomness

## Machine Learning Model - Logistic Regression

### Train test split 

In [6]:
bugzilla = bugzilla.drop(columns=['transactionid', 'commitdate'])

In [7]:
from sklearn.model_selection import train_test_split


y: np.ndarray = bugzilla.pop('bug').values
X: np.ndarray = bugzilla.values
labels = pd.unique(y)
trnX, tstX, trnY, tstY = train_test_split(X, y, train_size=0.7, stratify=y)


In [8]:
bugzilla.head()

Unnamed: 0,ns,nm,nf,entropy,la,ld,lt,fix,ndev,pd,npt,exp,rexp,sexp
0,1,1,3,0.57938,0.09362,0.0,480.666667,1,14,596,0.666667,143,133.5,129
1,1,1,1,0.0,0.0,0.0,398.0,1,1,0,1.0,140,140.0,137
2,3,3,52,0.739279,0.183477,0.208913,283.519231,0,23,15836,0.75,984,818.65,978
3,1,1,8,0.685328,0.016039,0.01288,514.375,1,21,1281,1.0,579,479.25,550
4,2,2,38,0.769776,0.091829,0.072746,366.815789,1,21,6565,0.763158,413,313.25,405


### Preprocessing 

#### Min-Max Scaler

In [9]:
from sklearn import preprocessing

scaler = preprocessing.MinMaxScaler()
scaler.fit(trnX)
trnX = scaler.transform(trnX)
tstX = scaler.transform(tstX)

#### Data Balancing

In [10]:
from imblearn.under_sampling import RandomUnderSampler

print(' before:  shape %s', str(trnX.shape))
sampler = RandomUnderSampler(sampling_strategy='majority')
trnX, trnY = sampler.fit_sample(trnX, trnY)
print(' after: shape %s', str(trnX.shape))

Using TensorFlow backend.


 before:  shape %s (3234, 14)
 after: shape %s (2374, 14)


### Logistic Regression Model 

In [11]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='liblinear', C=1, random_state=42)
model.fit(trnX, trnY)
prdY = model.predict(tstX)

#### Training Set - 10-fold Cross Validation

In [12]:
from sklearn.metrics import precision_recall_fscore_support
from sklearn.model_selection import cross_val_score

accuracy = cross_val_score(estimator=model, X=trnX, y=trnY, cv=10, n_jobs=-1)
loss = cross_val_score(estimator=model, X=trnX, y=trnY, cv=10, n_jobs=-1, scoring='neg_log_loss')

In [13]:
print(f"Training set\n\nAccuracy: {accuracy.mean()}\nLoss: {loss.mean()}")

Training set

Accuracy: 0.6466226997127965
Loss: -0.6313816542513869


Accuracy just above 58% threshold

#### Test Set Scores 

In [14]:
from sklearn import metrics
from sklearn.metrics import classification_report

tst_acc = metrics.accuracy_score(tstY, prdY)
cnf_mtx = metrics.confusion_matrix(tstY, prdY, labels)

In [15]:
print(f"Test set\n\nAccuracy: {tst_acc}\n\nConfusion Matrix:\n{cnf_mtx}\n\nclassification report\n {classification_report(tstY, prdY, labels=labels)}")
print()

Test set

Accuracy: 0.6587301587301587

Confusion Matrix:
[[313 196]
 [277 600]]

classification report
               precision    recall  f1-score   support

           1       0.53      0.61      0.57       509
           0       0.75      0.68      0.72       877

    accuracy                           0.66      1386
   macro avg       0.64      0.65      0.64      1386
weighted avg       0.67      0.66      0.66      1386




Test set accuracy very different from training set accuracy, since the model is optimized for the latter., ideally we want these values closer together

Now, we have a trained model to predict the degree of suspiciousness a change has of being defective, with the statistics shown above. 

**NOTE**: By using XGBoost and Neural Networks, these statistics are improved. Here, only Logistic Regression Models are trained for simplicity and speed.

---

# 2. Features for Test Case Prioritization

Using Machine Learning models to prioritize test according to a certain criteria, one criterion may be the ability of test cases to find faults that can be predicted *a priori*.

## Open Source Dataset

From Palma et al (2018) , 5 projects retrieved.

### Features
- **Version**: version under test.
- **TestID**: unique test identifier.
- **TestName**: test name.
- **Status**: test case status - Modified/New 
- **ST**: *Size of tests*. Number of lines of code.
- **MC**: *Method Coverage*, the ratio of the nr. of methods called by a test case from the previous version and the total number of methods in the source code.
- **BC**: *Basic Counting*, the nr. of unique method calls in the test trace from the current release that also appear in the previous failing sequences for that test case. 
- **HD**: *Hamming distance*, min nr. of edit operations (insertios, deletions and substitutions) required to convert a sequence into another.
- **ED**: *Edit distance*, Levenshtein distance. 
- **CMC**: *Changed Method Coverage*, ratio between the nr. of changed methods from the previous version and the total nr. of methods in the source code.
- **TM**: *Traditional Historical Fault Detection Metric*, obtained by counting the nr. of versions for which a test case has failed previously. 
- **IBC**: *Improved Basic Counting*, combination of **BC** and **HD**.



### Additional features to consider
- **size**: test size in minutes
- **txt**: Text-based features textual representation of the test case (obtained by using topic modelling). Can be used to cluster test cases.

### Label:
- **Result**: pass/fail; 1/0

In [50]:
ClosureCompiler = pd.read_csv('../pred-rep-master/tanzeem_noor-promise17_data/Closure-Compiler Metrics Raw_Data.csv')

In [51]:
print(f"This data set contains {ClosureCompiler.shape[0]} instances and {ClosureCompiler.shape[1]} features")

This data set contains 3329 instances and 13 features


In [52]:
ClosureCompiler.head(10)

Unnamed: 0,Version,TestID,TestName,Result,Status,ST,MC,BC,HD,ED,CMC,TM,IBC
0,1,C_1_M_1,com.google.javascript.jscomp.IntegrationTest::...,1,Modified,9,1866,0.950697,0.14595,638738,0.0,0,0.950697
1,1,C_2_M_1,com.google.javascript.jscomp.RemoveUnusedVarsT...,1,Modified,4,672,0.995536,0.249005,262767,0.0,0,0.995536
2,1,C_2_M_2,com.google.javascript.jscomp.RemoveUnusedVarsT...,1,Modified,4,667,0.995502,0.25011,261775,0.0,0,0.995502
3,1,C_2_M_3,com.google.javascript.jscomp.RemoveUnusedVarsT...,1,Modified,4,693,0.995671,0.249962,266981,0.0,0,0.995671
4,1,C_2_M_4,com.google.javascript.jscomp.RemoveUnusedVarsT...,1,Modified,3,672,0.995536,0.249431,262767,0.0,0,0.995536
5,1,C_3_M_1,com.google.javascript.jscomp.CommandLineRunner...,1,New,3,1630,0.941104,0.150986,558971,0.0,0,0.941104
6,1,C_3_M_2,com.google.javascript.jscomp.CommandLineRunner...,0,New,3,2031,0.928607,0.130447,694723,0.0,0,0.928607
7,1,C_3_M_3,com.google.javascript.jscomp.CommandLineRunner...,0,Modified,2,1674,0.94325,0.145455,573755,0.0,0,0.94325
8,1,C_3_M_4,com.google.javascript.jscomp.CommandLineRunner...,0,Modified,4,2187,0.942387,0.130274,747954,0.002286,0,0.942387
9,1,C_3_M_5,com.google.javascript.jscomp.CommandLineRunner...,1,Modified,4,1598,0.943054,0.151367,548219,0.0,0,0.943054


In [53]:
target_count = ClosureCompiler['Result'].value_counts()
print(target_count)
min_class = target_count.idxmin()
ind_min_class = target_count.index.get_loc(min_class)

print('Minority class:', target_count[ind_min_class])
print('Majority class:', target_count[1 - ind_min_class])
print('Proportion:', round(target_count[ind_min_class] / target_count[1 - ind_min_class], 2), ': 1')

0    3099
1     230
Name: Result, dtype: int64
Minority class: 230
Majority class: 3099
Proportion: 0.07 : 1


In [54]:
pass_t = ClosureCompiler[ClosureCompiler['Result']==0]
fail_t = ClosureCompiler[ClosureCompiler['Result']==1]


## Machine Learning Model 


In [55]:
ClosureCompiler = ClosureCompiler.drop(columns=['TestID', 'TestName'])

In [56]:
ClosureCompiler.head()

Unnamed: 0,Version,Result,Status,ST,MC,BC,HD,ED,CMC,TM,IBC
0,1,1,Modified,9,1866,0.950697,0.14595,638738,0.0,0,0.950697
1,1,1,Modified,4,672,0.995536,0.249005,262767,0.0,0,0.995536
2,1,1,Modified,4,667,0.995502,0.25011,261775,0.0,0,0.995502
3,1,1,Modified,4,693,0.995671,0.249962,266981,0.0,0,0.995671
4,1,1,Modified,3,672,0.995536,0.249431,262767,0.0,0,0.995536


### Categorical Variable encoding

ML algorithms only understand numbers, so it is mandatory to encode categorical values in values the algorithm can understand.

In [57]:
from sklearn.preprocessing import LabelEncoder

lb_make = LabelEncoder()
ClosureCompiler["Status"] = lb_make.fit_transform(ClosureCompiler["Status"])


In [58]:
ClosureCompiler.head(10)

Unnamed: 0,Version,Result,Status,ST,MC,BC,HD,ED,CMC,TM,IBC
0,1,1,0,9,1866,0.950697,0.14595,638738,0.0,0,0.950697
1,1,1,0,4,672,0.995536,0.249005,262767,0.0,0,0.995536
2,1,1,0,4,667,0.995502,0.25011,261775,0.0,0,0.995502
3,1,1,0,4,693,0.995671,0.249962,266981,0.0,0,0.995671
4,1,1,0,3,672,0.995536,0.249431,262767,0.0,0,0.995536
5,1,1,1,3,1630,0.941104,0.150986,558971,0.0,0,0.941104
6,1,0,1,3,2031,0.928607,0.130447,694723,0.0,0,0.928607
7,1,0,0,2,1674,0.94325,0.145455,573755,0.0,0,0.94325
8,1,0,0,4,2187,0.942387,0.130274,747954,0.002286,0,0.942387
9,1,1,0,4,1598,0.943054,0.151367,548219,0.0,0,0.943054


### Train test split 

In [60]:
from sklearn.model_selection import train_test_split


y: np.ndarray = ClosureCompiler.pop('Result').values
X: np.ndarray = ClosureCompiler.values
labels = pd.unique(y)
trnX, tstX, trnY, tstY = train_test_split(X, y, train_size=0.7, stratify=y)


### Preprocessing 

#### Min-Max Scaler

In [61]:
from sklearn import preprocessing

scaler = preprocessing.MinMaxScaler()
scaler.fit(trnX)
trnX = scaler.transform(trnX)
tstX = scaler.transform(tstX)

#### Data Balancing

In [62]:
from imblearn.under_sampling import RandomUnderSampler

print(' before:  shape %s', str(trnX.shape))
sampler = RandomUnderSampler(sampling_strategy='majority')
trnX, trnY = sampler.fit_sample(trnX, trnY)
print(' after: shape %s', str(trnX.shape), str(trnY.shape))

 before:  shape %s (2330, 9)
 after: shape %s (322, 9) (322,)


### Logistic Regression Model 

In [63]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='liblinear', C=1, random_state=42)
model.fit(trnX, trnY)
prdY = model.predict(tstX)

#### Training Set - 10-fold Cross Validation

In [64]:
from sklearn.metrics import precision_recall_fscore_support
from sklearn.model_selection import cross_val_score

accuracy = cross_val_score(estimator=model, X=trnX, y=trnY, cv=10, n_jobs=-1)
loss = cross_val_score(estimator=model, X=trnX, y=trnY, cv=10, n_jobs=-1, scoring='neg_log_loss')

In [65]:
print(f"Training set\nAccuracy: {accuracy.mean()}\nLoss: {loss.mean()}")

Training set
Accuracy: 0.5805871212121212
Loss: -0.6640055800495815


#### Test Set Scores 

In [66]:
from sklearn import metrics
from sklearn.metrics import classification_report

tst_acc = metrics.accuracy_score(tstY, prdY)
cnf_mtx = metrics.confusion_matrix(tstY, prdY, labels)

In [67]:
print(f"Test set\n\nAccuracy: {tst_acc}\n\nConfusion Matrix:\n{cnf_mtx}\n\nclassification report\n {classification_report(tstY, prdY, labels=labels)}")
print()

Test set

Accuracy: 0.5285285285285285

Confusion Matrix:
[[ 50  19]
 [452 478]]

classification report
               precision    recall  f1-score   support

           1       0.10      0.72      0.18        69
           0       0.96      0.51      0.67       930

    accuracy                           0.53       999
   macro avg       0.53      0.62      0.42       999
weighted avg       0.90      0.53      0.64       999




Precision is defined as the number of true positives divided by the number of true positives plus the number of false positives. For class 1, the precision value is around 11% which indicates that there are a lot of tests classified as Fail, when they were Passing.

### Make new prediction with unseen data

Using last version that belongs to this dataset

In [71]:
NewPred = pd.read_csv('../pred-rep-master/tanzeem_noor-promise17_data/CloserCompilerNew.csv')

In [72]:
Pred = NewPred.drop(columns=['TestID','TestName', 'Result'])

Pred["Status"] = lb_make.fit_transform(Pred["Status"])

In [73]:
Pred.head()

Unnamed: 0,Version,ST,MC,BC,HD,ED,CMC,TM,IBC
0,129,7,15,0.2,0.057619,140,0,0,0.2
1,129,7,15,0.2,0.057619,140,0,0,0.2
2,129,41,15,0.2,0.057619,140,0,0,0.2
3,129,9,15,0.2,0.057619,140,0,0,0.2
4,129,7,7,0.285714,0.046197,140,0,0,0.285714


#### Determine each test case failure probability, predicted by the model

In [74]:
Pred = scaler.transform(Pred)
prd = model.predict_proba(Pred)[:,1]
prd

array([0.31832292, 0.31832292, 0.26361079, 0.3149399 , 0.33948433,
       0.30633938, 0.29973517, 0.30549315, 0.29973517, 0.46969864,
       0.36879604, 0.36879604, 0.52338241, 0.52338241, 0.52338241,
       0.52107289, 0.52107289, 0.46078749, 0.51281151, 0.44501358,
       0.43154228, 0.50695073, 0.31316415, 0.47486421, 0.47286474,
       0.46044862, 0.43402948, 0.44061027, 0.43402948, 0.42824552,
       0.42824552, 0.42824552, 0.42824552, 0.42824552, 0.42824552,
       0.42824552, 0.42824552, 0.47284806, 0.45923139, 0.42628346,
       0.49344488, 0.49344488, 0.49344488, 0.49344488, 0.49344488])

In [75]:
NewPred['FailProb'] = np.round(prd*100, 1)

In [76]:
NewPred.head()

Unnamed: 0,Version,TestID,TestName,Result,Status,ST,MC,BC,HD,ED,CMC,TM,IBC,FailProb
0,129,C_1_M_1,com.google.javascript.jscomp.IntegrationTest::...,0,New,7,15,0.2,0.057619,140,0,0,0.2,31.8
1,129,C_1_M_2,com.google.javascript.jscomp.IntegrationTest::...,0,New,7,15,0.2,0.057619,140,0,0,0.2,31.8
2,129,C_1_M_3,com.google.javascript.jscomp.IntegrationTest::...,0,Modified,41,15,0.2,0.057619,140,0,0,0.2,26.4
3,129,C_1_M_4,com.google.javascript.jscomp.IntegrationTest::...,1,New,9,15,0.2,0.057619,140,0,0,0.2,31.5
4,129,C_1_M_5,com.google.javascript.jscomp.IntegrationTest::...,0,New,7,7,0.285714,0.046197,140,0,0,0.285714,33.9


#### Sort DataFrame by highest probability to lowest, while conserving the rest of the columns

In [77]:
Order = NewPred.sort_values(by=['FailProb'], ascending=False)
Order

Unnamed: 0,Version,TestID,TestName,Result,Status,ST,MC,BC,HD,ED,CMC,TM,IBC,FailProb
12,129,C_13_M_1,com.google.javascript.jscomp.parsing.JsDocInfo...,0,New,6,21,0.904762,0.215569,140,0,0,0.904762,52.3
14,129,C_13_M_3,com.google.javascript.jscomp.parsing.JsDocInfo...,0,New,6,21,0.904762,0.215569,140,0,0,0.904762,52.3
13,129,C_13_M_2,com.google.javascript.jscomp.parsing.JsDocInfo...,0,New,6,21,0.904762,0.215569,140,0,0,0.904762,52.3
16,129,C_13_M_5,com.google.javascript.jscomp.parsing.JsDocInfo...,0,Modified,2,23,0.869565,0.214987,142,0,0,0.869565,52.1
15,129,C_13_M_4,com.google.javascript.jscomp.parsing.JsDocInfo...,0,Modified,2,23,0.869565,0.214987,142,0,0,0.869565,52.1
18,129,C_14_M_1,com.google.javascript.rhino.jstype.FunctionTyp...,0,Modified,5,8,0.875,0.132932,140,0,0,0.875,51.3
21,129,C_14_M_4,com.google.javascript.rhino.jstype.FunctionTyp...,0,Modified,8,8,0.875,0.132932,140,0,0,0.875,50.7
44,129,C_9_M_5,com.google.javascript.jscomp.CheckRequiresForC...,0,New,3,32,0.75,0.483431,161,0,0,0.75,49.3
43,129,C_9_M_4,com.google.javascript.jscomp.CheckRequiresForC...,0,New,3,32,0.75,0.483431,161,0,0,0.75,49.3
40,129,C_9_M_1,com.google.javascript.jscomp.CheckRequiresForC...,0,New,3,32,0.75,0.483431,161,0,0,0.75,49.3


This way, for the next version our model provides a sorted list of which tests have the highest probability of failing, thus revealing faults. This model is should predict that test of index 3 should have P = 1 and all the rest = 0, indicating that the model has to be improved, however it becomes an approach that works better than random and with fine tuning or other models the accuracy and precision of the model can improve.

This model has many limitations, starting from the fact that we have a very unbalanced dataset, which probably doesn't make it better than randomly test. 
Also a tie criteria has to be defined to prioritize equal failure probabilities.

In [40]:
Order.loc[3]

Version                                                   129
TestID                                                C_1_M_4
TestName    com.google.javascript.jscomp.IntegrationTest::...
Result                                                      1
Status                                                    New
ST                                                          9
MC                                                         15
BC                                                        0.2
HD                                                   0.057619
ED                                                        140
CMC                                                         0
TM                                                          0
IBC                                                       0.2
FailProb                                                 24.4
Name: 3, dtype: object

### Final Remarks

Although the results are not near optimal, it is worth considering that this is working for only one dataset, one version, few preprocessing techniques and just one machine learning model. 

The goal of this notebook is to sketch the work to be done in the real dataset, where the same analysis will be performed and hopefully achieve better results. 