In [1]:
import toolkit

### First, we gather the dataset. This is a history of Java static code metrics and change metrics for our project.

In [2]:
# Used to indicate where the data should be gathered and stored
rootDirectory = '../dataSets/okhttpStudy/'

# Call gatherTimeMetrics and measure Java, Indent and Change metrics 
# on .java files from the git project's repository
#%debug
metricsData = toolkit.data.gatherTimeMetrics(rootDirectory, 'https://github.com/square/okhttp.git', rootDirectory+'okhttp/', '*/*.java *.java', ['java'], skipEvery=50)


In [3]:
metricsData['data']

Unnamed: 0,entity,age-months,n-authors,n-revs,added,deleted,fractal-value,soc,netchurn,cbo,...,assignmentsQty,mathOperationsQty,variablesQty,maxNestedBlocks,anonymousClassesQty,subClassesQty,lambdasQty,uniqueWordsQty,modifiers,time
0,okhttp/src/test/java/okhttp3/internal/http2/Ht...,0,2,6,90,61,0.28,8,29,34,...,262,12,261,3,6,-2,0,391,24,2019-07-25
1,okhttp-sse/src/test/java/okhttp3/internal/sse/...,0,2,2,3,8,0.50,12,-5,12,...,12,3,10,0,0,0,0,45,17,2019-07-25
2,okhttp/src/test/java/okhttp3/internal/http2/Ht...,0,2,6,128,6,0.44,14,122,86,...,231,7,214,8,22,-9,1,438,16,2019-07-25
3,okhttp/src/test/java/okhttp3/CacheTest.java,1,3,4,43,9,0.63,7,34,37,...,295,136,282,1,6,-4,3,561,14,2019-07-25
4,okhttp/src/test/java/okhttp3/ConnectionCoalesc...,1,2,2,105,7,0.50,6,98,20,...,34,0,32,1,2,-1,3,182,16,2019-07-25
5,okhttp/src/test/java/okhttp3/EventListenerTest...,1,3,4,8,18,0.63,28,-10,68,...,154,11,141,5,12,-5,3,239,13,2019-07-25
6,okhttp/src/test/java/okhttp3/internal/ws/WebSo...,1,2,4,17,17,0.50,24,0,40,...,89,8,82,2,14,-7,3,248,10,2019-07-25
7,okhttp/src/test/java/okhttp3/CallTest.java,1,4,10,95,19,0.64,32,76,118,...,435,35,377,10,52,-25,8,731,1,2019-07-25
8,okhttp/src/test/java/okhttp3/DispatcherTest.java,1,2,2,5,6,0.50,16,-1,16,...,36,0,36,2,0,0,4,120,17,2019-07-25
9,okhttp/src/test/java/okhttp3/internal/connecti...,1,3,6,49,22,0.50,8,27,13,...,28,0,28,1,0,0,0,91,17,2019-07-25


### How many times did we sample from?

In [4]:
print metricsData['times']

47


### How many features and samples are in our dataset?

In [5]:
print metricsData['data'].shape

(2404, 50)


### How many unique source files were measured?

In [6]:
print metricsData['data']['entity'].nunique()

843


In [7]:
print metricsData['data'].head

<bound method DataFrame.head of                                                entity  age-months  n-authors  \
0   okhttp/src/test/java/okhttp3/internal/http2/Ht...           0          2   
1   okhttp-sse/src/test/java/okhttp3/internal/sse/...           0          2   
2   okhttp/src/test/java/okhttp3/internal/http2/Ht...           0          2   
3         okhttp/src/test/java/okhttp3/CacheTest.java           1          3   
4   okhttp/src/test/java/okhttp3/ConnectionCoalesc...           1          2   
5   okhttp/src/test/java/okhttp3/EventListenerTest...           1          3   
6   okhttp/src/test/java/okhttp3/internal/ws/WebSo...           1          2   
7          okhttp/src/test/java/okhttp3/CallTest.java           1          4   
8    okhttp/src/test/java/okhttp3/DispatcherTest.java           1          2   
9   okhttp/src/test/java/okhttp3/internal/connecti...           1          3   
10       okhttp/src/test/java/okhttp3/DuplexTest.java           1          2   
11      

# Change Metrics
### Let's see what affects the net churn of files 
### Which types of files have net churn above and below the mean net churn?

In [12]:
# We split the data into 5 equally-sized groups, 
# then perform cross-validation while gradually adding these groups to the training set

# i.e. the train-test splits are with groups of size:
# 1-4, 2-3, 3-2, 4-1

# We omit visualization of decision trees to save space,
# but they can be shown with visualize=True as above
folds = 5

from sklearn.tree import DecisionTreeRegressor
modelInstance = DecisionTreeRegressor(max_leaf_nodes=32)
modelSimpler = DecisionTreeRegressor(max_leaf_nodes=16)
churnModelMoreFolds = toolkit.refinement.makeAndUpdateModel(rootDirectory, metricsData['data'], folds, 'netchurn', modelInstance, modelSimpler, scoreOnly=False) 

Response variable was netchurn
Model.score: 0.480526
                 name  importance
4             deleted    0.622082
39  mathOperationsQty    0.239641
Model.score: 0.799831
      name  importance
4  deleted    0.633197
Model.score: -2.927660
      name  importance
4  deleted    0.633644
Model.score: 0.725848
      name  importance
4  deleted    0.661978
3    added    0.303112
Model.score: 0.628061
      name  importance
1  deleted    0.878375
Model.score: 0.617432
      name  importance
1  deleted    0.779267
Model.score: 0.617468
      name  importance
1  deleted    0.723325
0    added    0.260470
Model.score: 0.608597
      name  importance
1  deleted    0.667823
0    added    0.305019


### Some observations:
- The model has very good Precision, Recall and F1-Score: net churn above/below the mean is classified very well by this model
- ROC area under curve is very high: very little compromise between false negative rate and false positive rate
- The model says that the features influencing net churn are (strongest to weakest):
    - Number of lines added
    - Number of lines deleted    
- The 'net churn below the mean' class is over-represented in our data (3 times as many samples as the other class)
    - However, the model still performs well without any steps taken to address class imbalance (e.g. under/over-sampling)
- Interpretation of the visualized decision tree is straightforward:
    - 63% of samples were files with less than 34 lines added 
        - These samples had net churn less than the mean
        - Some of these may be very stable files (over the history of the project)
    - The files with net churn greater than the mean had more than 108 lines added
    - Within this group, there are several subgroups with varying levels of churn

### Each of the subsets still exhibits class imbalance (but not with the same ratio)
### In particular, the 2nd train-test split has the most balanced classes (2:1) among the five splits

### We still see very good performance, and the same features are important throughout
### How far can we go? Let's cross-validate on every sampled time!

In [14]:
folds = metricsData['times'] # This is an attribute of the measured data set: the number of time points measured
churnModelMoreFolds = toolkit.refinement.makeAndUpdateModel(rootDirectory, metricsData['data'], folds, 'netchurn', modelInstance, scoreOnly=False) 

Response variable was netchurn
Model.score: -0.289683
   name  importance
10  rfc        0.32
Model.score: 0.821747
      name  importance
4  deleted    0.571122
3    added    0.260062
Model.score: 0.688150
      name  importance
4  deleted    0.656151
Model.score: 0.186992
      name  importance
4  deleted    0.504836
3    added    0.313540
Model.score: 0.405772
      name  importance
3    added    0.432621
4  deleted    0.285739
Model.score: -0.289683
      name  importance
3    added    0.515350
4  deleted    0.247184
Model.score: -0.096491
      name  importance
3    added    0.481192
4  deleted    0.251432
Model.score: -1.482270
      name  importance
3    added    0.463905
4  deleted    0.227354
Model.score: 0.300699
      name  importance
3    added    0.363405
4  deleted    0.349587
Model.score: -0.324413
      name  importance
3    added    0.518063
4  deleted    0.282610
Model.score: -1.735756
      name  importance
3    added    0.517756
4  deleted    0.273139
Model.score: -

### Results of this step are omitted for printing. However, the large cross-validation can be run to see them.

### The individual data sets used for training and testing are quite small and imbalanced.

### Many of the same relationships still show up.

### Why is 'added' a much more important factor than 'deleted'? 

In [16]:
print "Mean", metricsData['data']['netchurn'].mean()
print "Variance", metricsData['data']['netchurn'].var()
print "Standard deviation", metricsData['data']['netchurn'].std()
print "Max", metricsData['data']['netchurn'].max()
print "Min", metricsData['data']['netchurn'].min()

 Mean -5.468386023294509
Variance 12577.661505342396
Standard deviation 112.15017389795878
Max 996
Min -2190


### This codebase is *growing* in general (more added than deleted)

### Some files must experience more churn than others. We know from some of the motivating literature that defects can be correlated with large pre-release churn.

### Let's make some categories of binned churn data and classify them 

In [17]:
churnBinnedCategories = ['churnLow','churnMedium','churnHigh']
dataSetUpdated = toolkit.utilities.addBinnedResponseCategory(metricsData['data'], 'netchurn', churnBinnedCategories)

In [18]:
from sklearn.tree import DecisionTreeClassifier
modelInstance = DecisionTreeClassifier(max_leaf_nodes=8, criterion='entropy')
churnModelCategories = toolkit.refinement.makeAndUpdateModel(rootDirectory, dataSetUpdated, 2, churnBinnedCategories, modelInstance, modelSimpler, visualize=False, scoreOnly=False) 


Response variable was ['churnLow', 'churnMedium', 'churnHigh']
Model.score: 0.967527
accuracy_score: 0.967527
             precision    recall  f1-score   support

          0       0.00      0.00      0.00         0
          1       0.71      0.84      0.77        76
          2       0.99      0.98      0.98      1125

avg / total       0.97      0.97      0.97      1201

roc_auc_score cannot be computed for this test set
      name  importance
4  deleted    0.719285
3    added    0.197433
Model.score: 0.606690
accuracy_score: 0.975853
             precision    recall  f1-score   support

          0       0.00      0.00      0.00         0
          1       0.79      0.87      0.82        76
          2       0.99      0.98      0.99      1125

avg / total       0.98      0.98      0.98      1201

roc_auc_score cannot be computed for this test set
      name  importance
1  deleted    0.598854
0    added    0.343842


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


### Now we're seeing something interesting. The vast majority of the files exhibit very low amounts of churn. A select few files receive most of the lines added/deleted. Does the class imbalance impact the validity of this model? Let's try more cross-validation to see.

In [19]:
folds = 3
churnModelCategories = toolkit.refinement.makeAndUpdateModel(rootDirectory, dataSetUpdated, folds, churnBinnedCategories, modelInstance, modelSimpler, visualize=False, scoreOnly=False) 


Response variable was ['churnLow', 'churnMedium', 'churnHigh']
Model.score: 0.978750
accuracy_score: 0.978750
             precision    recall  f1-score   support

          0       0.00      0.00      0.00         1
          1       0.69      0.79      0.73        28
          2       0.99      0.99      0.99       771

avg / total       0.98      0.98      0.98       800

roc_auc_score: 0.753069
      name  importance
4  deleted    0.691191
3    added    0.239364
Model.score: 0.978750
accuracy_score: 0.978750
             precision    recall  f1-score   support

          0       0.00      0.00      0.00         0
          1       0.89      0.85      0.87        66
          2       0.99      0.99      0.99       734

avg / total       0.98      0.98      0.98       800

roc_auc_score cannot be computed for this test set
      name  importance
4  deleted     0.73088
3    added     0.26912
Model.score: 0.642650
accuracy_score: 0.987500
             precision    recall  f1-score   su

### Let's look at this from another point of view. What characterises the files which have the most lines added?

In [20]:
addedModel = toolkit.refinement.makeAndUpdateModel(rootDirectory, metricsData['data'], 2, 'added', modelInstance, modelSimpler, visualize=False, scoreOnly=False) 


Response variable was added
Model.score: 0.945878
accuracy_score: 0.945878
             precision    recall  f1-score   support

          0       0.95      0.98      0.96       890
          1       0.93      0.85      0.89       311

avg / total       0.95      0.95      0.95      1201

roc_auc_score: 0.915371
       name  importance
6  netchurn    0.626919
3   deleted    0.342857
Model.score: 0.788486
       name  importance
2  netchurn    0.685577
1   deleted    0.278838


### Net churn and deleted lines are strongly related. What do we find if we're not allowed to use these in our decision tree?

In [23]:
alteredData = metricsData['data'].drop(['netchurn','deleted'], axis=1)
addedModel = toolkit.refinement.makeAndUpdateModel(rootDirectory, alteredData, 2, 'added', modelInstance, modelSimpler, visualize=False, scoreOnly=False) 


Response variable was added
Model.score: 0.822648
accuracy_score: 0.822648
             precision    recall  f1-score   support

          0       0.87      0.90      0.88       890
          1       0.68      0.60      0.64       311

avg / total       0.82      0.82      0.82      1201

roc_auc_score: 0.750643
     name  importance
2  n-revs    0.714335
Model.score: 0.344658
     name  importance
0  n-revs    0.708057


### The model uses n-revs as the most important feature, but it does not classify '# lines added above the mean' very well

In [26]:
alteredData2 = metricsData['data'].drop(['netchurn','deleted','n-revs'], axis=1)
addedModel2 = toolkit.refinement.makeAndUpdateModel(rootDirectory, alteredData2, 2, 'added', modelInstance, modelSimpler, visualize=False, scoreOnly=False) 


Response variable was added
Model.score: 0.773522
accuracy_score: 0.773522
             precision    recall  f1-score   support

          0       0.79      0.96      0.86       890
          1       0.67      0.25      0.36       311

avg / total       0.75      0.77      0.73      1201

roc_auc_score: 0.603492
            name  importance
2  fractal-value    0.505913
Model.score: 0.141858
            name  importance
1  fractal-value    0.474911


### n-authors has similar problems with identifying the minority class

In [47]:
alteredData3 = metricsData['data'].drop(['netchurn','deleted','n-revs','n-authors'], axis=1)
addedModel3 = toolkit.refinement.makeAndUpdateModel(rootDirectory, alteredData3, 3, 'added', modelInstance, modelSimpler, visualize=False, scoreOnly=False) 

Response variable was added
Model.score: 0.810000
accuracy_score: 0.810000
             precision    recall  f1-score   support

          0       0.83      0.96      0.89       640
          1       0.57      0.21      0.30       160

avg / total       0.78      0.81      0.77       800

roc_auc_score: 0.583594
            name  importance
1  fractal-value    0.539915
Model.score: 0.752500
accuracy_score: 0.752500
             precision    recall  f1-score   support

          0       0.76      0.96      0.85       572
          1       0.70      0.23      0.34       228

avg / total       0.74      0.75      0.70       800

roc_auc_score: 0.594804
            name  importance
1  fractal-value    0.562188
8   totalMethods    0.225422
Model.score: 0.076671
            name  importance
0  fractal-value    0.481302
Model.score: 0.080657
            name  importance
0  fractal-value    0.550253


### fractal-value is derived from n-revs and n-authors

### Let's get rid of it and build a regression model for nline

# Java Code Metrics

### This model will predict the CBO (Coupling between objects)

In [45]:
from sklearn.tree import DecisionTreeRegressor
modelInstanceR = DecisionTreeRegressor(max_leaf_nodes=64)
modelInstanceRsimpler = DecisionTreeRegressor(max_leaf_nodes=32)
alteredData4 = metricsData['data']
nlineModelR = toolkit.refinement.makeAndUpdateModel(rootDirectory, alteredData4, 2, 'cbo', modelInstanceR, modelInstanceRsimpler, visualize=False, scoreOnly=False)

Response variable was cbo
Model.score: 0.404154
  name  importance
9  dit    0.711125
Model.score: 0.437415
  name  importance
1  dit    0.723992


### We see that the most important feature for predicting coupling between objects is RFC (Response for a Class). The RFC counts the number of unique method invocations in a class.

### It seems that the number of unique method invocations in a class is directly related to coupling between objects. What happens if we drop RFC from our dataframe?

In [42]:
from sklearn.tree import DecisionTreeRegressor
modelInstanceR = DecisionTreeRegressor(max_leaf_nodes=64)
modelInstanceRsimpler = DecisionTreeRegressor(max_leaf_nodes=32)
alteredData4 = metricsData['data'].drop(['rfc'],axis=1)
nlineModelR = toolkit.refinement.makeAndUpdateModel(rootDirectory, alteredData4, 2, 'cbo', modelInstanceR, modelInstanceRsimpler, visualize=False, scoreOnly=False)

Response variable was cbo
Model.score: 0.438750
  name  importance
9  dit    0.718641
Model.score: 0.362411
  name  importance
0  dit    0.732338


### The second most important feature for predicting coupling between objects appears to be the total number of variables in a class.

### This model will predict the LCOM (Lack of Cohesion of Methods)

In [35]:
from sklearn.tree import DecisionTreeRegressor
modelInstanceR = DecisionTreeRegressor(max_leaf_nodes=16)
modelInstanceRsimpler = DecisionTreeRegressor(max_leaf_nodes=16)
alteredData4 = metricsData['data']
nlineModelR = toolkit.refinement.makeAndUpdateModel(rootDirectory, alteredData4, 2, 'lcom', modelInstanceR, modelInstanceRsimpler, visualize=False, scoreOnly=False)

Response variable was lcom
Model.score: 0.411732
             name  importance
14  publicMethods    0.566104
24  privateFields    0.196105
Model.score: 0.416864
            name  importance
1  publicMethods    0.565160
2  privateFields    0.196252


### We see that the most important feature for predicting lack of cohesion of methods (LCOM) is the total number of methods in a class. 

### What happens if we drop 'totalMethods' from our dataframe?

In [36]:
from sklearn.tree import DecisionTreeRegressor
modelInstanceR = DecisionTreeRegressor(max_leaf_nodes=16)
modelInstanceRsimpler = DecisionTreeRegressor(max_leaf_nodes=16)
alteredData4 = metricsData['data'].drop(['totalMethods'],axis=1)
nlineModelR = toolkit.refinement.makeAndUpdateModel(rootDirectory, alteredData4, 2, 'lcom', modelInstanceR, modelInstanceRsimpler, visualize=False, scoreOnly=False)

Response variable was lcom
Model.score: 0.479870
             name  importance
13  publicMethods    0.566165
20    totalFields    0.196126
Model.score: 0.686717
            name  importance
1  publicMethods    0.565219
2    totalFields    0.196871


### It appears that the second most important feature for predicting lack of cohesion of methods is weight method class (WMC) or McCabe's complexity. This makes sense since it is a measure of the number of branch instructions in a class.

### This model predicts the size of files in terms of lines of code (loc)

In [37]:
from sklearn.tree import DecisionTreeRegressor
modelInstanceR = DecisionTreeRegressor(max_leaf_nodes=32)
modelInstanceRsimpler = DecisionTreeRegressor(max_leaf_nodes=16)
alteredData4 = metricsData['data']
nlineModelR = toolkit.refinement.makeAndUpdateModel(rootDirectory, alteredData4, 2, 'loc', modelInstanceR, modelInstanceRsimpler, visualize=False, scoreOnly=False)

Response variable was loc
Model.score: 0.924679
              name  importance
38  assignmentsQty    0.809094
Model.score: 0.909478
             name  importance
7  assignmentsQty    0.816106


### The most important feature for predicting size of a file by lines of code is the total number of assignments.

### This model predicts the WMC (Weight Method Class) or McCabe's complexity. 

In [38]:
from sklearn.tree import DecisionTreeRegressor
modelInstanceR = DecisionTreeRegressor(max_leaf_nodes=16)
modelInstanceRsimpler = DecisionTreeRegressor(max_leaf_nodes=16)
alteredData4 = metricsData['data']
nlineModelR = toolkit.refinement.makeAndUpdateModel(rootDirectory, alteredData4, 3, 'wmc', modelInstanceR, modelInstanceRsimpler, visualize=False, scoreOnly=False)

Response variable was wmc
Model.score: 0.901793
   name  importance
30  loc    0.776272
Model.score: 0.897073
   name  importance
30  loc     0.93211
Model.score: 0.932606
  name  importance
4  loc    0.926246
Model.score: 0.917882
  name  importance
4  loc    0.924661


### In this case study, we have used the toolkit to do the following:
- Gather the Maven dataset
- Create regression models of net churn: above and below the mean, in 3 binned categories (low, medium, high)
- Create regression models to analyze coupling between objects, lack of cohesion of methods, and weighted method class.