# Software Evolution - Practical Session
## Laws of software evolution, code counting, code duplication and dependency analysis
## Academic year 2022-2023

### Write your answers under the questions that are present in this notebook  

#### Note: Print the final output of each cell in this notebook

### Read Section 1 and Section 2 in the provided description document before proceeding with the following section 

In [2]:
import os
import json
from tqdm import tqdm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import subprocess
import math
from pandas import option_context
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

## 2. Verifying laws of software evolution

In [3]:
#Enter the path where the Eucalyptus project is present
eucalyptus_project_path = ''

### 2.1 Data prepocessing
1. Get all the tags present in the eucalyptus project using CLOC
2. Filter out the tags that do not correspond to official releases

Code Hint to get the tags that are present in the project and read the terminal output:

command = f'git -C {eucalyptus_project_path} tag -l --format="%(refname:short)" | sort -r'   
process = subprocess.Popen(command, stdout=subprocess.PIPE, shell=True)   
all_tags = list(line.strip().decode("utf-8") for line in process.stdout)

In [None]:
# Your code here
# Get all the tags and print them

print(all_tags)

##### Filter the tags based on semantic versioning
1. Write a regex statement to select the tags that obey semantic versioning (hint pattern = '^v?[0-9]+\.[0-9]+(\.[0-9]+)?$')   
2. Print the selected major.minor.path versions and major.minor versions 

In [None]:
# Your tag filtering code here

### 2.2 Law of Increasing Growth
1. Fetch the total lines of code, total blank lines, total comment lines and total number of files (hint: SUM field in the output) for each tag along with their release date using CLOC
2. Separate the data as following (also mentioned in section 2.1 of the description document)  
    a) Consider all the three-component versions (major.minor.patch) - Dataset A  
    b) Consider only the minor versions (major.minor) - Dataset B  

Note: For example, if there exists any tag like 2.1.0.1, then you can consider it as 2.1.0 provided such a tag does not exist in the data. If both 2.1.0 and 2.1.0.1 exists, then you can ignore the latter 

Code hint:  
To set the project to a required tag - 

command = f'git -C {project_path} reset --hard {tag}'   
os.system(command)   
command = f'CLOC/cloc --git {eucalyptus_project_path} --json' # invoke cloc application to read loc   
process = subprocess.Popen(command, stdout=subprocess.PIPE, shell=True)   
data_as_str = process.stdout.read()   

For each tag, pass the command to the terminal, invoke cloc tool to get the necessary data in the required format (json, md,...), read the terminal output and store the data

In [5]:
#Your code here    

For each tag

1. get the release date
2. combine it with the results obtained in the previous cell (lines of code, comments ...)

The final output of this cell should have tag, its release date and total #lines of code, total #comments, total #files, total #blanks.

In [None]:
command = f'git -C {project_path} tag -l --format="%(refname:short)|%(creatordate:short)" | sort -r' # to get the release date
process = subprocess.Popen(command, stdout=subprocess.PIPE, shell=True)
# Your code here

For easy visualization in the next task, separate you data as following
1. That has only the details (tag, release date, total lines of code, total comments and so on) corresponding to versions of the type major.minor
2. Similarly, extract the details for the versions of the type major.minor.patch 

In [None]:
# Your code here to get the details corresponding to versions of the form major.minor

In [None]:
# Your code here to get the details corresponding to versions of the form major.minor.patch

#### Visualization
Hint: Convert the dates in the data to proper datetime format rather than having it as a string
1. x-axis = version, y-axis = number  
    i) for tags of the type major.minor.patch  
    ii) for tags of the type major.minor  

2. x-axis = date, y-axis = number  
    i) for tags of the type major.minor.patch  
    ii) for tags of the type major.minor  

For easy comparison, place the plots with the versions on x-axis should be on the top (0,0) and (1,0) and their corresponding plots with date on the x-axis right below them (1,0) and (1,1) 

E.g. if you are using DataFrame:  
axes = df[['code','blank','comment','nfiles','major.minor']].plot(x='major.minor', figsize=(18,10), ax=axes[0,0], legend=False)   
axes.tick_params(axis='both',labelsize=15)  
axes.set_xlabel('major.minor version', fontsize=16)

In [None]:
fig,axes = plt.subplots(2,2)

# Your plotting code here

### Questions:
1. Do you find any difference between the plot that is having date in the x-axis and the plot that is having tag in the x-axis? If yes, then what is the difference? If no, then why is it same?

2. Which type of plot is preferable for software evolution analysis?   
    a) date in x-axis  
    b) tag in x-axis  
Why?

3. Choose an option regarding the growth of the software by considering Dataset B. Motivate your choice using a 1d regression plot (below). **Note**: Do not include comments and blank lines of code     
    a) Linear  
    b) Sub-linear  
    c) Super-linear  

4. Report the root mean squared error (RMSE) and standard error and R-squared (R2) values that were obtained through the above 1d regression plot.

Plot your 1d regression plot in the following cell  

In [4]:
plt.figure(figsize=(7,7))
# Your code for regression plot

### Correlation 
Correlation is generally used to analyse the relationship between variables. Here, analyse the relationship between the number of lines of code and the number of files using Spearman correlation and Pearson correlation by considering Dataset A. Report the correlation upto 3 decimal places

In [None]:
# Your code for correlation

### Questions:
5. Do you find any difference in correlation values between Pearson and Spearman? Which one is preferable for this use case? why?

6. Based on the above correlation value, how much is the number of lines of code related to the number of files?

### Prediction
Consider Dataset B for this task. Drop the last two data points in "number of lines of code" (LOC) (i.e. drop (LOC) corresponding to v4.4.1 and v4.4.2) and forecast the values for (LOC) for v4.4.1 and v4.4.2 using a basic linear/polynimial regression model.
1. Drop the last two data points
2. Build a basic regression model
3. Ask the model to forecast the next two data points
4. Plot the LOC original and forecasted in the same plot. x-axis = date, y-axis = number of lines of code. The plot should have the original evolution line and the fitted line as well. 

In [None]:
# Your code here

In [None]:
# Your code here

In [None]:
# Your code here

In [None]:
# Your plotting code here

### Questions:
7. What is the polynomial degree that you adopted to build the model? Why? 

8. What is the coefficient of determination? (R-squared)

9. What is the Root mean square error of the model? (hint: consider the error reported by the full orignial curve and the fitted curve)

10. What is the standard error obtained ?

11. What information does RMSE and R2 (R-squared) convey?

12. Based on the obtained R2 score, standard error and RMSE during training, how good the regression model fits the data? Which degree would you adopt? Motivate your choice

### Filtering on coding langugage

Consider the **prominent languages** used in this software project and plot the distribution of their LOC in a pie chart for the first and the last versions. If the language distribution is too small to visualize, you can plot them in a separate figure or just analyse the numerical data

In [None]:
# Your code to get the data for LOC of first and last versions

In [None]:
# Your pie chart code here

### Question:
9. Do you find any significant difference in the distribution of the language used in the software project between its first and the last version? If so, what is the difference and how much is it?

### Law of increasing growth for coding langugages

Get the LOC for each of the considered prominent langugages (as above) for each version of the form major.minor.patch  
1) Plot the date (x-axis) vs LOC (y-axis) 
2) Plot the date (x-axis) vs proportional LOC (y-axis). Proportional is LOC of a language/total number of LOC

In [None]:
# Your code to get the LOC for each prominent language of each version

In [None]:
# Your plotting code here

### Question:
10. Does the prominent languages considered by you obey laws of increasing growth?

11. Does this software project obey the Law of Increasing Growth?

### Modify the CLOC parameters

1. Exclude all bank lines and verify if the Law of Increasing Growth still holds good for this filtering

In [None]:
# Your code and plot here

2. Exclude all comment lines and verify if the Law of Increasing Growth still holds good for this filtering

In [None]:
# Your code and plot here

3. Exclude all non-code files (or consider prominent coding langugages used in the project) and and verify if the Law of Increasing Growth still holds good for this filtering

In [None]:
# Your code and plot here

### 2.3 Law of Continuing Change
1. Using CLOC tool, find the features such as number of lines of code that are added, modified, removed and same between two consecutive versions
2. Consider all the three component versions (major.minor.patch) - Dataset A
3. Consider only the minor versions (major.minor) - Dataset B
4. Having all the features in the same plot, create two plots (one for Dataset A and another for Dataset B) the outcomes with date on x-axis and the number on y-axis.
5. Make a subplot of all the features

Code hint:  
For each pair of tags, set the original project to the required tag and the copy of the original project to the next tag.  
To get the required data in json format - "command = f'CLOC/cloc --git --diff {first_project_path} {second_project_path} --json'"

In [None]:
eucalyptus_project_path = '' # your project path here
eucalyptus_copy_project_path = '' # your path to the copy of your project here
major_minor_patch_versions = '' #set of version that you would like to consider for this analysis

In [None]:
# Your code here to get the data

In [None]:
# Your code here to get the dates corresponding to the tags

In [None]:
# Your plotting code here (all the features in a single plot)

In [None]:
# Your plotting code here, different plot for each feature
fig,axes = plt.subplots(2,2)

### Questions:
1. Do you find the Law of Continuing Change holding good here? Prove it empirically.

2. Does the law of increasing growth too hold good here?

### Law of continuing change for coding languages
Consider the prominent languages used in this software project for versions of the form major.minor.patch
1. Obtain the number of lines of code that are added, modified, removed and same between two consecutive versions
2. Make plots for each parameter (added, modified, removed and same) with date on x-axis  
    a) y-axis number of lines of code   
    b) y-axis proportional number of lines of code (number of lines of code of that language/total number of lines of code) 

In [None]:
# Your code here

In [None]:
# Your plotting code here
fig,axes = plt.subplots(2,2)

### Question:
3. Does the law of continuing change obey here for all the considered prominent langugages? Comment on the rate of growth.  

4. Does this software project obey Law of Continuing Change?