## *WARNING*
<ins>Before running this script make sure that you followed steps described [here](https://github.com/pwr-pbr23/M6#preparation-for-reproduction).</ins>
## Accessing files for reproduction
To access files we need to mount google drive and change working directory. To mount drive a pop up window will appear - follow necessary steps.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
%cd /content/drive/MyDrive/M6/DeepLineDP/script

/content/drive/MyDrive/M6/DeepLineDP/script


In [3]:
!ls

DeepLineDP_model.py			line-level-baseline
export_data_for_line_level_baseline.py	my_util.py
file-level-baseline			preprocess_data.py
generate_prediction_cross_projects.py	__pycache__
generate_prediction.py			train_model.py
get_evaluation_result.R			train_word2vec.py


The previous line should've returned:

```
condacolab_install.log			my_util.py
DeepLineDP_model.py			new_preprocessing_methods.py
export_data_for_line_level_baseline.py	preprocess_data.py
file-level-baseline			__pycache__
generate_prediction_cross_projects.py	Rplots.pdf
generate_prediction.py			run_py_files.ipynb
get_evaluation_result.R			train_model.py
line-level-baseline			train_word2vec.py
```

## Installing necessary libraries - 2 min
Since we mounted google collab we do not need to repeat all the steps each time, since the result files are permanently saved, howerer when it comes to installing libaries this step <ins>needs to be run before each new session</ins>.

In [4]:
!pip install gensim==3.8.3
!pip install joblib==1.0.1
!pip install more-itertools==8.10.0
!pip install numpy==1.24.0
!pip install pyxdameraulevenshtein==1.5.3
!pip install pandas==1.3.3
!pip install pillow==8.3.2
!pip install python-dateutil==2.8.2
!pip install pytz==2021.3
!pip install scikit-learn==1.0
!pip install scipy==1.7.1
!pip install six==1.16.0
!pip install smart-open==5.2.1
!pip install threadpoolctl==3.0.0
!pip install tqdm==4.62.3
!pip install typing-extensions==3.10.0.2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting numpy==1.24.0
  Using cached numpy-1.24.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.22.4
    Uninstalling numpy-1.22.4:
      Successfully uninstalled numpy-1.22.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
yfinance 0.2.18 requires pytz>=2022.5, but you have pytz 2021.3 which is incompatible.
torchvision 0.15.1+cu118 requ

## preprocess_data.py - 9 min
Data prepared for training models is saved in `../datasets/preprocessed_data/`.

In [6]:
!python preprocess_data.py

  0% 0/9 [00:00<?, ?it/s]finish release activemq-5.0.0
finish release activemq-5.1.0
finish release activemq-5.2.0
finish release activemq-5.3.0
finish release activemq-5.8.0
 11% 1/9 [01:15<10:07, 75.93s/it]finish release camel-1.4.0
finish release camel-2.9.0
finish release camel-2.10.0
finish release camel-2.11.0
 22% 2/9 [03:17<11:59, 102.74s/it]finish release derby-10.2.1.6
finish release derby-10.3.1.4
finish release derby-10.5.1.1
 33% 3/9 [04:22<08:34, 85.73s/it] finish release groovy-1_5_7
finish release groovy-1_6_BETA_1
finish release groovy-1_6_BETA_2
 44% 4/9 [04:44<05:02, 60.53s/it]finish release hbase-0.94.0
finish release hbase-0.95.0
finish release hbase-0.95.2
 56% 5/9 [05:36<03:48, 57.17s/it]finish release hive-0.9.0
finish release hive-0.10.0
finish release hive-0.12.0
 67% 6/9 [06:21<02:39, 53.12s/it]finish release jruby-1.1
finish release jruby-1.4.0
finish release jruby-1.5.0
finish release jruby-1.7.0.preview1
 78% 7/9 [06:55<01:34, 47.06s/it]finish release luc

## train_word2vec.py - 3 min
It creates word2vec model, which is saved in `../output/Word2Vec_model`.

In [7]:
!python train_word2vec.py activemq
!python train_word2vec.py camel
!python train_word2vec.py derby
!python train_word2vec.py groovy
!python train_word2vec.py hbase
!python train_word2vec.py hive
!python train_word2vec.py jruby
!python train_word2vec.py lucene
!python train_word2vec.py wicket

save word2vec model at path ../output/Word2Vec_model//activemq-50dim.bin done
save word2vec model at path ../output/Word2Vec_model//camel-50dim.bin done
save word2vec model at path ../output/Word2Vec_model//derby-50dim.bin done
save word2vec model at path ../output/Word2Vec_model//groovy-50dim.bin done
save word2vec model at path ../output/Word2Vec_model//hbase-50dim.bin done
save word2vec model at path ../output/Word2Vec_model//hive-50dim.bin done
save word2vec model at path ../output/Word2Vec_model//jruby-50dim.bin done
save word2vec model at path ../output/Word2Vec_model//lucene-50dim.bin done
save word2vec model at path ../output/Word2Vec_model//wicket-50dim.bin done


## train_model.py - 51 min
It trains model, which is then saved in `../output/model/DeepLineDP/`.

In [None]:
!python train_model.py -dataset activemq
!python train_model.py -dataset camel
!python train_model.py -dataset derby
!python train_model.py -dataset groovy
!python train_model.py -dataset hbase
!python train_model.py -dataset hive
!python train_model.py -dataset jruby
!python train_model.py -dataset lucene
!python train_model.py -dataset wicket


^C
load Word2Vec for activemq finished
  0% 0/10 [00:00<?, ?it/s]activemq - at epoch: 1
 10% 1/10 [00:35<05:19, 35.46s/it]activemq - at epoch: 2
 20% 2/10 [01:09<04:38, 34.86s/it]activemq - at epoch: 3
 30% 3/10 [01:44<04:03, 34.80s/it]activemq - at epoch: 4
 40% 4/10 [02:19<03:29, 34.94s/it]activemq - at epoch: 5
 50% 5/10 [02:54<02:54, 34.92s/it]activemq - at epoch: 6
 60% 6/10 [03:29<02:19, 34.95s/it]activemq - at epoch: 7
 70% 7/10 [04:05<01:45, 35.31s/it]activemq - at epoch: 8
 80% 8/10 [04:42<01:11, 35.63s/it]activemq - at epoch: 9
 90% 9/10 [05:18<00:35, 35.77s/it]activemq - at epoch: 10
100% 10/10 [05:53<00:00, 35.39s/it]
load Word2Vec for camel finished
  0% 0/10 [00:00<?, ?it/s]camel - at epoch: 1
 10% 1/10 [00:37<05:35, 37.31s/it]camel - at epoch: 2
 20% 2/10 [01:14<04:59, 37.49s/it]camel - at epoch: 3
 30% 3/10 [01:51<04:21, 37.29s/it]camel - at epoch: 4
 40% 4/10 [02:28<03:42, 37.05s/it]camel - at epoch: 5
 50% 5/10 [03:05<03:05, 37.10s/it]camel - at epoch: 6
 60% 6/10 [0

## generate_prediction.py - 14 min
bierze model z ../output/model/DeepLineDP/

zapisuje output w ../output/intermediate_output/DeepLineDP/within-release/

zapisuje predykcje w ../output/prediction/DeepLineDP/within-release/

In [None]:
!python generate_prediction.py -dataset activemq
!python generate_prediction.py -dataset camel
!python generate_prediction.py -dataset derby
!python generate_prediction.py -dataset groovy
!python generate_prediction.py -dataset hbase
!python generate_prediction.py -dataset hive
!python generate_prediction.py -dataset jruby
!python generate_prediction.py -dataset lucene
!python generate_prediction.py -dataset wicket

load Word2Vec for activemq finished
generating prediction of release: activemq-5.2.0
100% 1286/1286 [00:28<00:00, 44.95it/s]
finished release activemq-5.2.0
generating prediction of release: activemq-5.3.0
100% 1487/1487 [00:33<00:00, 44.50it/s]
finished release activemq-5.3.0
generating prediction of release: activemq-5.8.0
100% 1927/1927 [00:43<00:00, 44.72it/s]
finished release activemq-5.8.0
load Word2Vec for camel finished
generating prediction of release: camel-2.10.0
100% 2559/2559 [00:57<00:00, 44.66it/s]
finished release camel-2.10.0
generating prediction of release: camel-2.11.0
100% 2832/2832 [00:58<00:00, 48.30it/s]
finished release camel-2.11.0
load Word2Vec for derby finished
generating prediction of release: derby-10.5.1.1
100% 1931/1931 [01:02<00:00, 30.95it/s]
finished release derby-10.5.1.1
load Word2Vec for groovy finished
generating prediction of release: groovy-1_6_BETA_2
100% 701/701 [00:16<00:00, 42.45it/s]
finished release groovy-1_6_BETA_2
load Word2Vec for hba

## generate_prediction_cross_projects.py - 1h
bierze model z ../output/model/DeepLineDP/

zapisuje output w ../output/intermediate_output/DeepLineDP/cross-project/

zapisuje predykcje w ../output/prediction/DeepLineDP/cross-project/


In [None]:
!python generate_prediction_cross_projects.py -dataset activemq
!python generate_prediction_cross_projects.py -dataset camel
!python generate_prediction_cross_projects.py -dataset derby
!python generate_prediction_cross_projects.py -dataset groovy
!python generate_prediction_cross_projects.py -dataset hbase
!python generate_prediction_cross_projects.py -dataset hive
!python generate_prediction_cross_projects.py -dataset jruby
!python generate_prediction_cross_projects.py -dataset lucene
!python generate_prediction_cross_projects.py -dataset wicket

load Word2Vec for activemq finished
using model from activemq-5.0.0 to generate prediction of camel-2.10.0
100% 2559/2559 [00:27<00:00, 92.95it/s] 
finished release camel-2.10.0
using model from activemq-5.0.0 to generate prediction of camel-2.11.0
100% 2832/2832 [00:28<00:00, 99.62it/s] 
finished release camel-2.11.0
using model from activemq-5.0.0 to generate prediction of derby-10.5.1.1
100% 1931/1931 [00:41<00:00, 46.02it/s]
finished release derby-10.5.1.1
using model from activemq-5.0.0 to generate prediction of groovy-1_6_BETA_2
100% 701/701 [00:08<00:00, 82.26it/s]
finished release groovy-1_6_BETA_2
using model from activemq-5.0.0 to generate prediction of hbase-0.95.2
100% 1136/1136 [00:29<00:00, 38.60it/s]
finished release hbase-0.95.2
using model from activemq-5.0.0 to generate prediction of hive-0.12.0
 25% 501/2029 [00:11<00:33, 45.33it/s]
Traceback (most recent call last):
  File "/content/drive/MyDrive/DeepLineDP/script/generate_prediction_cross_projects.py", line 200, in

## Running baselines

In [None]:
%cd file-level-baseline/

/content/drive/.shortcut-targets-by-id/1OcjA0LK1Qm_lHEuCd7dBq7dDu_zIfeSZ/DeepLineDP/script/file-level-baseline


In [None]:
# !python Bi-LSTM-baseline.py -data activemq -train
# !python Bi-LSTM-baseline.py -data camel -train
!python Bi-LSTM-baseline.py -data derby -train
!python Bi-LSTM-baseline.py -data groovy -train
!python Bi-LSTM-baseline.py -data hbase -train
!python Bi-LSTM-baseline.py -data hive -train
!python Bi-LSTM-baseline.py -data jruby -train
# !python Bi-LSTM-baseline.py -data lucene -train
# !python Bi-LSTM-baseline.py -data wicket -train

training model of derby
100% 40/40 [19:25<00:00, 29.13s/it]
finished training model of derby
training model of groovy
100% 40/40 [09:54<00:00, 14.85s/it]
finished training model of groovy
training model of hbase
100% 40/40 [13:22<00:00, 20.07s/it]
finished training model of hbase
training model of hive
100% 40/40 [20:14<00:00, 30.36s/it]
finished training model of hive
training model of jruby
100% 40/40 [09:30<00:00, 14.26s/it]
finished training model of jruby


## CNN_baseline - 51 min


In [None]:
!python CNN-baseline.py -data activemq -train
!python CNN-baseline.py -data camel -train
!python CNN-baseline.py -data derby -train
!python CNN-baseline.py -data groovy -train
!python CNN-baseline.py -data hbase -train
!python CNN-baseline.py -data hive -train
!python CNN-baseline.py -data jruby -train
!python CNN-baseline.py -data lucene -train
!python CNN-baseline.py -data wicket -train

Traceback (most recent call last):
  File "/content/drive/.shortcut-targets-by-id/1OcjA0LK1Qm_lHEuCd7dBq7dDu_zIfeSZ/DeepLineDP/script/file-level-baseline/CNN-baseline.py", line 328, in <module>
    train_model(proj_name)
  File "/content/drive/.shortcut-targets-by-id/1OcjA0LK1Qm_lHEuCd7dBq7dDu_zIfeSZ/DeepLineDP/script/file-level-baseline/CNN-baseline.py", line 190, in train_model
    loss_df = pd.read_csv(loss_dir+dataset_name+'-Bi-LSTM-loss_record.csv')
  File "/usr/local/lib/python3.9/dist-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/readers.py", line 586, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/readers.py", line 482, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/local/lib/python3.9/dist-packages/pandas/io/parsers/readers.py", line 811, in __init__
    self._engin

In [None]:
!python DBN-baseline.py -data activemq -train
!python DBN-baseline.py -data camel -train
!python DBN-baseline.py -data derby -train
!python DBN-baseline.py -data groovy -train
!python DBN-baseline.py -data hbase -train
!python DBN-baseline.py -data hive -train
!python DBN-baseline.py -data jruby -train
!python DBN-baseline.py -data lucene -train
!python DBN-baseline.py -data wicket -train

[START] Pre-training step:
>> Epoch 1 finished 	RBM Reconstruction error 653.252310
>> Epoch 2 finished 	RBM Reconstruction error 23.595779
>> Epoch 3 finished 	RBM Reconstruction error 18.608305
>> Epoch 4 finished 	RBM Reconstruction error 18.536358
>> Epoch 5 finished 	RBM Reconstruction error 18.884763
>> Epoch 6 finished 	RBM Reconstruction error 19.000650
>> Epoch 7 finished 	RBM Reconstruction error 18.990482
>> Epoch 8 finished 	RBM Reconstruction error 18.998615
>> Epoch 9 finished 	RBM Reconstruction error 19.047700
>> Epoch 10 finished 	RBM Reconstruction error 19.145483
>> Epoch 11 finished 	RBM Reconstruction error 19.274725
>> Epoch 12 finished 	RBM Reconstruction error 19.403844
>> Epoch 13 finished 	RBM Reconstruction error 19.520811
>> Epoch 14 finished 	RBM Reconstruction error 19.625667
>> Epoch 15 finished 	RBM Reconstruction error 19.720794
>> Epoch 16 finished 	RBM Reconstruction error 19.810261
>> Epoch 17 finished 	RBM Reconstruction error 19.901074
>> Epoch 18 

In [None]:
!python BoW-baseline.py -data activemq -train
!python BoW-baseline.py -data camel -train
!python BoW-baseline.py -data derby -train
!python BoW-baseline.py -data groovy -train
!python BoW-baseline.py -data hbase -train
!python BoW-baseline.py -data hive -train
!python BoW-baseline.py -data jruby -train
!python BoW-baseline.py -data lucene -train
!python BoW-baseline.py -data wicket -train

In [None]:
!pip install rpy2==3.5.1
%load_ext rpy2.ipython

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rpy2==3.5.1
  Downloading rpy2-3.5.1.tar.gz (201 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m201.7/201.7 KB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rpy2
  Building wheel for rpy2 (setup.py) ... [?25l[?25hdone
  Created wheel for rpy2: filename=rpy2-3.5.1-cp39-cp39-linux_x86_64.whl size=317891 sha256=616826ab89d2244a794060bd13cbc0f6c7442dca513964df668a4260009fe3d3
  Stored in directory: /root/.cache/pip/wheels/09/e7/bc/33685b60ab54dba969596dd87244ee9f4c2e83dff9a53d4f20
Successfully built rpy2
Installing collected packages: rpy2
  Attempting uninstall: rpy2
    Found existing installation: rpy2 3.5.5
    Uninstalling rpy2-3.5.5:
      Successfully uninstalled rpy2-3.5.5
Successfully installed rpy2-3.5.1


In [None]:
%%R
install.packages("tidyverse", dependencies=TRUE)
install.packages("gridExtra", dependencies=TRUE)
install.packages("ModelMetrics", dependencies=TRUE)
install.packages("reshape2", dependencies=TRUE)
install.packages("pROC", dependencies=TRUE)
install.packages("effsize", dependencies=TRUE)
install.packages("ScottKnottESD", dependencies=TRUE)
install.packages("caret", dependencies=TRUE)

UsageError: Cell magic `%%R` not found.


In [None]:
!Rscript  get_evaluation_result.R

[0;1;31mSystem has not been booted with systemd as init system (PID 1). Can't operate.[0m
[0;1;31mFailed to create bus connection: Host is down[0m
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──
[32m✔[39m [34mggplot2[39m 3.4.1     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.1
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.4.1
[32m✔[39m [34mreadr  [39m 2.1.4     [32m✔[39m [34mforcats[39m 1.0.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
In system("timedatectl", intern = TRUE) :
  running command 'timedatectl' had status 1
[?25hError in library(gridExtra) : there is no package called ‘gridExtra’
Execution halted
[?25h