* [Setting things up](#setting-things-up)
* [Pearson matrixes](#pearson-matrixes)
    * [1K merges](#pearson-matrixes)
    * [100k merges](#pearson-matrixes-100k-merges)
* [Cooccurrences](#cooccurrences)
    * [1K merges](#cooccurrences)
    * [100k merges](#cooccurrences-100k-merges)
* [Java vs Python](#java-vs-python)

<a id="setting-things-up"></a>

## Setting things up
### Settings and imports

In [46]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [4]:
import os
from pandas import DataFrame
from typing import List

from metrics.vector import pearson, cooccurences, summarize_cooccurences
from dataprep.split.merge import Merge, MergeList, read_merges
from metrics.matrix import pearson_matrix, cooccurence_matrix

### Util methods

In [5]:
def first_n_merges(merges: List[MergeList], n: int) -> List[MergeList]:
    return list(map(lambda l: l[:n], merges))

### Setting up paths

In [8]:
HOME = '/home/lv71161/hlibbabii'

In [9]:
PATH_TO_BPE_CODES_FOLDER = os.path.join(HOME, 'log-recommender-dataprep/bpe-codes')

In [10]:
PATH_TO_BPE_CODES_FOLDER_CASE = os.path.join(PATH_TO_BPE_CODES_FOLDER, 'case')
PATH_TO_BPE_CODES_FOLDER_NOCASE = os.path.join(PATH_TO_BPE_CODES_FOLDER, 'nocase')

### Loading BPE merges

In [11]:
full_merges_list_10_case = [read_merges(os.path.join(PATH_TO_BPE_CODES_FOLDER_CASE, f'10_{chunk}.txt')) for chunk in range(10)]
full_merges_list_20_case = [read_merges(os.path.join(PATH_TO_BPE_CODES_FOLDER_CASE, f'20_{chunk}.txt')) for chunk in range(5)]
full_merges_list_10_nocase = [read_merges(os.path.join(PATH_TO_BPE_CODES_FOLDER_NOCASE, f'10_{chunk}.txt')) for chunk in range(10)]
full_merges_list_20_nocase = [read_merges(os.path.join(PATH_TO_BPE_CODES_FOLDER_NOCASE, f'20_{chunk}.txt')) for chunk in range(5)]

#### Checking contents of some lists...

In [12]:
full_merges_list_10_case[0][:10]

[('e', 'r'): (10083236, 0),
 ('i', 'n'): (8162641, 1),
 ('o', 'n'): (7676160, 2),
 ('o', 'r'): (7601275, 3),
 ('e', 't'): (6853557, 4),
 ('a', 't'): (6463709, 5),
 ('e', 'n'): (6436192, 6),
 ('e', 's'): (5622049, 7),
 ('t', 'h'): (4936728, 8),
 ('a', 'l'): (4753658, 9)]

In [13]:
full_merges_list_20_nocase[4][:10]

[('e', 'r'): (21493741, 0),
 ('i', 'n'): (21026960, 1),
 ('r', 'e'): (17729548, 2),
 ('o', 'n'): (17659507, 3),
 ('s', 't'): (15902251, 4),
 ('o', 'r'): (15230588, 5),
 ('a', 't'): (14963476, 6),
 ('e', 'n'): (14152630, 7),
 ('t', 'h'): (12383901, 8),
 ('l', 'i'): (11245604, 9)]

<a id="pearson-matrixes"></a>

## Pearson matrixes

### 1k merges

In [14]:
m = pearson_matrix(first_n_merges(full_merges_list_10_case, 1000))
DataFrame(m).round(3)

If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  """Entry point for launching an IPython kernel.


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.0,0.909,0.976,0.977,0.941,0.94,0.97,0.928,0.981,0.95
1,0.909,1.0,0.931,0.928,0.927,0.952,0.923,0.953,0.904,0.923
2,0.976,0.931,1.0,0.99,0.959,0.961,0.986,0.952,0.968,0.964
3,0.977,0.928,0.99,1.0,0.954,0.959,0.987,0.946,0.967,0.963
4,0.941,0.927,0.959,0.954,1.0,0.924,0.953,0.916,0.934,0.929
5,0.94,0.952,0.961,0.959,0.924,1.0,0.957,0.983,0.946,0.953
6,0.97,0.923,0.986,0.987,0.953,0.957,1.0,0.946,0.963,0.957
7,0.928,0.953,0.952,0.946,0.916,0.983,0.946,1.0,0.937,0.96
8,0.981,0.904,0.968,0.967,0.934,0.946,0.963,0.937,1.0,0.954
9,0.95,0.923,0.964,0.963,0.929,0.953,0.957,0.96,0.954,1.0


In [15]:
m = pearson_matrix(first_n_merges(full_merges_list_10_nocase, 1000))
DataFrame(m).round(3)

If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  """Entry point for launching an IPython kernel.


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.0,0.956,0.916,0.917,0.914,0.884,0.957,0.919,0.925,0.989
1,0.956,1.0,0.961,0.958,0.963,0.936,0.987,0.96,0.955,0.954
2,0.916,0.961,1.0,0.988,0.993,0.961,0.958,0.988,0.924,0.915
3,0.917,0.958,0.988,1.0,0.989,0.962,0.962,0.993,0.931,0.918
4,0.914,0.963,0.993,0.989,1.0,0.965,0.961,0.989,0.929,0.916
5,0.884,0.936,0.961,0.962,0.965,1.0,0.931,0.963,0.959,0.886
6,0.957,0.987,0.958,0.962,0.961,0.931,1.0,0.964,0.957,0.958
7,0.919,0.96,0.988,0.993,0.989,0.963,0.964,1.0,0.931,0.919
8,0.925,0.955,0.924,0.931,0.929,0.959,0.957,0.931,1.0,0.925
9,0.989,0.954,0.915,0.918,0.916,0.886,0.958,0.919,0.925,1.0


In [16]:
m = pearson_matrix(first_n_merges(full_merges_list_20_case, 1000))
DataFrame(m).round(3)

Unnamed: 0,0,1,2,3,4
0,1.0,0.995,0.978,0.988,0.996
1,0.995,1.0,0.977,0.988,0.996
2,0.978,0.977,1.0,0.984,0.978
3,0.988,0.988,0.984,1.0,0.989
4,0.996,0.996,0.978,0.989,1.0


In [17]:
m = pearson_matrix(first_n_merges(full_merges_list_10_nocase, 1000))
DataFrame(m).round(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.0,0.956,0.916,0.917,0.914,0.884,0.957,0.919,0.925,0.989
1,0.956,1.0,0.961,0.958,0.963,0.936,0.987,0.96,0.955,0.954
2,0.916,0.961,1.0,0.988,0.993,0.961,0.958,0.988,0.924,0.915
3,0.917,0.958,0.988,1.0,0.989,0.962,0.962,0.993,0.931,0.918
4,0.914,0.963,0.993,0.989,1.0,0.965,0.961,0.989,0.929,0.916
5,0.884,0.936,0.961,0.962,0.965,1.0,0.931,0.963,0.959,0.886
6,0.957,0.987,0.958,0.962,0.961,0.931,1.0,0.964,0.957,0.958
7,0.919,0.96,0.988,0.993,0.989,0.963,0.964,1.0,0.931,0.919
8,0.925,0.955,0.924,0.931,0.929,0.959,0.957,0.931,1.0,0.925
9,0.989,0.954,0.915,0.918,0.916,0.886,0.958,0.919,0.925,1.0


<a id="pearson-matrixes-100k-merges"></a>

### 100K merges

In [13]:
m = pearson_matrix(first_n_merges(full_merges_list_10_case, 100000))
DataFrame(m).round(3)

If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  """Entry point for launching an IPython kernel.


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.0,0.922,0.979,0.981,0.95,0.949,0.975,0.939,0.984,0.957
1,0.922,1.0,0.941,0.939,0.937,0.959,0.934,0.96,0.918,0.934
2,0.979,0.941,1.0,0.992,0.965,0.967,0.989,0.959,0.973,0.97
3,0.981,0.939,0.992,1.0,0.961,0.965,0.989,0.954,0.972,0.969
4,0.95,0.937,0.965,0.961,1.0,0.935,0.96,0.928,0.944,0.94
5,0.949,0.959,0.967,0.965,0.935,1.0,0.964,0.986,0.955,0.96
6,0.975,0.934,0.989,0.989,0.96,0.964,1.0,0.954,0.969,0.964
7,0.939,0.96,0.959,0.954,0.928,0.986,0.954,1.0,0.947,0.966
8,0.984,0.918,0.973,0.972,0.944,0.955,0.969,0.947,1.0,0.961
9,0.957,0.934,0.97,0.969,0.94,0.96,0.964,0.966,0.961,1.0


In [14]:
m = pearson_matrix(first_n_merges(full_merges_list_10_nocase, 100000))
DataFrame(m).round(3)

If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  """Entry point for launching an IPython kernel.


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.0,0.962,0.927,0.928,0.925,0.899,0.963,0.93,0.935,0.99
1,0.962,1.0,0.966,0.964,0.968,0.945,0.989,0.966,0.961,0.961
2,0.927,0.966,1.0,0.99,0.994,0.967,0.964,0.99,0.934,0.926
3,0.928,0.964,0.99,1.0,0.991,0.967,0.968,0.994,0.94,0.929
4,0.925,0.968,0.994,0.991,1.0,0.97,0.967,0.991,0.939,0.927
5,0.899,0.945,0.967,0.967,0.97,1.0,0.941,0.968,0.965,0.901
6,0.963,0.989,0.964,0.968,0.967,0.941,1.0,0.969,0.963,0.964
7,0.93,0.966,0.99,0.994,0.991,0.968,0.969,1.0,0.941,0.93
8,0.935,0.961,0.934,0.94,0.939,0.965,0.963,0.941,1.0,0.935
9,0.99,0.961,0.926,0.929,0.927,0.901,0.964,0.93,0.935,1.0


In [15]:
m = pearson_matrix(first_n_merges(full_merges_list_20_case, 100000))
DataFrame(m).round(3)

If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  """Entry point for launching an IPython kernel.


Unnamed: 0,0,1,2,3,4
0,1.0,0.996,0.982,0.99,0.997
1,0.996,1.0,0.981,0.99,0.996
2,0.982,0.981,1.0,0.987,0.981
3,0.99,0.99,0.987,1.0,0.991
4,0.997,0.996,0.981,0.991,1.0


In [16]:
m = pearson_matrix(first_n_merges(full_merges_list_20_nocase, 100000))
DataFrame(m).round(3)

If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  """Entry point for launching an IPython kernel.


Unnamed: 0,0,1,2,3,4
0,1.0,0.996,0.992,0.969,0.966
1,0.996,1.0,0.993,0.97,0.966
2,0.992,0.993,1.0,0.967,0.963
3,0.969,0.97,0.967,1.0,0.939
4,0.966,0.966,0.963,0.939,1.0


<a id="cooccurrences"></a>

### Cooccurrences

#### 1K merges

In [18]:
m=cooccurence_matrix(first_n_merges(full_merges_list_10_case, 1000), first_n_merges(full_merges_list_10_case, 1000))
DataFrame(m).round(3)

If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  """Entry point for launching an IPython kernel.


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.0,0.712,0.786,0.82,0.76,0.76,0.786,0.751,0.818,0.772
1,0.712,1.0,0.737,0.734,0.744,0.765,0.72,0.781,0.729,0.708
2,0.786,0.737,1.0,0.821,0.803,0.783,0.814,0.778,0.804,0.765
3,0.82,0.734,0.821,1.0,0.777,0.778,0.83,0.749,0.805,0.779
4,0.76,0.744,0.803,0.777,1.0,0.744,0.784,0.758,0.77,0.729
5,0.76,0.765,0.783,0.778,0.744,1.0,0.762,0.809,0.783,0.745
6,0.786,0.72,0.814,0.83,0.784,0.762,1.0,0.769,0.799,0.755
7,0.751,0.781,0.778,0.749,0.758,0.809,0.769,1.0,0.769,0.761
8,0.818,0.729,0.804,0.805,0.77,0.783,0.799,0.769,1.0,0.748
9,0.772,0.708,0.765,0.779,0.729,0.745,0.755,0.761,0.748,1.0


In [19]:
m=cooccurence_matrix(first_n_merges(full_merges_list_10_nocase, 1000), first_n_merges(full_merges_list_10_nocase, 1000))
DataFrame(m).round(3)

If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  """Entry point for launching an IPython kernel.


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.0,0.755,0.72,0.722,0.708,0.711,0.742,0.718,0.746,0.8
1,0.755,1.0,0.753,0.757,0.76,0.76,0.832,0.767,0.817,0.752
2,0.72,0.753,1.0,0.844,0.837,0.793,0.776,0.844,0.76,0.743
3,0.722,0.757,0.844,1.0,0.839,0.826,0.781,0.856,0.77,0.751
4,0.708,0.76,0.837,0.839,1.0,0.807,0.794,0.829,0.774,0.738
5,0.711,0.76,0.793,0.826,0.807,1.0,0.736,0.799,0.778,0.713
6,0.742,0.832,0.776,0.781,0.794,0.736,1.0,0.783,0.804,0.79
7,0.718,0.767,0.844,0.856,0.829,0.799,0.783,1.0,0.761,0.745
8,0.746,0.817,0.76,0.77,0.774,0.778,0.804,0.761,1.0,0.765
9,0.8,0.752,0.743,0.751,0.738,0.713,0.79,0.745,0.765,1.0


In [20]:
m=cooccurence_matrix(first_n_merges(full_merges_list_20_case, 1000), first_n_merges(full_merges_list_20_case, 1000))
DataFrame(m).round(3)

Unnamed: 0,0,1,2,3,4
0,1.0,0.864,0.825,0.84,0.88
1,0.864,1.0,0.85,0.849,0.877
2,0.825,0.85,1.0,0.858,0.841
3,0.84,0.849,0.858,1.0,0.859
4,0.88,0.877,0.841,0.859,1.0


In [21]:
m=cooccurence_matrix(first_n_merges(full_merges_list_20_nocase, 1000), first_n_merges(full_merges_list_20_nocase, 1000))
DataFrame(m).round(3)

Unnamed: 0,0,1,2,3,4
0,1.0,0.851,0.86,0.789,0.855
1,0.851,1.0,0.884,0.805,0.859
2,0.86,0.884,1.0,0.812,0.856
3,0.789,0.805,0.812,1.0,0.77
4,0.855,0.859,0.856,0.77,1.0


<a id="cooccurrences-100k-merges"></a>

#### 1K vs 5K merges

In [22]:
m=cooccurence_matrix(first_n_merges(full_merges_list_10_case, 1000), first_n_merges(full_merges_list_10_case, 5000))
DataFrame(m).round(3)

If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  """Entry point for launching an IPython kernel.


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.0,0.842,0.917,0.927,0.886,0.871,0.911,0.865,0.922,0.891
1,0.812,1.0,0.847,0.836,0.849,0.862,0.83,0.882,0.835,0.825
2,0.903,0.858,1.0,0.924,0.905,0.891,0.929,0.886,0.918,0.894
3,0.919,0.851,0.925,1.0,0.886,0.884,0.936,0.864,0.914,0.887
4,0.877,0.874,0.901,0.883,1.0,0.862,0.88,0.865,0.876,0.859
5,0.881,0.898,0.91,0.895,0.876,1.0,0.887,0.925,0.902,0.883
6,0.918,0.846,0.925,0.946,0.889,0.885,1.0,0.879,0.919,0.886
7,0.866,0.897,0.877,0.87,0.858,0.923,0.868,1.0,0.868,0.887
8,0.917,0.851,0.918,0.919,0.875,0.889,0.905,0.876,1.0,0.882
9,0.871,0.834,0.877,0.873,0.845,0.859,0.863,0.876,0.874,1.0


In [23]:
m=cooccurence_matrix(first_n_merges(full_merges_list_10_nocase, 1000), first_n_merges(full_merges_list_10_nocase, 5000))
DataFrame(m).round(3)

If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  """Entry point for launching an IPython kernel.


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.0,0.863,0.831,0.815,0.817,0.82,0.86,0.816,0.844,0.914
1,0.867,1.0,0.839,0.85,0.855,0.852,0.921,0.854,0.908,0.868
2,0.842,0.86,1.0,0.927,0.922,0.912,0.879,0.923,0.867,0.865
3,0.838,0.878,0.936,1.0,0.934,0.933,0.88,0.935,0.884,0.862
4,0.821,0.869,0.921,0.931,1.0,0.927,0.888,0.909,0.876,0.847
5,0.839,0.872,0.914,0.93,0.921,1.0,0.864,0.91,0.895,0.854
6,0.878,0.937,0.866,0.862,0.884,0.851,1.0,0.86,0.919,0.901
7,0.838,0.875,0.932,0.942,0.919,0.917,0.884,1.0,0.87,0.863
8,0.851,0.917,0.856,0.867,0.873,0.883,0.915,0.854,1.0,0.874
9,0.902,0.856,0.841,0.835,0.84,0.823,0.884,0.832,0.862,1.0


In [24]:
m=cooccurence_matrix(first_n_merges(full_merges_list_20_case, 1000), first_n_merges(full_merges_list_20_case, 5000))
DataFrame(m).round(3)

If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  """Entry point for launching an IPython kernel.


Unnamed: 0,0,1,2,3,4
0,1.0,0.961,0.934,0.941,0.965
1,0.958,1.0,0.932,0.933,0.969
2,0.929,0.936,1.0,0.944,0.937
3,0.937,0.94,0.947,1.0,0.951
4,0.953,0.955,0.928,0.939,1.0


In [25]:
m=cooccurence_matrix(first_n_merges(full_merges_list_20_nocase, 1000), first_n_merges(full_merges_list_20_nocase, 5000))
DataFrame(m).round(3)

If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  """Entry point for launching an IPython kernel.


Unnamed: 0,0,1,2,3,4
0,1.0,0.94,0.945,0.878,0.946
1,0.939,1.0,0.954,0.882,0.943
2,0.95,0.956,1.0,0.89,0.943
3,0.878,0.873,0.884,1.0,0.859
4,0.929,0.924,0.927,0.859,1.0


#### 100K merges

In [17]:
m=cooccurence_matrix(first_n_merges(full_merges_list_10_case, 100000), first_n_merges(full_merges_list_10_case, 100000))
DataFrame(m).round(3)

If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  """Entry point for launching an IPython kernel.


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.0,0.45,0.47,0.488,0.472,0.471,0.462,0.454,0.482,0.472
1,0.45,1.0,0.452,0.453,0.453,0.461,0.44,0.462,0.442,0.444
2,0.47,0.452,1.0,0.483,0.476,0.465,0.468,0.458,0.466,0.463
3,0.488,0.453,0.483,1.0,0.483,0.481,0.474,0.465,0.473,0.472
4,0.472,0.453,0.476,0.483,1.0,0.467,0.473,0.46,0.469,0.461
5,0.471,0.461,0.465,0.481,0.467,1.0,0.458,0.468,0.463,0.469
6,0.462,0.44,0.468,0.474,0.473,0.458,1.0,0.454,0.458,0.453
7,0.454,0.462,0.458,0.465,0.46,0.468,0.454,1.0,0.452,0.455
8,0.482,0.442,0.466,0.473,0.469,0.463,0.458,0.452,1.0,0.464
9,0.472,0.444,0.463,0.472,0.461,0.469,0.453,0.455,0.464,1.0


In [18]:
m=cooccurence_matrix(first_n_merges(full_merges_list_10_nocase, 100000), first_n_merges(full_merges_list_10_nocase, 100000))
DataFrame(m).round(3)

If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  """Entry point for launching an IPython kernel.


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.0,0.417,0.395,0.414,0.41,0.413,0.4,0.392,0.408,0.435
1,0.417,1.0,0.395,0.408,0.409,0.403,0.41,0.4,0.406,0.404
2,0.395,0.395,1.0,0.422,0.425,0.41,0.391,0.414,0.389,0.398
3,0.414,0.408,0.422,1.0,0.442,0.432,0.406,0.429,0.4,0.412
4,0.41,0.409,0.425,0.442,1.0,0.435,0.415,0.427,0.414,0.406
5,0.413,0.403,0.41,0.432,0.435,1.0,0.402,0.414,0.408,0.412
6,0.4,0.41,0.391,0.406,0.415,0.402,1.0,0.406,0.397,0.399
7,0.392,0.4,0.414,0.429,0.427,0.414,0.406,1.0,0.396,0.394
8,0.408,0.406,0.389,0.4,0.414,0.408,0.397,0.396,1.0,0.402
9,0.435,0.404,0.398,0.412,0.406,0.412,0.399,0.394,0.402,1.0


In [61]:
m=cooccurence_matrix(first_n_merges(full_merges_list_20_case, 100000), first_n_merges(full_merges_list_20_case, 100000))
DataFrame(m).round(3)

Unnamed: 0,0,1,2,3,4
0,1.0,0.534,0.523,0.522,0.526
1,0.534,1.0,0.536,0.531,0.526
2,0.523,0.536,1.0,0.536,0.528
3,0.522,0.531,0.536,1.0,0.52
4,0.526,0.526,0.528,0.52,1.0


In [63]:
m=cooccurence_matrix(first_n_merges(full_merges_list_20_case, 100000), first_n_merges(full_merges_list_20_nocase, 100000))
DataFrame(m).round(3)

Unnamed: 0,0,1,2,3,4
0,1.0,0.484,0.479,0.454,0.471
1,0.484,1.0,0.492,0.466,0.477
2,0.479,0.492,1.0,0.466,0.479
3,0.454,0.466,0.466,1.0,0.445
4,0.471,0.477,0.479,0.445,1.0


In [64]:
m=cooccurence_matrix(first_n_merges(full_merges_list_10_case, 50000), first_n_merges(full_merges_list_10_case, 100000))
DataFrame(m).round(3)

If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  """Entry point for launching an IPython kernel.


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.0,0.618,0.651,0.662,0.643,0.632,0.642,0.627,0.671,0.644
1,0.609,1.0,0.618,0.614,0.615,0.629,0.604,0.637,0.61,0.609
2,0.651,0.625,1.0,0.662,0.653,0.64,0.656,0.634,0.656,0.644
3,0.666,0.623,0.665,1.0,0.654,0.646,0.657,0.637,0.659,0.649
4,0.644,0.626,0.659,0.658,1.0,0.629,0.652,0.633,0.651,0.637
5,0.637,0.644,0.644,0.654,0.633,1.0,0.631,0.652,0.645,0.642
6,0.644,0.614,0.655,0.657,0.652,0.629,1.0,0.631,0.65,0.636
7,0.626,0.644,0.632,0.636,0.63,0.649,0.626,1.0,0.631,0.635
8,0.663,0.611,0.65,0.651,0.639,0.631,0.641,0.624,1.0,0.643
9,0.644,0.613,0.641,0.646,0.631,0.634,0.627,0.629,0.645,1.0


#### 50K vs 100K merges

In [65]:
m=cooccurence_matrix(first_n_merges(full_merges_list_10_nocase, 50000), first_n_merges(full_merges_list_10_nocase, 100000))
DataFrame(m).round(3)

If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  """Entry point for launching an IPython kernel.


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.0,0.596,0.567,0.581,0.575,0.572,0.569,0.567,0.591,0.61
1,0.58,1.0,0.563,0.57,0.575,0.566,0.584,0.569,0.585,0.569
2,0.562,0.571,1.0,0.606,0.607,0.59,0.567,0.603,0.568,0.572
3,0.582,0.59,0.613,1.0,0.625,0.612,0.576,0.619,0.585,0.583
4,0.57,0.578,0.608,0.615,1.0,0.605,0.581,0.607,0.591,0.573
5,0.587,0.591,0.598,0.625,0.626,1.0,0.583,0.604,0.603,0.592
6,0.565,0.591,0.563,0.569,0.582,0.564,1.0,0.581,0.577,0.57
7,0.556,0.574,0.598,0.608,0.607,0.592,0.574,1.0,0.569,0.563
8,0.573,0.583,0.557,0.566,0.576,0.572,0.568,0.562,1.0,0.569
9,0.598,0.569,0.564,0.57,0.566,0.562,0.561,0.561,0.576,1.0


In [66]:
m=cooccurence_matrix(first_n_merges(full_merges_list_20_case, 50000), first_n_merges(full_merges_list_20_case, 100000))
DataFrame(m).round(3)

If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  """Entry point for launching an IPython kernel.


Unnamed: 0,0,1,2,3,4
0,1.0,0.706,0.686,0.698,0.702
1,0.711,1.0,0.707,0.713,0.708
2,0.7,0.715,1.0,0.722,0.709
3,0.704,0.713,0.718,1.0,0.708
4,0.704,0.707,0.697,0.701,1.0


In [67]:
m=cooccurence_matrix(first_n_merges(full_merges_list_20_nocase, 50000), first_n_merges(full_merges_list_20_nocase, 100000))
DataFrame(m).round(3)

If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  """Entry point for launching an IPython kernel.


Unnamed: 0,0,1,2,3,4
0,1.0,0.659,0.653,0.627,0.653
1,0.666,1.0,0.671,0.638,0.661
2,0.667,0.683,1.0,0.643,0.673
3,0.628,0.637,0.634,1.0,0.62
4,0.645,0.652,0.648,0.609,1.0


<a id='java-vs-python'></a>

### Cooccurrences accross all chunks

#### 1K merges

In [26]:
counts = cooccurences(*first_n_merges(full_merges_list_10_case, 1000))
fraction_of_merges_by_n_chunks_they_are_in = summarize_cooccurences(counts)
assert sum(fraction_of_merges_by_n_chunks_they_are_in.values()) - 1.0 < 0.0001
sorted(fraction_of_merges_by_n_chunks_they_are_in.items(), key=lambda s:s[1], reverse=True)

[(10, 0.511),
 (9, 0.1197),
 (8, 0.056),
 (7, 0.0553),
 (1, 0.0491),
 (5, 0.0475),
 (6, 0.0462),
 (3, 0.0414),
 (2, 0.0374),
 (4, 0.0364)]

In [27]:
counts = cooccurences(*first_n_merges(full_merges_list_10_nocase, 1000))
fraction_of_merges_by_n_chunks_they_are_in = summarize_cooccurences(counts)
assert sum(fraction_of_merges_by_n_chunks_they_are_in.values()) - 1.0 < 0.0001
sorted(fraction_of_merges_by_n_chunks_they_are_in.items(), key=lambda s:s[1], reverse=True)

[(10, 0.526),
 (9, 0.0972),
 (8, 0.0704),
 (5, 0.0615),
 (7, 0.0469),
 (6, 0.045),
 (1, 0.0446),
 (3, 0.0372),
 (2, 0.0368),
 (4, 0.0344)]

In [28]:
counts = cooccurences(*first_n_merges(full_merges_list_20_case, 1000))
fraction_of_merges_by_n_chunks_they_are_in = summarize_cooccurences(counts)
assert sum(fraction_of_merges_by_n_chunks_they_are_in.values()) - 1.0 < 0.0001
sorted(fraction_of_merges_by_n_chunks_they_are_in.items(), key=lambda s:s[1], reverse=True)

[(5, 0.733), (4, 0.1072), (3, 0.0582), (1, 0.0544), (2, 0.0472)]

In [29]:
counts = cooccurences(*first_n_merges(full_merges_list_20_nocase, 1000))
fraction_of_merges_by_n_chunks_they_are_in = summarize_cooccurences(counts)
assert sum(fraction_of_merges_by_n_chunks_they_are_in.values()) - 1.0 < 0.0001
sorted(fraction_of_merges_by_n_chunks_they_are_in.items(), key=lambda s:s[1], reverse=True)

[(5, 0.699), (4, 0.1256), (1, 0.0676), (3, 0.0558), (2, 0.052)]

#### 100K merges

In [46]:
counts = cooccurences(*first_n_merges(full_merges_list_10_case, 100000))
fraction_of_merges_by_n_chunks_they_are_in = summarize_cooccurences(counts)
assert sum(fraction_of_merges_by_n_chunks_they_are_in.values()) - 1.0 < 0.0001
sorted(fraction_of_merges_by_n_chunks_they_are_in.items(), key=lambda s:s[1], reverse=True)

[(1, 0.253172),
 (10, 0.21054),
 (2, 0.098872),
 (3, 0.074004),
 (9, 0.073179),
 (4, 0.061148),
 (8, 0.059384),
 (7, 0.057988),
 (5, 0.056525),
 (6, 0.055188)]

In [47]:
counts = cooccurences(*first_n_merges(full_merges_list_10_nocase, 100000))
fraction_of_merges_by_n_chunks_they_are_in = summarize_cooccurences(counts)
assert sum(fraction_of_merges_by_n_chunks_they_are_in.values()) - 1.0 < 0.0001
sorted(fraction_of_merges_by_n_chunks_they_are_in.items(), key=lambda s:s[1], reverse=True)

[(1, 0.303114),
 (10, 0.17421),
 (2, 0.109752),
 (3, 0.075564),
 (9, 0.061839),
 (4, 0.061824),
 (5, 0.055955),
 (8, 0.054952),
 (7, 0.052024),
 (6, 0.050766)]

In [48]:
counts = cooccurences(*first_n_merges(full_merges_list_20_case, 100000))
fraction_of_merges_by_n_chunks_they_are_in = summarize_cooccurences(counts)
assert sum(fraction_of_merges_by_n_chunks_they_are_in.values()) - 1.0 < 0.0001
sorted(fraction_of_merges_by_n_chunks_they_are_in.items(), key=lambda s:s[1], reverse=True)

[(5, 0.34366), (1, 0.285122), (2, 0.131988), (4, 0.127552), (3, 0.111678)]

In [50]:
counts = cooccurences(*first_n_merges(full_merges_list_20_nocase, 100000))
fraction_of_merges_by_n_chunks_they_are_in = summarize_cooccurences(counts)
assert sum(fraction_of_merges_by_n_chunks_they_are_in.values()) - 1.0 < 0.0001
sorted(fraction_of_merges_by_n_chunks_they_are_in.items(), key=lambda s:s[1], reverse=True)

[(1, 0.339992), (5, 0.28922), (2, 0.136876), (4, 0.123368), (3, 0.110544)]

### Learning bpe merges

In [47]:
%%bash
cd "$HOME/dataprep2"
python dataprep/__main__.py learn-bpe 40 -p "/home/lv71161/hlibbabii/raw_datasets/allamanis/chunks_nodup_en_only/01/0" --no-case --ext "java"

dataprep 1.0.0-alpha.1
--- Learning bpe codes...


0it [00:00, ?it/s]


In [48]:
%%bash
cd "$HOME/dataprep2"
python dataprep/__main__.py --version
python dataprep/__main__.py learn-bpe 10000 -p "/home/lv71161/hlibbabii/raw_datasets/allamanis/chunks_nodup_en_only/01/0" --no-case --ext "java"

dataprep 1.0.0-alpha.1
--- Learning bpe codes...


  0%|          | 0/9960 [00:00<?, ?it/s]  0%|          | 1/9960 [00:00<1:05:21,  2.54it/s]  0%|          | 2/9960 [00:00<1:06:36,  2.49it/s]  0%|          | 3/9960 [00:01<1:06:03,  2.51it/s]  0%|          | 4/9960 [00:01<1:06:21,  2.50it/s]  0%|          | 5/9960 [00:02<1:07:59,  2.44it/s]  0%|          | 6/9960 [00:02<1:01:41,  2.69it/s]  0%|          | 7/9960 [00:02<1:05:08,  2.55it/s]  0%|          | 8/9960 [00:03<1:06:53,  2.48it/s]  0%|          | 9/9960 [00:03<1:05:18,  2.54it/s]  0%|          | 10/9960 [00:03<1:04:23,  2.58it/s]  0%|          | 11/9960 [00:04<1:06:15,  2.50it/s]  0%|          | 12/9960 [00:04<1:05:16,  2.54it/s]  0%|          | 13/9960 [00:05<1:03:36,  2.61it/s]  0%|          | 14/9960 [00:05<1:04:28,  2.57it/s]  0%|          | 15/9960 [00:05<1:08:26,  2.42it/s]  0%|          | 16/9960 [00:06<1:07:59,  2.44it/s]  0%|          | 17/9960 [00:06<1:05:28,  2.53it/s]  0%|          | 18/9960 [00:07<1:02:31,  2.65it/s]  0%|          | 19/9960 [00:07

In [None]:
%%bash
cd "$HOME/dataprep2"
python dataprep/__main__.py --version
python dataprep/__main__.py learn-bpe 10 -p "/home/lv71161/hlibbabii/raw_datasets/allamanis/chunks_nodup_en_only/01/1" --no-case --ext "java"

In [None]:
%%bash
cd "$HOME/dataprep2"
python dataprep/__main__.py --version
python dataprep/__main__.py learn-bpe 10000 -p "/home/lv71161/hlibbabii/raw_datasets/allamanis/chunks_nodup_en_only/01/1" --no-case --ext "java"

In [None]:
%%bash
cd "$HOME/dataprep2"
python dataprep/__main__.py --version
python dataprep/__main__.py learn-bpe 10 -p "/home/lv71161/hlibbabii/raw_datasets/allamanis/chunks_nodup_en_only/23/2" --no-case --ext "java"

In [None]:
%%bash
cd "$HOME/dataprep2"
python dataprep/__main__.py --version
python dataprep/__main__.py learn-bpe 10000 -p "/home/lv71161/hlibbabii/raw_datasets/allamanis/chunks_nodup_en_only/23/2" --no-case --ext "java"

### Splitting base vocab with different bpe codes

In [31]:
import subprocess, os
from subprocess import PIPE
from typing import List, Dict

def cross_prep(datasets: List[Dict[str, str]], n_merges: int, output: str):
    for i in range(len(datasets)):
        for j in range(len(datasets)):
            code = datasets[j]['code']
            
            command_nocase = ["python", "dataprep/__main__.py", "bpe", f"{code}_nocase-{n_merges}", "-p", datasets[i]['path'], "-o", output, "--no-case", "--ext", datasets[i]['ext']]
            command = ["python", "dataprep/__main__.py", "bpe", f"{code}-{n_merges}", "-p", datasets[i]['path'], "-o", output, "--ext", datasets[i]['ext']]
            
            p1 = subprocess.run(command_nocase, cwd=os.path.join(os.environ['HOME'], 'dataprep'), stdout=PIPE, stderr=PIPE, check=True, universal_newlines=True)
            print(p1.stdout, p1.stderr)
            p2 = subprocess.run(command, cwd=os.path.join(os.environ['HOME'], 'dataprep'), stdout=PIPE, stderr=PIPE, check=True, universal_newlines=True)
            print(p2.stdout)

In [None]:
import os

RAW_DATASETS="/home/lv71161/hlibbabii/raw_datasets"
OUTPUT="/home/lv71161/hlibbabii/prep-1.0.0-alpha.0"

cross_prep([{'path': os.path.join(RAW_DATASETS, 'allamanis/'), 'ext': 'java', 'code': ''},
            {'path': os.path.join(RAW_DATASETS, 'allamanis/small_chunk'), 'ext': 'java', 'code': 'small_chunk'},
            {'path': os.path.join(RAW_DATASETS, 'c'), 'ext': 'c', 'code': 'c'},
            {'path': os.path.join(RAW_DATASETS, 'multilang'), 'ext': 'c|java|py', 'code': 'multilang'},
           ], 1000, OUTPUT)

#### 1K merges

In [30]:
from dataprep import bperegistry
from metrics.matrix import merge_similarity_rate_matrix
from pandas import DataFrame

vocab_list = [bperegistry.load_base_vocab(f'{i}-1000') for i in range(9)]
merges_list = [bperegistry.load_bpe_merges(f'{i}-1000') for i in range(9)]

AttributeError: module 'dataprep.bperegistry' has no attribute 'load_base_vocab'

In [46]:
vocab_list;

In [45]:
from dataprep.split import bpe_encode

bpe_encode.encode(vocab_list[0], merges_list[0]);

In [41]:
m = merge_similarity_rate_matrix(vocab_list[:3], merges_list[:3])

DataFrame(m).round(3)

If this happens often in your code, it can cause performance problems 
(results will be correct in all cases). 
The reason for this is probably some large input arguments for a wrapped
 function (e.g. large strings).
THIS IS A JOBLIB ISSUE. If you can, kindly provide the joblib's team with an
 example so that they can fix the problem.
  """Entry point for launching an IPython kernel.


Unnamed: 0,0,1,2
0,1.0,0.932,0.939
1,0.933,1.0,0.937
2,0.945,0.941,1.0


### Small vs big  java chunk

In [None]:
small_chunk_merges = read_merges(os.path.join(HOME, ".config/dataprep/1.0.0-alpha/bpe/small_chunk_19-05-22T17-59-55/10000/merges.txt"))

pearson(new_merges0[:10000], small_chunk_merges[:10000])

### N tokens for cross-bpe: java chunk 1 and small python corpus

In [None]:
chunk1_p0_10000=os.path.join(HOME, "raw_datasets/allamanis/chunks_nodup_en_only/01/0_19-05-22T09-25-33_preprocessed_00900_0-10000")

In [31]:
%%bash -s "$chunk1_p0_10000"
source "$HOME/.bashrc"
echo "Calculating N tokens in $1"
tc "$1"

354976336


In [32]:
chunk1_ppython_10000=os.path.join(HOME, "raw_datasets/allamanis/chunks_nodup_en_only/01/0_19-05-22T09-25-33_preprocessed_00900_python-10000")

In [33]:
%%bash -s "$chunk1_ppython_10000"
source "$HOME/.bashrc"
echo "Calculating N tokens in $1"
tc "$1"

Calculating N tokens in /home/lv71161/hlibbabii/raw_datasets/allamanis/chunks_nodup_en_only/01/0_19-05-22T09-25-33_preprocessed_00900_python-10000
391342265


In [36]:
chunk1_p0_10000=os.path.join(HOME, "raw_datasets/python_19-05-20T00-48-25_preprocessed_00900_0-10000")

In [37]:
%%bash -s "$chunk1_p0_10000"
source "$HOME/.bashrc"
echo "Calculating N tokens in $1"
tc "$1"

Calculating N tokens in /home/lv71161/hlibbabii/raw_datasets/python_19-05-20T00-48-25_preprocessed_00900_0-10000
128214089


In [34]:
python_ppython_10000=os.path.join(HOME, "raw_datasets/python_19-05-20T00-48-25_preprocessed_00900_python-10000")

In [35]:
%%bash -s "$python_ppython_10000"
source "$HOME/.bashrc"
echo "Calculating N tokens in $1"
tc "$1"

Calculating N tokens in /home/lv71161/hlibbabii/raw_datasets/python_19-05-20T00-48-25_preprocessed_00900_python-10000
121685704
