# Predictive Modeling for Library Migration

## Hypothesis

* H1: If many other projects have removed a library, it will be more likely for a project to migrate away from this library
* H2: If many other projects have migrated from a library, it will be more likely for a project to migrate away from this library
* H3: If the use of a library do not align well with current best practices, it will more likely for a project to migrate away from this library
* H4: If a project have simultaneous use of same-domain libraries, it will be more likely to consolidate its usage to a single library
* H5: A project is more likely to use a library that its upstream projects are already using
* H6: If a library is not actively maintained, it will be more likely for a project to migrate away from this library
* H7: If a library has unpatched security vulnerabilities, it will be more likely for a project to migrate away from this library
* H8: If a library has an unusual license, it will be more likely for a project to migrate away from this library

## Model

首先，我们确立一组感兴趣的库集合$L$（$L$可能是若干个同一领域的库）。我们对这些库在大规模的项目集合$\mathcal{P}$上提取所有的依赖项变更

$$
\begin{align}
\Delta L   &= \{\langle t,p,c,f,l^-,l^+,v^-,v^+ \rangle\}, p\in \mathcal{P}, x \in \Delta L \Rightarrow x.l^- \in L \lor x.l^+ \in L\\
\Delta L^+ &= \{x | x \in \Delta L \land x.l^+ \in L \land x.l^- = \emptyset \}\\
\Delta L^- &= \{x | x \in \Delta L \land x.l^- \in L \land x.l^+ = \emptyset\}
\end{align}
$$

其中，$t$是时间，$c$是Commit，$f$是被修改的依赖配置文件。

我们使用逻辑回归模型来拟合一个函数$f$，满足

$$
\begin{cases}
f(x) = 1, x \in \Delta L^+ \\
f(x) = 0, x \in \Delta L^-
\end{cases}
$$

对每个$x \in \Delta L$，定义其被添加或被删除的库为$l$，计算如下特征

1. 项目做出变更的时间$t$减去$l$最近的上一次发布的时间。（刻画库是否正在被维护，验证H6）
2. 项目做出变更的时间$t$减去$l$的第一次发布的时间。（刻画库的老旧程度，验证H6）
3. 项目做出变更时，$l$能够查询到的安全漏洞的数量。（刻画库的安全漏洞情况，验证H7）
4. $l$的许可证情况，按$L$集合里的许可证数量进行one-hot encoding。（验证H8）
5. 项目做出变更时，项目的间接依赖里是否已经包含$l$。（验证H5）
6. 项目做出变更时，项目的其他依赖配置文件里是否已经声明了$l$。(验证H4）
7. 项目做出变更时，$l$在$\mathcal{P}$中的全局留存率（1 - 被删除的次数 / 被添加的次数）。（验证H1）
8. 项目做出变更时，$l$在已确认迁移中的流入比率（迁移图上入度 / 出度）。（验证H2）
9. 项目做出变更时，$l$与剩下所有依赖的Pointwise Mutual Information (PMI)均值
    $$
    \frac{1}{|x.f|}\sum_{l'\in x.f} \log \frac{p(l,l')}{p(l)p(l')}
    $$
    公式中概率使用$\mathcal{P}$中所有依赖配置文件来估计。（验证H3）

备注：为了计算上述指标，需要额外研究一下如何获取安全漏洞数据。可能考虑：GitHub Advisories，或参考已有研究。

In [None]:
import datautil
import pandas as pd
dep_change = datautil.select_dependency_changes_all()
dep_change.to_csv('data/migration_changes.csv')
print(len(dep_change))

In [None]:
import model
import numpy as np
lib = "org.json:json"
migrations_to_lib = model.get_migration_to_library(lib).values
migrations_from_lib = model.get_migration_from_library(lib).values

In [None]:
index1 = np.array([])
index2 = np.array([])
index7 = np.array([])
index8 = np.array([])
f = np.array([])

for migration in migrations_to_lib:
    index1 = np.append(index1, model.index_1(migration, lib))
    index2 = np.append(index2, model.index_2(migration, lib))
    index7 = np.append(index7, model.get_library_retention_rate(migration, lib))
    index8 = np.append(index8, model.get_library_inflow_rate(migration, lib))
    f = np.append(f, 1)
for migration in migrations_from_lib:
    index1 = np.append(index1, model.index_1(migration, lib))
    index2 = np.append(index2, model.index_2(migration, lib))
    f = np.append(f, 0)

print(len(index1))

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve, roc_curve, auc
from sklearn.metrics import classification_report
from matplotlib import pyplot
from matplotlib import pylab
import numpy as np
import time

def plot_pr(auc_score, precision, recall, label=None):
    pylab.figure(num=None, figsize=(6, 5))
    pylab.xlim([0.0, 1.0])
    pylab.ylim([0.0, 1.0])
    pylab.xlabel('Recall')
    pylab.ylabel('Precision')
    pylab.title('P/R (AUC=%0.2f) / %s' % (auc_score, label))
    pylab.fill_between(recall, precision, alpha=0.5)
    pylab.grid(True, linestyle='-', color='0.75')
    pylab.plot(recall, precision, lw=1)
    pylab.show()

start_time = time.time()
x = np.c_[index1, index2]
# x = index2.reshape(-1, 1)

average = 0
testNum = 10
for i in range(0, testNum):
    X_train, X_test, y_train, y_test = train_test_split(x, f,
                                                    test_size=0.2)
    lr = LogisticRegression()
    lr.fit(X_train, y_train)
    y_pred = lr.predict(X_test)
    p = np.mean(y_pred == y_test)
    print(p)
    average += p

answer = lr.predict_proba(X_test)[:, 1]
precision, recall, thresholds = precision_recall_curve(y_test, answer)
report = answer > 0.5
print(classification_report(y_test, report, target_names=['neg', 'pos']))
print("average precision:", average / testNum)
print("time spent:", time.time() - start_time)
plot_pr(0.5, precision, recall, "pos")