#####  挑战1：完成步骤一中的环境配置，使可以进行步骤二中的节点编码。
#####  挑战2：代码漏洞检测是一个图级的二分类问题，它在对图节点编码的基础上，进一步得到图级的编码。
#####        实现一个Readout层(src/process/model.py)，从节点编码得到代码图的编码表示，从而可以完成步骤三中的模型训练和评估。
#####  挑战3：在使用所有训练数据条件下，修改代码的图表示或者模型，使模型的预测准确率达到55%以上(此挑战会根据准确率的提升酌情加分,最高满分)。
#####       

In [None]:
# 一.环境安装
# 安装并激活python环境, 已经验证的版本是3.7.2
!python --version 
!conda create -n devign python=3.7.2 #此命令不要重复执行，成功后可注释掉
!conda activate devign    
# 根据当前机器的情况安装torch、torch-geometric、torch-sparse 和 torch-scatter
#   具体安装方法参考 https://pytorch.org/get-started/locally/ 
# 安装当前试验所依赖的python包
!pip install -r requirements.txt

In [1]:
# 原始的代码数据集存放在 data/raw (Paths.raw)中
# 每一条数据包含：
# project名，commit_id，target（是否为漏洞代码），func（函数代码文本）
# 我们探索一下原始数据的构成
import pandas as pd
dataset = pd.read_json("data/raw/dataset.json")
dataset.head(5)

Unnamed: 0,project,commit_id,target,func
0,FFmpeg,973b1a6b9070e2bf17d17568cbaf4043ce931f51,0,static av_cold int vdadec_init(AVCodecContext ...
1,FFmpeg,321b2a9ded0468670b7678b7c098886930ae16b2,0,static int transcode(AVFormatContext **output_...
2,FFmpeg,5d5de3eba4c7890c2e8077f5b4ae569671d11cf8,0,"static void v4l2_free_buffer(void *opaque, uin..."
3,FFmpeg,32bf6550cb9cc9f487a6722fe2bfc272a93c1065,0,"int ff_get_wav_header(AVFormatContext *s, AVIO..."
4,FFmpeg,57d77b3963ce1023eaf5ada8cba58b9379405cc8,0,"int av_opencl_buffer_write(cl_mem dst_cl_buf, ..."


In [2]:
# 使用joern工具将代码数据转化为CPG图
# 生成过程比较慢而且需要安装环境，在此省略
# 感兴趣的同学可以了解一下joern： https://docs.joern.io/home
# 生成的代码图数据存放在data/cpg (Paths.cpg)中，其中每100条数据写入一个文件
# 解压代码图数据
!tar -xvzf cpg.tar.gz


data/cpg/
data/cpg/36_cpg.pkl
data/cpg/38_cpg.pkl
data/cpg/22_cpg.pkl
data/cpg/2_cpg.pkl
data/cpg/4_cpg.pkl
data/cpg/26_cpg.pkl
data/cpg/13_cpg.pkl
data/cpg/19_cpg.pkl
data/cpg/16_cpg.pkl
data/cpg/20_cpg.pkl
data/cpg/12_cpg.pkl
data/cpg/30_cpg.pkl
data/cpg/8_cpg.pkl
data/cpg/24_cpg.pkl
data/cpg/33_cpg.pkl
data/cpg/37_cpg.pkl
data/cpg/29_cpg.pkl
data/cpg/10_cpg.pkl
data/cpg/28_cpg.pkl
data/cpg/25_cpg.pkl
data/cpg/3_cpg.pkl
data/cpg/32_cpg.pkl
data/cpg/6_cpg.pkl
data/cpg/15_cpg.pkl
data/cpg/11_cpg.pkl
data/cpg/1_cpg.pkl
data/cpg/34_cpg.pkl
data/cpg/7_cpg.pkl
data/cpg/35_cpg.pkl
data/cpg/0_cpg.pkl
data/cpg/21_cpg.pkl
data/cpg/5_cpg.pkl
data/cpg/9_cpg.pkl
data/cpg/17_cpg.pkl
data/cpg/31_cpg.pkl
data/cpg/18_cpg.pkl
data/cpg/14_cpg.pkl
data/cpg/27_cpg.pkl
data/cpg/23_cpg.pkl


In [2]:
# 二.图节点代码文本编码
# 训练用于编码节点中文本的word2vec模型，并对节点的文本进行编码
import main
main.setup()
main.embed_task()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 97 entries, 5619 to 6222
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   target  97 non-null     int64 
 1   func    97 non-null     object
 2   Index   97 non-null     int64 
 3   cpg     97 non-null     object
dtypes: int64(2), object(2)
memory usage: 89.3 KB
Saving input dataset 8_cpg with size 97.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 98 entries, 1373 to 1962
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   target  98 non-null     int64 
 1   func    98 non-null     object
 2   Index   98 non-null     int64 
 3   cpg     98 non-null     object
dtypes: int64(2), object(2)
memory usage: 87.2 KB
CPG cut - original nodes: 229 to max: 205
Saving input dataset 2_cpg with size 98.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 99 entries, 23940 to 24609
Data columns (total 4 columns):
 #   Column  Non-Null

In [4]:
# 三.模型训练和验证
main.process_task(True, False)

The model has 481,200 trainable parameters
<class 'pandas.core.frame.DataFrame'>
Int64Index: 98 entries, 2 to 665
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   input   98 non-null     object
 1   target  98 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 5.4 KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 99 entries, 6879 to 7686
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   input   99 non-null     object
 1   target  99 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 5.4 KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 7688 to 8327
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   input   100 non-null    object
 1   target  100 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 5.5 KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 98 entries

  dataset = dataset.append(load(data_sets_dir, ds_file))
  dataset = dataset.append(load(data_sets_dir, ds_file))
  dataset = dataset.append(load(data_sets_dir, ds_file))
  dataset = dataset.append(load(data_sets_dir, ds_file))
  dataset = dataset.append(load(data_sets_dir, ds_file))
  dataset = dataset.append(load(data_sets_dir, ds_file))
  dataset = dataset.append(load(data_sets_dir, ds_file))


<class 'pandas.core.frame.DataFrame'>
Int64Index: 99 entries, 10452 to 11029
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   input   99 non-null     object
 1   target  99 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 5.4 KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 11031 to 11665
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   input   100 non-null    object
 1   target  100 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 5.5 KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 97 entries, 11677 to 12421
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   input   97 non-null     object
 1   target  97 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 5.3 KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 12429 to 13199
Data columns (t

  dataset = dataset.append(load(data_sets_dir, ds_file))
  dataset = dataset.append(load(data_sets_dir, ds_file))
  dataset = dataset.append(load(data_sets_dir, ds_file))
  dataset = dataset.append(load(data_sets_dir, ds_file))
  dataset = dataset.append(load(data_sets_dir, ds_file))
  dataset = dataset.append(load(data_sets_dir, ds_file))
  dataset = dataset.append(load(data_sets_dir, ds_file))


<class 'pandas.core.frame.DataFrame'>
Int64Index: 99 entries, 14359 to 15157
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   input   99 non-null     object
 1   target  99 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 5.4 KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 15174 to 15908
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   input   100 non-null    object
 1   target  100 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 5.5 KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 99 entries, 15920 to 16972
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   input   99 non-null     object
 1   target  99 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 5.4 KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 99 entries, 16973 to 17585
Data columns (to

  dataset = dataset.append(load(data_sets_dir, ds_file))
  dataset = dataset.append(load(data_sets_dir, ds_file))
  dataset = dataset.append(load(data_sets_dir, ds_file))
  dataset = dataset.append(load(data_sets_dir, ds_file))
  dataset = dataset.append(load(data_sets_dir, ds_file))
  dataset = dataset.append(load(data_sets_dir, ds_file))
  dataset = dataset.append(load(data_sets_dir, ds_file))


<class 'pandas.core.frame.DataFrame'>
Int64Index: 103 entries, 19842 to 20565
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   input   103 non-null    object
 1   target  103 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 5.6 KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 98 entries, 20578 to 21227
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   input   98 non-null     object
 1   target  98 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 5.4 KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 98 entries, 1373 to 1962
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   input   98 non-null     object
 1   target  98 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 5.4 KB
Splitting Dataset
train with cuda.


  dataset = dataset.append(load(data_sets_dir, ds_file))
  train = train_false.append(train_true)
  val = val_false.append(val_true)
  test = test_false.append(test_true)


NotImplementedError: 