# final.ipynb: 通过已经训练的决策树，最终解题
前提：跑完`train.ipynb`，最好跑完`waveform.py`。如果跑的是`model.ipynb`，请将model的文件名相应加一个play。

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import torch
import h5py
from sklearn import preprocessing
from sklearn import ensemble
from utils import loadData, saveData, getNum, getPePerWF, saveans, lossfunc_eval, lossfunc_train
import lightgbm as lgb
from tqdm import tqdm

读取测试集

In [2]:
testpath = "data/final.h5"
testWF = loadData(testpath, 'test')

Structure of data:
<HDF5 dataset "Waveform": shape (12178193,), type "|V2008"> Waveform /Waveform


如果已经跑完了`waveform.py`，跑上面的代码块；如果没有，跑下面的代码块。

总之，得到第一个决策树需要的那些参数，并为第二个决策树做准备。

In [3]:
numPEW, wfIndices = getNum(testWF)
with h5py.File('./train/final_wf.h5', 'r') as ipt:
    intTestWF = ipt['Waveform']['intWF'][...]
    pointsPerTestWF = ipt['Waveform']['pointsPerWF'][...]
    pePerTestWFCalc = ipt['Waveform']['pePerWFCalc'][...]
    meanPeTimePerTestWF = ipt['Waveform']['meanPeTimePerWF'][...]


In [5]:
# numPEW, wfIndices = getNum(testWF)
# splitWFChannels = np.split(testWF['ChannelID'], wfIndices[1:-1])
# denoisedTestWF = np.where(testWF['Waveform'] < 918, 918-testWF['Waveform'], 0)
# intTestWF = np.sum(denoisedTestWF, axis=1)
# pointsPerTestWF = np.sum(denoisedTestWF > 0, axis=1)
# pePerTestWFCalc = np.empty(denoisedTestWF.shape[0])
# for index, waveform in enumerate(tqdm(denoisedTestWF)):
#    wfArgmax = getPePerWF(denoisedTestWF[index], pointsPerTestWF[index])
#    pePerTestWFCalc[index] = wfArgmax.shape[0]

100%|██████████| 12178193/12178193 [39:55<00:00, 5083.12it/s] 


使用第一个决策树，预测每个波形的PE数`pePerWF`

In [4]:
gbmForPePerWF = lgb.Booster(model_file='./modelPePerWF.txt')
pePerWF = gbmForPePerWF.predict(
    np.stack(
        (intTestWF, pointsPerTestWF, pePerTestWFCalc),
        axis=1
    )
)

运算得到第二个决策树所需要的五个feature

In [5]:
splitPePerTestWFFinal = np.split(pePerWF, wfIndices[1:-1])
peTotal = np.empty(4000)
peMean = np.empty(4000)
peStd = np.empty(4000)
for index, pePerTestWFFinalChunk in enumerate(tqdm(splitPePerTestWFFinal)):
    peTotal[index] = np.sum(pePerTestWFFinalChunk)
    peMean[index] = np.mean(pePerTestWFFinalChunk)
    peStd[index] = np.std(pePerTestWFFinalChunk)

splitMeanPeTimePerTestWF = np.split(meanPeTimePerTestWF, wfIndices[1:-1])
peTimeMean = np.empty(4000)
peTimeStd = np.empty(4000)
for index, meanPeTimePerTestWFFinalChunk in enumerate(tqdm(splitMeanPeTimePerTestWF)):
    peTimeMean[index] = np.nanmean(meanPeTimePerTestWFFinalChunk)
    peTimeStd[index] = np.nanstd(meanPeTimePerTestWFFinalChunk)


100%|██████████| 4000/4000 [00:00<00:00, 12073.58it/s]
100%|██████████| 4000/4000 [00:00<00:00, 5236.77it/s]


喂进第二个决策树，得到最终答案动量p

In [6]:
gbmForP = lgb.Booster(model_file='./modelP.txt')
answerP = gbmForP.predict(
    np.stack(
        (peTotal, peMean, peStd, peTimeMean, peTimeStd),
        axis=1
    )
)

将答案存为标准格式，完成！

In [7]:
saveans(answerP, './ans/ans10.h5')