# First Order Differential Model

Expands upon the simple set of features used in `n3060_std` by taking their frame-to-frame differential, the difference between the current frame and the most recent frame observed.

#### *Why include differences?*

The image and audio processes in the field are dependent on numerous factors (wind, time of day, weather)
and so the value of the features generated at any one time are not actuallystationary processes,
there are variations but the processes are the cumulative sum of these variations.
In time-series analysis, to deal with such processes one needs to apply a differential,
where by we use the rate of change of the features as inputs to the model.

To achieve this, the model needs to have not only the most recent frame available, but also
the previous frame. It then finds the difference between these and proceeds as with n3060_std in
feature extraction.

#### *Potential Deployment Issues*

The dataset that was generated here assumes that the model can operate at a constant 30 frames per second - so the difference observed is simply the difference in the underlying processes over 1/30th of a second. If the model when deployed is not able to obtain 30 frames per second then it is likely that the results will be different to those when training the model since the differences will occur over a longer time period.

A potential way to mitigate this in deployment is to scale the numerical difference by the time between frames. For example, if a the deployed frame-rate is only 15 frames per second then there is 1/15 second between frames and the feature values should therefore be halved.
Essentially we are assuming that:

$$\frac{\partial_{\text{process}}}{\partial_{\text{time}}} = c$$
for some constant c.

Numerically this is:
$$\frac{\Delta_{\text{process}}}{\Delta_{\text{time}}} = c$$
where we have $\Delta_{\text{time}}$ as the time between frames.

#### To Replicate Results:
Use `general_constructor.py` which is designed to generate data to be used for training models when given a certain dataset configuration (stored in the directory supplied for argument `-m`).

Can also specify where to save the output csv file with `-s`, database to draw labels from `-db` and total number of frames to generate `-n`.
```
python common/model/training/general_constructor.py \
-s "/home/matthew/Documents/GapWatch/common/model/training/n3060_dif/n3060_dif.csv" \
-m "common.model.training.n3060_dif.n3060_dif" \
-db "common/data/labels/app/frames.db" \
-n "4000"
```
The frames are sampled such that each class has an equal number of frames in the output dataset (regardless of the frequency of video clips stored in the label database .db file), which should allow for fairer inference. This behaviour can be changed by editing `general_constructor.py` and changing the `frame_target` array to the desired weighting.

The `general_constructor.py` file first builds an index of frames to search through, such that each class is equally represented in the data. If a class is less frequent in the labels data (as "Danger" was) then it samples more frames from the videos of that particular class.

Once the index of frames has been constructed:

* Open the .mp4 file in OpenCV and seek to the desired frame
* Perform resizing and feature extraction according to the database configuration
  * Eg. may need to include some previous frames to get the difference between them
  These transformations are generally run in parallel if Intel MKL is installed as quite often they will boil down to simple matrix operations
* Convert the .mp4 file to .wav format with ffmpeg (this is usually single threaded)
* Seek to the desired frames in the audio file output by ffmpeg and extract the features again as defined in the database configuration file
* Close the video and audio tracks and save this chunk to the csv file. Repeat for all labels stored in the .db file.

In [1]:
# load model and data and any imports
# imports and model specific settings
# perform imports
import numpy as np
import pandas as pd
from pathlib import Path

# standard libraries
import importlib, argparse, os.path, os, sys, time, sqlite3, subprocess, progressbar

# attempt to import the dataset configuration itself
# Add the git root directory to python path
# unfortunately need to do this manually because Jupyter
# sets the current working directory to the location of the notebook .ipynb file
git_root = "/home/matthew/Documents/GapWatch"
model_module = "common.model.training.n3060_dif.n3060_dif"
data_path = "common/model/training/n3060_dif/n3060_dif.csv"
sys.path.insert(0,git_root)

# custom helper scripts
import common.model.training.training_utils as tu

# import the model
m = importlib.import_module(model_module)

# read the data
df = pd.read_csv(os.path.join(git_root,data_path))

# headers should be the same as m.const_header()
# to keep headers consistent throughout the project
headers = df.columns.values
print("Dataframe Header: \n{}".format(headers))

Dataframe Header: 
['label' 'video_url' 'frame' 'mean' 'd_mean' 'var' 'd_var' 'kurt' 'd_kurt'
 'skew' 'd_skew' 'mfcc_0' 'mfcc_1' 'mfcc_2' 'mfcc_3' 'mfcc_4' 'mfcc_5'
 'mfcc_6' 'mfcc_7' 'mfcc_8' 'mfcc_9' 'd_mfcc_0' 'd_mfcc_1' 'd_mfcc_2'
 'd_mfcc_3' 'd_mfcc_4' 'd_mfcc_5' 'd_mfcc_6' 'd_mfcc_7' 'd_mfcc_8'
 'd_mfcc_9']


In [2]:
# select the X and y to be used for modeling
ml_df = df.copy() # make a copy
X_cols = headers[3:] # numeric features
y_cols = headers[0] # "label"

# NOTE: Perform standardisation inside the model pipeline
# as should not use test data for mean and standard deviation estimates!

# Models

Will consider the following models:

1. Logistic Regression
2. Decision Tree Classifier

## Train-Valid-Test Split
Split the frames by `video_url` as this will ensure that the unseen data doesn't contain frames from videos that the model has been exposed to. This is particularly important because some of the features are video dependent (eg. lighting, weather etc) and the model should not have seen any of those criteria if the frame is truly "unseen".

In [3]:
# get the video urls to determine unique videos
video_urls = ml_df.loc[:,headers[1]].unique()

# randomly sample
np.random.seed(0)
train_prop = 0.5 # save half the videos for validation run
val_prop   = 0.25
train_idx = np.random.choice(len(video_urls),int(np.floor(train_prop*len(video_urls))),
                           replace=False)
val_idx_full   = np.setdiff1d(np.arange(len(video_urls)), train_idx)

val_idx = np.random.choice(val_idx_full,int(np.floor(val_prop*len(video_urls))),
                           replace=False)
test_idx = np.setdiff1d(val_idx_full, val_idx)

# find the videos corresponding to these indices
train_videos = video_urls[train_idx]
val_videos   = video_urls[val_idx]
test_videos   = video_urls[test_idx]


ml_train = ml_df.loc[ml_df[headers[1]].isin(train_videos),:]
ml_val   = ml_df.loc[ml_df[headers[1]].isin(val_videos)  ,:]
ml_test  = ml_df.loc[ml_df[headers[1]].isin(test_videos) ,:]

print("Train set shape: ", ml_train.shape)
print("Validation set shape: ",ml_val.shape)
print("Test set shape: ",ml_test.shape)

Train set shape:  (2062, 31)
Validation set shape:  (1018, 31)
Test set shape:  (956, 31)


## Logistic Regression
### Training

In [None]:
# load scikit learn logistic regression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, ShuffleSplit
from sklearn.metrics import confusion_matrix

# Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler # (x-mu)/sigma

# hide convergence warnings
import warnings
warnings.filterwarnings("ignore")

# show progress
import progressbar

def score_model(clf, X_test, y_test, t=False):
    # get predictions
    if t:
        # user defined threshold
        y_score = clf.predict_proba(X_test)[:,1]
        y_hat = np.where(y_score > t, 1, 0)
    else:
        y_hat = clf.predict(X_test)
    # get the confusion matrix
    tn, fp, fn, tp = confusion_matrix(y_test, y_hat).ravel()
    # get accuracy
    acc  = (tn+tp)/(tn+fp+fn+tp)
    # get sensitivity
    sens = tp/(tp+fn)
    # get specificity
    spec = tn/(tn+fp)
    return acc, sens, spec
    

# perform bootstrapping to observe which coefficients can be dropped
b = 200
n_train = 0.5 # train on half, test on half

# storage for accuracy, sensitivity, specificity (for both train and test to check for overfitting)
scores = np.zeros((b, 6))
bs = ShuffleSplit(n_splits = b,
                 random_state=0,
                 test_size=0.5)
conf = .9 # form a 90% confidence interval to include significant variables
X_cols_0 = X_cols.copy() # make a copy to prevent weirdness happening
stop_cond = False
while not stop_cond:
    # define the response and predictors
    X = ml_train[X_cols_0]
    y = np.where(ml_train[y_cols]=="Danger",1,0) # 1: Danger, 0 No_Danger
    # iterate training the model and performing variable selection
    params = np.zeros((b,1+len(X_cols_0))) # storage for coefficeints
    iter = 0
    with progressbar.ProgressBar(max_value=b) as bar:
        for train_index, test_index in bs.split(X):
            X_train, X_test = X.iloc[train_index,:], X.iloc[test_index,:]
            y_train, y_test = y[train_index], y[test_index]
            # fit the classifier
            lr_pipe = Pipeline([('scaler', StandardScaler()),
                  ('lr', LogisticRegression(random_state=0,
                                           solver="sag"))])
            lr_pipe.fit(X_train, y_train)
            # Automatically tune the threshold to the training data
            y_hat = lr_pipe.predict_proba(X_train)[:, 1]
            p, r, thresholds = tu.precision_recall_curve(y_train, y_hat)
            target_r = 0.95
            t_opt = thresholds[np.argmin(np.abs(r-target_r))]
            
            # record the model parameters
            params[iter,:] = np.squeeze(np.hstack((lr_pipe.named_steps['lr'].intercept_[:,None],
                                                   lr_pipe.named_steps['lr'].coef_)))
            # score the model on train data first
            scores[iter,:3] = score_model(lr_pipe, X_train, y_train, t=t_opt)
            # score the model on test data
            scores[iter,3:] = score_model(lr_pipe, X_test, y_test, t=t_opt)
            # iteration counter
            bar.update(iter)
            iter +=1
    # check the parameters for ones that are not significantly different to zero and drop
    # then refit the model
    # We cannot drop the intercept!
    up = np.percentile(params[:,1:], 100*(1-(1-conf)/2), axis=0)
    down = np.percentile(params[:,1:], 100*(1-conf)/2, axis=0)
    keep = np.where((up>0)&(down<0),False, True)
    not_keep = np.where((up>0)&(down<0),True, False)
    X_cols_1 = X_cols_0[keep]
    print("Variables Kept:\n{}\n".format(X_cols_1))
    print("Variables Removed:\n{}\n".format(X_cols_0[not_keep]))
    # update the columns to use in the model
    X_cols_0 = X_cols_1
    if False in keep:
        stop_cond = False
    else:
        print("Stopping condition reached, no more variables to drop")
        stop_cond = True
        
    

100% (200 of 200) |######################| Elapsed Time: 0:00:11 Time:  0:00:11
N/A% (0 of 200) |                        | Elapsed Time: 0:00:00 ETA:  --:--:--

In [None]:
# plotting
import matplotlib.pyplot as plt
import seaborn as sns

# boxplot of the parameters
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(2, 2, 1)
ax.boxplot(params)
ax.axhline(y=0, c="b")
x_ax_labs = np.insert(X_cols_0,0,"intercept") 
ax.set_xticklabels(x_ax_labs, rotation = 90)


# boxplot of the scorings, grouped by type if possible
# reconfigure so that the 
ax = fig.add_subplot(2, 2, 2)
ax.boxplot(scores[:,[0,3]])
ax.set_xticklabels(["Training\nAccuracy", "Testing\nAccuracy"])

ax = fig.add_subplot(2, 2, 3)
ax.boxplot(scores[:,[1,4]])
ax.set_xticklabels(["Training\nSensitivity", "Testing\nSensitivity"])

ax = fig.add_subplot(2, 2, 4)
ax.boxplot(scores[:,[2,5]])
ax.set_xticklabels(["Training\nSpecificity", "Testing\nSpecificity"])

plt.subplots_adjust(hspace=0.3)
plt.savefig("lr.pdf")
plt.show()

## Logistic Regression
### Threshold Selection

In [None]:
from sklearn.metrics import roc_curve, precision_recall_curve, auc, make_scorer, recall_score, accuracy_score, precision_score, confusion_matrix

X_train, X_test = ml_train[X_cols_1], ml_val[X_cols_1]
y_train, y_test = np.where(ml_train[y_cols]=="Danger",1,0), np.where(ml_val[y_cols]=="Danger",1,0)
lr_pipe = Pipeline([('scaler', StandardScaler()),
                  ('dt', LogisticRegression(random_state=0,
                                           solver="sag"))])
lr_pipe.fit(X_train, y_train)
# get y_hat
y_hat = lr_pipe.predict_proba(X_test)[:, 1]

p, r, thresholds = tu.precision_recall_curve(y_test, y_hat)

target_r = 0.95
t_opt_lr = thresholds[np.argmin(np.abs(r-target_r))]
print("Threshold to achieve Sensitivity of {}: {}".format(target_r, t_opt_lr))

tu.precision_recall_threshold(p, r, thresholds, y_hat, y_test, t= t_opt_lr)

In [None]:
tu.plot_precision_recall_vs_threshold(p, r, thresholds)

In [None]:
fpr, tpr, auc_thresholds = tu.roc_curve(y_test, y_hat)
print("Area under ROC curve: {}".format(auc(fpr, tpr))) # AUC of ROC
tu.plot_roc_curve(fpr, tpr, 'recall_optimized')

## Decision Tree Classifier
### Training

In [None]:
# load scikit learn logistic regression
from sklearn.tree import DecisionTreeClassifier

# parallel
from multiprocessing import Pool
from itertools import repeat

X = ml_train[X_cols]
y = np.where(ml_train[y_cols]=="Danger",1,0) # 1: Danger, 0 No_Danger

# perform bootstrapping to observe which coefficients can be dropped
b = 200
n_train = 0.5 # train on half, test on half
params = np.zeros((b,1+len(X_cols))) # storage for coefficeints
# storage for accuracy, sensitivity, specificity (for both train and test to check for overfitting)
nk = 10 # number of different depths to try
depths = np.arange(nk)+1 # zero depth tree doesnt make sense

scores = np.zeros((b, nk, 6))
bs = ShuffleSplit(n_splits = b,
                 random_state=0,
                 test_size=0.5)
iter = 0
start = time.time()
X_tr, X_te, y_tr, y_te = [], [], [], [] # storage lists
for train_index, test_index in bs.split(X):    
    X_train, X_test = X.iloc[train_index,:], X.iloc[test_index,:]
    y_train, y_test = y[train_index], y[test_index]
    X_tr.append(X_train)
    X_te.append(X_test)
    y_tr.append(y_train)
    y_te.append(y_test)
# evaluate performance in parallel
def return_scores(split, depths):
    train_index = split[0]
    test_index  = split[1]
    scores = np.zeros((nk,6))
    for i in range(nk):
        # fit the classifier
        dt_pipe = Pipeline([('scaler', StandardScaler()),
                 ('dt', DecisionTreeClassifier(random_state=0,
                        max_depth=depths[i]))])
        dt_pipe.fit(X_train, y_train)
        # threshold optimisation
        y_hat = dt_pipe.predict_proba(X_train)[:, 1]
        p, r, thresholds = tu.precision_recall_curve(y_train, y_hat)
        target_r = 0.95
        t_opt = thresholds[np.argmin(np.abs(r-target_r))]
        # score the model on train data first
        scores[i,:3] = score_model(dt_pipe, X_train, y_train, t=t_opt)
        # score the model on test data
        scores[i,3:] = score_model(dt_pipe, X_test, y_test, t=t_opt)
    return scores
with Pool(os.cpu_count()) as pool:
    res = pool.starmap(return_scores,
                        zip(bs.split(X),
                            repeat(depths)))
print("Search Complete in {} seconds".format(time.time()-start))
scores = np.array(res)

# get the confidence intervals
up = np.percentile(scores, 95, axis=0)
med = np.percentile(scores, 50, axis=0)
lo = np.percentile(scores, 5, axis=0)

# Plot 1: Accuracy with depth
metric = "Accuracy"
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(3, 1, 1)
ax.fill_between(depths, up[:,0], lo[:,0], color="b")
ax.plot(depths, med[:,0], c="b", label="Train {}".format(metric))
ax.fill_between(depths, up[:,3], lo[:,3], color="r")
ax.plot(depths, med[:,3], c="r", label="Test {}".format(metric))
ax.legend()


# Plot 2: Sensitivity with depth
metric = "Sensitivity"
ax = fig.add_subplot(3, 1, 2)
ax.fill_between(depths, up[:,1], lo[:,1], color="b")
ax.plot(depths, med[:,1], c="b", label="Train {}".format(metric))
ax.fill_between(depths, up[:,4], lo[:,4], color="r")
ax.plot(depths, med[:,4], c="r", label="Test {}".format(metric))
ax.legend()

# Plot 3: Specificity with depth
metric = "Specificity"
ax = fig.add_subplot(3, 1, 3)
ax.fill_between(depths, up[:,2], lo[:,2], color="b")
ax.plot(depths, med[:,2], c="b", label="Train {}".format(metric))
ax.fill_between(depths, up[:,5], lo[:,5], color="r")
ax.plot(depths, med[:,5], c="r", label="Test {}".format(metric))
ax.legend()


plt.subplots_adjust(hspace=0.3)
plt.savefig("dt.pdf")
plt.show()


## Decision Tree
### Threshold Selection

In [None]:
# optimal depth appears to be 3 as overfitting happens afterwards
max_depth=3
X_train, X_test = ml_train[X_cols], ml_val[X_cols]
y_train, y_test = np.where(ml_train[y_cols]=="Danger",1,0), np.where(ml_val[y_cols]=="Danger",1,0)
clf = DecisionTreeClassifier(random_state=0,
                        max_depth=max_depth).fit(X_train,y_train)
# get y_hat
y_hat = clf.predict_proba(X_test)[:, 1]

p, r, thresholds = tu.precision_recall_curve(y_test, y_hat)

target_r = 0.95
t_opt_dt = thresholds[np.argmin(np.abs(r-target_r))]
print("Threshold to achieve Sensitivity of {}: {}".format(target_r, t_opt_dt))

tu.precision_recall_threshold(p, r, thresholds, y_hat, y_test, t= t_opt_dt)

# Save Models in Pipeline
Save the models so that they can be easily reloaded

In [None]:
from joblib import dump, load

# config utilities
from common.data.labels.app.config_utils import JSONPropertiesFile

# logistic regression
# make sure we can automatically select the variables in X that LR expects
# don't want to be creating a dataframe every time this pipeline is called as that will get
# expensive quickly
# load in the model headers

# attempt to remove any previous config or pkl files
def cleanup(directory):
    while os.path.isfile(directory):
        # delete temporary file
        # file deletions appear to fail sometimes
        os.remove(temp_fn)
        time.sleep(0.01)

m_loc = "common.model.training.n3060_dif.n3060_dif" # dataset configuration file
lr_model_store = os.path.join(git_root,"common/model/training/n3060_dif/n3060_dif_lr.pkl")

scaler_step_name = "scaler"
headers = m.const_header()
lr_sel_headers = [True if i in X_cols_0 else False for i in headers]

# write a config file
lr_CONFIG_FILE_LOC = "lr_config.json"

cleanup(lr_CONFIG_FILE_LOC)
cleanup(lr_model_store)

lr_default_properties = {
    'name'        : 'lr',
    'headers'     : headers,
    'sel_headers' : lr_sel_headers,
    'model_store' : lr_model_store,
    'm_loc'       : m_loc,
    'n_prev'      : 1,
    'scaler'      : scaler_step_name
}
config_file = JSONPropertiesFile(lr_CONFIG_FILE_LOC, lr_default_properties)
config = config_file.get()
config_file.set(config)
# dump the logistic regression model
X_train, X_test = ml_train[X_cols_0], ml_val[X_cols_0] # recall that these are the variables it uses
y_train, y_test = np.where(ml_train[y_cols]=="Danger",1,0), np.where(ml_val[y_cols]=="Danger",1,0)

lr_pipe = Pipeline([(scaler_step_name, StandardScaler()),
                  ('lr', LogisticRegression(random_state=0,
                                           solver="sag"))])
lr_pipe.fit(X_train, y_train)
lr_pipe.score(X_test, y_test)
dump(lr_pipe, lr_model_store)


# store decision tree model
dt_model_store = os.path.join(git_root, "common/model/training/n3060_dif/n3060_dif_dt.pkl")
headers = m.const_header()
dt_sel_headers = [True if i in X_cols else False for i in headers]

# write a config file
dt_CONFIG_FILE_LOC = "dt_config.json"

cleanup(dt_CONFIG_FILE_LOC)
cleanup(dt_model_store)

dt_default_properties = {
    'name'        : 'dt',
    'headers'     : headers,
    'sel_headers' : dt_sel_headers,
    'model_store' : dt_model_store,
    'm_loc'       : m_loc,
    'n_prev'      : 1,
    'scaler'      : scaler_step_name  
}
config_file = JSONPropertiesFile(dt_CONFIG_FILE_LOC, dt_default_properties)
config = config_file.get()
config_file.set(config)
# dump the decision tree classifier model
X_train, X_test = ml_train[X_cols], ml_val[X_cols] # recall that these are the variables it uses
y_train, y_test = np.where(ml_train[y_cols]=="Danger",1,0), np.where(ml_val[y_cols]=="Danger",1,0)

dt_pipe = Pipeline([(scaler_step_name, StandardScaler()),
                 ('dt', DecisionTreeClassifier(random_state=0,
                        max_depth=max_depth))])
dt_pipe.fit(X_train, y_train)
dt_pipe.score(X_test, y_test)
dump(dt_pipe, dt_model_store)

# Compare Models
Simulated live inference.
Find a video in the test set with a Dangerous scene and observe how:

1. Predictors change
2. Model predictions change

## Compare Models:
### Previously seen footage
How do the models perform when used on a video that they have seen before?


In [None]:
# select a video
import common.model.training.general_demo as demo
# vid_url = ml_test.loc[(ml_test[headers[0]]=="Danger")&(ml_test[headers[2]] > 100), headers[1]].values[0]
vid_url = ml_train.loc[(ml_train[headers[0]]=="Danger")&(ml_train[headers[2]] > 100), headers[1]].values[0]
out_file = os.path.join(git_root,"common/model/training/n3060_dif/n3060_dif_inference.csv") # save the demo data here
# do the demo
print("Performing inference on video: \n{}".format(vid_url))
demo_df = demo.inference_demo(vid_url, out_file, [dt_CONFIG_FILE_LOC,
                                                 lr_CONFIG_FILE_LOC])

In [None]:
# add column names
demo_df.describe()

In [None]:
y_hat_lr = demo_df["lr"]
y_hat_dt = demo_df["dt"]
frames = np.arange(len(y_hat_lr))
y_true = np.where((frames>=6915)&(frames<=6935),1,0)
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1, 1, 1)
ax.plot(frames, y_true, color="g", label="True Label")
ax.plot(frames, y_hat_dt, color="r", label="dt")
ax.plot(frames, y_hat_lr, color="b", label="lr")
ax.set_xlim((6500,7000))


In [None]:
y_hat_lr_score = lr_pipe.predict_proba(ml_df.loc[ml_df[headers[1]]==vid_url,X_cols_0])[:,1]
y_hat_dt_score = dt_pipe.predict_proba(ml_df.loc[ml_df[headers[1]]==vid_url,X_cols])[:,1]
labels = ml_df.loc[ml_df[headers[1]]==vid_url,headers[0]]

# convert the scores to predictions using threshold
y_hat_lr = np.where(y_hat_lr_score>t_opt_lr,1,0)
y_hat_dt = np.where(y_hat_dt_score>t_opt_dt,1,0)

frames = ml_df.loc[ml_df[headers[1]]==vid_url, headers[2]]
fig = plt.figure(figsize = (16,8))
ax = fig.add_subplot(1, 2, 1)
ax.set_title("Video: \n{}\nLabel Prediction\nfrom Training Data".format(vid_url))
ax.plot(frames, y_hat_lr, color="b", label="lr prediction")
ax.plot(frames, y_hat_dt, color="r", label="dt prediction")
ax.plot(frames, labels, color = "g", label="True")
ax.set_xlim((6500,7000))
ax.legend()

ax = fig.add_subplot(1, 2, 2)
ax.set_title("Video: \n{}\nLabel Probabiltiy\nfrom Training Data".format(vid_url))
ax.plot(frames, y_hat_lr_score, color="b", label="lr probability")
ax.plot(frames, y_hat_dt_score, color="r", label="dt probability")
ax.plot(frames, labels, color = "g", label="True")
ax.set_xlim((6000,8000))
ax.legend()

## Compare Models:
### Previously Unseen Footage

In [None]:
# select a video
import common.model.training.general_demo as demo
# vid_url = ml_test.loc[(ml_test[headers[0]]=="Danger")&(ml_test[headers[2]] > 100), headers[1]].values[0]
vid_url = ml_test.loc[(ml_test[headers[0]]=="Danger")&(ml_test[headers[2]] > 100), headers[1]].values[0]
out_file = os.path.join(git_root,"common/model/training/n3060_dif/n3060_dif_inference.csv") # save the demo data here
# do the demo
print("Performing inference on video: \n{}".format(vid_url))
demo_df = demo.inference_demo(vid_url, out_file, [dt_CONFIG_FILE_LOC,
                                                 lr_CONFIG_FILE_LOC])

In [None]:
y_hat_lr = demo_df["lr"]
y_hat_dt = demo_df["dt"]
frames = np.arange(len(y_hat_lr))
y_true = np.where((frames>=1686)&(frames<=2120),1,0)
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1, 1, 1)
ax.plot(frames, y_true, color="g", label="True Label")
ax.plot(frames, y_hat_dt, color="r", label="dt")
ax.plot(frames, y_hat_lr, color="b", label="lr")
ax.set_xlim((1500,2500))

In [None]:
y_hat_lr_score = lr_pipe.predict_proba(ml_df.loc[ml_df[headers[1]]==vid_url,X_cols_0])[:,1]
y_hat_dt_score = dt_pipe.predict_proba(ml_df.loc[ml_df[headers[1]]==vid_url,X_cols])[:,1]
labels = ml_df.loc[ml_df[headers[1]]==vid_url,headers[0]]

# convert the scores to predictions using threshold
y_hat_lr = np.where(y_hat_lr_score>t_opt_lr,1,0)
y_hat_dt = np.where(y_hat_dt_score>t_opt_dt,1,0)

frames = ml_df.loc[ml_df[headers[1]]==vid_url, headers[2]]
fig = plt.figure(figsize = (16,8))
ax = fig.add_subplot(1, 2, 1)
ax.set_title("Video: \n{}\nLabel Prediction\nfrom Training Data".format(vid_url))
ax.plot(frames, y_hat_lr, color="b", label="lr prediction")
ax.plot(frames, y_hat_dt, color="r", label="dt prediction")
ax.plot(frames, labels, color = "g", label="True")
ax.set_xlim((1500,2500))
ax.legend()

ax = fig.add_subplot(1, 2, 2)
ax.set_title("Video: \n{}\nLabel Probabiltiy\nfrom Training Data".format(vid_url))
ax.plot(frames, y_hat_lr_score, color="b", label="lr probability")
ax.plot(frames, y_hat_dt_score, color="r", label="dt probability")
ax.plot(frames, labels, color = "g", label="True")
ax.set_xlim((1000,3000))
ax.legend()