# Sensor-Based Linear Regression

In this notebook, we perform several linear regressions which we call "sensor-based". This means that the features included make essential use of all 5,160 IceCube sensors. In essence, each data point in `X_train` fed into the `LinearRegression` object will be a 5160-tuple where the $i$th entry provides information about the (in)activation of the $i$th sensor for that event.

Since the raw data is not conducive to directly being fed into `sklearn`'s `LinearRegression`, we first define various functions which help us extract these features and return the processed data that we feed into the regressor. So that our results on this notebook are consistent with results on others, we train and test using the data in batch 10. As batch 10 (and all other batches) consists of 200,000 events (each with up to thousands of pulses), we ran the functions below on the raw batch 10 data first and output the result as `.parquet` files. The processed data is then called directly by the `pandas.read_parquet` function to save time.

# Importing Modules and Defining Feature Extraction Functions

In [3]:
# import modules
import pandas as pd
import numpy as np
import sklearn.linear_model as LinearRegression
from mae import angular_dist_score

In [2]:
def raw_event_to_proc_binary(event, aux_incl=False):
    """
    Given an event, this function returns a processed
    5160-tuple with the ith entry being 0 if that sensor_id
    was not pinged during this event, and a 1 otherwise
    """
    if aux_incl == False:
        event = event[event.auxiliary==False]
    
    # array to be returned
    proc = np.zeros((5160,))
    
    # find the sensors that got pinged, modify proc accordingly
    sensors = np.unique(event.sensor_id.values)
    for sensor in sensors:
        proc[sensor] = 1
    
    return proc

In [3]:
def raw_batch_to_proc_binary(batch, aux_incl=False):
    """
    Given a (sub)batch, this function returns a processed
    pandas DataFrame whose rows are the processed events
    according to raw_event_to_proc_binary
    """
    # DataFrame to be returned
    event_ids = np.unique(batch.index)
    df = pd.DataFrame(0, index=event_ids, columns=[i for i in range(5160)])

    # run the raw_event_to_proc_binary function on each event
    count = 0
    for event_id in event_ids:
        df.loc[event_id] = raw_event_to_proc_binary(batch.loc[event_id], aux_incl=aux_incl)
        if count % 1000 == 0:
            print('Working on',count)
        count += 1
    return df

In [44]:
def raw_event_to_proc_chargesum(event, aux_incl=False):
    """
    Given an event, this function returns a processed
    5160-tuple with the ith entry being the sum of all
    charges across all pulses registered by that sensor
    in this event
    """
    if aux_incl == False:
        event = event[event.auxiliary==False]
    
    # array to be returned
    proc = np.zeros((5160,))
    
    # find the sensors that got pinged, modify proc accordingly
    event = event.drop(['time','auxiliary'], axis=1).groupby('sensor_id').sum()
    for sensor in event.index:
        proc[sensor] = event.loc[sensor].values[0]
    
    return proc

In [45]:
def raw_batch_to_proc_chargesum(batch, aux_incl=False):
    """
    Given a (sub)batch, this function returns a processed
    pandas DataFrame whose rows are the processed events
    according to raw_event_to_proc_chargesum
    """
    # DataFrame to be returned
    event_ids = np.unique(batch.index)
    df = pd.DataFrame(0, index=event_ids, columns=[i for i in range(5160)])

    # run the raw_event_to_proc_binary function on each event
    count = 0
    for event_id in event_ids:
        df.loc[event_id] = raw_event_to_proc_chargesum(batch.loc[event_id], aux_incl=aux_incl)
        if count % 1000 == 0:
            print('Working on',count)
        count += 1
    return df

In [2]:
# load batch of our data
batch10 = pd.read_parquet('../batches_train/batch_10.parquet')
sensor_geom = pd.read_csv('../sensor_geometry.csv')
meta10 = pd.read_parquet('../batches_train/batch10_meta.parquet')
batch10_true_directions = meta10[['azimuth', 'zenith']]

# list of unique event ids
event_ids = np.sort(np.unique(batch10.index))

# Model 0: Baseline 

In [54]:
az_avg = np.mean(meta10['azimuth'].values)
ze_avg = np.mean(meta10['zenith'].values)

In [None]:
# Train test split batch10, k-fold cross validation
# this cell imitates the erdos lectures notes on kfold cross validation , k = 5
# random seed to all splits random_seed = 134

In [8]:
from sklearn.model_selection import train_test_split, KFold
y_train, y_test = train_test_split(batch10_true_directions,
                                   shuffle=True,
                                   test_size=.25,
                                   random_state=134)

In [14]:
kfold = KFold(n_splits=5,
              shuffle=True,
              random_state=134)
# cross validation on model 0
maes_0 = []
for train_index, test_index in kfold.split(y_train):
    # assign X_tt, y_tt and X_ho, y_ho
    y_tt = y_train.iloc[train_index]
    y_ho = y_train.iloc[test_index]
    
    # "fit" our model 
    az_avg = np.mean(y_tt['azimuth'])
    ze_avg = np.mean(y_tt['zenith'])
    
    # predict 
    az_pred = az_avg*np.ones((len(y_ho),))
    ze_pred = ze_avg*np.ones((len(y_ho),))
    
    # get error according to custom error function
    err = angular_dist_score(y_ho['azimuth'].values, 
                             y_ho['zenith'].values,
                             az_pred,
                             ze_pred)
    maes_0.append(err)

In [16]:
# This code was run locally and I'm saving the result here for 
# future use without needing to run it
maes_0 = [1.5653397534634008, 1.5664915922174312, 1.5770938834588555, 1.570667153939643, 1.5712505530785261]
avg_mae_0 = np.mean(maes_0)
print("Average CV mae for model_0 (baseline) is:", avg_mae_0)

Average CV mae for model_0 (baseline) is: 1.5701685872315714


# Model 1: Sensor binary data

In [3]:
# Create processed binary data from batch10 using 
# batch10_proc_binary = raw_batch_to_proc_binary(batch10)
# but for convenience I've already run this and stored the 
# result as a .parquet file
batch10_proc_binary = pd.read_parquet('../batches_train/batch10_proc_binary.parquet')

In [5]:
# the targets are the azimuth (az) and zenith (ze)
# which we extract from the provided meta data
batch10_true_directions = meta10[['azimuth', 'zenith']]

In [9]:
# This cell is used to downsize the data for debugging purposes
# so it runs faster. Comment out to run on full dataset. 
batch10_proc_binary = batch10_proc_binary[0:1000]
batch10_true_directions = batch10_true_directions[0:1000]

In [6]:
# Now we train test split on the whole batch10
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(batch10_proc_binary, 
                                                    batch10_true_directions,
                                                    shuffle=True,
                                                    test_size=.25,
                                                    random_state=134)

In [14]:
# Train test split batch10, k-fold cross validation
# this cell imitates the erdos lectures notes on kfold cross validation , k = 5
# random seed to all splits random_seed = 134

In [15]:
# on our training test we now perform k-fold cross validation
# we use k = 5 and random seed 134
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5, 
              shuffle=True,
              random_state=134)


In [17]:
### CROSS-VALIDATION ###

# Defining model 1
from sklearn.linear_model import LinearRegression
model_1 = LinearRegression(copy_X=True)

# cross validation on model 1
maes_1 = []
for train_index, test_index in kfold.split(X_train, y_train):
    # assign X_tt, y_tt and X_ho, y_ho
    X_tt = X_train.iloc[train_index]
    y_tt = y_train.iloc[train_index]
    X_ho = X_train.iloc[test_index]
    y_ho = y_train.iloc[test_index]
    
    # fit our model 
    model_1.fit(X_tt, y_tt)
    
    # predict 
    pred = model_1.predict(X_ho)
    az_pred = pred[:,0]
    ze_pred = pred[:,1]
    
    # get error according to custom error function
    err = angular_dist_score(y_ho['azimuth'].values, 
                             y_ho['zenith'].values,
                             az_pred,
                             ze_pred)
    maes_1.append(err)

In [24]:
# This code was run on the Great Lakes Cluster to save compute time
# and therefore we define maes here ourselves to be the output of that
# job (the job could be run locally from this notebook and output the
# same result, it would just take > 30 mins)
maes_1 = [1.5615248170191087, 1.5610988936989836, 1.5701054461389465, 1.5631204382457082, 1.5665744120854157]
avg_mae_1 = np.mean(maes_1)
print("Average CV mae of model_1:", avg_mae_1)

Average mae of model_1: 1.5644848014376325


# Model 2: sensor chargesum

In [None]:
# Load the data
batch10_proc_chargesum = pd.read_parquet('../batches_train/batch10_proc_chargesum.parquet')
meta10 = pd.read_parquet('../batches_train/batch10_meta.parquet')
batch10_true_directions = meta10[['azimuth', 'zenith']]

In [None]:
# Create train test split
from sklearn.model_selection import train_test_split, KFold
X_train, X_test, y_train, y_test = train_test_split(batch10_proc_chargesum, 
                                                    batch10_true_directions, 
                                                    shuffle=True,
                                                    test_size=.25,
                                                    random_state=134)

In [None]:
# Cross-validation on model_2

# Defining model 2
from sklearn.linear_model import LinearRegression
model_2 = LinearRegression(copy_X=True)

# cross validation on model 2
maes_2 = []
kfold = KFold(n_splits=5,
              shuffle=True,
              random_state=134)
for train_index, test_index in kfold.split(X_train, y_train):
    # assign X_tt, y_tt and X_ho, y_ho
    X_tt = X_train.iloc[train_index]
    y_tt = y_train.iloc[train_index]
    X_ho = X_train.iloc[test_index]
    y_ho = y_train.iloc[test_index]
    
    # fit our model 
    model_2.fit(X_tt, y_tt)
    
    # predict 
    pred = model_2.predict(X_ho)
    az_pred = pred[:,0]
    ze_pred = pred[:,1]
    
    # get error according to custom error function
    err = angular_dist_score(y_ho['azimuth'].values, 
                             y_ho['zenith'].values,
                             az_pred,
                             ze_pred)
    maes_2.append(err)

In [51]:
# This code was run on the Great Lakes Cluster to save compute time
# and therefore we define maes here ourselves to be the output of that
# job (the job could be run locally from this notebook and output the
# same result, it would just take > 30 mins)
maes_2 = [1.5660103902013034, 1.565257340669138, 1.576880378243591, 1.568163212969159, 1.5706258379714457]
avg_mae_2 = np.mean(maes_2)
print("Average mae of model_2:", avg_mae_2)

Average mae of model_2: 1.5693874320109273


# Model 3: sensor binary and num-clusters

The processed data in this model combines the binary sensor activation of model 2 while also performing cluster analysis. We add classifier variables to the data to distinguish if data was clustered in 1, 2, 3, 4, or 5 cluster(s). 

In [2]:
batch10_binary_num_clusters = pd.read_parquet('../batches_train/batch10_binary_num_cluster.parquet')
meta10 = pd.read_parquet('../batches_train/batch10_meta.parquet')
batch10_true_directions = meta10[['azimuth', 'zenith']]

In [10]:
# Create train test split
from sklearn.model_selection import train_test_split, KFold
X_train, X_test, y_train, y_test = train_test_split(batch10_binary_num_clusters, 
                                                    batch10_true_directions, 
                                                    shuffle=True,
                                                    test_size=.25,
                                                    random_state=134)

In [13]:
# Cross-validation on model_3

# Defining model 3
from sklearn.linear_model import LinearRegression
model_3 = LinearRegression(copy_X=True)

# cross validation on model 3
maes_3 = []
kfold = KFold(n_splits=5,
              shuffle=True,
              random_state=134)
for train_index, test_index in kfold.split(X_train, y_train):
    # assign X_tt, y_tt and X_ho, y_ho
    X_tt = X_train.iloc[train_index]
    y_tt = y_train.iloc[train_index]
    X_ho = X_train.iloc[test_index]
    y_ho = y_train.iloc[test_index]
    
    # fit our model 
    model_3.fit(X_tt, y_tt)
    
    # predict 
    pred = model_3.predict(X_ho)
    az_pred = pred[:,0]
    ze_pred = pred[:,1]
    
    # get error according to custom error function
    err = angular_dist_score(y_ho['azimuth'].values, 
                             y_ho['zenith'].values,
                             az_pred,
                             ze_pred)
    maes_3.append(err)

In [18]:
# This code was run on the Great Lakes Cluster to save compute time
# and therefore we define maes here ourselves to be the output of that
# job (the job could be run locally from this notebook and output the
# same result, it would just take > 30 mins)
maes_3 = [1.5614726430586565, 1.5612771545643698, 1.570034175849675, 1.5630600262171235, 1.5665136339700907]
avg_mae_3 = np.mean(maes_3)
print("Average CV mae of model_3:", avg_mae_3)

Average mae of model_3: 1.564471526731983


# Model 4: Sensor chargesum and num-clusters

The processed data in this model combines the sensor chargesum of model 2 while also performing cluster analysis. We add classifier variables to the data to distinguish if data was clustered in 1, 2, 3, 4, or 5 cluster(s). 

In [None]:
batch10_binary_num_clusters = pd.read_parquet('../batches_train/batch10_chargesum_num_cluster.parquet')
meta10 = pd.read_parquet('../batches_train/batch10_meta.parquet')
batch10_true_directions = meta10[['azimuth', 'zenith']]

In [None]:
# Create train test split
from sklearn.model_selection import train_test_split, KFold
X_train, X_test, y_train, y_test = train_test_split(batch10_chargesum_num_clusters, 
                                                    batch10_true_directions, 
                                                    shuffle=True,
                                                    test_size=.25,
                                                    random_state=134)

In [None]:
# Cross-validation on model_4

# Defining model 4
from sklearn.linear_model import LinearRegression
model_4 = LinearRegression(copy_X=True)

# cross validation on model 4
maes_4 = []
kfold = KFold(n_splits=5,
              shuffle=True,
              random_state=134)
for train_index, test_index in kfold.split(X_train, y_train):
    # assign X_tt, y_tt and X_ho, y_ho
    X_tt = X_train.iloc[train_index]
    y_tt = y_train.iloc[train_index]
    X_ho = X_train.iloc[test_index]
    y_ho = y_train.iloc[test_index]
    
    # fit our model 
    model_4.fit(X_tt, y_tt)
    
    # predict 
    pred = model_4.predict(X_ho)
    az_pred = pred[:,0]
    ze_pred = pred[:,1]
    
    # get error according to custom error function
    err = angular_dist_score(y_ho['azimuth'].values, 
                             y_ho['zenith'].values,
                             az_pred,
                             ze_pred)
    maes_4.append(err)

In [15]:
# This code was run on the Great Lakes Cluster to save compute time
# and therefore we define maes here ourselves to be the output of that
# job (the job could be run locally from this notebook and output the
# same result, it would just take > 30 mins)
maes_4 = [1.5659610694146888, 1.5652911088990729, 1.5768272185107415, 1.5677951311907221, 1.5705259073284994]
avg_mae_4 = np.mean(maes_4)
print("Average CV mae of model_4:", avg_mae_4)

Average CV mae of model_4: 1.569280087068745


# Model 5: Binary sensor data with MAE (mean absolute error)

This model is similar to model 1 in that it uses the binary sensor activation data. However, the error function used in this regression is mean absolute error (using `sklearn.linear_model.SGDRegressor`).

In [1]:
from sklearn.linear_model import SGDRegressor
model_5_az = SGDRegressor(loss='epsilon_insensitive',
                       max_iter=50000)
model_5_ze = SGDRegressor(loss='epsilon_insensitive',
                       max_iter=50000)

In [None]:
batch10_proc_binary = pd.read_parquet('../batches_train/batch10_proc_binary.parquet')
meta10 = pd.read_parquet('../batches_train/batch10_meta.parquet')
batch10_true_directions = meta10[['azimuth', 'zenith']]

In [None]:
from sklearn.model_selection import train_test_split, KFold
X_train, X_test, y_train, y_test = train_test_split(batch10_proc_binary, 
                                                    batch10_true_directions, 
                                                    shuffle=True,
                                                    test_size=.25,
                                                    random_state=134)

In [None]:
# Cross-validation on model_5
maes_5 = []
kfold = KFold(n_splits=5,
              shuffle=True,
              random_state=134)
for train_index, test_index in kfold.split(X_train, y_train):
    # assign X_tt, y_tt and X_ho, y_ho
    X_tt = X_train.iloc[train_index]
    y_tt = y_train.iloc[train_index]
    X_ho = X_train.iloc[test_index]
    y_ho = y_train.iloc[test_index]
    
    # fit our model 
    model_5_az.fit(X_tt, y_tt['azimuth'])
    model_5_ze.fit(X_tt, y_tt['zenith'])
    
    # predict 
    az_pred = model_5_az.predict(X_ho)
    ze_pred = model_5_ze.predict(X_ho)
    
    # get error according to custom error function
    err = angular_dist_score(y_ho['azimuth'].values, 
                             y_ho['zenith'].values,
                             az_pred,
                             ze_pred)
    maes_5.append(err)

In [16]:
# This code was run on the Great Lakes Cluster to save compute time
# and therefore we define maes here ourselves to be the output of that
# job (the job could be run locally from this notebook and output the
# same result, it would just take > 5 mins)
maes_5 = [1.5623128167452556, 1.561854543414195, 1.569177896837212, 1.5639737578049215, 1.5672470043197075]
avg_mae_5 = np.mean(maes_5)
print("Average CV mae of model_5:", avg_mae_5)

Average CV mae of model_5: 1.5649132038242584


# Model 6: Adding in Katja's event features

For this model we append the time-based best-fit-line predictions `az_t_pred`, `ze_t_pred` to the `batch10_proc_binary` DataFrame

In [None]:
batch10_proc_model_6 = pd.read_parquet('../batches_train/batch10_proc_model_6.parquet')
meta10 = pd.read_parquet('../batches_train/batch10_meta.parquet')
batch10_true_directions = meta10[['azimuth', 'zenith']]

In [None]:
# Create train test split
from sklearn.model_selection import train_test_split, KFold
X_train, X_test, y_train, y_test = train_test_split(batch10_proc_model_6, 
                                                    batch10_true_directions, 
                                                    shuffle=True,
                                                    test_size=.25,
                                                    random_state=134)

In [None]:
# Cross-validation on model_6

# Defining model 6
from sklearn.linear_model import LinearRegression
model_6 = LinearRegression(copy_X=True)

# cross validation on model 6
maes_6 = []
kfold = KFold(n_splits=5,
              shuffle=True,
              random_state=134)
for train_index, test_index in kfold.split(X_train, y_train):
    # assign X_tt, y_tt and X_ho, y_ho
    X_tt = X_train.iloc[train_index]
    y_tt = y_train.iloc[train_index]
    X_ho = X_train.iloc[test_index]
    y_ho = y_train.iloc[test_index]
    
    # fit our model 
    model_6.fit(X_tt, y_tt)
    
    # predict 
    pred = model_6.predict(X_ho)
    az_pred = pred[:,0]
    ze_pred = pred[:,1]
    
    # get error according to custom error function
    err = angular_dist_score(y_ho['azimuth'].values, 
                             y_ho['zenith'].values,
                             az_pred,
                             ze_pred)
    maes_6.append(err)

In [17]:
# This code was run on the Great Lakes Cluster to save compute time
# and therefore we define maes here ourselves to be the output of that
# job (the job could be run locally from this notebook and output the
# same result, it would just take > 30 mins)
maes_6 = [1.5085980143977527, 1.5081281425142634, 1.5139578294166005, 1.5093864031516566, 1.5137255315568392]
avg_mae_6 = np.mean(maes_6)
print("Average CV mae of model_6:", avg_mae_6)

Average CV mae of model_6: 1.5107591842074226


# Conclusions

From the k-fold cross-validation performed above, we believe model 6 has the lowest generalization error. Below we train this model on the full training set `X_train, y_train` and test on the full test data. 

In [None]:
# Load data
batch10_proc_model_6 = pd.read_parquet('../data/batch10_proc_model_6.parquet')
meta10 = pd.read_parquet('../data/batch10_meta.parquet')
batch10_true_directions = meta10[['azimuth', 'zenith']]

# Create train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(batch10_proc_model_6,
                                                    batch10_true_directions,
                                                    shuffle=True,
                                                    test_size=.25,
                                                    random_state=134)

# Create and train the model
from sklearn.linear_model import LinearRegression
model_6 = LinearRegression(copy_X=True)
model_6.fit(X_train, y_train)

# Predict on the test set
pred = model_6.predict(X_test)
az_pred = pred[:,0]
ze_pred = pred[:,1]

# get error according to custom error function
err = angular_dist_score(y_test['azimuth'].values,
                         y_test['zenith'].values,
                         az_pred,
                         ze_pred)


In [18]:
# We ran the code above on the Great Lakes Cluster to save compute time
# and we record the result here
err = 1.5120286550902189
print("Error of model 6 on full test set:", err)

Error of model 6 on full test set: 1.5120286550902189


# Conclusions

We obtained a model (model 6) that performs better on average (as verified by 5-fold cross-validation) than the baseline model 0 which simply guesses the mean value of the training set's target values. 