# Compile all event features used for linear regression here

In this notebook, we import the data we have precomputed and add a few other features that seem useful. The precomputed features are: 
1. A time best fit line based on physics literature: The specific features we have extracted are the velocity of this line (vx_t, vy_t, vz_t), the predicted angle based on this (az_t_pred, ze_t_pred), and the mae of this line (mae_t).
2. A PCA best fit line: The specific features we had extracted are the velocity of the line (vx_pca, vy_pca, vz_pca), the predicted angle based on this (az_pca_pred, ze_pca_pred)
3. The number of clusters corresponding to each event

We normalize these data points so that all the azimuths are in [0, 2\pi] and all the zeniths are in [0, \pi]. Also, if the zenith is 0 or \pi, then the azimuth is automatically set to 0. 

We compute a few extra features in this notebook, namely: 
1. The mse of the time best fit line (mse, mse_squared)
2. The dot product of the two best fit lines. 
3. Make number of clusters a categorical variable
4. Find a cutoff mse that splits the data into two groups, one with "good" and one with "bad" mse and make this a categorical variable
5. For each event, compute which "side" of the detectors the points are located at (categorical variables for x,y, and z). 

We then save all this data to use in our linear regressions.

In [175]:
# Import packages 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math

In [176]:
# Time best fit line data to standardize
data_t = pd.read_csv("C:/Users/k_vsl/Documents/Erdos/Boot Camp/ice-cube-katja/best-fit-batch10.csv")
data_pca = pd.read_csv("C:/Users/k_vsl/Documents/Erdos/Boot Camp/ice-cube-katja/batch10_pca_directions_aux_false.csv")
data_clusters = pd.read_csv("C:/Users/k_vsl/Documents/Erdos/Boot Camp/ice-cube-katja/batch10_aux_false_clusters.csv")
mses = pd.read_csv("C:/Users/k_vsl/Documents/Erdos/Boot Camp/ice-cube-katja/mse.csv")

In [177]:
event_ids = np.unique(data_t.event_id.values)

In [178]:
fixed_data = pd.DataFrame(index = event_ids, columns = ["event_id", "vx_t", "vy_t", "vz_t", "az_t_pred", "ze_t_pred", "mae_t", "mse_squared", "mse", "vx_pca", "vy_pca", "vz_pca", "az_pca_pred", "ze_pca_pred", "az_true", "ze_true"])

# Normalize Angles/Velocities: 

Standardize the angles and velocities for each best fit line so velocities have norm 1 and the angles are az: [0, 2\pi] and ze: [0, \pi]. Also ensure that angles which are purely vertical have no horizontal angle. 

In [158]:
# Function that normalizes zenith and azimuth 
def normalize(vx, vy, vz, az, ze): 
    mod2 = np.square(vx) + np.square(vy) + np.square(vz)
    if mod2 == 0:
        mod = 1
    else: 
        mod = np.sqrt(mod2)
    vx_n = vx/mod
    vy_n = vy/mod
    vz_n = vz/mod
    if vz_n >= 1: 
        ze = 0
        az = 0
    if vz_n <= -1: 
        ze = math.pi
        az = 0
    else: 
        if az < 0: 
            az = az + 2 * math.pi
        if ze < 0: 
            ze = ze + math.pi
    return (vx_n, vy_n, vz_n, az, ze)

In [179]:
# Standardize pca data
i = 0
for row in data_pca.itertuples(): 
    event_id = row.event_id
    vx = row.vx
    vy = row.vy
    vz = row.vz
    az = row.az
    ze = row.ze
    vx_pca, vy_pca, vz_pca, az_pca_pred, ze_pca_pred = normalize(vx,vy,vz, az, ze)
    fixed_data.loc[event_id]["vx_pca"] = vx_pca
    fixed_data.loc[event_id]["vy_pca"] = vy_pca
    fixed_data.loc[event_id]["vz_pca"] = vz_pca
    fixed_data.loc[event_id]["az_pca_pred"] = az_pca_pred
    fixed_data.loc[event_id]["ze_pca_pred"] = ze_pca_pred
    
    i += 1
    if i % 10000 == 0: 
        print("On index " + str(i))

On index 10000
On index 20000
On index 30000
On index 40000
On index 50000
On index 60000
On index 70000
On index 80000
On index 90000
On index 100000
On index 110000
On index 120000
On index 130000
On index 140000
On index 150000
On index 160000
On index 170000
On index 180000
On index 190000
On index 200000


In [180]:
# Standardize time best fit data
i = 0
for row in data_t.itertuples():
    event_id = row.event_id
    vx = row.vx
    vy = row.vy
    vz = row.vz
    az = row.az_pred
    ze = row.ze_pred
    az_true = row.az_true
    ze_true = row.ze_true
    mae = row.mae
    vx_t, vy_t, vz_t, az_t_pred, ze_t_pred = normalize(vx, vy, vz, az, ze)
    az_true = az_true
    ze_true = ze_true
    mae_t = mae
    fixed_data.loc[event_id]["vx_t"] = vx_t
    fixed_data.loc[event_id]["vy_t"] = vy_t
    fixed_data.loc[event_id]["vz_t"] = vz_t
    fixed_data.loc[event_id]["az_t_pred"] = az_t_pred
    fixed_data.loc[event_id]["ze_t_pred"] = ze_t_pred
    fixed_data.loc[event_id]["az_true"] = az_true
    fixed_data.loc[event_id]["ze_true"] = ze_true
    fixed_data.loc[event_id]["mae_t"] = mae_t
    if i % 10000 == 0:
        print("The index is " + str(i))
    i += 1

The index is 0
The index is 10000
The index is 20000
The index is 30000
The index is 40000
The index is 50000
The index is 60000
The index is 70000
The index is 80000
The index is 90000
The index is 100000
The index is 110000
The index is 120000
The index is 130000
The index is 140000
The index is 150000
The index is 160000
The index is 170000
The index is 180000
The index is 190000


# Clustering

In [181]:
# Add number of clusters
fixed_data['num_clusters'] = data_clusters.num_clusters.values

In [162]:
fixed_data.head()

Unnamed: 0,event_id,vx_t,vy_t,vz_t,az_t_pred,ze_t_pred,mae_t,mse_squared,mse,vx_pca,vy_pca,vz_pca,az_pca_pred,ze_pca_pred,az_true,ze_true,num_clusters
29296372,,0.840768,0.133669,0.524635,0.157664,1.01851,2.595228,,,0.950603,0.151131,0.271134,0.157664,1.296226,3.517532,1.616892,1.0
29296374,,0.211115,-0.119066,0.970182,5.769669,0.244814,0.045669,,,0.195457,-0.145903,0.969798,5.641945,0.246393,5.775634,0.199164,4.0
29296414,,-0.368366,-0.91147,-0.183109,4.328315,1.754944,1.12855,,,-0.358119,-0.886117,-0.294191,4.328315,1.869405,3.715111,0.769641,1.0
29296416,,0.539816,-0.666199,0.514566,5.393375,1.030295,2.349623,,,0.629551,-0.776942,0.005176,5.393375,1.565621,2.613874,1.399564,1.0
29296437,,0.10565,-0.889777,0.443999,4.830573,1.11074,0.635174,,,0.007387,-0.906071,0.423061,4.720542,1.133976,4.110431,1.054067,2.0


# Dot Product

In [182]:
# Get dot product of v_t and v_pca for each entry
dot_product = np.zeros(len(event_ids))
i = 0
for row in fixed_data.itertuples():
    vx_t = row.vx_t
    vy_t = row.vy_t
    vz_t = row.vz_t
    vx_pca = row.vx_pca
    vy_pca = row.vy_pca
    vz_pca = row.vz_pca
    dot = np.dot((vx_t, vy_t, vz_t), (vx_pca, vy_pca, vz_pca))
    dot_product[i] = np.abs(dot)
    i = i + 1

In [183]:
print("The min of the dot product is " + str(dot_product.min()))
print("The max of the dot product is " + str(dot_product.max()))
print("The mean of the dot product is " + str(dot_product.mean()))

The min of the dot product is 3.399761412093555e-05
The max of the dot product is 1.0
The mean of the dot product is 0.9395973408285097


In [184]:
fixed_data['dot_product'] = dot_product

# MSE and MSE Squared

In [168]:
mses.columns

Index(['Unnamed: 0', 'mse', 'not_squared'], dtype='object')

In [185]:
# Add mse values for each event, filling columns "mse_squared", "mse"
mse_squared = mses.mse.values
mse = mses.not_squared.values
fixed_data["mse_squared"] = mse_squared
fixed_data["mse"] = mse

In [170]:
fixed_data.head(20)

Unnamed: 0,event_id,vx_t,vy_t,vz_t,az_t_pred,ze_t_pred,mae_t,mse_squared,mse,vx_pca,vy_pca,vz_pca,az_pca_pred,ze_pca_pred,az_true,ze_true,num_clusters,dot_product
29296372,,0.840768,0.133669,0.524635,0.157664,1.01851,2.595228,232206.0,481.87754,0.950603,0.151131,0.271134,0.157664,1.296226,3.517532,1.616892,1.0,0.961684
29296374,,0.211115,-0.119066,0.970182,5.769669,0.244814,0.045669,1174334.0,1083.66696,0.195457,-0.145903,0.969798,5.641945,0.246393,5.775634,0.199164,4.0,0.999517
29296414,,-0.368366,-0.91147,-0.183109,4.328315,1.754944,1.12855,15525.22,124.600239,-0.358119,-0.886117,-0.294191,4.328315,1.869405,3.715111,0.769641,1.0,0.993457
29296416,,0.539816,-0.666199,0.514566,5.393375,1.030295,2.349623,494803.3,703.422549,0.629551,-0.776942,0.005176,5.393375,1.565621,2.613874,1.399564,1.0,0.860102
29296437,,0.10565,-0.889777,0.443999,4.830573,1.11074,0.635174,4716322.0,2171.709365,0.007387,-0.906071,0.423061,4.720542,1.133976,4.110431,1.054067,2.0,0.99482
29296449,,-0.082325,0.517377,0.851788,1.728593,0.551407,1.928713,1009317.0,1004.647895,-0.083098,0.522234,0.848744,1.728593,0.557191,4.037991,1.567998,2.0,0.999983
29296484,,-0.726709,0.370045,-0.578758,2.670607,2.188002,0.110184,387794.1,622.731152,-0.79526,0.316625,-0.517021,2.762691,2.114163,2.798409,2.155779,2.0,0.994318
29296493,,0.0,-0.0,-1.0,0.0,3.141593,1.740515,243110.8,493.062638,0.0,0.0,-1.0,0.0,3.141593,4.653443,1.401078,1.0,1.0
29296496,,0.403821,0.285576,0.869123,0.615532,0.51737,0.065858,5390553.0,2321.75651,0.40626,0.296204,0.864416,0.62999,0.526808,0.590728,0.452531,3.0,0.999929
29296505,,0.184698,0.479771,0.857733,1.203312,0.539952,0.431063,2593434.0,1610.414243,0.182321,0.473594,0.861666,1.203312,0.532254,0.26674,0.280125,2.0,0.99997


In [186]:
fixed_data["event_id"] = fixed_data.index

In [187]:
fixed_data.reset_index(drop = True)

Unnamed: 0,event_id,vx_t,vy_t,vz_t,az_t_pred,ze_t_pred,mae_t,mse_squared,mse,vx_pca,vy_pca,vz_pca,az_pca_pred,ze_pca_pred,az_true,ze_true,num_clusters,dot_product
0,29296372,0.840768,0.133669,0.524635,0.157664,1.01851,2.595228,2.322060e+05,481.877540,0.950603,0.151131,0.271134,0.157664,1.296226,3.517532,1.616892,1.0,0.961684
1,29296374,0.211115,-0.119066,0.970182,5.769669,0.244814,0.045669,1.174334e+06,1083.666960,0.195457,-0.145903,0.969798,5.641945,0.246393,5.775634,0.199164,4.0,0.999517
2,29296414,-0.368366,-0.91147,-0.183109,4.328315,1.754944,1.12855,1.552522e+04,124.600239,-0.358119,-0.886117,-0.294191,4.328315,1.869405,3.715111,0.769641,1.0,0.993457
3,29296416,0.539816,-0.666199,0.514566,5.393375,1.030295,2.349623,4.948033e+05,703.422549,0.629551,-0.776942,0.005176,5.393375,1.565621,2.613874,1.399564,1.0,0.860102
4,29296437,0.10565,-0.889777,0.443999,4.830573,1.11074,0.635174,4.716322e+06,2171.709365,0.007387,-0.906071,0.423061,4.720542,1.133976,4.110431,1.054067,2.0,0.994820
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199995,32567581,0.906319,0.244411,0.344746,0.263408,1.218828,0.139888,7.603994e+05,872.008808,0.910331,0.248632,0.330878,0.26662,1.233562,0.353034,1.329428,2.0,0.999887
199996,32567639,0.694877,0.278872,-0.662855,0.381649,2.295422,2.898224,7.919173e+05,889.897365,0.683626,0.264641,-0.680162,0.369348,2.31878,3.437011,1.079144,2.0,0.999686
199997,32567659,-0.17537,0.212403,0.961317,2.260982,0.279052,2.414195,2.484241e+06,1576.147445,-0.182435,0.220959,0.958068,2.260982,0.290614,1.112257,2.569093,2.0,0.999933
199998,32567680,0.535383,0.550844,-0.640263,0.799631,2.265636,0.14919,1.210059e+04,110.002675,0.406316,0.801088,-0.439506,1.101401,2.025845,0.898184,2.139322,1.0,0.940208


In [144]:
data_to_csv = fixed_data.to_csv('all-feature.csv', index = True) 

# Investigate cutoffs for MSE and dot product

In [None]:
# Code to figure out best cutoff for mse
cutoffs = np.arange(720,740, 1)
n = len(cutoffs)
bad_mse = np.zeros(n)
good_mse = np.zeros(n)
for i in range(0,n): 
    cutoff = cutoffs[i]
    X_bad_mse = fixed_data[fixed_data.mse > cutoff]
    X_good_mse = fixed_data[fixed_data.mse < cutoff]
    bad_mse[i] = X_bad_mse.mae_t.values.mean()
    good_mse[i] = X_good_mse.mae_t.values.mean()

In [145]:
# Graph for different cutoffs
plt.figure(figsize=(8,6))

plt.scatter(cutoffs,
               bad_mse, 
           c = 'orange')
plt.scatter(cutoffs,
               good_mse, 
           c = 'blue')

plt.xticks(fontsize=10)
plt.yticks(fontsize=10)

plt.show()


Result: Cutoff of 721 makes most sense as it minimizes mae in the "bad" group

In [188]:
# Get categorical variable for mse
i = 0
mse_cat = np.zeros(len(fixed_data))
for row in fixed_data.itertuples(): 
    mse = row.mse
    if mse <= 721:
        mse_cat[i] = np.uint8(0)
    else:
        mse_cat[i] = np.uint8(1)
    i += 1
fixed_data['mse_cat'] = mse_cat

In [None]:
# Code to figure out dot_product cutoffs
cutoffs = np.arange(0,1, .01)
n = len(cutoffs)
bad_dot = np.zeros(n)
good_dot = np.zeros(n)
for i in range(0,n): 
    cutoff = cutoffs[i]
    X_bad_dot = fixed_data[fixed_data.dot_product > cutoff]
    X_good_dot = fixed_data[fixed_data.dot_product < cutoff]
    bad_mse[i] = X_bad_dot.mae_t.values.mean()
    good_mse[i] = X_good_dot.mae_t.values.mean()

In [None]:
# Graph for different cutoffs
plt.figure(figsize=(8,6))

plt.scatter(cutoffs,
               bad_dot, 
           c = 'orange')
plt.scatter(cutoffs,
               good_dot, 
           c = 'blue')

plt.xticks(fontsize=10)
plt.yticks(fontsize=10)

plt.show()

Result: MSE appeared to give better cutoffs

# Clusters as Categorical Variable

In [208]:
# Get categorical variable for num_clusters
fixed_data = fixed_data.set_index("event_id")
W = pd.get_dummies(fixed_data['num_clusters'], prefix = 'cat')
W['event_id'] = fixed_data.index.values
W = W.set_index("event_id")
data = pd.merge(fixed_data, W, left_index = True, right_index = True)

In [153]:
data.head()

Unnamed: 0,event_id_x,vx_t,vy_t,vz_t,az_t_pred,ze_t_pred,mae_t,mse_squared,mse,vx_pca,...,cat_3.0,cat_4.0,cat_5.0,cat_6.0,cat_7.0,cat_8.0,cat_9.0,cat_10.0,cat_11.0,event_id_y
29296372,29296372,0.840768,0.133669,0.524635,0.157664,1.01851,2.595228,232206.0,481.87754,0.950603,...,0,0,0,0,0,0,0,0,0,29296372
29296374,29296374,0.211115,-0.119066,0.970182,5.769669,0.244814,0.045669,1174334.0,1083.66696,0.195457,...,0,1,0,0,0,0,0,0,0,29296374
29296414,29296414,-0.368366,-0.91147,-0.183109,4.328315,1.754944,1.12855,15525.22,124.600239,-0.358119,...,0,0,0,0,0,0,0,0,0,29296414
29296416,29296416,0.539816,-0.666199,0.514566,5.393375,1.030295,2.349623,494803.3,703.422549,0.629551,...,0,0,0,0,0,0,0,0,0,29296416
29296437,29296437,0.10565,-0.889777,0.443999,4.830573,1.11074,0.635174,4716322.0,2171.709365,0.007387,...,0,0,0,0,0,0,0,0,0,29296437


# Determine where in ICE Cube the data points lean 

This feature is an effort to determine which side of the cone, the detections are coming from (If the detections are happening on one side of the detector, then potentially only one side of the cone would even be within the detector).

# Find where the data points are skewing and make categorical variables
For each direction, determine which sign over 50% of the points lie in 

In [190]:
# Function that given an event returns which quadrants the data is skewed to
def find_quad_event(event, geometry): 
    stats = event.reset_index()
    n = len(event)
    
    id_x = 0
    id_y = 0
    id_z = 0
    for i in range(0,n):
        id = stats.loc[i, "sensor_id"]
        x = geometry.iloc[id].x
        y = geometry.iloc[id].y
        z = geometry.iloc[id].z
        
        # Keep counts on positive coordinates
        if x > 0: 
            id_x += 1
        else: 
            id_x = id_x
            
        if y > 0: 
            id_y += 1
        else: 
            id_y = id_y
            
        if z > 0: 
            id_z += 1
        else: 
            id_z = id_z        
        i += 1
        
    per_x = id_x / n
    per_y = id_y / n
    per_z = id_z / n
    
    cat_x = 0
    cat_y = 0
    cat_z = 0
    
    if per_x > .5: 
        cat_x = 1
    if per_y > .5:
        cat_y = 1
    if per_z > .5: 
        cat_z = 1
        
    return (per_x, per_y, per_z, cat_x, cat_y, cat_z)

In [199]:
# Separate into quadrants (If most detections have (x > 0), (y > 0), (z > 0))

# load batch10
batch10 = pd.read_parquet("C:/Users/k_vsl/Documents/Erdos/IceCubeData/batch_10.parquet")
sensors = pd.read_csv("C:/Users/k_vsl/Documents/Erdos/IceCubeData/sensor_geometry.csv")

# Fo

def find_quad(aux_incl, batch, geometry):
    if aux_incl == False: 
        batch_aux = batch[batch.auxiliary==False]
    else: 
        batch_aux = batch
    event_ids = np.unique(data_t.event_id.values)
    n = len(event_ids)
    print(n)
    
    df = pd.DataFrame(index = event_ids, columns = ['per_x', 'per_y', 'per_z', 'cat_x', 'cat_y', 'cat_z'])
    
    
    # Loop through the events and populate the data frame
    for i in range(0, n): 
        event_id = event_ids[i]
        event = batch_aux.loc[event_id]
        stats = find_quad_event(event, geometry)
        
        df.loc[event_id, "per_x"] = stats[0]
        df.loc[event_id, "per_y"] = stats[1]
        df.loc[event_id, "per_z"] = stats[2]
        df.loc[event_id, "cat_x"] = stats[3]
        df.loc[event_id, "cat_y"] = stats[4]
        df.loc[event_id, "cat_z"] = stats[5]
        if (i % 1000 == 0): 
            print("Testing complete for " + str(i) + " events")
            
    return df

In [200]:
df = find_quad(False, batch10, sensors)

200000
Testing complete for 0 events
Testing complete for 1000 events
Testing complete for 2000 events
Testing complete for 3000 events
Testing complete for 4000 events
Testing complete for 5000 events
Testing complete for 6000 events
Testing complete for 7000 events
Testing complete for 8000 events
Testing complete for 9000 events
Testing complete for 10000 events
Testing complete for 11000 events
Testing complete for 12000 events
Testing complete for 13000 events
Testing complete for 14000 events
Testing complete for 15000 events
Testing complete for 16000 events
Testing complete for 17000 events
Testing complete for 18000 events
Testing complete for 19000 events
Testing complete for 20000 events
Testing complete for 21000 events
Testing complete for 22000 events
Testing complete for 23000 events
Testing complete for 24000 events
Testing complete for 25000 events
Testing complete for 26000 events
Testing complete for 27000 events
Testing complete for 28000 events
Testing complete for

In [260]:
final = pd.merge(data, df, left_index = True, right_index = True)

In [263]:
final_to_csv = final.to_csv('features-final.csv', index = True) 