 Like probably many of you, I was *very* surprised when I saw the final results this morning. As the discussion [here](https://www.kaggle.com/c/seti-breakthrough-listen/discussion/266385) has shown, the key to the large distance between the winning solution and #2 was what the winning team called "magic #2". This method allowed them to remove background noise completely for at least some samples. Achieving this would be a kind of Holy Grail for experimental physics, so I really wanted to understand this method. At first, I couldn't wrap my head aroungd their explanation, so I started to reimplement it on my own.
 
DISCLAIMER: even though I think that in the end it is some kind of leak, the winning team's solution is absolutely brilliant. I would have never figured this out on my own. And of course, I'm not 100% sure that this is all there is to "magic #2". This is just my attempt at an explanation.
 
By now, the following is clear to me:

The key is that the true measured background signals used in this competition are far wider in the frequency range than 256 bins. Therefore, the organizers cut the data up into bands of 256 bins each and used these as samples. However, for reasons unclear to me they user *severely* overlapping samples. Therefore, for most samples in the datasets, you can -partially- find that data again in another sample. If one of them contains a needle, you can perfectly remove any noise.
The only remaining problem is the fact that each image in a sample is independently normalized to mean=0, std=1. This needs to be reversed.

I will demonstrate this for one sample:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
from tqdm import tqdm

In [None]:
INPUT_DIR = "../input/seti-breakthrough-listen"
df_train = pd.read_csv(os.path.join(INPUT_DIR, "train_labels.csv"))
df_subm = pd.read_csv(os.path.join(INPUT_DIR, "sample_submission.csv"))
df_train_pos = df_train[df_train.target == 1]

Helper functions for loading and displaying samples:

In [None]:
def load_example(idx):
    try:
        x = np.load(os.path.join(INPUT_DIR, "train", idx[0], idx + ".npy"))
    except:
        x = np.load(os.path.join(INPUT_DIR, "test",  idx[0], idx + ".npy"))
    return x.astype(np.float32)


def show_example(x, p=0):
    
    x = x.reshape(-1, 256)
    x = np.clip(x, np.percentile(x, p), np.percentile(x, 100-p)) # clip for better contrast
    
    fig, ax = plt.subplots()
    fig.set_size_inches(18, 3)
    ax.set_xticks(np.arange(1,6)*273)
    ax.set_yticks([])
    ax.grid(True)
    ax.imshow(x.T, aspect="auto", cmap="Greys")

Let's take a look at one sample which contains a needle with a very low SNR, definitely not visible by eye, the 5th positive sample in train. Time is the horizontal direction, frequency vertical:

In [None]:
idx = df_train_pos.id.iloc[5]
print(idx)
x0 = load_example(idx)
show_example(x0, 1)

I'm doing the "renormalization" similar to but probably different from the winning team. I start with one normalized reference column and check if this column is found somewhere else in the data:

In [None]:
def column_norm(x):
    ''' normalizes each column of 273 pixels in each of the 6 images in sample x separately 
    to mean==0 and L2 norm==1 '''
    
    xn  = x - np.mean(x, axis=1, keepdims=True) # remove mean
    xn /= np.sqrt(np.sum(xn**2, axis=1, keepdims=True)) # normalize
    return xn  

def find_similar_column(col0, xn1):
    ''' calculates cosine similarity between the normalized reference column col0 and all columns
        in the column-normalized sample xn1 '''
    
    return np.array([ [ np.dot(col0, col1) for col1 in img.T ] for img in xn1 ])

The normalized sample from above. Still no needle visible

In [None]:
xn0 = column_norm(x0)
show_example(xn0, 1)

Now let's search the dataset for a copy of normalized column 128 in image 1 of our sample. I use col 128 because it's in the middle of the image. I search only a small subset of the full 60,000 samples dataset to keep running time short because I know from earlier runs where the match will be ;).

In [None]:
for idx in tqdm(df_train.id.iloc[50500:51000]):
    xn1 = column_norm(load_example(idx))
    cs = find_similar_column(xn0[0,:,128], xn1)
    csm = cs.max()
    if csm > 0.9:
        print(idx, csm, cs.argmax() % 273)

Found a perfect match: sample d87cb86179e9d02. Let's look at it. It is identical to 004933b94083be2 where they overlap It's only shifted by 128-68=60 frequency bins:

In [None]:
x1 = load_example("d87cb86179e9d02")
show_example(x1, 1)

Normalize it and subtract from the normalized sample xn0. Increasing the contrast a little, the needles are easily visible. The noise has been perfectly removed because of the leak. A 2-layer MLP could find these needles:

In [None]:
xn1 = column_norm(x1)
show_example(xn0 - np.roll(xn1, 128-68, axis=2), 10)

Increasing the contrast some more, numerical rounding errors become visible, except for where the artificial signals were inserted. Obviously using rectangles.

In [None]:
show_example(xn0 - np.roll(xn1, 128-68, axis=2), 40)

That's all for now. Hope this clears up the magic a bit.
Again: congrats to Team Watercooled. Brilliant detective work!

In [None]:
nan