#### <p style="text-align: center; font-family:cm; font-size:1.8em;">Machine Learning for Dark Matter Signal Classification</p>

<p style="font-family:cm; font-size:1.3em; text-align: center"><b>Owner:</b> Lucas Rocha Castro</p>

<p style="font-family:cm; font-size:1em;"> 
In the process of searching dor a possible Dark Matter (DM) particle, astrophysical observations are one of the pilars of DM indirect detection. When a target is observed, however, it is well known that noise may interfere with the obtained signal, and it is really important to develop strategies in order to be able to separate what is a potential DM particle signal and what is noisy, not usable data. In this project, I stablished a separation system that is simplified, yet physically relevant, taking into account the characteristics of the received signal, such as: the energy received; the angle at which it was intercepted; the particle's velocity and others.
Using this separation, I generated many simulated data points using basic Python functions, and developed a trained algorithm using basic Machine Learning (ML) to evaluate if the machine was able to identify the two type of potential signals or not. 
It is a simple and yet useful project. As an undergraduate in Physics, my main objective is to be able to study ML and other computer science topics while integrating it with my currently researched topics: Dark Matter, Cosmology and Astrophysics.
</p>

---

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import maxwell

#### <p style="text-align: center; font-family:cm; font-size:1.8em;">Data simulation</p>

<p style="font-family:cm; font-size:1em;"> 
I will be following a simplified model to identify what really should be Dark Matter signals versus what is background noise. To <b>Dark Matter</b> signal: <br>
- $E$ (energy) will be a normal distribution with its peak located at ~50GeV, a reasonable value when dealing with indirect detection, such as Gamma Rays. <br>
- $v$ (velocity) will follow a Maxwell-Boltzmann distribution, because according to the Standard Halo Model, the DM particles form an isotropical gas in gravitational equilibrium. Its scale is due to the Sun's velocity around the Galactic Center, which is $v_{\odot} = 220 km/s$. <br>
- $\theta$, as the angle at which those particles are received. Note that it is not isotropic due to the Sun's orbit around the GC, and that causes a type of "dark matter wind", at which there is in fact a prefered angle. <br>
- $\phi$ represents the rotational symmetry due to the Halo's format. <br>
- $r$ represents a DM model that decays exponentially. It is indeed not the current model used (such as NFW, Einasto, Moore, etc.) but in this case, to simplify the analysis and not deal with major computational problems, it is a great approximation. <br>
<br>
In contrast, the majority of <b>background data</b> is distributed on a uniform dataset, which is due to its homogeneity and isotropy, without any peaks because it has no specific events and, furthermore, it is a mix of different processes (radioactivity, cosmic rays, equipment noise and more), which brings a lot of randomness to all parameters.
</p>

In [21]:
#Creating the dataset 
def generate_dm_events(N): 
    E = np.random.normal(loc=50, scale=10, size=N) 
    v = maxwell.rvs(scale=220, size=N) 
    theta = np.random.normal(loc=np.pi/2, scale=0.4, size=N) 
    phi = np.random.uniform(0, 2*np.pi, size=N) 
    r = np.random.exponential(scale=8, size=N) 
    
    return pd.DataFrame({ "E": E, 
                         "v": v, 
                         "theta": theta, 
                         "phi": phi, 
                         "r": r, 
                         "label": 1 })

def generate_noise(N):
    E = np.random.exponential(scale=40, size=N) 
    v = np.random.uniform(0, 600, size=N) 
    theta = np.random.uniform(0, np.pi, size=N) 
    phi = np.random.uniform(0, 2*np.pi, size=N) 
    r = np.random.uniform(0, 20, size=N) 
    
    return pd.DataFrame({ "E": E, 
                         "v": v, 
                         "theta": theta, 
                         "phi": phi, 
                         "r": r, 
                         "label": 0 })

In [22]:
N = 5000
dm = generate_dm_events(N)
bg = generate_noise(N)

data = pd.concat([dm, bg]).sample(frac=1).reset_index(drop=True) 
data.to_csv("events.csv", index=False)

print(data)

               E           v     theta       phi          r  label
0      54.166063  465.564682  2.056886  0.162574  46.544763      1
1     105.983932  295.214998  1.479003  1.165806   4.721523      0
2      55.466821  310.029380  0.782318  5.225734   4.604419      1
3      34.050060  153.211239  1.307326  5.048844   8.400338      1
4      54.710806  452.396770  0.822376  3.225239   1.287275      1
...          ...         ...       ...       ...        ...    ...
9995   76.494516  289.338491  0.634798  2.982687  14.055532      0
9996   29.701707  228.181610  0.948288  5.979220   4.057441      1
9997   41.718264  207.594039  2.761381  1.369940   8.272060      0
9998   42.230783  285.207607  2.972705  1.717316  18.936708      0
9999   51.192400  347.563707  1.090117  5.193784   7.262237      1

[10000 rows x 6 columns]


#### <p style="text-align: center; font-family:cm; font-size:1.8em;">Machine Learning</p>
