#### <p style="text-align: center; font-family:cm; font-size:1.8em;">Machine Learning for Dark Matter Signal Classification</p>

<p style="font-family:cm; font-size:1.3em; text-align: center"><b>Owner:</b> Lucas Rocha Castro</p>

<p style="font-family:cm; font-size:1em;"> 
In the process of searching dor a possible Dark Matter (DM) particle, astrophysical observations are one of the pilars of DM indirect detection. When a target is observed, however, it is well known that noise may interfere with the obtained signal, and it is really important to develop strategies in order to be able to separate what is a potential DM particle signal and what is noisy, not usable data. In this project, I stablished a separation system that is simplified, yet physically relevant, taking into account the characteristics of the received signal, such as: the energy received; the angle at which it was intercepted; the particle's velocity and others.
Using this separation, I generated many simulated data points using basic Python functions, and developed a trained algorithm using basic Machine Learning (ML) to evaluate if the machine was able to identify the two type of potential signals or not. 
It is a simple and yet useful project. As an undergraduate in Physics, my main objective is to be able to study ML and other computer science topics while integrating it with my currently researched topics: Dark Matter, Cosmology and Astrophysics.
</p>

---

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import maxwell

#### <p style="text-align: center; font-family:cm; font-size:1.8em;">Data simulation</p>

<p style="font-family:cm; font-size:1em;"> 
I will be following a simplified model to identify what really should be Dark Matter signals versus what is background noise. <br>
    <br>
Regarding <b>Dark Matter</b> signal: <br>
- $E$ (energy) will be a normal distribution with its peak located at ~50GeV, a reasonable value when dealing with indirect detection, such as Gamma Rays. <br>
- $v$ (velocity) will follow a Maxwell-Boltzmann distribution, because according to the Standard Halo Model, the DM particles form an isotropical gas in gravitational equilibrium. Its scale is due to the Sun's velocity around the Galactic Center, which is $v_{\odot} = 220 km/s$. <br>
- $\theta$, as the angle at which those particles are received. Note that it is not isotropic due to the Sun's orbit around the GC, and that causes a type of "dark matter wind", at which there is in fact a prefered angle. <br>
- $\phi$ represents the rotational symmetry due to the Halo's format. <br>
- $r$ represents a DM model that decays exponentially. It is indeed not the current model used (such as NFW, Einasto, Moore, etc.) but in this case, to simplify the analysis and not deal with major computational problems, it is a great approximation. <br>
<br>
In contrast, the majority of <b>background data</b> is distributed on a uniform dataset, which is due to its homogeneity and isotropy, without any peaks because it has no specific events and, furthermore, it is a mix of different processes (radioactivity, cosmic rays, equipment noise and more), which brings a lot of randomness to all parameters.
</p>

In [2]:
#Creating the dataset 
def generate_dm_events(N): 
    E = np.random.normal(loc=50, scale=10, size=N) 
    v = maxwell.rvs(scale=220, size=N) 
    theta = np.random.normal(loc=np.pi/2, scale=0.4, size=N) 
    phi = np.random.uniform(0, 2*np.pi, size=N) 
    r = np.random.exponential(scale=8, size=N) 
    
    return pd.DataFrame({ "E": E, 
                         "v": v, 
                         "theta": theta, 
                         "phi": phi, 
                         "r": r, 
                         "label": 1 })

def generate_noise(N):
    E = np.random.exponential(scale=40, size=N) 
    v = np.random.uniform(0, 600, size=N) 
    theta = np.random.uniform(0, np.pi, size=N) 
    phi = np.random.uniform(0, 2*np.pi, size=N) 
    r = np.random.uniform(0, 20, size=N) 
    
    return pd.DataFrame({ "E": E, 
                         "v": v, 
                         "theta": theta, 
                         "phi": phi, 
                         "r": r, 
                         "label": 0 })

In [3]:
N = 5000
dm = generate_dm_events(N)
bg = generate_noise(N)

data = pd.concat([dm, bg]).sample(frac=1).reset_index(drop=True) 
data.to_csv("events.csv", index=False)

print(data)

               E           v     theta       phi          r  label
0      61.025803  310.879390  0.817846  5.422012   0.287951      1
1      40.934842  479.906652  2.215813  3.034172  18.788725      0
2      28.563479  117.470710  1.151123  2.493208   1.247649      1
3      51.732675  310.756166  1.078034  4.802839   3.076940      1
4       6.636092  356.295642  0.366266  0.994590  15.430433      0
...          ...         ...       ...       ...        ...    ...
9995   44.422658  336.451925  1.243652  2.851176   0.645467      1
9996   42.221867  584.632380  1.348136  5.360222   5.895090      1
9997  100.119918  356.295004  2.314896  4.209037  10.551552      0
9998   46.495388   72.528357  2.584907  1.152684  12.228210      0
9999   47.412793  499.894128  1.462030  0.709926   1.186067      1

[10000 rows x 6 columns]


#### <p style="text-align: center; font-family:cm; font-size:1.8em;">Machine Learning</p>


In [4]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

In [5]:
#Creating dataset
blind_data = data.drop('label', axis = 1) #Separating labels from the data
answer = data["label"] #Answers: 0 for background, 1 for DM

#Separating train and test data
X_train, X_test, y_train, y_test = train_test_split(blind_data, 
                                                    answer, 
                                                    test_size=0.2, 
                                                    random_state=13)

#### <p style="font-family:cm; font-size:1.4em;">Logistic Regression</p>
<p style="font-family:cm; font-size:1.1em;"> 
Logistic Regression is a ML model that is only applied to linear data, in which every parameter is associated to a coefficient. It then plots a hyperplane in the 4d space formed by the 4 parameters in question. If a data point falls into one side of the hyperplane, it will be considered Dark Matter. If not, then it will be counted as background noise. <br>
<br>
At the end, the results are compared to the original labels and the accuracy is computed.
</p>

In [6]:
#Putting all into a same scale to avoid absolute value bias
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

#Applying Logistic Regression to the standardized dataset
model_LR = LogisticRegression()
model_LR.fit(X_train_scaled, y_train)

#Computing the final accuracy of DM-signal separation
accuracy_LR = model_LR.score(X_test_scaled, y_test)
print(f"Standard Scaler test accuracy: {accuracy_LR * 100:.2f}%")

Standard Scaler test accuracy: 63.90%


#### <p style="font-family:cm; font-size:1.4em;">Random Forest</p>

<p style="font-family:cm; font-size:1.1em;">
This is a supervised ML tool that is going to be used for classification purposes by creating various decision trees until the number of estimators is reached. The same 80/20 dataset will be used here for training and testing.
</p>


In [7]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

#Activating Random Forest model to my dataset
model_RF = RandomForestClassifier(n_estimators=100, random_state=13)
model_RF.fit(X_train, y_train)

#Test and check for the accuracy
y_pred_RF = model_RF.predict(X_test)
accuracy_RF = accuracy_score(y_test, y_pred_RF)

print(f"Random Forest test accuracy: {accuracy_RF * 100:.2f}%")

Random Forest test accuracy: 88.80%


#### <p style="font-family:cm; font-size:1.4em;">SVM (Support Vector Machines)</p>

<p style="font-family:cm; font-size:1.1em;">
SVM is a supervised ML tool that analyses data and separates the two different groups by finding a the best type of frontier that sets them apart, and that's why it can be used to linear and non-linear data (differently from Logistic Regression). It then spits out its results, which then are compared to the original ones (data labels). 
</p>


In [8]:
from sklearn.svm import SVC

model_SVM = SVC(kernel='rbf', C=1.0, gamma='scale')

#Again, training ML model with the same dataset
model_SVM.fit(X_train_scaled, y_train)
y_predict_SVM = model_SVM.predict(X_test_scaled)

#Testing and accuracy check
accuracy_SVM = accuracy_score(y_test, y_predict_SVM)
print(f"SVM test accuracy: {accuracy_SVM * 100:.2f}%")

SVM test accuracy: 88.90%


#### <p style="text-align: center; font-family:cm; font-size:1.8em;">Conclusion</p>


<p style="font-family:cm; font-size:1em;">
The three tested models showed clearly distinct behaviors in the task of classifying dark matter events versus background. Logistic regression achieved significantly lower performance, indicating that the separation between signal and background cannot be described by a simple linear decision boundary in the space of physical observables. This suggests that the discriminating information is encoded in correlations among variables rather than in a single dominant feature. Both the Support Vector Machine and the Random Forest reached very similar performance levels, with accuracies close to 90%, indicating a well-defined and robust non-linear separation in the synthetic dataset. The slight advantage of the Random Forest, combined with its interpretability through feature importance analysis, makes it particularly suitable for physical interpretation, while the SVM stands out as an effective classification tool with limited direct physical insight.
</p>