# Magic data analysis
---

# Project
---

### Description

The data are Monte Carlo generated to simulate registration of high energy gamma particles in a ground-based atmospheric Cherenkov gamma telescope using the imaging technique. Cherenkov gamma telescope observes high energy gamma rays, taking advantage of the radiation emitted by charged particles produced inside the electromagnetic showers initiated by the gammas, and developing in the atmosphere. This Cherenkov radiation (of visible to UV wavelengths) leaks through the atmosphere and gets recorded in the detector, allowing reconstruction of the shower parameters. The available information consists of pulses left by the incoming Cherenkov photons on the photomultiplier tubes, arranged in a plane, the camera. Depending on the energy of the primary gamma, a total of few hundreds to some 10000 Cherenkov photons get collected, in patterns (called the shower image), allowing to discriminate statistically those caused by primary gammas (signal) from the images of hadronic showers initiated by cosmic rays in the upper atmosphere (background). 

Typically, the image of a shower after some pre-processing is an elongated cluster. Its long axis is oriented towards the camera center if the shower axis is parallel to the telescope's optical axis, i.e. if the telescope axis is directed towards a point source. A principal component analysis is performed in the camera plane, which results in a correlation axis and defines an ellipse. If the depositions were distributed as a bivariate Gaussian, this would be an equidensity ellipse. The characteristic parameters of this ellipse (often called Hillas parameters) are among the image parameters that can be used for discrimination. The energy depositions are typically asymmetric along the major axis, and this asymmetry can also be used in discrimination. There are, in addition, further discriminating characteristics, like the extent of the cluster in the image plane, or the total sum of depositions. 

The program was run with parameters allowing to observe events with energies down to below 50 GeV.

### Dataset

The dataset is available at this [link](https://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.data)

Attribute Information:

1. fLength: continuous # major axis of ellipse [mm] 
2. fWidth: continuous # minor axis of ellipse [mm] 
3. fSize: continuous # 10-log of sum of content of all pixels [in #phot] 
4. fConc: continuous # ratio of sum of two highest pixels over fSize [ratio] 
5. fConc1: continuous # ratio of highest pixel over fSize [ratio] 
6. fAsym: continuous # distance from highest pixel to center, projected onto major axis [mm] 
7. fM3Long: continuous # 3rd root of third moment along major axis [mm] 
8. fM3Trans: continuous # 3rd root of third moment along minor axis [mm] 
9. fAlpha: continuous # angle of major axis with vector to origin [deg] 
10. fDist: continuous # distance from origin to center of ellipse [mm] 
11. class: g,h # gamma (signal), hadron (background) 

g = gamma (signal): 12332 
h = hadron (background): 6688 

For technical reasons, the number of h events is underestimated. In the real data, the h class represents the majority of the events. 

The simple classification accuracy is not meaningful for this data, since classifying a background event as signal is worse than classifying a signal event as background. For comparison of different classifiers an ROC curve has to be used. The relevant points on this curve are those, where the probability of accepting a background event as signal is below one of the following thresholds: 0.01, 0.02, 0.05, 0.1, 0.2 depending on the required quality of the sample of the accepted events for different experiments.

### References

Bock, R.K., Chilingarian, A., Gaug, M., Hakl, F., Hengstebeck, T., Jirina, M., Klaschka, J., Kotrc, E., Savicky, P., Towers, S., Vaicilius, A., Wittek W. (2004). 
Methods for multidimensional event classification: a case study using images from a Cherenkov gamma-ray telescope. 
Nucl.Instr.Meth. A, 516, pp. 511-528. 

P. Savicky, E. Kotrc. 
Experimental Study of Leaf Confidences for Random Forest. 
Proceedings of COMPSTAT 2004, In: Computational Statistics. (Ed.: Antoch J.) - Heidelberg, Physica Verlag 2004, pp. 1767-1774. 

J. Dvorak, P. Savicky. 
Softening Splits in Decision Trees Using Simulated Annealing. 
Proceedings of ICANNGA 2007, Warsaw, (Ed.: Beliczynski et. al), Part I, LNCS 4431, pp. 721-729.

Aharonian, F. et al.
The Energy Spectrum of TeV Gamma-Rays from the Crab Nebula as measured by the HEGRA system of imaging air Cherenkov telescopes
Astrophys. J. 539 (2000) 317-324

Aleksic, J. et al.
Measurement of the Crab Nebula spectrum over three decades in energy with the MAGIC telescopes
Journal of High Energy Astrophysics, 5–6 (2015) 30-38.

### Assignments

The main goal is to distinguish signal and background events. Two approaches can be followed: 1) exploiting the physics of the detection principle 2) use a physics-agnostic multivariate technique, e.g. a neural network.

1. Study the features of the datasets and compare them for signal and background events
2. Study the correlations among the features of the datasets for signal and background events
3. Compute the "mean-scaled-width" and the "mean-scale-length", i.e. rescale by means of their mean and standard deviation the "Width" and "Length" distributions. Compare them for signal and background events in the cases of little or a lot of light ("fSize") 
4. Perform a Principal Component Analysis on that dataset for the signal and the background events
5. Perform a multivariate analysis, without using the parameter `fAlpha` for the classification, with the technique you prefer and evaluate its performance (e.g. in terms of Area Under the (ROC) Curve).
6.  If we call "gammaness" the score that you give when you classify an event as gamma or hadron (that has been trained without using `fAlpha`), find the gammaness and alpha cuts that are giving the highest quality factor. 

Since you have a dataset that does not correspond to reality (in which hadrons are much more numerous than gammas), we define the quality factor **Q** as:
    
   **Q** = epsilon_gamma / sqrt(epsilon_hadron); where
   
   epsilon_gamma = selected_gammas / total_number_of_gammas
   
   epsilon_hadron = selected_hadrons / total_number_of_hadrons

7. Assuming that the telescope has a collection area of 10^9 cm^2 and that we are observing gamma rays between 50 GeV and 50 TeV, to what observation time does this measurement corresponds? (assume here the Crab spectrum measured by HEGRA [Aharonian, F. et al. 2000])
8. Do the same using the MAGIC measured spectrum of the Crab Nebula [Aleksic, J. et al. 2015]
9. Plot both spectra and answer if the difference in the observation time that you are obtaining goes in the direction of the difference in the spectra that you are plotting.

### Contacts

* Ruben Lopez <ruben.lopezcoto@pd.infn.it>, (who is unfortunately not in Padova any more.., his group must be contacted instead, like prof. Michele Doro)


# Implementation
---

In [None]:
#!wget https://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random as rnd

rnd.seed(42)    #settiamo un seed

In [2]:
file_name="magic04.data"
data = pd.DataFrame(pd.read_csv(file_name))
data.columns=['fLength','fWidth','fSize',
        'fConc','fConc1','fAsym',
        'fM3Long','fM3Trans','fAlpha','fDist','category']
data.head(10)

Unnamed: 0,fLength,fWidth,fSize,fConc,fConc1,fAsym,fM3Long,fM3Trans,fAlpha,fDist,category
0,31.6036,11.7235,2.5185,0.5303,0.3773,26.2722,23.8238,-9.9574,6.3609,205.261,g
1,162.052,136.031,4.0612,0.0374,0.0187,116.741,-64.858,-45.216,76.96,256.788,g
2,23.8172,9.5728,2.3385,0.6147,0.3922,27.2107,-6.4633,-7.1513,10.449,116.737,g
3,75.1362,30.9205,3.1611,0.3168,0.1832,-5.5277,28.5525,21.8393,4.648,356.462,g
4,51.624,21.1502,2.9085,0.242,0.134,50.8761,43.1887,9.8145,3.613,238.098,g
5,48.2468,17.3565,3.0332,0.2529,0.1515,8.573,38.0957,10.5868,4.792,219.087,g
6,26.7897,13.7595,2.5521,0.4236,0.2174,29.6339,20.456,-2.9292,0.812,237.134,g
7,96.2327,46.5165,4.154,0.0779,0.039,110.355,85.0486,43.1844,4.854,248.226,g
8,46.7619,15.1993,2.5786,0.3377,0.1913,24.7548,43.8771,-6.6812,7.875,102.251,g
9,62.7766,29.9104,3.3331,0.2475,0.1261,-33.9065,57.5848,23.771,9.9144,323.094,g


## 1\. Features of the datasets
Study the features of the datasets and compare them for signal and background events - Luca

class: g #gamma (signal), h #hadron (background)

In [None]:
# Import the os module for operating system functionality, such as directory management
import os

# Define the name of the directory where plots will be saved
dir_name = 'punto_1_plots'

# Check if the directory already exists; if not, create it
if not os.path.exists(dir_name):
    os.makedirs(dir_name)  # Create the directory

# X contains feature columns, Y contains the category labels
X = data.loc[:, 'fLength':'fDist']  # Select feature columns from 'fLength' to 'fDist'
Y = data.loc[:, 'category']          # Select the category column

# Iterate through each feature in X
for i in range(X.shape[1]):
    x_g = []  # List to hold values for category 'g'
    x_h = []  # List to hold values for category 'h'
    X_temp = X[str(X.columns[i])]  # Extract the current feature column

    # Iterate through each row in the DataFrame
    for j in range(X.shape[0]):
        if (Y[j] == 'g'):  # If the category is 'g'
            x_g.append(X_temp[j])  # Append the value to x_g
        if (Y[j] == 'h'):  # If the category is 'h'
            x_h.append(X_temp[j])  # Append the value to x_h

    # Create a title for the histogram based on the current feature
    titolo = 'hist of feature ' + str(i) + ' : ' + X.columns[i]                        #-------titolo del grafico------------------- 
    # Create a filename for saving the histogram
    img_name = 'hist_' + str(i) + '_' + X.columns[i] + '.png'

    # Create the histogram for category 'h' (hadron)
    sns.histplot(x_h, label='hadron', alpha=0.5, color='blue', edgecolor='none')       #-------numero di bins diversi?---------------
    # Create the histogram for category 'g' (gamma)
    sns.histplot(x_g, label='gamma', alpha=0.5, color='red', edgecolor='none')       
    
    # Set the title for the plot
    plt.title(titolo)
    plt.legend()  # Show the legend to differentiate the two categories                #--------farei dei subplots--------------------
    # Save the plot in the specified directory with the generated filename
    plt.savefig(os.path.join(dir_name, img_name))
    plt.show()  # Display the plot

# All plots are saved in a folder named 'punto_1_plots'

## 2\. Correlations among the features
Study the correlations among the features of the datasets for signal and background events - Luca

Rescaling features to average = 0 and std = 1

In [None]:
# Create a new variable X_rescaled that will hold the rescaled features
X_rescaled = X  # Start with the original X DataFrame

# Iterate through each feature column in X
for i in range(X.shape[1]):
    h = X[str(X.columns[i])]  # Extract the current feature column by its name
    
    # Calculate the average (mean) of the feature
    avg = np.average(h)  
    
    # Calculate the standard deviation of the feature
    std = np.std(h)  
    
    # Center the feature by subtracting the average
    h = h - avg  
    
    # Scale the feature by dividing by the standard deviation
    h = h / std  
    
    # Update the corresponding column in X_rescaled with the rescaled values
    X_rescaled[str(X_rescaled.columns[i])] = h

In [None]:
Xg_rescaled = X_rescaled[Y == 'g']
Xh_rescaled = X_rescaled[Y == 'h']

In [None]:
# Initialize a covariance matrix with zeros for the rescaled features
cov = np.zeros((Xg_rescaled.shape[1], Xg_rescaled.shape[1]))

# Loop through each feature in the rescaled DataFrame
for i in range(Xg_rescaled.shape[1]):
    h = Xg_rescaled[str(Xg_rescaled.columns[i])]  # Extract the ith feature column
    # Loop through each feature again to compute pairwise covariances
    for j in range(Xg_rescaled.shape[1]):
        v = Xg_rescaled[str(Xg_rescaled.columns[j])]  # Extract the jth feature column
        # Calculate the covariance between the ith and jth features
        cov[i, j] = (np.cov(h, v)[0, 1])  # [0, 1] gives the covariance value

# Set up the figure and axes for plotting
ticks = np.arange(0, 10, 1)  # Create tick positions for the axes
fig, ax = plt.subplots(figsize=(6, 6))  # Create a square figure of size 6x6 inches

# Display the covariance matrix as an image with a color map
cax = ax.imshow(cov, cmap='plasma', interpolation='nearest')

# Add a color bar to indicate the scale of the covariance values
fig.colorbar(cax)

# Configure the ticks on both axes for better visualization
ax.set_xticks(np.arange(-0.5, cov.shape[1], 1), minor=True)  # Minor ticks for x-axis
ax.set_yticks(np.arange(-0.5, cov.shape[0], 1), minor=True)  # Minor ticks for y-axis

# Add a grid for minor ticks to enhance readability
ax.grid(which='minor', color='black', linewidth=1)

# Draw a diagonal line across the matrix for visual reference
ax.plot([-0.5, cov.shape[1] - 0.5], [-0.5, cov.shape[0] - 0.5], color='black', linewidth=2, linestyle='-')

# Set the title of the plot
plt.title('GAMMA CORRELATIONS')

# Uncomment the following block to display the covariance values in each cell of the matrix
# for i in range(cov.shape[0]):
#     for j in range(cov.shape[1]):
#         plt.text(j, i, f'{cov[i, j]:.2f}', ha='center', va='center', color='white', fontsize=8)

# Display the plot
plt.show()

In [None]:
# Initialize a covariance matrix with zeros for the rescaled features
cov = np.zeros((Xh_rescaled.shape[1], Xh_rescaled.shape[1]))

# Loop through each feature in the rescaled DataFrame
for i in range(Xh_rescaled.shape[1]):
    h = Xh_rescaled[str(Xh_rescaled.columns[i])]  # Extract the ith feature column
    # Loop through each feature again to compute pairwise covariances
    for j in range(Xh_rescaled.shape[1]):
        v = Xh_rescaled[str(Xh_rescaled.columns[j])]  # Extract the jth feature column
        # Calculate the covariance between the ith and jth features
        cov[i, j] = (np.cov(h, v)[0, 1])  # [0, 1] gives the covariance value

# Set up the figure and axes for plotting
ticks = np.arange(0, 10, 1)  # Create tick positions for the axes
fig, ax = plt.subplots(figsize=(6, 6))  # Create a square figure of size 6x6 inches

# Display the covariance matrix as an image with a color map
cax = ax.imshow(cov, cmap='plasma', interpolation='nearest')

# Add a color bar to indicate the scale of the covariance values
fig.colorbar(cax)

# Configure the ticks on both axes for better visualization
ax.set_xticks(np.arange(-0.5, cov.shape[1], 1), minor=True)  # Minor ticks for x-axis
ax.set_yticks(np.arange(-0.5, cov.shape[0], 1), minor=True)  # Minor ticks for y-axis

# Add a grid for minor ticks to enhance readability
ax.grid(which='minor', color='black', linewidth=1)

# Draw a diagonal line across the matrix for visual reference
ax.plot([-0.5, cov.shape[1] - 0.5], [-0.5, cov.shape[0] - 0.5], color='black', linewidth=2, linestyle='-')

# Set the title of the plot
plt.title('HADRON CORRELATIONS')

# Uncomment the following block to display the covariance values in each cell of the matrix
# for i in range(cov.shape[0]):
#     for j in range(cov.shape[1]):
#         plt.text(j, i, f'{cov[i, j]:.2f}', ha='center', va='center', color='white', fontsize=8)

# Display the plot
plt.show()

In [None]:
print('correlation of f0 and f1:',cov[0,1])
print('correlation of f0 and f2:',cov[0,2])
print('correlation of f3 and f4:',cov[3,4])
print('correlation of f2 and f3:',cov[2,3])
print('correlation of f2 and f4:',cov[2,4])

Looks like we have good correlation between
- features 0-1
- features 0-2
- features 3-4

And good anti-correlation between
- features 2-3
- features 2-4

Porco dio Johnny la prossima volta che mi fai un audio ammazzo un pesce 🐟 

## 3\. Rescaling by means of the mean and standard deviation  - DA RIVEDERE
Compute the "mean-scaled-width" and the "mean-scale-length", i.e. rescale by means of their mean and standard deviation the "Width" and "Length" distributions. Compare them for signal and background events in the cases of little or a lot of light ("fSize") - Samu

In [None]:
## save the two columns I care about
## I may not need this if I use only dataframe and pandas
#length = np.array(data.fLength)
#print('lenght=', length)
#width = np.array(data.fWidth)
#print('width=',width)

## find mean and std
##print(data.describe()) ##useful as a comparison, gives you general values of the dataset, also mean and std
mu_l=data['fLength'].mean()
std_l=data['fLength'].std()
mu_w=data['fWidth'].mean()
std_w=data['fWidth'].std()
print('Length attribute:   mean = ', mu_l, '    std = ', std_l)
print('Width attribute:    mean = ', mu_w, '   std = ', std_w)


In [None]:
#### ATTENZIONE RUNNARE TUTTO ASSICURARSI CHE SIA FATTO BENE IL DATAFRAME E MU E STD CHE SE NO SI SFANCULA TUTTO

#print(data['fLength'].head(30))
data['scaled_fLength']= (data['fLength']-mu_l)/std_l   ##add a scaled comlumn to the DataFrame
data['scaled_fWidth']= (data['fWidth']-mu_w)/std_w  
print(data[['scaled_fLength','scaled_fWidth','fLength','fWidth']].head(30))
#print(data['scaled_fLength'].mean())
#print(data['scaled_fLength'].std())


## Plot the distribution for scaled fLength (all gamma events)
plt.figure(figsize=(8, 6))
sns.histplot(data[data['category'] == 'g']['scaled_fLength'], kde=True, color='blue', label='Gamma Events', alpha=0.7)
sns.histplot(data[data['category'] == 'g']['scaled_fWidth'], kde=True, color='blue', label='Gamma Events', alpha=0.7)


## Add titles and labels
plt.title("Distribution of Scaled fLength for All Gamma Events", fontsize=14)
plt.xlabel("Scaled fLength", fontsize=12)
plt.ylabel("Density", fontsize=12)
plt.legend(title="Event Type", fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()

## Show the plot
plt.show()



In [None]:
fig, axes = plt.subplots(2, 2, figsize=(16, 11))

# Plot for scaled fLength, cat=g
sns.histplot(data[data['category'] == 'g']['scaled_fLength'], kde=True, color='blue', ax=axes[0][0])
axes[0][0].set_title("Distribution of Scaled fLength for Gamma Events", fontsize=14)
axes[0][0].set_xlabel("Scaled fLength", fontsize=12)
axes[0][0].set_ylabel("Density", fontsize=12)
axes[0][0].grid(True, alpha=0.3)

# Plot for scaled fWidth, cat=g
sns.histplot(data[data['category'] == 'g']['scaled_fWidth'], kde=True, color='green', ax=axes[1][0])
axes[1][0].set_title("Distribution of Scaled fWidth for Gamma Events", fontsize=14)
axes[1][0].set_xlabel("Scaled fWidth", fontsize=12)
axes[1][0].set_ylabel("Density", fontsize=12)
axes[1][0].grid(True, alpha=0.3)

# Plot for scaled fLength, cat=h
sns.histplot(data[data['category'] == 'h']['scaled_fLength'], kde=True, color='blue', ax=axes[0][1])
axes[0][1].set_title("Distribution of Scaled fLength for Hadron Events", fontsize=14)
axes[0][1].set_xlabel("Scaled fLength", fontsize=12)
axes[0][1].set_ylabel("Density", fontsize=12)
axes[0][1].grid(True, alpha=0.3)

# Plot for scaled fWidth, cat=h
sns.histplot(data[data['category'] == 'h']['scaled_fWidth'], kde=True, color='green', ax=axes[1][1])
axes[1][1].set_title("Distribution of Scaled fWidth for Hadron Events", fontsize=14)
axes[1][1].set_xlabel("Scaled fWidth", fontsize=12)
axes[1][1].set_ylabel("Density", fontsize=12)
axes[1][1].grid(True, alpha=0.3)

In [None]:
#fig, axes = plt.subplots(1, 2, figsize=(15, 6))
#sns.histplot(data['fSize'], kde=True, color='blue', label='Low Light', alpha=0.7, ax=axes[0])
#sns.histplot(10**data['fSize'], kde=True, color='blue', label='Low Light', alpha=0.7,ax=axes[1])
#plt.show()

median_fSize = data['fSize'].median()
min_fSize = data['fSize'].min()
max_fSize = data['fSize'].max()
print("data for fSize: median= ",median_fSize, 'min= ', min_fSize, 'max= ', max_fSize)
#### I don't know if I should differentiate between median of h and g I don't think so because the light quantity 
#### should be independent from the nature of the phenomenon
##as a threshold I don't know if need to use a median or maybe just the half


In [None]:
## suppose I use the median

#high_light=data[data['fSize']>median_fSize]
#print(high_light[])

### HIGH LIGHT  ############

fig, axes = plt.subplots(2, 2, figsize=(16, 11))

# Plot for scaled fLength, cat=g
sns.histplot(data[data['category'] == 'g']['scaled_fLength'][data['fSize'] >= median_fSize], kde=True, color='red', ax=axes[0][0])
axes[0][0].set_title("Distribution of Scaled fLength for Gamma Events", fontsize=14)
axes[0][0].set_xlabel("Scaled fLength", fontsize=12)
axes[0][0].set_ylabel("Density", fontsize=12)
axes[0][0].grid(True, alpha=0.3)

# Plot for scaled fWidth, cat=g
sns.histplot(data[data['category'] == 'g']['scaled_fWidth'][data['fSize'] >= median_fSize], kde=True, color='green', ax=axes[1][0])
axes[1][0].set_title("Distribution of Scaled fWidth for Gamma Events", fontsize=14)
axes[1][0].set_xlabel("Scaled fWidth", fontsize=12)
axes[1][0].set_ylabel("Density", fontsize=12)
axes[1][0].grid(True, alpha=0.3)

# Plot for scaled fLength, cat=h
sns.histplot(data[data['category'] == 'h']['scaled_fLength'][data['fSize'] >= median_fSize], kde=True, color='blue', ax=axes[0][1])
axes[0][1].set_title("Distribution of Scaled fLength for Hadron Events", fontsize=14)
axes[0][1].set_xlabel("Scaled fLength", fontsize=12)
axes[0][1].set_ylabel("Density", fontsize=12)
axes[0][1].grid(True, alpha=0.3)

# Plot for scaled fWidth, cat=h
sns.histplot(data[data['category'] == 'h']['scaled_fWidth'][data['fSize'] >= median_fSize], kde=True, color='green', ax=axes[1][1])
axes[1][1].set_title("Distribution of Scaled fWidth for Hadron Events", fontsize=14)
axes[1][1].set_xlabel("Scaled fWidth", fontsize=12)
axes[1][1].set_ylabel("Density", fontsize=12)
axes[1][1].grid(True, alpha=0.3)

In [None]:
### LOW LIGHT  ############

fig, axes = plt.subplots(2, 2, figsize=(16, 11))

# Plot for scaled fLength, cat=g
sns.histplot(data[data['category'] == 'g']['scaled_fLength'][data['fSize'] <= median_fSize], kde=True, color='red', ax=axes[0][0])
axes[0][0].set_title("Distribution of Scaled fLength for Gamma Events", fontsize=14)
axes[0][0].set_xlabel("Scaled fLength", fontsize=12)
axes[0][0].set_ylabel("Density", fontsize=12)
axes[0][0].grid(True, alpha=0.3)

# Plot for scaled fWidth, cat=g
sns.histplot(data[data['category'] == 'g']['scaled_fWidth'][data['fSize'] <= median_fSize], kde=True, color='green', ax=axes[1][0])
axes[1][0].set_title("Distribution of Scaled fWidth for Gamma Events", fontsize=14)
axes[1][0].set_xlabel("Scaled fWidth", fontsize=12)
axes[1][0].set_ylabel("Density", fontsize=12)
axes[1][0].grid(True, alpha=0.3)

# Plot for scaled fLength, cat=h
sns.histplot(data[data['category'] == 'h']['scaled_fLength'][data['fSize'] <= median_fSize], kde=True, color='blue', ax=axes[0][1])
axes[0][1].set_title("Distribution of Scaled fLength for Hadron Events", fontsize=14)
axes[0][1].set_xlabel("Scaled fLength", fontsize=12)
axes[0][1].set_ylabel("Density", fontsize=12)
axes[0][1].grid(True, alpha=0.3)

# Plot for scaled fWidth, cat=h
sns.histplot(data[data['category'] == 'h']['scaled_fWidth'][data['fSize'] <= median_fSize], kde=True, color='green', ax=axes[1][1])
axes[1][1].set_title("Distribution of Scaled fWidth for Hadron Events", fontsize=14)
axes[1][1].set_xlabel("Scaled fWidth", fontsize=12)
axes[1][1].set_ylabel("Density", fontsize=12)
axes[1][1].grid(True, alpha=0.3)

In [None]:
### LOW LIGHT  ############

fig, axes = plt.subplots(2, 2, figsize=(14, 8))

# Plot for scaled fLength, cat=g
sns.histplot(data[data['category'] == 'g']['scaled_fLength'][data['fSize'] <= median_fSize], kde=True, color='red', ax=axes[0][0], label='low light')

# Plot for scaled fWidth, cat=g
sns.histplot(data[data['category'] == 'g']['scaled_fWidth'][data['fSize'] <= median_fSize], kde=True, color='red', ax=axes[1][0],label='low light')

# Plot for scaled fLength, cat=h
sns.histplot(data[data['category'] == 'h']['scaled_fLength'][data['fSize'] <= median_fSize], kde=True, color='red', ax=axes[0][1],label='low light')

# Plot for scaled fWidth, cat=h
sns.histplot(data[data['category'] == 'h']['scaled_fWidth'][data['fSize'] <= median_fSize], kde=True, color='red', ax=axes[1][1],label='low light')

### HIGH LIGHT  ############

# Plot for scaled fLength, cat=g
sns.histplot(data[data['category'] == 'g']['scaled_fLength'][data['fSize'] >= median_fSize], kde=True, color='blue', ax=axes[0][0],label='high light')
axes[0][0].set_title("Distribution of Scaled fLength for Gamma Events", fontsize=14)
axes[0][0].set_xlabel("Scaled fLength", fontsize=12)
axes[0][0].set_ylabel("Density", fontsize=12)
axes[0][0].grid(True, alpha=0.3)
axes[0][0].legend()

# Plot for scaled fWidth, cat=g
sns.histplot(data[data['category'] == 'g']['scaled_fWidth'][data['fSize'] >= median_fSize], kde=True, color='blue', ax=axes[1][0],label='high light')
axes[1][0].set_title("Distribution of Scaled fWidth for Gamma Events", fontsize=14)
axes[1][0].set_xlabel("Scaled fWidth", fontsize=12)
axes[1][0].set_ylabel("Density", fontsize=12)
axes[1][0].grid(True, alpha=0.3)
axes[1][0].legend()

# Plot for scaled fLength, cat=h
sns.histplot(data[data['category'] == 'h']['scaled_fLength'][data['fSize'] >= median_fSize], kde=True, color='blue', ax=axes[0][1],label='high light')
axes[0][1].set_title("Distribution of Scaled fLength for Hadron Events", fontsize=14)
axes[0][1].set_xlabel("Scaled fLength", fontsize=12)
axes[0][1].set_ylabel("Density", fontsize=12)
axes[0][1].grid(True, alpha=0.3)
axes[0][1].legend()

# Plot for scaled fWidth, cat=h
sns.histplot(data[data['category'] == 'h']['scaled_fWidth'][data['fSize'] >= median_fSize], kde=True, color='blue', ax=axes[1][1],label='high light')
axes[1][1].set_title("Distribution of Scaled fWidth for Hadron Events", fontsize=14)
axes[1][1].set_xlabel("Scaled fWidth", fontsize=12)
axes[1][1].set_ylabel("Density", fontsize=12)
axes[1][1].grid(True, alpha=0.3)
axes[1][1].legend()

plt.show()

## 4\. PCA
Perform a Principal Component Analysis on that dataset for the signal and the background events - Johnny

## 5\. Multivariate analysis (without using the parameter fAlpha)
Perform a multivariate analysis, without using the parameter fAlpha for the classification, with the technique you prefer and evaluate its performance (e.g. in terms of Area Under the (ROC) Curve)

### 5.1\. Perceptron - Luca?

In [None]:
'''
new_Y = []

# gammas are +1    hadrons are -1

for i in range(len(Y)):
    if (Y.iloc[i] == 'g'):
        new_Y.append(1)
    if(Y.iloc[i] == 'h'):
        new_Y.append(-1)
        
Y_df = pd.DataFrame({'category' : new_Y})
data_rescaled_and_indexed = pd.merge(X_rescaled, Y_df,left_index=True, right_index=True)

# data_rescaled_and_indexed contains all features rescaled with avg = 0 and std = 1, and gammas = 1 and hadrons = -1
'''

In [None]:
Y_df = pd.DataFrame({'category': Y.map({'g': 1, 'h': -1})})

### Some ML functions ??

In [None]:
# this function divides the dataset in input in parts_to_be_split parts. One of them is returned as test_set
# the other ones are the training_set. Indexes of the chosen data can be accessed (test_indexes, train_indexes)
# (for example in a K-fold)

def split_train_test_set(data,parts_to_be_split):
    indexes = np.arange(0,data.shape[0],1)
    rnd.shuffle(indexes)
    len_test_set = int(len(indexes) / parts_to_be_split)

    test_indexes = indexes[:len_test_set]
    train_indexes = indexes[len_test_set:]

    train_set = data.iloc[train_indexes]
    test_set = data.iloc[test_indexes]
    
    return train_set, test_set

# this function takes col category and returns as Y (target), the other columns are X (labels)
def split_X_Y(data):
    Y = data['category']
    X = data.drop('category',axis=1)

    return X,Y

Getting the right sets

In [None]:
# divide dataset in 5 parts, taking 1 as test set ( 20 % )
# remove alpha



parts_to_be_split = 5 
train_set, test_set = split_train_test_set( data_rescaled_and_indexed , parts_to_be_split )
train_set = train_set.drop('fAlpha',axis=1)
test_set = test_set.drop('fAlpha',axis=1)
X_train, Y_train = split_X_Y(train_set)
X_test, Y_test = split_X_Y(test_set)

print(np.shape(X_train),np.shape(Y_train))
print(np.shape(X_test),np.shape(Y_test))


### coding the perceptron

In [None]:
def count_errors(X_train,Y_train,current_w):
    prediction = np.sign(np.dot(current_w,X_train.T))
    indexes = []
    errors = 0
    for i in range(len(prediction)):
        if(prediction[i] != Y_train.iloc[i]):
            indexes.append(i)
            errors += 1
            
    if (errors == 0):
        return 0, -1
    
    return errors, indexes

def random_perceptron_update(X_train,Y_train,current_w):
    
    errors, indexes = count_errors(X_train,Y_train,current_w)
    rnd.shuffle(indexes)
    idx = indexes[0]
    
    if (errors != 0):
        new_w = current_w + Y_train.iloc[idx] * X_train.iloc[idx]
        
    return new_w

def evaluate_misclassification(X_train,Y_train,current_w):             #evaluate_test_set era identica
    prediction = np.sign(np.dot(current_w,X_train.T))
    g_miss = 0
    h_miss = 0
    
    for i in range(len(prediction)):
        if(prediction[i] == 1 and Y_train.iloc[i] == -1):
            h_miss += 1
        if(prediction[i] == -1 and Y_train.iloc[i] == 1):
            g_miss += 1
            
    return g_miss,h_miss

def randomized_perceptron(X_train, Y_train, X_test, Y_test, max_iter):
    current_w = np.zeros(len(X_train.columns))
    new_w = current_w.copy()
    best_w = current_w.copy()
    best_errors, _ = count_errors(X_train,Y_train,current_w)
    iteration = 0
    
    g_miss_tot = []
    h_miss_tot = []
    gt_miss_tot = []
    ht_miss_tot = []
    all_w = []
    
    while (best_errors > 0 and iteration < max_iter):
        iteration += 1
        new_w = random_perceptron_update(X_train,Y_train,new_w)
        errors, _ = count_errors(X_train,Y_train,new_w)
        
        if (errors < best_errors):
            best_errors = errors
            best_w = new_w
        
        g_miss, h_miss = evaluate_misclassification(X_train,Y_train,new_w)
        tg_miss, th_miss = evaluate_test_set(X_test,Y_test,new_w)
        g_miss_tot.append(g_miss)
        h_miss_tot.append(h_miss)
        gt_miss_tot.append(tg_miss)
        ht_miss_tot.append(th_miss)
        all_w.append(new_w)
        
    return best_w, all_w, best_errors, g_miss_tot, h_miss_tot, gt_miss_tot, ht_miss_tot

In [None]:
max_iter = 1000

best_w, all_w, best_errors, g_misses, h_misses, test_g_misses, test_h_misses = randomized_perceptron(X_train, Y_train,X_test,Y_test,max_iter)


In [None]:
tot_g = len(Y_train[Y_train == 1])
tot_h = len(Y_train[Y_train == -1])
tot_g_test = len(Y_test[Y_test == 1])
tot_h_test = len(Y_test[Y_test == -1])
ROC_X = np.zeros(len(g_misses))
ROC_Y = np.zeros(len(g_misses))
ROC_X_test = np.zeros(len(test_g_misses))
ROC_Y_test = np.zeros(len(test_g_misses))

for i in range(len(g_misses)):
    ROC_Y[i] = ((tot_g - g_misses[i]) / tot_g)
    ROC_X[i] = (h_misses[i] / tot_h)
    ROC_Y_test[i] = ((tot_g_test - test_g_misses[i]) / tot_g_test)
    ROC_X_test[i] = (test_h_misses[i] / tot_h_test)

a = np.linspace(0,1,100)
plt.scatter(ROC_X,ROC_Y,color='blue',label='perceptron points',s=5)
plt.scatter(ROC_X_test,ROC_Y_test,color='green',label='perceptron points',s=5)
plt.plot(a,a,color='red')
plt.legend()
plt.show()

In [None]:
'''
distances = np.zeros(len(ROC_X_test))

for i in range(len(distances)):
    distances[i] = np.sqrt(ROC_X_test[i]**2 + (1-ROC_Y_test[i])**2)
'''

distances = np.sqrt(ROC_X_test**2 + (1 - ROC_Y_test)**2)


best_index = np.argmin(distances)
best_w = all_w[best_index]

file_name="magic04.data"
data = pd.DataFrame(pd.read_csv(file_name))
data.columns=['fLength','fWidth','fSize',
        'fConc','fConc1','fAsym',
        'fM3Long','fM3Trans','fAlpha','fDist','category']

averages = []
stds = []

for i in range(len(best_w)):
    averages.append(np.average(data[data.columns[i]]))
    stds.append(np.std(data[data.columns[i]]))
    
best_w_rescaled = best_w * np.array(stds) + np.array(averages)
print('best W for perceptron and rescaled is:')
print(best_w_rescaled)

In [None]:
data = list(zip(ROC_X, ROC_Y))

N_points = 200
point_range = 1 / N_points

# Sort the data based on ROC_X, then select the maximum ROC_Y for each unique ROC_X
sorted_data = sorted(data, key=lambda x: x[0])  # Sort by ROC_X
unique_x = []
unique_y = []

# Keep track of the maximum ROC_Y for each unique ROC_X
current_x = sorted_data[0][0]
max_y = sorted_data[0][1]

for x, y in sorted_data:
    if x > current_x - point_range and x < current_x + point_range:
        max_y = max(max_y, y)  # Update max_y if y is higher
    else:
        unique_x.append(current_x)
        unique_y.append(max_y)
        current_x = x
        max_y = y

# Append the last pair
unique_x.append(current_x)
unique_y.append(max_y)

# Plot the results
a = np.linspace(0, 1, 100)
plt.scatter(unique_x, unique_y, color='blue', label='perceptron points', s=5)
plt.plot(a, a, color='red')  # Plot the "no skill" line
plt.legend()
plt.show()

print(len(unique_x))
print(len(ROC_X))

### 5.2\. Random Forest - Luca

I am going to implement a random forest for data classification, performing a control on the test set, not using the alpha parameter

import dataset, remove fAlpha and set category to 1 for gamma and -1 for hadron  -- di nuovo???

In [None]:
import random

file_name="magic04.data"
data = pd.DataFrame(pd.read_csv(file_name))
data.columns=['fLength','fWidth','fSize',
        'fConc','fConc1','fAsym',
        'fM3Long','fM3Trans','fAlpha','fDist','category']

data = data.drop('fAlpha',axis=1)
cat = np.zeros(data.shape[0])
for i in range(data.shape[0]):
    if (data['category'].iloc[i] == 'g'):
        cat[i] = 1
    if (data['category'].iloc[i] == 'h'):
        cat[i] = -1
data['category'] = cat
print(np.shape(data))
parts_to_be_split = 5

train_set, test_set = split_train_test_set(data,parts_to_be_split)

print(np.shape(train_set))
print(np.shape(test_set))

X_train, Y_train = split_X_Y(train_set)
X_test, Y_test = split_X_Y(test_set)

 ## making decision tree
 Making up the class Tree

In [None]:
class Tree:

    def __init__(self):
        self.idx = -1    # The index of the feature over which you split (no split: -1)
        self.thresh = 0  # The threshold value over which you split (<=: left, >: right)
        self.leaf = 0    # 1 if it is a leaf of class 1, -1 if it is a leaf of class -1, 0 if it is an internal node
        self.left = []   # Left subtree (empty if it is a leaf)
        self.right = []  # Right subtree (empty if it is a leaf)


    def entropy(left, right):
        H = 0
        tot_length = len(left) + len(right)
        left_prob = len(np.where(left > 0)[0]) / len(left)
        if (left_prob > 0):
            H -= len(left) * left_prob * np.log2(left_prob) / tot_length
        if (left_prob < 1):
            H -= len(left) * (1 - left_prob) * np.log2(1 - left_prob) / tot_length
        right_prob = len(np.where(right > 0)[0]) / len(right)
        if (right_prob > 0):
            H -= len(right) * right_prob * np.log2(right_prob) / tot_length
        if (right_prob < 1):
            H -= len(right) * (1 - right_prob) * np.log2(1 - right_prob) / tot_length
        return H

    def classify(self, x):
        ## TO DO: classify the point x (easy for leaves, you have to go down the tree if the node is internal)
        if self.leaf == 0:
            if x[self.idx] > self.thresh:
                return self.right.classify(x)
            if x[self.idx] <= self.thresh:
                return self.left.classify(x)         
        else:
            return self.leaf

    def id3_training(self, X, Y, max_depth, printing):
        # Check if the node is a leaf (all nodes have the same label)
        if (np.max(Y) - np.min(Y) < 1e-3):
            self.leaf = np.max(Y)
            if (printing):
                print('Remaining depth: ' + str(max_depth) + ', leaf node (all labels are the same over ' + str(len(Y)) + ' points)')
            return
        # If the maximum depth is 0, the node must be a leaf!
        if (max_depth < 1):
            if (printing):
                print('Remaining depth: ' + str(max_depth) + ', leaf node (maximum depth reached, ' + str(len(Y)) + ' points)')
            if (len(np.where(Y > 0)) > len(Y) / 2):
                self.leaf = 1
            else:
                self.leaf = -1
            return
        # Find the best split: iterate over features
        best_idx = -1
        best_thresh = -1
        best_entropy = 1e9
        ## TO DO: Iterate over the features and threshold values! 
        for idx in range(X.shape[1]):
            values = X[:,idx]
            sorted_idx = np.argsort(values)
            values = np.unique(values) #values = np.unique(values[sorted_idx])
            for j in range(len(values)-1):
                thresh = (values[j]+values[j+1])/2
                left = np.where(X[:,idx] < thresh)
                right = np.where(X[:,idx] >= thresh)
                if len(left) == 0 or len(right) == 0:
                    print('error',thresh,idx)
                H = Tree.entropy(Y[left],Y[right])
                if H < best_entropy:
                    best_entropy = H
                    best_idx = idx
                    best_thresh = thresh
        
        
        if (best_idx == -1):
            # No valid features! The points are all identical
            self.leaf = np.sign(np.sum(Y))
            if (self.leaf == 0):
                self.leaf = 1
            if (printing):
                print('Remaining depth: ' + str(max_depth) + ', leaf node (all inputs are the same over ' + str(len(Y)) + ' points)')
            return
        left_samples = np.where(X[:, best_idx] <= best_thresh)[0]
        right_samples = np.where(X[:, best_idx] > best_thresh)[0]
        if (printing):
            print('Remaining depth: ' + str(max_depth) + ', splitting ' + str(len(Y)) + ' elements into ' + str(len(left_samples)) + ' and ' + str(len(right_samples)) + ' over feature ' + str(best_idx))
        ## TO DO: run the next recursive step of ID3 over the left and right subtrees!
        
        self.idx = best_idx
        self.thresh = best_thresh
        self.left = Tree()
        self.right = Tree()
        self.left.id3_training(X[left_samples,:],Y[left_samples],max_depth - 1,printing)
        self.right.id3_training(X[right_samples,:],Y[right_samples],max_depth-1,printing)


    def extra_training(self, X, Y, max_depth, printing):
        # Check if the node is a leaf (all nodes have the same label)
        if (np.max(Y) - np.min(Y) < 1e-3):
            self.leaf = np.max(Y)
            if (printing):
                print('Remaining depth: ' + str(max_depth) + ', leaf node (all labels are the same over ' + str(len(Y)) + ' points)')
            return
        # If the maximum depth is 0, the node must be a leaf!
        if (max_depth < 1):
            if (printing):
                print('Remaining depth: ' + str(max_depth) + ', leaf node (maximum depth reached, ' + str(len(Y)) + ' points)')
            if (len(np.where(Y > 0)) > len(Y) / 2):
                self.leaf = 1
            else:
                self.leaf = -1
            return
        # Find the best split: iterate over features
        best_idx = -1
        best_thresh = -1
        best_entropy = 1e9
        ## TO DO: Iterate over the features (remember, the threshold value is random)! 
        for idx in range(X.shape[1]):
            values = X[:,idx]
            sorted_idx = np.argsort(values)
            values = np.unique(values)
            
            for j in range(len(values)):
                thresh = rnd.uniform(1)
                thresh = (max(values) - min(values)) * thresh
                left = np.where(X[:,idx] < thresh)
                right = np.where(X[:,idx] >= thresh)
                if (len(left) == 0 or len(right) == 0):
                    print('error',thresh,idx)
                H = Tree.entropy(Y[left],Y[right])
                
                if H < best_entropy:
                    best_entropy = H
                    best_idx = idx
                    best_thresh = thresh
        
        if (best_idx == -1):
            # No valid features! The points are all identical
            self.leaf = np.sign(np.sum(Y))
            if (self.leaf == 0):
                self.leaf = 1
            if (printing):
                print('Remaining depth: ' + str(max_depth) + ', leaf node (all inputs are the same over ' + str(len(Y)) + ' points)')
            return
        left_samples = np.where(X[:, best_idx] <= best_thresh)[0]
        right_samples = np.where(X[:, best_idx] > best_thresh)[0]
        if (printing):
            print('Remaining depth: ' + str(max_depth) + ', splitting ' + str(len(Y)) + ' elements into ' + str(len(left_samples)) + ' and ' + str(len(right_samples)) + ' over feature ' + str(best_idx))
        ## TO DO: run the next recursive step of ID3 over the left and right subtrees!
        
        self.idx = best_idx
        self.thresh = best_thresh
        self.left = Tree()
        self.right = Tree()
        self.left.extra_training(X[left_samples,:],Y[left_samples],max_depth-1,printing)
        self.right.extra_training(X[right_samples,:],Y[right_samples],max_depth-1,printing)

Now we use the tree class to learn a model

In [None]:
X_train_np = np.array(X_train)
Y_train_np = np.array(Y_train)
X_test_np = np.array(X_test)
Y_test_np = np.array(Y_test)

In [None]:
single_tree = Tree()
single_tree.id3_training(X_train_np, Y_train_np, 20, True)

train_loss = 0
for i in range(len(Y_train_np)):
    predicted = single_tree.classify(X_train_np[i, :])
    if (Y_train_np[i] != predicted):
        train_loss += 1 / len(Y_train_np)
print('Training loss: ' + str(train_loss))

test_loss = 0
for i in range(len(Y_test_np)):
    predicted = single_tree.classify(X_test_np[i, :])
    if (Y_test_np[i] != predicted):
        test_loss += 1 / len(Y_test_np)
print('Test loss: ' + str(test_loss))

Remaining depth: 5, splitting 1147 elements into 1118 and 29 over feature 5
Remaining depth: 4, splitting 1118 elements into 618 and 500 over feature 0
Remaining depth: 3, splitting 618 elements into 32 and 586 over feature 6
Remaining depth: 2, splitting 32 elements into 4 and 28 over feature 3
Remaining depth: 1, leaf node (all labels are the same over 4 points)
Remaining depth: 1, splitting 28 elements into 12 and 16 over feature 3
Remaining depth: 0, leaf node (maximum depth reached, 12 points)
Remaining depth: 0, leaf node (maximum depth reached, 16 points)
Remaining depth: 2, splitting 586 elements into 579 and 7 over feature 1
Remaining depth: 1, splitting 579 elements into 219 and 360 over feature 1
Remaining depth: 0, leaf node (maximum depth reached, 219 points)
Remaining depth: 0, leaf node (maximum depth reached, 360 points)
Remaining depth: 1, splitting 7 elements into 4 and 3 over feature 3
Remaining depth: 0, leaf node (maximum depth reached, 4 points)
Remaining depth: 0

Remaining depth: 12, splitting 1061 elements into 630 and 431 over feature 4
Remaining depth: 11, splitting 630 elements into 593 and 37 over feature 8
Remaining depth: 10, splitting 593 elements into 9 and 584 over feature 7
Remaining depth: 9, splitting 9 elements into 3 and 6 over feature 3
Remaining depth: 8, leaf node (all labels are the same over 3 points)
Remaining depth: 8, leaf node (all labels are the same over 6 points)
Remaining depth: 9, splitting 584 elements into 14 and 570 over feature 0
Remaining depth: 8, splitting 14 elements into 5 and 9 over feature 3
Remaining depth: 7, leaf node (all labels are the same over 5 points)
Remaining depth: 7, splitting 9 elements into 5 and 4 over feature 1
Remaining depth: 6, leaf node (all labels are the same over 5 points)
Remaining depth: 6, splitting 4 elements into 2 and 2 over feature 0
Remaining depth: 5, leaf node (all labels are the same over 2 points)
Remaining depth: 5, leaf node (all labels are the same over 2 points)
Rem

Remaining depth: 1, splitting 56 elements into 46 and 10 over feature 1
Remaining depth: 0, leaf node (maximum depth reached, 46 points)
Remaining depth: 0, leaf node (maximum depth reached, 10 points)
Remaining depth: 2, leaf node (all labels are the same over 3 points)
Remaining depth: 4, splitting 18 elements into 12 and 6 over feature 8
Remaining depth: 3, splitting 12 elements into 5 and 7 over feature 3
Remaining depth: 2, leaf node (all labels are the same over 5 points)
Remaining depth: 2, splitting 7 elements into 3 and 4 over feature 2
Remaining depth: 1, leaf node (all labels are the same over 3 points)
Remaining depth: 1, splitting 4 elements into 3 and 1 over feature 1
Remaining depth: 0, leaf node (all labels are the same over 3 points)
Remaining depth: 0, leaf node (all labels are the same over 1 points)
Remaining depth: 3, leaf node (all labels are the same over 6 points)
Remaining depth: 6, splitting 41 elements into 33 and 8 over feature 0
Remaining depth: 5, leaf nod

Remaining depth: 16, splitting 2367 elements into 843 and 1524 over feature 1
Remaining depth: 15, splitting 843 elements into 453 and 390 over feature 2
Remaining depth: 14, splitting 453 elements into 147 and 306 over feature 4
Remaining depth: 13, splitting 147 elements into 133 and 14 over feature 6
Remaining depth: 12, splitting 133 elements into 30 and 103 over feature 1
Remaining depth: 11, splitting 30 elements into 12 and 18 over feature 0
Remaining depth: 10, leaf node (all labels are the same over 12 points)
Remaining depth: 10, splitting 18 elements into 11 and 7 over feature 3
Remaining depth: 9, splitting 11 elements into 10 and 1 over feature 0
Remaining depth: 8, leaf node (all labels are the same over 10 points)
Remaining depth: 8, leaf node (all labels are the same over 1 points)
Remaining depth: 9, splitting 7 elements into 3 and 4 over feature 2
Remaining depth: 8, leaf node (all labels are the same over 3 points)
Remaining depth: 8, splitting 4 elements into 3 and 

Remaining depth: 6, splitting 8 elements into 7 and 1 over feature 0
Remaining depth: 5, leaf node (all labels are the same over 7 points)
Remaining depth: 5, leaf node (all labels are the same over 1 points)
Remaining depth: 14, splitting 390 elements into 382 and 8 over feature 6
Remaining depth: 13, splitting 382 elements into 187 and 195 over feature 1
Remaining depth: 12, leaf node (all labels are the same over 187 points)
Remaining depth: 12, splitting 195 elements into 85 and 110 over feature 2
Remaining depth: 11, splitting 85 elements into 75 and 10 over feature 0
Remaining depth: 10, splitting 75 elements into 46 and 29 over feature 2
Remaining depth: 9, splitting 46 elements into 45 and 1 over feature 1
Remaining depth: 8, splitting 45 elements into 36 and 9 over feature 5
Remaining depth: 7, splitting 36 elements into 4 and 32 over feature 4
Remaining depth: 6, splitting 4 elements into 3 and 1 over feature 3
Remaining depth: 5, leaf node (all labels are the same over 3 poi

Remaining depth: 9, splitting 96 elements into 22 and 74 over feature 3
Remaining depth: 8, splitting 22 elements into 12 and 10 over feature 5
Remaining depth: 7, splitting 12 elements into 9 and 3 over feature 2
Remaining depth: 6, splitting 9 elements into 6 and 3 over feature 5
Remaining depth: 5, leaf node (all labels are the same over 6 points)
Remaining depth: 5, splitting 3 elements into 2 and 1 over feature 0
Remaining depth: 4, leaf node (all labels are the same over 2 points)
Remaining depth: 4, leaf node (all labels are the same over 1 points)
Remaining depth: 6, leaf node (all labels are the same over 3 points)
Remaining depth: 7, leaf node (all labels are the same over 10 points)
Remaining depth: 8, splitting 74 elements into 5 and 69 over feature 1
Remaining depth: 7, leaf node (all labels are the same over 5 points)
Remaining depth: 7, splitting 69 elements into 7 and 62 over feature 6
Remaining depth: 6, leaf node (all labels are the same over 7 points)
Remaining depth

Remaining depth: 4, splitting 48 elements into 33 and 15 over feature 2
Remaining depth: 3, splitting 33 elements into 24 and 9 over feature 4
Remaining depth: 2, splitting 24 elements into 12 and 12 over feature 7
Remaining depth: 1, splitting 12 elements into 11 and 1 over feature 3
Remaining depth: 0, leaf node (all labels are the same over 11 points)
Remaining depth: 0, leaf node (all labels are the same over 1 points)
Remaining depth: 1, splitting 12 elements into 3 and 9 over feature 0
Remaining depth: 0, leaf node (all labels are the same over 3 points)
Remaining depth: 0, leaf node (maximum depth reached, 9 points)
Remaining depth: 2, leaf node (all labels are the same over 9 points)
Remaining depth: 3, leaf node (all labels are the same over 15 points)
Remaining depth: 5, splitting 117 elements into 101 and 16 over feature 8
Remaining depth: 4, splitting 101 elements into 3 and 98 over feature 5
Remaining depth: 3, leaf node (all labels are the same over 3 points)
Remaining de

Remaining depth: 17, splitting 7832 elements into 7452 and 380 over feature 1
Remaining depth: 16, splitting 7452 elements into 6298 and 1154 over feature 2
Remaining depth: 15, splitting 6298 elements into 5959 and 339 over feature 1
Remaining depth: 14, splitting 5959 elements into 4890 and 1069 over feature 4
Remaining depth: 13, splitting 4890 elements into 4587 and 303 over feature 0
Remaining depth: 12, splitting 4587 elements into 3811 and 776 over feature 2
Remaining depth: 11, splitting 3811 elements into 3371 and 440 over feature 1
Remaining depth: 10, splitting 3371 elements into 3165 and 206 over feature 0
Remaining depth: 9, splitting 3165 elements into 2155 and 1010 over feature 4
Remaining depth: 8, splitting 2155 elements into 231 and 1924 over feature 1
Remaining depth: 7, splitting 231 elements into 64 and 167 over feature 4
Remaining depth: 6, splitting 64 elements into 5 and 59 over feature 4
Remaining depth: 5, leaf node (all labels are the same over 5 points)
Rema

Remaining depth: 6, splitting 800 elements into 525 and 275 over feature 0
Remaining depth: 5, splitting 525 elements into 458 and 67 over feature 8
Remaining depth: 4, splitting 458 elements into 43 and 415 over feature 6
Remaining depth: 3, splitting 43 elements into 4 and 39 over feature 8
Remaining depth: 2, leaf node (all labels are the same over 4 points)
Remaining depth: 2, splitting 39 elements into 37 and 2 over feature 2
Remaining depth: 1, splitting 37 elements into 32 and 5 over feature 3
Remaining depth: 0, leaf node (all labels are the same over 32 points)
Remaining depth: 0, leaf node (maximum depth reached, 5 points)
Remaining depth: 1, leaf node (all labels are the same over 2 points)
Remaining depth: 3, splitting 415 elements into 214 and 201 over feature 2
Remaining depth: 2, splitting 214 elements into 205 and 9 over feature 5
Remaining depth: 1, splitting 205 elements into 29 and 176 over feature 8
Remaining depth: 0, leaf node (maximum depth reached, 29 points)
Re

Remaining depth: 1, splitting 59 elements into 14 and 45 over feature 8
Remaining depth: 0, leaf node (maximum depth reached, 14 points)
Remaining depth: 0, leaf node (maximum depth reached, 45 points)
Remaining depth: 9, splitting 206 elements into 41 and 165 over feature 8
Remaining depth: 8, splitting 41 elements into 9 and 32 over feature 8
Remaining depth: 7, leaf node (all labels are the same over 9 points)
Remaining depth: 7, splitting 32 elements into 27 and 5 over feature 0
Remaining depth: 6, splitting 27 elements into 21 and 6 over feature 3
Remaining depth: 5, splitting 21 elements into 17 and 4 over feature 0
Remaining depth: 4, splitting 17 elements into 9 and 8 over feature 6
Remaining depth: 3, splitting 9 elements into 1 and 8 over feature 6
Remaining depth: 2, leaf node (all labels are the same over 1 points)
Remaining depth: 2, leaf node (all labels are the same over 8 points)
Remaining depth: 3, splitting 8 elements into 4 and 4 over feature 6
Remaining depth: 2, le

Remaining depth: 1, splitting 6 elements into 2 and 4 over feature 0
Remaining depth: 0, leaf node (all labels are the same over 2 points)
Remaining depth: 0, leaf node (all labels are the same over 4 points)
Remaining depth: 1, leaf node (all labels are the same over 4 points)
Remaining depth: 2, splitting 74 elements into 73 and 1 over feature 6
Remaining depth: 1, splitting 73 elements into 21 and 52 over feature 6
Remaining depth: 0, leaf node (maximum depth reached, 21 points)
Remaining depth: 0, leaf node (maximum depth reached, 52 points)
Remaining depth: 1, leaf node (all labels are the same over 1 points)
Remaining depth: 5, splitting 19 elements into 9 and 10 over feature 4
Remaining depth: 4, splitting 9 elements into 2 and 7 over feature 7
Remaining depth: 3, leaf node (all labels are the same over 2 points)
Remaining depth: 3, splitting 7 elements into 1 and 6 over feature 8
Remaining depth: 2, leaf node (all labels are the same over 1 points)
Remaining depth: 2, leaf node

Remaining depth: 7, splitting 41 elements into 37 and 4 over feature 2
Remaining depth: 6, splitting 37 elements into 4 and 33 over feature 4
Remaining depth: 5, leaf node (all labels are the same over 4 points)
Remaining depth: 5, splitting 33 elements into 5 and 28 over feature 5
Remaining depth: 4, leaf node (all labels are the same over 5 points)
Remaining depth: 4, splitting 28 elements into 3 and 25 over feature 6
Remaining depth: 3, leaf node (all labels are the same over 3 points)
Remaining depth: 3, splitting 25 elements into 22 and 3 over feature 1
Remaining depth: 2, splitting 22 elements into 13 and 9 over feature 0
Remaining depth: 1, splitting 13 elements into 3 and 10 over feature 1
Remaining depth: 0, leaf node (all labels are the same over 3 points)
Remaining depth: 0, leaf node (maximum depth reached, 10 points)
Remaining depth: 1, splitting 9 elements into 8 and 1 over feature 5
Remaining depth: 0, leaf node (all labels are the same over 8 points)
Remaining depth: 0,

Remaining depth: 10, splitting 231 elements into 5 and 226 over feature 0
Remaining depth: 9, leaf node (all labels are the same over 5 points)
Remaining depth: 9, splitting 226 elements into 9 and 217 over feature 2
Remaining depth: 8, leaf node (all labels are the same over 9 points)
Remaining depth: 8, splitting 217 elements into 28 and 189 over feature 1
Remaining depth: 7, splitting 28 elements into 8 and 20 over feature 1
Remaining depth: 6, splitting 8 elements into 6 and 2 over feature 2
Remaining depth: 5, leaf node (all labels are the same over 6 points)
Remaining depth: 5, splitting 2 elements into 1 and 1 over feature 0
Remaining depth: 4, leaf node (all labels are the same over 1 points)
Remaining depth: 4, leaf node (all labels are the same over 1 points)
Remaining depth: 6, splitting 20 elements into 10 and 10 over feature 8
Remaining depth: 5, splitting 10 elements into 8 and 2 over feature 3
Remaining depth: 4, splitting 8 elements into 4 and 4 over feature 4
Remaining

Remaining depth: 8, splitting 80 elements into 56 and 24 over feature 1
Remaining depth: 7, splitting 56 elements into 47 and 9 over feature 1
Remaining depth: 6, splitting 47 elements into 18 and 29 over feature 2
Remaining depth: 5, splitting 18 elements into 4 and 14 over feature 5
Remaining depth: 4, splitting 4 elements into 3 and 1 over feature 0
Remaining depth: 3, leaf node (all labels are the same over 3 points)
Remaining depth: 3, leaf node (all labels are the same over 1 points)
Remaining depth: 4, splitting 14 elements into 1 and 13 over feature 2
Remaining depth: 3, leaf node (all labels are the same over 1 points)
Remaining depth: 3, leaf node (all labels are the same over 13 points)
Remaining depth: 5, splitting 29 elements into 16 and 13 over feature 8
Remaining depth: 4, splitting 16 elements into 8 and 8 over feature 5
Remaining depth: 3, splitting 8 elements into 1 and 7 over feature 2
Remaining depth: 2, leaf node (all labels are the same over 1 points)
Remaining de

Remaining depth: 9, splitting 20 elements into 6 and 14 over feature 1
Remaining depth: 8, splitting 6 elements into 3 and 3 over feature 5
Remaining depth: 7, splitting 3 elements into 1 and 2 over feature 0
Remaining depth: 6, leaf node (all labels are the same over 1 points)
Remaining depth: 6, leaf node (all labels are the same over 2 points)
Remaining depth: 7, leaf node (all labels are the same over 3 points)
Remaining depth: 8, leaf node (all labels are the same over 14 points)
Remaining depth: 10, splitting 13 elements into 2 and 11 over feature 2
Remaining depth: 9, leaf node (all labels are the same over 2 points)
Remaining depth: 9, splitting 11 elements into 2 and 9 over feature 6
Remaining depth: 8, leaf node (all labels are the same over 2 points)
Remaining depth: 8, splitting 9 elements into 1 and 8 over feature 1
Remaining depth: 7, leaf node (all labels are the same over 1 points)
Remaining depth: 7, leaf node (all labels are the same over 8 points)
Remaining depth: 11

Remaining depth: 16, splitting 380 elements into 246 and 134 over feature 2
Remaining depth: 15, splitting 246 elements into 130 and 116 over feature 1
Remaining depth: 14, splitting 130 elements into 71 and 59 over feature 4
Remaining depth: 13, splitting 71 elements into 20 and 51 over feature 6
Remaining depth: 12, splitting 20 elements into 16 and 4 over feature 2
Remaining depth: 11, splitting 16 elements into 1 and 15 over feature 2
Remaining depth: 10, leaf node (all labels are the same over 1 points)
Remaining depth: 10, leaf node (all labels are the same over 15 points)
Remaining depth: 11, splitting 4 elements into 3 and 1 over feature 1
Remaining depth: 10, leaf node (all labels are the same over 3 points)
Remaining depth: 10, leaf node (all labels are the same over 1 points)
Remaining depth: 12, splitting 51 elements into 7 and 44 over feature 0
Remaining depth: 11, leaf node (all labels are the same over 7 points)
Remaining depth: 11, splitting 44 elements into 18 and 26 o

Remaining depth: 18, splitting 972 elements into 505 and 467 over feature 6
Remaining depth: 17, splitting 505 elements into 19 and 486 over feature 1
Remaining depth: 16, splitting 19 elements into 18 and 1 over feature 1
Remaining depth: 15, leaf node (all labels are the same over 18 points)
Remaining depth: 15, leaf node (all labels are the same over 1 points)
Remaining depth: 16, leaf node (all labels are the same over 486 points)
Remaining depth: 17, splitting 467 elements into 387 and 80 over feature 2
Remaining depth: 16, splitting 387 elements into 101 and 286 over feature 1
Remaining depth: 15, splitting 101 elements into 84 and 17 over feature 2
Remaining depth: 14, splitting 84 elements into 53 and 31 over feature 0
Remaining depth: 13, leaf node (all labels are the same over 53 points)
Remaining depth: 13, splitting 31 elements into 12 and 19 over feature 2
Remaining depth: 12, splitting 12 elements into 5 and 7 over feature 5
Remaining depth: 11, leaf node (all labels are 

Since the overfit of a single Tree is very high, we can implement a random forest to avoid that

In [None]:
class Forest:

    def __init__(self, trees, n_features, n_samples, max_depth):
        self.forest = []
        self.features = n_features
        self.max_depth = max_depth
        self.samples = n_samples
        for i in range(trees):
            self.forest.append(Tree())

    def classify(self, x):
        # Classify the point through a majority vote
        vote = 0
        for tree in self.forest:
            vote += tree.classify(x)
            
        return vote
                

    def train(self, X, Y):
        for tree in self.forest:
            X_train, Y_train = self.bag(X, Y)
            tree.id3_training(X_train, Y_train, self.max_depth, False)
        
    def bag(self, X, Y):
        ## TO DO: implement bagging! Sample data points with replacement
        features = X.shape[1]
        points = X.shape[0]
        # Bagging: sample with replacement
        bagged = random.choices(range(points), k = self.samples)
        X_bagged = X[bagged,:]
        # remove features that are not part of the tree
        selected = random.sample(range(features),self.features)
        for i in range(features):
            if i not in selected:
                X_bagged[:,i] = 0
        return X_bagged, Y[bagged]
        

In [None]:
forest = Forest(400, 3, X_train_np.shape[0], 12)
forest.train(X_train_np, Y_train_np)

test_loss = 0
for i in range(len(Y_test_np)):
    predicted = forest.classify(X_test_np[i, :])
    if (Y_test_np[i] * predicted <= 0):
        test_loss += 1 / len(Y_test_np)
print('Test loss: ' + str(test_loss))

## 6\. CUT ATTEMPT - Luca


The cut is going to be performed given a minimium number of hadrons to be kept as hadrons (not false gammas). Per each number of hadrons the parameters are going to be given a minimum cut value, to be optimize maximizing Q. The dataset is not going to be split, and per each parameter Q is going to be computed, so that applying them on test set we can have ROC points per each value of true_hadrons

split dataset functions

In [None]:
# this function divides the dataset in input in parts_to_be_split parts. One of them is returned as test_set
# the other ones are the training_set. Indexes of the chosen data can be accessed (test_indexes, train_indexes)
# (for example in a K-fold)

def split_train_test_set(data,parts_to_be_split):
    indexes = np.arange(0,data.shape[0],1)
    rnd.shuffle(indexes)
    len_test_set = int(len(indexes) / parts_to_be_split)

    test_indexes = indexes[:len_test_set]
    train_indexes = indexes[len_test_set:]

    train_set = data.iloc[train_indexes]
    test_set = data.iloc[test_indexes]
    
    return train_set, test_set

1/5 of the total dataset is here picked as test set. This can be easily changed

In [None]:
file_name="magic04.data"
data = pd.DataFrame(pd.read_csv(file_name))
data.columns=['fLength','fWidth','fSize',
        'fConc','fConc1','fAsym',
        'fM3Long','fM3Trans','fAlpha','fDist','category']


train_set, test_set = split_train_test_set(data,5)

print('shape of train_set:',np.shape(train_set))
print('shape of test set:',np.shape(test_set))

Functions for finding cut values

In [None]:
# this function takes col category and returns as Y (target), the other columns are X (labels)
def split_X_Y(data):
    Y = data['category']
    X = data.drop('category',axis=1)

    return X,Y

In [None]:
X_train, Y_train = split_X_Y(train_set)
X_test, Y_test = split_X_Y(test_set)

print(np.shape(X_train))
print(np.shape(X_test))
print(np.shape(Y_train))
print(np.shape(Y_test))



Study the features and their CDF relative to hadron fraction

In [None]:
def hadron_CDF(X_train,Y_train,sel_label):
    new_X = X_train[X_train.columns[sel_label]]
    new_data = np.vstack((new_X,Y_train))
    new_data_sorted = new_data[:, new_data[0].argsort()]
    # setup vars, cut value is  set to max of the label, so that in worst scenario all points are true hadrons
    hadrons_fraction = np.zeros(new_data_sorted.shape[1])
    index = np.zeros(new_data_sorted.shape[1])
    true_hadrons = 0
    tot_h = len(Y_train[Y_train == 'h'])
    
    for i in range(new_data_sorted.shape[1]):
    
        if new_data_sorted[1,i] == 'h':
            true_hadrons += 1
        index[i] = i
        hadrons_fraction[i] = true_hadrons / tot_h
            
    return index,hadrons_fraction

In [None]:
for i in range(X_train.shape[1]):
    index, hadrons_fraction = hadron_CDF(X_train,Y_train,i)
    plt.scatter(index,hadrons_fraction,s=0.3)
    title = 'feature n '+str(i)
    plt.title(title)
    plt.xlabel('feature')
    plt.ylabel('hadron fraction')
    plt.show()

for features n 0-1-8 may be better to cut from high values

In [None]:
# this function finds the minimum cut value for given label to compute
def find_min_cut_value(X_train,Y_train,sel_label,min_hadrons):
    new_X = X_train[X_train.columns[sel_label]]
    new_data = np.vstack((new_X,Y_train))
    new_data_sorted = new_data[:, new_data[0].argsort()]
    # setup vars, cut value is  set to max of the label, so that in worst scenario all points are true hadrons
    true_hadrons = 0
    min_cut_value = new_data_sorted[0,new_data_sorted.shape[1]-1]
    
    for i in range(new_data_sorted.shape[1]):
        if new_data_sorted[1,i] == 'h':
            true_hadrons += 1
        if true_hadrons == min_hadrons:
            min_cut_value = new_data_sorted[0,i]
            
    return min_cut_value

def find_max_cut_value(X_train,Y_train,sel_label,min_hadrons):
    new_X = X_train[X_train.columns[sel_label]]
    new_data = np.vstack((new_X,Y_train))
    new_data_sorted = new_data[:, new_data[0].argsort()]
    
    new_data_sorted_reversed = new_data_sorted[:, ::-1] #reverse!! STARTING FROM MAXX
    
    # setup vars, cut value is  set to max of the label, so that in worst scenario all points are true hadrons
    true_hadrons = 0
    max_cut_value = new_data_sorted_reversed[0,new_data_sorted.shape[1]-1]
    
    for i in range(new_data_sorted_reversed.shape[1]):
        if new_data_sorted_reversed[1,i] == 'h':
            true_hadrons += 1
        if true_hadrons == min_hadrons:
            max_cut_value = new_data_sorted_reversed[0,i]
            
    return max_cut_value


def compute_Q(new_X,Y_train,cut_value):
    tot_g = len(Y_train[Y_train == 'g'])
    tot_h = len(Y_train[Y_train == 'h'])
    n_accepted_g = 0
    n_accepted_h = 0
    
    for j in range(len(Y_train)):                
        if (new_X.iloc[j] < cut_value and Y_train.iloc[j] == 'g'):
            n_accepted_g += 1
        if (new_X.iloc[j] < cut_value and Y_train.iloc[j] == 'h'):
            n_accepted_h += 1
    #if(n_accepted_h == 0):
    
    #n_accepted_h = 0.01
    
    epsilon_g = n_accepted_g / tot_g
    epsilon_h = n_accepted_h / tot_h
    
    sigma = n_accepted_g / np.sqrt( 2 * n_accepted_h + n_accepted_g )
    
    return epsilon_g / np.sqrt(epsilon_h)
    #return sigma


def find_better_cut_value(X_train,Y_train,sel_label,min_cut_value):
    new_X = X_train[X_train.columns[sel_label]]
    max_cut_value = max(new_X)
    
    #iterations of repeating Q calculus, edit to improve accuracy (and decrease performance)
    N_iter = 10
    step = abs(max_cut_value - min_cut_value) / N_iter
    best_Q = compute_Q(new_X,Y_train,min_cut_value)
    best_cut = min_cut_value
    
    for i in range(N_iter):
        cut = min_cut_value + i * step
        Q = compute_Q(new_X,Y_train,cut)
        
        #threshold for Q is at least 1.2
        if( Q > best_Q ):
            best_Q = Q.copy()
            best_cut = cut
            
    return best_Q, best_cut


def find_min_ROC_coordinates(X_train,Y_train,X_test,Y_test,min_hadrons):
    X_ROC = 0
    Y_ROC = 0
    X_ROC_BETTER = 0
    Y_ROC_BETTER = 0
    
    min_cut_values = []
    better_cut_values = []
    
    for i in range(len(X_train.columns)):
        min_cut_value = find_max_cut_value(X_train,Y_train,i,min_hadrons)
        #min_cut_value = find_min_cut_value(X_train,Y_train,i,min_hadrons)
        #Q, better_cut_value = find_better_cut_value(X_train,Y_train,i,min_cut_value)
        #this line uses the find better cut value function
        
        min_cut_values.append(min_cut_value)
        #better_cut_values.append(better_cut_value)
        #print(i,'min_cut:',min_cut_value)
        #print(i,'better_cut:',better_cut_value)
        
    n_accepted_g = 0
    n_accepted_h = 0
    #n_accepted_gb = 0
    #n_accepted_hb = 0
    #print(min_cut_values)
    #print(min_cut_values)
    #print(better_cut_values)
    for j in range(len(Y_test)):
        acc = 1
        accb = 1
        for i in range(len(min_cut_values)):
            #if (X_test[X_test.columns[i]].iloc[j] > min_cut_values[i] and i == 0 or i == 1 or i == 8):
            #    acc = 0
            if (X_test[X_test.columns[i]].iloc[j] > min_cut_values[i]):
                acc = 0
            #if (X_test[X_test.columns[i]].iloc[j] < better_cut_values[i]):
            #    accb = 0
        #print(acc)  
        if (acc == 1 and Y_test.iloc[j] == 'g'):
            n_accepted_g += 1
        if (acc == 1 and Y_test.iloc[j] == 'h'):
            n_accepted_h += 1
            
        #if (accb == 1 and Y_test.iloc[j] == 'g'):
        #    n_accepted_gb += 1
        #if (accb == 1 and Y_test.iloc[j] == 'h'):
        #    n_accepted_hb += 1
            
    
    tot_g = len(Y_test[Y_test == 'g'])
    tot_h = len(Y_test[Y_test == 'h'])
    '''print(n_accepted_g,tot_g)
    print(n_accepted_h,tot_h)'''
    Y_ROC = n_accepted_g / tot_g
    X_ROC = n_accepted_h / tot_h
    #Y_ROC_BETTER = n_accepted_gb / tot_g
    #X_ROC_BETTER = n_accepted_hb / tot_h
    
    return X_ROC, Y_ROC, min_cut_values                  #, X_ROC_BETTER, Y_ROC_BETTER




Worst possible selection results

In [None]:
tot_h = len(Y_test[Y_test == 'h'])

print(np.shape(X_train))
print(np.shape(Y_train))
print(np.shape(X_test))
print(np.shape(X_test))
ROC_CURVE_X = []
ROC_CURVE_Y = []
BETTER_ROC_CURVE_X = []
BETTER_ROC_CURVE_Y = []
WS = []

for i in range(tot_h):
    
    #reduce points and improves performance
    
    X_ROC, Y_ROC, min_cut_values = find_min_ROC_coordinates(X_train,Y_train,X_test,Y_test,i)
    ROC_CURVE_X.append(X_ROC)
    ROC_CURVE_Y.append(Y_ROC)
    WS.append(min_cut_values)
        #BETTER_ROC_CURVE_X.append(XB_ROC)
        #BETTER_ROC_CURVE_Y.append(YB_ROC)

    #print(i,X_ROC,Y_ROC)
        #print(i,XB_ROC,YB_ROC,'\n')

In [None]:
x=np.linspace(0,1,1000)
print(len(ROC_CURVE_X))

distances = np.zeros(len(WS))
for i in range(len(WS)):
    distances[i] = np.sqrt((ROC_CURVE_X[i])**2 + (1-ROC_CURVE_Y[i])**2)
    
best_index = np.argmin(distances)
best_w = WS[best_index]
print('best parameters are:')
print(best_w)

plt.scatter(ROC_CURVE_X,ROC_CURVE_Y,s=3,label='cutting "hard way" the parameters')
plt.scatter(x,x,s=0.3,color='red',alpha=0.75)
plt.legend()
plt.show()