# Introduction
The dataset contains Monte Carlo–simulated events from an atmospheric Cherenkov telescope, representing light patterns produced by high-energy particles interacting in the Earth’s atmosphere. Each event is described by 10 numerical features (Hillas parameters) that characterize the shape and intensity of the recorded shower image.

The goal is a binary classification task — to distinguish gamma-ray events (signal) from hadronic cosmic-ray events (background) based on these geometric and brightness features.

Data set Link: https://archive.ics.uci.edu/dataset/159/magic+gamma+telescope

# Data Exploration

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler

: 

In [None]:
df = pd.read_csv('/Users/venkatchandan/Desktop/ML_Projects/CosmicClassifier/magic+gamma+telescope/magic04.data')
df.head(5)

In [None]:
df.columns

Looks Like the Dataset has no column headers and we have to manually enter the column headers from the Dataset Link. There are 11 column headers namely:

| Variable Name | Role    | Type       | Description                                                      | Units | Missing Values |
| ------------- | ------- | ---------- | ---------------------------------------------------------------- | ----- | -------------- |
| fLength       | Feature | Continuous | Major axis of ellipse                                            | mm    | No             |
| fWidth        | Feature | Continuous | Minor axis of ellipse                                            | mm    | No             |
| fSize         | Feature | Continuous | 10-log of sum of content of all pixels                           | #phot | No             |
| fConc         | Feature | Continuous | Ratio of sum of two highest pixels over fSize                    | —     | No             |
| fConc1        | Feature | Continuous | Ratio of highest pixel over fSize                                | —     | No             |
| fAsym         | Feature | Continuous | Distance from highest pixel to center, projected onto major axis | —     | No             |
| fM3Long       | Feature | Continuous | 3rd root of third moment along major axis                        | mm    | No             |
| fM3Trans      | Feature | Continuous | 3rd root of third moment along minor axis                        | mm    | No             |
| fAlpha        | Feature | Continuous | Angle of major axis with vector to origin                        | deg   | No             |
| fDist         | Feature | Continuous | Distance from origin to center of ellipse                        | mm    | No             |
| class         | Target  | Binary     | gamma (signal), hadron (background)                              | -     | No             | 


In [None]:
cols = ['fLength','fWidth','fSize','fConc','fConc1','fAsym','fM3Long','fM3Trans','fAlpha','fDist','class']
df1 = pd.read_csv('/Users/venkatchandan/Desktop/ML_Projects/CosmicClassifier/magic+gamma+telescope/magic04.data',names = cols)
df1.head(5)

In [None]:
df1['class'].value_counts()

In [None]:
df1.shape

The class column has two values namely g and h. We convert them to 1's and 0's as our computer cant understand Language.
We can convert that into 1's and 0's using various methods
```bash 
1. data['class'] = data['class'].map({'g': 1, 'h': 0})
2. data['class'] = data['class'].replace({'g': 1, 'h': 0})
3. from sklearn.preprocessing import LabelEncoder
    le = LabelEncoder()
    data['class'] = le.fit_transform(data['class'])
4. data['class'] = np.where(data['class'] == 'g', 1, 0)
```

We are gonna use the most simplest of all



In [None]:
df1['class']= (df1['class'] == 'g').astype(int)

In [None]:
df1.head(5)

In [None]:
df1['class'].value_counts()

In [None]:

for i in cols[:-1]:
    
    plt.hist(df1[df1['class']==1][i],color='blue',label = 'gamma',density=True)
    plt.hist(df1[df1['class']==0][i],color='red',label = 'hydron',density=True)
    plt.title(i)
    plt.ylabel('Probability')
    plt.xlabel(i)
    plt.legend()
    plt.show()

# Pre-Processing
#### Observations:

1. The data is skewed in the Favour of Target = 'gamma'. Probably need to normalize that
2. The scale of the Dataframe of every column is very skewed which can cause problem. So we need to scale that.
3. From the Graphs above, few observations can be made.


#### Next Steps:

1. We will be dividing the Data set into Training, Validation and Testing.( 0-60, 60-80, 80-100)
2. Seperate Input variables and Output Variables
3. Scale all the columns using StandardScaler()


##### 1. Train, Validation, Test Set

In [None]:
train, valid, test = np.split(df1.sample(frac = 1),[int(0.6*len(df1)),int(0.8*len(df1))])

# Could have also been done thru train-test library


##### 2. Separating Input and Output Features and 3. Scaling

In [None]:
def scale_dataset(dataframe):
    X = dataframe[dataframe.columns[:-1]].values
    y = dataframe[dataframe.columns[-1]].values

    scaler = StandardScaler()
    X = scaler.fit_transform(X)
    data = np.hstack((X,np.reshape(y,(-1,1))))

    return data, X,y


In [None]:
print("training Dataset")
print(len(train[train['class']==1]))
print(len(train[train['class']==0]))

print('Test Dataset')
print(len(test[test['class']==1]))
print(len(test[test['class']==0]))

print('Valid Dataset')
print(len(valid[valid['class']==1]))
print(len(valid[valid['class']==0]))

### Further Observations:
1. As expected, there was a lot of imbalance in the training dataset itself. This could cause the model to biased.
2. We would not been solving the imbalance for test and valid dataset but only for training dataset. This is done cause we have to see how our Model Performs on new data that could be biased.

In [None]:
def scale_dataset_oversample(dataframe,oversample = False):
    X = dataframe[dataframe.columns[:-1]].values
    y = dataframe[dataframe.columns[-1]].values

    if oversample:
        ros = RandomOverSampler()
        X,y = ros.fit_resample(X,y)

    scaler = StandardScaler()
    X = scaler.fit_transform(X)
    data = np.hstack((X,np.reshape(y,(-1,1))))

    return data, X,y

In [None]:
train, X_train, y_train = scale_dataset_oversample(train,oversample=True)
train.shape


In [None]:
print(sum(y_train == 1))
print(sum(y_train == 0))

In [None]:
# as we mentioned, we would not be oversampling our Test and valid dataset.
test, X_test, y_test = scale_dataset_oversample(test,oversample=False)
valid, X_valid, y_valid = scale_dataset_oversample(valid,oversample=False)

In [None]:
import joblib

joblib.dump(X_train, "X_train.pkl")
joblib.dump(X_test, "X_test.pkl")
joblib.dump(X_valid, "X_valid.pkl")
joblib.dump(y_train, "y_train.pkl")
joblib.dump(y_test, "y_test.pkl")
joblib.dump(y_valid, "y_valid.pkl")


