# ACT CW2 Q2

__Q2 Objective:__

Process dataset using a Neural Network.

__Plan__

We have a binary classification problem, so we are sorting data into one of two classes based on the input values.


__Workflow__ (from Source 1)

1. Get data ready (transform into tensors)
2. Build or pick a pretrained model to suit your problem
3. Pick a loss function and optimiser
4. Build a training loop
(Loop through steps 2-4)
5. Fit the model to the data and make a prediction
6. Evaluate the model
7. Improve through experimentation
8. Save and reload the trained model


__Links__ 

(move to the bottom in a bit)

* https://www.learnpytorch.io/02_pytorch_classification/
* https://docs.pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html
* 


More general links:

* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.values.html
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html
* https://www.geeksforgeeks.org/deep-learning/converting-a-pandas-dataframe-to-a-pytorch-tensor/
* https://saturncloud.io/blog/how-do-i-convert-a-pandas-dataframe-to-a-pytorch-tensor/
* https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html
* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html
* https://www.geeksforgeeks.org/pandas/adding-new-column-to-existing-dataframe-in-pandas/
* 

### Import Libraries

In [129]:
# import necessary libraries

import numpy as np # numpy
import pandas as pd # for dataframes
import matplotlib.pyplot as plt # for plotting
# import seaborn as sns # for data visualisation

# machine learning libraries and models

import torch
from sklearn.model_selection import train_test_split

## 
> ## Preparing the Dataset

Load in the classification data and prepare it as a PyTorch tensor.

To start with, we want our data set (all relevant features) to be contained in a pandas dataframe and ready to solve.

We will then move onto importing it into a PyTorch tensor.

Therefore, the first part of this notebook will be very similar to Q1.

After this, convert to a PyTorch tensor and use sklearn to perform a reproducible train/test split.

### Load in the data

Load in the data file (currently in csv file), add the data to a pandas dataframe, and inspect the dataframe to check that it has loaded in correctly.

In [110]:
# import the data file
# data is in the file "psion_upsilon.csv"

# read in the file and store it in a pandas dataframe
rawdata_df = pd.read_csv('psion_upsilon.csv')

In [112]:
# check the dataframe has loaded in correctly

# print the shape of the dataframe
print("Size of dataframe:")
print(rawdata_df.shape) # 40,000 rows x 22 columns
print(f"\n")

# print the column headings
print("Column headings:")
print(rawdata_df.columns)
print(f"\n")

# check top few rows of data
print("Top few rows of dataframe:")
print(rawdata_df.head)

Size of dataframe:
(40000, 22)


Column headings:
Index(['Unnamed: 0', 'Run', 'Event', 'type1', 'E1', 'px1', 'py1', 'pz1', 'pt1',
       'eta1', 'phi1', 'Q1', 'type2', 'E2', 'px2', 'py2', 'pz2', 'pt2', 'eta2',
       'phi2', 'Q2', 'class'],
      dtype='object')


Top few rows of dataframe:
<bound method NDFrame.head of        Unnamed: 0     Run       Event type1       E1      px1      py1  \
0               0  167807  1101779335     G   5.8830   3.6101   2.3476   
1               1  167102   286049970     G  13.7492  -1.9921  11.8723   
2               2  160957   190693726     G   8.5523   1.4623   4.5666   
3               3  166033   518823971     G   7.5224   0.1682  -3.5854   
4               4  163589    49913789     G  12.4683   8.1310  -1.6633   
...           ...     ...         ...   ...      ...      ...      ...   
39995       39995  166033   460063858     G  21.1411  -9.3928  10.8857   
39996       39996  173692   573648364     G  29.4819  16.1461  21.9823   
39997       

### Remove Unnecessary Data Columns

The first 3 columns contain index, run number, and event number. These are parameters used when recording and storing the data points, but they are not physical properties and do not have any effect on the type of particle created. Therefore, they are irrelevant to determining output class.

In [113]:
# remove the first 3 columns
# by defining a new dataframe
# that only contains the relevant variables

# drop columns 0, 1, and 2
# so keep all rows, and columns 3-21
# (df.iloc indices are start inclusive and end exclusive)
reduced_df = rawdata_df.iloc[:, 3:]

# check the properites of the new dataframe are what we want
# print out the new shape and column headings

print("Size of reduced dataframe:")
print(reduced_df.shape)
# 40,000 rows x 19 columns

print(f"\n")
print("Column headings:")
print(reduced_df.columns)

# this is what we expect
# we have removed 'Unnamed (index)', 'Run', and 'Event'
# and kept all 40,000 samples

Size of reduced dataframe:
(40000, 19)


Column headings:
Index(['type1', 'E1', 'px1', 'py1', 'pz1', 'pt1', 'eta1', 'phi1', 'Q1',
       'type2', 'E2', 'px2', 'py2', 'pz2', 'pt2', 'eta2', 'phi2', 'Q2',
       'class'],
      dtype='object')


Now that we have removed the first 3 columns, we can look at the other variables.

To build a neural network in PyTorch, the data should be stored in a tensor, which can only contain numerical values. We can check the type of all of the columns in the reduced dataframe.

In [114]:
# get the datatypes of each column in the dataframe
get_types = reduced_df.dtypes

# print the data types of each column
print(get_types)
# and how many of each type there are
print("\nSummary of column types:")
print(get_types.value_counts())

type1     object
E1       float64
px1      float64
py1      float64
pz1      float64
pt1      float64
eta1     float64
phi1     float64
Q1         int64
type2     object
E2       float64
px2      float64
py2      float64
pz2      float64
pt2      float64
eta2     float64
phi2     float64
Q2         int64
class     object
dtype: object

Summary of column types:
float64    14
object      3
int64       2
Name: count, dtype: int64


Of the 19 columns in the reduced dataframe, 16 of them are numeric (14 float and 2 int). These values can all be converted to a single numeric type (e.g. float64) upon transformation into a tensor. However, there are also 3 'object'-type variables that are unable to be processed by the neural network.

The first 2 non-numeric columns are 'type1' and 'type2', which tell us the types of the first and second muon respectively, whether they are a global muon (G), or a tracker muon (T). The other non-numeric variable is 'class', which tells us the type of the meson created by the collision, either J/psi or Upsilon.

There are a few options for dealing with these variables:

1) Label encoding or assigning category codes. These methods assign an integer value to every possible string value. This is a very efficient method of converting non-numeric variables but can easily be misinterpreted by the neural network. Some machine learning models, including neural networks, will treat integer-encoded data as numeric, and make assumptions that are not true for categorical data, leading to incorrect assumptions and correlations. This can be avoid by using embedding layers in the neural network.
2) One-hot encoding. This avoids the pitfalls of integer-encoding, but increases the dimensionality of the problem. Certain methods of one-hot encoding also produce Boolean values instead of numeric ones, which also cannot be interpreted by the neural network.
3) Deleting the non-numeric values. This avoids the problem of encoding the data, but can lead to important variables being neglected and the neural network not learning the correct connections. If we decide not to include a certain variable, we must have a good reason for doing so.

In deciding which of these method to use, we should consider the following points:

* 'type1' contains only 1 unique value. As we found in Q1, 'type1' contained only 'G' values, so 100% of these particles were global muons. This variable can be immediately ignored as there can be no correlation between 'type1' and 'class'.
* Proportion of each value in 'type2'. This column contains both 'G' and 'T' values, although around 90% of the samples have a 'G' value. This is a relatively small portion of the samples, especially when the division of class among the samples is a 50/50 split.
* Feature importance. In Q1, after training a decision tree, the relative importance of each feature was plotted. The importance of 'type2' was one of the lowest, at only 0.2%. We should weigh up the possibility of introducing extra parameters with how useful the result may or may not be to the resulting neural network.
* The type of each ingoing muon ('G' or 'T') is not a physical property of the muon, or a reflection of the physics governing it. It is a parameter denoting how the muon was detected, either locally or globally. As this only refers to the equipment used to detect the particle, and not information about the particle itself, it should not have any effect.
* 'class' is our target variable, and cannot be ignored or deleted, so it must be encoded using one of the above methods.
* The issues with integer-encoding don't apply to target data. The neural network treats labels differently to input data, because it only uses them to compare prediction with reality, not to learn connections. This means the 'class' variable can be label encoded with no loss of accuracy.

Based on the factors above, the most efficient choice for this dataset is to remove the 'type1' and 'type2' columns entirely, and to encode the 'class' data using integer labels.

In [115]:
# remove 'type1' and 'type2' from the dataset
features_df = reduced_df.drop(['type1', 'type2'], axis=1)

# the resultant dataframe contains only the relevant physical features

# check that these 2 columns have been removed
print("New shape:")
print(features_df.shape) # 40000 x 17
print("\nNew column headings:")
print(features_df.columns) # type columns no longer present

New shape:
(40000, 17)

New column headings:
Index(['E1', 'px1', 'py1', 'pz1', 'pt1', 'eta1', 'phi1', 'Q1', 'E2', 'px2',
       'py2', 'pz2', 'pt2', 'eta2', 'phi2', 'Q2', 'class'],
      dtype='object')


### Label encoding the target data

The target data needs to be converted to numeric data before being transformed into a tensor. Because this is just the label array, we can take the simple approach of mapping each category to an integer without the risk of training the neural network incorrectly.

In [122]:
# assign an integer value to each class type

# add these values to a new column, 'class_int'
# so we don't lose the original data

# convert strings to numerical values
# use category codes to assign integers
features_df['class_int'] = features_df['class'].astype('category').cat.codes

# print out the first few rows of both columns
# first 8 rows, last 2 columns
print(features_df.iloc[:8, -2:])

# save the mapping so we can access it later
class_mapping = dict(enumerate(features_df['class'].astype('category').cat.categories))

# print the mapping key
print("\nInteger mapping:")
print(class_mapping)


     class  class_int
0  upsilon          1
1    J/psi          0
2  upsilon          1
3    J/psi          0
4  upsilon          1
5  upsilon          1
6    J/psi          0
7  upsilon          1

Integer mapping:
{0: 'J/psi', 1: 'upsilon'}


### Create Features matrix and Target array

We now have a dataframe with 18 columns, including 16 features and 1 target variable (across 2 columns). All of these variables are numeric in type and relate to physical properties of the particles.

Before transforming into tensors, we should group the dataset into 2 separate objects - a feature matrix (X) and a target array (y).

In [132]:
# split into feature matrix and target array
# X and y

# the feature matrix contains the input data
# drop the 'class' and 'class_int' columns from the dataframe
X_total = features_df.drop(['class', 'class_int'], axis=1)

# the target array is the information with which we want the data to be classified (the "label")
# this data is the end column of the features dataframe
y_total = features_df['class_int']

# check type and size of both X and y

print("X:")
print(type(X_total)) # pd dataframe
print(X_total.shape) # (40000, 16)

print("\ny")
print(type(y_total)) # pd series
print(y_total.shape) # (40000,)

# the features matrix has been split into the correct X and y arrays

X:
<class 'pandas.core.frame.DataFrame'>
(40000, 16)

y
<class 'pandas.core.series.Series'>
(40000,)


### Transform dataframes into PyTorch tensors

(This bit needs editing)

In [125]:
# pip installed and imported torch
# from X_total (dataframe) and y_total (series) to X_ten and y_ten (tensors)

# we already have the types and sizes of X and y
# we don't need to print them out again

# dataframe.values
# gets just the data and no column names
# may also increase the storage of each datatype
# by setting all columns to same datatype (probably float64)

# X_total is a dataframe
# X_total.values extracts only the data and neglects the headings
# df.values is of type numpy array
# but the shape of the array is unchanged

print("\nX.values:")
print(type(X_total.values)) # numpy array
print(X_total.values.shape) # (40000, 17)

print("\ny.values")
print(type(y_total.values)) # numpy array
print(y_total.values.shape) # (40000,)


X.values:
<class 'numpy.ndarray'>
(40000, 16)

y.values
<class 'numpy.ndarray'>
(40000,)


In [126]:
# transform X and y to tensors
# X_ten and y_ten

# first check that they only contain numeric dtypes
print("X:")
print(X_total.dtypes.value_counts())
print("y:")
print(y_total.dtypes)

# all types int64 and float64, which are fine for tensor transformations

X:
float64    14
int64       2
Name: count, dtype: int64
y:
int8


In [None]:
# transform X_total to a tensor
X_ten = torch.from_numpy(X_total.values).float()

# now transform y_total to a tensor
y_ten = torch.from_numpy(y_total.values).float()

# ints will be automatically changed to floats
# or can we specify this, just to be clearer?

# check properties of these tensors
# type and shape

# features
print("X tensor:")
print(type(X_ten)) # tensor
print(X_ten.shape) # 40000 x 16
# target
print("\ny tensor:")
print(type(y_ten)) # tensor
print(y_ten.shape) # 40000 (1D)

# these are both torch.tensor objects
# and have the correct shape

# we may need to use unsqueeze() to add a dimension to y_ten

X tensor:
<class 'torch.Tensor'>
torch.Size([40000, 16])

y tensor:
<class 'torch.Tensor'>
torch.Size([40000])


### Split into Test and Training data

We don't want to use all of our data points to train the model. We can split the full dataset into data used to train the model (training data) and data that we can use to analyse the effectiveness of the model once it has been trained (test data).

As there are 40,000 samples, we have enough data to do a 50/50 split. I will therefore use 20,000 samples to train the model and the other 20,000 to test the model.

In [134]:
# use train_test_split to split the data
# allocate some of the data for training the neural network
# and the rest for checking its accuracy

# use 20,000 samples for training
# so train_size = 0.5

# use a random state integer for reproducible random shuffling

# inputs are the X and y torch tensors
# split the data into 4 separate objects
X_train, X_test, y_train, y_test = train_test_split(X_ten, y_ten, train_size=0.5, random_state=71)

# check the types and sizes of the outputs

# training data
# should be 20,000 randomly selected samples
print("X_train:")
print(X_train.shape) # 20000 x 16
print(type(X_train)) # tensor
print("y_train:")
print(y_train.shape) # 20000 (1D)
print(type(y_train)) # tensor

print(f"\n") # space between outputs

# test data
# should be the 20,000 remaining samples
print("X_test:")
print(X_test.shape) # 20000 x 16
print(type(X_test)) # tensor
print("y_test:")
print(y_test.shape) # 20000
print(type(y_test)) # tensor

# the arrays have been split randomly as specified in the function


X_train:
torch.Size([20000, 16])
<class 'torch.Tensor'>
y_train:
torch.Size([20000])
<class 'torch.Tensor'>


X_test:
torch.Size([20000, 16])
<class 'torch.Tensor'>
y_test:
torch.Size([20000])
<class 'torch.Tensor'>


### Plot the Training Data (?)

## 
> ## Building the Neural Network