# ACT CW2 Q2

__Q2 Objective:__

Process dataset using a Neural Network.

__Plan__

We have a binary classification problem, so we are sorting data into one of two classes based on the input values.


__Workflow__ (from Source 1)

1. Get data ready (turn into tensors)
2. Build or pick a pretrained model to suit your problem
3. Pick a loss function and optimiser
4. Build a training loop
(Loop through steps 2-4)
5. Fit the model to the data and make a prediction
6. Evaluate the model
7. Improve through experimentation
8. Save and reload the trained model


__Links__ 

(move to the bottom in a bit)

* https://www.learnpytorch.io/02_pytorch_classification/
* https://docs.pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html
* 


### Import Libraries

In [1]:
# import necessary libraries

import numpy as np # numpy
import pandas as pd # for dataframes
import matplotlib.pyplot as plt # for plotting
# import seaborn as sns # for data visualisation

# machine learning libraries and models

## 
> ## Preparing the Dataset

Load in the classification data and prepare it as a PyTorch tensor.

To start with, we want our data set (all relevant features) to be contained in a pandas dataframe and ready to solve.

We will then move onto importing it into a PyTorch tensor.

Therefore, the first part of this notebook will be identical to Q1.

After this, convert to a PyTorch tensor and use sklearn to perform a reproducible train/test split.

### Load in the data

Load in the data file (currently in csv file), add the data to a pandas dataframe, and inspect the dataframe to check that it has loaded in correctly.

In [9]:
# import the data file
# data is in the file "psion_upsilon.csv"

# read in the file and store it in a pandas dataframe
rawdata_df = pd.read_csv('psion_upsilon.csv')

In [10]:
# check the dataframe has loaded in correctly

# print the shape of the dataframe
print("Size of dataframe:")
print(rawdata_df.shape) # 40,000 rows x 22 columns
print(f"\n")

# print the column headings
print("Column headings:")
print(rawdata_df.columns)
print(f"\n")

# check top few rows of data
print("Top few rows of dataframe:")
print(rawdata_df.head)

Size of dataframe:
(40000, 22)


Column headings:
Index(['Unnamed: 0', 'Run', 'Event', 'type1', 'E1', 'px1', 'py1', 'pz1', 'pt1',
       'eta1', 'phi1', 'Q1', 'type2', 'E2', 'px2', 'py2', 'pz2', 'pt2', 'eta2',
       'phi2', 'Q2', 'class'],
      dtype='object')


Top few rows of dataframe:
<bound method NDFrame.head of        Unnamed: 0     Run       Event type1       E1      px1      py1  \
0               0  167807  1101779335     G   5.8830   3.6101   2.3476   
1               1  167102   286049970     G  13.7492  -1.9921  11.8723   
2               2  160957   190693726     G   8.5523   1.4623   4.5666   
3               3  166033   518823971     G   7.5224   0.1682  -3.5854   
4               4  163589    49913789     G  12.4683   8.1310  -1.6633   
...           ...     ...         ...   ...      ...      ...      ...   
39995       39995  166033   460063858     G  21.1411  -9.3928  10.8857   
39996       39996  173692   573648364     G  29.4819  16.1461  21.9823   
39997       

### Remove Unnecessary Data Columns

The first 3 columns contain index, run number, and event number. These are used to track the data points, but they are not scientific parameters and should not be used as variables to determine output class.

In [11]:
# remove the first 3 columns
# by defining a new dataframe
# that only contains the relevant variables

# drop columns 0, 1, and 2
# so keep all rows, and columns 3-21
# (df.iloc indices are start inclusive and end exclusive)
reduced_df = rawdata_df.iloc[:, 3:]

# check the properites of the new dataframe are what we want
# print out the new shape and column headings

print("Size of reduced dataframe:")
print(reduced_df.shape)
# 40,000 rows x 19 columns

print(f"\n")
print("Column headings:")
print(reduced_df.columns)

# this is what we expect
# we have removed 'Unnamed (index)', 'Run', and 'Event'
# and kept all 40,000 samples

Size of reduced dataframe:
(40000, 19)


Column headings:
Index(['type1', 'E1', 'px1', 'py1', 'pz1', 'pt1', 'eta1', 'phi1', 'Q1',
       'type2', 'E2', 'px2', 'py2', 'pz2', 'pt2', 'eta2', 'phi2', 'Q2',
       'class'],
      dtype='object')


The decision tree classifier will not be able to process the variables 'type1' and 'type2', as these are non-numerical values. These columns tell us the type of the first and second muon, whether they are a global muon (G), or a tracker muon (T).

Before deciding what to do with these values, look at the proportion of each type that we have in the dataset.

In [13]:
# how many occurences of G and T in each 'type' column?
# look at 'type1' and 'type2'

# type1
print(reduced_df['type1'].value_counts()) # total freq of each value
print(reduced_df['type1'].value_counts(normalize=True)) # as a fraction of the total

print("\n") # space out the outputs

# type2
print(reduced_df['type2'].value_counts()) # total freq of each value
print(reduced_df['type2'].value_counts(normalize=True)) # as a fraction of the total

type1
G    40000
Name: count, dtype: int64
type1
G    1.0
Name: proportion, dtype: float64


type2
G    36121
T     3879
Name: count, dtype: int64
type2
G    0.903025
T    0.096975
Name: proportion, dtype: float64


The values in 'type1' are all global muons - all of the samples have a type1 value of 'G'. This variable will add nothing to our calculation, so it would be more efficient to simply remove this column.

In [14]:
# as G is 100% of the values in 'type1'
# rewrite the dataframe without this column
reduced_df = reduced_df.drop('type1', axis=1)

# check that this column is gone
print("New shape:")
print(reduced_df.shape) # 40000 x 18
print("\nNew column headings:")
print(reduced_df.columns)

# type1 column has been removed
# (only run this code cell once as we are overwriting a variable)

New shape:
(40000, 18)

New column headings:
Index(['E1', 'px1', 'py1', 'pz1', 'pt1', 'eta1', 'phi1', 'Q1', 'type2', 'E2',
       'px2', 'py2', 'pz2', 'pt2', 'eta2', 'phi2', 'Q2', 'class'],
      dtype='object')


### One-Hot Encoding for Non-numerical Values

About 10% of the type2 values were 'T', so we cannot just remove this data. This variable may turn out to be significant in determining the class of the output muon. However, the decision tree classifier will not be able to handle non-numerical features.

There are multiple ways to do this:
- ____ encoding
- one-hot encoding
- dictionary vectorising ?

I will be using one-hot vectorisation as ____. Scikit-learn has a one hot encoding function but use get_dummies because _____

In [15]:
# need to change the 'type2' variables into numerical values
# using get_dummies - essentially a one-hot encoder

# use one-hot encoding on just the type2 column
# use drop_first=False
# get a 2-column table with G and T headings
type2_onehot = pd.get_dummies(reduced_df['type2'])

# look at the dummy values table (dataframe)
print("Dummy Values (first few rows):")
print(type2_onehot.head)
print("\nShape:")
print(type2_onehot.shape) # 40000 x 2
print("\nColumns:")
print(type2_onehot.columns)

# if drop_first is set to True, we only get the second column, T

# now replace the 'type2' column in the features dataset
# with the G column from the dummy table

# reassign the column values directly
reduced_df['type2'] = type2_onehot['G']

# if you use get_dummies on the whole dataframe directly
# _________

# rename column as type2_G to make the meaning clearer
reduced_df.rename(columns={'type2' : 'type2_G'}, inplace=True)

# check that the dataframe has updated correctly
print("\nUpdated Dataframe:")
print("Shape:") # shape should not have changed
print(reduced_df.shape) # 40000 x 18
print("Updated headings:") # column name should have changed
print(reduced_df.columns)
print("First few rows:") # check the type2_G values are Boolean
print(reduced_df.head)

Dummy Values (first few rows):
<bound method NDFrame.head of            G      T
0       True  False
1       True  False
2       True  False
3       True  False
4       True  False
...      ...    ...
39995   True  False
39996   True  False
39997  False   True
39998   True  False
39999   True  False

[40000 rows x 2 columns]>

Shape:
(40000, 2)

Columns:
Index(['G', 'T'], dtype='object')

Updated Dataframe:
Shape:
(40000, 18)
Updated headings:
Index(['E1', 'px1', 'py1', 'pz1', 'pt1', 'eta1', 'phi1', 'Q1', 'type2_G', 'E2',
       'px2', 'py2', 'pz2', 'pt2', 'eta2', 'phi2', 'Q2', 'class'],
      dtype='object')
First few rows:
<bound method NDFrame.head of             E1      px1      py1      pz1      pt1    eta1    phi1  Q1  \
0       5.8830   3.6101   2.3476   4.0069   4.3062  0.8314  0.5766  -1   
1      13.7492  -1.9921  11.8723  -6.6416  12.0382 -0.5270  1.7370   1   
2       8.5523   1.4623   4.5666   7.0809   4.7950  1.1818  1.2609   1   
3       7.5224   0.1682  -3.5854   6.6100

### Create Features matrix and Target array

** Up to here the process of loading in and editing the data has been identical to Q1. From here, the process will diverge as we create PyTorch tensors and build a neural network.**