<a href="https://colab.research.google.com/github/marctruter/Golden-Gate-Circuit-Calculator/blob/main/exercises.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Computational Algebraic Geometry Research Network
### Machine Learning and Algebraic Geometry, Practical tutorial
Sara Veneziale, Imperial College London

---

This is a practical tutorial about machine learning and algebraic geometry. It has various small exercises about predicting the dimension of geometric objects from sequential data. Every time you see ```...``` it is something to fill in.

The tutorial is structured as follows.

- I: Jupyter notebooks
- II: A recap of basic python syntax
- III: Checking installation
  
The first part aims at predicting the dimension of a weighted projective space from some sequential data (the quantum period).
- 1: First dataset download and data exploration
- 2: Applying PCA
- 3: Trying to predict the dimension
- 4: Second dataset download and data exploration
- 5: Applying PCA
- 6: Trying to predict the dimension

The second part aims at predicting the dimension of a polytope from some sequential data (its Ehrhart series). This part is less guided.
- 7: Third dataset download and data exploration
- 8: Applying PCA
- 9: Predicting the dimension from the Log of the Ehrhart series
- 10: Predicting the dimension from the Ehrhart series

In [None]:
from google.colab import drive
drive.mount('/content/drive')

-------------

## I: Jupyter notebooks

Jupyter notebooks integrate text and code. This is a markdown cell.

In [None]:
# This is a code cell: execute it by pressing the triangle or pressing shift+enter
m = 'This is a code cell!'
print(m)

In [None]:
# Once a cell has been executed, the variable defined in it stay in memory (regardless of the order you have executed them)
print(m)

# II: Introduction and Python syntax
In this tutorial we will be using Python inside a Jupyter notebook. If you have never used Python before, we recap some of the basic syntax in the following cells.

### Data types

The basic data types in python (that we will encounter) are: integers, floats, strings, lists, dictionaries. Here are some examples:

In [None]:
# Integers
an_integer = 1
another_integer = 2

# If you divide two integers, you do not get an integer, but a float (a 'real' number or a number with a '.')

print('These are floats')
print(1/2)
print(2/1) # even if the division is exact

# You can have anything packages by ' ' or " ", and this will be a string
a_string = 'hello'
another_string = "hello_again"

# You can package objects in a list
an_empty_list = []
a_list_of_integers = [1, 4, 5, 6]
a_list_of_many_things = [1.0, 'hi!', 35]

# Lists in Python are indexed from 0
print('The first element of my list is')
print(a_list_of_integers[0])

# We can add elements to lists by appending
print('My list has length')
print(len(a_list_of_integers))

a_list_of_integers.append(100)

print('and now it has length')
print(len(a_list_of_integers))

# Dictionaries are like lists, where the indices can be (almost) anything
an_empty_dictinary = {}
oscar = {'Name': 'Oscar', 'Breed': 'Chihuahua', 'Age':5}

print('How old is Oscar?')
print(oscar['Age'])

# We can add a key to my dictionary
oscar['house'] = 'London'
print('Where does Oscar live?')
print(oscar['house'])

### Loops

In [None]:
# This is the structure of an if loop
if 100 == a_list_of_integers[-1]: # == is equality, while = is assignment
    print('I have appended 100 to the list')
else:
    print('I have note appended 100 to the list')

# The following is a for loop
print('Let us print the numbers from 1 to 4')
for i in range(1,5): # range(1,5) = [1,2,3,4]
    print(i)

# The following is a list comprehension, it prints all even numbers between 0 and 20
l = [2*i for i in range(1,10)]
print('Let us print the even numbers greater than 0 and less than 20')
print(l)

# We can add conditions
print('Let us print the even numbers greater than 0 and less than 20, not divisible by 5')
print([x for x in l if x % 5 != 0])


## III: Checking installation
Let us check if all the packages will import correctly. Execute the next cell, if it throws any errors please let me know!

In [None]:
import ast
import matplotlib
import sklearn
import numpy

In [None]:
# IMPORTANT: if you are running on Google Colab uncomment the following two lines and run this cell
# from google.colab import drive
# drive.mount('/content/drive')

--------------

# 1: Data

**Download the data of *regularised quantum periods* of weighted projective spaces** from [HERE](https://www.dropbox.com/scl/fi/sc98nnt0xpbbrxt2mj29m/periods1.txt?rlkey=wyd93zr0h001m82fbtxahdxmf&dl=0).

The data looks like
| Keys                   | Values                                                              |
| ---------------------  | ------------------------------------------------------------------- |
| $\texttt{Weights}$     | A list of integers, e.g. $[1,1]$                                    |
| $\texttt{Dimension}$   | An integer                                                          |
| $\texttt{Periods}$     | A list of floats, e.g $[0.0, 0.69, 1.79, 2.99, ...]$                |

It is not important for this tutorial to understand what these objects are! Here is what you need to know:
- Weights are the weights of a weighted projective space (for example $[1,1]$ is $\mathbb{P}^1$).
- Dimension: an integer, the dimension of the weighted projective space.
- Periods: a list of 2000 numbers. These are conjectured to be invariants of the weighted projective spaces.

The following cell imports the data from 'periods1.txt' and saves it in a list 'data'. Each data sample is a dictionary with keys 'Weights', 'Dimension', 'Periods'.

In [None]:
# Uncomment the correct line and run this cell to load the data

import ast

data = []

with open('periods1.txt', 'r') as f: # Leave this if you are running locally and your file in the same folder as the notebook
# with open('/content/drive/MyDrive/periods1.txt', 'r') as f: # If you are on Colab uncomment this line and comment the line above
    d = {}
    for x in f:
        if 'Weights' in x:
            d['Weights'] = ast.literal_eval(x.split(': ')[1])
        elif 'Dimension' in x:
            d['Dimension'] = ast.literal_eval(x.split(': ')[1])
        elif 'Periods' in x:
            d['Periods'] = ast.literal_eval(x.split(': ')[1])
            data.append(d)
            d = {}

We want to understand if the 'Periods' know the 'Dimension' of each weighted projective space.

Before we approach the question, let us look at basic things on our data. Data is a list containing dictionaries, each dictionary is a data sample at it has keys ```{'Weights', 'Dimension', 'Periods'} ```.

In [None]:
# data is a list containing dictionaries
print(data[0])

In [None]:
# How many data samples we have? (Print the length of data)
print(...)

In [None]:
# Maximum dimension that appears
dims_max = max([x['Dimension'] for x in data])

# Minimum dimension that appears
dims_min = min(...)

# How many samples do we have for each dimension?
for i in range(dims_min,dims_max+1): # Loop over all possible dimensions
    print(f'Dimension {i}:')
    print(len([x for x in data if ... ])) # print the length of the list of those samples that have dimension equal to i

In [None]:
# Do all list of periods have the same length?
len_of_list_of_periods = [len(x['Periods']) for x in data] # List containing the lengths of each period

# Print the maximum and minimum length of periods
print(max(...))
print(min(...))

---

## 2: Dimensionality Reduction

Each period has 2000 terms, that is a lot of terms! We want to reduce the dimension of our data, by applying PCA (Principal Component Analysis). [Here is the sklearn link](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) and [Wikipedia](https://en.wikipedia.org/wiki/Principal_component_analysis).

In particular, let us apply PCA to the periods and look at the first two principal components.

Here is a worked out example from the sklearn documentation:
```python
import numpy as np
from sklearn.decomposition import PCA
# Random data
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])

# Define Scaler
scaler = StandardScaler()

# Fit the scaler and transform the features
scaler.fit(X)
X = scaler.transform(X)

# Define PCA with two components
pca = PCA(n_components=2)

# Fit and transform the data
pca.fit(X)
X = pca.transform(X)

# Print the explained variance
print(pca.explained_variance_ratio_)
```

In [None]:
# Isolate the periods data
periods = [ ... for x in data]

In [None]:
from sklearn.utils import shuffle

# Get the data and shuffle
X = periods
y = [ ... for x in data ] # the dimensions

# Shuffle data
X, y = shuffle(X, y, random_state = 0)

When we fit the PCA we will only do it to the training data, and transform both, so first we have to divide into training and testing.

In [None]:
from sklearn.model_selection import train_test_split

# Split training and testing (70% training and 30% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.3 , shuffle = True, stratify = y)

# This cell should fail

You should get a mistake saying that the least populated class has only one number, which is too few. Let us exclude the dimension one and dimension two examples.

In [None]:
from sklearn.utils import shuffle

# Get the data and shuffle
X = [periods[i] for i in range(len(periods)) if data[i]['Dimension']>=3]
y = [ ... for x in data if ... ] # the dimensions

# Shuffle data
X, y = shuffle(X, y, random_state = 0)

In [None]:
from sklearn.model_selection import train_test_split

# Split training and testing (70% training and 30% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= ... , shuffle = True, stratify = y)

# This should now work

When applying PCA we need to standardise the data, this ensures that all features have equal importance in the analysis.

In [None]:
from sklearn.preprocessing import StandardScaler

# Define the scaler
scaler = ...

# Fit the scaler to the training data and transform the both training and testing
scaler.fit( ... )
X_train = scaler.transform( ... )
X_test = scaler.transform( ... )


In [None]:
from sklearn.decomposition import PCA

# Define the PCA with two component
pca = ...

In [None]:
# Fit and transform the data
pca.fit(...)
X_train = pca.transform(...)
X_test = pca.transform(...)

# Print the explained variance
print(pca.explained_variance_ratio_)

The explained variance ratio is a measure of the proportion of the total variance in the original dataset that is explained by each principal component. You can think of it as 'how much information is carried in each component?' (with 0 being no information and 1 begin all the information).

These components do not seem to carry a lot of information. Let us plot the two components against each other and color by dimension.

In [None]:
# Nothing to do here, just execute the cell to plot the two PCA components against each other

import matplotlib.pyplot as plt

# Apply the scaler to periods
periods = scaler.transform(periods)

# Apply the PCA to all periods
periods = pca.transform(periods)

# Plot the PCA results coloured by dimension
x = [i[0] for i in periods]
y = [i[1] for i in periods]
c = [x['Dimension'] for x in data]

# Classes for the legends
classes = ['1','2','3','4','5','6','7','8','9','10']

# Scatter plot
scatter = plt.scatter(x, y, c=c, alpha=1, s=2)

# Legend
plt.legend(handles=scatter.legend_elements()[0], labels=classes)

# Labels
plt.xlabel(r'PCA$_1$')
plt.ylabel(r'PCA$_2$')

---

## 3: Predicting the dimension?

Let us try to predict the dimension from the PCA component. We do not expect to do very well. Let us try to classify it using a Support Vector Machine, [ here is the sklearn documentation ](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) and [here is Wikipedia](https://en.wikipedia.org/wiki/Support_vector_machine).

Here is a worked out example from the documentation.

```python
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Random data
X = np.array([[-1, -1], [1, 1], [-2, -1], [2, 1]]) # features
y = np.array([1, 2, 1, 2]) # labels

# Divide training and testing
X_train = X[:2]
X_test = X[2:]

y_train = y[:2]
y_test = y[2:]

from sklearn.svm import SVC
scaler = StandardScaler() # define standard scaler
svm = SVC(kernel = 'linear') # define svm (the kernel can be chosen from ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’)

# Fit the scaler and transform the features
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# Fit the SVM
svm.fit(X_train, y_train)

# Compute prediction on the testing set
predictions = svm.predict(X_test)

# Compute the accuracy score
print('Accuracy: ')
print(accuracy_score(y_test, predictions))
```

In [None]:
from sklearn.svm import SVC

# Define the svm with a kernel (you can try different ones!)
svm = SVC(kernel = ...)

In [None]:
# Fit the SVM to the training data
svm.fit( ... , ... )

In [None]:
from sklearn.metrics import accuracy_score

# Compute prediction on the testing set
predictions = svm.predict( ... )

# Compute the accuracy score between y_test and predictions
print('Accuracy: ')
print(accuracy_score( ... , ... ))

Why is this doing so badly? Let us look at the data more closely. Print the number of zeros that appear in each periods and plot a histogram.

In [None]:
# Find the number of zeros for each periods
zeros = [len([y for y in x['Periods'] if ... ]) for x in data]

_ = plt.hist(zeros, bins = 100)

Most of the data is zero!!! This might be the reason why our classification problem was not very good. Let us download more data, where we have computed more coefficients ofthe periods.

------

# 4: Data

**Download the data of *regularised quantum periods* of weighted projective spaces** from [HERE](https://www.dropbox.com/scl/fi/qr3hs25h5mfea9krmjbyk/periods_more.txt?rlkey=v81hueis97kn3nkrg775xfjw5&dl=0). In this case we compute a lot more terms, and record only those that are non-zero.

The data looks like
| Keys                   | Values                                                              |
| ---------------------  | ------------------------------------------------------------------- |
| $\texttt{Weights}$     | A list of integers, e.g. $[1,1]$                                    |
| $\texttt{Dimension}$   | An integer                                                          |
| $\texttt{Indices}$     | A list of integers, e.g. $[0, 2, 4, 6, 8, 10, 12, 14, ... ]$        |
| $\texttt{Periods}$     | A list of floats, e.g $[0.0, 0.69, 1.79, 2.99, ...]$                |

It is not important for this tutorial to understand what these objects are! Here is what you need to know:
- Weights are the weights of a weighted projective space (for example $[1,1]$ is $\mathbb{P}^1$).
- Dimension: an integer, the dimension of the weighted projective space.
- Indices: a list of integers, the indices of the non-zero terms of the periods.
- Periods: a list of floats, only the non-zero terms of the periods.
Note in this case we only record those elements of periods that are non-zero, and their indices are recorded in Indices.

In [None]:
# Uncomment the correct line and run this cell to load the data

import ast

data = []

with open('periods_more.txt', 'r') as f: # Leave this if you are running locally and your file in the same folder as the notebook
# with open('/content/drive/MyDrive/periods_more.txt', 'r') as f: # If you are on Colab uncomment this line and comment the line above
    d = {}
    for x in f:
        if 'Weights' in x:
            d['Weights'] = ast.literal_eval(x.split(': ')[1])
        elif 'Dimension' in x:
            d['Dimension'] = ast.literal_eval(x.split(': ')[1])
        elif 'Indices' in x:
            d['Indices'] = ast.literal_eval(x.split(': ')[1])
        elif 'Periods' in x:
            d['Periods'] = ast.literal_eval(x.split(': ')[1])
            data.append(d)
            d = {}

In [None]:
# How many data samples we have?
print( ... )

# What is the max dimension and the min dimension?
dims_max = ...
dims_min = ...

# How many samples do we have for each dimension?
for i in range( ... , ... ):
    print(f'Dimension {i}:')
    print(len( ... ))


---

## 5: Dimensionality reduction using PCA

Let us try to predict the dimension from the PCA component, apply PCA with two components to the periods data.

In [None]:
from sklearn.decomposition import PCA
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

# Isolate the periods data
periods = [ ... for x in data]

# Ignore dim 1 and 2 because there are too few examples
X = [periods[i] for i in range(len(periods)) if  ... ]
y = [x['Dimension'] for x in data if ... ]

# Shuffle data
X, y = shuffle( ... )

# Train test split: 70% for training and 30% testing
X_train, X_test, y_train, y_test = train_test_split(... , ... , test_size= ... , shuffle = True, stratify = y)

from sklearn.preprocessing import StandardScaler

# Define scaler
scaler = ...

# Scale the features
scaler.fit(...)
X_train = scaler.transform(...)
X_test = scaler.transform(...)

# Define the PCA with two components
pca = ...

# Fit and transform the data
pca.fit(...)
X_train = pca.transform(...)
X_test = pca.transform(...)

# Print the explained variance
print( ... )

# This cell should fail!

You should have got a ValueError! This is because our periods are not the same length anymore, and PCA only works if our input all have the same dimension. Compute the maximum and minimum length of periods that appear in our data.

In [None]:
# Do all list of periods have the same length?
len_of_list_of_periods = [len( ... ) for x in data]

# Print the maximum lenght and the minimum lenght that appears
print(max( ... ))
print(min( ... ))

Let us chop all the periods so that they have the same length as the vector with minimum length appearing in periods

In [None]:
# Define minimum
min_len_periods = min( ... )

# Truncate all the log periods so that they have length m
for x in data:
    x['Periods'] = x['Periods'][:min_len_periods]

In [None]:
# Now check that the max and the min of the lengths of the periods are the same
len_of_list_of_periods = [len( ... ) for x in data]

# Print the maximum lenght and the minimum lenght that appears
print(max( ... ))
print(min( ... ))

Now that all periods have the same length we can apply PCA with two components.

In [None]:
from sklearn.decomposition import PCA
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

# Isolate the periods data
periods = ...

# Ignore dim 1 and 2 because there are too few examples
X = [periods[i] for i in range(len(periods)) if  ... ]
y = [x['Dimension'] for x in data if ... ]

# Shuffle data
X, y = ...

# Train test split: 70% for training and 30% testing
X_train, X_test, y_train, y_test = train_test_split(... , ... , test_size= ... , shuffle = True, stratify = y)

from sklearn.preprocessing import StandardScaler

# Define scaler
scaler = ...

# Scale the features
scaler.fit(...)
X_train = scaler.transform(...)
X_test = scaler.transform(...)

# Define the PCA with two components
pca = ...

# Fit and transform the data
pca.fit(...)
X_train = pca.transform(...)
X_test = pca.transform(...)

# Print the explained variance
print( ... )

# This should now work!

Let us plot the two components against each other and color by dimension.

In [None]:
# Nothing to do here, just execute the cell to plot the two PCA components against each other

import matplotlib.pyplot as plt

periods = scaler.transform(periods)
periods = pca.transform(periods)

# Plot the PCA results coloured by dimension
x = [i[0] for i in periods]
y = [i[1] for i in periods]
c = [d['Dimension'] for d in data]

# Classes for the legends
classes = ['1','2','3','4','5','6','7','8','9','10']

# Scatter plot
scatter = plt.scatter(x, y, c=c, alpha=1, s=2)

# Legend
plt.legend(handles=scatter.legend_elements()[0], labels=classes)

# Labels
plt.xlabel(r'PCA$_1$')
plt.ylabel(r'PCA$_2$')

---

## 6: Predicting the dimension?

The PCA components are clearly separable by lines into the different dimensions clusters. Let us fit an SVM with linear kernel.

In [None]:
from sklearn import svm
from sklearn.metrics import accuracy_score

# Define the svm
clf = svm.SVC(...)

# Train the svm
clf.fit(..., ...)

# Compute prediction on the testing set
predictions = clf.predict(...)

# Compute the accuracy score
print('Accuracy: ')
print(...)

This is not as good of an accuracy as we would have expected. Try and standardise the data again and see if it helps.

In [None]:
from sklearn import svm
from sklearn.metrics import accuracy_score

# Define scaler
scaler2 = ...

# Scale the features
scaler2.fit(...)
X_train = scaler2.transform(...)
X_test = scaler2.transform(...)

# Define the svm
clf = svm.SVC(...)

# Train the svm
clf.fit(..., ...)

# Compute prediction on the testing set
predictions = clf.predict(...)

# Compute the accuracy score
print('Accuracy: ')
print(...)

Plot the learning boundaries computed by the SVM on top of a scatter plot of the scaled PCA components. Note that the SVM should compute 28 decision boundaries, only plot those with indices [0, 7, 13, 18, 22, 25, 27], which correspond to decision boundaries for neighbour dimensions.

In [None]:
# Nothing to do here, just execute the cell to see the plot
import numpy as np
plt.clf()

# Isolate the periods data
periods = [ x['Periods'] for x in data]

# Scale all features
periods = scaler.transform(periods)
periods = pca.transform(periods)
periods = scaler2.transform(periods)

y = [x['Dimension'] for x in data]

# Coefficients and y-intercepts of the decision boundaries from the trained svm
coeffs = clf.coef_
intercepts = clf.intercept_

# Extract PCA components
scaled_PCA1 = [x[0] for x in periods]
scaled_PCA2 = [x[1] for x in periods]

# Plot the decision boundaries between neighbouring classes
for i in [0, 7, 13, 18, 22, 25, 27]:
    x = np.linspace(-2, 4, 1000)
    f = (-coeffs[i][0]*x-intercepts[i])/coeffs[i][1]
    plt.plot(x, f, color = 'black', linewidth = 0.5)

# Scatter plot
ax = plt.scatter(scaled_PCA1, scaled_PCA2, c = y, alpha = 1, s = 2)

# x-label
plt.xlabel(r'standardised PCA$_1$')

# y-label
plt.ylabel(r'standardised PCA$_2$')

# Legend
plt.legend(handles = ax.legend_elements()[0], labels=classes[2:])

--------------------------

## 7: Data, the dimension of a polytope

This example is taken from `Machine learning the dimension of a polytope', T. Coates, J. Hofscheirer, A. M. Kasprzyk.

The aim of this part is to predict the dimension of a polytope from its Ehrhart series.

**Download the data of Ehrhart series and dimensions** from [HERE](https://zenodo.org/records/6614821) and save it to the same folder as this notebook.

The data looks like
| Keys                   | Values                                                              |
| ---------------------  | ------------------------------------------------------------------- |
| $\texttt{ULID}$        | A string                                                            |
| $\texttt{Dimension}$   | An integer                                                          |
| $\texttt{Volume}$      | An integer                                                          |
| $\texttt{EhrhartDelta}$| A list of integers, e.g. $[1,70,223,48]$                            |
| $\texttt{Ehrhart}$     | A list of integers, e.g. $[1,74,513,...]$                           |
| $\texttt{LogEhrhart}$  | A list of floats, e.g $[0.0,4.3,6.2,...]$                           |

It is not important for this tutorial to understand what these objects are! Here is what you need to know:
- ULID is just to keep track of which example we are looking at.
- Dimension: an integer, the dimension of a polytope $d$
- Volume: an integer, the volume of a polytope
- EhrhartDelta: a sequence of integers of length $d+1$
- Ehrhart: a sequence of integers
- LogEhrhart: a sequence of floats (the log of the previous one)

In [None]:
# Uncomment the correct line and run this cell to load the data

import ast

data = []

with open('dimension.txt', 'r') as f: # Leave this if you are running locally and your file in the same folder as the notebook
# with open('/content/drive/MyDrive/dimension.txt', 'r') as f: # If you are on Colab uncomment this line and comment the line above
    d = {}
    for x in f:
        if 'Dimension' in x:
            d['Dimension'] = ast.literal_eval(x.split(': ')[1])
        elif 'Volume' in x:
            d['Volume'] = ast.literal_eval(x.split(': ')[1])
        elif 'EhrhartDelta' in x:
            d['EhrhartDelta'] = ast.literal_eval(x.split(': ')[1])
        elif 'Ehrhart' in x and 'Log' not in x and 'Delta' not in x:
            d['Ehrhart'] = ast.literal_eval(x.split(': ')[1])
        elif 'LogEhrhart' in x:
            d['LogEhrhart'] = ast.literal_eval(x.split(': ')[1])
            data.append(d)
            d = {}


As before, print how many samples we have and how many samples for each dimension.

In [None]:
# How many examples do we have
print(...)

In [None]:
# How many samples do we have for each dimension?
dims_max = ...
dims_min = ...


for i in ...:
    print(f'Dimension {i}:')
    print(...)


In [None]:
# Do all list of series have the same length?
len_of_log_ehrhart = ...

# Print the maximum lenght and the minimum lenght that appears
print(max(...))
print(min(...))

In [None]:
# Do all list of series have the same length?
len_of_ehrhart = ...

# Print the maximum lenght and the minimum lenght that appears
print(max(...))
print(min(...))

---

## 8: Dimensionality reduction of the logarithm of the Ehrhat vector

Apply PCA with two components to the logarithmic Ehrhat vector.

In [None]:
from sklearn.decomposition import PCA

# Periods data
logehrhart = ...

# Isolate features and labels
X = ...
y = ...

# Shuffle data
X, y = ...

# Train test split: 70% for training and 30% testing
X_train, X_test, y_train, y_test = ...

# Define scaler
...

# Fit the scaler
...

# Scale both the training and the testing data
...
...

# Define the PCA with two components
...

# Fit the data and transform
...
...
...

# Print the explained variance
...

Plot the two principal components against each other and colour by dimension.

In [None]:
# Nothing to do here, just execute the cell for plotting

import matplotlib.pyplot as plt


logehrhart = scaler.transform(logehrhart)
logehrhart = pca.transform(logehrhart)

# Plot the PCA results coloured by dimension
x = [i[0] for i in logehrhart]
y = [i[1] for i in logehrhart]
c = [x['Dimension'] for x in data]

# Classes for the legends
classes = ['2','3','4','5','6','7','8']

# Scatter plot
scatter = plt.scatter(x, y, c=c, alpha=1, s=2)

# Legend
plt.legend(handles=scatter.legend_elements()[0], labels=classes)

# Labels
plt.xlabel(r'PCA$_1$')
plt.ylabel(r'PCA$_2$')

These are clearly linearly separable. Train a linear Support Vector Machine to predict the dimension from the two components of the PCA and plot the linear boundaries.

In [None]:
from sklearn.utils import shuffle
from sklearn import svm
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


# Define another scaler
...

# Fit the scaler
...

# Scale both the training and the testing data
...
...


# Define the svm
...

# Train the svm
...

# Compute prediction on the testing set
...

# Compute the accuracy score
print('Accuracy: ')
...

---

## 10: Predicting the dimension from the Ehrhart vetor

Try to do the same using the Ehrhart vector, instead of LogEhrhart.

In [None]:
from sklearn.decomposition import PCA

# Periods data
ehrhart = ...

# Isolate features and labels
X = ehrhart
y = [x['Dimension'] for x in data]

# Shuffle data
...

# Train test split: 70% for testing
...

# Define the PCA with two components
...

# Define scaler and scale the features
...
...
...
...

# Fit the data and transform the data
...
...
...

# Print the explained variance
...

In [None]:
# Nothing to do here, just execute the cell to plot the picture

import matplotlib.pyplot as plt


ehrhart = scaler.transform(ehrhart)
ehrhart = pca.transform(ehrhart)

# Plot the PCA results coloured by dims
x = [i[0] for i in ehrhart]
y = [i[1] for i in ehrhart]
c = [x['Dimension'] for x in data]

# Classes for the legends
classes = ['2','3','4','5','6','7','8']

# Scatter plot
scatter = plt.scatter(x, y, c=c, alpha=1, s=2)

# Legend
plt.legend(handles=scatter.legend_elements()[0], labels=classes)

# Labels
plt.xlabel(r'PCA$_1$')
plt.ylabel(r'PCA$_2$')

Try to predict the dimension from the first two PCA components.

In [None]:
# Define scaler and scale the features
...
...
...
...

# Define the svm, fit it, and compute the predictions for the test set
...

# Compute the accuracy score
print('Accuracy: ')
...

Maybe in this case, two components are not enough. Write a function that takes in as arguments
- the features (the ehrhart vector)
- the labels (the dimensions)
- the numbers of components of the PCA
- the percentage of testing data

And the returns the accuracy obtained by a linear SVM trained on the data after it has gone through PCA with the specified number of components.

In [None]:
def Accuracy(features, labels, n_components, testing):
    ''' Returns the accuracy of a linear SVM.'''
...


In [None]:
# Periods data
ehrhart = [x['Ehrhart'] for x in data]
y = [x['Dimension'] for x in data]

n_components = 2 # Play around with the number of components and see how the accuracy changes
print(Accuracy(ehrhart, y, n_components, 0.3))