# Machine Learning - Task A

*M Kundegorski*, Fjelltopp Ltd, University of Glasgow, 2019-2020


In this task you will read in some toy data from an experiment and, using linear regression, try to predict the outcome for an 'unseen' or unknown measurement.

In [None]:
#specify how matplotlib interacts with jupyter display. This is not a Python command but Jupyter's
%matplotlib inline 

#Import useful libraries
import matplotlib.pyplot as plt  # We will use matplotlib for plotting
import pandas as pd  # We will use pandas DataFrames for storing features
import numpy as np  # We will use NumPy arrays to store image data

# Utils is a custom module written to simplify these tutorials
# You do not need to understand these codes for this practical
from utils.practice_data import generateNiceData

# Generate a pandas Dataframe `problem` with some toy results
# Each row, a data point, has four features: A, B, C, Y
# These could, for example, be length, intensity, etc.
number_of_samples=100
number_of_features=3
problem = generateNiceData(number_of_features,number_of_samples,noNoise=True) 
problem.describe()

In [None]:
#You can show the top of the pandas dataframe using .head(N) method
problem.head(10)

In [None]:
#Or you can display some sample data points using .sample(N) method
problem.sample(20)

In [None]:
#You can modify the plotting function below to see the relationship between each variable A, B and C, and output Y. 
#Does the relationship look like it is linear?
plt.plot(problem['A'],problem['Y'], marker='o', linewidth=0)

## The Task

In one experiment we were able to measure three features: A, B, C which we suspect allow for prediction of some other feature Y. With enough *training data* containing both input features A, B, C and additionally measured output Y, we can learn a model to predict Y.

Given these three features (A, B and C) we want to predict the value of feature Y.

1. Let's start with a multivariate linear regression (assuming the relationship between our variables is simple).
2. Advanced: Following see how the performance changes with and without noise in data
3. Advanced: See how performance changes depending on size of the problem
4. Advanced: Then try to provide non-linear features (i.e. assuming the relationship between our variables is complex)

In [None]:
#Data from pandas series needs to be converted to familiar numpy arrays
x = problem.loc[:,['A','B','C']].values
y = problem.loc[:,'Y'].values

print(x.shape, y.shape)  # check our array shapes make sense

## Task A.1 

Read the documentation for function [sklearn.model_selection.train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html), start with the second example provided.

In the cell below, complete the function call (by replacing all the `____`s) to split our dataset into 80% `x_train`/`y_train` (training) and 20% `x_test`/`y_test` (testing) subsets.

In [None]:
from sklearn import model_selection  # for trans_test_split function

____, ____, ____, ____  = model_selection.train_test_split(____, ____, test_size=____, random_state=0)

## Task A.2

 Look at the [Linear Regression class](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) documentation and examples. In the code below, we have initialised a Linear Regression model. Add the following lines of code:
1. A line that 'fit's a model to our training data,
2. A line that 'predict's the value of Y for our test data.

What do the intercept and slope represent?

In [None]:
from sklearn import linear_model  # this submodule contains the 'LinearRegression' function

mv_regression = linear_model.LinearRegression(normalize=True) #How would it perform without normalisation?

# Fit regression model to feature data x_train and target y_train
# ADD CODE HERE:

# Fill vector y_predict with estimations of target y_predict from data x_test
# ADD CODE HERE:

print("Intercept {}".format(mv_regression.intercept_))
print("Slope {}".format(mv_regression.coef_))

In machine learning we want to get an idea of how well our models fits our data (by comparing out prediction to our known testing data values), there are a variety of error metrics that can be used for this. Run the cell below to compare the 'True' or known values and the predicted values as well as three common error metrics. How do you interpret these numbers?

In [None]:
from sklearn import metrics

results = pd.DataFrame({'True value': y_test.flatten(), 'Predicted': y_predict.flatten()})
print('Mean Absolute Error: {}'.format(metrics.mean_absolute_error(y_test, y_predict))  )
print('Mean Squared Error: {}'.format(metrics.mean_squared_error(y_test, y_predict)) )
print('Root Mean Squared Error: {}'.format(np.sqrt(metrics.mean_squared_error(y_test, y_predict))))

Numbers are all well and good, but often it's easier to understand our results when plotted visually. Run the cell below to compare the true and predicted values for our test data.

In [None]:
f, axis = plt.subplots(1,1)  # create a figure with a single axis (subplot)

axis.plot(y_test, y_test, '--')  # plot true vs true, i.e. the ideal case
axis.scatter(y_test,y_predict)  # plot a scatter of the true value against the prediction value
axis.set_ylabel('Predicted Value')
axis.set_xlabel('True Value')

plt.show()

## Advanced Task A.3

In the cell below we have put all this code into a loop. Using `np.arange` we have defined a range of sample sizes. Run the loop to see how sample size affects the results.

So far, these experiments have been magically noise free. Make `noNoise=False` in the call to function generateNiceData() in the first cell and re-run it. How does noise affect the results?

In [None]:
# We will store the error metrics for each sample size
errors = pd.DataFrame(index=np.arange(20, 200, 20),columns=['MAE', 'MSE', 'RMS'])

for number_of_samples in np.arange(20, 200, 20):
    print('Running for {0} samples...'.format(number_of_samples))
    problem = generateNiceData(3,number_of_samples,noNoise=True) # CHANGE THIS LINE
    
    #Data from pandas series needs to be converted to familiar numpy arrays
    x = problem.loc[:,['A','B','C']].values
    y = problem.loc[:,'Y'].values
    
    # Split test/train data
    x_train, x_test, y_train, y_test = model_selection.train_test_split(x, y, test_size=0.2, random_state=0)
    
    # Initialise Linear Regression Model
    mv_regression = linear_model.LinearRegression(normalize=True) #How would it perform without normalisation?
    
    # Fit regression model to feature data x_train and target y_train
    mv_regression.fit(x_train,y_train)
    
    # Fill vector y_predict with estimations of target y_predict from data x_test
    y_predict = mv_regression.predict(x_test)
    
    # Print Fit
    print("Intercept {}".format(mv_regression.intercept_))
    print("Slope {}".format(mv_regression.coef_))
    
    # Print Error Metrics
    results = pd.DataFrame({'True value': y_test.flatten(), 'Predicted': y_predict.flatten()})
    print('Mean Absolute Error: {}'.format(metrics.mean_absolute_error(y_test, y_predict)))
    errors.loc[number_of_samples,'MAE'] = metrics.mean_absolute_error(y_test, y_predict)
    print('Mean Squared Error: {}'.format(metrics.mean_squared_error(y_test, y_predict)))
    errors.loc[number_of_samples,'MSE'] = metrics.mean_squared_error(y_test, y_predict)
    print('Root Mean Squared Error: {}'.format(np.sqrt(metrics.mean_squared_error(y_test, y_predict))))
    errors.loc[number_of_samples,'RMS'] = np.sqrt(metrics.mean_squared_error(y_test, y_predict))
    
    # Plot
    f, axis = plt.subplots(1,1)  # create a figure with a single axis (subplot)
    axis.plot(y_test, y_test, '-')  # plot true vs true, i.e. the ideal case
    axis.scatter(y_test,y_predict)  # plot a scatter of the true value against the prediction value
    axis.set_ylabel('Predicted Value')
    axis.set_xlabel('True Value')
    
    plt.show()
    
# Show errors and create a quick plot
display(errors)

fErr, axErr = plt.subplots(1,1)  # create a figure with a single axis (subplot)

errors.plot(ax=axErr)
axErr.set_ylabel('Error Value')
axErr.set_xlabel('Sample Size')

plt.show()

## Advanced Task A.4

Given your extensive knowledge of the property Y, you suspect that measurement B and C have a non-linear relation to Y. 

Modify the following cell to use function [np.power()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.power.html) and [np.sin()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.sin.html?highlight=sin#numpy.sin) to create additional non-linear features z_b and z_c. 

*Hint:* try different powers of `x_b`, from 2 to 8 to see when the fit is the closest. 

*Hint:* Apply sine function firectly to `x_c: z_c = np.sin(x_c)`

In [None]:
x_a = problem.loc[:,['A']].values
x_b = problem.loc[:,['B']].values
x_c = problem.loc[:,['C']].values

#Here we create *new features*. Change them to be a power and sin of existing features
z_b = x_b
z_c = x_c 

#You can either add z-features to your existing measurements, or replace x_b and x_c with the non-linear features.
x=np.concatenate((x_a,x_b,x_c,z_b,z_c),axis=1)
x=np.concatenate((x_a,z_b,z_c),axis=1)
#comment out one of the above lines to see different results

## Advanced Task A.5

Copy the code from the cells above (task A.2) to try linear regression with your non-linear features. Do you need to change anything?

## Advanced Task A.6

Display the results on the test dataset by executing the following cell

In [None]:
results = pd.DataFrame({'True value': y_test.flatten(), 'Predicted': y_predict.flatten()})
print('Mean Absolute Error: {}'.format(metrics.mean_absolute_error(y_test, y_predict))  )
print('Mean Squared Error: {}'.format(metrics.mean_squared_error(y_test, y_predict)) )
print('Root Mean Squared Error: {}'.format(np.sqrt(metrics.mean_squared_error(y_test, y_predict))))

f, axis = plt.subplots(1,1)  # create a figure with a single axis (subplot)

axis.plot(y_test, y_test, 'x')  # plot true vs true, i.e. the ideal case
axis.scatter(y_test,y_predict)  # plot a scatter of the true value against the prediction value
axis.set_ylabel('Predicted Value')
axis.set_xlabel('True Value')

plt.show()