#### **Individual Reflection - Assessment 1 - Lucy Anthony**

For this project, I used a KNN model to perform regression analysis on our COVID-19 data. For the regression problem, we considered the features: Government stringency, CH index, government response index, economic support index and time, with the target variable reproduction rate, which measures the reproduction rate of the COVID-19 virus. We divided our training and test data by country, and thus our aim was to predict the reproduction rate for a specific country, using the features for that country. This could be applicable in cases where the features for a country are known (e.g. we know what measures its government put in place and how much economic support it received), but we do not know the true values of its COVID-19 reproduction rate.

KNN is a non-parametric model, meaning that the only model assumption it makes is that the distance between the values of features correlates to the distance between the values of the target variable. This is useful for our dataset with multiple features, since it is unlikely that the relationship between features and the target variable is linear (as is assumed in parametric models). Other benefits of the KNN model include its interpretability. Since it makes predictions based on our 5 features, we can easily interpret which features are most influential to our model's performance. In order to optimise the model's performance, I also adjusted the hyperparameters. By changing the number of neighbors, I saw that the MSE decreased as the number of neighbors increased. Mathematically, increasing the number of neighbors smooths out the model and decreases the variance. Thus this could improve the MSE by reducing the influence of outliers. It relates to the variance-bias trade-off, since increasing the number of neighbors increases the bias (the error incurred by the model due to oversimplification), but reduces the variance (how much the model would change under new training data). Ideally, we would find the optimal number of neighbors in order to balance the variance and the bias. Traditionally, the way to do this is to find the point where increasing the bias by a small amount decreases the variance by a lot. I also explored how the use of different metrics, weights and scaling methods affected model performance. Results showed that the Manhattan metric performed better than the Euclidean metric, potentially due to the fact that Euclidean distance amplifies the effect of outliers. Finally I considered cross-validation and variable selection in order to further optimise the model, and found that PCA (principal compondent analysis) worked significantly better than RFE (recursive feature elimination), and that cross-validation further improved the MSE.

The choice of MSE (Mean Squared Error) as our performance metric was motivated by several factors. MSE provides a measure of the data's dispersion, by averaging the squared differences between the actual and predicted values. The squaring of errors means that the MSE heavily penalises large errors, which is relevant in this context, where making large mistakes about COVID-19 data can have large consequences on public policy, for example. This squaring feature also makes the MSE more sensitive to outliers. However, while sensitivity to outliers can be an advantage, it is important to note that it can lead to over-penalisation of outliers which are due to noise.

Mathematically, the KNN model takes the weighted average of the values of the target variable (reproduction rate) from the nearest neighbors in the training set to predict the value for each test point. In this case, "distance" can be calculated between the independent variables using a variety of metrics such as Euclidean distance, or Manhattan distance. In the context of this problem, this means that the KNN model takes each point in the (5-dimensional) test feature set, calculates the nearest k neighbours in the training feature set, and then takes the average of the corresponding points in the training target variable set. This average is what it uses for its prediction. The distance between neighbors is calculated in 5-dimensional space using our metric of choice, and the features are scaled before being used for training and testing in order to ensure that features do not have a disproportionate effect during training. For example, since KNN is distance-based, if one feature has a much larger range, then it will contribute disproportionally to the distance calculation. The weighting of each feature is also a hyperparameter, as it is possible to choose an unequal weighting if some variables are more significant than others. Thus, the choice of distance metric, scaler and weighting is highly important in a KNN model. 

If my intention were to "win", I would further adjust my hyperparameters by finding the optimal number of neighbors, k. In my analysis I considered a few values of k, but it is computationally difficult to train the model on many large values of k, especially when using additional techniques such as PCA and cross-validation. However, I saw that the performance of the KNN model improved as k increased, so in a different project, I would continue to increase the value of k until I found its optimal value. Additionally, I dealt with NaN (missing) values by simply removing rows which contained any NaN values. While this is somewhat effective, it has the potential to lose important information or nuances in the data. Equally, if the missing values are not random (which they almost certainly aren't for our case), then removing missing values will introduce a bias into our model. In our case, missing values more often come from less economically developed countries, or those which received less economic support through the pandemic. Thus, removing missing values biases the model towards more economically advanced countries. If my intention was to "win", I could address this problem using data imputation, where missing values are replaced with the mean or median of the corresponding feature. 

In a competition setting, I would need to consider the accuracy, efficiency and interpretability of my model. Due to my use of cross-validation, tuning of the hyperparameters and variable selection, my model performs well in teerms of my chosen metric. Although, I would also experiment with the use of different metrics such as the Sum of Absolute Errors (SAE), which penalises outliers less harshly, in order to see if my model performed better under a different metric. Also, in a competitive setting, KNN imputation of missing data might give me a competitive edge, instead of simply removing missing values. Tuning of hyperparameters using GridSearchCV ensures that I have optimal hyperparameters for my given performance metric, however I did not exhaustively test all possible combinations of hyperparameters, so this is something that could also be improved in a competition setting. I improved the efficiency of my model using the dimensionalty reduction technique, PCA, which reduces the size of the feature space, thus making my model more computationally efficient. Additionally, KNN is known to be an interpretable model (though this is somewhat reduced by the use of PCA). The underlying assumptions that distance in features correlates with distance in the target variable is also intuitive and improves the model's transparency.

The link to our GitHub repository is [here](https://github.com/markmilner21/Assessment-1-Supervised-Prediction.git)