Data Analysis – Advanced Statistics with Python  
Dr. Julia Jerke | HS 2021


## Training sheet - Part A
**December 21, 2021**

# _Clustering, principal component analysis and machine learning_





**Notes:**  
- **This sheet is meant as a summarizing exercise in which you can repeat and practive the topics that we covered in the course. You do NOT have to submit your solution!**
- **However, if you wish to receive feedback, you can send me your script to: jerke@soziologie.uzh.ch** 
- **An example solution of this exercise will be published on OLAT at the end of January.**


---
---
---

**For this exercise we will work with a data set that contains clinical measurements from breast cancer patients. The overarching goal is to train a machine learning model that can distinguish between malignant (M) and benign (B) cancer diseases. But first we will apply cluster analysis and principal component analysis to inspect and prepare our data set.**

**Load the data set _breast_cancer_wisconsin.csv_ into python.**

# 1. Data inspection and preparation

1. In a first step we will inspect the data set and the variables that it contains. 
    - How many patients does the data set contain?
    - Are there missing values?
    - Of what data type are the clinical measurements?
2. There are 30 different clinical measurements. Inspect them by printing summary statistics and by plotting a heatmap to inspect their correlations. Describe what you can infer from the heatmap.


# 2. Cluster analysis

Unfortunately the data set does not contain the diagnosis of the cancer patients. We, therefore, do not know which cancer cases are benign and which are malignant.

1. Run a cluster analysis with all thirty clinical measurements to identify clusters within the cancer cases that show similarity. 
    - Make sure to properly select your data for the analysis and to standardize the variables before the analysis.
    - Write a loop to run the cluster analysis for different numbers of clusters (e.g. 1 to 10 clusters). Calculate the SSE each time and store it in a list.
    - How many clusters does the elbow plot suggest?
    - What is the SSE of your preferred solution?
2. Independent from the previous task, we will continue with a two-cluster solution since we expect two different types of cancer diseases in the data (malient vs. benign). Append the cluster labels to your original data set.

# 3. Principal component analysis

The ratio between the number of observations and the number of observations in our data set is not ideal (569:33). We will therefore try to reduce the number of variables with a principal component analysis.

1. Conduct a principal component analysis. Make sure to use the standardized variables from before.
    1. Print the eigen values of the components.
    2. What is the explained variance of each component?
    3. Plot the explained variance as well as the cumulative explained variance.
2. How many components would you choose? Consider the following criteria for that:
    1. Eigen value criteria. 
    2. Inspection of the scree plot and use the KneeLocator to identify the knee.
    3. The explained variance should be at least 85%.
    4. Since we are interested in reducing the number of variables, which of these criteria suggests the smallest number of components?
3. Continue with your preferred component solution and append the components to your data set.
4. Repeat the cluster analysis from before, but now with the components instead of the original measurements.
5. Compare the results from the two different cluster analyses. Plot a crosstab to see whether the cluster solutions overlap. What do you note?

# 4. Building a machine learning model using K-Nearest Neighbors (KNN)

Meanwhile, the cancer cases from the data set have been manually reviewed by medical experts and assigned with a diagnosis. We can use that information to train a model that will be able to detect future malignant cancer cases. But first, we want to review our cluster solutions with the new information.

1. The diagnoses are stored in the data set _breast_cancer_wisconsin_experts_. 
    1. Load the data into python. 
    2. How many cases are benign and how many are malignant?
    3. Append the diagnosis data to your main data frame.
2. To evaluate the performance of our cluster analysis, we want to compare the cluster assignments with the actual diagnosis. Therefore, calculate a crosstab and discuss the result.
3. We now want to train a model that might predict the nature of future cancer cases. 
    1. Using the KNN algorithm, start with a 1-Neighbor model.
    2. Plot the confusion matrix.
    3. Print the classification report. Interpret _precision_, _recall_ and _accuracy_. What is your opinion on the size of these values?
4. We should further check, whether it is better to use more than one neighbor. 
    1. Write a loop that trains models with a range of neighbors from 1 to 100.
    2. For each model, calculate the accuracy score, the precision score and the recall score and append them each to a respective list. You can load the precision with `from sklearn.metrics import precision_score` and the recall with `from sklearn.metrics import recall_score`.
    3. Create a graph that show how accuracy, precision and recall change with the number of neighbors.
    4. To be able to make predictions, we have to decide for a model. For what number of neighbors do you decide? Therefore also consider the following question: in the case of detecting malignant cancer, which value would you rather maximize - precision or recall?
5. Continue with your preferred neighbor solution and train you final model.
6. We are now able to try and predict the type of cancer for new cases. Load the data set _breast_cancer_wisconsin_newdata.csv_ that contains 10 new cases with unknown diagnosis. Using the trained model, predict the type of cancer. Do not forget to standardize the new data accordingly.

### WELL DONE!!! 
### You finished the first part of the extensive training exercise. You are able to apply cluster analaysis and principal component analysis to large and complex data and to train a first simple machine learning model to make predictions.