<div class="alert alert-block alert-warning">
    <h1><center> DATA UNDERSTANDING  </center></h1>

### <font color = red> Packages needed for this exercise: </font>
- The exercise can be done without importing any extra packages, but you can import new ones but bear in mind that if you are importing many new packages, you may be complicating your answer.

In [None]:
# --- Libraries with a short description ---
import pandas as pd # for data manipulation
import matplotlib.pyplot as plt # for plotting
import numpy as np #for numeric calculations and making simulated data.
import seaborn as sns # for plotting, an extension on matplotlib

# - sklearn has many data analysis utility functions like scaling as well as a large variety of modeling tools.
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import minmax_scale
from sklearn.preprocessing import scale
from sklearn.manifold import TSNE

# This forces plots to be shown inline in the notebook
%matplotlib inline

This exercise relates to the _data understanding_ and  _data preparation_ stages of the Crisp Data Mining (CRISP-DM) model presented on the course. The questions at this stage of a data-analysis project are for example:

- Is the data quality sufficient?
- How can we check the data for problems?
- How can we clean the data?
- How is the data best transformed for modeling?

It may be tempting to just run a model on data without checking it. However, not doing basic checks can ruin your whole analysis and make your results invalid as well as mislead you in further analyses. There is no excuse for not plotting and checking that the data is as we expect and clean. In this exercise we do just that, check the validity of data and familiarize ourselves with a dataset, also discussing preprocessing and multi-dimensional plotting.

------------

##  <font color = dimgrey> 1. Introduction to the dataset </font>

The dataset in this exercice contains comprehensive health information from  hospital patients with and without cardiovascular disease. The target variable "cardio," reflects the presence or absence of the disease, which is characterized by a buildup of fatty deposits inside the arteries (blood vessels) of the heart.

 -------
As is often the case with data analysis projects, the features/variables have been retrieved from different sources:
- doctors notes (texts)
- examination variables that have come from a database containing lab results or taken during a doctors examination
- self reported variables

--------------
The exercise data has the following columns/attributes:

| Feature | Type | Explanation |
| :- | :- | :-
| age | numeric | The age of the patient in days
| gender | binary | Male/Female
| body_mass | numeric | Patient's measured weight, in kilograms (kg).
| height | numeric | Patient's measured height, in centimeters (cm).
| blood_pressure_high | numeric | Measured Systolic blood pressure
| blood_pressure_low | numeric | Measured Diastolic blood pressure
| smoke | binary | A subjective feature based on asking the patient whether or not he/she smokes
| active | binary |  A subjective feature based on asking the patient whether or not he/she exercises regularly
| serum_lipid_level | categorical | Serum lipid / Cholesterol associated risk information evaluated by a doctor
|family_history| binary | Indicator for the presence of family history of cardiovascular disease based on medical records of patients
| cardio | binary | Whether or not the patient has been diagnosed with cardiac disease.

-----------
#### ***Reading data***

It is good practice to read the features in using their correct types instead of fixing them later. Below, there is ready-made code for you to read in the data, using the data types and column names listed in the above table. Don't change the name of the variable, _data_. It is important in later exercises (for example in ex. 5e) that this is the name of the variable. <font color = red> If you have the dataset in the same folder as this notebook, the path already given to you should work. </font>

---------------

In [20]:
 # --- READ IN DATA (no need to change) --------
data_path = "CardioCare_ex1.csv" #if you just give the name of the file it will look for the data in the same folder as your script
data = pd.read_csv(data_path, dtype = {'age': 'int', 'height': 'int', 'body_mass':'int', 'blood_pressure_low':'int', 'blood_pressure_high':'int', 'gender': 'boolean', 'smoke': 'boolean',
       'active':'boolean', 'cardio':'boolean', 'serum_lipid_level':'category', 'family_history':'boolean'}) #the main data you use in this exercise should have this variable name, so that code given for you further on will run.

NameError: name 'pd' is not defined

---------
***Exercise 1 a)***
1. First, print out the first five rows of the data.

2. Then, save the feature names to lists by their types:
   - Create three lists named **numeric_features**, **binary_features**, and **categorical_features**. 
   - These lists should contain the **names** of the features based on their types:
     - Numeric features (e.g., `age`, `body_mass`, etc.)
     - Binary features (also known as boolean, e.g., `gender`, `smoke`, `cardio`, etc.)
     - Categorical features (e.g., `serum_lipid_level`)

---

#### Important Notes:

When working with DataFrames, it is often useful to organize column names into lists. This practice simplifies data manipulation and analysis. Once the feature names are organized, you can easily select, filter, or apply operations to specific groups of features. This also helps to avoid typing errors and reduces repetition.

For example, once you create your list of numeric features, you can select all columns containing numeric data with the following command:

```python
data[numeric_features]


In [None]:
# --- Your code here for 1 a) ---

#printing first five rows of original dataset
data.head()

#List for Numeric features
numeric_features = ['age','height','body_mass','blood_pressure_high','blood_pressure_low']

#List for Boolean features
binary_features = ['gender','smoke','active','cardio','family_history']

#List for Categorical features
categorical_features = ['serum_lipid_level']

-----
In many data analysis projects, the data is often not collected specifically for analysis purposes. Instead, it may come from various sources or be collected for entirely different reasons. As a result, the data might not be well-formatted and could contain errors or inconsistencies. 

It might be tempting to immediately apply a model to the data "as is," but it is crucial to first **check the data for quality issues**. Ignoring potential data issues can lead to misleading conclusions, undermining the entire analysis. 

### Why Data Quality Checks Matter:

One standard routine to ensure data quality is:
1. **Calculate descriptive statistics** for each feature. This gives an overview of the distribution, range, and possible anomalies.
2. **Visualize the features** to check whether the values are realistic and within expected ranges.

This step helps identify outliers, incorrect data entries, or formatting issues, ensuring that your analysis is based on clean and reliable data.

---

### Descriptive Statistics and Data Types

It's important to note that certain descriptive statistics might not be meaningful for specific types of features. For instance, calculating the "mean" for binary or categorical features may not offer valuable insight. In **pandas** (as in many other data analysis packages), some functions behave differently depending on the data type of the column.

In the following exercises, we will explore:
- **Descriptive statistics** for the dataset.
- How the results and behavior of descriptive functions can vary based on the data type (e.g., numeric vs. categorical features).


----------
***Exercise 2 a)***  Print out the data types of your dataset below.

_Perhaps the most common data types in pandas (see https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes) are **float**, **int**, **bool** and **category**._

In [None]:
# --- 2 a) Print the feature types of your dataset --- #

#Printing feature types from original dataset
data.dtypes

--------
***Exercise 2 b)*** Use the **DataFrame.describe() method** in the cell below on your data.   


In [None]:
# --- Your code for 2 b) --- #

#Calling describe method on original dataset
data.describe()

--------
***Exercise 2 c)*** Did you get all of the features statistics or not? What do you think happened?


<font color="green">Your answer for 2 c)</font>

No we didn't get statistics of all the features. What happend here is that the method only described statistics of columns that are of numeric nature and ignore all those columns that are of boolean or category type.

----------
***Exercise 2 d)*** Calculate descriptives for the binary (boolean) features and the categorical feature <br>

_tip: in python, same type data structures can in many cases be concatenated using the + operator. If youre using the lists of names you created to subset, you can concatenate the two lists of feature names and use the resulting list to help you subset the dataframe_

In [None]:
# 2 d) Your code here #

#Slicing boolean features from original dataset and calling describe method on it
new_df = data.iloc[:, 6:11] 
new_df.describe()

Now, we will explore **what happens if the data is read using the default settings** (i.e., without specifying the data types for the features). In this case, we are **not providing information about the data types (dtypes)** to `pd.read_csv`, meaning no additional arguments are passed when loading the data.

Run the cell below (you don't need to modify the code) and observe the output of the data that has been incorrectly read due to missing dtype information. Then, compare this output with the data you loaded earlier using the correct dtypes, and check the descriptive statistics.


In [None]:
# read in the dataset with no arguments
wrongly_read_data = pd.read_csv(data_path)

# calculate descriptives for the data that was wrongly read in.
wrongly_read_data.describe()



***Exercise 2 e)*** 
Based on the output above, can you identify what went wrong with the data presentation? Why was it important to correctly define the data types when loading the dataset?


<font color="green">Your answer for 2 e)</font>

I think for the boolean variables it interpreted those variables as numeric too since it has a value of 0 and 1. Since we didn't defined any data types so instead of keeping its original data type, the method treated it as a numeric value.

-----------------------
## 3. Plotting numeric features
Descriptives don't really give a full or intuitive picture of the distribution of features. Next, we will make use of different plots to check the data quality.


----------
***Exercise 3 a)*** Plot histograms for the **numeric features** to visually inspect their distributions. (Refer to the tutorial if you need assistance with plotting.)


_tip: When using `plt.subplots()`, if you provide only one argument for the grid size (e.g., `plt.subplots(3)`), it will create a **one-dimensional grid**. You can then index this grid with a single index, making it easier to loop through and assign plots to each subplot.

---

In [None]:
# --- Your code for 3 a) here --- #

#plotting histogram of numeric features
data[numeric_features].hist(bins=10, figsize=(8, 8), grid=False)
plt.tight_layout()
plt.show()

_______
## 4. Plotting binary and categorical features

***Exercise 4 a)*** Plot **barplots** for each of the **non-numeric features** in the dataset. Make sure to **use fractions** instead of the actual frequencies of the categories.

 Tips:
- To create the barplots, refer to the documentation for `axes.bar`.
- To obtain the fractions of each category, use the `value_counts()` function with the `normalize` argument set to `True`. This will return the relative frequencies of each category (proportion of each category relative to the total).

**Note:** 

If you imported boolean features as `pandas` dtype `boolean`, you may find it easier to work with plotting libraries like `matplotlib` when these values are represented as numbers (`0` and `1`) instead of `True` and `False`.

If you encounter any errors while plotting, you can temporarily convert these boolean values to integers or floats using the `.astype()` method:

```python
# Example of converting boolean to int:
data['..'] = data['..'].astype(int)

In [None]:
### Your code for 4 a) here ###

#casting binary features data as int type
binary_df = data[binary_features].astype(int)

#creating sub plot for binary features
fig, axes = plt.subplots(nrows=1, ncols=len(binary_df.columns), figsize=(10, 5))

for i, feature in enumerate(binary_df.columns):
    value_counts = binary_df[feature].value_counts(normalize=False)
    
    value_counts.plot(kind='bar', ax=axes[i], color='skyblue', edgecolor='black')
    
    axes[i].set_title(f'Bar Plot of {feature}')

plt.tight_layout()
plt.show()

#casting categorical features data as int type
categorical_df = data[categorical_features]

#creating sub plot for categorical features
fig, axes = plt.subplots(nrows=1, ncols=len(categorical_df.columns), figsize=(5, 5))

if len(categorical_df.columns) == 1:
    axes = [axes]

for i, feature in enumerate(categorical_df.columns):
    value_counts = categorical_df[feature].value_counts(normalize=False) 
    
    value_counts.plot(kind='bar', ax=axes[i], color='skyblue', edgecolor='black')
    
    axes[i].set_title(f'Bar Plot of {feature}')
    axes[i].set_xlabel(feature)
    axes[i].set_ylabel('Count')

plt.tight_layout()
plt.show()

**Exercise 4 b)** After reviewing the barplots above, Do you notice anything (unusual/irrelevant) with one of the features? If so, Let's try fix it.

If you have read the dtype of a categorical feature as `pandas` dtype `categorical`, you must also use the `remove_categories()` function to remove any unnecessary category levels.

To remove a specific category level, you can use the following example syntax:

```python
data['feature_name'] = data['feature_name'].cat.remove_categories("category name to delete")


<font color="green">Your answer for 4 b)</font>

Yes, the bar plot of **serum_lipid_level** shows a category named as **elev ated** which I think is a typo mistake and is the one which should be removed from the data.

In [None]:
### Your code for 4 b) here ###

#removing category named elev ated from serum_lipid_level feature 
data['serum_lipid_level'] = data['serum_lipid_level'].cat.remove_categories("elev ated")

-------------

## 5. Feature generation and exploration

Feature Engineering is a crucial step in the process of preparing data for most data analysis projects. It involves creating new features or modifying existing ones to improve the performance of predictive models. Feature engineering is a combination of domain knowledge, creativity, and data analysis, and it can have a significant impact on the success of a data analysis project.

--------------

**BMI**, or **Body Mass Index**, is a simple numerical measure that is commonly used to assess an individual's body weight in relation to their height. In our use case, BMI can be a useful indicator in the prediction of cardiovascular problems, as it could provide a well-established link between obesity and an increased risk of developing the disease.

\begin{align*}
\text{BMI} & = \frac{\text{Body mass (kg)}}{(\text{height (m)})^2} \\
\end{align*}

---------------------------------------
***Exercise 5 a)*** Generate a new feature called **BMI** using the provided formula that incorporates the **height** and **body_mass** features.


_tip: In this dataset, the **height** is recorded in centimeters. Before applying the formula, ensure that you convert the height from centimeters to meters by dividing by 100.

In [None]:
### Your code for 5 a) here ###

#first we will convert height into meters
converted_height = data['height']/100
data['height'] = converted_height
data.head()

#now we calculate BMI and add the column to the dataset
BMI = data['body_mass']/(data['height'] * 2)
data['BMI'] = BMI
data.head()

***Exercise 5 b)*** Using the previously calculated feature **BMI** generate a new feature named **BMI_category** that categorizes the values into groups, according to the standard BMI categories :

- Underweight: BMI less than 18.5
- Normal Weight: BMI between 18.5 and 24.9
- Overweight: BMI between 25 and 29.9
- Obese: BMI of 30 or greater

In [None]:
### Your code for 5 b) here ###

#Reference Source: https://www.geeksforgeeks.org/python-pandas-apply/

#Function to assign BMI category based on BMI value
def categorize_bmi(bmi):
    if bmi < 18.5:
        return 'Underweight'
    elif 18.5 <= bmi < 25:
        return 'Normal Weight'
    elif 25 <= bmi < 30:
        return 'Overweight'
    else:
        return 'Obese'

#applying BMI category based on BMI value and storing it into new column named BMI_category
data['BMI_category'] = data['BMI'].apply(categorize_bmi)

data.head()

Now that we have our BMI values, it's a good practice to see if we can spot a hidden trend in our data.

***Exercise 5 c)*** Create a countplot to visualize the distribution of cardio (target variable) across different BMI categories.
Here, countplot refers to a type of bar plot that displays the frequency (count) of observations in each category of a categorical variable, visualizing the distribution of data by showing how many instances fall into each category.








In [None]:
### Your code for 5 c) here ###

# References: 
## https://www.geeksforgeeks.org/countplot-using-seaborn-in-python/
## https://seaborn.pydata.org/generated/seaborn.countplot.html

#Plotting count plot of cardio against BMI_category
sns.countplot(x ='BMI_category', hue = "cardio", data = data, width=0.5)
plt.show()

***5 d)*** Can you notice any relationship or visible trend?

<font color="green">Your answer for 5 d)</font>

The plot shows that most people in the dataset have a normal weight, and within this group, most do not have cardio diseases. This suggests that a normal BMI might be linked to a lower risk of cardio diseases in this data.

Below, there is ready-made code for you to appropriatly add the newly created features to the right column type list. You don't need to change anything about the code, just make sure that the names of the added features are as specified earlier (**BMI** and **BMI_category**)

In [None]:
# ---- Add features to column type list (no need to change) --------#
numeric_features.append("BMI")
data['BMI_category'] = data['BMI_category'].astype('category')
categorical_features.append("BMI_category")

-------------

## 6. Preprocessing numeric features

Scaling the data is a crucial step in the preprocessing phase of machine learning, as it can significantly improve algorithm performance. In many cases, if scaling is not applied, it may lead to poor performance. This is particularly true for distance-based algorithms covered in the course, such as PCA, t-SNE, KNN and Kmeans where features with larger values can dominate the distance calculations.

---

### Common Scaling Techniques:

In this exercise, we will explore two commonly used methods for scaling data:

1. **Min-Max Scaling to [0, 1]:** 
   - This technique rescales the feature values to a range between 0 and 1. It is particularly useful when you want to maintain the relationships between the values while fitting the data into a specific range. This method is often used in training neural networks, where matching the input range to the range of activation functions is important.

2. **Standardization :**
   - standardizing the features to 0 mean and unit variance. Standardizing values is very common in statistics.

### Available Functions:

To assist you in applying these scaling techniques, the following functions from the `sklearn` library have been imported for your use:

- `sklearn.preprocessing.minmax_scale`: For Min-Max Scaling.
- `sklearn.preprocessing.scale`: For Standardization.


**6 a)** Min-max numeric attributes to [0,1] and **store the results in a new dataframe called data_min_maxed**. You might have to wrap the data to a dataframe again using pd.DataFrame()

In [None]:
# --- Your code for 6 a) here --- #

#applying min max scaling on numeric_features from original data
data_min_maxed = pd.DataFrame(minmax_scale(data[numeric_features]), columns=numeric_features)

**Exercise 6 b)** Standardize the numeric attributes of the dataset to have a mean of 0 and a standard deviation of 1. Store the standardized results in a new DataFrame called `data_standardized`.

In [None]:
# Your code for 6 b here --- #

#applying standardization on numeric_features from original data
data_standardized = pd.DataFrame(scale(data[numeric_features], with_mean= True), columns=numeric_features)

**Exercise 6 c)** Create two boxplots for the 'age' feature: one using the `data_min_maxed` DataFrame and the other using the `data_standardized` DataFrame. Display the plots side-by-side and provide titles for each plot. See the tutorial in the beginning for help.

In [None]:
# Your code for 6 c) here --- #

#Plotting min maxed and standardized data using box plot
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
sns.boxplot(data=data_min_maxed, x='age')
plt.title('Boxplot for Min-Max Scaled Age')

plt.subplot(1, 2, 2)
sns.boxplot(data=data_standardized, x='age')
plt.title('Boxplot for Standardized Age')

plt.tight_layout()
plt.show()

**Execise 6 d)** Describe what you would expect to see in these two boxplots. How would the characteristics of the boxplots differ for min-max scaled data and standardized data?

_tip: Consider factors like the location of the mean, and the range of values presented._

<font color="green">Your answer for 6 d)</font>

For Min-Max boxplot, it shows that the plot has a range from **0 to 1**, and the mean is usually around **0.5**. On the other hand, for standardized data, the range is like **-2.5 to 1.5**, depending on the original spread of the data. Here the mean is closer to **0** because standardization centers the data by subtracting the mean and dividing by the standard deviation. This shows that Min-Max scaling compresses data into a fixed range, while standardization keeps the data's variability intact.

---------

Let's compare the effects of these preprocessing methods on a dataset with an outlier. We'll replace the last data point with an outlier (a value significantly different from the rest) and then apply min-max scaling and standardization. Finally, we'll visualize the results to observe how each method handles the outlier. The code to add the value is given for you and you shouldn't change it.

--------------------

***Exercise 6 e) Do the following:***
1. **Use the Provided Data:**
   - Start with the given data for the 'age' feature, which includes an outlier. This variable is referred to as `age_w_outlier`. The value of `age_w_outlier` is already set for you, so you don't need to modify it.

2. **Create Min-Max Scaled Variable:**
   - Use the `sklearn.preprocessing.minmax_scale` function to apply Min-Max scaling to `age_w_outlier`. Store the scaled values in a new variable named `age_w_outlier_minmaxed`.

3. **Create Standardized Variable:**
   - Use the `sklearn.preprocessing.scale` function to standardize the values of `age_w_outlier`. Store the standardized values in a new variable named `age_w_outlier_standardized`.


In [None]:
### Add an outlier, DONT CHANGE THIS CELL CODE, JUST RUN IT ###
data_w_outlier = data.copy() #data should be the name of the variable where you have stored your data!
data_w_outlier.loc[data.shape[0] -1 , 'age'] = 150 #change the last value of age to be 150
age_w_outlier = data_w_outlier.age

In [None]:
# --- Your code for 6 e) ---

#applying min max scale on age_w_outlier
age_w_outlier_minmaxed = minmax_scale(age_w_outlier)

#applying standardized scale on age_w_outlier
age_w_outlier_standardized = scale(age_w_outlier)

***Below there is pre-written code for you to plot the different cases. Run it. The code should run if you have named your features appropriately. Run the code.***

In [None]:
# Wrap in a dataframe that will have two features - the age feature without the outlier, and the age feature with it, min-maxed.
minmaxed_datas = pd.DataFrame({"minmaxed_age_no_outlier" : data_min_maxed.age,
              "minmaxed_age_with_outlier": age_w_outlier_minmaxed })

# Wrap in a dataframe that will have two features - the age feature without the outlier, and the age feature with it, standardized.
standardized_datas = pd.DataFrame({"standardized_data_no_outlier" : data_standardized.age,
              "standardized_data_w_outlier": age_w_outlier_standardized })

axes_minmaxed = minmaxed_datas[['minmaxed_age_no_outlier', 'minmaxed_age_with_outlier']].plot(kind='box', title='Minmax with and without outlier')
axes_std = standardized_datas[['standardized_data_no_outlier', 'standardized_data_w_outlier']].plot(kind='box', title='Standardized with and without outlier')

----------
**Exercise 6 f) Look at the output of the above cell and answer the following**:

1. Can you notice a difference between the two cases (min-maxed and standardized)?
2. Can you say something about the difference of the effect of min-maxing and standardization?


<font color="green">Your answer for 6 f)</font> 
1. In min-max scaling, data is compressed between [0,1], but outliers stretch the range, pushing most values closer to 1. While in case of standardization, data centers around 0, and while outliers increase the range, the main distribution remains stable near 0.
2. Min-max scaling compresses data into a fixed range [0,1], making it highly sensitive to outliers, which can distort the scale. While in case of standardization, the data is centered around 0 and scales by standard deviation, reducing the influence of outliers on the overall distribution.

---------------
## 7. Preprocessing categorical features



We can roughly divide categorical variables/features to two types:  ***nominal categorical***  and  ***ordinal categorical*** variables/features. Some cases are clear in terms of which of the two a feature falls into. For example nationality is not an ordered feature, but which grade in school someone is has a natural ordering. **One-hot encoding** was presented in the lectures and will be used in the following exercises with different learning methods.


-----
***Nominal categorical features need to be encoded***, because not encoding them implies that they have an order. For example, consider a dataset where you would have rows by different countries, encoded randomly with numbers, for ex. Finland = 1, Norway = 2 and so on. For some analyses and methods this would imply that Norway is somehow "greater" in value than Finland. For some algorithms, the implication would also be, that some of the countries would be "closer" to each other.

------
***Ordinal categorical features do not necessarily need to be encoded***, but there are cases where it can be wise. One case is that the categories are not even distance from each other, which is the case with the 'serum_lipid_level' feature with the levels 'normal', 'elevated' and 'at risk'. Its not clear that these are equal in distance from each other. When unsure, it may also be better to one-hot encode, and a lot of packages do it for you behind the scenes. Here we decide to one-hot encode.  

---------------------


***Exercise 7 a)*** Apply One-hot-encode to the `serum_lipid_level` feature and add the resulting one-hot encoded features back to the DataFrame. Give the new features meaningful names. Print the first rows of the resulting dataframe.

_tip: pandas has a function for this, google!_

In [None]:
# --- Your code for 7 a) here ---

#applying one-hot-encode on serum_lipid_level feature
encoded_serum_lipid = pd.get_dummies(data['serum_lipid_level'], prefix='serum_level')

#adding back to the dataframe
data = pd.concat([data, encoded_serum_lipid], axis=1)
print(data.head())

----------

<div class="alert alert-block alert-warning">
    <h1><center> BONUS EXERCISES </center></h1>

- Below are the bonus exercises. You can stop here, and get the "pass" grade.
- By doing both of the bonus exercises below, you can get a "pass with honors", which means you will get one point bonus for the exam.

The following exercises are more challenging and not as straight-forward and may require some research of your own. However, perfect written answers are not required, but answers that show that you have tried to understand the problems and explain them with your own words.

____________
##  <font color = dollargreen > 8. BONUS: Dimensionality reduction and plotting with PCA </font>
In the lectures, PCA was introduced as a dimensionality reduction technique. Here we will use it to reduce the dimensionality of the numeric features of this dataset and use the resulting compressed view of the dataset to plot it. This means you have to, run PCA  and then project the data you used to fit the PCA to the new space, where the principal components are the axes.
____________

-------------
**Exercise 8 a)** Do PCA with two components with and without z-score standardization **for the numeric features in the data**.

In [None]:
# --- Your for 8 a) code here --- #
numeric_data = data[numeric_features]

# applying two component PCA on standardized data
pca_standardized = PCA(n_components=2)
pca_standardized_result = pca_standardized.fit_transform(data_standardized)

# applying two component PCA on original data
pca_non_standardized = PCA(n_components=2)
pca_non_standardized_result = pca_non_standardized.fit_transform(numeric_data)


-------------


**Exercise 8 b) Plot the data, projected on to the PCA space as a scatterplot, the x-axis being one component and y the other. **Add the total explained variance to your plot as an annotation**. See the documentation of the pca method on how to get the explained variance.

- _Tip: It may be easier to try the seaborn scatterplot for this one. For help see documentation on how to do annotation (see tutorial). The total explained variance is the sum of both the components explained variance_.

- _Tip2_: Depending on how you approach annotating the plot, you might have to cast the feature name to be a string. One nice way to format values in python is the f - formatting string, which allows you to insert expressions inside strings (see example below):



------
name = Valtteri<br>
print(f"hello_{name}")

---------
You can also set the number of wanted decimals for floats<br>
For example f'{float_variable:.2f}' would result in 2 decimals making it to the string created

----------

In [None]:
# --- Your code for 8 b) --- you can make more cells if you like ---

explained_variance_standardized = pca_standardized.explained_variance_ratio_
total_explained_variance_standardized = sum(explained_variance_standardized)

explained_variance_non_standardized = pca_non_standardized.explained_variance_ratio_
total_explained_variance_non_standardized = sum(explained_variance_non_standardized)

# Plotting both results side by side
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Plot for standardized data
sns.scatterplot(
    x=pca_standardized_result[:, 0], 
    y=pca_standardized_result[:, 1], 
    ax=axes[0]
)
axes[0].set_title('PCA with 2 Components (Standardized Data)')
axes[0].set_xlabel('PCA Component 1')
axes[0].set_ylabel('PCA Component 2')
axes[0].text(
    0.05, 0.05,  # Move text to bottom left
    f"Total Explained Variance: {total_explained_variance_standardized:.2f}", 
    transform=axes[0].transAxes,
    fontsize=12,
    verticalalignment='bottom'
)

# Plot for non-standardized data
sns.scatterplot(
    x=pca_non_standardized_result[:, 0], 
    y=pca_non_standardized_result[:, 1], 
    ax=axes[1]
)
axes[1].set_title('PCA with 2 Components (Non-standardized Data)')
axes[1].set_xlabel('PCA Component 1')
axes[1].set_ylabel('PCA Component 2')
axes[1].text(
    0.05, 0.05,  # Move text to bottom left
    f"Total Explained Variance: {total_explained_variance_non_standardized:.2f}", 
    transform=axes[1].transAxes,
    fontsize=12,
    verticalalignment='bottom'
)

plt.show()



**Exercise 8 c) Gather information for the next part of the exercise and print out the following things:**
- First, the standard deviation of the original data features (not standardized, and with the numeric features only).
- Second, the standard deviation of the standardized numeric features

In [None]:
# --- Your code for 8 c) here --- #
original_std_dev = data[numeric_features].std()
print("Standard Deviation of Non-Standardized Numeric Features:\n", original_std_dev)

# Convert standardized data back to DataFrame to calculate standard deviation
standardized_data_df = pd.DataFrame(data_standardized, columns=numeric_features)

# Calculate the standard deviation of the standardized numeric features
standardized_std_dev = standardized_data_df.std()
print("\nStandard Deviation of Standardized Numeric Features:\n", standardized_std_dev)

----------
**Exercise 8 d) Look at the output above and the explained variance information you added as annotations to the plots. Try to think about the following questions and give a short answer of what you think has happened:**

1. Where do you think the difference between the amounts of explained variance might come from?

2. Can you say something about why it is important to scale the features for PCA by looking at the evidence youve gathered?

__Answer in your own words, here it is not important to get the perfect answer but to try to think and figure out what has happened__

------------

<font color="green">Your answer for 8 d)</font>

1. The difference in explained variance happens because the features weren't scaled. In the non-standardized data, features like age have a much bigger range, so they dominate the variance calculation. This makes PCA focus mostly on these features, giving a high explained variance of 1.00. After scaling the features, they all have the same range of standard deviation i.e. 1. So each feature contributes equally to the variance. This makes the explained variance more balanced and lowers it to 0.64.

2. Scaling makes sure that all features have the same impact on the PCA calculation, so features with larger values don't dominate those with smaller values. Without scaling, PCA would focus more on features with big numbers, which could lead to biased results.

------------------

## <font color = dollargreen > 9. Bonus: t-SNE and high dimensional data </font>

Another method that can be used to plot high-dimensional data introduced in the lectures was t-distributed Stochastic Neighbor Embedding (t-SNE).

***Exercise 9 a)*** Run t-SNE for both standardized and non standardized data (as you did with PCA).

In [None]:
# --- Your code for 9 a) here --- #
# Reference: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html, https://www.geeksforgeeks.org/ml-t-distributed-stochastic-neighbor-embedding-t-sne-algorithm/

tsne_non_standardized = TSNE(n_components=2, random_state=0)
tsne_non_standardized_result = tsne_non_standardized.fit_transform(numeric_data)

tsne_standardized = TSNE(n_components=2, random_state=0)
tsne_standardized_result = tsne_standardized.fit_transform(data_standardized)

tsne_non_standardized_df = pd.DataFrame(tsne_non_standardized_result, columns=['TSNE1', 'TSNE2'])
tsne_standardized_df = pd.DataFrame(tsne_standardized_result, columns=['TSNE1', 'TSNE2'])

plt.figure(figsize=(14, 6))

plt.subplot(1, 2, 1)
sns.scatterplot(x='TSNE1', y='TSNE2', data=tsne_non_standardized_df, c='red', alpha=0.7, edgecolor='k')
plt.title("t-SNE without Standardization")
plt.xlabel("t-SNE Component 1")
plt.ylabel("t-SNE Component 2")
plt.grid(True)

plt.subplot(1, 2, 2)
sns.scatterplot(x='TSNE1', y='TSNE2', data=tsne_standardized_df, c='red', alpha=0.7, edgecolor='k')
plt.title("t-SNE with Standardization")
plt.xlabel("t-SNE Component 1")
plt.ylabel("t-SNE Component 2")
plt.grid(True)

plt.tight_layout()
plt.show()

***Exercise 9 b)*** Plot t-sne, similarly to PCA making the color of the points correspond to the levels of the cardio feature, but having only numerical features as a basis of the T-SNE.  

In [None]:
# --- Code for 9 b) --- #
corresponding_feature = data['cardio']

plt.figure(figsize=(8, 6))
scatter = plt.scatter(tsne_standardized_result[:, 0], tsne_standardized_result[:, 1], c=corresponding_feature, edgecolor='k', s=100)
plt.title('t-SNE with Color Corresponding to Cardio Levels')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')

plt.colorbar(scatter, label='Cardio Level (0 or 1)')
plt.show()

***Exercise 9 c)***

- What do you think might have happened between the two runs of t-SNE on unstandardized and standardized data? Why is it important to standardize before using the algorithm?

_Here the aim is to think about this and learn, not come up with a perfect explanation. Googling is encouraged. Think about whether t-sne is a distance based algorithm or not?_

<font color="green">Your answer for 9 c)</font>

When using t-SNE, it’s important to standardize the data because t-SNE relies on distances between points to group similar ones together. If the data isn’t standardized, features with larger values will dominate the distance calculation, making it harder for t-SNE to show meaningful patterns because it is a distance based algorithm. Standardizing the data ensures that each feature has the same influence, allowing t-SNE to better capture the true relationships between data points. Without standardization, the results might be misleading or hard to interpret, as t-SNE might focus too much on just a few features instead of the whole dataset.