In [None]:
import pandas as pd
import numpy as np

# Scaling modules
from sklearn.preprocessing import MinMaxScaler

# Plotting modules
import seaborn as sns
import matplotlib.pyplot as plt

# Ensures the same random data is used each time you execute the code.
np.random.seed(0)

# This code will suppress all warnings in your notebook.
import warnings
warnings.filterwarnings('ignore')


#### 1. For the following examples, decide whether standardisation or normalisation makes more sense.
  
  
  
  a. You want to build a linear regression model to predict someone's grades, given how much time they spend on various activities during a typical school week.  You notice that your measurements for how much time students spend studying aren't normally distributed: some students spend almost no time studying, while others study for four or more hours daily. Should you standardise or normalise this variable?  

  

  b. You're still working with your student's grades, but you want to include information on how students perform on several fitness tests as well. You have information on how many jumping jacks and push-ups each student can complete in a minute. However, you notice that students perform far more jumping jacks than push-ups: the average for the former is 40, and for the latter only 10. Should you standardise or normalise these variables?



##### Answer for 1a:

For the situation where you want to build a linear regression model to predict someone's grades based on how much time they spend on various activities during a typical school week, and you notice that the measurements for how much time students spend studying aren't normally distributed (with some students spending almost no time studying, while others study for four or more hours daily), normalization would be more appropriate. Normalization scales the values of a variable to a range between 0 and 1, preserving the relative differences in the data. This transformation ensures that the time spent studying is comparable to other variables in the regression model without biasing the model towards variables with larger scales.




##### Answer for 1b:

In this scenario, where you're incorporating information on how students perform on several fitness tests alongside their grades, and you observe that students perform far more jumping jacks than push-ups (with the average for the former being 40, and for the latter only 10), standardization would be more suitable. Standardization transforms the data to have a mean of 0 and a standard deviation of 1, making it easier to compare variables with different units and scales. Since the number of jumping jacks and push-ups is measured in different units and has different average values, standardizing these variables would ensure that they contribute equally to the model without being dominated by variables with larger scales.



#### 2. Visualise the "EG.ELC.ACCS.ZS" column from the countries dataset using a histogram. Then, scale the column using the appropriate scaling method (normalisation or standardisation). Finally, visualise the original and scaled data alongside each other. Note EG.ELC.ACCS.ZS is the percentage of the population with access to electricity.

In [None]:
# Load countries data.
countries = pd.read_csv("countries.csv")

# Display the first five observation.
countries.head()

In [None]:
# Check the shape of your dataframe
countries.shape

In [None]:
# Extract the column and drop missing values
data = countries['EG.ELC.ACCS.ZS'].dropna().values.reshape(-1, 1)

In [None]:
# Initialize scaler and scale the data
scaler = MinMaxScaler()
scaled_data_countries = scaler.fit_transform(data)

In [None]:
# Visualise the original and scaled data alongside each other.
fig, ax=plt.subplots(1,2)

# Create a histogram of the original data.
sns.histplot(countries['EG.ELC.ACCS.ZS'], ax=ax[0],kde=True)
ax[0].set_title("Original Data")

# Create a histogram of the scale data.
sns.histplot(scaled_data_countries, ax=ax[1],kde=True)
ax[1].set_title("Scaled data")