[The disappearing computer](https://www.ted.com/talks/imran_chaudhri_the_disappearing_computer_and_a_world_where_you_can_take_ai_everywhere/c?user_email_address=02b4db6ec1c28f003d0443330ce209ef&lctg=62d19e111c794c328c90ca0e)   

#### Good AI is complex:  
It takes high-quality, clean data;  
fine-tuning of foundation models;  
thoughtful and responsible roll-out.  
“Many companies aren’t in a position to use AI in this way yet.”   hai. 

In [None]:
import pandas as pd
from scipy import stats
import os
from random import randint
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

Range, domain, and domain constraint  
Variance and standard deviation   
Covariance and correlation  



---
#### Dimensionality   

#### Data Normalisation   
Normalization scales the data to a standard range.  
This prevents a specific feature from having a strong influence on the model’s output.  
It ensures that the model is more robust to variations in the data.   

Normalization gives equal weights/importance to each variable   
so that no single variable steers model performance in one direction  
just because they are bigger numbers.  
E.g.,, clustering algorithms use distance measures to determine if an observation should belong to a certain cluster.

##### MinMaxScaler   
y = (x – min) / (max – min)  

##### StandardScaler   
y = (x – mean) / standard_deviation  

Also, called **z score** $z = \frac{x - \mu}{\sigma}$

##### Task   
Using Excel, find   
- normalised values of height and weight of your class.      
- standard normalised values of height and weight of your class.         

In [None]:
df = pd.read_excel('../Data/some_girls.xlsx')
df1 = df[['Name','Height_cm', 'Weight_Kg', 'BMI', 'Age']]
df1.head()

In [None]:
# y = (x – min) / (max – min)
df1['ht_minMax'] = (df1.Height_cm - df1.Height_cm.min()) / (df1.Height_cm.max() - df1.Height_cm.min())
df1['wt_minMax'] = (df1.Weight_Kg - df1.Weight_Kg.min()) / (df1.Weight_Kg.max() - df1.Weight_Kg.min())
df1['bmi_minMax'] = (df1.BMI - df1.BMI.min()) / (df1.BMI.max() - df1.BMI.min())
# df1.sort_values('Weight_Kg')
# df1.sort_values('Height_cm')
df1.sort_values('BMI')


In [None]:
# y = (x – mean) / standard_deviation
df2 = df[['Name','Height_cm', 'Weight_Kg', 'BMI', 'Age']]
df2['ht_scaled'] = (df2.Height_cm - df2.Height_cm.mean()) / df2.Height_cm.std()
df2['wt_scaled'] = (df2.Weight_Kg - df2.Weight_Kg.mean()) / df2.Weight_Kg.std()
df2['bmi_scaled'] = (df2.BMI - df2.BMI.mean()) / df2.BMI.std()
df2.sort_values('Weight_Kg')
# df2.sort_values('Height_cm')
# df2.sort_values('BMI')

##### Normalise using Scikit-Learn function   
`scaler.fit_transform(column)`

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

df3 = df[['Name','Height_cm', 'Weight_Kg', 'BMI', 'Age']]

df3['Weight_Kg_skl'] = scaler.fit_transform(df3[['Weight_Kg']])
df3['Height_cm_skl'] = scaler.fit_transform(df3[['Height_cm']])
df3['BMI_skl'] = scaler.fit_transform(df3[['BMI']])
df3.sort_values('Weight_Kg')

In [None]:
  
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

df4 = df[['Name','Height_cm', 'Weight_Kg', 'BMI', 'Age']]

df4['Weight_Kg_skl'] = scaler.fit_transform(df4[['Weight_Kg']])
df4['Height_cm_skl'] = scaler.fit_transform(df4[['Height_cm']])
df4['BMI_skl'] = scaler.fit_transform(df4[['BMI']])
df4.sort_values('Weight_Kg')

---  
##### Example Data Reduction  
We can take any Normal Distribution and convert it to The Standard Normal Distribution.  
z-score in Standard Normal Distribution gives the measure of distance from the mean in standard deviations.

$$z = \frac{x - \mu}{\sigma}$$  
```
z is the "z-score" (Standard Score)
x is the value to be standardized
μ ('mu") is the mean
σ ("sigma") is the standard deviation
```

##### Task   
Using your data, in a spreadsheet create z scores for numerical values.   
Compare them with the z-scores using **`stats.zscore()`**  

In [None]:
from scipy import stats

In [None]:
df = pd.read_excel('../Data/Six_Schools.xlsx', sheet_name='KSP21')
df.describe()                

In [None]:
# Calculate the Z-scores
df['age_Z'] = stats.zscore(df.Age)
df[['Age','age_Z']].sort_values('Age')

In [None]:
# Define a threshold for outlier detection
threshold = 3
df[df.age_Z < threshold].sort_values('Age')

#### When to standardise   
If the distribution of the quantity is **normal**, then **standardise**,   
**else normalise**. 

[Sonar Project](https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.names)  
[Sonar Data](https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv)   

shape = (208, 60)  
Curated and much used sample data.  

KNN Accuracy:  
```
with raw data: 0.797   
min-max scale: 0.813   
standardised : 0.810   
```  

Scaling may or may not result in significant improvement in accuracy.  

---  

#### Experiment  
1. Create a dataset   
2. Cluster the data as is using K-mean   
3. Doing the same thing on the standardized data yields a totally different result

In [None]:
import numpy as np

def random_2D_data(x,y,size):
    x = (np.random.randn(size)/3.5)+x
    y = (np.random.randn(size)*3.5)+y
    return x,y
x1,y1 = random_2D_data(2,20,50)
x2,y2 = random_2D_data(2,-20,50)
x3,y3 = random_2D_data(-2,20,50)
x4,y4 = random_2D_data(-2,-20,50)
x = np.concatenate((x1,x2,x3,x4))
y = np.concatenate((y1,y2,y3,y4))

#### Scaled values    
In case of correlation of height/weight with BMI, difference is insignificant with and without scaling.  

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
plt.rcParams["figure.figsize"] = [7.00, 3.50]
plt.rcParams["figure.autolayout"] = True
f, axes = plt.subplots(1, 2)
df = pd.read_excel('../Data/some_girls.xlsx')
sns.regplot(x='Height_cm', y='BMI',data=df4, ax=axes[0])
sns.regplot(x='Height_cm_skl', y='BMI_skl',data=df4, ax=axes[1]);


In [None]:
print(df4[''])

In [None]:
f, axes = plt.subplots(1, 2)
df = pd.read_excel('../Data/some_girls.xlsx')
sns.regplot(x='Weight_Kg', y='BMI',data=df4, ax=axes[0])
sns.regplot(x='Weight_Kg_skl', y='BMI_skl',data=df4, ax=axes[1]);

In [None]:
df4.columns

In [None]:
print(df4.Height_cm.corr(df4.BMI))
print(df4.Height_cm_skl.corr(df4.BMI_skl))
print()
print(df4.Weight_Kg.corr(df4.BMI))
print(df4.Weight_Kg_skl.corr(df4.BMI_skl))



#### Feature Selection
Filter irrelevant or redundant features from your dataset.  
Feature selection keeps a subset of the original features while feature extraction creates new ones.  
Some supervised algorithms have built-in feature selection, such as Regularized Regression and Random Forests.  
Feature selection can be unsupervised (e.g. Variance Thresholds) or supervised (e.g. Genetic Algorithms).  
You can also combine multiple methods if needed.  

##### Typically Needed  
1. Linear Regression   
2. Logistic Regression  
3. Support Vector Machines
4. Neural Networks  
5. k-Nearest Neighbors  
6. KMeans Clustering  
7. Principal Component Analysis

##### Typically Not Needed   
1. Decision Trees  
2. Random Forests  
3. Naive Bayes  
4. Gradient Boosting  


#### Variance Thresholds  
Variance thresholds remove features whose values don't change much from observation to observation (i.e. their variance falls below a threshold). These features provide little value.  
E.g., In the dataset for the class, the 'Age' and 'Gender' features can be eliminated without loss in information.  

Because variance is dependent on scale, you should always normalize your features first.

In [None]:
from sklearn.feature_selection import VarianceThreshold
df_temp = df4[['Name']]
df5 = df4[['Age','Weight_Kg_skl','Height_cm_skl','BMI_skl']]
print(f"\nVariances of columns: \n\n{df5.agg(['var'])}\n")

In [None]:
sel = VarianceThreshold(threshold = 0.5)
df5 = sel.fit_transform(df5.iloc[:,1:])
# pd.DataFrame(list(df5))

In [None]:
pd.DataFrame(df5, columns=['Weight', 'Height', 'BMI_skl'])


#### Task   
Check variances for age in different datasets in Seven Schools   
In each, which columns may be excluded from the dataset while modeling ML?   

### Model Selection  
- For any prediction problem, there are many algorithms and methods available - decision trees, random forests, neural networks, and more   
- Model evaluation and selection is done by evaluating model performance on a validation dataset  
- Holdout validation: Partition available data into a training dataset and a holdout; evaluate model performance on holdout  
- Cross-validation: Create a number of partitions (validation datasets) from the
training dataset; fit model to the training dataset (sans the validation data);
evaluate model against each validation dataset; repeat with each validation set
and average results to obtain the cross-validation error  

##### Data vs. Model   
- Often Data > Methods
     - Microsoft researchers (Banco and Brill) evaluated performance of multiple models for a language understanding task  
     - Varied size of training dataset (up to 1B words)
     - Among modern methods, performance differences between algorithms are relatively small when compared to differences between same algorithms with more/less data

##### Batch   
If the training dataset is large, several batches are made.  
E.g., a dataset with 2000 examples is divided into 4 batches of 500 each.   
Batch Size = 500/  
This would result in 4 **iterations** in one training epoch.  
##### Epoch   
The entire dataset is passed once.  


----
#### Equations in Data Science   
1. Gradient Descent: An optimization algorithm used to minimize the cost function. It helps us find the optimal parameters for ML models.  
2. Normal Distribution: A probability distribution that forms a bell curve and is often used to model and analyze data in statistics.
3. Sigmoid: A function that maps input values to a range between 0 and 1. It is commonly used in logistic regression to make predictions.
4. Linear Regression: A statistical model used to model a linear relationship between independent and dependent variables.
5. Cosine Similarity: A measure that calculates the cosine of the angle between two vectors. It is typically used to determine the similarity between data points.
6. Naive Bayes: A probabilistic classifier based on the Bayes theorem. It assumes independence between features and is often used in classification tasks.
7. KMeans: The most popular clustering algorithm that is used to partition data points into distinct groups.
8. Log Loss: A loss function used to evaluate the performance of classification models using output probabilities.
9. MSE (Mean Squared Error): A metric that measures the average squared difference between predicted and actual values. It is commonly used to assess regression models.
10. MSE + L2 Regularization: An extension of MSE that includes L2 regularization. It is used to prevent overfitting.
11. Entropy: A measure of the uncertainty or randomness of a random variable. It is often utilized in decision trees.
12. Softmax: A function that normalizes a set of values into probabilities. It is commonly used in multiclass classification problems.
13. Ordinary Least Squares: A method for estimating the parameters in linear regression models by minimizing the sum of squared residuals.
14. Correlation: A statistical measure that quantifies the strength and direction of the linear relationship between two variables.
14. Z-score: A standardized value that indicates how many standard deviations a data point is from the mean.
15. MLE (Maximum Likelihood Estimation): A method for estimating the parameters of a statistical model by maximizing the likelihood of the observed data.
16. Eigen Vectors: The non-zero vectors that do not change their direction when a linear transformation is applied. It is widely used in dimensionality reduction techniques.
17. R2 (R-squared): A statistical measure that represents the proportion of variance explained by a regression model, indicating its predictive power.
18. F1 Score: A metric that combines precision and recall to evaluate the performance of binary classification models.
19. Expected Value: The weighted average value of a random variable, calculated by multiplying each possible outcome by its probability.


In [None]:
from IPython.display import Image
Image('../Figures/equations_DS.png', width=700)