In [6]:
import numpy as np
import pandas as pd

In [7]:
np.median([1,2,3,4])

2.5

In [4]:
np.mean([1,2,3])

2.0

In [2]:
np.random.randint(1,10)

8

* https://numpy.org/doc/stable/reference/random/generated/numpy.random.rand.html
![image.png](attachment:image.png)

In [3]:
np.random.rand()

0.14921428090051736

![image.png](attachment:image.png)

In [14]:
mu, sigma = 0, 0.1 # mean and standard deviation
s = np.random.normal(mu, sigma, 1000)
len(s)

1000

* https://numpy.org/doc/stable/reference/random/generated/numpy.random.binomial.html

In [6]:
n, p = 10, .5  # number of trials, probability of each trial
s = np.random.binomial(n, p, 1000)
# result of flipping a coin 10 times, tested 1000 times.
len(s)

1000

* https://numpy.org/doc/stable/reference/random/generated/numpy.random.randn.html
* Return a sample (or samples) from the “standard normal” distribution.

In [7]:
np.random.randn()

1.033710741178057

![image.png](attachment:image.png)

In [9]:
3 + 2.5 * np.random.randn(2, 4)

array([[ 0.49537765, -0.41338517,  3.39878321,  7.88908517],
       [ 3.27685574,  5.02303655,  2.12189718,  7.69390165]])

![image.png](attachment:image.png)

In [18]:
data={
 'a':np.random.binomial(1,0.2,1000),
 'b':np.random.normal(0,1,1000)
}
# Or use np.random.choice(3,10,[0.1,0.2,0.7])
df=pd.DataFrame(data=data)
df=df[df.b>=1].reset_index(drop=True)
df.shape

(169, 2)

In [20]:
# Normalization: rescale to range of [0,1]
df.groupby('a').transform(lambda x: (x-x.min())/(x.max()-x.min()))
df.head(3)

Unnamed: 0,a,b
0,0,1.824137
1,0,1.583663
2,0,1.341293


In [21]:
# Standardization: rescale to have mean=0 and std=1
df.groupby('a').transform(lambda x: (x-x.mean())/x.std())
df.head()

Unnamed: 0,a,b
0,0,1.824137
1,0,1.583663
2,0,1.341293
3,1,1.425073
4,0,1.35718


* Normalization is good to use when you know that the distribution of your data does not follow a Gaussian distribution. This can be useful in algorithms that do not assume any distribution of the data like K-Nearest Neighbors and Neural Networks.

* Standardization, on the other hand, can be helpful in cases where the data follows a Gaussian distribution. However, this does not have to be necessarily true. Also, unlike normalization, standardization does not have a bounding range. So, even if you have outliers in your data, they will not be affected by standardization.

* However, at the end of the day, the choice of using normalization or standardization will depend on your problem and the machine learning algorithm you are using. There is no hard and fast rule to tell you when to normalize or standardize your data. You can always start by fitting your model to raw, normalized and standardized data and compare the performance for best results.
