# standard scalar

In [3]:
%matplotlib widget
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA


In [4]:
iris=pd.read_csv("/home/cryzal/ml/dataset/iris.csv")
x=iris.drop("Species",axis=1)
y=iris["Species"]

# standard sclar z-score

## Z-score https://www.youtube.com/watch?v=4Fta6KQ1QHQ

### What does the z-score tell you?

A z-score describes the position of a raw score in terms of its distance from the mean, when measured in standard deviation units. The z-score is positive if the value lies above the mean, and negative if it lies below the mean.

It is also known as a standard score, because it allows comparison of scores on different kinds of variables by standardizing the distribution. A standard normal distribution (SND) is a normally shaped distribution with a mean of 0 and a standard deviation (SD) of 1 
It is useful to standardized the values (raw scores) of a normal distribution by converting them into z-scores because:

    (a) it allows researchers to calculate the probability of a score occurring within a standard normal distribution;

    (b) and enables us to compare two scores that are from different samples (which may have different means and standard deviations).

### How do you calculate the z-score?

The formula for calculating a z-score is is z = (x-μ)/σ, where x is the raw score, μ is the population mean, and σ is the population standard deviation.

    As the formula shows, the z-score is simply the raw score minus the population mean, divided by the population standard deviation.


When the population mean and the population standard deviation are unknown, the standard score may be calculated using the sample mean (x̄) and sample standard deviation (s) as estimates of the population values.

### How do you interpret a z-score?

The value of the z-score tells you how many standard deviations you are away from the mean. If a z-score is equal to 0, it is on the mean.

A positive z-score indicates the raw score is higher than the mean average. For example, if a z-score is equal to +1, it is 1 standard deviation above the mean.

A negative z-score reveals the raw score is below the mean average. For example, if a z-score is equal to -2, it is 2 standard deviations below the mean.

Another way to interpret z-scores is by creating a standard normal distribution (also known as the z-score distribution or probability distribution).



## sklearn.preprocessing.StandardScaler

### class sklearn.preprocessing.StandardScaler(*, copy=True, with_mean=True, with_std=True)[source]

#### Standardize features by removing the mean and scaling to unit variance

####    The standard score of a sample x is calculated as:

####        z = (x - u) / s

    where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.

    Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using transform.

    Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

    For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

    This scaler can also be applied to sparse CSR or CSC matrices by passing with_mean=False to avoid breaking the sparsity structure of the data.


## Before standard scalar
### not easy to plot the x values here its very big

In [3]:
#we plot the x value (sepal,petal)
fig=plt.figure()
an=fig.add_subplot(1,1,1)
an.boxplot(x)
an.set_title("iris dataset",fontsize=15,fontweight='bold')
an.set_xlabel("Flower")

#we do not plot this dataset becase the range of data is very high
#so we use standard scalar for raduce the range using z-score

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

TypeError: cannot perform reduce with flexible type

In [46]:
# we plot sepal.length of iris before standard scalar this for to sure that value is not change
fig=plt.figure()
an=fig.add_subplot(1,1,1)
an.boxplot(iris["Sepal.Width"])
an.set_title("iris dataset",fontsize=15,fontweight='bold')
an.set_xlabel("Flower")

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Text(0.5, 0, 'Flower')

# applying standard scalar using standard scalar function fit transform

## fit_transform(self, X[, y])
### Fit to data, then transform it.

In [5]:
#we change the range of data
x=StandardScaler().fit_transform(x) #to get z-score using this function


## After standard scalar value not change but range will change
### here we easily plot the x

In [5]:
#the same sepal.length is we got that mean no value change happen
#here we can easily plot the x 
fig=plt.figure()

an=fig.add_subplot(1,1,1)

an.boxplot(x)

an.set_title("iris dataset",fontsize=15,fontweight='bold')
an.set_xlabel("Flower")
an.set_ylabel("mean ")

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Text(0, 0.5, 'mean ')

### implement pca 

In [6]:
pca=PCA(n_components=2,random_state=0)#ncomponents how many components i given their

pcs=pca.fit_transform(x)

pcs

array([[-2.26470281,  0.4800266 ],
       [-2.08096115, -0.67413356],
       [-2.36422905, -0.34190802],
       [-2.29938422, -0.59739451],
       [-2.38984217,  0.64683538],
       [-2.07563095,  1.48917752],
       [-2.44402884,  0.0476442 ],
       [-2.23284716,  0.22314807],
       [-2.33464048, -1.11532768],
       [-2.18432817, -0.46901356],
       [-2.1663101 ,  1.04369065],
       [-2.32613087,  0.13307834],
       [-2.2184509 , -0.72867617],
       [-2.6331007 , -0.96150673],
       [-2.1987406 ,  1.86005711],
       [-2.26221453,  2.68628449],
       [-2.2075877 ,  1.48360936],
       [-2.19034951,  0.48883832],
       [-1.898572  ,  1.40501879],
       [-2.34336905,  1.12784938],
       [-1.914323  ,  0.40885571],
       [-2.20701284,  0.92412143],
       [-2.7743447 ,  0.45834367],
       [-1.81866953,  0.08555853],
       [-2.22716331,  0.13725446],
       [-1.95184633, -0.62561859],
       [-2.05115137,  0.24216355],
       [-2.16857717,  0.52714953],
       [-2.13956345,

## insert pca values in to dataframe columns and join species

In [7]:
df = pd.DataFrame(data=pcs,columns=['pc1','pc2'])

df =pd.concat([df,y],axis=1)

df

Unnamed: 0,pc1,pc2,Species
0,-2.264703,0.480027,setosa
1,-2.080961,-0.674134,setosa
2,-2.364229,-0.341908,setosa
3,-2.299384,-0.597395,setosa
4,-2.389842,0.646835,setosa
...,...,...,...
145,1.870503,0.386966,virginica
146,1.564580,-0.896687,virginica
147,1.521170,0.269069,virginica
148,1.372788,1.011254,virginica


In [8]:
fig = plt.figure(figsize=(8,8))

ax =fig.add_subplot(1,1,1)
ax.set_xlabel('Principal Component',fontsize = 15)
ax.set_ylabel('Principal Component',fontsize =15)
ax.set_title('2-component PCA', fontsize = 20)




species = ['setosa','versicolor','virginica']
colors =['r','g','b']


for target,color in zip(species,colors):
    indices = df['Species'] == target
    ax.scatter(df.loc[indices,'pc1'],df.loc[indices,'pc2'],c = color,s = 50)
ax.legend(species)
ax.grid()




Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …