## Why learn distributions??

<p>
Let us take an example of house prices in hyderabad:

<ol>
<li> The random variable is <i>h</i> (price of a houses where <i>h</i> belongs to set of real numbers).
Given a data of house prices we have mean (average price of a house in hyderebad), variance or standard deviation (a sense of house price variablilty from its mean).</li>

<li> Let <i>h</i> ~ N which tells that house prices are normally distributed
<ul> Consider we know both  
<li>The mean price of house = 25 lakhs</li>
<li>The standard deviation = 3.5 lakhs</li>
</ul>
</li>
</ol>
We can say that the prices will be of the range 21.5L to 28.5L and its correct 60% times.
Therefore we can predict the price of a house given just these <i>two variables</i> which are basically modelling our distribution. </p>

<p>
Since we already know the data is distributed normally:
<ol>

<li>68% of the house prices will lie in the range 21.5 to 28.5 lakhs</li>
<li>95% of the house prices will lie in the range 18 to 32 lakhs</li>
<li>99.7% of the house prices will lie in the range 14.5 to 35.5 lakhs</li>
</ol>

By learning distributions we can use it to model our data with just few parameters just like in the case of gaussian distribution where by knowing the mean and variance we can come into conclusion about the distribution of prices.

Example: Andrew Ng's house price prediction example which goes into what is supervised learning (https://www.youtube.com/watch?v=hfO6iRj-GZo - video 1.1.2)
</p>



## What if a given distribution doesn't belong to any standard distribution:
<p>
It might be due to the following reasons:
<ol>
<li> We might have screwed up the analysis :p </li>
<li> Occurence of a new phenomenon which has not been observed before </li>
<li> Use non-parametric statistics in our distribution(mean, variance and standard deviation are the non-parametric statistical variables that are skewed by the outliers) </li>

</ol>
</p>

<p>
There are two Schools of thoughts:
<ol>
<li>Parametric(median , IQR(Inter-quartile range), MAD(Median average deviation), percentiles(50 , 75 and 25th percentiles)) </li>
<li> Non-parametric(mean,variance,standard deviation) </li>
</ol>
</p>
<p>
Question: Can poker be modeled?
Poker Deep learning algorithms(https://www.quora.com/How-can-one-apply-machine-learning-to-poker)
Texas hold'em AI Bot (http://spectrum.ieee.org/automaton/robotics/artificial-intelligence/texas-holdem-ai-bot-taps-deep-learning-to-demolish-humans) 
</p>

## Given the house prices what are the best ways to calculate mean and variance ?:

    1.By using MAD and median:
        a.How do you account for outliers?
        b.Do we use CLT(Central limit theorem)? 
        Even CLT can't escape from outliers(If the number of outliers in a dataset are more than 25% then the data is corrupt. In such case we might have taken a good amount of outliers for calculating the mean)
        Example: If we consider house of prices in mumbai and let they be normally distributed such that: h= {h1 , h2 ,......hn} h ~ N(House of prices in mumbai are Normally distributed)
        
If the distribution was h = {10k, 10k, 10k, ..., 5l, 5l, 6l, ..., 40l, 50l, ..., 500cr}
The 500 crore house is clearly an outlier here.In fact the price of antilla(ambani owned)is about 13000 crores and if we were to take the mean here then due to such outliers our mean and variance would be completely skewed.
            
What would we do given that our distribution is normal(gauss)?
             For symmetric distribution mean = median(Take median instead of mean to get rid of outliers as the advantage of median is that its break down point(i.e. number of house prices which can corrupt the median) is 50% Which mean median gets corrputed only if 50% of the houses consisted of 500 crore villas and 13000 crore antillas.
             Use MAD(Median average distribution) instead of variance for the same reason that it won't get corrupted by outliers.

2.Another way is :
Keep increasing window from mean to left and right till we cover 65% of data.(That would be our standard deviation)
If the mean price of houses  = 25 lakhs then move left and right of 25 lakhs till we cover 65% of houses.(That is our standard deviation)

3.outliers in fraud detection: Find the outlier transactions(If I swipped card at PVR and 10 minutes later somewhere in russia there is transaction detected using a same card).

### Expectations:

E(X) = sum of(x*f(x)) where x is a random value and f(x) is the probablility of that value

##### Assignment : Prove E((X-mean)^2) = sigma^2

The expected value, also called the expectation or mathematical expectation of a real random variable X is denoted E(X). It's also called the mean of X denoted μ. It's the average of the values that X takes on weighted by the probabilities that X takes on those values.

1.Discrete case:

It's easy to compute when the random variable is a discrete random variable.  It can be stated as the sum

                               E(X)=∑xP(X=x)

where  P(X=x)P(X=x)  denotes the probability that X takes on the value x.

Example: Fair die, uniform discrete probability

Take the case of rolling a fair die, and let X denote the number that appears. It will be one of the six numbers 1, 2, 3, 4, 5, or 6, each with probability 1/6.  Then

       E(X)=E(X)= 1⋅(1/6)+2⋅(1/6)+⋯+6⋅(1/6) = 3.5
2.If X = f(x) = Y is a function of x. Let's say X=x^2 in this case.

Y=X is the square of the number showing(Y=x^2), then Y will take on one of the six numbers 1, 4, 9, 16, 25, or 36 each with probability 1/6.  So


       E(Y)=E(Y)= 1⋅(1/6)+4⋅(1/6)+⋯+36⋅(1/6)≈15.17

3.Example: Nonuniform discrete probability example

Suppose you have a loaded die so that the numbers 4, 5, and 6 each come up with probability 2/9 while 1, 2, and 3 each come up with probability 1/9. Then the expectation of the number X that appears will be


       E(X)= 1⋅(1/9)+2⋅(1/9)+3⋅(1/9)+4⋅(2/9)+5⋅(2/9)+6⋅(2/9) 
       
### Data normalization and standardization:

Normalization and standarization are pretty much the same thing and both relate to the issue of feature scaling.

Normalization:
The simplest method is rescaling the range of features to scale the range in [0, 1] or [−1, 1]. The general formula with example is given as:

If we were to graph out height vs. weight. height normally ranges from {5ft. to 6'5"ft.} and weight {50 to 100 Kgs}.
If we were to plot it and due to different scales of measurements ,we could get confused.
A simple Pre-processing step which can be used is normalization where we range our values from [0 to 1] or [-1 to 1] in the below formula.

where H belongs to height group
H(norm) = (H-Hmin)/(Hmax - Hmin)

Example:
Given weight of a person in class (H = 78kg)
Maximum weight in class is (Hmax = 98 kg)
Minimum weight in class is (Hmin = 50kg)

Then H(normalized) = (78 - 50)/(98 - 50) = 28/48 = 0.58
The normalized weight of the person with weight = 78 kg = 0.58
 
Standardization
In machine learning, we can handle various types of data, e.g. audio signals and pixel values for image data, and this data can include multiple dimensions. Feature standardization makes the values of each feature in the data have zero-mean.

Standardization is similar to normalization but instead of max and min values we use mean and deviation.

Formula : H(standardization) = (H - H(mean))/(standard deviation) 

### Self-Assessment assignment:
1.Use Grid in plots as much as possible

2.Plot CDF(sn.kdeplot)

3.distribution plots(Plot feature distributions separately).

4.qq plot:

import statsmodel.api as sm
sm.qqplot(data['age',line = 'q'])
pylab.show()
