# Think Stats
> Exploratory Data Analysis in Python

## Ch-1 Exploratory Data Analysis

**Problem**:- Do first babies come out early?

**Anecdotal evidence**:- This is the evidence based on unpublished data and usually personal. Like we can give examples of our friends. But there are flaws in this evidence, namely 
    
* Small number of observations 
* selection bias (as people you join in this conversation would be those whose first baby was late)
* Confirmation bias (people who believe the claim are more likely to come up with examples that confirm it, and vice versa for those who do not believe it)
    
To solve this problem using **statistics**, we will do:

1. Data collection
2. Descriptive statistics: generate statistics that summarize the data concisely
3. Exploratory data analysis: look for patterns, differences,
4. Estimation: use data from sample population to estimate characteristics of the general population
5. Hypothesis testing: When we see difference between groups, we will evaluate whether the effect might have happened by chance.

Important terms:
* cross-sectional study: study that collects data about a population at a particular point in time
* cycle: in repeated cross-sectional study, each repetition is called a cycle
* longitudinal study: study that follows a population over time, collecting from the same group repeatedly
* population: group we are interested in studying
* sample: subset of population used to collect data
* representative: a sample is representative if every member of the population has the same chance of being in the sample
* oversampling: the technique of increasing the representation of a sub-population in order to avoid errors due to small sample size.
* recode: a value that is generated by calculation and other logic applied to raw data.

Create some validation statistics. Like compute the counts, means or something and store these values. When you move your data from one place to another or someone else tries to do something with your work, then can compare the validation statistics to see the data they are working with is correct.

## Ch-2 Distributions

Histograms are the best way to describe a variable. Using histogram you can easily identify **outliers**. After identifying the outliers, see if they are due to some errors or are some rare cases.

These are some of the statistics that we want to report
* central tendency(mean): do the values tend to cluster around a particular point
* modes: is there more than one mode
* spread(variance): how much variability is there in the values
* tails: how quickly the population drops off as we move away from the modes
* outliers: are there extreme values and are they natural or not

These are called **summary statistics**. 

- mean: we use the normal formula
- average: can be used to refer to any measure of central tendency

To measure the effect size take the **difference of the means**.

Cohen's d statistic is, can also be used to measure the effect size in terms of variability among groups.

In [None]:
def CohenEffectSize(group1, group2):
    diff = group1.mean() - group2.mean()
    var1, var2 = group1.var(), group2.var()
    n1, n2 = len(group1), len(group2)
    
    pooled_var = (n1*var1 + n2*var2) / (n1+n2)
    d = diff/math.sqrt(pooled_var)
    return d

The output of the above program would be the difference in means is 0.028 standard deviations.

## Ch-3 Probability Mass Functions

Another way to represent a distribution is PMF, which maps each value to a probability. To get PMF, compute the histogram and then divide by total n (i.e. normalize).

To plot PMF
* bar graph: if the number of values is small
* step function: the the number of values is large and the PMF is smooth (just bar plot but only the outline is drawn, it is not the cumulative one)

Histograms and PMFs are useful wile you are exploring data and trying to identify patterns and relationships. Once you have got the idea, the next step is to design a visualization that makes the patterns you have identified as clear as possible.

**Biased PMFs**

If you want to know the average number of students in each class, you can do it in two ways
* ask the college, and they would use the records to tell you the actual average value
* if you ask the students, you are probably going to get a large value because there is more chance that you will get students from large class size and they will all say the large class size number.

In this way we can see how we might end with biased PMFs.

## Ch-4 Cumulative Distribution Functions

PMFs work well if the number of values is small. But as the number of values increase the probability of each value decreases and effect of random noise increases. This can be mitigated by binning the data. If the size of the bins is large then we will smooth out noise, but it might also smooth out useful information.

An **alternative that avoid this is cumulative distribution function (CDF)**

The tests report score as **percentile rank**. It is the fraction of the people who scored lower than you (or the same). So are in the 90th percentile, it means you are better than 90% of the people who took the exam.

**Percentile**, is the reverse of above. So a percentile goes from 90th percentile to the actual score that you scored.

**CDFs** is a function that maps from a value to its percentile rank. Say you plot a graph with weeks on x-axis and CDF on y-axis. Then to interpret this graph, you can say that 10% of value on y-axis are shorter than 36 weeks on the x-axis. **CDFs are very useful for comparing distribution**.

**Interquartile range** is the measure of the spread of a distribution (difference between 75th and 25th percentiles).

## Ch-5 Modeling distributions

The distribution we discussed so far are called **empirical distributions** because they used finite samples. The alternative is **analytic distribution**, which is a mathematical function.

### Exponential Distribution
\begin{equation*}
CDF(x) = 1 - e^{-\lambda x}
\end{equation*}

This distribution comes up when we look at a series of events and measure the times between events. If the events are equally likely to occur we get an exponential distribution. The mean of this distribution is $1/\lambda$, so you can use this to get the $\lambda$ value of your distribution. The mean is susceptible to outliers, so you can use median in that case, $ln(2)/median$.

### Normal Distribution
The CDF is like a 'S' value graph.

### Lognormal Distribution
If the $log(x)$ is a normal distribution, then it is forms lognormal distribution.

## Probability Density Functions

The derivative of a CDF is called a **PDF**. The result at a particular value is not useful, as it gives probability density, not probability.

**Skewness** is a property that describes the shape of a distribution. If the distribution is symmetric around its central tendency, it is unskewed . If the values extend farther to the right, it is "right skewed" and if the values extend left, it is "left skewed".

You can look at the mean and median for skewness. Also, **Pearson's median skewness coefficient** can be used.

In [None]:
def PearsonMedianSkewness(x):
    median = x.median()
    mean = x.mean()
    std = x.std()
    return 3*(mean-median)/std

This statistic is robust meaning it is less vulnerable to the effect of outliers.

## Ch-7 Relationships Between Variables

**Scatter plots** are good.

**Correlation** is intended to quantify the strength of the relationship between two variables. The problem is the variables may not be in the same units, or come from different distributions. Use these techniques
* **Pearson product-moment correlation coefficient**: Transform each value to standard deviation from the mean.
* **Spearman rank correlation coefficient**: Transform each value to Percentile rank.

**Covariance** is a tendency of the variables to vary together. Take product of the difference of the means and divide by n.

## Ch-8 Estimation

For large sample variance is a good estimator, but for small samples it tends to be too low. For this reason variance produces a **biased** estimator. For this reason we divide by $N-1$ in the variance formula, which produces unbiased estimate.

Variation in the estimate caused by random selection is called **sampling error**.

## Ch-9 Hypothesis Testing

The fundamental question we want to address is whether the effects we see in a sample are likely to appear in the larger population i.e. the effects we saw are due to the sample given to us or not.

**Classical hypothesis testing**: Given a sample and an apparent effect, what is the probability of seeing such an effect by chance?
1. Choose the thing that you want to measure. Like difference in means between two groups.
2. Define a **null hypothesis**, which is based on the assumption that the apparent effect is not real.
3. Compute a **p-value**, which is the probability of seeing the apparent effect if the null hypothesis is true.
4. If the value is small, it means that it is unlikely to have occurred by chance. So in that case it is more likely to appear in the large population.