# Chapter 5. Introduction to Statistics in Python

Statistics: the practice and study of collecting and analyzing data.

Do's
- How likely is someone to purchase a product?
- Are people more likely to purchase it if they can use a different payment system?
- How many occupants will your hotel have?
- A/B testing: Which ad is more effective in getting people to purchase a product?

Don'ts
- Subjective questions (like all reasons behind a person's preference)

Types of statistics
- Descriptive. Focuses on *describing and summarizing* the data at hand.
- Inferential. Uses the data at hand, which is called sample data, to make *inferences* about a larger population.

Types of data
- Numeric (quantitative) data is made up of numeric values.
   - Continuous numeric data: quantities that can be *measured*, like speed or time.
   - Discrete numeric data: usually *count* data, like number of pets or number of packages shipped.
- Categorical (qualitative) data is made up of values that belong to distinct groups.
   - Nominal categorical data: made up of categories with no inherent ordering, like marriage status or country of residence.
   - Ordinal categorical data: has an inherent order, like a survey question where you need to indicate the degree to which you agree with a statement.

Being able to identify data types is important since the type of data you're working with will dictate what kinds of summary statistics and visualizations make sense for your data. For numerical data, we can use summary statistics like mean, and plots like scatter plots, but these don't make a ton of sense for categorical data. Similarly, things like counts and barplots don't make much sense for numeric data.

# 5.1 Summary statistics

Summary statistics: fact about or summary of some data, like an average or a count.

## Measures of center

### Histograms

- A histogram takes a bunch of data points and separates them into bins, or ranges of values.
- The heights of the bars represent the number of data points that fall into that bin
- Great way to visually summarize the data, but we can use numerical summary statistics to summarize even further.

### Mean

- Often called the average, is one of the most common ways of summarizing data.
- Is more sensitive to extreme values (outliers)
- To calculate mean, we add up all the numbers of interest and divide by the total number of data points.
- In Python, we can use numpy's function ``np.mean(variable)``, passing it the variable of interest.

### Median

- Is the value where 50% of the data is lower than it, and 50% of the data is higher.
- We can calculate this by sorting all the data points and taking the middle one
- In Python, we can use ``np.median(variable)`` to do the calculations for us.


### Mode

- Is the most frequent value in the data.
- Is often used for categorical variables, since categorical variables can be unordered and often don't have an inherent numerical representation.
- We can also find the mode using the ``np.mode(variable)`` function from the statistics module.

### Which measure should I use?

- Since the mean is more sensitive to extreme values, it works better for symmetrical data
- If the data is skewed, meaning it's not symmetrical, median is usually better to use.
   - Left-skewed data has a tail on the left and data is piled up on the right. Mean < Median
   - Right-skewed data has a tail on the right and data is piled up on the left. Mean > Median

## Measures of spread

Describes how spread apart or close together the data points are.

### Variance

- Measures the average distance from each data point to the data's mean. It has squared units.
- To calculate the variance:
   1. We start by calculating the distance between each point and the mean, so we get one number for every data point.
   2. We then square each distance and then add them all together.
   3. Finally, we divide the sum of squared distances by the number of data points minus 1, giving us the variance.
 
- The higher the variance, the more spread out the data is.
- We can calculate the variance in one step using ``np.var(variable)``, setting the ``ddof`` argument to 1.
   - If we don't specify ``ddof=1``, a slightly different formula is used to calculate variance that should only be used on a full population, not a sample.

### Standard deviation

- Calculated by taking the square root of the variance. It has lineal units.
- It can be calculated using ``np.std(variable)``, and we need to set ``ddof=1``.

### Mean absolute deviation

- Takes the absolute value of the distances to the mean, and then takes the mean of those differences.
- Standard deviation squares distances, so longer distances are penalized more than shorter ones, while mean absolute deviation penalizes each distance equally.
- It can be calculated using ``np.mean(np.abs(distances))``, where ``distances = data - np.mean(data)``

### Quantiles

- Also called percentiles, split up the data into some number of equal parts.
- We call ``np.quantile(df['column_name'], argument)``, passing in the column of interest, followed by point-5.
   - Argument: ``0.5`` is equal to median, it can be a list of quantiles like ``[0, 0.25, 0.5, 0.75, 1]``

- We can look for quantiles using ``np.quantile(df['column_name'], np.linspace(start, stop, num)``, which takes in the starting number, the stopping number, and the number intervals.

### Boxplots and quartiles

- The boxes in box plots represent quartiles.
   - The bottom of the box is the first quartile
   - The middle line is the second quartile, or the median.
   - The top of the box is the third quartile.

- How to plot a boxplot

```
import matplotlib.pyplot as plt
plt.boxplot(df['column'])
plt.show()
```

### Interquartile range (IQR)

- It's the distance between the 25th and 75th percentile, which is also the height of the box in a boxplot.
- We can calculate it using
   - The function from NumPy ``np.quantile(df['column_name'], 0.75) - np.quantile(df['column_name'], 0.25)``
   - The imported function from Sci-Py:
   ```
   from scipy-dot-stats import iqr
   iqr(df['column_name'])
   ```

### Outlier detection

Outliers are data points that are substantially different from the others.

How do we know what a substantial difference is? Here is an often used rule:
- Any data point less than the first quartile minus 1-point-5 times the IQR is an outlier
- Any point greater than the third quartile plus 1-point-5 times the IQR.

Using Python:
```
from scipy-dot-stats import iqr
iqr(df['column_name'])
lower_threshold = np.quantile(df['column_name'], 0.25) - (1.5*iqr)
upper_threshold = np.quantile(df['column_name'], 0.75) + (1.5*iqr)
```

### ALL IN ONE GO: ``pd.df['column_name].describe()``

It returns:
- count of rows
- mean
- std
- 25% (first quartile)
- 50% (first quartile)
- 75% (first quartile)
- maximum value

## ```dataframe.melt()```

About the method...

Syntax
- Calling the function while working with 2 DataFrames: ```dataframe.melt(arguments)```

Arguments
- Selects columns to be used as identifier variables: ```id_vars=[list of columns]```
- Will allow us to control which columns are unpivoted: ```value_vars=[list of columns]```
- Will allow us to set the name of column in the output: ```var_name='column'```
- Will allow us to set the name of the value column in the output: ```var_values='column'```

What does it do?
- Will unpivot a table from wide to long format. This is often a much more computer-friendly format, therefore making this a valuable method to know.
- Wide format: data where every row relates to one subject, and each column has different information about an attribute of that subject.
- Long (or tall) format: information about one subject is found over many rows, and each row has one attribute about that subject.

# 5.2 Random numbers and Probability

- We can measure the chances of an event using probability.
- Probability is always between 0% and 100%
   - If probability is 0%, it's impossible
   - If probability is 100%, it will certainly happen

## Setting a random seed and sampling from a DataFrame

- To ensure we get the same results when we run the script.
- The seed is a number that Python's random number generator uses as a starting point, so if we orient it with a seed number, it will generate the same random value each time.
- The only thing that matters is that we use the same seed the next time we run the script.

Parameters
1. ``n``: integer representing the number of items from axis to return.
2. ``replace``: ``None`` by default, ``True`` allows sampling of the same row more than once

Code
```
np.random.seed()
pd.df.sample(5, replace=True)
```

## Independent events

- Two events are independent if the probability of the second event isn't affected by the ouctome of the first event.
- Generally when sampling with replacement, each pick is independent.

## Dependent events

- Two events are dependent if the probability of the second event is affected by the outcome of the first event
- Generally when sampling without replacement, each pick is dependent.

## Probability distribution

- Describes the probability of each possible outcome in a scenario.
- The expected value of a distribution, is the mean of a distribution.

### Discrete distributions

- They represent situations with discrete outcomes, which can be thought of as counted variables.
- We can visualize this using a barplot, where each bar represents an outcome, and each bar's height represents the probability of that outcome.
- We can calculate probabilities of different outcomes by taking areas of the probability distribution.

Sampling from discrete distributions
```
np.mean(die['number'])
rolls_10 = die.saample(10, replace=True)
rolls_10

rolls_10['number'].hist(bins=np.linspace(1,7,7))
plt.show()
```

- Law of large numbers: "As the size of your sample increases, the sample mean will approach the expected value"

### Continuous distributions

- We can use discrete distributions to model situations that involve discrete or countable variables.
- We'll use a continuous line to represent probability.
- Just like with discrete distributions, we can take the area to calculate probability.

Calculating the probability using the Sci-Py function ``uniform`` with method ``cdf``. Example code: 
```
from scipy.stats import uniform
uniform.cdf(array, lower_limit_value, upper_limit_value)
```

#### Random numbers generation according to a continuous distribution (i.e. uniform)

```
from scipy.stats import uniform
uniform.rvs(lower_limit_vale, upper_limit_value, size=n)
```

#### Binomial distribution

The binomial distribution describes the probability of the number of successes in a sequence of independent trials.

- The binomial distribution can be described using two parameters, n and p.
   - n represents the total number of trials being performed
   - p is the probability of success.
- We could also represent these outcomes as a 1 and a 0, a success or a failure, and a win or a loss.
- In order for the binomial distribution to apply, each trial must be independent, so the outcome of one trial shouldn't have an effect on the next.
- We can simulate this in Python ike this:
```
from scipy.stats import binom

# GETTING RANDOM VARIATES
binom.rvs(k, p, size=n)

# Single flip
binom.rvs(1, 0.5, size=1)

# single flip, many times
binom.rvs(1, 0.5, size=8)

# Many flips, one time
binom.rvs(8, 0.5, size=1)

# Many flips, many times
binom.rvs(8, 0.5, size=10)

# Other probabilities
binom.rvs(3, 0.25, size=10)

# GETTING THE PROBABILITY OF SUCCESS
binom.pmf(r, n, p) # of 'r'
binom.cdf(r, n, p) # of 'r' or less
1 - binom.cdf(r, n, p) # of more than 'r'

# EXPECTED VALUE
n*p
```

Parameters
- k: number of samples
- r: number of successful results
- p: probability of success
- size/n: number of experiments

# 5.3 More distributions and the Central Limit Theorem

#### Normal distribution

- Its shape is commonly referred to as a "bell curve"
- Countless number of statistical methods rely on it, and it applies to more real-world situations than the distributions we've covered so far.
- Has important properties
   1. Symmetrical: the left side is a mirror image of the right.
   2. Area=1 : like any continuous distribution, the area beneath the curve is 1.
   3. Curve never hits 0: the probability never hits 0, even if it looks like it does at the tail ends. Only 0.006% of its area is contained beyond the edges of this graph.
   4. Described by mean and standard deviation.
       - When a normal distribution has mean 0 and a standard deviation of 1, it's a special distribution called the standard normal distribution.
       - Other distributions with different values for mean and std will have the same shape, but their axes have different scales
   5. Areas under the normal distribution ("68-95-99.7" rule)
       - 68% of the area is within 1 standard deviation of the mean.
       - 95% of the area falls within 2 standard deviations of the mean
       - 99.7% of the area falls within three standard deviations.
- Approximating data with the normal distribution: since histograms can closely resemble the normal distribution, we can take the area under a normal distribution given certain mean and std values

```
from scipy.stats import norm

# CUMULATIVE DENSITY FUNCTION

# Probability of a value being equal or less than "x"
norm.cdf(x, mean, std)

# Probability of a value being greater than "x"
1 - norm.cdf(x, mean, std)

# Probability of a value being between "x1" and "x2"
norm.cdf(x2, mean, std) - norm.cdf(x1, mean, std)

# PROBABILITY POINT FUNCTION

# Value for which a given percentage of observations fall below
norm.ppf(x, mean, std)

# Value for which a given percentage of observations fall above
norm.ppf((1-x), mean, std)

# # Sampling from a Normal distribution
norm.rvs(mean, std, size=n)
```

Parameters
- x: value
- mean: float
- std: float
- n: number of samples

### Central limit theorem (CLT)

- States that a sampling distribution will approach a normal distribution as the number of trials increases.
- Samples should be random and independent
- Applies to other summary statistics
- Useful for estimating characteristics of unkown underlyng distribution
- More easily estimate characteristics of large populations

#### Poisson distribution

A Poisson process is a process where events appear to happen at a certain rate, but completely at random. The time unit like, hours, weeks, or years, is irrelevant as long as it's consistent.

The Poisson distribution describes the probability of some number of events happening over a fixed period of time.

- The Poisson distribution is described by a value called lambda, which represents the average number of events per time period. 
   - This value is also the expected value of the distribution
   - Lambda changes the shape of the distribution: a Poisson distribution with lambda=1, looks quite different than a Poisson distribution with lambda=8 but no matter what, the distribution's peak is always at its lambda value.
- Python code:

```
from scipy.stats import poisson

# Probability of a single value
poisson.pmf(k, mu)

# Probability of less than or equal to...
poisson.cdf(k, mu)

# probability of greater than...
1 - poisson.cdf(k, mu)

# Sampling from a Poisson distribution
poisson.rvs(mu, size=n)
```

Parameters
- k: number of ocurrences
- mu: average value
- n: sample size

Side note: CLT still appies!

#### Exponential distribution

The exponential distribution measures frequency in terms of time between events.

- Represents the probability of a certain time passing between Poisson events.
- The exponential distribution uses the same lambda value as the Poisson.
- The expected value of the exponential distribution can be calculated by taking 1 divided by lambda.
- It's continuous, since it represents time

```
from scipy.stats import expon

# Probability of being less than or equal to "x"
expon.cdf(x, scale=mu)

# Probability of being greater than "x"
1 - expon.cdf(x, scale=mu)

# Probability of a value being between "x1" and "x2"
expon.cdf(x2, scale=1/mu) - expon.cdf(x1, scale=1/mu)
```

Parameters
- x: time value
- scale: expected value = mu

#### Student's t-distribution

- Its shape is similar to the normal distribution
- The t-distribution's tails are thicker. This means that in a t-distribution, observations are more likely to fall further from the mean.
- Has a parameter called degrees of freedom, which affects the thickness of the distribution's tails.
   - Lower degrees of freedom results in thicker tails and a higher standard deviation.
   - As the number of degrees of freedom increases, the distribution looks more and more like the normal distribution.

#### Log-normal distribution

- Variables that follow a log-normal distribution have a logarithm that is normally distributed.
- This results in distributions that are skewed, unlike the normal distribution.

# 5.4 Correlation and Experimental designs

## Relationships between 2 variables

- Relationships between numeric variables can be visualized with scatter plots.
- In this scatterplot, we can see the relationship between the "x" and "y" variables.
- The variable on the x-axis is called the explanatory/independent variable, and the variable on the y-axis is called the response/dependent variable.

## Correlation coefficient

- We can also examine relationships between two numeric variables using a number called the correlation coefficient.
- This is a number between -1 and 1, where the magnitude corresponds to the strength of the relationship between the variables, and the sign, positive or negative, corresponds to the direction of the relationship.
   - Closer to 1 or -1, very strong (or moderate) relationship
   - Closer to 0, very weak (or no) relationship

## Visualizing relationships

- We can use a scatterplot.
- Also, we'll use seaborn, which is a plotting package built on top of matplotlib.
- Python code:
```
import seabron as sns
sns.scatterplot(x="column_x", y="column_y", data=DataFrame)
plt.show()
```

## Adding a trend

- We can add a linear trendline to the scatterplot using seaborn's ```sns.lmplot()`` function.
- It takes the same arguments as ``sns.scatterplot()``, but we'll set ci to None so that there aren't any confidence interval margins around the line.
   - Trendlines like this can be helpful to more easily see a relationship between two variables.
- Python code:
```
import seabron as sns
sns.lmplot(x="column_x", y="column_y", data=DataFrame, ci=None)
plt.show()
```

## Computing correlation

- To calculate the correlation coefficient (Pearson product-moment correlation) between two Series, we can use the ``.corr()`` method.
- Note that it doesn't matter which Series the method is invoked on and which is passed in since the correlation between x and y is the same thing as the correlation between y and x.
- Python code:
```
df["column_x"].corr(df["column_y"]) is equal to df["column_y"].corr(df["column_x"])
```

## Correlation caveats

1. **Non-linear relationships**: Correlation only accounts for linear relationships.
   - The correlation coefficient measures the strength of linear relationships only.
   - Correlation shouldn't be used blindly, and you should always visualize your data when possible.
2. If data is highly skewed, we can use the **log transformation**
```
import seabron as sns

df["log_x_column"] = np.log(df["skewed_x_column"])

sns.lmplot(x="log_column", y="column_y", data=DataFrame, ci=None)
plt.show()

df["log_x_column"].corr(df["column_y"])
```
3. Other transformations
   - Square root tranformation ``sqrt(x)``
   - Reciprocal transformation ``(1/x)``
   - Combinations, ie.: ``log(x)`` and ``(1/y)`
   
Why use a transformation?
- Certain statistical methods rely on variables having a linear relationship, like calculating a correlation coefficient.
- Linear regression is another statistical technique that requires variables to be related in a linear manner

4. **CORRELATION DOES NOT IMPLY CAUSATION**
- "x" being correlated with "y" **does not mean** "x" causes "y"
- This kind of correlation is often called a **spurious correlation**.
- **Confounding** can lead to spurious correlation
   - If two variables are correlated, it may lead us to think one causes the other.
   - In reality, there might be a third variabe at play that is known to be the real cause of one, and associated with other.

## Design of experiments

Data is created as a result of a study that aims to answer a specific question. However, data needs to be analyzed and interpreted differently depending on how the data was generated and how the study was designed.

- Experiments generally aim to answer a question in the form, "What is the effect of the treatment on the response?"
- Treatment refers to the explanatory or independent variable
- Response refers to the response or dependent variable.

### Controlled experiments

- Participants are randomly assigned to either the treatment group or the control group, where the treatment group receives the treatment and the control group does not.
   - Example: A/B test
- Other than this difference, the groups should be comparable so that we can determine if seeing an advertisement causes people to buy more.
   - If the groups aren't comparable, this could lead to confounding, or bias.

### The gold standard

"If there are fewer opportunities for bias to creep into your experiment, the more reliably you can conclude whether the treatment affects the response"

- The ideal experiment will eliminate as much bias as possible by using certain tools. 
   1. Use a randomized controlled trial.
      - In it, participants are randomly assigned to the treatment or control group and their assignment isn't based on anything other than chance.
      - Random assignment like this helps ensure that the groups are comparable.
   2. Use a placebo
      - Is something that resembles the treatment, but has no effect.
      - This way, participants don't know if they're in the treatment or control group.
      - Ensures that the effect of the treatment is due to the treatment itself, not the idea of getting the treatment.
   3. Double-blind trial
      - The person administering the treatment or running the experiment also doesn't know whether they're administering the actual treatment or the placebo.
      - This protects against bias in the response as well as the analysis of the results.

### Observational studies

- Participants are not randomly assigned to groups.
   - Instead, participants assign themselves, usually based on pre-existing characteristics.
- Useful for answering questions that aren't conducive to a controlled experiment.
- Observational studies can't establish causation, only association.

### Longitudinal vs. cross-sectional studies

1. Longitudinal study
   - The same participants are followed over a period of time to examine the effect of treatment on the response.
   - More expensive, and take longer to perform
2. Cross-sectional study
   - Data is collected from a single snapshot in time.
   - Cheaper, faster, and more convenient.