
#### Mean

The mean is the sum of all values divided by the number of values.

```python
# Example: Mean calculation
data = [3, 5, 1, 2]
mean = sum(data) / len(data)
print(mean)  # Output: 2.75
```

#### Trimmed Mean

The trimmed mean eliminates the influence of extreme values.

```python
# Example: Trimmed Mean calculation
sorted_data = sorted([3, 5, 1, 2])
p = 1  # Number of values to trim from each end
trimmed_mean = sum(sorted_data[p:-p]) / (len(sorted_data) - 2 * p)
print(trimmed_mean)  # Output: 2.5
```

#### Weighted Mean

The weighted mean multiplies each data value by a weight and divides by the sum of the weights.

```python
# Example: Weighted Mean calculation
data = [3, 5, 1, 2]
weights = [1, 2, 1, 1]
weighted_mean = sum(w * d for w, d in zip(weights, data)) / sum(weights)
print(weighted_mean)  # Output: 3.0
```

#### Median and Robust Estimates

The median is the middle number on a sorted list of the data.

```python
# Example: Median calculation
data = [3, 5, 1, 2]
sorted_data = sorted(data)
median = (sorted_data[1] + sorted_data[2]) / 2
print(median)  # Output: 2.5
```

# Chapter 1: Exploratory Data Analysis - Summary

## 2. Elements of Structured Data

Data comes in various forms (sensor readings, events, text, images, videos).  Much is unstructured, requiring processing into structured forms like tables for statistical analysis.  Structured data is categorized into:

* **Numeric:**
    * **Continuous:**  Values within an interval (e.g., wind speed).
    * **Discrete:** Integer values (e.g., event counts).
* **Categorical:** A fixed set of values (e.g., tv screen types or country names).
    * **Binary:** Two categories (e.g., 0/1, true/false).
    * **Ordinal:** Categories with an order (e.g., rating scales).

Explicitly identifying data types helps software optimize processing and statistical procedures.  In Python, `pandas` allows explicit categorical data type specification.

In Python, scikit-learn supports ordinal data with the sklearn.preprocessing.OrdinalEncoder.

## 3. Rectangular Data

Rectangular data, represented as a two-dimensional matrix (data frame in Python), is the standard format for data analysis and modeling.  Rows represent records (cases, observations), and columns represent features (attributes, variables). Unstructured Data often needs transformation into rectangular form.

**Key Terms:**

* **Data frame:** The basic rectangular data structure with rows and columns.
* **Feature:** A column.
* **Outcome:** The variable to be predicted (dependent variable).
* **Record:** A row.



## 4. Data Frames and Indexes

In Python, `pandas` DataFrames have a default integer index, but multilevel indexes can be created for efficiency.
In pandas, it is also possible to set multilevel/hierarchical indexes to improve the efficiency of certain operations.

## 5. Terminology Differences

Different fields use varying terminology for the same concepts (e.g., "predictor variables" vs. "features").


## 6. Nonrectangular Data Structures

Other data structures include:

Time series data records successive measurements of the same variable. It is the raw material for statistical forecasting methods, and it is also a key component of the data produced by devices—the Internet of Things.
Spatial data structures, which are used in mapping and location analytics, are more complex and varied than rectangular data structures. In the object representation, the focus of the data is an object (e.g., a house) and its spatial coordinates. The field view, by contrast, focuses on small units of space and the value of a relevant metric (pixel brightness, for example).
Graph (or network) data structures are used to represent physical, social, and abstract relationships. For example, a graph of a social network, such as Facebook or LinkedIn, may represent connections between people on the network. Distribution hubs connected by roads are an example of a physical network. Graph structures are useful for certain types of problems, such as network optimization and recommender systems


## 7. Estimates of Location

an estimate of where most of the data is located (i.e., its central tendency).

* **Mean:** The average value.
* **Weighted mean:**  The sum of all values times a weight divided by the sum of the weights.
* **Median:** The middle value in a sorted dataset. or The value such that one-half of the data lies above and below.
* **Weighted median:** Accounts for different weights or The value such that one-half of the sum of the weights lies above and below the sorted data.
* **Trimmed mean:**  Average after removing extreme values.
* **Outlier:** A value significantly different from most others.

percentile: The value such that P percent of the data lies below.
robust : Not sensitive to extreme values.

You will encounter the symbol x ¯  (pronounced “x-bar”) being used to represent the mean of a sample from a population

The mean is easily calculated but sensitive to outliers.  The median and trimmed mean are more robust.

N (or n) refers to the total number of records or observations. In statistics it is capitalized if it is referring to a population, and lowercase if it refers to a sample from a population. In data science, that distinction is not vital, so you may see it both ways.

weighted mean, which you calculate by multiplying each data value xi  by a user-specified weight wi and dividing their sum by the sum of the weights.

There are two main motivations for using a weighted mean:
• Some values are intrinsically more variable than others, and highly variable
observations are given a lower weight. For example, if we are taking the average
from multiple sensors and one of the sensors is less accurate, then we might
downweight the data from that sensor.
• The data collected does not equally represent the different groups that we are
interested in measuring. For example, because of the way an online experiment
was conducted, we may not have a set of data that accurately reflects all groups in
the user base. To correct that, we can give a higher weight to the values from the
groups that were underrepresented.

As with the median, we first sort the data, although each data value
has an associated weight. Instead of the middle number, the weighted median is a
value such that the sum of the weights is equal for the lower and upper halves of the
sorted list. Like the median, the weighted median is robust to outliers.

**Python Code Examples:**

```python
import pandas as pd
from scipy.stats import trim_mean
import numpy as np
import wquantiles

# Load data
state = pd.read_csv('state.csv')

# Mean
state['Population'].mean()

# Trimmed mean
trim_mean(state['Population'], 0.1)    # what is this i don't know  remove some examples from sorted array and take mean of rest.

# Median
state['Population'].median()

# Weighted mean (murder rate, weighted by population)
np.average(state['Murder.Rate'], weights=state['Population'])

# Weighted median (murder rate, weighted by population)
wquantiles.median(state['Murder.Rate'], weights=state['Population'])

```

**Note:**  The code requires the `pandas`, `scipy`, `numpy` and `wquantiles` libraries.  The `state.csv` file should contain the data shown in Table 1-2.  This code directly addresses the example provided in the text and includes the necessary imports and functions.

, outliers are often the result of data errors such as mixing data of different
units (kilometers versus meters) or bad readings from a sensor. When outliers are the
result of bad data, the mean will result in a poor estimate of location, while the
median will still be valid. In any case, outliers should be identified and are usually
worthy of further investigation.

===========================================================================

summary 2

## Estimates of Variability

Location is just one dimension in summarizing a feature. A second dimension, variability, also referred to as dispersion, measures whether the data values are tightly clustered or spread out.

### Key Terms for Variability Metrics

- **Deviations**: Differences between observed values and the estimate of location. Synonyms: errors, residuals.
- **Variance**: Sum of squared deviations from the mean divided by \( n - 1 \), where \( n \) is the number of data values. Synonym: mean-squared-error.
- **Standard Deviation**: Square root of the variance.
- **Mean Absolute Deviation**: Mean of the absolute values of the deviations from the mean. Synonyms: l1-norm, Manhattan norm.
- **Median Absolute Deviation from the Median**: Median of the absolute values of the deviations from the median.
- **Range**: Difference between the largest and smallest value in a data set.
- **Order Statistics**: Metrics based on sorted data values. Synonym: ranks.
- **Percentile**: Value such that \( P \) percent of the values take on this value or less and \( (100 - P) \) percent take on this value or more. Synonym: quantile.
- **Interquartile Range (IQR)**: Difference between the 75th percentile and the 25th percentile.


## Standard Deviation and Related Estimates

Variability is measured by examining deviations from a central tendency (mean or median).  Averaging deviations directly is uninformative (positive and negative deviations cancel).  Therefore, alternative approaches are used:

* **Mean Absolute Deviation:** Averages the absolute values of the deviations from the mean. 

* **Variance and Standard Deviation:** Based on squared deviations.  Variance is the average of squared deviations (divided by n-1 for unbiased sample estimate). The standard deviation is the square root of the variance and is easier to interpret because it uses the original data's units.


## Degrees of Freedom, and n or n – 1?

consider n people and n balls, a person has to choose only one ball. first person has n choices but last person has no choice, so n-1 people have choices so degreee of freedom is n-1. 

Using (n-1) in the variance denominator instead of 'n' provides an *unbiased* estimate of the population variance.  The (n-1) accounts for the *degrees of freedom*, reflecting the constraint of using the sample mean in the calculation.  For large datasets, the difference is negligible.

## Robust Estimates of Variability

The variance and standard deviation are sensitive to outliers.  A more robust alternative is the **Median Absolute Deviation (MAD)**:

* **MAD Formula:** `Median(|x1 - m|, |x2 - m|, ..., |xn - m|)` where 'm' is the median. # so median of absolute error.

The variance, the standard deviation, the mean absolute deviation,
and the median absolute deviation from the median are not equiv‐
alent estimates, even in the case where the data comes from a nor‐
mal distribution. In fact, the standard deviation is always greater
than the mean absolute deviation, which itself is greater than the
median absolute deviation. Sometimes, the median absolute devia‐
tion is multiplied by a constant scaling factor to put the MAD on
the same scale as the standard deviation in the case of a normal dis‐
tribution. The commonly used factor of 1.4826 means that 50% of
the normal distribution fall within the range ±MAD

## Estimates Based on Percentiles
Statistics based on sorted (ranked) data are referred to as order statistics.
This section describes variability estimation using order statistics (sorted data):

* **Range:** The difference between the maximum and minimum values.  Extremely sensitive to outliers.
* **Percentiles (Quantiles):**  The Pth percentile is a value where at least P% of the data is less than or equal to it, and at least (100-P)% is greater than or equal to it.

* **Interquartile Range (IQR):** The difference between the 75th and 25th percentiles. A robust measure of spread.

**Example in Python (using pandas):**

```python
import pandas as pd
import numpy as np
data = pd.Series([3, 1, 5, 3, 6, 7, 2, 9])
sorted_data = data.sort_values()
percentile_25 = sorted_data.quantile(0.25)  #2.5
percentile_75 = sorted_data.quantile(0.75)  #6.5
iqr = percentile_75 - percentile_25  #4.0
print(f"IQR: {iqr}")
```

## Example: Variability Estimates of State Population


```python
import pandas as pd
from statsmodels.robust.scale import mad #you need to install statsmodels package

# Sample data (replace with your actual data)
data = {'State': ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California'],
        'Population': [4779736, 710231, 6392017, 2915918, 37253956]}
state = pd.DataFrame(data)

std_dev = state['Population'].std()
iqr = state['Population'].quantile(0.75) - state['Population'].quantile(0.25)
mad_value = mad(state['Population'])

print(f"Standard Deviation: {std_dev}")  # sensitive to outliers
print(f"IQR: {iqr}")
print(f"MAD: {mad_value}")  # not sensitive to outliers
```

## Key Ideas (Variability)

* Variance and standard deviation are commonly used but sensitive to outliers.
* Mean absolute deviation and median absolute deviation from the median are more robust.
* Percentiles offer insights into data spread at different percentages.

## Exploring the Data Distribution

This section covers methods to visualize data distributions:

## Key Terms for Exploring the Distribution

* **Boxplot:** A visual representation of data distribution using percentiles (quartiles, IQR, outliers).
* **Frequency table:** Data values categorized into intervals (bins) with their counts.
* **Histogram:** A bar chart visualizing a frequency table.
* **Density plot:** A smoothed representation of a histogram, often using kernel density estimation.


## Percentiles and Boxplots

Percentiles (quartiles, deciles) summarize data distribution. Boxplots visualize this using percentiles:  the box represents the interquartile range, the line inside the box is the median, and whiskers extend to show the range of the data (typically up to 1.5 times the IQR). Points beyond the whiskers are considered outliers.

state['Murder.Rate'].quantile([0.05, 0.25, 0.5, 0.75, 0.95])

## Frequency Tables and Histograms

Frequency tables and histograms summarize data by grouping values into bins.  Histograms visualize frequency tables, showing counts or proportions for each bin.

It is important to include the empty bins; the fact that there are no values in those bins is useful information. It can also be useful to experiment with different bin sizes. If they are too large, important features of the distribution can be obscured. If they are too small, the result is too granular, and the ability to see the bigger picture is lost.

**Example in Python (using pandas):**

```python
import pandas as pd
import matplotlib.pyplot as plt

# Sample data (replace with your actual data)
data = {'Population': [4779736, 710231, 6392017, 2915918, 37253956, 5029196, 3574097, 897934]}
state = pd.DataFrame(data)

# Create a histogram
plt.hist(state['Population'], bins=5) #Adjust number of bins as needed
plt.xlabel('Population')
plt.ylabel('Frequency')
plt.title('Histogram of State Populations')
plt.show()

#Create frequency table using cut and value_counts
bins = pd.cut(state['Population'], bins=5)
frequency_table = bins.value_counts()
print(frequency_table)

# Create a boxplot
state['Population'].plot.box()
plt.ylabel('Population')
plt.title('Boxplot of State Populations')
plt.show()

```


## Statistical Moments

In statistical theory, location and variability are referred to as the first and second moments of a distribution. The third and fourth moments are called skewness and kurtosis. Skewness refers to whether the data is skewed to larger or smaller values, and kurtosis indicates the propensity of the data to have extreme values. 

Beyond location and variability (first and second moments), skewness (third moment) and kurtosis (fourth moment) describe data asymmetry and tail weight respectively.  These are usually analyzed visually (histograms, boxplots).


## Density Plots and Estimates

Density plots provide a smoothed view of data distributions.  They are often created using kernel density estimation.

**Example in Python (using pandas):**


```python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns # you will need seaborn installed

# Sample data (replace with your actual data)
data = {'Murder.Rate': [5.7, 5.6, 4.7, 5.6, 4.4, 2.8, 2.4, 5.8]}
state = pd.DataFrame(data)

# Create a density plot using seaborn
sns.kdeplot(state['Murder.Rate'])
plt.xlabel('Murder Rate')
plt.ylabel('Density')
plt.title('Density Plot of Murder Rates')
plt.show()

# Create histogram with density plot overlay
plt.hist(state['Murder.Rate'], density=True, alpha=0.5) #alpha adds some transparency
sns.kdeplot(state['Murder.Rate'])
plt.xlabel('Murder Rate')
plt.ylabel('Density')
plt.title('Histogram with Density Overlay')
plt.show()
```


the total area under the density curve = 1, and instead of counts in bins you calculate areas under the curve between any two points on the x-axis, which correspond to the proportion of the distribution lying between those two points.

## Key Ideas (Data Distribution)

* Histograms and frequency tables show data distribution by grouping into bins.
* Boxplots provide a concise visual summary using percentiles.
* Density plots offer a smoothed view of the distribution.

=========================================================================


summary 2

## Exploring Binary and Categorical Data

This section explores methods for summarizing and visualizing categorical data, including binary data (e.g., yes/no).  It also introduces the concept of expected value and touches upon probability.

### Key Terms for Exploring Categorical Data

* **Mode:** The most frequent category.
* **Expected Value:**  When the categories can be associated with a numeric value, this gives an average value based on a category’s probability of occurrence.
* **Bar Charts:** Visual representation of category frequencies or proportions using bars.
* **Pie Charts:** Visual representation of category proportions using pie slices.


### Summarizing Categorical Data

Summarizing categorical data often involves calculating proportions or percentages for each category.  For example, Table 1-6 (shown below) displays the percentage of flight delays at Dallas/Fort Worth Airport categorized by cause.

**Table 1-6. Percentage of delays by cause at Dallas/Fort Worth Airport**

| Cause          | Percentage |
|-----------------|------------|
| Carrier         | 23.02      |
| ATC             | 30.40      |
| Weather         | 4.03       |
| Security        | 0.12       |
| Inbound Aircraft| 42.43      |


### Visualizing Categorical Data: Bar Charts and Pie Charts

Bar charts are commonly used to visualize categorical data.  Categories are on the x-axis, and frequencies or proportions are on the y-axis.


Note that a bar chart resembles a histogram; in a bar chart the x-axis represents different categories of a factor variable, while in a histogram the x-axis represents values of a single variable on a numeric scale. In a histogram, the bars are typically shown touching each other, with gaps indicating values that did not occur in the data. In a bar chart, the bars are shown separate from one another.

**Python Code for Bar Charts (using pandas):**

```python
# Assuming 'dfw' is a pandas DataFrame with delay causes and counts
ax = dfw.transpose().plot.bar(figsize=(4, 4), legend=False) # Creates the bar chart
ax.set_xlabel('Cause of delay')
ax.set_ylabel('Count')
```

Pie charts are an alternative, but less informative than bar charts according to data visualization experts.


### Numerical Data as Categorical Data

Numeric data can be converted to categorical data through binning (grouping into ranges).  This simplifies the data and can aid in identifying relationships between variables.  Histograms and bar charts are similar in this regard, but bar charts use unordered categories on the x-axis.


### Mode

The mode is the most frequent value in a dataset or values in case of tie.  


### Expected Value

The expected value is a weighted average of possible outcomes, where the weights are their probabilities.

A marketer for a new cloud technology, for example, offers two levels of service, one priced at $300/month and another at $50/month. The marketer offers free webinars to generate leads, and the firm figures that 5% of the attendees will sign up for the $300 service, 15% will sign up for the $50 service, and 80% will not sign up for anything. This data can be summed up, for financial purposes, in a single “expected value,” which is a form of weighted mean, in which the weights are probabilities.

The expected value is calculated as follows:
1.Multiply each outcome by its probability of occurrence.
2.Sum these values.

**Example:** A cloud service provider has two plans: $300/month (5% probability) and $50/month (15% probability).  The expected value per customer is:

```
EV = (0.05 * 300) + (0.15 * 50) + (0.80 * 0) = $22.50
```


### Probability

Probability is the likelihood of an event occurring, often expressed as a proportion or percentage.


### Correlation

This section discusses methods for measuring and visualizing the relationship between numerical variables.

### Key Terms for Correlation

* **Correlation Coefficient:**  A measure of the linear association between two numerical variables, ranging from -1 (perfect negative correlation) to +1 (perfect positive correlation). 0 indicates no linear correlation.
* **Correlation Matrix:** A table showing the correlation coefficients between all pairs of variables.
* **Scatterplot:** A plot showing the relationship between two numerical variables; each point represents a data record.

the correlation coefficient, which gives an esti‐
mate of the correlation between two variables that always lies on the same scale. To
compute Pearson’s correlation coefficient, we multiply deviations from the mean for
variable 1 times those for variable 2, and divide by the product of the standard
deviations. (n-1) degree of freedom


**Pearson's Correlation Coefficient Formula:**

```
r = Σᵢ₌₁ⁿ (xᵢ - x̄)(yᵢ - ȳ) / ((n-1)sxsy) 
```

where:

*  `xᵢ` and `yᵢ` are individual data points.
*  `x̄` and `ȳ` are the means of x and y.
*  `sx` and `sy` are the standard deviations of x and y.
*  `n` is the number of data points.


**Python code for Correlation Matrix and Heatmap (using seaborn and pandas):**

```python
import seaborn as sns
import pandas as pd

# Assuming 'etfs' is a pandas DataFrame of ETF returns

# Correlation Matrix
correlation_matrix = etfs.corr()

# Heatmap
sns.heatmap(correlation_matrix, vmin=-1, vmax=1, cmap=sns.diverging_palette(20, 220, as_cmap=True))
```


**Scatterplot (using pandas):**

```python
# Assuming 'telecom' is a pandas DataFrame with ATT and Verizon returns
ax = telecom.plot.scatter(x='T', y='VZ', figsize=(4, 4), marker='$◯$')
ax.set_xlabel('ATT (T)')
ax.set_ylabel('Verizon (VZ)')
ax.axhline(0, color='grey', lw=1)
ax.axvline(0, color='grey', lw=1)

```

Like the mean and standard deviation, the correlation coefficient is sensitive to outli‐
ers in the data

Other correlation estimates exist (Spearman's rho, Kendall's tau) These are correla‐
tion coefficients based on the rank of the data. Since they work
with ranks rather than values, these estimates are robust to outliers
and can handle certain types of nonlinearities, but Pearson's and robust alternatives are generally sufficient for exploratory analysis. 


### Exploring Two or More Variables

This section covers visualizing relationships between multiple variables.
Familiar estimators like mean and variance look at variables one at a time (univariate
analysis). Correlation analysis (see “Correlation” on page 30) is an important method
that compares two variables (bivariate analysis). In this section we look at additional
estimates and plots, and at more than two variables (multivariate analysis).

### Key Terms for Exploring Two or More Variables

* **Contingency Table:** A table summarizing the counts of different combinations of categories across two or more categorical variables.
* **Hexagonal Binning:**  A method for visualizing the density of points in a scatterplot by binning the data into hexagons.
* **Contour Plot:** A topographical-like representation of the density of two numeric variables.
* **Violin Plot:** Similar to a box plot but shows the probability density of the data.

### Two Categorical Variables

A useful way to summarize two categorical variables is a contingency table—a table of counts by category.

#### Table 1-8: Contingency Table of Loan Grade and Status

| Grade | Charged off | Current | Fully paid | Late | Total |
|-------|-------------|---------|------------|------|-------|
| **A** | 1562        | 50051   | 20408      | 469  | 72490 |
| **B** | 5302        | 93852   | 31160      | 2056 | 132370|
| **C** | 6023        | 88928   | 23147      | 2777 | 120875|
| **D** | 5007        | 53281   | 13681      | 2308 | 74277 |
| **E** | 2842        | 24639   | 5949       | 1374 | 34804 |
| **F** | 1526        | 8444    | 2328       | 606  | 12904 |
| **G** | 409         | 1990    | 643        | 199  | 3241  |
| **Total**| 22671   | 321185  | 97316      | 9789 | 450961|

#### Python Code for Contingency Table
```python
crosstab = lc_loans.pivot_table(index='grade', columns='status',
                                aggfunc=lambda x: len(x), margins=True)

df = crosstab.loc['A':'G',:].copy()
df.loc[:,'Charged Off':'Late'] = df.loc[:,'Charged Off':'Late'].div(df['All'],
                                                                    axis=0)
df['All'] = df['All'] / sum(df['All'])
perc_crosstab = df
```

### Key Ideas

- Contingency tables can look only at counts, or they can also include column and total percentages.
- Hexagonal binning and contour plots give a visual representation of a two-dimensional density.

**Python code for Hexagonal Binning (using pandas):**

```python
# Assuming 'kc_tax0' is a pandas DataFrame with square footage and tax assessed value.

ax = kc_tax0.plot.hexbin(x='SqFtTotLiving', y='TaxAssessedValue', gridsize=30, sharex=False, figsize=(5, 4))
ax.set_xlabel('Finished Square Feet')
ax.set_ylabel('Tax-Assessed Value')
```

**Python code for Contour Plot (using seaborn):**


```python
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming 'kc_tax0' is a pandas DataFrame with square footage and tax assessed value.

ax = sns.kdeplot(kc_tax0.SqFtTotLiving, kc_tax0.TaxAssessedValue, ax=ax)
ax.set_xlabel('Finished Square Feet')
ax.set_ylabel('Tax-Assessed Value')
plt.show()


```

These visualization techniques help to understand the relationships between variables in larger datasets where simple scatterplots are too dense.

Heat maps, hexagonal binning, and contour plots all give
a visual representation of a two-dimensional density. In this way, they are natural
analogs to histograms and density plots.

===============================================================================


summary 2

## Categorical and Numeric Data Visualization in Python

This section explores visualizing the relationship between categorical and numeric data using boxplots and violin plots in Python.

### Boxplots

Boxplots offer a simple way to compare the distribution of a numeric variable across different categories. 

**Code:**

```python
# Assuming 'airline_stats' is a Pandas DataFrame with 'airline' and 'pct_carrier_delay' columns.
import matplotlib.pyplot as plt
import pandas as pd

# Using pandas' built-in boxplot function
airline_stats.boxplot(by='airline', column='pct_carrier_delay')
plt.xlabel('')
plt.ylabel('Daily % of Delayed Flights')
plt.suptitle('')
plt.show()


#there is boxplot in matplotlib as well. 
```

The example highlights that Alaska has fewer delays (lower quartile is higher) than American (whose lower quartile is higher than Alaska's upper quartile).

### Violin Plots

Violin plots enhance boxplots by displaying the probability density of the data at different values, providing a richer understanding of the distribution's shape.

**Code:**

```python
import seaborn as sns

ax = sns.violinplot(airline_stats.airline, airline_stats.pct_carrier_delay, inner='quartile', color='white')
ax.set_xlabel('')
ax.set_ylabel('Daily % of Delayed Flights')
plt.show()

```

**Explanation:** This code uses the `seaborn` library to create a violin plot. The `inner='quartile'` argument adds a boxplot inside the violin, showing the quartiles. The resulting plot reveals the concentration of data points near zero for Alaska and Delta, a detail less apparent in the boxplot.


### Visualizing Multiple Variables

This section shows how to extend visualization techniques to handle more than two variables by using conditioning (faceting). The example analyzes the relationship between house square footage, tax-assessed value, and zip code.

**Code:**

```python
import seaborn as sns
import matplotlib.pyplot as plt

zip_codes = [98188, 98105, 98108, 98126]
kc_tax_zip = kc_tax0.loc[kc_tax0.ZipCode.isin(zip_codes),:]

def hexbin(x, y, color, **kwargs):
    cmap = sns.light_palette(color, as_cmap=True)
    plt.hexbin(x, y, gridsize=25, cmap=cmap, **kwargs)

g = sns.FacetGrid(kc_tax_zip, col='ZipCode', col_wrap=2)
g.map(hexbin, 'SqFtTotLiving', 'TaxAssessedValue', extent=[0, 3500, 0, 700000])
g.set_axis_labels('Finished Square Feet', 'Tax-Assessed Value')
g.set_titles('Zip code {col_name:.0f}')
plt.show()
```

**Explanation:** This code uses `seaborn.FacetGrid` to create a set of hexagonal binning plots, each conditioned on a specific zip code.  This reveals that tax-assessed values vary significantly across zip codes, explaining clusters seen in a simpler two-variable plot (not shown here, but referenced in the original text).  The `hexbin` function creates the hexagonal bin plots, efficiently visualizing the density of data points.  `FacetGrid` arranges these plots based on the `ZipCode` column, providing a clear comparison across different zip codes.


**Key Ideas Summarized:**

*   Boxplots and violin plots effectively visualize the relationship between one numeric and one categorical variable.
*   Conditioning variables (faceting) extend visualizations to handle multiple variables by creating separate plots for each category of the conditioning variable.  This allows for deeper insight into data relationships.
*   Hexagonal binning is a useful technique to visualize the density of points in a scatterplot, especially for large datasets.


==================================================================================
## Chapter 2. Data and Sampling Distributions

### Introduction

The chapter discusses the importance of sampling in data science, even in the era of big data. Sampling is essential for working efficiently with a variety of data and minimizing bias. Predictive models are often developed and piloted with samples, and samples are used in various tests, such as comparing the effect of web page designs on clicks.

### Key Concepts

- **Data and Sampling Distributions**:
  - **Population**: A large, defined set of data.
  - **Sample**: A subset of data from a larger data set.
  - **Sampling Procedure**: The process of drawing elements into a sample at random.
  - **Empirical Distribution**: The distribution of sample data.

- **Random Sampling and Sample Bias**:
  - **Random Sampling**: Each member of the population has an equal chance of being chosen.
  - **Sample Bias**: A sample that misrepresents the population.
  - **Stratified Sampling**: Dividing the population into strata and randomly sampling from each stratum.
  - **Bias**: Systematic error.

- **Self-Selection Sampling Bias**:
  - Bias in reviews due to self-selection of reviewers.

- **Bias**:
  - **Statistical Bias**: Systematic errors in measurement or sampling.
  - **Random Error**: Errors due to random chance.

- **Random Selection**:
  - Methods to achieve representativeness, such as stratified sampling.

- **Size Versus Quality**:
  - Smaller samples can be better for reducing bias and allowing greater attention to data quality.
  - Massive amounts of data are needed for sparse data problems, such as search queries.

- **Sample Mean Versus Population Mean**:
  - **x ¯**: Mean of a sample.
  - **μ**: Mean of a population.

- **Selection Bias**:
  - Bias resulting from the way observations are selected.
  - **Data Snooping**: Extensive hunting through data.
  - **Vast Search Effect**: Bias from repeated data modeling.

- **Regression to the Mean**:
  - Extreme observations tend to be followed by more central ones.

- **Sampling Distribution of a Statistic**:
  - The distribution of a sample statistic over many samples.
  - **Central Limit Theorem**: The tendency of the sampling distribution to take on a normal shape as sample size rises.
  - **Standard Error**: The variability of a sample statistic over many samples.

### Code Snippets

#### Python Code for Sampling Distribution

```python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming loans_income is a predefined list of income values
sample_data = pd.DataFrame({
    'income': loans_income.sample(1000),
    'type': 'Data',
})
sample_mean_05 = pd.DataFrame({
    'income': [loans_income.sample(5).mean() for _ in range(1000)],
    'type': 'Mean of 5',
})
sample_mean_20 = pd.DataFrame({
    'income': [loans_income.sample(20).mean() for _ in range(1000)],
    'type': 'Mean of 20',
})
results = pd.concat([sample_data, sample_mean_05, sample_mean_20])

g = sns.FacetGrid(results, col='type', col_wrap=1, height=2, aspect=2)
g.map(plt.hist, 'income', range=[0, 200000], bins=40)
g.set_axis_labels('Income', 'Count')
g.set_titles('{col_name}')
```

### Examples and Key Points

- **Literary Digest Poll of 1936**:
  - Predicted a victory of Alf Landon over Franklin Roosevelt.
  - The sample was biased towards high socioeconomic status.

- **George Gallup**:
  - Conducted biweekly polls of just 2,000 people and accurately predicted a Roosevelt victory.

- **Regression to the Mean**:
  - Example: "Rookie of the year, sophomore slump" phenomenon in sports.
  - Identified by Francis Galton in 1886.

### Tips

- **Data Quality**:
  - Data quality involves completeness, consistency of format, cleanliness, and accuracy of individual data points.
  - Statistics adds the notion of representativeness.

- **Random Sampling**:
  - Proper definition of an accessible population is key.
  - Consider timing and stratification for representative samples.

- **Data Snooping**:
  - Extensive hunting through data can lead to misleading conclusions.
  - Use holdout sets and target shuffling to validate performance.

### Further Reading

- **Sampling Procedures**:
  - Ronald Fricker’s chapter “Sampling Methods for Online Surveys” in The SAGE Handbook of Online Research Methods, 2nd ed.

- **Literary Digest Poll Failure**:
  - Story on the Capital Century website.

- **Selection Bias**:
  - Christopher J. Pannucci and Edwin G. Wilkins’ article “Identifying and Avoiding Bias in Research” in Plastic and Reconstructive Surgery (August 2010).
  - Michael Harris’s article “Fooled by Randomness Through Selection Bias” from the perspective of traders.

### Conclusion

- **Key Ideas**:
  - Random sampling remains important in data science.
  - Bias occurs when measurements or observations are systematically in error.
  - Data quality is often more important than data quantity.
  - The sampling distribution of a statistic tells us how a metric would turn out differently from sample to sample.
  - The central limit theorem and standard error are key concepts in understanding sampling distributions.

This structured format ensures that all information is presented clearly and logically, maintaining the flow of the original text.



summary 2

# Chapter 2: Data and Sampling Distributions - Summary

This chapter emphasizes the importance of sampling, even in the "big data" era, to ensure efficient data handling and minimize bias.  It covers random sampling, bias types, sampling distributions, and the central limit theorem.

## 1. Random Sampling and Sample Bias

**Key Concept:**  A sample is a smaller subset of a larger dataset (the population). Random sampling ensures every population member has an equal chance of selection.  This avoids **sample bias**, where the sample misrepresents the population in a systematic way (e.g., the Literary Digest poll).

**Example:** The Literary Digest's 1936 poll failed because it sampled from a biased population (subscribers and car/telephone owners), leading to an inaccurate prediction. George Gallup's smaller, more representative sample gave a correct prediction.

**Types of Sampling:**

*   **Simple Random Sampling:** Each member has an equal chance of being selected.  Can be done with or without replacement.
*   **Stratified Sampling:** Population is divided into subgroups (strata), and random samples are taken from each. Useful for ensuring representation of minority groups.

**Data Quality:**  Data quality (completeness, consistency, cleanliness, accuracy, and representativeness) is crucial, often outweighing quantity.


## 2. Self-Selection Sampling Bias

**Key Concept:**  Self-selection bias occurs when individuals select themselves into the sample (e.g., Yelp reviews). Those motivated to participate might not represent the entire population. While unreliable for generalizing to the whole population, self-selection samples can be useful for *comparisons* between similar entities.


## 3. Bias

**Key Concept:** Bias refers to systematic errors in measurement or sampling, distinct from random error.  It often indicates a misspecified model or missing variables.


## 4. Random Selection

Proper definition of the accessible population and the sampling procedure are vital for avoiding bias.  Consider the challenges in defining “customer” for a customer survey (past customers, refunds, test purchases, etc.). Timing of sampling also matters (e.g., website traffic varies by time of day and week).


## 5. Size Versus Quality

Smaller, high-quality samples are often preferable to massive, low-quality datasets.  Random sampling reduces bias and allows for better data exploration and quality improvement.

**Exception:** "Big data" is valuable when data is sparse (e.g., Google search queries).  The sheer volume allows for effective predictions even for infrequent search terms.


## 6. Sample Mean Versus Population Mean

x̄ represents the sample mean, while μ represents the population mean. The distinction is important because sample statistics are observed, while population parameters are often inferred from samples.


## 7. Selection Bias

**Key Concept:** Selection bias involves choosing data in a way that leads to misleading conclusions.  This includes data snooping (searching for patterns until something interesting is found) and the vast search effect (repeated modeling leading to spurious findings).

**Mitigation:** Use a holdout set (or multiple sets) to validate model performance and protect against the vast search effect.  Target shuffling (permutation tests) can also help assess the validity of findings.


## 8. Regression to the Mean

**Key Concept:** Extreme observations tend to be followed by more central ones. This is a type of selection bias, where focusing on extreme values can lead to misinterpretations (e.g., "rookie of the year, sophomore slump").  It's caused by the combination of skill and luck influencing initial extreme performance; luck typically regresses towards the average in subsequent measurements.

**Caution:** Regression to the mean is *not* the same as linear regression in statistical modeling.


## 9. Sampling Distribution of a Statistic

**Key Concept:** The sampling distribution of a statistic is the distribution of that statistic (e.g., mean) across many samples from the same population. It shows how much the statistic might vary from sample to sample.

**Key Terms:**

*   **Sample statistic:** A metric calculated from a sample (e.g., sample mean).
*   **Data distribution:** The distribution of individual data values.
*   **Sampling distribution:** The distribution of a sample statistic over many samples.
*   **Central limit theorem:** The tendency of sampling distributions to become normal as sample size increases.
*   **Standard error:** The variability (standard deviation) of a sample statistic over many samples.

**Example (Python Code):**  The following Python code demonstrates the central limit theorem by showing how the distribution of sample means becomes more normal and narrower as sample size increases.  It uses the `seaborn` library for visualization.

```python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Assume 'loans_income' is a Pandas Series containing income data (replace with your data)
#For demonstration purposes:
loans_income = pd.Series(range(10000))

sample_data = pd.DataFrame({
    'income': loans_income.sample(1000),
    'type': 'Data',
})
sample_mean_05 = pd.DataFrame({
    'income': [loans_income.sample(5).mean() for _ in range(1000)],
    'type': 'Mean of 5',
})
sample_mean_20 = pd.DataFrame({
    'income': [loans_income.sample(20).mean() for _ in range(1000)],
    'type': 'Mean of 20',
})
results = pd.concat([sample_data, sample_mean_05, sample_mean_20])

g = sns.FacetGrid(results, col='type', col_wrap=1, height=2, aspect=2)
g.map(plt.hist, 'income', range=[0, 200000], bins=40)
g.set_axis_labels('Income', 'Count')
g.set_titles('{col_name}')
plt.show()
```


## 10. Central Limit Theorem

The central limit theorem states that the distribution of sample means will approximate a normal distribution, regardless of the original population distribution, as the sample size increases. This is crucial for statistical inference (confidence intervals, hypothesis tests).  However, bootstrapping (not covered in this excerpt's python code) offers a modern, assumption-free alternative for estimating sampling distributions.


## 11. Standard Error

Standard error measures the variability of a sample statistic (e.g., the mean). A smaller standard error indicates a more precise estimate. It is inversely related to the sample size (square root of n rule: quadrupling sample size halves the standard error).  Bootstrapping is a preferred method for estimating standard error over methods relying on the central limit theorem in modern statistics.


## 12. Standard Deviation Versus Standard Error

Standard deviation measures the variability of *individual data points*, while standard error measures the variability of a *sample statistic*.  Do not confuse the two.

========================================================================


# The Bootstrap

## Introduction

The bootstrap is a powerful statistical technique that estimates the sampling distribution of a statistic or model parameters by repeatedly drawing samples with replacement from the original data set. This method does not require any assumptions about the data or the sample statistic being normally distributed.

## Key Terms

- **Bootstrap Sample**: A sample taken with replacement from an observed data set.
- **Resampling**: The process of taking repeated samples from observed data, including both bootstrap and permutation procedures.

## Conceptual Explanation

Imagine replicating the original sample thousands or millions of times to create a hypothetical population that embodies all the knowledge from the original sample. From this population, you can draw samples to estimate the sampling distribution.

## Practical Algorithm

1. Draw a sample value, record it, and then replace it.
2. Repeat step 1 \( n \) times.
3. Record the mean of the \( n \) resampled values.
4. Repeat steps 1–3 \( R \) times.
5. Use the \( R \) results to:
   - Calculate their standard deviation (estimates sample mean standard error).
   - Produce a histogram or boxplot.
   - Find a confidence interval.

The number of iterations \( R \) is set somewhat arbitrarily. The more iterations you do, the more accurate the estimate of the standard error or the confidence interval.

## Example in R

```R
library(boot)
stat_fun <- function(x, idx) median(x[idx])
boot_obj <- boot(loans_income, R=1000, statistic=stat_fun)
```

The function `stat_fun` computes the median for a given sample specified by the index `idx`. The result is:

```
Bootstrap Statistics :
    original   bias    std. error
t1*    62000 -70.5595    209.1515
```

## Example in Python

```python
results = []
for nrepeat in range(1000):
    sample = resample(loans_income)
    results.append(sample.median())
results = pd.Series(results)
print('Bootstrap Statistics:')
print(f'original: {loans_income.median()}')
print(f'bias: {results.mean() - loans_income.median()}')
print(f'std. error: {results.std()}')
```

## Multivariate Bootstrap

The bootstrap can also be applied to multivariate data, where the rows are sampled as units. A model might then be run on the bootstrapped data to estimate the stability (variability) of model parameters or to improve predictive power.

## Historical Context

The bootstrap was introduced by Bradley Efron in the late 1970s and early 1980s. It gained popularity among researchers who use statistics but are not statisticians, especially for metrics or models where mathematical approximations are not readily available.

## Warning

The bootstrap does not compensate for a small sample size; it does not create new data, nor does it fill in holes in an existing data set. It merely informs us about how lots of additional samples would behave when drawn from a population like our original sample.

## Resampling Versus Bootstrapping

- **Resampling**: Includes both bootstrap and permutation procedures.
- **Bootstrap**: Always implies sampling with replacement from an observed data set.

## Key Ideas

- The bootstrap is a powerful tool for assessing the variability of a sample statistic.
- It can be applied in a wide variety of circumstances without extensive study of mathematical approximations to sampling distributions.
- When applied to predictive models, aggregating multiple bootstrap sample predictions (bagging) outperforms the use of a single model.

## Further Reading

- "An Introduction to the Bootstrap" by Bradley Efron and Robert Tibshirani (Chapman & Hall, 1993)
- The retrospective on the bootstrap in the May 2003 issue of Statistical Science (vol. 18, no. 2)
- "An Introduction to Statistical Learning" by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani (Springer, 2013)

# Confidence Intervals

## Introduction

Confidence intervals provide a range within which a sample estimate is expected to fall with a certain level of confidence. They are a way to understand the potential error in a sample estimate.

## Key Terms

- **Confidence Level**: The percentage of confidence intervals, constructed in the same way from the same population, that are expected to contain the statistic of interest.
- **Interval Endpoints**: The top and bottom of the confidence interval.

## Conceptual Explanation

Confidence intervals present an estimate as a range, grounded in statistical sampling principles. A 90% confidence interval encloses the central 90% of the bootstrap sampling distribution of a sample statistic.

## Practical Algorithm

1. Draw a random sample of size \( n \) with replacement from the data (a resample).
2. Record the statistic of interest for the resample.
3. Repeat steps 1–2 many (R) times.
4. For an x% confidence interval, trim [(100-x) / 2]% of the R resample results from either end of the distribution.
5. The trim points are the endpoints of an x% bootstrap confidence interval.

## Example in Python

```python
results = []
for nrepeat in range(1000):
    sample = resample(loans_income)
    results.append(sample.median())
results = pd.Series(results)
confidence_interval = results.quantile([0.025, 0.975])
print(f'95% Confidence Interval: {confidence_interval}')
```

## Key Ideas

- Confidence intervals are the typical way to present estimates as an interval range.
- The more data you have, the less variable a sample estimate will be.
- The lower the level of confidence you can tolerate, the narrower the confidence interval will be.
- The bootstrap is an effective way to construct confidence intervals.

## Further Reading

- "Introductory Statistics and Analytics: A Resampling Perspective" by Peter Bruce (Wiley, 2014)
- "Statistics: Unlocking the Power of Data, 2nd ed." by Robin Lock and four other Lock family members (Wiley, 2016)
- "Modern Engineering Statistics" by Thomas Ryan (Wiley, 2007)

# Normal Distribution

## Introduction

The normal distribution is a bell-shaped curve that is iconic in traditional statistics. It is essential for the development of mathematical formulas that approximate the distributions of sample statistics.

## Key Terms

- **Error**: The difference between a data point and a predicted or average value.
- **Standardize**: Subtract the mean and divide by the standard deviation.
- **z-score**: The result of standardizing an individual data point.
- **Standard Normal**: A normal distribution with mean = 0 and standard deviation = 1.
- **QQ-Plot**: A plot to visualize how close a sample distribution is to a specified distribution, e.g., the normal distribution.

## Conceptual Explanation

In a normal distribution, 68% of the data lies within one standard deviation of the mean, and 95% lies within two standard deviations.

## Warning

Most raw data is not normally distributed. The utility of the normal distribution derives from the fact that many statistics are normally distributed in their sampling distribution.

## Standard Normal and QQ-Plots

To compare data to a standard normal distribution, you subtract the mean and then divide by the standard deviation. This is also called normalization or standardization. The transformed value is termed a z-score.

## Example in Python

```python
import matplotlib.pyplot as plt
import scipy.stats as stats

fig, ax = plt.subplots(figsize=(4, 4))
norm_sample = stats.norm.rvs(size=100)
stats.probplot(norm_sample, plot=ax)
plt.show()
```

## Key Ideas

- The normal distribution was essential to the historical development of statistics.
- While raw data is typically not normally distributed, errors often are, as are averages and totals in large samples.
- To convert data to z-scores, you subtract the mean of the data and divide by the standard deviation; you can then compare the data to a normal distribution.

## Further Reading

- "An Introduction to the Bootstrap" by Bradley Efron and Robert Tibshirani (Chapman & Hall, 1993)
- "An Introduction to Statistical Learning" by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani (Springer, 2013)
- "Modern Engineering Statistics" by Thomas Ryan (Wiley, 2007)



summary 2


## The Bootstrap

**Concept:** The bootstrap is a resampling technique used to estimate the sampling distribution of a statistic or model parameters.  It involves drawing multiple samples *with replacement* from the original dataset and recalculating the statistic for each resample. This creates a simulated population reflecting the characteristics of the original sample, allowing estimation of variability.  It's particularly useful when the data doesn't follow a normal distribution or when mathematical approximations for the sampling distribution are unavailable.


**Python Implementation (Illustrative):**

```python
import pandas as pd
from sklearn.utils import resample

# Sample data (replace with your actual data)
loans_income = pd.Series([60000, 65000, 70000, 55000, 62000, 75000, 68000, 58000, 63000, 61000])

results = []
for nrepeat in range(1000):  # Number of bootstrap resamples
    sample = resample(loans_income) #Resample the data with replacement
    results.append(sample.median()) #Calculate the median for each resample

results = pd.Series(results)

print('Bootstrap Statistics:')
print(f'original: {loans_income.median()}')
print(f'bias: {results.mean() - loans_income.median()}')
print(f'std. error: {results.std()}')

```

This code snippet demonstrates a bootstrap to estimate the median income's standard error and bias.  It generates 1000 bootstrap samples, calculates the median of each, and then computes the mean and standard deviation of those medians to estimate bias and standard error.


**Key Points:**

*   The bootstrap doesn't require assumptions about data distribution (unlike many traditional statistical methods).
*   It's computationally intensive, especially with a large number of resamples (R).
*   A larger R value leads to more accurate estimates.
*   The bootstrap doesn't create new data; it uses existing data to understand its inherent variability.  It is particularly useful when the sample size is small.


**Warning:**  The bootstrap doesn't compensate for small sample sizes.  It only reflects the variability inherent within the sample.


## Confidence Intervals

**Concept:** Confidence intervals provide a range of values likely containing the true population parameter (e.g., mean, median) with a specified level of confidence (e.g., 95%).


**Bootstrap Confidence Interval Algorithm (Illustrative):**

The algorithm is conceptually similar to the bootstrap itself but is focused on generating the confidence interval:

1.  Draw numerous bootstrap samples with replacement.
2.  Calculate the statistic (e.g., mean) for each sample.
3.  Order the resulting statistics.
4.  Trim a percentage of values from the lower and upper ends (the percentage depends on the desired confidence level).  For example, for a 90% confidence interval, trim 5% from each end.
5.  The remaining values define the confidence interval range.


**Key Points:**

*   Higher confidence levels lead to wider intervals (more certainty).
*   Smaller sample sizes lead to wider intervals (more uncertainty).
*   Data scientists often use confidence intervals to communicate uncertainty around estimates.


## Normal Distribution

**Concept:** The normal distribution is a bell-shaped probability distribution. Many statistical methods rely on the assumption of normality (though this is increasingly less common with the rise of resampling methods).


**Key Terms:**

*   **Standardization (z-score):** Transforming data by subtracting the mean and dividing by the standard deviation.  This puts the data on a standard normal distribution scale (mean=0, standard deviation=1).
*   **QQ-Plot:** A graphical method for assessing whether sample data follows a specific distribution (often the normal distribution).  A diagonal line indicates a good fit.


**Python for QQ-Plot:**

```python
import matplotlib.pyplot as plt
from scipy import stats

# Generate a sample from a normal distribution
norm_sample = stats.norm.rvs(size=100) 

# Create QQ-Plot
fig, ax = plt.subplots(figsize=(4, 4))
stats.probplot(norm_sample, plot=ax)
plt.show()

```

**Key Points:**

*   The assumption of normality is less critical thanks to methods like the bootstrap.
*   Standardization (z-scores) doesn't make data normally distributed; it transforms it to a standard scale for comparison.


**Note:** The provided text also mentions further readings for each section.  These are valuable resources for a deeper understanding of each concept.

========================================================================

# Structured Summary of Technical Book Excerpt

## Long-Tailed Distributions

### Key Terms

- **Tail**: The long narrow portion of a frequency distribution, where relatively extreme values occur at low frequency.
- **Skew**: Where one tail of a distribution is longer than the other.

### Concepts

Despite the importance of the normal distribution historically in statistics, data is generally not normally distributed. Data can be highly skewed or discrete, and both symmetric and asymmetric distributions may have long tails. The tails of a distribution correspond to the extreme values (small and large). Long tails are widely recognized in practical work, and Nassim Taleb's black swan theory predicts that anomalous events, such as a stock market crash, are much more likely to occur than would be predicted by the normal distribution.

### Example: Stock Returns

Stock returns often exhibit long-tailed distributions. Figure 2-12 shows the QQ-Plot for the daily stock returns for Netflix (NFLX). The corresponding Python code is:

```python
nflx = sp500_px.NFLX
nflx = np.diff(np.log(nflx[nflx>0]))
fig, ax = plt.subplots(figsize=(4, 4))
stats.probplot(nflx, plot=ax)
```

### Key Ideas

- Most data is not normally distributed.
- Assuming a normal distribution can lead to underestimation of extreme events ("black swans").

### Further Reading

- **The Black Swan, 2nd ed.**, by Nassim Nicholas Taleb (Random House, 2010)
- **Handbook of Statistical Distributions with Applications, 2nd ed.**, by K. Krishnamoorthy (Chapman & Hall/CRC Press, 2016)

## Student’s t-Distribution

### Key Terms

- **n**: Sample size.
- **Degrees of freedom**: A parameter that allows the t-distribution to adjust to different sample sizes, statistics, and numbers of groups.

### Concepts

The t-distribution is a normally shaped distribution with thicker tails. It is used extensively in depicting distributions of sample statistics. The t-distribution is often called Student’s t because it was published in 1908 in Biometrika by W. S. Gosset under the name "Student."

### Example: Gosset's Experiment

Gosset wanted to answer the question “What is the sampling distribution of the mean of a sample, drawn from a larger population?” He started with a resampling experiment and derived a function now known as Student’s t.

### Key Ideas

- The t-distribution is actually a family of distributions resembling the normal distribution but with thicker tails.
- The t-distribution is widely used as a reference basis for the distribution of sample means, differences between two sample means, regression parameters, and more.

### Further Reading

- The original W.S. Gosset paper as published in Biometrika in 1908 is available as a PDF.
- A standard treatment of the t-distribution can be found in David Lane’s online resource.

## Binomial Distribution

### Key Terms

- **Trial**: An event with a discrete outcome (e.g., a coin flip).
- **Success**: The outcome of interest for a trial.
- **Binomial**: Having two outcomes.
- **Binomial trial**: A trial with two outcomes.
- **Binomial distribution**: Distribution of number of successes in x trials.

### Concepts

The binomial distribution is the frequency distribution of the number of successes (x) in a given number of trials (n) with specified probability (p) of success in each trial.

### Example: Binomial Probabilities

The R function `dbinom` calculates binomial probabilities. For example:

```python
stats.binom.pmf(2, n=5, p=0.1)
stats.binom.cdf(2, n=5, p=0.1)
```

### Key Ideas

- Binomial outcomes are important to model, since they represent, among other things, fundamental decisions (buy or don’t buy, click or don’t click, survive or die, etc.).
- A binomial trial is an experiment with two possible outcomes: one with probability p and the other with probability 1 – p.
- With large n, and provided p is not too close to 0 or 1, the binomial distribution can be approximated by the normal distribution.

### Further Reading

- Read about the “quincunx”, a pinball-like simulation device for illustrating the binomial distribution.
- The binomial distribution is a staple of introductory statistics, and all introductory statistics texts will have a chapter or two on it.

## Chi-Square Distribution

### Key Terms

- **Chi-Square Statistic**: A measure of the extent to which a set of observed values "fits" a specified distribution.

### Concepts

The chi-square distribution is typically concerned with counts of subjects or items falling into categories. The chi-square statistic measures the extent of departure from what you would expect in a null model.

### Key Ideas

- The chi-square distribution is typically concerned with counts of subjects or items falling into categories.
- The chi-square statistic measures the extent of departure from what you would expect in a null model.

### Further Reading

- The chi-square distribution owes its place in modern statistics to the great statistician Karl Pearson and the birth of hypothesis testing.
- For a more detailed exposition, see the section in this book on the chi-square test.

## F-Distribution

### Key Terms

- **F-Statistic**: The ratio of the variability among the group means to the variability within each group.

### Concepts

The F-distribution is used with experiments and linear models involving measured data. The F-statistic compares variation due to factors of interest to overall variation.

### Key Ideas

- The F-distribution is used with experiments and linear models involving measured data.
- The F-statistic compares variation due to factors of interest to overall variation.

### Further Reading

- George Cobb’s Introduction to Design and Analysis of Experiments (Wiley, 2008) contains an excellent exposition of the decomposition of variance components, which helps in understanding ANOVA and the F-statistic.

## Poisson and Related Distributions

### Key Terms

- **Lambda**: The rate (per unit of time or space) at which events occur.
- **Poisson Distribution**: The frequency distribution of the number of events in sampled units of time or space.
- **Exponential Distribution**: The frequency distribution of the time or distance from one event to the next event.
- **Weibull Distribution**: A generalized version of the exponential distribution in which the event rate is allowed to shift over time.

### Concepts

#### Poisson Distributions

The Poisson distribution tells us the distribution of events per unit of time or space when we sample many such units.

#### Example: Generating Random Numbers

The scipy function `stats.poisson.rvs` generates random numbers from a Poisson distribution:

```python
stats.poisson.rvs(2, size=100)
```

#### Exponential Distribution

The exponential distribution models the distribution of the time between events.

#### Example: Generating Random Numbers

The scipy function `stats.expon.rvs` generates random numbers from an exponential distribution:

```python
stats.expon.rvs(0.2, size=100)
```

#### Weibull Distribution

The Weibull distribution is used when the event rate changes over time.

#### Example: Generating Random Numbers

The scipy function `stats.weibull_min.rvs` generates random numbers from a Weibull distribution:

```python
stats.weibull_min.rvs(1.5, scale=5000, size=100)
```

### Key Ideas

- For events that occur at a constant rate, the number of events per unit of time or space can be modeled as a Poisson distribution.
- You can also model the time or distance between one event and the next as an exponential distribution.
- A changing event rate over time (e.g., an increasing probability of device failure) can be modeled with the Weibull distribution.

### Further Reading

- Modern Engineering Statistics by Thomas Ryan (Wiley, 2007) has a chapter devoted to the probability distributions used in engineering applications.
- Read an engineering-based perspective on the use of the Weibull distribution.

## Summary

In the era of big data, the principles of random sampling remain important when accurate estimates are needed. Random selection of data can reduce bias and yield a higher quality data set than would result from just using the conveniently available data. Knowledge of various sampling and data-generating distributions allows us to quantify potential errors in an estimate that might be due to random variation. At the same time, the bootstrap (sampling with replacement from an observed data set) is an attractive “one size fits all” method to determine possible error in sample estimates.

### Tip

The bell curve is iconic but perhaps overrated. George W. Cobb, the Mount Holyoke statistician, argued in a November 2015 editorial in the American Statistician that the “standard introductory course, which puts the normal distribution at its center, had outlived the usefulness of its centrality.”



summary 2
## Long-Tailed Distributions

**Key Concepts:**

*   **Long-tailed distributions:**  Data distributions where extreme values occur more frequently than predicted by a normal distribution. This means there's a higher chance of observing unusual events (e.g., a stock market crash).
*   **Skew:**  One tail of the distribution is longer than the other, indicating asymmetry.
*   **Black Swan Theory:** Predicts the likelihood of unexpected, high-impact events is much higher than the normal distribution suggests.


**Example: Netflix Stock Returns**

The following Python code generates a QQ-plot to visualize whether Netflix stock returns follow a normal distribution.  A QQ-plot compares the quantiles of your data to the quantiles of a theoretical normal distribution. Points significantly deviating from the diagonal line indicate non-normality.

```python
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# Assuming 'sp500_px' is a pandas DataFrame containing stock prices
nflx = sp500_px.NFLX  # Extract Netflix stock prices
nflx = np.diff(np.log(nflx[nflx>0])) # Calculate log returns (adjusting for positive values)

fig, ax = plt.subplots(figsize=(4, 4))
stats.probplot(nflx, plot=ax)
plt.show()
```

This code first extracts Netflix stock prices and then calculates the log returns (a common way to represent stock price changes) before plotting the data on a QQ-plot.  If the points substantially deviate from the diagonal line (which would be added using `abline` in R), it signifies that Netflix stock returns are not normally distributed.


**Note:** Fitting distributions to data can be subjective, requiring both statistical and domain knowledge.


## Student's t-Distribution

**Key Concepts:**

*   **t-distribution:**  Similar to the normal distribution but with thicker tails, making it more robust to outliers.  It's used for analyzing sample statistics, especially when the sample size is small.
*   **Degrees of freedom:**  A parameter that adjusts the t-distribution based on sample size.  Larger samples lead to t-distributions closer to the normal distribution.


**Confidence Intervals:**

A 90% confidence interval around a sample mean (x̄) is calculated as:

`x̄ ± t_(n-1)(0.05) * s/√n`

where:

*   `t_(n-1)(0.05)` is the critical t-value with (n-1) degrees of freedom and a significance level of 0.05 (corresponding to a 90% confidence level).
*   `s` is the sample standard deviation.
*   `n` is the sample size.


**Note:** While used in classical statistics, the t-distribution's importance is reduced in data science due to the prevalence of bootstrapping for uncertainty quantification.


## Binomial Distribution

**Key Concepts:**

*   **Binomial Distribution:** Models the probability of getting a certain number of "successes" in a fixed number of independent trials, each with two possible outcomes (success/failure).  Examples include coin flips, website clicks converting into sales, etc.
*   **Bernoulli trial:**  A single trial with two possible outcomes.


**Python Code:**

The `scipy.stats` module provides functions for working with binomial distributions:

```python
from scipy import stats

# Probability of exactly 2 successes in 5 trials with probability of success = 0.1
prob_exact = stats.binom.pmf(2, n=5, p=0.1) 

# Probability of 2 or fewer successes in 5 trials with probability of success = 0.1
prob_cumulative = stats.binom.cdf(2, n=5, p=0.1)

print(f"Probability of exactly 2 successes: {prob_exact}")
print(f"Probability of 2 or fewer successes: {prob_cumulative}")
```

`pmf` calculates the probability mass function (probability of exactly k successes), and `cdf` calculates the cumulative distribution function (probability of k or fewer successes).

**Approximation:**  For large `n` and `p` not too close to 0 or 1, the binomial distribution can be approximated by a normal distribution.


## Chi-Square Distribution

**Key Concepts:**

*   **Chi-square distribution:** Measures the difference between observed and expected frequencies in categorical data.  It's used for hypothesis testing (e.g., testing independence between variables).
*   **Null hypothesis:** The assumption that there's no significant difference or relationship between variables.


A high chi-square value indicates a significant departure from the expected values, suggesting the null hypothesis might be false.  Degrees of freedom are important in determining the appropriate chi-square distribution to compare results against.


## F-Distribution

**Key Concepts:**

*   **F-distribution:** Used to compare the variances of two or more groups.  Commonly used in ANOVA (Analysis of Variance).
*   **ANOVA:**  A statistical test that determines if there are significant differences between the means of three or more groups.


The F-statistic is the ratio of the variance between groups to the variance within groups. A large F-statistic suggests significant differences between group means. Python's statistical packages (like `statsmodels`) automatically calculate F-statistics as part of ANOVA and regression analysis.



## Poisson and Related Distributions

**Key Concepts:**

*   **Poisson Distribution:** Models the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known average rate and independently of the time since the last event.
*   **Exponential Distribution:** Models the time between events in a Poisson process.
*   **Weibull Distribution:**  A generalization of the exponential distribution, allowing for a changing event rate over time (useful for modeling things like equipment failures).


**Python Code (Poisson):**

```python
from scipy import stats

# Generate 100 random numbers from a Poisson distribution with lambda = 2
poisson_numbers = stats.poisson.rvs(2, size=100)
print(poisson_numbers)
```

**Python Code (Exponential):**

```python
from scipy import stats

# Generate 100 random numbers from an exponential distribution with rate = 0.2
exponential_numbers = stats.expon.rvs(scale=1/0.2, size=100) #Scale parameter is 1/rate.
print(exponential_numbers)

```

**Python Code (Weibull):**

```python
from scipy import stats

# Generate 100 random numbers from a Weibull distribution with shape=1.5 and scale=5000
weibull_numbers = stats.weibull_min.rvs(1.5, scale=5000, size=100)
print(weibull_numbers)

```

Estimating the failure rate for rare events often involves simulations or goodness-of-fit tests.  The Weibull distribution handles situations where the event rate isn't constant.

## Summary

This summary covers the key concepts and Python code snippets for various statistical distributions frequently encountered in data analysis.  Remember that appropriate distribution selection depends on the nature of your data and the research question.  Bootstrapping offers a robust, data-driven alternative to many classical statistical inference methods.



=====================================================================================================

# Chapter 3. Statistical Experiments and Significance Testing

## Design of Experiments

### Overview
Design of experiments is crucial in statistics, helping to confirm or reject hypotheses. It's particularly important in data science for continual experiments in user interface and product marketing. This chapter covers traditional experimental design, common challenges, and key concepts in statistical inference.

### Statistical Inference Pipeline
The classical statistical inference pipeline involves:
1. **Hypothesis**: Start with a hypothesis (e.g., "drug A is better than the existing standard drug").
2. **Experiment Design**: Design an experiment to test the hypothesis.
3. **Data Collection and Analysis**: Collect and analyze data.
4. **Conclusion**: Draw a conclusion.

The goal is to apply experiment results to a larger process or population.

## A/B Testing

### Definition
A/B testing compares two groups to determine which treatment, product, or procedure is superior. One group is the control (standard or no treatment), and the other is the treatment group.

### Key Terms
- **Treatment**: Something exposed to subjects (e.g., drug, price, web headline).
- **Treatment Group**: Group exposed to a specific treatment.
- **Control Group**: Group exposed to no or standard treatment.
- **Randomization**: Randomly assigning subjects to treatments.
- **Subjects**: Items exposed to treatments (e.g., web visitors, patients).
- **Test Statistic**: Metric used to measure the effect of the treatment.

### Examples
- Testing soil treatments for seed germination.
- Testing therapies for cancer suppression.
- Testing prices for net profit.
- Testing web headlines for clicks.
- Testing web ads for conversions.

### Randomization
Randomization ensures that any difference between treatment groups is due to the treatment or random assignment.

### Test Statistic
The test statistic is crucial for comparing groups. For binary variables (e.g., click/no-click), results are summed up in a 2x2 table.

#### Example: 2x2 Table
| Outcome       | Price A | Price B |
|---------------|---------|---------|
| Conversion    | 200     | 182     |
| No conversion| 23,539  | 22,406  |

For continuous variables (e.g., purchase amount), results might be displayed differently:
- **Revenue/page view with price A**: mean = 3.87, SD = 51.10
- **Revenue/page view with price B**: mean = 4.11, SD = 62.98

### Warning
Default statistical software output may not always be useful. For example, standard deviations suggesting negative revenue are not feasible. Mean absolute deviation from the mean is more reasonable.

### Control Group Importance
A control group ensures that all other conditions are equal, isolating the effect of the treatment.

### Blinding in Studies
- **Blind Study**: Subjects are unaware of the treatment.
- **Double-Blind Study**: Both subjects and investigators are unaware of the treatment.

### A/B Testing in Data Science
A/B testing in data science typically involves web contexts, such as web page design, product price, or headline wording. Randomization and a single predetermined metric are crucial.

### Multi-Arm Bandit Algorithm
For questions like "Which, out of multiple possible prices, is best?", the multi-arm bandit algorithm is used instead of traditional A/B tests.

### Getting Permission
In scientific and medical research, permission from subjects and institutional review boards is necessary. In business, this is less common but can be controversial, as seen in Facebook's 2014 experiment.

### Key Ideas
- Subjects are assigned to groups treated exactly alike, except for the treatment.
- Ideally, subjects are assigned randomly to the groups.

### Further Reading
- **Introductory Statistics and Analytics: A Resampling Perspective** by Peter Bruce.
- **Google Analytics help section on experiments**.

## Hypothesis Tests

### Definition
Hypothesis tests help determine if random chance might be responsible for an observed effect.

### Key Terms
- **Null Hypothesis**: Chance is to blame.
- **Alternative Hypothesis**: Counterpoint to the null.
- **One-Way Test**: Counts chance results in one direction.
- **Two-Way Test**: Counts chance results in two directions.

### Purpose
Hypothesis tests protect researchers from being fooled by random chance. They involve a null hypothesis (chance is to blame) and an alternative hypothesis (what you hope to prove).

### Misinterpreting Randomness
Humans tend to underestimate randomness. For example, in a series of 50 coin flips, real results will have longer runs of Hs or Ts than invented ones.

### Null Hypothesis
The null hypothesis assumes treatments are equivalent, and any difference is due to chance. A resampling permutation procedure can test this hypothesis.

### Alternative Hypothesis
Examples:
- Null = "no difference between the means of group A and group B"; alternative = "A is different from B".
- Null = "A ≤ B"; alternative = "A > B".
- Null = "B is not X% greater than A"; alternative = "B is X% greater than A".

### One-Way vs. Two-Way Hypothesis Tests
- **One-Way Test**: Protects from being fooled by chance in one direction.
- **Two-Way Test**: Protects from being fooled by chance in either direction.

### Key Ideas
- A null hypothesis assumes nothing special has happened, and any effect is due to random chance.
- The hypothesis test assumes the null hypothesis is true and tests whether the observed effect is a reasonable outcome of that model.

### Further Reading
- **The Drunkard’s Walk** by Leonard Mlodinow.
- **Statistics** by David Freedman, Robert Pisani, and Roger Purves.
- **Introductory Statistics and Analytics: A Resampling Perspective** by Peter Bruce.

## Resampling

### Definition
Resampling involves repeatedly sampling values from observed data to assess random variability in a statistic.

### Key Terms
- **Permutation Test**: Combining samples and randomly reallocating observations.
- **Resampling**: Drawing additional samples from an observed data set.
- **With or Without Replacement**: Whether an item is returned to the sample before the next draw.

### Permutation Test
1. Combine results from different groups into a single data set.
2. Shuffle the combined data and randomly draw resamples.
3. Calculate the test statistic for the resamples.
4. Repeat to yield a permutation distribution of the test statistic.
5. Compare the observed difference to the permuted differences.

### Example: Web Stickiness
A company tests which web presentation leads to better sales using a proxy variable (session time).

#### Tip
A proxy variable stands in for the true variable of interest. It's useful to have data on the true variable to assess the strength of its association with the proxy.

#### Code Snippets
```python
# Boxplot using pandas
ax = session_times.boxplot(by='Page', column='Time')
ax.set_xlabel('')
ax.set_ylabel('Time (in seconds)')
plt.suptitle('')

# Mean calculation
mean_a = session_times[session_times.Page == 'Page A'].Time.mean()
mean_b = session_times[session_times.Page == 'Page B'].Time.mean()
mean_b - mean_a

# Permutation function
def perm_fun(x, nA, nB):
    n = nA + nB
    idx_B = set(random.sample(range(n), nB))
    idx_A = set(range(n)) - idx_B
    return x.loc[idx_B].mean() - x.loc[idx_A].mean()

# Permutation test
perm_diffs = [perm_fun(session_times.Time, nA, nB) for _ in range(1000)]

# Histogram
fig, ax = plt.subplots(figsize=(5, 5))
ax.hist(perm_diffs, bins=11, rwidth=0.9)
ax.axvline(x = mean_b - mean_a, color='black', lw=2)
ax.text(50, 190, 'Observed\ndifference', bbox={'facecolor':'white'})
ax.set_xlabel('Session time differences (in seconds)')
ax.set_ylabel('Frequency')

# Calculate percentage
np.mean(perm_diffs > mean_b - mean_a)
```

### Exhaustive and Bootstrap Permutation Tests
- **Exhaustive Permutation Test**: Figures out all possible ways data could be divided.
- **Bootstrap Permutation Test**: Draws are made with replacement.

### Key Ideas
- In a permutation test, multiple samples are combined and shuffled.
- The shuffled values are divided into resamples, and the statistic of interest is calculated.
- Comparing the observed value to the resampled distribution helps judge whether an observed difference might occur by chance.

### Further Reading
- **Randomization Tests** by Eugene Edgington and Patrick Onghena.
- **Introductory Statistics and Analytics: A Resampling Perspective** by Peter Bruce.

## Conclusion
This chapter covers the design of experiments, A/B testing, hypothesis tests, and resampling techniques, providing a comprehensive overview of statistical experiments and significance testing in data science.


summary 2

# Chapter 3: Statistical Experiments and Significance Testing

This chapter explores statistical experiments, focusing on A/B testing and hypothesis testing within the context of data science.  It emphasizes practical application over strict adherence to classical statistical inference, highlighting the importance of understanding the underlying concepts rather than rote application of formulas.


## A/B Testing

A/B testing compares two versions (A and B) of a treatment (e.g., website design, pricing strategy) to determine which performs better.  One version typically serves as a control. The goal is to identify the superior treatment based on a chosen metric.

**Key Terms:**

* **Treatment:** The thing being tested (e.g., a new website design).
* **Treatment group:** The group exposed to a specific treatment.
* **Control group:** The group exposed to the standard or no treatment.
* **Randomization:** Randomly assigning subjects to treatment groups.
* **Subjects:** The items being tested (e.g., website visitors).
* **Test statistic:** The metric used to compare treatments (e.g., click-through rate, conversion rate).


**Example:** Testing two web headlines to see which generates more clicks.


**Why a Control Group?** A control group ensures that any observed difference is due to the treatment, not other factors.  Comparing only to past data ignores potential confounding variables.


**Blinding:** In some experiments (not usually in data science A/B tests), blinding participants to the treatment they receive prevents bias from influencing their responses.


**Beyond A/B (Multi-Arm Bandits):** While A/B testing is common, data scientists often need to compare more than two options.  Multi-arm bandit algorithms are better suited for this.


**Ethical Considerations:**  Obtaining informed consent is crucial in research involving human subjects, but less strictly enforced in business contexts. However, ethical considerations remain paramount (e.g., Facebook's 2014 emotional tone experiment).


## Hypothesis Tests

Hypothesis tests determine whether an observed effect is likely due to chance or a real difference between treatments.


**Key Terms:**

* **Null hypothesis:**  Assumes there's no real difference between treatments; any observed difference is due to chance.
* **Alternative hypothesis:** The opposite of the null hypothesis; it's what you hope to prove.
* **One-way test:** Considers chance results only in one direction (e.g., treatment B is better than A).
* **Two-way test:** Considers chance results in both directions (e.g., treatment B is different from A).


**Misinterpreting Randomness:** People tend to underestimate randomness, seeing patterns where none exist. Hypothesis testing helps mitigate this bias.


**The Null Hypothesis:**  The starting assumption that there's no significant difference; the goal is to disprove this.


**Alternative Hypothesis:** The statement you are trying to support, for example "Treatment B is better than Treatment A".


**One-Way vs. Two-Way Tests:** One-way tests are directional (e.g., A > B), while two-way tests are non-directional (A ≠ B).  The choice depends on the research question.  In Data Science, the distinction is less crucial.


## Resampling

Resampling involves repeatedly sampling from observed data to assess variability.  It's a powerful tool for hypothesis testing, offering a more flexible approach compared to traditional methods.

**Key Terms:**

* **Permutation test (Randomization test):**  Combines samples, shuffles the data, and randomly reassigns it to groups.  This simulates the null hypothesis.
* **Resampling:** Drawing multiple samples from the observed data.
* **With or without replacement:** Whether data points are returned to the pool after sampling.


**Permutation Test (Python Example):**

```python
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt

# Sample data (replace with your own)
data = {'Page': ['Page A'] * 21 + ['Page B'] * 15,
        'Time': [100, 120, 110, ..., 150, 160, 140]} #replace ... with your actual data
session_times = pd.DataFrame(data)

nA = 21
nB = 15

def perm_fun(x, nA, nB):
    n = nA + nB
    idx_B = set(random.sample(range(n), nB))
    idx_A = set(range(n)) - idx_B
    return x.loc[idx_B].mean() - x.loc[idx_A].mean()

perm_diffs = [perm_fun(session_times.Time, nA, nB) for _ in range(1000)]

mean_a = session_times[session_times.Page == 'Page A'].Time.mean()
mean_b = session_times[session_times.Page == 'Page B'].Time.mean()

fig, ax = plt.subplots(figsize=(5, 5))
ax.hist(perm_diffs, bins=11, rwidth=0.9)
ax.axvline(x = mean_b - mean_a, color='black', lw=2)
ax.text(50, 190, 'Observed\ndifference', bbox={'facecolor':'white'})
ax.set_xlabel('Session time differences (in seconds)')
ax.set_ylabel('Frequency')
plt.show()

p_value = np.mean(perm_diffs > mean_b - mean_a)
print(f"P-value: {p_value}")

```

This code performs a permutation test to compare the means of two groups.  It shuffles the data, recalculates the difference in means many times, and then compares the observed difference to this distribution to assess statistical significance.  A low p-value (typically below 0.05) suggests the observed difference is unlikely due to chance.

**Exhaustive and Bootstrap Permutation Tests:** These are variations on the basic permutation test.  Exhaustive tests consider all possible permutations (computationally expensive for large datasets). Bootstrap tests sample *with* replacement, adding another layer of randomness.


**Permutation Tests in Data Science:** Permutation tests are valuable for their simplicity, flexibility, and avoidance of stringent assumptions (like normally distributed data). They provide an intuitive understanding of statistical significance.


**Further Reading:** The provided text includes several books and resources for further learning.  The Google Analytics help section on experiments is also a good resource for practical A/B testing.

=========================================================================================================================

## Statistical Significance and p-Values

### Introduction

Statistical significance measures whether an experiment's results are more extreme than what chance might produce. If the results are beyond chance variation, they are statistically significant.

### Key Terms

- **p-value**: The probability of obtaining results as unusual or extreme as observed, given a null hypothesis.
- **Alpha**: The probability threshold for "unusualness" that chance results must surpass for actual outcomes to be deemed statistically significant.
- **Type 1 error**: Mistakenly concluding an effect is real (when it is due to chance).
- **Type 2 error**: Mistakenly concluding an effect is due to chance (when it is real).

### Example: Ecommerce Experiment

#### Table 3-2: 2×2 Table for Ecommerce Experiment Results

| Outcome       | Price A | Price B  |
|---------------|---------|----------|
| Conversion    | 200     | 182      |
| No conversion | 23,539  | 22,406   |

Price A converts almost 5% better than Price B. Despite having over 45,000 data points, the conversion rates are low (less than 1%), making the actual meaningful values only in the 100s.

#### Permutation Procedure

1. **Put cards labeled 1 and 0 in a box**: Represents the shared conversion rate of 382 ones and 45,945 zeros = 0.8246%.
2. **Shuffle and draw out a resample of size 23,739 (same n as Price A)**, and record how many 1s.
3. **Record the number of 1s in the remaining 22,588 (same n as Price B)**.
4. **Record the difference in proportion of 1s**.
5. **Repeat steps 2–4**.
6. **How often was the difference >= 0.0368?**

#### Python Code for Permutation Test

```python
obs_pct_diff = 100 * (200 / 23739 - 182 / 22588)
print(f'Observed difference: {obs_pct_diff:.4f}%')
conversion = [0] * 45945
conversion.extend([1] * 382)
conversion = pd.Series(conversion)

perm_diffs = [100 * perm_fun(conversion, 23739, 22588) for _ in range(1000)]

fig, ax = plt.subplots(figsize=(5, 5))
ax.hist(perm_diffs, bins=11, rwidth=0.9)
ax.axvline(x=obs_pct_diff, color='black', lw=2)
ax.text(0.06, 200, 'Observed\ndifference', bbox={'facecolor':'white'})
ax.set_xlabel('Conversion rate (percent)')
ax.set_ylabel('Frequency')
```

#### Histogram of Differences in Conversion Rates

The observed difference of 0.0368% is within the range of chance variation.

### p-Value

The p-value is the frequency with which the chance model produces a result more extreme than the observed result.

#### Python Code for p-Value Calculation

```python
np.mean([diff > obs_pct_diff for diff in perm_diffs])
```

The p-value is 0.308, meaning that we would expect a result as extreme as this, or more extreme, by random chance over 30% of the time.

#### Approximating p-Value Using Binomial Distribution

```python
survivors = np.array([[200, 23739 - 200], [182, 22588 - 182]])
chi2, p_value, df, _ = stats.chi2_contingency(survivors)

print(f'p-value for single sided test: {p_value / 2:.4f}')
```

The normal approximation yields a p-value of 0.3498, close to the permutation test p-value.

### Alpha

Alpha is the threshold for "unusualness" in a null hypothesis chance model. Typical alpha levels are 5% and 1%.

### p-Value Controversy

The p-value has been a subject of controversy. The American Statistical Association issued a cautionary statement regarding its use:

1. P-values can indicate how incompatible the data are with a specified statistical model.
2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
4. Proper inference requires full reporting and transparency.
5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

### Practical Significance

Even if a result is statistically significant, it does not mean it has practical significance. Large samples can make small, non-meaningful effects statistically significant.

### Type 1 and Type 2 Errors

- **Type 1 error**: Mistakenly concluding an effect is real, when it is due to chance.
- **Type 2 error**: Mistakenly concluding that an effect is not real (i.e., due to chance), when it actually is real.

### Data Science and p-Values

For data scientists, a p-value is a useful metric to determine if a model result is within the range of normal chance variability. It should not be considered controlling but merely another point of information.

### Key Ideas

- Significance tests determine whether an observed effect is within the range of chance variation for a null hypothesis model.
- The p-value is the probability that results as extreme as the observed results might occur, given a null hypothesis model.
- The alpha value is the threshold of "unusualness" in a null hypothesis chance model.
- Significance testing has been more relevant for formal reporting of research than for data science.

### Further Reading

- Stephen Stigler, “Fisher and the 5% Level,” Chance 21, no. 4 (2008): 12.
- See also “Hypothesis Tests” and the further reading mentioned there.

## t-Tests

### Introduction

t-Tests are common significance tests used for comparing means. They are based on Student’s t-distribution.

### Key Terms

- **Test statistic**: A metric for the difference or effect of interest.
- **t-statistic**: A standardized version of common test statistics such as means.
- **t-distribution**: A reference distribution to which the observed t-statistic can be compared.

### Example: t-Test in Python

```python
res = stats.ttest_ind(session_times[session_times.Page == 'Page A'].Time,
                      session_times[session_times.Page == 'Page B'].Time,
                      equal_var=False)
print(f'p-value for single sided test: {res.pvalue / 2:.4f}')
```

The p-value of 0.1408 is fairly close to the permutation test p-values.

### Key Ideas

- Before the advent of computers, resampling tests were not practical, and statisticians used standard reference distributions.
- A test statistic could then be standardized and compared to the reference distribution.
- One such widely used standardized statistic is the t-statistic.

### Further Reading

- Any introductory statistics text will have illustrations of the t-statistic and its uses.
- For a treatment of both the t-test and resampling procedures in parallel, see Introductory Statistics and Analytics: A Resampling Perspective by Peter Bruce (Wiley, 2014) or Statistics: Unlocking the Power of Data, 2nd ed., by Robin Lock and four other Lock family members (Wiley, 2016).

## Multiple Testing

### Introduction

Multiple testing increases the risk of concluding that something is significant just by chance.

### Key Terms

- **Type 1 error**: Mistakenly concluding that an effect is statistically significant.
- **False discovery rate**: The rate of making a Type 1 error across multiple tests.
- **Alpha inflation**: The multiple testing phenomenon, in which alpha increases as you conduct more tests.
- **Adjustment of p-values**: Accounting for doing multiple tests on the same data.
- **Overfitting**: Fitting the noise.

### Example: Alpha Inflation

The probability that at least one predictor will (falsely) test significant is 0.64.

### Key Ideas

- Multiplicity in a research study or data mining project increases the risk of concluding that something is significant just by chance.
- For situations involving multiple statistical comparisons, there are statistical adjustment procedures.
- In a data mining situation, use of a holdout sample with labeled outcome variables can help avoid misleading results.

### Further Reading

- For a short exposition of one procedure (Dunnett’s test) to adjust for multiple comparisons, see David Lane’s online statistics text.
- Megan Goldman offers a slightly longer treatment of the Bonferroni adjustment procedure.
- For an in-depth treatment of more flexible statistical procedures for adjusting p-values, see Resampling-Based Multiple Testing by Peter Westfall and Stanley Young (Wiley, 1993).
- For a discussion of data partitioning and the use of holdout samples in predictive modeling, see Chapter 2 of Data Mining for Business Analytics, by Galit Shmueli, Peter Bruce, Nitin Patel, Peter Gedeck, Inbal Yahav, and Kenneth Lichtendahl (Wiley, 2007–2020, with editions for R, Python, Excel, and JMP).

## Degrees of Freedom

### Introduction

Degrees of freedom refer to the number of values free to vary in a sample.

### Key Terms

- **n or sample size**: The number of observations in the data.
- **d.f.**: Degrees of freedom.

### Key Ideas

- The number of degrees of freedom forms part of the calculation to standardize test statistics so they can be compared to reference distributions.
- The concept of degrees of freedom lies behind the factoring of categorical variables into n – 1 indicator or dummy variables when doing a regression (to avoid multicollinearity).

### Further Reading

There are several web tutorials on degrees of freedom.



summary 2


c:\Users\mridu\Pictures\FastApi\.venv\Lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
# Statistical Significance and p-Values

## Key Terms

* **p-value:** The probability of getting results as extreme as the observed results, assuming the null hypothesis (no effect) is true.  A low p-value suggests the null hypothesis is unlikely.
* **Alpha (α):** The threshold for statistical significance.  If the p-value is less than alpha, the result is considered statistically significant. Common alpha levels are 0.05 (5%) and 0.01 (1%).
* **Type 1 error:**  Rejecting the null hypothesis when it's actually true (false positive).
* **Type 2 error:** Failing to reject the null hypothesis when it's actually false (false negative).


## Example: A/B Test on Ecommerce Prices

An A/B test compared two prices (A and B) resulting in the following conversion rates:

| Outcome       | Price A | Price B |
|---------------|---------|---------|
| Conversion    | 200     | 182     |
| No Conversion | 23539   | 22406   |


Price A showed a seemingly better conversion rate (0.8425% vs 0.8057%), but is this difference statistically significant?  A permutation test helps determine if this difference could be due to random chance.

### Python Code for Permutation Test

```python
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

# Function (assumed defined elsewhere -  see original text for details)
def perm_fun(conversion, n1, n2):
  # This function shuffles the conversion data and returns the difference in proportions
  pass  # Placeholder; the actual function would be implemented here.

obs_pct_diff = 100 * (200 / 23739 - 182 / 22588)
print(f'Observed difference: {obs_pct_diff:.4f}%')
conversion = [0] * 45945  # 0 represents no conversion
conversion.extend([1] * 382)  # 1 represents conversion
conversion = pd.Series(conversion)

perm_diffs = [100 * perm_fun(conversion, 23739, 22588)
              for _ in range(1000)] # Perform 1000 permutations

fig, ax = plt.subplots(figsize=(5, 5))
ax.hist(perm_diffs, bins=11, rwidth=0.9)
ax.axvline(x=obs_pct_diff, color='black', lw=2)
ax.text(0.06, 200, 'Observed\ndifference', bbox={'facecolor':'white'})
ax.set_xlabel('Conversion rate (percent)')
ax.set_ylabel('Frequency')
plt.show()
```

This code simulates many possible outcomes under the null hypothesis (no price difference).  The histogram visualizes these simulated differences, showing where the observed difference falls.

### Calculating the p-value

The p-value is estimated by calculating the proportion of simulated differences that are as extreme or more extreme than the observed difference:

```python
p_value = np.mean([diff > obs_pct_diff for diff in perm_diffs])
print(f"Estimated p-value: {p_value}")
```

A high p-value (e.g., >0.05) suggests the observed difference is not statistically significant and likely due to chance.

### Alternative: Using `scipy.stats.chi2_contingency`

The `chi2_contingency` function provides a more direct way to calculate a p-value for this 2x2 contingency table:

```python
survivors = np.array([[200, 23739 - 200], [182, 22588 - 182]])
chi2, p_value, df, _ = stats.chi2_contingency(survivors)
print(f'p-value for single sided test: {p_value / 2:.4f}')
```

Note: The p-value is halved because it's a one-sided test (we're only interested if Price A is better).

## Alpha and the p-value Controversy

Alpha sets the threshold for significance.  However, the p-value is often misinterpreted as the probability that the result is due to chance.  The American Statistical Association highlights the importance of not relying solely on p-values for decision-making.

## Practical Significance

Statistical significance doesn't always equate to practical significance. A statistically significant result might represent a small, unimportant effect.

## Type 1 and Type 2 Errors

* **Type 1 error:**  Falsely concluding an effect is real.
* **Type 2 error:** Falsely concluding an effect is not real.

Significance tests aim to minimize Type 1 errors.

## Data Science and p-Values

In data science, p-values are helpful to understand if an interesting result could be due to chance.  They shouldn't be the sole basis for decisions.


## t-Tests

The t-test is a common significance test for comparing means of two groups of numerical data.

### Python Code for t-test

```python
from scipy import stats
# Assuming 'session_times' is a Pandas DataFrame with columns 'Time' and 'Page'
res = stats.ttest_ind(session_times[session_times.Page == 'Page A'].Time,
                      session_times[session_times.Page == 'Page B'].Time,
                      equal_var=False) # Welch's t-test (does not assume equal variances)
print(f'p-value for single sided test: {res.pvalue / 2:.4f}')
```

This code performs an independent samples t-test.  `equal_var=False` uses Welch's t-test, which is more robust if the variances of the two groups are not equal.  The p-value is divided by 2 for a one-sided test.


## Multiple Testing

Performing many tests increases the chance of finding a statistically significant result by chance (Type 1 error).  This is called alpha inflation.

## False Discovery Rate

The false discovery rate (FDR) is the rate of Type 1 errors across multiple tests.  It's particularly relevant in situations with many comparisons, like genomic studies.

## Degrees of Freedom

Degrees of freedom (df) represent the number of values in a calculation that are free to vary.  It's important for standardizing test statistics and in regression when dealing with categorical predictors (to avoid multicollinearity).  For large datasets, the impact of degrees of freedom on calculations is often negligible.


========================================================================================================================

## Summary of ANOVA and Chi-Square Test

### ANOVA

#### Introduction
ANOVA (Analysis of Variance) is a statistical procedure used to test for significant differences among multiple groups. It extends the A/B test to multiple groups, assessing whether the overall variation among groups is within the range of chance variation.

#### Key Terms
- **Pairwise comparison**: A hypothesis test between two groups among multiple groups.
- **Omnibus test**: A single hypothesis test of the overall variance among multiple group means.
- **Decomposition of variance**: Separation of components contributing to an individual value.
- **F-statistic**: Measures the extent to which differences among group means exceed what might be expected in a chance model.
- **SS (Sum of Squares)**: Refers to deviations from some average value.

#### Example: Web Page Stickiness
Table 3-3 shows the stickiness (in seconds) of four web pages.

| Page 1 | Page 2 | Page 3 | Page 4 |
|--------|--------|--------|--------|
| 164    | 178    | 175    | 155    |
| 172    | 191    | 193    | 166    |
| 177    | 182    | 171    | 164    |
| 156    | 185    | 163    | 170    |
| 195    | 177    | 176    | 168    |
| **Average** | 172 | 185 | 176 | 162 |
| **Grand average** | 173.75 |

#### Pairwise Comparisons
With four means, there are six possible comparisons between groups:
1. Page 1 compared to Page 2
2. Page 1 compared to Page 3
3. Page 1 compared to Page 4
4. Page 2 compared to Page 3
5. Page 2 compared to Page 4
6. Page 3 compared to Page 4

#### Permutation Test in R
```r
> library(lmPerm)
> summary(aovp(Time ~ Page, data=four_sessions))
[1] "Settings:  unique SS "
Component 1 :
            Df R Sum Sq R Mean Sq Iter Pr(Prob)
Page         3    831.4    277.13 3104  0.09278 .
Residuals   16   1618.4    101.15
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

#### Permutation Test in Python
```python
observed_variance = four_sessions.groupby('Page').mean().var()[0]
print('Observed means:', four_sessions.groupby('Page').mean().values.ravel())
print('Variance:', observed_variance)

def perm_test(df):
    df = df.copy()
    df['Time'] = np.random.permutation(df['Time'].values)
    return df.groupby('Page').mean().var()[0]

perm_variance = [perm_test(four_sessions) for _ in range(3000)]
print('Pr(Prob)', np.mean([var > observed_variance for var in perm_variance]))
```

#### F-Statistic
The F-statistic is based on the ratio of the variance across group means to the variance due to residual error.

#### ANOVA Table in R
```r
> summary(aov(Time ~ Page, data=four_sessions))
            Df Sum Sq Mean Sq F value Pr(>F)
Page         3  831.4   277.1    2.74 0.0776 .
Residuals   16 1618.4   101.2
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```

#### ANOVA Table in Python
```python
model = smf.ols('Time ~ Page', data=four_sessions).fit()
aov_table = sm.stats.anova_lm(model)
aov_table
```

#### Decomposition of Variance
1. Start with grand average (173.75 for web page stickiness data).
2. Add treatment effect, which might be negative (independent variable = web page).
3. Add residual error, which might be negative.

#### Two-Way ANOVA
Involves a second factor (e.g., weekend vs. weekday) and identifies the interaction effect.

#### Key Ideas
- ANOVA is used to analyze the results of an experiment with multiple groups.
- It identifies variance components associated with group treatments, interaction effects, and errors.

#### Further Reading
- *Introductory Statistics and Analytics: A Resampling Perspective* by Peter Bruce (Wiley, 2014)
- *Introduction to Design and Analysis of Experiments* by George Cobb (Wiley, 2008)

### Chi-Square Test

#### Introduction
The chi-square test is used with count data to test how well it fits some expected distribution. It is commonly used with r × c contingency tables to assess independence among variables.

#### Key Terms
- **Chi-square statistic**: Measures the extent to which observed data departs from expectation.
- **Expectation or expected**: How data is expected to turn out under some assumption, typically the null hypothesis.

#### Example: Web Testing Results
Table 3-4 shows the results for three different headlines.

| Headline A | Headline B | Headline C |
|------------|------------|------------|
| Click      | 14         | 8          | 12        |
| No-click   | 986        | 992        | 988       |

#### Resampling Approach
1. Constitute a box with 34 ones (clicks) and 2,966 zeros (no clicks).
2. Shuffle, take three separate samples of 1,000, and count the clicks in each.
3. Find the squared differences between the shuffled counts and the expected counts and sum them.
4. Repeat steps 2 and 3, say, 1,000 times.
5. How often does the resampled sum of squared deviations exceed the observed? That’s the p-value.

#### Chi-Square Test in Python
```python
box = [1] * 34
box.extend([0] * 2966)
random.shuffle(box)

def chi2(observed, expected):
    pearson_residuals = []
    for row, expect in zip(observed, expected):
        pearson_residuals.append([(observe - expect) ** 2 / expect
                                  for observe in row])
    # return sum of squares
    return np.sum(pearson_residuals)

expected_clicks = 34 / 3
expected_noclicks = 1000 - expected_clicks
expected = [34 / 3, 1000 - 34 / 3]
chi2observed = chi2(clicks.values, expected)

def perm_fun(box):
    sample_clicks = [sum(random.sample(box, 1000)),
                     sum(random.sample(box, 1000)),
                     sum(random.sample(box, 1000))]
    sample_noclicks = [1000 - n for n in sample_clicks]
    return chi2([sample_clicks, sample_noclicks], expected)

perm_chi2 = [perm_fun(box) for _ in range(2000)]

resampled_p_value = sum(perm_chi2 > chi2observed) / len(perm_chi2)
print(f'Observed chi2: {chi2observed:.4f}')
print(f'Resampled p-value: {resampled_p_value:.4f}')
```

#### Statistical Theory
The chi-square distribution is typically skewed, with a long tail to the right. The degrees of freedom for a contingency table are related to the number of rows (r) and columns (c) as follows:
\[ \text{degrees of freedom} = (r - 1) \times (c - 1) \]

#### Chi-Square Test in Python
```python
chisq, pvalue, df, expected = stats.chi2_contingency(clicks)
print(f'Observed chi2: {chi2observed:.4f}')
print(f'p-value: {pvalue:.4f}')
```

#### Fisher’s Exact Test
Used when counts are extremely low. R code for Fisher’s exact test:
```r
> fisher.test(clicks)
	Fisher's Exact Test for Count Data

data:  clicks
p-value = 0.4824
alternative hypothesis: two.sided
```

#### Detecting Scientific Fraud
An example involving the distribution of digits in laboratory data, where the chi-square test was used to detect fabrication.

#### Relevance for Data Science
- Used to determine appropriate sample sizes for web experiments.
- Used in spatial statistics and mapping.
- Used in automated feature selection in machine learning.

#### Key Ideas
- Tests whether observed data counts are consistent with an assumption of independence.
- The chi-square distribution is the reference distribution to which the observed chi-square statistic must be compared.

#### Further Reading
- R. A. Fisher’s “Lady Tasting Tea” example.
- Stat Trek offers a good tutorial on the chi-square test.



summaary 2

# ANOVA (Analysis of Variance)

## Introduction

ANOVA is a statistical test used to compare the means of three or more groups.  It's an extension of the t-test, which only compares two groups.  Instead of multiple pairwise comparisons (which increases the chance of false positives), ANOVA tests if there's a significant overall difference between the group means.

## Example: Web Page Stickiness

This example analyzes the "stickiness" (time spent) of four web pages (A, B, C, D). Each page was shown to five visitors, and the time spent was recorded. The goal is to determine if there's a significant difference in stickiness between the pages.

**Table 3-3 (Stickiness in seconds):**  (This table was provided in the text but is omitted here for brevity.  It contains the stickiness data for each page and visitor)


**Python Code (Permutation Test):**

```python
import pandas as pd
import numpy as np

# Sample Data (replace with your actual data from Table 3-3)
data = {'Page': ['A']*5 + ['B']*5 + ['C']*5 + ['D']*5,
        'Time': [164, 172, 177, 156, 195, 178, 191, 182, 185, 177, 175, 193, 171, 163, 176, 155, 166, 164, 170, 168]}
four_sessions = pd.DataFrame(data)

observed_variance = four_sessions.groupby('Page').mean().var()[0]
print('Observed means:', four_sessions.groupby('Page').mean().values.ravel())
print('Variance:', observed_variance)

def perm_test(df):
    df = df.copy()
    df['Time'] = np.random.permutation(df['Time'].values)
    return df.groupby('Page').mean().var()[0]

perm_variance = [perm_test(four_sessions) for _ in range(3000)]
print('Pr(Prob)', np.mean([var > observed_variance for var in perm_variance]))

```

This code performs a permutation test to determine the p-value.  It shuffles the time data randomly and recalculates the variance between group means many times. The proportion of times the shuffled variance exceeds the observed variance gives the p-value. A low p-value (typically below 0.05) suggests a significant difference between the groups.


**Python Code (ANOVA using statsmodels):**

```python
import statsmodels.formula.api as smf
import statsmodels.api as sm

model = smf.ols('Time ~ Page', data=four_sessions).fit()
aov_table = sm.stats.anova_lm(model)
print(aov_table)
```

This code uses the `statsmodels` library to perform a traditional ANOVA. The output provides an ANOVA table with Sum of Squares, Mean Squares, F-statistic, and p-value.


## F-Statistic

The F-statistic is the ratio of the variance *between* groups to the variance *within* groups. A high F-statistic indicates that the variance between groups is much larger than the variance within groups, suggesting a significant difference between the groups.


## Decomposition of Variance

Each data point can be broken down into:

1. **Grand Average:** The overall average of all data points.
2. **Treatment Effect:** The difference between the group mean and the grand average.
3. **Residual Error:** The difference between the individual data point and its group mean.


## Two-Way ANOVA

Two-way ANOVA extends the analysis to include two or more factors (e.g., web page and day of the week).  It also allows for the assessment of interaction effects between the factors.


## Chi-Square Test

## Introduction

The chi-square test is used to analyze categorical data (counts) to see if there's a significant association between two or more categorical variables.  It compares observed counts to expected counts under the assumption of independence between the variables.


## Example: Headline Click-Through Rates

This example examines click-through rates for three different headlines (A, B, C).  The goal is to determine if there's a significant difference in click-through rates between the headlines.

**Table 3-4 (Headline Clicks):** (This table was provided in the text but is omitted here for brevity. It contains the click and no-click counts for each headline).

**Python Code (Permutation Test):**

```python
import random
import numpy as np

# Sample Data (replace with your actual data from Table 3-4)
clicks = np.array([[14, 986], [8, 992], [12, 988]])

box = [1] * 34
box.extend([0] * 2966)
random.shuffle(box)

def chi2(observed, expected):
    pearson_residuals = []
    for row, expect in zip(observed, expected):
        pearson_residuals.append([(observe - expect) ** 2 / expect
                                  for observe in row])
    # return sum of squares
    return np.sum(pearson_residuals)

expected_clicks = 34 / 3
expected_noclicks = 1000 - expected_clicks
expected = [34 / 3, 1000 - 34 / 3]
chi2observed = chi2(clicks, expected)

def perm_fun(box):
    sample_clicks = [sum(random.sample(box, 1000)),
                     sum(random.sample(box, 1000)),
                     sum(random.sample(box, 1000))]
    sample_noclicks = [1000 - n for n in sample_clicks]
    return chi2([sample_clicks, sample_noclicks], expected)

perm_chi2 = [perm_fun(box) for _ in range(2000)]

resampled_p_value = sum(perm_chi2 > chi2observed) / len(perm_chi2)
print(f'Observed chi2: {chi2observed:.4f}')
print(f'Resampled p-value: {resampled_p_value:.4f}')

```

This code performs a permutation test for the chi-square statistic.  It shuffles the click/no-click data and calculates the chi-square statistic many times.  The p-value indicates the probability of observing the actual chi-square statistic or a more extreme value under the assumption of independence.



**Python Code (Chi-Square Test using scipy):**

```python
from scipy import stats

chisq, pvalue, df, expected = stats.chi2_contingency(clicks)
print(f'Observed chi2: {chisq:.4f}')
print(f'p-value: {pvalue:.4f}')

```

This code uses the `scipy.stats` library to perform a traditional chi-square test.  The p-value is obtained using the chi-square distribution as an approximation.


## Fisher's Exact Test

Fisher's exact test is an alternative to the chi-square test, particularly useful when the expected counts are low.  It calculates the exact p-value by considering all possible arrangements of the data.  Note that an easy-to-use Python implementation for Fisher's exact test wasn't explicitly provided in the original text.


##  Detecting Scientific Fraud (Example)

The example of Thereza Imanishi-Kari's case illustrates how the chi-square test can be used to detect anomalies in data, potentially indicating data fabrication or manipulation.  The analysis involved examining the distribution of digits in the data and comparing it to the expected uniform distribution.  Deviations from the expected distribution could suggest irregularities.


## Relevance for Data Science

In data science, the chi-square test (or Fisher's exact test) isn't primarily used to establish statistical significance for publication but rather as a tool for feature selection, sample size determination, and identifying potential data anomalies.  Multi-armed bandit algorithms are often preferred for optimizing treatments in online experiments.

==================================================================================================================


## Multi-Arm Bandit Algorithm

### Introduction

The Multi-Arm Bandit (MAB) algorithm is a method for optimizing decision-making in experiments, particularly in web testing. It allows for more rapid decision-making compared to traditional statistical approaches.

### Key Terms

- **Multi-arm bandit**: An analogy for a multitreatment experiment, where each arm represents a different treatment with varying payoffs.
- **Arm**: A treatment in an experiment (e.g., "headline A in a web test").
- **Win**: The desired outcome in an experiment (e.g., "customer clicks on the link").

### Traditional A/B Testing vs. Multi-Arm Bandits

Traditional A/B testing involves collecting data to answer a specific question, such as which treatment is better. However, this approach has limitations:
- **Inconclusive Results**: The results may not be statistically significant.
- **Delayed Action**: You may want to act on results before the experiment concludes.
- **Flexibility**: Traditional methods are inflexible and do not allow for changes based on new data.

Bandit algorithms address these issues by allowing multiple treatments to be tested simultaneously and adjusting the sampling process based on ongoing results.

### Bandit Algorithms

#### Epsilon-Greedy Algorithm

This is a simple algorithm for an A/B test:

1. Generate a uniformly distributed random number between 0 and 1.
2. If the number lies between 0 and epsilon (a small number between 0 and 1):
   - Flip a fair coin (50/50 probability).
   - If heads, show offer A.
   - If tails, show offer B.
3. If the number is ≥ epsilon, show the offer with the highest response rate to date.

```python
import random

def epsilon_greedy(epsilon, offer_A_rate, offer_B_rate):
    if random.random() < epsilon:
        return random.choice(['A', 'B'])
    else:
        return 'A' if offer_A_rate > offer_B_rate else 'B'
```

- **Epsilon**: A parameter that governs the algorithm. If epsilon is 1, it results in a standard A/B experiment. If epsilon is 0, it results in a purely greedy algorithm.

#### Thompson Sampling

Thompson Sampling is a more sophisticated algorithm that uses a Bayesian approach to maximize the probability of choosing the best arm. It assumes a prior distribution of rewards and updates this distribution with each draw.

### Key Ideas

- **Traditional A/B Tests**: Random sampling process, leading to excessive exposure to inferior treatments.
- **Multi-Arm Bandits**: Alter the sampling process to incorporate information learned during the experiment, reducing the frequency of inferior treatments.
- **Efficient Treatment**: Facilitates efficient treatment of more than two treatments.
- **Algorithms**: Different algorithms for shifting sampling probability away from inferior treatments to the presumed superior one.

### Further Reading

- **Bandit Algorithms for Website Optimization** by John Myles White (O’Reilly, 2012).
- **"Analysis of Thompson Sampling for the Multi-armed Bandit Problem"** by Shipra Agrawal and Navin Goyal.

## Power and Sample Size

### Key Terms

- **Effect Size**: The minimum size of the effect you hope to detect.
- **Power**: The probability of detecting a given effect size with a given sample size.
- **Significance Level**: The statistical significance level at which the test will be conducted.

### Determining Sample Size

To decide how long a web test should run, consider the frequency of the desired goal. There is no general guidance; it depends on the goal's attainment frequency.

### Steps for Statistical Calculations

1. **Hypothetical Data**: Start with data representing your best guess about the results.
2. **Second Sample**: Create a second sample by adding the desired effect size to the first sample.
3. **Bootstrap Sample**: Draw a bootstrap sample of size n from each box.
4. **Hypothesis Test**: Conduct a hypothesis test on the two bootstrap samples and record the significance.
5. **Estimated Power**: Repeat the above steps many times to determine the estimated power.

### Example

Suppose you are testing a new ad against an existing ad with click-through rates of 1.1% and seeking a 10% boost to 1.21%.

```python
import numpy as np
from statsmodels.stats.power import TTestIndPower
from statsmodels.stats.proportion import proportion_effectsize

# Effect size calculation
effect_size = proportion_effectsize(0.0121, 0.011)

# Power analysis
analysis = TTestIndPower()
result = analysis.solve_power(effect_size=effect_size, alpha=0.05, power=0.8, alternative='larger')
print('Sample Size: %.3f' % result)
```

### Key Ideas

- **Sample Size**: Depends on effect size, power, and significance level.
- **Power Calculation**: Involves specifying the effect size, power, and significance level.

### Further Reading

- **Sample Size Determination and Power** by Thomas Ryan (Wiley, 2013).
- Steve Simon's narrative-style post on the subject.

## Summary

Experimental design principles, such as randomization, allow for valid conclusions about treatment effectiveness. Formal statistical inference is less critical from a data science perspective, but understanding random variation's role is essential. Intuitive resampling procedures like permutation and bootstrap help gauge the extent of chance variation in data analysis.

### Technical Concepts

- **Multiplication Rule**: The probability of n independent events all happening is the product of the individual probabilities. For example, the probability of two coins both landing heads is 0.5 × 0.5 = 0.25.

This structured summary captures all the essential information from the provided text, including code snippets, examples, and key points.




summaary 2

# Multi-Arm Bandit Algorithm

## Key Terms

* **Multi-arm bandit:** A metaphor for a multi-treatment experiment, like a slot machine with multiple arms (treatments), each with different payoff probabilities.
* **Arm:** A single treatment in the experiment (e.g., a headline variant on a website).
* **Win:** A successful outcome of a treatment (e.g., a user clicking a link).


## Traditional A/B Testing Limitations

Traditional A/B testing has limitations:

* **Inconclusive results:**  Experiments may not have enough data to definitively show a difference between treatments.
* **Delayed action:** Results are only acted upon after the experiment concludes.
* **Inflexibility:**  The approach doesn't easily adapt to changing data during the experiment.


## Multi-Arm Bandits: A Superior Approach

Bandit algorithms offer a more flexible and efficient way to run experiments, especially in web testing.  They allow testing multiple treatments concurrently and making quicker, data-driven decisions.

The goal is to identify the best-performing arm (treatment) as quickly as possible, maximizing overall reward (e.g., clicks, conversions).  Unlike traditional A/B tests, bandit algorithms dynamically adjust the allocation of "pulls" (exposures) to different arms based on observed performance. Initially, all arms may be tested equally. Over time, better-performing arms receive a larger proportion of the pulls.

## Epsilon-Greedy Algorithm

This is a simple bandit algorithm:

1. A random number between 0 and 1 is generated.
2. If the number is less than `epsilon` (a small, pre-defined value), a coin flip decides which arm to choose (exploration).
3. Otherwise, the arm with the highest observed success rate is chosen (exploitation).

`epsilon` controls the balance between exploration (trying all arms) and exploitation (focusing on the seemingly best arm).  A higher `epsilon` means more exploration, while a lower `epsilon` leads to more exploitation.


**No Python code was provided in the original text related to the epsilon-greedy algorithm.**


## Thompson Sampling

This is a more sophisticated algorithm employing a Bayesian approach. It starts with an initial belief about the probability of success for each arm (a prior distribution, often a Beta distribution).  After each pull, this belief is updated based on the outcome. The arm is selected probabilistically, where the probability of selecting an arm is proportional to its estimated probability of success.  This method balances exploration and exploitation effectively.

**No Python code was provided in the original text related to Thompson sampling.**


## Key Ideas of Multi-Arm Bandits

* Reduce exposure to inferior treatments.
* Adapt sampling based on learned information.
* Efficiently handle multiple (more than two) treatments.


## Power and Sample Size

Determining the necessary sample size for an A/B test is crucial to avoid inconclusive results. This involves considering:

* **Effect size:** The minimum difference between treatments you want to detect.
* **Power:** The probability of detecting a real effect of a given size.
* **Significance level (alpha):** The probability of falsely concluding there's a difference when there isn't.


## Calculating Power and Sample Size: An Intuitive Approach

This approach avoids complex statistical formulas:

1. Create hypothetical data representing your best guess of the results for each treatment.
2. Create a second dataset, modified to include the desired effect size.
3. Draw bootstrap samples from each dataset.
4. Perform a hypothesis test on the bootstrap samples and record the significance.
5. Repeat this many times to estimate power.


## Python Code for Power Calculation (Using `pwr` and `statsmodels`)

The original text provides R code.  Since you requested Python code, equivalent Python code is shown below that achieves the same function using the `statsmodels` package.  Note that the `pwr` package is an R package, so a direct Python equivalent does not exist.


```python
import statsmodels.api as sm

# Example: Calculating sample size for a 10% improvement in click-through rate
# from 1.1% to 1.21% with 80% power and 5% significance level.

effect_size = sm.stats.proportion_effectsize(0.0121, 0.011) # Calculate effect size
analysis = sm.stats.TTestIndPower() # Initialize power analysis object

# Solve for sample size
result = analysis.solve_power(effect_size=effect_size,
                              alpha=0.05, power=0.8, alternative='larger')

print('Sample Size: %.3f' % result) 
```

This code calculates the sample size needed to detect a 10% difference in click-through rates with 80% power and a 5% significance level using a one-sided test.


## Key Ideas on Power and Sample Size

* Sample size determination requires considering the planned statistical test.
* Specify the minimum detectable effect size, desired power, and significance level.


## Summary

The document explains Multi-Arm Bandit algorithms as a superior alternative to traditional A/B testing for web optimization, highlighting their flexibility and efficiency in identifying the best-performing treatment.  It also discusses the importance of power and sample size calculations in experimental design, providing an intuitive method alongside Python code for sample size estimation.


