In [None]:
# Best Statistics Libraries for Machine Learning

In this notebook, we'll explore some of the best Python libraries for statistics and their specific use cases in machine learning. We'll provide examples for each library.

## SciPy

SciPy is a robust library for scientific computing, which includes modules for optimization, integration, interpolation, eigenvalue problems, algebra, and statistics. It is widely used for advanced statistical methods and computations.

One of the key features of SciPy is its `scipy.stats` module, which provides various statistical tests and probability distributions.

### Example: Performing a t-test using `scipy.stats`

In [None]:
from scipy import stats as sps

# Perform a two-sample t-test
stat, p = sps.ttest_ind([1, 2, 3], [4, 5, 6])
print(f"t-statistic: {stat}, p-value: {p}")

# Calculate the Spearman rank-order correlation coefficient
corr_spearman, p_spearman = sps.spearmanr([1, 2, 3], [3, 2, 1])
print(f"Spearman correlation coefficient: {corr_spearman}, p-value: {p_spearman}")

# Perform a chi-square test of independence
data_contingency = [[10, 20], [15, 25]]
chi2, p_chi2, dof, expected = sps.chi2_contingency(data_contingency)
print(f"Chi-square statistic: {chi2}, p-value: {p_chi2}, degrees of freedom: {dof}")
print(f"Expected frequencies: \n{expected}")

# Generate random samples from a normal distribution
random_samples = sps.norm.rvs(loc=0, scale=1, size=10)
print(f"Random samples from a normal distribution: {random_samples}")



## Statsmodels

Statsmodels is a Python library designed for statistical modeling and analysis. It offers tools for conducting various statistical tests, data exploration, and creating models such as regression, ANOVA, and time-series analysis.

A particularly useful feature is its support for linear regression models coupled with comprehensive summaries of results.

### Example: Linear regression using Statsmodels

In [None]:
import statsmodels.api as sm

# Linear regression
X = [1, 2, 3, 4, 5]
y = [5, 9, 11, 15, 18]
X = sm.add_constant(X)  # Add constant for intercept
model = sm.OLS(y, X).fit()
print(model.summary())

## Scikit-learn

Scikit-learn is a powerful library for machine learning, but it also provides robust tools for data preprocessing and feature engineering. It includes functions for statistics-based preprocessing such as feature scaling, normalization, and encoding.

### Example: Feature scaling using `StandardScaler`

In [None]:
from sklearn.preprocessing import StandardScaler
import numpy as np

data = np.array([[1, 2], [3, 4], [5, 6]])
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print(scaled_data)

## Pandas

Pandas is a highly popular library for data manipulation and exploration. It is particularly well-suited for exploratory data analysis (EDA) and provides tools for descriptive statistics, grouping, merging, and reshaping datasets.

### Example: Descriptive statistics using Pandas

In [None]:
import pandas as pd

data = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print(data.describe())

## PyMC and Bayesian Statistics

PyMC is a library for Bayesian inference, allowing for probabilistic programming and building complex statistical models. It's commonly used in scenarios requiring problem-specific uncertainty modeling.

### Example: Bayesian inference using PyMC3

In [None]:
import pymc3 as pm   # pymc3 latest install requires an older version of numpy than is allowed by tensorflow

with pm.Model() as model:
    mu = pm.Normal("mu", mu=0, sigma=1)
    obs = pm.Normal("obs", mu=mu, sigma=1, observed=[-1, 0, 1])
    trace = pm.sample(1000)

## Recommendation Summary

| Library       | Use Case                                                                                  |
|---------------|-------------------------------------------------------------------------------------------|
| **SciPy**     | Advanced statistical tests and probability distributions                                 |
| **Statsmodels** | Statistical modeling, regression, and hypothesis testing                                |
| **Scikit-learn** | Data preprocessing and features for machine learning workflows                         |
| **Pandas**    | Exploratory data analysis (EDA) and basic descriptive statistics                          |
| **PyMC**      | Bayesian inference and probabilistic programming                                          |

### Practical Recommendations:
- Use **SciPy** for stand-alone statistical tests.
- Use **Statsmodels** for in-depth modeling and examining relationships in data.
- Use **Scikit-learn** when working within a machine learning pipeline.
- Use **Pandas** for exploration and quick statistical insights.
- Use **PyMC** for advanced Bayesian analysis or probabilistic problem-solving.