### Basic Statistical Tests in Python
This notebook provides a curosry overview of which tests to choose for which data and how to implement them in Python

#### Decision Trees for Selecting Statistical Tests
1. Type of Data
    - **Categorical Data** (i.e. rating scale, gender, etc.)
        - Two variables: Chi-Square Test
        - More than two variables: Logistic Regression
    - **Numerical Data**
        - Compare means of two groups: t-test
            - Paired: Paired t-test
            - Independent: Independent t-test
        - Compare means of more than two groups: ANOVA
        - Compare medians: Kruskal-Wallis Test or Mann-Whitney U Test
-----
2. Sample Size and Distribution
    - **Small sample size or non-normal distribution**: Non-parametric tests (e.g., Mann-Whitney U Test, Wilcoxon Signed-Rank Test)
    - **Large sample size and normal distribution**: Parametric tests (e.g., t-tests, ANOVA)
------    
3. Relationship Between Variables
    - **Correlation between two variables**: Pearson or Spearman Correlation
    - **Predicting a variable from other variables**: Regression Analysis
        - Linear Regression: For continuous outcome variables
        - Logistic Regression: For categorical outcome variables

In [None]:
import pandas as pd
import seaborn as sns

# Load Titanic dataset
titanic = sns.load_dataset('titanic')

# Load Iris dataset
iris = sns.load_dataset('iris')

### Chi-Square Test
**Use when**: comparing two categorical variables
<br>**Objective**: Test if there is a significant association between two categorical variables, e.g., 'Sex' and 'Survived'.
<br>**Interpretation**: A low p-value (typically < 0.05) indicates a significant association between the variables.

In [20]:
from scipy.stats import chi2_contingency

# Create a contingency table
contingency_table = pd.crosstab(titanic['sex'], titanic['survived'])

# Perform Chi-Square Test
# chi2 = Chi-square value
# p = p-value
# dof = degrees of freedom
# expected = expected frequencies of each cell in the table if the null hypothesis is true
chi2, p, dof, expected = chi2_contingency(contingency_table)

#Format p value not be scientific notation
if p < 0.001:
    formatted_p = "< 0.001"
else:
    formatted_p = format(p, '.5f')

print(f"Chi-Square Statistic: {chi2}, P-value: {formatted_p}")

Chi-Square Statistic: 260.71702016732104, P-value: < 0.001


### t-Test
**Use when**: comparing the means of two groups of numeric data with large sample size and normal distribution
<br>**Objective**: Test if there is a significant association between two numeric variables
<br>**Interpretation**: A p-value less than 0.05 typically indicates a statistically significant difference in the means.

In [18]:
from scipy.stats import ttest_ind

# Remove NaN values from 'age' column
titanic = titanic.dropna(subset=['age'])

# Split data into two groups
group1 = titanic[titanic['survived'] == 1]['age']
group2 = titanic[titanic['survived'] == 0]['age']

# Perform Independent t-test
t_stat, p = ttest_ind(group1, group2)


#Format p value not be scientific notation
if p < 0.001:
    formatted_p = "< 0.001"
else:
    formatted_p = format(p, '.5f')


print(f"T-statistic: {t_stat}, P-value: {formatted_p}")

T-statistic: -2.06668694625381, P-value: 0.03912


### ANOVA
**Use when**: comparing the means of two OR MORE groups of numeric data with large sample size and normal distribution
<br>**Objective**: Test if there is a significant association between two or more numeric variables
<br>**Interpretation**: A p-value less than 0.05 typically indicates one variable has a statistically significant difference in the means.

In [17]:
from scipy.stats import f_oneway

# Split the data into groups
group1 = iris[iris['species'] == 'setosa']['sepal_length']
group2 = iris[iris['species'] == 'versicolor']['sepal_length']
group3 = iris[iris['species'] == 'virginica']['sepal_length']

# Perform ANOVA
f_stat, p = f_oneway(group1, group2, group3)

#Format p value not be scientific notation
if p < 0.001:
    formatted_p = "< 0.001"
else:
    formatted_p = format(p, '.5f')

print(f"F-statistic: {f_stat}, P-value: {formatted_p}")

F-statistic: 119.26450218450468, P-value: < 0.001


### Kruskal-Wallis Test
**Use when**: comparing the MEDIANS of two groups of numeric data
<br>**Objective**: Test if there is a significant association between two numeric variables
<br>**Interpretation**: A p-value less than 0.05 typically indicates a statistically significant difference in the medians.

In [13]:
from scipy.stats import kruskal

# Split the data into groups
group1 = iris[iris['species'] == 'setosa']['petal_length']
group2 = iris[iris['species'] == 'versicolor']['petal_length']
group3 = iris[iris['species'] == 'virginica']['petal_length']

# Perform Kruskal-Wallis Test
stat, p = kruskal(group1, group2, group3)

#Format p value not be scientific notation
if p < 0.001:
    formatted_p = "< 0.001"
else:
    formatted_p = format(p, '.5f')

print(f"Statistic: {stat}, P-value: {formatted_p}")


Statistic: 130.41104857977163, P-value: < 0.001


### Pearson Correlation
**Use when**: Exploring the relationship between two variables how their impact on each other
<br>**Objective**: Examine the correlation between two variables
<br>**Interpretation**: The correlation coefficient indicates the strength and direction of the relationship, 0 being no relationship, while the p-value tests its significance.

In [21]:
from scipy.stats import pearsonr

# Perform Pearson Correlation
corr, p = pearsonr(iris['sepal_length'], iris['sepal_width'])

#Format p value not be scientific notation
if p < 0.001:
    formatted_p = "< 0.001"
else:
    formatted_p = format(p, '.5f')

print(f"Pearson correlation: {corr}, P-value: {formatted_p}")


Pearson correlation: -0.11756978413300204, P-value: 0.15190


### Linear Regression
**Use when**: predicting a variable based on another variable for continuous variables
<br>**Objective**: Predict one variable based on another
<br>**Interpretation**: The coefficient shows the change in a variable for each unit increase in the other variable.

In [22]:
from sklearn.linear_model import LinearRegression
import numpy as np

# Prepare data
X = iris['sepal_length'].values.reshape(-1, 1)
y = iris['sepal_width']

# Fit linear regression model
model = LinearRegression()
model.fit(X, y)

# Display coefficients
print(f"Coefficient: {model.coef_}, Intercept: {model.intercept_}")


Coefficient: [-0.0618848], Intercept: 3.418946836103816


### Logistic Regression
**Use when**: predicting a variable based on another variable for categorical variables
<br>**Objective**: Predict one variable based on another
<br>**Interpretation**: The coefficient shows the change in a variable for each unit increase in the other variable.

In [23]:
from sklearn.linear_model import LogisticRegression

# Drop rows with missing values in 'age' and 'pclass'
titanic = titanic.dropna(subset=['age', 'pclass'])

# Select features and target
X = titanic[['age', 'pclass']]  # Features: Age and Passenger Class
y = titanic['survived']         # Target: Survival

# Create and train the logistic regression model
model = LogisticRegression()
model.fit(X, y)

# Output the model's coefficients and intercept
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")

# To make a prediction, use model.predict() with appropriate input
# For example, predicting for a 30-year-old passenger in 1st class
sample_data = pd.DataFrame({'age': [30], 'pclass': [1]})
prediction = model.predict(sample_data)
print(f"Predicted Survival: {prediction[0]}")  # 0 for not survived, 1 for survived



Coefficients: [[-0.04149665 -1.22653571]]
Intercept: [3.532956]
Predicted Survival: 1
