<a href="https://colab.research.google.com/github/Ash100/Python_for_Lifescience/blob/main/Chapter_8_Statistical_Analysis_for_Biology.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Learn Python for Biological Science**

This course is designed and taught by **Dr. Ashfaq Ahmad**. During teaching I will use all the examples from the Biological Sciences or Life Sciences.

## 📅 Course Outline

---

## 🏗️ Foundation (Weeks 1–2)

### 📘 Chapter 1: Getting Started with Python and Colab [Watch Lecture](https://youtu.be/BKe2CmiG_TU)
- Introduction to Google Colab interface
- Basic Python syntax and data types
- Variables, strings, and basic operations
- Print statements and comments

### 📘 Chapter 2: Control Structures [Watch Lecture](https://youtu.be/uPHeqVb4Mo0)
- Conditional statements (`if`/`else`)
- Loops (`for` and `while`)
- Basic functions and scope

---

## 🧬 Data Handling (Weeks 3–4)

### 📘 Chapter 3: Data Structures for Biology [Watch Lecture](https://youtu.be/x1IJwSYhNZg)
- Lists and tuples (storing sequences, experimental data)
- Dictionaries (gene annotations, species data)
- Sets (unique identifiers, sample collections)

### 📘 Chapter 4: Working with Files [Watch Lecture](https://youtu.be/D27MyLpSdks)
- Reading and writing text files
- Handling CSV files (experimental data)
- Basic file operations for biological datasets

---

## 📊 Scientific Computing (Weeks 5–7)

### 📘 Chapter 5: NumPy for Numerical Data [Watch Lecture](https://youtu.be/DPaZN3NQtWw)
- Arrays for storing experimental measurements
- Mathematical operations on datasets
- Statistical calculations (mean, median, standard deviation)

### 📘 Chapter 6: Pandas for Data Analysis [Watch Lecture](https://youtu.be/MPE6qibUyTE)
- DataFrames for structured biological data
- Data cleaning and manipulation
- Filtering and grouping experimental results
- Handling missing data

### 📘 Chapter 7: Data Visualization [Watch Lecture](https://youtu.be/gWhXywbMyfM)
- Matplotlib basics for scientific plots
- Creating publication-quality figures
- Specialized plots for biological data (histograms, scatter plots, box plots)

---

## 🔬 Biological Applications (Weeks 8–10)

### 📘 Chapter 8: Statistical Analysis for Biology
- Hypothesis testing basics
- t-tests and chi-square tests
- Correlation analysis
- Introduction to `scipy.stats`

### 📘 Chapter 9: Practical Projects
- Analyzing gene expression data
- Population genetics calculations
- Ecological data analysis
- Creating reproducible research workflows

---

## 🚀 Advanced Topics *(Optional – Weeks 11–12)*

### 📘 Chapter 10: Bioinformatics Libraries
- Introduction to Biopython
- Working with biological databases
- Phylogenetic analysis basics

### 📘 Chapter 11: Best Practices
- Code organization and documentation
- Error handling
- Reproducible research practices
- Sharing code and results

---

✅ We will move from basic programming concepts to practical biological applications, ensuring students can immediately apply what they learn to their research and coursework.


###**Topics Covered**

Hypothesis Testing Basics<br>
t-tests and Chi-square Tests<br>
Correlation Analysis<br>
Introduction to scipy.stats

# 🧪 Hypothesis Testing in Biological Research

# Hypothesis Testing

Hypothesis testing is a **statistical method** used to make decisions about a population based on a sample of data.  
The process always begins with two competing hypotheses: **the null hypothesis** and **the alternative hypothesis**.

---

## Null Hypothesis (H₀)

- The **null hypothesis (H₀)** represents the **status quo** or a statement of **no effect, no difference, or no relationship**.  
- It's the hypothesis that a researcher attempts to disprove.  
- Think of it as the **"innocent until proven guilty"** statement in a statistical trial.

**Example:**

> H₀: The fertilizer has **no effect** on plant height.

- This is the starting assumption that your experiment is designed to challenge.

---

## Alternative Hypothesis (H₁ or Hₐ)

- The **alternative hypothesis (H₁ or Hₐ)** is the statement that the researcher **wants to prove**.  
- It **contradicts H₀** and suggests there **is an effect, a difference, or a relationship**.

**Example:**

> H₁: The fertilizer **does affect** plant height.

- This is the "guilty verdict" you're seeking to prove.

---

## The Role of Evidence and Decision Making ⚖️

1. Formulate **H₀** and **H₁**.  
2. Collect data and calculate a **test statistic**.  
3. Use this statistic to determine the **p-value**.  
4. Compare the p-value with the **significance level (α)** to make a decision.

---

## p-value

- The **p-value** is the probability of observing a result as extreme as (or more extreme than) the one you actually got, **assuming H₀ is true**.  

**Interpretation:**
- **Small p-value (≤ α):** Strong evidence against H₀ → reject H₀.  
- **Large p-value (> α):** Results are not surprising under H₀ → fail to reject H₀.  

---

## Significance Level (α)

- The **significance level (α)** is your **predetermined threshold** for making a decision.  
- It represents the **maximum p-value** you are willing to accept to call a result **statistically significant**.  

- Common choice: **α = 0.05**  
  - This means you accept a **5% chance of making a Type I error** (rejecting a true H₀).

---

## Decision Rule ✅❌

- **If p-value ≤ α:** Reject H₀ → results are **statistically significant**.  
- **If p-value > α:** Fail to reject H₀ → not enough evidence to support H₁.  

---

### Why This Matters

This structured process ensures that conclusions drawn from data are:
- **Objective**
- **Evidence-based**
- Not driven by **mere speculation**




In [9]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

In [3]:
np.random.seed(42)  # for reproducibility

# Simulate untreated sample: 30 measurements from a normal distribution
untreated = np.random.normal(loc=10, scale=2, size=30)

# Simulate treated sample: same size, but shifted mean to reflect treatment effect
treated = np.random.normal(loc=13, scale=2, size=30)

**loc=10** means the untreated group has a mean of 10 (e.g. baseline gene expression)<br>
**loc=13** means the treated group has a higher mean, simulating an effect (e.g. upregulation)<br>
**scale=2** adds biological variability<br>
**size=30** simulates 30 replicates per group

**Now we are going to visualize the data**

In [None]:
sns.set(style="whitegrid")

plt.figure(figsize=(8, 5))
sns.histplot(untreated, color="skyblue", label="Untreated", kde=True)
sns.histplot(treated, color="salmon", label="Treated", kde=True)
plt.title("Simulated Biological Measurements")
plt.xlabel("Measurement Value")
plt.ylabel("Frequency")
plt.legend()
plt.show()

**kde=True** adds a smooth curve to show the density

### Statistical Summary of the data

In [None]:
print("Untreated Mean:", np.mean(untreated))
print("Untreated Std Dev:", np.std(untreated))
print("Treated Mean:", np.mean(treated))
print("Treated Std Dev:", np.std(treated))

### Hypothesis Testing (t-test)

In [None]:
t_stat, p_value = stats.ttest_ind(treated, untreated)

print("T-statistic:", t_stat)
print("P-value:", p_value)

**Explanation:**<br>A low p-value (< 0.05) suggests a statistically significant difference

###**Reference Sheet**
#### Interpreting the p-value

In hypothesis testing, we compare the **p-value** to a predefined **significance level (α)**, commonly set at **0.05**.  

| **p-value**     | **Interpretation**              |
|------------------|---------------------------------|
| > 0.05           | Not statistically significant   |
| ≤ 0.05           | Statistically significant       |
| ≤ 0.01           | Strongly significant            |
| ≤ 0.001          | Very strongly significant       |
| ≤ 0.000001       | Extremely significant           |

---

✅ Use this table as a quick reference when interpreting results of hypothesis tests.

**1.23 × 10⁻⁸ = 0.0000000123**

##**Chi-Square Test? (χ²)**

The **chi-square (χ²) test** is a **statistical hypothesis test** used to examine the relationship between two **categorical variables**.  

- It works by comparing:  
  - **Observed frequencies** → the data you actually collect  
  - **Expected frequencies** → what you would expect to see if there were *no relationship* between the variables  

👉 The goal is to determine whether any difference between observed and expected results is due to a **real relationship** or simply due to **chance**.  

---

## Types of Chi-Square Tests

1. **Test of Independence**  
   - Checks whether two categorical variables are significantly associated.  
   - **Example:** Is there an association between a person’s **hair color** and **eye color**?

2. **Goodness-of-Fit Test**  
   - Checks whether the observed frequency distribution for a single categorical variable differs from a theoretical or expected distribution.  
   - **Example:** Does the ratio of **heads to tails** in coin flips match the expected **50/50 distribution**?

---

## How is it Different from a t-test? 🧐

| **Feature**         | **Chi-Square Test**                                | **t-test**                                        |
|----------------------|----------------------------------------------------|--------------------------------------------------|
| **Variable Type**    | Categorical (e.g., gender, species, presence/absence) | Continuous/Quantitative (e.g., height, weight)   |
| **Primary Use**      | To assess relationships or compare observed vs. expected frequencies | To compare the means of two groups               |
| **Underlying Data**  | Frequency counts in categories                     | Numerical values with a defined scale             |

🔑 **Rule of thumb:**  
- **t-test** → for *“how much”* questions (e.g., "Is the average height of Group A different from Group B?").  
- **Chi-square test** → for *“how many / how often”* questions (e.g., "Is the number of people choosing Option A different from Option B?").  

---

## Applications of Chi-Square Test in Biological Data 🧬🌱🩺

1. **Genetics**  
   - Test whether observed **phenotypic ratios** match expected Mendelian ratios.  
   - Example: Does a heterozygous cross yield the expected **3:1 dominant:recessive** ratio?

2. **Ecology**  
   - Test associations between species in a habitat.  
   - Example: Are two plant species **co-occurring** in quadrats significantly more often than expected by chance?

3. **Medical Research**  
   - Analyze relationships between categorical variables such as **disease presence** and **risk factors**.  
   - Example: Is there an association between **smoking status** and **lung disease**?

---


In [None]:
# Simulate observed counts (e.g. genotype frequencies)
observed = [20, 30, 50]  # AA, Aa, aa
expected = [25, 25, 50]  # Hardy-Weinberg equilibrium

chi2_stat, chi2_p = stats.chisquare(f_obs=observed, f_exp=expected)

print("Chi-square statistic:", chi2_stat)
print("P-value:", chi2_p)

**-**We're testing whether the observed genotype frequencies (AA, Aa, aa) match the expected frequencies under Hardy-Weinberg equilibrium.<br>
**-**The chi-square test compares observed vs expected counts to see if the difference is statistically significant.<br>
**Interpretation:**<br> The p-value is greater than 0.05, that means the observed frequency is not statistically significant, and may occur by chance. In statistical terms, it also means that the **the test does not rejects the null hypothesis**.

Correlation Analysis

In [None]:
# Simulate gene expression and protein abundance
gene_expr = np.random.normal(50, 10, 50)
protein_abund = gene_expr * 0.8 + np.random.normal(0, 5, 50)

# Calculate Pearson correlation
corr_coef, corr_p = stats.pearsonr(gene_expr, protein_abund)

print("Correlation coefficient:", corr_coef)
print("P-value:", corr_p)

We're testing whether gene expression levels are linearly related to protein abundance.

The Pearson correlation coefficient (corr_coef) ranges from:<br>
**+1:** perfect positive correlation<br>
**0:** no correlation<br>
**–1:** perfect negative correlation<br>
A significant p-value means the correlation is unlikely due to chance.<br>
This is useful in systems biology to explore gene-protein relationships, co-expression networks, or biomarker discovery.

### Interpretation
- **Correlation coefficient (r):** 0.7912  
- **P-value:** 8.06 × 10⁻¹²  

---

## 🔍 What Does This Mean?

### ✅ Strength of Relationship
- The correlation coefficient **r = 0.7912** indicates a **strong positive linear relationship**.  
- As one variable increases, the other tends to **increase as well**.

### ✅ Statistical Significance
- The **p-value = 8.06 × 10⁻¹²** is extremely small — far below the common threshold of **0.05**.  
- This means the observed correlation is **highly unlikely to be due to random chance**.

---

## 🧬 Biological Interpretation
There is a **strong and statistically significant association** between the two variables.  
Depending on the dataset, this could represent:

- A **gene expression level** correlating with **phenotype severity**  
- A **metabolite concentration** tracking with **disease progression**  
- An **environmental exposure** linked to a **biological response**  

---

In [None]:
sns.scatterplot(x=gene_expr, y=protein_abund)
plt.title("Gene Expression vs Protein Abundance")
plt.xlabel("Gene Expression")
plt.ylabel("Protein Abundance")
plt.show()


###**Summary Table of Tests**

| **Test Type**   | **Use Case**                          | **Function**              |
|------------------|---------------------------------------|----------------------------|
| **t-test**       | Compare means of two groups           | `stats.ttest_ind()`        |
| **Chi-square**   | Compare observed vs expected freq.    | `stats.chisquare()`        |
| **Correlation**  | Measure linear relationship           | `stats.pearsonr()`         |

---

📌 These functions are available in **`scipy.stats`** and are commonly used in hypothesis testing.