<a href="https://colab.research.google.com/github/Ash100/Python_for_Lifescience/blob/main/Chapter_6%3APandas_for_Data_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Learn Python for Biological Data Analysis
## **Chapter 6:** Pandas for Data Analysis

This course is designed and taught by **Dr. Ashfaq Ahmad**. During teaching I will use all the examples from the Biological Sciences or Life Sciences.

## 📅 Course Outline

---

## 🏗️ Foundation (Weeks 1–2)

### 📘 Chapter 1: Getting Started with Python and Colab [Watch Lecture](https://youtu.be/BKe2CmiG_TU)
- Introduction to Google Colab interface
- Basic Python syntax and data types
- Variables, strings, and basic operations
- Print statements and comments

### 📘 Chapter 2: Control Structures [Watch Lecture](https://youtu.be/uPHeqVb4Mo0)
- Conditional statements (`if`/`else`)
- Loops (`for` and `while`)
- Basic functions and scope

---

## 🧬 Data Handling (Weeks 3–4)

### 📘 Chapter 3: Data Structures for Biology [Watch Lecture](https://youtu.be/x1IJwSYhNZg)
- Lists and tuples (storing sequences, experimental data)
- Dictionaries (gene annotations, species data)
- Sets (unique identifiers, sample collections)

### 📘 Chapter 4: Working with Files [Watch Lecture](https://youtu.be/D27MyLpSdks)
- Reading and writing text files
- Handling CSV files (experimental data)
- Basic file operations for biological datasets

---

## 📊 Scientific Computing (Weeks 5–7)

### 📘 Chapter 5: NumPy for Numerical Data [Watch Lecture](https://youtu.be/DPaZN3NQtWw)
- Arrays for storing experimental measurements
- Mathematical operations on datasets
- Statistical calculations (mean, median, standard deviation)

### 📘 Chapter 6: Pandas for Data Analysis [Watch Lecture](https://youtu.be/MPE6qibUyTE)
- DataFrames for structured biological data
- Data cleaning and manipulation
- Filtering and grouping experimental results
- Handling missing data

### 📘 Chapter 7: Data Visualization
- Matplotlib basics for scientific plots
- Creating publication-quality figures
- Specialized plots for biological data (histograms, scatter plots, box plots)

---

## 🔬 Biological Applications (Weeks 8–10)

### 📘 Chapter 8: Sequence Analysis
- String manipulation for DNA/RNA sequences
- Basic sequence operations (reverse complement, transcription)
- Reading FASTA files
- Simple sequence statistics

### 📘 Chapter 9: Statistical Analysis for Biology
- Hypothesis testing basics
- t-tests and chi-square tests
- Correlation analysis
- Introduction to `scipy.stats`

### 📘 Chapter 10: Practical Projects
- Analyzing gene expression data
- Population genetics calculations
- Ecological data analysis
- Creating reproducible research workflows

---

## 🚀 Advanced Topics *(Optional – Weeks 11–12)*

### 📘 Chapter 11: Bioinformatics Libraries
- Introduction to Biopython
- Working with biological databases
- Phylogenetic analysis basics

### 📘 Chapter 12: Best Practices
- Code organization and documentation
- Error handling
- Reproducible research practices
- Sharing code and results

---

✅ We will move from basic programming concepts to practical biological applications, ensuring students can immediately apply what they learn to their research and coursework.


### Introduction
Welcome to this chapter on using **Pandas**, a powerful Python library for working with structured data. In biology, we often deal with tabular data—gene expression matrices, protein interaction tables, clinical trial results, etc. Pandas helps us clean, filter, and analyze such data efficiently.

In this chapter, you'll learn how to:
- Create and manipulate DataFrames
- Clean biological datasets
- Filter and group experimental results
- Handle missing data

Let's dive in!

### 🔧 Step 1: Setting Up

Before we begin, we need to install and import the libraries we'll use.  
- `pandas`: for data manipulation  
- `seaborn`: for visualization  
- `numpy`: for numerical operations

In [None]:
# Install necessary libraries
# !pip install pandas seaborn

# Import libraries
import pandas as pd



## DataFrames (df) for structured biological data

### 🧬 What is a DataFrame?

A **DataFrame** is like an Excel spreadsheet in Python. Each row is an observation (e.g., a gene), and each column is a variable (e.g., expression level, tissue type).

Here, we simulate a gene expression dataset:
- `Gene`: gene name
- `Expression_Level`: measured expression
- `Tissue`: tissue where it was measured
- `Condition`: whether the sample is healthy or cancerous

This structure is common in bioinformatics and experimental biology.


In [3]:
# Sample gene expression dataset
data = {
    "Gene": ["TP53", "BRCA1", "EGFR", "MYC", "CDK2"],
    "Expression_Level": [7.2, 5.5, 8.1, 6.3, 4.9],
    "Tissue": ["Liver", "Breast", "Lung", "Brain", "Liver"],
    "Condition": ["Healthy", "Cancer", "Cancer", "Healthy", "Cancer"],
}

df = pd.DataFrame(data)
df

Unnamed: 0,Gene,Expression_Level,Tissue,Condition
0,TP53,7.2,Liver,Healthy
1,BRCA1,5.5,Breast,Cancer
2,EGFR,8.1,Lung,Cancer
3,MYC,6.3,Brain,Healthy
4,CDK2,4.9,Liver,Cancer


## 🧹 Cleaning/Handling Biological Data

Real-world biological data is messy. We often need to:
- Rename columns for clarity
- Normalize values (e.g., divide by max expression)
- Standardize labels (e.g., "Liver" → "Hepatic")

These steps make downstream analysis easier and more reproducible.


In [4]:
# Rename columns for clarity
df.rename(columns={"Expression_Level": "Expr"}, inplace=True)

# Add a normalized expression column
df["Expr_Norm"] = df["Expr"] / df["Expr"].max()

# Replace tissue names for consistency
df["Tissue"] = df["Tissue"].replace({"Liver": "Hepatic"})

df

Unnamed: 0,Gene,Expr,Tissue,Condition,Expr_Norm
0,TP53,7.2,Hepatic,Healthy,0.888889
1,BRCA1,5.5,Breast,Cancer,0.679012
2,EGFR,8.1,Lung,Cancer,1.0
3,MYC,6.3,Brain,Healthy,0.777778
4,CDK2,4.9,Hepatic,Cancer,0.604938


## 🔬 Filtering and Grouping

We often want to:
- Focus on specific conditions (e.g., cancer samples)
- Compare groups (e.g., tissues)

Here, we:
- Filter rows where `Condition == 'Cancer'`
- Group by `Tissue` and compute average and maximum expression

This mimics comparing experimental replicates or conditions.


In [5]:
# Filter for cancer samples
cancer_df = df.query("Condition == 'Cancer'")
cancer_df

Unnamed: 0,Gene,Expr,Tissue,Condition,Expr_Norm
1,BRCA1,5.5,Breast,Cancer,0.679012
2,EGFR,8.1,Lung,Cancer,1.0
4,CDK2,4.9,Hepatic,Cancer,0.604938


In [6]:
# Group by tissue and compute mean/max expression
grouped = df.groupby("Tissue").agg({"Expr": ["mean", "max"]})
grouped

Unnamed: 0_level_0,Expr,Expr
Unnamed: 0_level_1,mean,max
Tissue,Unnamed: 1_level_2,Unnamed: 2_level_2
Brain,6.3,6.3
Breast,5.5,5.5
Hepatic,6.05,7.2
Lung,8.1,8.1


## ❗ Missing Data in Biology

Missing data is common—due to failed experiments, unreadable files, or low signal.

We can:
- Detect missing values using `.isna()`
- Fill them with statistical estimates (e.g., mean)
- Drop them if necessary

Always document how you handle missing data—it affects biological interpretation. <br>
**Lets create a df with some missing data first**

In [7]:
# Create a DataFrame with missing data
missing_data = {
    "SampleID": ["P1", "P2", "P3", "P4"],
    "Age": [35, 42, None, 58],
    "BiomarkerA": [1.2, 1.5, 1.1, None],
    "BiomarkerB": [15.2, 16.5, 14.8, 17.1],
}
missing_df = pd.DataFrame(missing_data)
print("Original DataFrame with Missing Data:")
print(missing_df)

# Check for missing values
print("\nMissing values per column:")
print(missing_df.isnull().sum())

Original DataFrame with Missing Data:
  SampleID   Age  BiomarkerA  BiomarkerB
0       P1  35.0         1.2        15.2
1       P2  42.0         1.5        16.5
2       P3   NaN         1.1        14.8
3       P4  58.0         NaN        17.1

Missing values per column:
SampleID      0
Age           1
BiomarkerA    1
BiomarkerB    0
dtype: int64


**Drop the missing data**<br>
The `.dropna()` method removes rows or columns with missing values.

In [8]:
# Create a new DataFrame by dropping all rows with any missing values
df_dropped = missing_df.dropna()
print("DataFrame after dropping rows with missing values:")
print(df_dropped)

DataFrame after dropping rows with missing values:
  SampleID   Age  BiomarkerA  BiomarkerB
0       P1  35.0         1.2        15.2
1       P2  42.0         1.5        16.5


**Fill the Missing Data**<br>
The `.fillna()` method replaces missing values with a specified value. A common approach is to fill missing values with the mean or median of the column.

In [9]:
# Calculate the mean age and median of BiomarkerA
mean_age = missing_df["Age"].mean()
median_biomarker_a = missing_df["BiomarkerA"].median()

# Fill missing 'Age' with the mean and missing 'BiomarkerA' with the median
df_filled = missing_df.fillna(value={"Age": mean_age, "BiomarkerA": median_biomarker_a})
print("DataFrame after filling missing values:")
print(df_filled)

DataFrame after filling missing values:
  SampleID   Age  BiomarkerA  BiomarkerB
0       P1  35.0         1.2        15.2
1       P2  42.0         1.5        16.5
2       P3  45.0         1.1        14.8
3       P4  58.0         1.2        17.1


I hope you are learning something new. Keep practicing