# **Programming for Biologists: Advanced Data Handling with NumPy, Pandas, and Matplotlib**


So far, you've learned the fundamental building blocks of Python: variables, lists, dictionaries, loops, and functions.

But what happens when your data grows? When you have thousands of gene expression values, millions of sequencing reads, or complex patient metadata spread across large tables? Using only basic Python lists can become slow, memory-intensive, and cumbersome.

This is where **Python's powerful scientific computing libraries** come into play. Think of them as specialized, high-throughput instruments in your lab – designed to handle massive amounts of data efficiently and give you sophisticated insights.

In this notebook, we'll dive deep into three essential libraries:

1.  **NumPy (Numerical Python):** The backbone for numerical operations, especially with large arrays (like matrices).
2.  **Pandas (Panel Data):** Your go-to tool for structured, tabular data (like spreadsheets or databases).
3.  **Matplotlib:** The versatile plotting library to visualize your findings.

---

## **1. NumPy: The Foundation for Numerical Operations**

NumPy is the fundamental package for scientific computing with Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of high-level mathematical functions to operate on these arrays.

### **1.1 Why We Need NumPy**

Imagine you're dealing with:
* **Gene expression matrices:** Thousands of genes by hundreds of samples, all numerical values.
* **Image data:** Microscopy images represented as grids of pixel intensities.
* **Simulation results:** Large arrays of numerical outputs from biological models.
* **Sequence alignments:** Representing scores for matches/mismatches across long sequences.

Standard Python lists are not optimized for numerical computations on large datasets. NumPy arrays, however, are specifically designed for this, offering significant **speed** and **memory efficiency** improvements.

### **1.2 The NumPy Array: The Core Concept**

The primary object in NumPy is the `ndarray` (N-dimensional array). It's a grid of values (of the same type) that can be indexed by tuples of non-negative integers.

To use NumPy, you first need to `import` it, usually with the alias `np`.

In [None]:
import numpy as np

# 1. Creating NumPy Arrays
print("--- Creating NumPy Arrays ---")

# From a Python list
gene_expression = np.array([10.5, 22.1, 5.8, 15.0])
print("Gene Expression Array:", gene_expression)
print("Type:", type(gene_expression))

# 2D Array (Matrix) - like a small table
protein_data = np.array([
    [100, 25, 0.5],   # Protein 1: MW, pI, Concentration
    [120, 30, 0.7],   # Protein 2
    [90,  20, 0.4]    # Protein 3
])
print("\nProtein Data Matrix:\n", protein_data)


### **1.3 Array Attributes: Understanding Your Data's Structure**

Arrays have useful attributes that tell you about their dimensions, size, and data type.

In [None]:
my_array = np.array([[1, 2, 3], [4, 5, 6]])

print("Original Array:\n", my_array)

print("\nShape (rows, columns):", my_array.shape) # (2, 3) means 2 rows, 3 columns
print("Number of dimensions:", my_array.ndim) # 2 dimensions
print("Total number of elements:", my_array.size) # 2 * 3 = 6 elements
print("Data type of elements:", my_array.dtype) # int64 (integer with 64 bits)

### **1.4 Array Operations: Fast Math!**

NumPy allows for incredibly fast element-wise operations, meaning an operation is applied to each corresponding element in arrays. This is much faster than using loops with Python lists.

In [None]:
temp_celsius = np.array([20, 25, 30, 35]) # Temperature in Celsius

# Convert Celsius to Fahrenheit: F = C * (9/5) + 32
temp_fahrenheit = temp_celsius * (9/5) + 32
print("Celsius:", temp_celsius)
print("Fahrenheit:", temp_fahrenheit)



In [None]:
# Element-wise addition of two arrays (e.g., adding experimental noise)
expression_a = np.array([100, 150, 200])
expression_b = np.array([10, 20, 30])
total_expression = expression_a + expression_b
print("\nExpression A:", expression_a)
print("Expression B:", expression_b)
print("Total Expression (A+B):", total_expression)



In [None]:
# Mathematical functions (e.g., log-transforming expression data)
log_expression = np.log2(expression_a)
print("\nLog2 Expression A:", log_expression)

### **1.5 Indexing and Slicing Arrays**

Just like lists, you can access specific elements or sub-sections of arrays. NumPy offers powerful indexing techniques.

In [None]:
data_matrix = np.array([
    [10, 20, 30, 40],
    [50, 60, 70, 80],
    [90, 100, 110, 120]
])

print("Original Data Matrix:\n", data_matrix)

# Accessing a single element (row, column)
element = data_matrix[1, 2] # Row 1 (2nd row), Column 2 (3rd column) - 70
print("\nElement at (1, 2):", element)



In [None]:
# Slicing rows
first_row = data_matrix[0, :]
print("\nFirst Row (all columns):", first_row)

all_rows_first_two_cols = data_matrix[:, 0:2]
print("\nAll Rows, first two Columns:\n", all_rows_first_two_cols)



In [None]:
# Conditional Indexing (very powerful for filtering data!)
high_values = data_matrix[data_matrix > 75] # Get all values greater than 75
print("\nValues greater than 75:", high_values)

**Your Turn! (NumPy Exercise)**

Imagine you have experimental results for 5 replicates of 3 different assays.

1.  Create a NumPy array called `assay_results` with the following data:
    `[[15, 22, 18], [20, 25, 23], [12, 19, 17], [18, 24, 20], [16, 21, 19]]`
    * The rows represent replicates, columns represent assays.
2.  Print the `shape` and `ndim` of your `assay_results` array.
3.  Calculate the **average** result for `Assay 2` (the second column) and print it.
    * *Hint:* You can select a column using slicing (e.g., `my_array[:, 1]`) and then use `np.mean()`.
4.  Identify and print all results in `assay_results` that are **greater than 20**.

In [None]:
# Write your code for NumPy Exercise here!


---

## **2. Pandas: The Data Scientist's Spreadsheet**

While NumPy is great for numerical arrays, it lacks built-in features for handling mixed data types (numbers, text, dates), or for easily labeling rows and columns like a spreadsheet. That's where **Pandas** comes in.

Pandas is a powerful library for data manipulation and analysis. It introduces two primary data structures: **Series** (1-dimensional) and **DataFrame** (2-dimensional), which are built on top of NumPy arrays.

### **2.1 Why We Need Pandas**

Almost all biological data that isn't raw sequence data comes in a tabular format:
* **Gene expression tables:** Rows are genes, columns are samples, with expression values.
* **Patient clinical data:** Rows are patients, columns are age, diagnosis, treatment, response.
* **Proteomics data:** Protein IDs, spectral counts, quantification values.
* **Experimental metadata:** Sample names, conditions, dates, operators.

Pandas makes it incredibly easy to load, clean, transform, and analyze these types of datasets.

### **2.2 Pandas Series: A Labeled 1D Array**

A Series is like a single column in a spreadsheet or a labeled NumPy array. It can hold any data type.

In [None]:
import pandas as pd

print("--- Pandas Series ---")

# Create a Series from a list
gene_expression_series = pd.Series([120, 500, 80, 210], index=['GeneA', 'GeneB', 'GeneC', 'GeneD'])
print("Gene Expression Series:\n", gene_expression_series)
print("Type:", type(gene_expression_series))



In [None]:
# Accessing elements by label (index)
gene_b_expr = gene_expression_series['GeneB']
print("\nExpression of GeneB:", gene_b_expr)

### **2.3 Pandas DataFrame: The Heart of Data Analysis**

A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or a SQL table.

#### **2.3.1 Creating DataFrames**

A common way to create a DataFrame is from a dictionary, where keys become column names and values are lists/arrays of data.

In [None]:
print("--- Pandas DataFrame ---")

# Clinical data for patients
data = {
    'Patient_ID': ['P001', 'P002', 'P003', 'P004', 'P005'],
    'Age': [35, 42, 58, 29, 65],
    'Gender': ['M', 'F', 'F', 'M', 'F'],
    'Diagnosis': ['Flu', 'Cancer', 'Flu', 'Healthy', 'Cancer'],
    'Treatment_Response': [8.5, 2.1, 7.9, np.nan, 3.5] # np.nan for missing data
}

patient_df = pd.DataFrame(data)
print("Patient DataFrame:\n", patient_df)
print("Type:", type(patient_df))

#### **2.3.2 Inspecting DataFrames**

Once you have a DataFrame, you'll want to get a quick overview of its contents.

In [None]:
# Display the first few rows
print("\nFirst 3 rows (head()):\n", patient_df.head(3))

# Get a concise summary of the DataFrame (data types, non-null values)
print("\nDataFrame Info (info()):")
patient_df.info()

In [None]:
# Get descriptive statistics for numerical columns
print("\nDescriptive Statistics (describe()):\n", patient_df.describe())

#### **2.3.3 Selecting Data (Columns and Rows)**

Accessing specific parts of your data is fundamental.

In [None]:
# Select a single column (returns a Series)
ages = patient_df['Age']
print("\nAges Series:\n", ages)
print("Type of 'ages':", type(ages))


In [None]:

# Select multiple columns (returns a DataFrame)
id_age_gender = patient_df[['Patient_ID', 'Age', 'Gender']]
print("\nPatient ID, Age, Gender DataFrame:\n", id_age_gender)

In [None]:
# Select rows by index (using .iloc for integer-location based indexing)
first_patient = patient_df.iloc[0]
print("\nFirst Patient (iloc[0]):\n", first_patient)



In [None]:
# Select rows by label (using .loc for label-based indexing - careful if your index is not P001, P002...)
# For this example, our index is default 0, 1, 2... so iloc/loc on rows works similarly if using numbers.
# If we set 'Patient_ID' as index, then loc becomes more powerful for row selection by ID.
patient_df_indexed = patient_df.set_index('Patient_ID')
p003_data = patient_df_indexed.loc['P003']
print("\nData for P003 (after setting ID as index):\n", p003_data)

#### **2.3.4 Filtering Data (Conditional Selection)**

This is extremely powerful for biological data. You can filter rows based on conditions, similar to `if` statements, but applied across an entire column efficiently.

In [None]:
print("--- Filtering Data ---")

# Patients older than 50
old_patients = patient_df[patient_df['Age'] > 50]
print("\nPatients older than 50:\n", old_patients)



In [None]:
# Female patients
female_patients = patient_df[patient_df['Gender'] == 'F']
print("\nFemale Patients:\n", female_patients)



In [None]:
# Patients with Cancer diagnosis AND good response (>3.0)
cancer_and_good_response = patient_df[
    (patient_df['Diagnosis'] == 'Cancer') & (patient_df['Treatment_Response'] > 3.0)
]
print("\nCancer patients with good response:\n", cancer_and_good_response)

#### **2.3.5 Handling Missing Data (`NaN`)**

Real-world biological data often has missing values. Pandas uses `NaN` (Not a Number) to represent these. You can detect and handle them.

In [None]:
print("--- Handling Missing Data ---")

# Check for missing values
print("\nMissing values per column:\n", patient_df.isnull().sum())


#### **2.3.6 Grouping and Aggregation (e.g., Averages by Group)**

This is powerful for summarizing data, like finding the average expression level for genes in different pathways, or average patient age by diagnosis.

The `groupby()` method is one of the most frequently used Pandas operations.

In [None]:

# Drop rows with any missing values
df_cleaned_rows = patient_df.dropna()
print("\nDataFrame after dropping rows with NaN:\n", df_cleaned_rows)



In [None]:
# Fill missing values (e.g., with the mean of the column, or a specific value)
mean_response = patient_df['Treatment_Response'].mean()
df_filled = patient_df.fillna(value={'Treatment_Response': mean_response})
print("\nDataFrame after filling NaN in Treatment_Response with mean:\n", df_filled)

In [None]:
print("--- Grouping Data ---")

# Average age by gender
avg_age_by_gender = patient_df.groupby('Gender')['Age'].mean()
print("\nAverage Age by Gender:\n", avg_age_by_gender)



In [None]:
# Average treatment response by diagnosis
avg_response_by_diagnosis = patient_df.groupby('Diagnosis')['Treatment_Response'].mean()
print("\nAverage Treatment Response by Diagnosis:\n", avg_response_by_diagnosis)

**Your Turn! (Pandas Exercise)**

You have gene expression data from an experiment. `GeneA` and `GeneB` are controls, `GeneC` and `GeneD` are experimental.

1.  Create a Pandas DataFrame called `gene_expression_df` from the following data:
    ```python
    data = {
        'Gene_ID': ['GeneA', 'GeneB', 'GeneC', 'GeneD', 'GeneE', 'GeneF'],
        'Sample1_Expr': [100, 250, 50, 300, 120, 80],
        'Sample2_Expr': [110, 240, 60, 310, 130, 90],
        'Sample3_Expr': [95, 260, 45, 290, 115, 75],
        'Category': ['Control', 'Control', 'Experimental', 'Experimental', 'Control', 'Experimental']
    }
    ```
2.  Print the `info()` and `describe()` of your DataFrame.
3.  Filter the DataFrame to show only `Experimental` genes and print this filtered DataFrame.
4.  Calculate the **average expression** for each gene across all three samples (`Sample1_Expr`, `Sample2_Expr`, `Sample3_Expr`). Store this in a new column called `Average_Expr` in your DataFrame and print the updated DataFrame.
    *Hint: You can sum the three expression columns and divide by 3, or use `df[['col1', 'col2', 'col3']].mean(axis=1)` to get the mean row-wise.*
5.  Group the DataFrame by `Category` and calculate the average `Average_Expr` for each category. Print this result.

In [None]:
# Write your code for Pandas Exercise here!



---

## **3. Matplotlib: Visualizing Your Data**

After you've cleaned and analyzed your data, you need to communicate your findings effectively. **Matplotlib** is the most popular plotting library in Python, capable of creating a wide range of static, animated, and interactive visualizations.

### **3.1 Why Biologists Need Matplotlib**

Visualizations are critical for:
* **Exploring data:** Quickly identify patterns, outliers, or distributions.
* **Presenting results:** Create publication-quality figures for papers, posters, and presentations.
* **Quality control:** Spot potential issues in your raw data (e.g., uneven sequencing depth, batch effects).

We usually import `matplotlib.pyplot` (the plotting interface) as `plt`.

In [None]:
import matplotlib.pyplot as plt
import numpy as np # Already imported, but good to remind
import pandas as pd # Already imported, but good to remind

# Ensure plots appear directly in the notebook
%matplotlib inline

print("--- Basic Matplotlib Plots ---")

### **3.2 Common Plot Types for Biology**

#### **3.2.1 Line Plots (e.g., Time Series, Dose-Response)**

Useful for showing trends over time or continuous variables.

In [None]:
# Simulated bacterial growth curve
time_points = np.array([0, 2, 4, 6, 8, 10, 12]) # hours
bacterial_growth = np.array([10, 30, 90, 270, 810, 2430, 7290]) # colony forming units (CFU)

plt.figure(figsize=(8, 5)) # Create a figure and axes
plt.plot(time_points, bacterial_growth, marker='o', linestyle='-', color='green')
plt.title('Bacterial Growth Over Time')
plt.xlabel('Time (hours)')
plt.ylabel('Bacterial Count (CFU)')
plt.grid(True) # Add a grid for readability
plt.show()

#### **3.2.2 Scatter Plots (e.g., Correlation between Gene Expression)**

Show the relationship between two numerical variables.

In [None]:
# Simulated expression of two genes in different samples
gene_x_expr = np.array([10, 15, 20, 25, 30, 35, 40, 45, 50])
gene_y_expr = np.array([12, 18, 21, 28, 32, 38, 43, 48, 55]) + np.random.normal(0, 2, 9) # Add some noise

plt.figure(figsize=(8, 6))
plt.scatter(gene_x_expr, gene_y_expr, color='blue', alpha=0.7)
plt.title('Correlation of Gene X vs Gene Y Expression')
plt.xlabel('Gene X Expression Level')
plt.ylabel('Gene Y Expression Level')
plt.grid(True)
plt.show()

#### **3.2.3 Histograms (e.g., Distribution of Read Lengths, Expression Values)**

Display the distribution of a single numerical variable.

In [None]:
# Simulated distribution of sequencing read lengths
read_lengths = np.random.normal(loc=150, scale=20, size=1000) # Mean 150, Std Dev 20, 1000 reads
read_lengths = read_lengths[read_lengths > 0] # Remove any negative lengths just in case

plt.figure(figsize=(9, 6))
plt.hist(read_lengths, bins=20, color='purple', edgecolor='black', alpha=0.7)
plt.title('Distribution of Sequencing Read Lengths')
plt.xlabel('Read Length (bp)')
plt.ylabel('Frequency')
plt.grid(axis='y', alpha=0.75)
plt.show()

#### **3.2.4 Bar Charts (e.g., Counts of Cell Types, Average Values per Group)**

Compare categorical data or aggregated numerical data.

In [None]:
# Counts of different cell types in a sample
cell_types = ['T-cell', 'B-cell', 'Macrophage', 'Neutrophil']
cell_counts = [2500, 1800, 1200, 3000]

plt.figure(figsize=(9, 6))
plt.bar(cell_types, cell_counts, color=['skyblue', 'lightcoral', 'lightgreen', 'gold'])
plt.title('Cell Type Abundance in Sample')
plt.xlabel('Cell Type')
plt.ylabel('Count')
plt.grid(axis='y', alpha=0.75)
plt.show()

### **3.3 Customizing Plots: Making Them Publication-Ready**

Matplotlib offers extensive customization. Here are some basics:
* `plt.figure(figsize=(width, height))`: Set plot size.
* `plt.title()`, `plt.xlabel()`, `plt.ylabel()`: Add labels.
* `plt.legend()`: Add a legend if you have multiple lines/elements.
* `color`, `marker`, `linestyle`, `alpha` (transparency) arguments in plot functions.
* `plt.xlim()`, `plt.ylim()`: Set axis limits.
* `plt.xticks()`, `plt.yticks()`: Customize tick marks.
* `plt.grid()`: Add a grid.

**Your Turn! (Matplotlib Exercise)**

Using the `gene_expression_df` from your Pandas exercise, create a bar chart.

1.  Use the `Gene_ID` column for your x-axis categories.
2.  Use the `Average_Expr` column for your y-axis values.
3.  Add a title: "Average Gene Expression Levels".
4.  Label the x-axis: "Gene ID".
5.  Label the y-axis: "Average Expression".
6.  Make the bars a different color (e.g., `'skyblue'` or `'teal'`).

In [None]:
# Write your code for Matplotlib Exercise here!

# Ensure gene_expression_df is available from the previous exercise
if 'gene_expression_df' not in locals():
    print("Please run the Pandas Exercise cell first to create 'gene_expression_df'.")
else:
    #Write Code Here

---

## **4. Putting It All Together: A Workflow Example**

Let's combine these libraries to simulate a common biological data analysis task: analyzing patient sample data.

Scenario: We have patient data including age, blood pressure, and a new biomarker level. We want to:
1.  Load the data.
2.  Calculate some summary statistics.
3.  Filter based on conditions.
4.  Visualize the relationship between blood pressure and the biomarker.

In [None]:
# 1. Create a dummy CSV file for demonstration (like reading from a real file)
csv_content = """
Patient_ID,Age,Blood_Pressure,Biomarker_Level,Treatment_Group
P001,45,120,5.2,A
P002,62,145,8.1,B
P003,30,110,4.5,A
P004,58,135,7.0,B
P005,39,125,6.1,A
P006,70,150,9.2,B
P007,28,105,3.9,A
P008,50,130,6.5,B
"""
with open("patient_data.csv", "w") as f:
    f.write(csv_content)

print("--- Integrated Workflow Example ---")

# 1. Load Data with Pandas
patient_data_df = pd.read_csv("patient_data.csv")
print("\nOriginal Patient Data:\n", patient_data_df)

# 2. Calculate Summary Statistics (using Pandas, which often uses NumPy internally)
print("\nDescriptive statistics for numerical columns:\n", patient_data_df[['Age', 'Blood_Pressure', 'Biomarker_Level']].describe())

# 3. Filter Data: Patients in Treatment Group 'B' with high blood pressure (>130)
high_bp_group_b = patient_data_df[
    (patient_data_df['Treatment_Group'] == 'B') & 
    (patient_data_df['Blood_Pressure'] > 130)
]
print("\nPatients in Group B with high BP:\n", high_bp_group_b)

# 4. Visualize Data: Scatter plot of Blood Pressure vs. Biomarker Level, colored by Treatment Group
plt.figure(figsize=(10, 7))

# Plot points for Group A
group_a = patient_data_df[patient_data_df['Treatment_Group'] == 'A']
plt.scatter(group_a['Blood_Pressure'], group_a['Biomarker_Level'], 
            color='blue', label='Treatment Group A', alpha=0.7, s=100) # s for size

# Plot points for Group B
group_b = patient_data_df[patient_data_df['Treatment_Group'] == 'B']
plt.scatter(group_b['Blood_Pressure'], group_b['Biomarker_Level'], 
            color='red', label='Treatment Group B', alpha=0.7, s=100)

plt.title('Blood Pressure vs. Biomarker Level by Treatment Group')
plt.xlabel('Blood Pressure (mmHg)')
plt.ylabel('Biomarker Level')
plt.legend()
plt.grid(True)
plt.show()

---