# Parkinson’s Disease Progression and Gene Expression Analysis
**Dataset:** GSE49036 (NCBI GEO)

This notebook investigates gene expression changes in the substantia nigra across different stages of Parkinson’s disease progression, as defined by Braak staging (Control, BR12, BR34, BR56). Using microarray data from GEO dataset GSE49036, we perform differential expression analysis and trend-based modeling to identify genes whose expression levels vary with disease severity.

## Step 1: Load and Inspect the Data

In [2]:
import pandas as pd
import numpy as np

# Load the GEO expression matrix
df = pd.read_csv("/Users/jamieannemortel/Desktop/parkinsons-gene-expression-analysis/data/GSE49036_series_matrix.txt", sep='\t', comment='!', index_col=0)
df = df.T


#df = df[df.index.str.startswith("GSM")]
df.head()

ID_REF,1007_s_at,1053_at,117_at,121_at,1255_g_at,1294_at,1316_at,1320_at,1405_i_at,1431_at,...,AFFX-r2-Ec-bioD-3_at,AFFX-r2-Ec-bioD-5_at,AFFX-r2-P1-cre-3_at,AFFX-r2-P1-cre-5_at,AFFX-ThrX-3_at,AFFX-ThrX-5_at,AFFX-ThrX-M_at,AFFX-TrpnX-3_at,AFFX-TrpnX-5_at,AFFX-TrpnX-M_at
GSM1192691,10.5641,6.60376,5.68355,7.91604,5.16548,6.85336,5.53094,4.72669,5.27655,5.00814,...,12.4445,11.8873,13.9853,13.7274,11.048,7.35552,9.09508,2.8761,3.451,3.10766
GSM1192692,11.1759,6.12394,5.82604,8.10854,4.53927,7.20813,5.75089,4.61105,4.85065,4.89116,...,12.7364,12.2142,13.9713,13.7999,4.35998,3.41204,3.0972,2.7705,3.33565,3.16542
GSM1192693,10.4751,6.77599,5.92443,7.86884,5.90801,6.41755,5.48822,4.4511,4.81965,4.97457,...,12.2786,11.7375,13.854,13.6569,4.24409,3.6679,3.03463,2.63667,3.29663,3.3041
GSM1192694,10.9885,6.43294,5.86894,7.92703,4.67369,6.94416,5.86956,4.52485,4.80454,4.78318,...,12.8319,12.3234,14.0776,13.9672,4.42594,3.6976,3.30814,2.91511,3.42633,3.15808
GSM1192695,10.1362,6.87867,5.68709,8.02081,5.37847,6.73202,5.3767,4.90069,4.47104,5.26074,...,11.9782,11.4839,13.7716,13.4899,4.20578,3.87063,3.14483,2.60828,3.31189,3.24647


## Step 2: Assign Braak Stage Labels

In [None]:
group_labels = (
    ['Control'] * 8 +
    ['BR12'] * 5 +
    ['BR34'] * 7 +
    ['BR56'] * 8
)
df['Group'] = group_labels
print(df['Group'].value_counts())

## Step 3: Differential Expression Analysis using ANOVA

In [None]:
from scipy.stats import f_oneway

# Split the DataFrame into groups
grp_ctrl = df[df['Group'] == 'Control']
grp_br12 = df[df['Group'] == 'BR12']
grp_br34 = df[df['Group'] == 'BR34']
grp_br56 = df[df['Group'] == 'BR56']

# Run ANOVA across Braak stages
results = []
for gene in df.columns.drop('Group'):
    stat, pval = f_oneway(
        grp_ctrl[gene],
        grp_br12[gene],
        grp_br34[gene],
        grp_br56[gene]
    )
    results.append({'gene': gene, 'p_value': pval})

results_df = pd.DataFrame(results)
results_df['-log10(p_value)'] = -np.log10(results_df['p_value'])
results_df = results_df.sort_values('p_value')
results_df.head()

## Step 4: Load Annotation and Merge Gene Symbols

In [None]:
# Load annotation file
anno = pd.read_csv("data/GPL570-annot.txt", sep="\t", skiprows=16)
anno = anno[['ID', 'Gene Symbol']]
anno.columns = ['ID_REF', 'Gene_Symbol']

# Merge with results_df
results_df = results_df.merge(anno, left_on='gene', right_on='ID_REF', how='left')
results_df[['gene', 'Gene_Symbol', 'p_value']].head()

## Step 5: Visualize Expression for a Selected Gene

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Select a gene to visualize
gene = '1007_s_at'  # Replace with any probe ID
sns.boxplot(x='Group', y=gene, data=df)
plt.title(f"Expression of {gene} Across Braak Stages")
plt.show()