<a href="https://colab.research.google.com/github/mrandrivan/ML-DL-AI-practice/blob/main/Genetics_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
import pandas as pd
import allel  # or use PyVCF

# Load clinical/phenotype data
phenotype_data = pd.read_csv('clinical_trial_data.csv')

# Load genetic data from a VCF file
vcf_data = allel.read_vcf('genetic_data.vcf')

# Preview the clinical data
print(phenotype_data.head())

# Preview the genetic data (this will give you genotype information)
genotypes = vcf_data['calldata/GT']
print(genotypes.shape)


TypeError: read_csv() got an unexpected keyword argument 'error_bad_lines'

**Step 2: Perform Quality Control on Genetic Data
Ensure the data is clean before analysis. This can include filtering for minor allele frequency (MAF), removing individuals with low genotyping rates, or filtering based on Hardy-Weinberg Equilibrium (HWE).**

Key Concepts:
Minor Allele Frequency (MAF): Ensure that SNPs have a certain minimum frequency in the population.
Hardy-Weinberg Equilibrium (HWE): Test to ensure the allele frequencies are in equilibrium.

In [None]:
# Calculate allele counts
allele_counts = allel.GenotypeArray(genotypes).count_alleles()

# Filter SNPs based on MAF (e.g., MAF > 0.05)
maf = allele_counts[:, 1] / allele_counts.sum(axis=1)
filtered_snps = maf[maf > 0.05]

# HWE test using Scipy for one SNP
from scipy.stats import chi2_contingency
observed = [allele_counts[0], allele_counts[1]]
chi2, p_value, dof, expected = chi2_contingency([observed])


**Step 3: Perform Association Tests (GWAS-like Analysis)**
Now, you want to find out if any SNPs are associated with the treatment response. This is a simple genotype-phenotype association analysis.

Statistical Test:
Chi-square test or logistic regression to test for an association between genetic variants (SNPs) and the treatment response.

In [None]:
from sklearn.linear_model import LogisticRegression
from scipy.stats import chi2_contingency

# Extract SNP data and corresponding patient phenotypes (treatment response)
X = genotypes.reshape(-1, genotypes.shape[2])  # SNP data as features
y = phenotype_data['treatment_response']  # Binary outcome (e.g., 1 for response, 0 for no response)

# Perform Logistic Regression
model = LogisticRegression()
model.fit(X, y)

# For each SNP, you could also perform a Chi-square test
for snp in X.T:
    contingency_table = pd.crosstab(snp, y)
    chi2, p, _, _ = chi2_contingency(contingency_table)
    print(f'Chi-square: {chi2}, P-value: {p}')


**Step 4: Correct for Multiple Testing**

Since many SNPs are being tested, it's essential to apply a multiple testing correction (like Bonferroni or FDR) to avoid false positives.

In [None]:
from statsmodels.stats.multitest import multipletests

# Assume p-values from the Chi-square tests are stored in p_values list
corrected_p_values = multipletests(p_values, method='fdr_bh')  # FDR correction
print(corrected_p_values)


**Step 6: Pathway or Functional Analysis**

After identifying significant SNPs or genes, you can perform pathway enrichment analysis to understand if these genes are involved in specific biological processes.