# Data wrangling Illumina microarray data and Allan Brain Span data
In this notebook I process some data from *Illumina HumanHT-12 V4.0* microarray annotations, as well as developmental transcriptome data from the 
Allen Brain institute's developmental transcriptome. The goal of this notebook is to produce a tab-separated value file containing the illumina probe # for
the *Illumina HumanHT-12 V4.0* microarray, the associated gene symbol for said probe, and other potentially useful information about said gene detailed later. 

Author: Michael Moore

In [1]:
import pandas as pd
import csv
experimental_df = pd.read_excel("GSE132903_Matrix_Normalized.xlsx")
experimental_df.columns

Index(['ID_REF', 'Illumina probe name', 'ND_1_08-81', 'ND_2_08-85',
       'ND_4_08-72', 'ND_5_08-83', 'ND_6_97-53', 'ND_7_98-19', 'ND_8_97-46',
       'ND_9_97-17',
       ...
       'ND_14_97-09', 'ND_20_99-29', 'ND_15_97-10', 'ND_21_99-22',
       'ND_16_97-02', 'ND_22_99-02', 'ND_17_97-14', 'ND_23_98-22',
       'ND_18_97-37', 'ND_24_98-32'],
      dtype='object', length=197)

# Simplifying the Illumina HumanHT-12 V4.0 expression beadchip annotation file
The Illumina HumanHT-12 Microarray annotation file was obtained from [GEO](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL10558). I'm not sure what the first 
811 lines are, but line 812 of the fuyll annotation tsv are. The first line which correlates illumina probe #'s to gene names. This file is in a tab-separated value 
format with 30 values per line. Each line has many redundant  values. Some values I could not determine the meaning of, and many seem to be left blank with some single-character 
flag values ocassionally ocurring. I pared down the list of values I wanted per line to 6. 
1. Illumina probe # (col 0)
2. Accession # (col 3)
3. Gene Symbol (col 5)
4. Gene Symbol 2 (col 12, this may be redundant, but I wanted to capture lines which may not have the 1st gene name)
5. locus location (col 22) 
6. Brief Description (col 23)

I loop over each of these entries, and write them into a new tab-separated value file which has the the aformentioned column names as the first row. 
This should be quite a bit nice to work with via pandas then the original tab-separate value file. 

In [2]:
with open("GPL10558-50081.txt", "r") as f:
    # Found colums 
    with open("cleaned_illumina_annotations.tsv", "w") as out:
        out.write("probe#\taccession#\tgeneSymbol\tgeneSymbol2\tcsomeLocus\tdescription\n")
        for line in f:
            vals = line.split("\t")
            out.write(f"{vals[0]}\t{vals[3]}\t{vals[5]}\t{vals[12]}\t{vals[22]}\t{vals[23]}\n")


In [3]:
# For now let's read each value as a string (pandas thinks csome locus values are floats and produces NaNs)
dtypes = {header:"object" for header in "probe#\taccession#\tgeneName\tgeneName2\tcsomeLocus\tdescription\n".split("\t")}

clean_annotations = pd.read_csv("cleaned_illumina_annotations.tsv", sep="\t", header=0, dtype=dtypes)
clean_annotations.set_index("geneSymbol", inplace=True)
clean_annotations.head

<bound method NDFrame.head of                   probe#   accession# geneSymbol2 csomeLocus  \
geneSymbol                                                     
LOC643334   ILMN_1651199  XM_931492.1   LOC643334    2q37.3b   
SLC35E2     ILMN_1651209  NM_182838.1     SLC35E2   1p36.33a   
DUSP22      ILMN_1651210  XM_941691.1      DUSP22    6p25.3b   
LOC642820   ILMN_1651221  XM_926225.1   LOC642820        NaN   
RPS28       ILMN_1651228  NM_001031.4       RPS28   19p13.2d   
...                  ...          ...         ...        ...   
SKCG-1      ILMN_3311170  XR_040141.2      SKCG-1   11q22.3d   
ESP33       ILMN_3311175  XR_079076.1       ESP33        NaN   
SKCG-1      ILMN_3311180  XR_040140.2      SKCG-1   11q22.3d   
ESP33       ILMN_3311185  XR_078679.1       ESP33        NaN   
NCRNA00173  ILMN_3311190  NR_027346.1  NCRNA00173        NaN   

                                                  description  
geneSymbol                                                     
LOC643334

# Getting a list of Ad genes to use
Three lists of AD associated genes were created from data downloaded form the Allen Brain Institute's 
[Brain Span developmental transcriptome](http://www.brainspan.org/rnaseq/search/index.html). These genes were selected via the gene categories:
- *Alzheimer Disease*
- *Alzheimer disease amyloid secretase pathway*
- *Alzheimer disease presenilin pathway*

The data in these files will be of use later when I attempt to vectorize the individual gene expression values in the GEO study from which transcriptome data was 
provided. 

In [4]:
# use a set to avoid duplicates
gene_symbols = set()
for name in ["", "amyloid_secretase_", "Presenilin_"]:
    filepath = f"Brain_Span_Data/AD_{name}0log2/Rows.csv"
    with open(filepath, "r") as f:
        reader = csv.DictReader(f)
        for row in reader:
            gene_symbols.add(row.get("gene-symbol"))
        

In [5]:
print(len(gene_symbols), "genes")
# If the microtubule associated tau protein gene and APOE gene aren't in here, they really should be.
gene_symbols.add("APOE")
# and Tau
gene_symbols.add("MAPT")
print(len(gene_symbols), "genes after adding Tau and APOE")
# Now filter out those genes which are not included in the Illumina probe
illumina_genes = set(clean_annotations.index)
gene_symbols = set(filter(lambda g: g in illumina_genes, gene_symbols))
print(f"{len(gene_symbols)} genes remain")

178 genes
180 genes after adding Tau and APOE
178 genes remain


## I'm writing the results into a simple text file so I can use them to get relative transcriptomic expression data from [The Allen Brain Map](https://celltypes.brain-map.org/rnaseq/human_m1_10x)

In [6]:
gene_symbol_list = list(gene_symbols)
gene_symbol_list.sort()
genes = "\n".join(gene_symbol_list) # Makes a string w/ gene symbols separated by \n carriadge return
with open("AD_genes.txt", "w") as f:
    f.write(genes)

# Restricting transcriptome data to 178 genes
Restrict the experimental transcriptome data to just the 178 genes of interest for all 197 subjects

TODO: This

NameError: name 'ad_expression_df' is not defined