# GEOparse

GEOparse is a handy python lib that allows one to connect to the GEO database and retrieve different
transcriptomice datasets. In this example we will focus on microarray experiments.

### Install GEOparse

In [1]:
# use pip to install GEOparse
# !pip install GEOparse

Remember that once package is installed it needs to be loaded

In [2]:
import GEOparse
import math
import pandas as pd
from scipy import stats
import statsmodels.stats.multitest as smm

The GEOparse.get_GEO() function in the GEOparse package retrieves meta data from the GEO website with not only
the meta data, but also the experimental data points. Below are some useful commands in the package.

In [3]:
gds = GEOparse.get_GEO(geo="GDS2084", destdir="./")

24-Feb-2021 10:06:27 DEBUG utils - Directory ./ already exists. Skipping.
24-Feb-2021 10:06:27 INFO GEOparse - File already exist: using local version.
24-Feb-2021 10:06:27 INFO GEOparse - Parsing ./GDS2084.soft.gz: 
24-Feb-2021 10:06:27 DEBUG GEOparse - DATABASE: Geo
24-Feb-2021 10:06:27 DEBUG GEOparse - DATASET: GDS2084
24-Feb-2021 10:06:27 DEBUG GEOparse - SUBSET: GDS2084_1
24-Feb-2021 10:06:27 DEBUG GEOparse - SUBSET: GDS2084_2
24-Feb-2021 10:06:27 DEBUG GEOparse - DATASET: GDS2084


### Retrieve data

In [4]:
# meta data
gds.metadata

{'title': ['Polycystic ovary syndrome: adipose tissue'],
 'description': ['Analysis of omental adipose tissues of morbidly obese patients with polycystic ovary syndrome (PCOS). PCOS is a common hormonal disorder among women of reproductive age, and is characterized by hyperandrogenism and chronic anovulation. PCOS is associated with obesity.'],
 'type': ['Expression profiling by array'],
 'pubmed_id': ['17062763'],
 'platform': ['GPL96'],
 'platform_organism': ['Homo sapiens'],
 'platform_technology_type': ['in situ oligonucleotide'],
 'feature_count': ['22283'],
 'sample_organism': ['Homo sapiens'],
 'sample_type': ['RNA'],
 'channel_count': ['1'],
 'sample_count': ['15'],
 'value_type': ['count'],
 'reference_series': ['GSE5090'],
 'order': ['none'],
 'update_date': ['Mar 21 2007']}

In [5]:
# normalized reads
gds.table.head()

Unnamed: 0,ID_REF,IDENTIFIER,GSM114841,GSM114844,GSM114845,GSM114849,GSM114851,GSM114854,GSM114855,GSM114834,GSM114842,GSM114843,GSM114847,GSM114848,GSM114850,GSM114852,GSM114853
0,1007_s_at,MIR4640,222.6,252.7,219.3,258.9,239.0,286.0,230.1,197.1,254.4,296.5,171.1,268.9,251.2,301.9,234.3
1,1053_at,RFC2,35.5,24.5,23.4,31.4,20.6,26.1,24.3,26.9,31.4,27.1,25.9,40.5,22.2,24.6,31.3
2,117_at,HSPA6,41.5,53.3,31.3,43.0,65.5,39.6,68.5,46.9,61.7,93.7,68.5,79.6,40.0,43.2,53.4
3,121_at,PAX8,229.8,419.6,274.5,227.1,271.6,428.7,333.4,221.1,291.5,399.8,307.1,364.8,326.1,387.2,400.9
4,1255_g_at,GUCA1A,14.3,13.0,29.6,16.3,4.6,10.7,7.8,2.4,13.9,24.7,3.8,14.3,1.9,12.0,11.5


In [6]:
# columns
gds.columns

Unnamed: 0,description,disease state
GSM114841,Value for GSM114841: EP3_adipose_control; src:...,control
GSM114844,Value for GSM114844: EP23_adipose_control; src...,control
GSM114845,Value for GSM114845: EP31_adipose_control_rep1...,control
GSM114849,Value for GSM114849: EP37_adipose_control; src...,control
GSM114851,Value for GSM114851: EP49_adipose_control; src...,control
GSM114854,Value for GSM114854: EP69_adipose_control; src...,control
GSM114855,Value for GSM114855: EP71_adipose_control; src...,control
GSM114834,Value for GSM114834: EP1_adipose_pcos_rep1; sr...,polycystic ovary syndrome
GSM114842,Value for GSM114842: EP10_adipose_pcos; src: O...,polycystic ovary syndrome
GSM114843,Value for GSM114843: EP18_adipose_pcos; src: O...,polycystic ovary syndrome


Notice that the gds.columns command provides information regarding the different factors and the leves
involved in the dataset. Also the gds.table command retrieves all the normalized values. In the case where
normalizing the data from the raw data files is too cumbersome, this is an easy alternative.
Below we download meta information about a specific sample and using the same commands, we retrieve
similar type of information.

Below we download meta information about a specific sample and using the same commands, we retrieve
similar type of information.

In [7]:
gsm = GEOparse.get_GEO(geo="GSM114841", destdir="./")

24-Feb-2021 10:06:28 DEBUG utils - Directory ./ already exists. Skipping.
24-Feb-2021 10:06:28 INFO GEOparse - File already exist: using local version.
24-Feb-2021 10:06:28 INFO GEOparse - Parsing ./GSM114841.txt: 


In [8]:
gsm.metadata

{'title': ['EP3_adipose_control'],
 'geo_accession': ['GSM114841'],
 'status': ['Public on Jun 17 2006'],
 'submission_date': ['Jun 16 2006'],
 'last_update_date': ['Jun 16 2006'],
 'type': ['RNA'],
 'channel_count': ['1'],
 'source_name_ch1': ['Omental adipose tissue'],
 'organism_ch1': ['Homo sapiens'],
 'taxid_ch1': ['9606'],
 'characteristics_ch1': ['Morbidly obese control subject'],
 'biomaterial_provider_ch1': ['Ramón y Cajal Hospital, Madrid, Spain'],
 'molecule_ch1': ['total RNA'],
 'label_ch1': ['Biotin'],
 'description': ['Total RNA was extracted from omental  adipose tissue from a control subject'],
 'data_processing': ['MAS 5.0, scaled to 100 and RMA'],
 'platform_id': ['GPL96'],
 'contact_name': ['BELEN,,PERAL'],
 'contact_email': ['bperal@iib.uam.es'],
 'contact_phone': ['34 91 5854478'],
 'contact_fax': ['34 91 5854401'],
 'contact_institute': ['INSTITUTO DE INVESTIGACIONES BIOMEDICAS, CSIC-UAM'],
 'contact_address': ['ARTURO DUPERIER'],
 'contact_city': ['MADRID'],
 'co

In [9]:
gsm.columns

Unnamed: 0,description
ID_REF,
VALUE,"Signal intensity - MAS 5.0, scaled to 100 and RMA"
ABS_CALL,Presence/absence of gene transcript in sample;...
Detection p-value,p-value that indicates the significance level ...


In [10]:
gsm.table.head()

Unnamed: 0,ID_REF,VALUE,ABS_CALL,Detection p-value
0,AFFX-TrpnX-M_at,1.3,A,0.963431
1,AFFX-TrpnX-5_at,2.6,A,0.672921
2,AFFX-TrpnX-3_at,0.5,A,0.910522
3,AFFX-ThrX-M_at,4.3,A,0.631562
4,AFFX-ThrX-5_at,1.9,A,0.897835


# Differential Expression

Now that we can retrieve all the necessary information, we can use the functions we have been writing to
perform a t-test and calculate fold change for all genes.

In [11]:
# save all values to a dataframe
alldata = gds.table
allsamples = gds.columns

# retrieve only columns with sample names
expdata = alldata.loc[:,allsamples.index]


#let's put the probe/gene names back on the rownames
expdata = expdata.set_index(alldata.iloc[:,0])

# transpose df to use groupby function
t_expdata = expdata.transpose()
t_expdata["group"] = list(allsamples.iloc[:,1])
genemean = t_expdata.groupby("group").mean()

# Use pd.Series to store data
generatio = [math.log2(i) for i in (genemean.loc["polycystic ovary syndrome",]/genemean.loc["control",])]
generatio = pd.Series(generatio, index=genemean.columns)

# get control sample and disease sample
ctrl = allsamples.index[allsamples["disease state"] == "control"]
disease = allsamples.index[allsamples["disease state"] == "polycystic ovary syndrome"]

# write a function to calculate p value. use stats.ttest_ind() function, set equal_var = False (Welch t test)
def dottest(index, ctrl, disease):
    return(stats.ttest_ind(expdata.loc[index,ctrl], expdata.loc[index,disease], equal_var=False)[1])

# write a loop to calculate p value for each row
ttestpvalues = []
for i in expdata.index:
    ttestpvalues.append(dottest(i, ctrl, disease))
    
# make the result into a pd.Series
ttestpvalues = pd.Series(ttestpvalues, index=genemean.columns)
ttestpvaluesfdr = smm.multipletests(list(ttestpvalues),alpha=0.05, method="fdr_bh")
print("The number of genes that pass the fdr cutoff (0.05):",sum(ttestpvaluesfdr[0]))


The number of genes that pass the fdr cutoff (0.05): 0


  reject = pvals_sorted <= ecdffactor*alpha


None of the FDR corrected p-values pass the cutoff. So let’s just take the original p-values and the logratio
cutoffs

In [12]:
#filter the dataset
temp = (abs(generatio) > math.log2(1.5)) & (ttestpvalues < 0.05)
DiffGenes = list(temp[temp == True].index)
print("The number of genes that pass the p cutoff (0.05):",len(DiffGenes))

The number of genes that pass the p cutoff (0.05): 286


Compare the results to running the analysis using SimpleAffy package. What is the major difference between
the methods? Why should this cause such a different in number of differentially expressed genes?