# Extracting GEO IDs of RNA-seq samples of differentiated cell types

The following script will parse the GEO IDs (samples) from the R scripts that were retrieved manually from ARCHS4 for each individual cell type category.
The aim is to extract the expression vectors of the selected samples from the hdf5 file from ARCHS4 dataset version V11.

In [1]:
# Importing libraries
import pandas as pd
import os

### Parsing R scripts from archs4 to extract sample geo_ids and cell types

In [2]:
# This generates a a dataframe mapping the GEO IDs and the cell type category
labels=[]
for subdir, dirs, files in os.walk("../inputs/"):
    for file in files:
        samples=[]
        filepath = subdir + os.sep + file
        if filepath.endswith(".r"):
            with open(filepath) as inf:
                temp_df=pd.DataFrame(columns=["geo_id","type"])
                content=inf.read().splitlines()
                content=str(content[16:-18]).strip()
                for sample in content[content.find("(")+1:content.find(")")].split(","):
                    clean_sample=sample.strip(" \\'\"")
                    if clean_sample!="":
                        samples.append(clean_sample)
            temp_df["geo_id"]=samples
            temp_df["type"]=file.strip(".r")
            labels.append(temp_df)
labels = pd.concat(labels, ignore_index=True)

In [3]:
labels.head()

Unnamed: 0,geo_id,type
0,GSM2132128,Podocyte
1,GSM2132132,Podocyte
2,GSM2132130,Podocyte
3,GSM2132129,Podocyte
4,GSM2132133,Podocyte


In [4]:
labels.shape

(74985, 2)

In [5]:
neuron=labels[labels["type"]=="Neuron"].index
len(neuron)

5326

### Are there any duplicated samples...?

In [6]:
len(labels["geo_id"])

74985

In [7]:
len(set(labels["geo_id"]))

71961

In [8]:
duplicated=(labels.geo_id.value_counts()>1)[(labels.geo_id.value_counts()>1)].index

In [9]:
len(duplicated)

3023

In [10]:
duplicated[0:10]

Index(['GSM4150378', 'GSM4100699', 'GSM4432569', 'GSM5068004', 'GSM1649162',
       'GSM2287090', 'GSM2494634', 'GSM2431689', 'GSM2574521', 'GSM2574623'],
      dtype='object')

### What are these duplicated samples?

In [12]:
labels[labels["geo_id"]=="GSM4100699"]

Unnamed: 0,geo_id,type
3912,GSM4100699,Motor Neuron
12154,GSM4100699,Neuron


In [13]:
labels[labels["geo_id"]=="GSM4432569"]

Unnamed: 0,geo_id,type
3927,GSM4432569,Motor Neuron
12222,GSM4432569,Neuron


### They were just labeled both as tissues and cell lines. Removing redundancy...

In [14]:
labels.set_index("geo_id",drop=False,inplace=True)

In [15]:
#dropping out duplicated samples by keeping just the first one
labels.drop_duplicates(subset="geo_id",keep="last",inplace=True)

In [16]:
labels.shape

(71961, 2)

In [17]:
labels.to_csv("../outputs/samples_types.csv",index=None)