# Prepare relevant gene sets for project

This project brought up many questions, including:
- What types of genes are most discriminating when it comes to distinguishing male and female cells in the developing brain (and - in particular - are they all sex or chromosome-linked genes)? 
- Are the genes that distinguish male from female cells entirely sex-linked, or might these genes regulate the expression of other targets in a sex-dependent manner? 
- How early can sexual dimorphic differentiation be detected in the brain?

In order to answer these questions, I needed to acquire lists of genes to annotate my own dataset.

I tried multiple methods of acquiring gene lists (e.g. [gget]('https://github.com/pachterlab/gget')), but ultimately, the method that worked best for me was the classic UCSC Genome Browser with my own added reformatting:

[UCSC genome browser](https://genome.ucsc.edu/) > mm10 > Tools > Table Browser > add filter for chrom = 'chrX' OR chrom = 'chrY'

I also downloaded gene sets of interest from the mSigDB website.

In [9]:

# Import needed libs
import pandas as pd
import os

In [10]:
# Import needed paths
path_ucsc_complete = "../../../data/raw/ucsc/complete_X_Y_geneIDs.csv"

path_output = "../data/processed/ucsc"
os.makedirs(path_output, exist_ok=True)

In [12]:
# Downloaded data
data_full = pd.read_csv(path_ucsc_complete, sep="\t", skiprows=0, header=1)
data_streamlined = data_full[['mm10.kgXref.geneSymbol', 'mm10.kgXref.description']]
data_streamlined = data_streamlined.rename({'mm10.kgXref.geneSymbol': 'geneSymbol', 'mm10.kgXref.description': 'description'}, axis=1)
data_streamlined.head(2)


Unnamed: 0,geneSymbol,description
0,Btbd35f23,"Mus musculus BTB domain containing 35, family ..."
1,Btbd35f24,Contains 1 BTB (POZ) domain. (from UniProt J3Q...


In [13]:
data_streamlined.to_csv(os.path.join(path_output, 'ucsc_sexgene_info.csv'))