# Notes

## Data access 

Link = https://583643567512.signin.aws.amazon.com/console

- Username = geissmaf
- Password = Pv5e1I31

Go to folfer **Seqc_results/** this is where the results of the pipeline are found. For each sample sequenced download the following files
1. *_dense.txt -> Cell x Gene UMI matrix
2. *_sparce_molecule_counts.mtx -> raw counts, no filtering
3. *_summary.tar.gz -> QC results

For each sample create a folder with the name of the sample and put all 3 samples. 
- Create and **input/** and **output/** folder


## ScRNAseq theory and comments

Library size of a cell is the number of unique molecules (UMI) detected in that cell.

### Phenograph

Phenograph models high dimensional space insing a NN graph. Each cell is a node, connected by esges to each similar cell. Then it partitions the graph into communities of cell using Jaccard similarity coefficient.


__How it works__

Takes as input a matrix ofNsingle-cell measurements and parti-tions them into subpopulations by clustering a graph that represents theirphenotypic similarity. 
1. It finds the k-nearest neighbors for each cell (using Euclidean distance), resulting inNsets of k-neighborhoods
2. Refine the k-neighborhoods defined in the first step, using Jaccard similarity coeffiecient.  In this metric, the similarity between two cells reaches a maximum when their k-neighborhoods are identical
and decreases with the number of neighbors they share. Thus, the metric incorporates the structure of the data distribution into the weights, reinforcing edges in dense regions and penalizing edges that span sparse regions. 

One such approximation, called the **Louvain Method** (Blondel et al., 2008), has become popular due to its efficiency on large graphs containing hundreds of millions of nodes. This method is hierarchical and agglomerative. At the beginning of the first iteration, every node (cell) is placed into its own cluster. At each iteration, neighboring nodes are merged into clusters for all pairs whose mergers yield the largest increase in overall modularity (𝑄) of the graph. This process is repeated hierarchically (representing bottom-level clusters as nodes in the next iteration, etc.) until no further increase in 𝑄 is obtained.

PhenoGraph uses the Louvain Method to maximize the modularity of its partitions. Specifically, PhenoGraph runs multiple random restarts of the Louvain Method, choosing a final partition in which the modularity reaches a maximum among all solutions.

**PhenoGraph is robust to random resampling of the data and a wide range of values for the single parameter k**

### Visualization 

**TNSE** 

The most common dimensionality reduction method for scRNA-seq visualization is the t-distributed stochastic neighbour embedding (t-SNE; van der Maaten & Hinton, 2008). t-SNE dimensions focus on capturing local similarity at the expense of global structure. Thus, these visualiza- tions may exaggerate differences between cell populations and over- look potential connections between these populations. A further difficulty is the choice of its perplexity parameter, as t-SNE graphs may show strongly different numbers of clusters depending on its value (Wattenberg et al, 2016). 

**UMAP**

Common alternatives to t-SNE are the Uniform Approximation and Projection method (UMAP; preprint: McInnes & Healy, 2018) or graph-based tools such as SPRING (Weinreb et al, 2018). UMAP and SPRING’s force-directed layout algorithm ForceAtlas2 arguably represent the best approxi- mation of the underlying topology (Wolf et al, 2019, Supplemental Note 4). What sets UMAP apart in this comparison is its speed and ability to scale to large numbers of cells (Becht et al, 2018). Thus, in the absence of particular biological questions, we regard UMAP as best practice for exploratory data visualization. Moreover, UMAP can also summarize data in more than two dimensions. While we are not aware of any applications of UMAP for data summarization, it may prove a suitable alternative to PCA.

# Questions

- In the immune cell cluster I have cells that have genes that are highly expressed in neurons or glia; what do I do with those?

# Plan for the analysis

## Analyze CONTROL samples first

1. Load only the control samples 
2. Quality control 
3. Normalization
4. Clusters
5. Markers

## Analyze VE samples next


1. Load only the control samples 
2. Quality control 
3. Normalization
4. Clusters
5. Markers

## CTRL + Normal

# My list

In [12]:
marker_genes = dict()

# immune cells 
marker_genes['Immune_cells'] = ['CX3CR1', 'P2RY12', 'CCR6','CASP4','CXCL12','CD86','CD14', # miroglia
                             'TMEM119', 'CASP8', # Microglia activated
                             'PF4', 'CD74', # perivascular macrophages
                             'PF4', # perivascular macrophages activated
                            'HEXB','P2RY12','C1QA','C1QC','CTSS','C1QB','SIGLECH','GPR34','FCRLS'] #macrophages MCA  

# Neurons
marker_genes['Neurons'] = ['CBLN3','GABRA6', # Cerebellum neurons
                               'PVALB','CNPY1', 
                               'SLC6A5', 'GRM2', 'LGI2', 
                                'SLC1A6', 'CAR8', 
                               'PVALB', 'KIT', 
                               'ST8SIA6','ADGRG2','MDGA1', 'OLFM1', # Hindbrain neurons
                               'KLHL1','MREG','SPP1','NEFH',
                               'KCNG4',"GLRA1", 'TFAP2B',"CBLN4",
                              'EBF3','SEMA3A','NXPH4','TMEM72','GPR139',
                              'SLC6A5','STAC2','VAMP1','SALL3',
                              'LMX1A','RERG','TIAM2',# Immature neural 
                               'EBF2','EBF1','GLRA1', #spinal cord neurons
                               'PAX2', 'NYAP2' ,'KCNMB2',
                              'RSPO3', 'B3GAT2',
                              'PAX6','CCSAP', # Telencephalon interneurons
                              'NTF3','CPNE4','VIPR1','PDE1A','WFS1','VWC2L','CBLN1', #Telencephalon projecting neurons
                               'CDKL4', 'LAMP5','KRT12',
                           'MOXD1','TMEM255B','ASS1','ANXA11','NEFM','PAQR5', # PNS neurons
                           'STAC','MREG','CLEC18A','TIAM2','GAD1','MEIS2','RBM24'] # From the developing atlas]

                             
marker_genes['Glia'] = ['TIMP4','GFAP', 'SLC7A10','MFSD2A','AGT','SLC6A11','CBS','FAM107A', #Astrocytes
                       'TTR','FOLR1','CLIC6', #Choroid epithelial cells
                       'HOPX', # Dentate gyrus radial glia-like cells
                        'NEU4','RINL', 'CCP110','SNX33','OPALIN','NINJ2', #olygodendrocytes (OPC)
                        'HAPLN2','DOCK5','KLK6','CNKSR3','TMEM2',
                       'LUM','PARP14','COL12A1', # Neural crest-like glia
                       'PDGFRA', 'C1QL1','TEX40','FBLN2','NCMAP',
                       'LRRN1','ROR2', # radial glia development paper
                        'MATN4','SCRG1','PLLP','GPR37L1','OLIG1','OLIG2','SOX10', # OPC dev paper
                       "SFRP5", # Swann cells dev paper
                        'ALX3','ALX4','CPED1',  #neural crest dev atlas
                        'PLA2G7','SEMA4B','PPP1R3C', 'CYP2J9','ATP1B2', # astrocytes from dev paper
                        'ALDOC','ATP1A2','GPR37L1','MT3','SLC1A3','PLA2G7', # Astrocytes form MC
                       'MBP','MOBP','PTGDS']# Oligo MCA
                       

marker_genes['Vascular cells'] = ['HIGD1B','DEGS2','RGS5','FLT1','TBXA2R','CD93',# Pericytes
                                 'MGP','SLC47A1','DAPL1','DCN','FAM180A','SLC6A13', #Vascular and leptomeningeal cells
                                 'ECSCR','LY6C1',# Vascular endothelial cells
                                 'KCNJ8','ANPEP','IGF2', # Vascular smooth muscle cells
                                  'EMCN','CDH5','MYH11','SLC38A11'] # from the deveping atlas 


marker_genes['Epithelial cells'] = ['SLCO1A4', 'CLDN5','TTR', 'FN1','RAMP2','LY6C1', 'LY6A','CCDC153', 'GM973','PRLR','FOLR1']
                                 