# Dataset integration 

We will integrate the Neurons 5k dataset with a subset of 5K cells from the larger 10X dataset [1.3 million brain cells from E18 mouse](https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.3.0/1M_neurons). 

In [1]:
suppressPackageStartupMessages({
library(dplyr)
library(patchwork)
library(Seurat)
library(SummarizedExperiment)
library(TENxBrainData)})

## Load data

In [2]:
load(file='../data/objects/a3.refseurat.RData',verbose = TRUE)
ref.sobj[["Dataset"]]<-'nr1M'

Loading objects:
  ref.sobj


In [3]:
load(file="../data/objects/a2.neur5k.RData",verbose = TRUE)
nr5k[['Dataset']]<-'nr5k'

Loading objects:
  nr5k
  ct


## Prepare integration

In [5]:
options(future.globals.maxSize = 4000 * 1024^2, future.seed=NULL, warnings=FALSE)

objects <- list(nr5k,ref.sobj)
features <- SelectIntegrationFeatures(object.list = objects, nfeatures = 1000)
objects <- PrepSCTIntegration(object.list = objects, anchor.features = features, verbose = FALSE)

## Integrate datasets

In [6]:
anchors <- FindIntegrationAnchors(object.list = objects, normalization.method = "SCT", anchor.features = features, verbose = FALSE)
nr.int <- IntegrateData(anchorset = anchors, normalization.method = "SCT", verbose = FALSE)

“UNRELIABLE VALUE: Future (‘future_lapply-1’) unexpectedly generated random numbers without specifying argument 'future.seed'. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this, specify 'future.seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG method. To disable this check, use 'future.seed=NULL', or set option 'future.rng.onMisuse' to "ignore".”


In [7]:
save(nr.int,file='../data/objects/a3.integrated.RData')

## Downstream analysis

### Hands-on activity 2

---

Perform the downstream analysis steps that we did on the previous section for the Neurons 5K dataset
1. Dimensionality reduction 
2. Clustering
3. Cell type annotation 

Could you find more cell types?

If you do not have the output from the previous sections, just load the following RData object:

In [None]:
load(file='../data/objects/a3.integrated.RData',verbose=TRUE)

#### Dimensionality reduction and clustering

In [8]:
nr.int <- RunPCA(nr.int, verbose = FALSE)
nr.int <- RunUMAP(nr.int, dims = 1:30, verbose = FALSE,spread = 1,min.dist = 1)

“The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session”


In [None]:
markers <- FindAllMarkers(nr5k, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 1) %>% 
                group_by(cluster) %>% 
                top_n(n = 10, wt = avg_log2FC) 

In [None]:
map<-tapply(markers$gene,markers$cluster,function(mlist,...){
    gcts<-lapply(mlist,function(m,...){ct[grep(m,ct$MarkerGenes),'Subclass']}) %>% 
          unlist() %>% 
          table() %>% 
          sort()
    return(ifelse(length(gcts)==0,'Undefined',names(gcts)[1]))
}) %>% 
unlist()

In [None]:
nr5k <- RenameIdents(nr5k, map)

In [None]:
options(repr.plot.width=15, repr.plot.height=7)
DimPlot(nr.int, group.by = c("Dataset"), combine = FALSE)