In [1]:
source("source/RaceID3_StemID2_class.R")

Loading required package: tsne
Loading required package: pheatmap
Loading required package: MASS
Loading required package: cluster
Loading required package: mclust
Package 'mclust' version 5.4
Type 'citation("mclust")' for citing this R package in publications.
Loading required package: flexmix
Loading required package: lattice
Loading required package: fpc
Loading required package: amap
Loading required package: RColorBrewer
Loading required package: locfit
locfit 1.5-9.1 	 2013-03-22
Loading required package: vegan
Loading required package: permute
This is vegan 2.4-6
Loading required package: Rtsne
Loading required package: scran
Loading required package: BiocParallel
Loading required package: SingleCellExperiment
Loading required package: SummarizedExperiment
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘pack

In [8]:
x <- read.csv("source/E725.matrix.Seb_NewData_E725.3.quantif",sep="\t",header=TRUE, row.name=1)

In [21]:
prdata <- x[grep("ERCC",rownames(x),invert=TRUE),]
dim(x)
dim(prdata)

sc <- SCseq(prdata)

# Normalisation

There is no real way to see whether a normalisation is "good" or "bad" without performing the subsequent analysis and clustering stages and seeing how noisy the clustering performs.

For this reason, we will iterate over the various possible normalisation strategies in RaceID3 and perform analyses in parallel.

### RaceID3 Normalisation strategies:

 1. **Minimum rescaling** - Divide all elements in a cell column by the total number of transcripts in that column, repeat for each column, and then multiply all elements in the entire matrix by the cell column with the minimum number of transcripts.
 
 2. **Downsampling** - All cell columns are downsampled to match the cell with the minimum number of transcripts. This is repeated `dsn` times and then averaged. If `dsn` is = 1, then sampling noise is comparable across cells, otherwise the data resembles Minimum rescaling for higher values. It is recommended to only use downsampling if batch effects due to different library complexities are observed in the data.
 
 3. **(Experimental) Size-factor** - This is a method that pools transcript data from genes across all cells in order to remove the issue of dealing with zero-inflated data. The following steps are performed:
   a. Pools of similarly sized cells are defined:
      i. Cells are split into groups of odd and even total transcript numbers (as a form of random splitting?) and then placed in a line adjacent to each other so that cells with similar number of transcripts are next to each to other. (This is to reduce transcript abundance bias within a pool).
      ii. This line is then looped to form a ring. Starting from the 12:00 position of the ring, a sliding window approach is undertaken, where a group of adjacent cells form a 'pool' of N cells, each cell belonging to N other pools as the window passes over them in clockwise fashion.
   b. Expression values are summed across all cells in a pool
   c. Cells in a given pool are normalised to an average pseudo-cell generated from the average expression profiles of these cells.
   d. Each cell is part of N pools, assuming that each pool has N cells, and so each cell can be described by a linear system of N average pseudo-cells.
  
 4. **(Experimental) Housekeeping Normalisation** - Recaling is applied to a group of housekeeping genes that are essentially any genes known to minimize variability across all cells. Only genes with at least 1 transcript in 50% of all cells are considered. A background variance-mean dependence model is fitted, and any genes that are within 25%-90% expression profile are extracted, and only those underneath a minimum baseline of variability (i.e. below the regression line).
 
### Filtering parameters

 - `mintotal` - Total transcript for a cell must exceed this value to not be discarded
 - `minexpr` - Minimum expression a gene must have across `minnumber` of cells to be flagged for detection
 - `maxexpr` - Max expression a gene must have "..." detection.
     
  
### Questions:

 - Why is the sampling noise more comparable across all cells for lower dsn values?
 - CelSeq2 data are read off 192 plates. Do separate plates constitute different batches, or is there more to it than just the plates? (e.g. is it is possible to 1/2 a plate on one day, and 1/2 a plate on another day?)
 

In [35]:
# A quick profile of our data via SCATER
library(SingleCellExperiment)

sce <- SingleCellExperiment(assays= list(counts = x))



“first element used of 'length.out' argument”

ERROR: Error in seq_len(ncol(assay)): argument must be coercible to non-negative integer


Let us determine how many genes are lost by varying:
 1. minexpr ( 1, 2, 5, 10 )
 2. maxexpr ( 100, 500, 1000 )
 3.

Hello Wendy, Amy and Kerstin,


**It** has been some time since I asked if there were job vacancies at your language school. I started working as **a** German teacher in January and the sudden workload kept me from applying **to** other language schools. However, I'm still very much interested in working at your language school since I like working with creative teaching strategies and not being bound **to** a book.  I studied English and **History** to become a teacher and did a DaF course at the TTI in Freiburg last **October**. I'm currently working at Euer Sprachzentrum, teaching **classes 20hrs/week** and enjoy everything about the job. Attached to this **e-mail** you can find my CV.

In case you are interested I would be happy to introduce myself in person or send any additional information you might need!

Best wishes,

Katharina

Hello Wendy, Amy and Kerstin,

It has been some time since I last asked if there were job vacancies at your language school. I had began working as a German teacher in January, but the sudden demanding workload kept me from applying to other language schools. However, I'm still very much interested in working at your school since I enjoy engaging in creative teaching strategies that would not require me to be bound so rigidly to a book. 

I have studied English and History to become a teacher and recently undertook a DaF course at the TTI in Freiburg last October. Currently, I am working at the Euer Sprachzentrum where I teach classes 20hrs/week and enjoying every aspect of it.

If I have roused your interest, I would be more than happy to further introduce myself in person or to send along any additional information that you might need! Attached below is my CV.

Best,

Katharina

(some notes: 
 - it's not good to start two consecutive sentences with 'I'
 )

In [25]:

res <- filterdata(sc, mintotal=3000, minexpr=1, maxexpr=500, 
           downsample = F, sfn = F, hkn = F,
           dsn = 1, rseed = 17000, CGenes = NULL, FGenes = NULL
)



In [26]:
dim(res@fdata)

In [20]:
test_minexpr <- function(obj, min_total, min_nummer, min_expr, max_expr){
    return(
        filterdata(
            obj, mintotal= min_total, 
            minexpr= min_expr, minnumber= min_nummer, maxexpr=max_expr, 
            downsample=FALSE, sfn=FALSE, hkn=FALSE, 
            dsn=1, rseed=17000, CGenes=NULL, FGenes=NULL, ccor=.4
        )
    )
}

## min_expr = 5, minnumber = 1, maxexpr = 500
test_downsample_defaults <- function(obj, dsn_val){
    return(
        filterdata(
            obj, downsample=T, sfn=FALSE, hkn=FALSE, 
            dsn= dsn_val, rseed=17000, CGenes=NULL, FGenes=NULL, ccor=.4
        )
    )
}


# Vary min number of cells
# minexpr 3000, 1, 5, 500
# minexpr 3000, 2, 5, 500
# minexpr 3000, 5, 5, 500

# Var min expression
# minexpr 3000, 1, 1, 500
# minexpr 3000, 1, 2, 500
# minexpr 3000, 1, 5, 500



# minexpr 4000, 1, 5, 500



test_hkn <- function(obj) filterdata(

)



sc_minrescale <- filterdata(
    sc, mintotal=3000, 
    minexpr=5, minnumber=1, maxexpr=500, 
    downsample=FALSE, sfn=FALSE, hkn=FALSE, 
    dsn=1, rseed=17000, CGenes=NULL, FGenes=NULL, ccor=.4
)

sc_downsample <- filterdata(
    sc, mintotal=3000, 
    minexpr=5, minnumber=1, maxexpr=500, 
    downsample=T, sfn=FALSE, hkn=FALSE, 
    dsn=1, rseed=17000, CGenes=NULL, FGenes=NULL, ccor=.4
)

sc_sfn <- filterdata(
    sc, mintotal=3000, 
    minexpr=5, minnumber=1, maxexpr=500, 
    downsample=FALSE, sfn=T, hkn=FALSE, 
    dsn=1, rseed=17000, CGenes=NULL, FGenes=NULL, ccor=.4
)

sc_hkn <- filterdata(
    sc, mintotal=3000,
    minexpr=5, minnumber=1, maxexpr=500,
    downsample=FALSE, sfn=FALSE, hkn=T,
    dsn=1, rseed=17000, CGenes=NULL, FGenes=NULL, ccor=.4
)

sc_sfn_hkn <- filterdata(
    sc, mintotal=3000,
    minexpr=5, minnumber=1, maxexpr=500,
    downsample=FALSE, sfn=T, hkn=T,
    dsn=1, rseed=17000, CGenes=NULL, FGenes=NULL, ccor=.4
)


ERROR: Error in filterdata(sc, mintotal = 3000, minexpr = 5, minnumber = 1, maxexpr = Inf, : object 'sc' not found
