# KASP assays genotyping

### Load the libraries

In [None]:
library(ape)
library(adegenet)
library(ggplot2)
library(RWeka)
library(caret)

### Load the QTL results

Top 20 markers assosciated with phenotypic sex

In [None]:
data <- read.delim('GBS.markers.tsv', header=TRUE, row.names="Sample.ID")
head(data)

### Load the KASP results

503 individual tested for the 20 markers

In [None]:
kasp <- read.delim('KASP.results.tsv', header=TRUE, row.names="Sample.ID")
kasp <- kasp[-1]
head(kasp)

### Convert the QTL result to a genind object (again)

In [None]:
haplo <- data[-1] # Remove actual Sex
row.names(haplo) <- row.names(data) # Name the row
head(haplo)

In [None]:
obj_pop <- df2genind(haplo, ploidy=2, sep='/',ncode=1);

## Predict sex

### Clustering the result by PCA

These functions implement the clustering procedure used in Principal Components Analysis

In [None]:
cluster <- find.clusters(obj_pop, n.pca=2000, n.clust=2, n.iter=1e6);

In [None]:
cluster

The variable `cluster$grp` has the clustering prediction results: 2 ~ female, 1 ~ male.

In [None]:
data2 <- data
data2$predicted <- cluster$grp
head(data2)

Try to plot the Actual phenotypic sex _versus_ the predictions.

In [None]:
ggplot(data2, aes(Sex, predicted)) + geom_jitter(width = 0.1, height=0.1, aes(colour=factor(Sex)), alpha=0.6, size=2) + theme_bw()

### Use a Machine Learning approach: C4.5

[C4.5](https://en.wikipedia.org/wiki/C4.5_algorithm) is an algorithm used to generate a "decision tree". [Read more](https://www.quora.com/What-is-the-C4-5-algorithm-and-how-does-it-work).

The recommended approach (if you have enough sample) is to split your learning set (true data, i.e. QTL results) into a training set and a test set to evaluate the quaity of your prediction... 

In [None]:
train_rows <- createDataPartition(data$Sex, list=FALSE)
train_set <- data[train_rows, ]
test_set <- data[-train_rows, ]

In [None]:
head(train_set)

In [None]:
head(test_set)

Then the C4.5 algorithm is trained (C4.5 is also called J48...) 

In [None]:
fit.c45 <- train(Sex ~ ., data=train_set, method='J48')
fit.c45

The model was optimised 25 times (Bootstrapped) to identidy the parameters that maximise **Accuracy**.
It's time to a real test set to evaluate "real" accuracy...

In [None]:
pred <- predict(fit.c45, newdata=test_set)
pred

Unfortinatly we don't have enough data to run properly a test, however because we have a "real" training: QTL data and a real set to apply the model (KASP) we can ignore this step, for now...

In [None]:
fit.c45 <- train(Sex ~ ., data=data, method='J48')
pred <- predict(fit.c45, newdata=data)
pred

In [None]:
plot(pred)

The model is actually using this decition tree to make a decision:

In [None]:
fit.c45$finalModel

So you accurate are those results?

We can add those prediction to the clustering table

In [None]:
data2$predicted2 <- pred
head(data2)

Then plot the results: Actual phenotypic sex _versus_ the C4.5 predictions.

In [None]:
ggplot(data2, aes(Sex, predicted2)) + geom_jitter(width = 0.1, height=0.1, aes(colour=factor(Sex)), alpha=0.6, size=2) + theme_bw()

Clustering prediction _versus_ the C4.5 predictions.

In [None]:
ggplot(data2, aes(predicted2, predicted)) + geom_jitter(width = 0.1, height=0.1, aes(colour=factor(Sex)), alpha=0.6, size=2) + theme_bw()

## Application

Now we have a "prefered" approach can you predict the sex of our 503 new samples?

In [None]:
head(kasp)

In [None]:
pred <- predict(fit.c45, newdata=kasp)
pred

In [None]:
kasp <- read.delim('KASP.results.tsv', header=TRUE, row.names="Sample.ID")
#kasp <- kasp[-1]

kasp$predicted <- pred
ggplot(kasp, aes(Sex, predicted)) + geom_jitter(width = 0.1, height=0.1, aes(colour=factor(Sex)), alpha=0.6, size=2) + theme_bw()

In [None]:
kasp[which(kasp$Sex != kasp$predicted),]

Conclusion?