# Mice data exercise

Use the mice data you filtered in the exercise of GWAS3, or if you do not have it, use the initial mice data.

**1 - Calculate the kinship matrix and plot the values on the histogram.**



<img src="../Images/bash.png" alt="Bash" width="40"> Switch to the bash kernel

In [3]:
ln -sf ../../Data
mkdir -p Results/GWAS4

We will use the mice data undrgoing QC from the previous exercise (You can also use the initial mice data if you want). We apply `--make-king-table` option in PLINK2 to calculate the KING kinship matrix.

In [None]:
plink2 --bfile Results/GWAS3/mice_QCA \
       --make-king-table \
       --out Results/GWAS4/KING

Remember that the table from PLINK2 contains: 

- `FID1`: Family ID of the first individual.
- `IID1`: Individual ID of the first individual.
- `FID2`: Family ID of the second individual.
- `IID2`: Individual ID of the second individual.
- `NSNP`: The number of SNPs used to calculate the kinship between the two individuals.
- `HETHET`: The number of heterozygous sites used in the kinship calculation (often useful for checking genotyping quality).
- `IBS0`: The number of identical-by-state (IBS) 0 SNPs, which are variants where the two individuals do not share the same allele.
- `KINSHIP`: The kinship coefficient between the two individuals, which is the proportion of alleles shared by descent.

 Let's print the first few rows of `KING.kin0` and then, the unique relatioship categories in the file: 

In [None]:
cat Results/GWAS4/KING.kin0 | head -5

It looks like there are mostly related pairs!

In [None]:
awk '{ if ($8 > 0.2) print $2}' Results/GWAS4/KING.kin0 | sort | uniq | wc -l

:::

---------------

Generation of the plot of KING values

<img src="../Images/R.png" alt="R" width="40"> Switch to the R kernel.


In [None]:
suppressMessages(suppressWarnings(library(ggplot2)))

options(repr.plot.width = 9, repr.plot.height = 4)

# Read data into R 
relatedness <- read.table("Results/GWAS4/KING.kin0", header=TRUE, comment.char = '|')

head(relatedness)

**That is a crazy histograms!**

In [None]:
hist.king <- ggplot(relatedness, aes(x=relatedness[,8])) +
  geom_histogram(binwidth = 0.02, col = "black", fill="tomato") + 
  labs(title = "Histogram of relatedness (KING)") + 
  xlab("KING kinship") + 
  ylab("Log Frequency") + 
  theme_bw() +
  scale_y_log10() +
  theme(axis.title=element_text(size=14), 
        axis.text=element_text(size=13),
        plot.title=element_text(size=15)) 

#Extract coordinates of the plot
bin_data <- ggplot_build(hist.king)$data[[1]]

#Adding text labels to  each bar, 
#accounting for log scale
hist.king + 
  geom_text(data = bin_data, 
            aes(x = xmin + (xmax - xmin) / 2, 
                y = log10(count+1),  # Apply log10
                label = count), 
                vjust = -5, #Vertical adjustment of text
                size = 4, 
                color = "black")

The quality seems however good - around same number of SNPs has been used for all samples (this plot takes some time as there are hundreds of thousands of points).

In [None]:
# Relatedness plot
plot.relatedness <- ggplot(relatedness) +
  geom_point(aes(x=NSNP, y=KINSHIP), size=5, alpha=.25) + 
  ylim(-.1,.4) +
  labs(x = "Number of SNPs used", y = "KING kinship", title = "Check for genotyping quality") + 
  theme_bw() +
  theme(axis.title=element_text(size=14), 
        axis.text=element_text(size=13), 
        legend.text= element_text(size=13), 
        legend.title=element_text(size=14), 
        plot.title=element_text(size=15))

show(plot.relatedness)

**2 - Why do you get those values in the plot? What could be happening? Remember those are mice!**

One possibility for getting such a crazy histogram is that the individuals have a lot of inbreeding. This is indeed our case: all mice come from only eight founders! The KING estimator assumes outbred populations, which is usually the case for humans. Another thing that can happen in this scenario, is that a founder is responsible for many of the descendants. This will create so many differences in genotype frequencies, that KING gets negative value, which would mean *pairs with very negative relatedness*, almost creating an outgroup within the dataset

**3 - Now try instead to use `plink` with the option `--genome`. This calculates the IBD estimator called pi_hat. You need `--bfile` for the input data and `--out` for the name of the output table.**

<img src="../Images/bash.png" alt="R" width="40"> Switch to the bash kernel.

In [None]:
plink --bfile Results/GWAS3/mice_QCA \
      --genome \
      --out Results/GWAS4/pihat

4 - Plot again the histogram using the table (column PI_HAT). Now you should have values bounded between (0,1) on the x axis.

<img src="../Images/R.png" alt="R" width="40"> Switch to the R kernel.


In [None]:
# Generate a plot to assess the type of relationship.
suppressMessages(suppressWarnings(library(ggplot2)))

options(repr.plot.width = 9, repr.plot.height = 4)

# Read data into R 
relatedness <- read.table("Results/GWAS4/pihat.genome", header=TRUE)

head(relatedness)

In [None]:
hist.king <- ggplot(relatedness, aes(x=PI_HAT)) +
  geom_histogram(binwidth = 0.02, col = "black", fill="tomato") + 
  labs(title = "Histogram of relatedness (PI hat)") + 
  xlab("PI hat kinship") + 
  ylab("Log Frequency") + 
  theme_bw() +
  scale_y_log10() +
  theme(axis.title=element_text(size=14), 
        axis.text=element_text(size=13),
        plot.title=element_text(size=15)) 

#Extract coordinates of the plot
bin_data <- ggplot_build(hist.king)$data[[1]]

#Adding text labels to  each bar, 
#accounting for log scale
hist.king + 
  geom_text(data = bin_data, 
            aes(x = xmin + (xmax - xmin) / 2, 
                y = log10(count+1),  # Apply log10
                label = count), 
                vjust = -5, #Vertical adjustment of text
                size = 4, 
                color = "black")

**6 - Values around 0.25 are second-degree relationships (half siblings), around 0.5 are parent-child and siblings. Usually, a pi_hat much above 0.5 is trace of duplicates, or inbreeding (the second is likely our case, or we would have really bad data with too many duplicates). We have also some sample pairs with a pi_hat=1. What could those samples be?**

Those could be pairs of twins!