# Phenotype Exploration
### Daniele Filiault
## This notebook helps us understand the phenotype we are using for GWAS

A GWAS is only as good as the phenotype that you use.  We will make two simple plots that can help us understand our phenotype better before we run GWAS.

The trait we will work on together is the flowering time of 200 *Arabidopsis thaliana* accessions (genotyped individuals) grown at 16 degrees in a growth chamber. (This is a subset of a much larger dataset: https://doi.org/10.1016/j.cell.2016.05.063).  The phenotype is the number of days between seed planting and the opening of the first flower.  The values we are using are the means of multiple replicates, which is why we observe partial days in the data.

# 1.  Initial setup steps

## 1a. Prepare environment
Loading packages and functions into R

In [None]:
library(ggplot2) #both packages are used for making maps
library(maps)

## 1b. Define input variables


In [None]:
# the phenotypes
pheno.file <- ("./data/subset_flowering_time_16.csv") # two columns giving ecotypeid and phenotype

# the collection location for each accession (latitude and longitude)
# (an accession is a plant whose genotype we know.  For A. thaliana, these were originally collected from the field.)
accession.pos.file <- ("./data/accession_geo_locations.csv") # 5 columns: ecotypeid, collection location name, country, latitude, and longitude for all accession in the 1001 genomes dataset

# 2.  Distribution of phenotypic values

Is it sensible to use this data in a mixed linear model GWAS?  General questions to ask about a trait include: Is the trait quantitative?  Is the distribution likely to result in normally-distributed residuals in the linear model we use for GWAS?  A simple histogram is a good first check.

In [None]:
# load phenotype data
pheno <- read.csv(pheno.file)

# check format of data
dim(pheno)
head(pheno)

# make a histogram
pheno.name <- colnames(pheno)[2]
hist(pheno[,2],xlab=pheno.name, main=paste("Histogram of ",pheno.name, sep=""),col="blue")


### Although the flowering time data doesn't have a beautiful normal distribution, we would consider it "close enough" to use in GWAS.

## 3.  Geographic distribution of trait values.
How much of a problem will population structure likely be?

In *Arabidopsis thaliana*, accessions that are geographically close tend to be more closely related than those that are further apart.  Another way of saying this is that population structure in the species largely reflects geography.  Strong geographic patterns in phenotypic values can therefore be a warning sign that population structure confounding will be high in an *A. thaliana* GWAS, so let's plot phenotypic values on a map to look for concerning patterns in our data.

In [None]:
# read in accession origin data
pos <- read.csv(accession.pos.file, stringsAsFactors=TRUE, header=TRUE)

# merge this data with the phenotype variable
pheno <- merge(pheno, pos)
head(pheno)


In [None]:
# download map data
world_map <- map_data("world")
# plot map using ggplot2 package
europe_map <- ggplot(world_map, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill="lightgray", colour = "white") +
  xlim(-13,25) +
  ylim(35, 66) +
  geom_point(data=pheno, aes(x = longitude, y=latitude, colour=get(pheno.name)),inherit.aes = FALSE) +
  labs(colour=pheno.name) +
  scale_color_gradient(low = "blue", high = "red") +
  theme(text = element_text(size = 18)) +
  theme_bw()
 
options(repr.plot.width=9, repr.plot.height=8)
europe_map

 ### Given the strong geographic pattern in flowering time, I would predict that we will observe very strong population structure confounding for this trait!  We will look at this prediction more closely in notebook 3.
 ## But first we need to run the GWAS, so let's move on to 2_GWAS.ipynb where we will do just that.