# Parse phenotype data for stem regeneration GWAS

In this workbook, we parse out machine vision phenotypes and the diameter covariate (measured manually)

In [None]:
# install.packages("readxl")

In [None]:
library(readxl)
library(data.table)

## Read and inspect phenotype data

### Read data into a master data frame

Data is in many files, one for each phase x timepoint combination. We will combined them all into a master dataframe.

In [None]:
files <- list.files("/mnt/data/NSF_GWAS/phenodata/final_training/", full.names=TRUE, pattern=".xlsx")

In [None]:
files_combined <- read_excel(files[1])

In [None]:
for(file in files[2:length(files)]){
    file_in <- read_excel(file)
    files_combined <- rbind(files_combined, file_in)
}

<div class="alert alert-block alert-info"> Now observe the data for stem regeneration. There should be six columns – one for row ID, one for Folder_name, one for file_name, one for each of 3 tissues </div> 

### Inspect phenotype data data frame

In [None]:
head(files_combined)

In [None]:
nrow(files_combined)

In [None]:
# levels(factor(files_combined$Folder_name))

Notes from observing data: 
<div class="alert alert-block alert-info"> 1. Phase 4 timepoint 2 has two underscores between genotypes and TDZ concentration<br>2. Later timepoints have a '_' delimited number that is not relevant<br>3. Genotypes sometimes have _ instead of -<br>4. Genotype numbers sometimes have .0 at end </div>


## Parse and clean phenotype data

### Load functions and libraries needed for parsing

stringr needed for all kinds of parsing

In [None]:
library(stringr)

Define the inverse of the %in% operator, to enable writing of clean and easy parsing code.

In [None]:
'%ni%' <- Negate('%in%')

<div class="alert alert-block alert-info"> Parse out TDZ concentration into a column, and then split out the filenames to prepare for parsing the genotype IDs from them. </div> 

### Parse [TDZ] from file path

In [None]:
files_combined$TDZ <- str_split_fixed(files_combined$file_name, "_", 4)[,3]

In [None]:
# genotype_split <- str_split_fixed(files_combined$file_name, "_", 4)[,1:2] # We don't do this here because we need to parse names differently for different phases

### Parse genotype from file path

#### Split according to subsets that are formatted differently

This is a first step to making formatting consistent.

Split data into three parts, for pre-4.2, 4.2 and post-4.2 (because of the inconsitency described in section 1.1)

##### Clean up the Phase ID variable so that we can split by it

In [None]:
files_combined$Folder_name <- gsub("GWAS_Phase", "", files_combined$Folder_name)
colnames(files_combined)[2] <- "Phase"
files_combined$Phase <- as.numeric(as.character(files_combined$Phase))

Observe the resulting data file with cleaned Phase IDs, and look at a list of unique Phase IDs that exist.

In [None]:
head(files_combined)
levels(factor(files_combined$Phase))

Note on Phase ID: As stored currently, the two numbers delimited by '.' represent the Phase ID and the week timepoint. This will later be split to keep Phase ID in the desired column while creating a new column for timepoint.

##### Split the data into: before phase 4.2, phase 4.2 and after phase 4.2

In [None]:
data_pre4.2 <- files_combined[which(files_combined$Phase < 4.2),]
data_4.2 <- files_combined[which(files_combined$Phase == 4.2),]
data_post4.2 <- files_combined[which(files_combined$Phase > 4.2),]

##### Observe the result. Does it make sense?

How many samples (each corresponding to an image) are there in each subset?

In [None]:
nrow(data_pre4.2)
nrow(data_4.2)
nrow(data_post4.2)

Look at each phase independently

In [None]:
for(phase in levels(factor(files_combined$Phase))){
    print(head(files_combined[which(files_combined$Phase == phase),], n=1))
}

#### Parse out genotype IDs

Since filenames are formatted slightly differently in the three phase subsets (describe above), the genotype IDs contained in filenames must be parsed out independently for each group. We will recombine them afterward.

##### First for data before Phase 4 timepoint 2

In [None]:
genotype_split <- str_split_fixed(data_pre4.2$file_name, '_', 3)[,1:2]

In [None]:
data_pre4.2$Genotype <- paste0(genotype_split[,1], "-", genotype_split[,2])

Skim the resulting genotype IDs and make sure nothing looks out of place.

In [None]:
# levels(factor(data_pre4.2$Genotype))

Take a look at the genotype_split table. This sholud have two columns, one for each part of the genotype ID previously separated by a delimiter.

In [None]:
head(genotype_split)

##### Now for data in Phase 4 timepoint 2

In [None]:
data_4.2$Genotype <- str_split_fixed(data_4.2$file_name, '_', 3)[,1]

Again, skim the resulting genotype IDs and make sure nothing looks out of place.

In [None]:
# levels(factor(data_4.2$Genotype))

##### ...and data after Phase 4 timepoint 2

In [None]:
data_post4.2$Genotype <- str_split_fixed(data_post4.2$file_name, '_', 3)[,1]

Once again, skim the resulting genotype IDs and make sure nothing looks out of place.

In [None]:
# levels(factor(data_post4.2$Genotype))

### Recombine and fix inconsistencies

#### Recombine and examine result

In [None]:
data_combined <- rbind(data_pre4.2, data_4.2, data_post4.2)

In [None]:
head(data_combined)

#### Deal with the .0 added to the end of some genotype IDs

This is the presumably an artifact of data entry in Excel with the genotype ID as a numeric column.

In [None]:
data_combined$Genotype <- gsub(" ", "", data_combined$Genotype)
data_combined$Genotype <- gsub("\\.0", "", data_combined$Genotype)

#### Cross-reference genotype IDs with the master inventory
Here, we aim to resolve cases in which a genotype was listed as studied but we do not have image data for (or vice versa). In some cases, this is due to a naming inconsistency, typing error, etc. We will use this information to correct any inconsistencies in filenames.

In [None]:
master_inventory <- fread("master_inventory.csv")

<div class="alert alert-block alert-warning"> Need to check these two lists, look at them alongside one another alphabetically and identify any cases where a genotype's ID is formatted in two ways. (e.g.  'SLMD-28-03' vs  'SLMD-28-3)' </div>

##### What genotypes are in the master inventory but were not studied?

In [None]:
length(setdiff(master_inventory$`ALL GENOTYPES`,
       levels(factor(data_combined$Genotype))))

In [None]:
setdiff(master_inventory$`ALL GENOTYPES`,
       levels(factor(data_combined$Genotype)))

##### What genotypes are in the phenotype data but not the master inventory?

In [None]:
mystery_genotypes <- setdiff(levels(factor(data_combined$Genotype)),
                             master_inventory$`ALL GENOTYPES`)

In [None]:
mystery_genotypes # First just looking at IDs of mystery genotypes themseslves

In [None]:
data_combined[which(data_combined$Genotype %in% mystery_genotypes),] # Now looking at all the data for all the mystery genotypes

In [None]:
if (!dir.exists("stem_regen_parsing_midway")) dir.create("stem_regen_parsing_midway")

In [None]:
fwrite(data_combined[which(data_combined$Genotype %in% mystery_genotypes),],
       "stem_regen_parsing_midway/mystery_genotypes_with_data.csv")

<div class="alert alert-block alert-danger">
Note: There is an extra "0" at end of genotypes in Phase 3, timepoint wk1. Timepoint wk1 data is not used in our analysis. This error will not be corrected since the data is not being analyzed.

The other errors are corrected below.
</div>

##### Correct some inconsistencies

In [None]:
data_combined <- data_combined[-which(data_combined$Phase == "3.1"), ]

In [None]:
data_combined$Genotype <- gsub("BLGC", "BLCG", data_combined$Genotype)
data_combined$Genotype <- gsub("BLCG-20", "BLCG-28", data_combined$Genotype)
data_combined$Genotype <- gsub("SKWF-24-2", "SKWE-24-2", data_combined$Genotype)
data_combined$Genotype <- gsub("SLMD-28-03", "SLMD-28-3", data_combined$Genotype)
data_combined$Genotype <- gsub("BESC _331",  "BESC_331", 
                                                  data_combined$Genotype)

data_combined$Genotype <- gsub("GW _",  "GW_", 
                                                  data_combined$Genotype)

#### Cross-referencing genotype IDs with ORNL-provided ID list.
The ORNL-provided genotype list contains a list of genotypes in the SNP set.
We will cross-referencce with this as another way of verifying genotype IDs.

First let's read in that list of IDs for genotypes in the SNP data.

<div class="alert alert-block alert-info">Use choice of ID lists for desired SNP set</div>

In [None]:
# id_order <- fread("/scratch2/NSF_GWAS/Scripts/format_pheno_EMMAX/id_list_NewpubSNPs.txt") # 882-genotype SNP set

id_order <- fread("id_list_1323geno.txt") # 1323-genotype SNP sest

A little formatting of the id list file

In [None]:
id_order <- colnames(id_order)[-(1:9)]

##### Match names to dictionary

<div class="alert alert-block alert-info">Only if using 882-genotype SNP set that requires dictionary due to genotypes being coded as FID_IID instead of genotype ID as in 1323-genotype SNP set...
Read in the dictionary, which we need to translate the FID_IID codes in the SNP metadata to the genotype IDs found in our phenotype data. We need this to match them up to each other.
For which genotypes do we have phenotype data but are not in dictionary? Vice versa?</div> 

In [None]:
#dictionary <- fread("/scratch2/NSF_GWAS/Scripts/format_pheno_EMMAX/uc_id_to_names.txt")
#colnames(dictionary) <- c("IID", "Name", "ID")

#length(setdiff(data_combined$Genotype, dictionary$Name)) # Genotypes in phenodata but not dict
#length(setdiff(dictionary$Name, data_combined$Genotype)) # Genotypes in dictionary but not phenotype data

#setdiff(dictionary$Name, data_combined$Genotype) # Genotypes in dictionary but not phenotype data

#dictionary_gotSNPs <- dictionary[which(dictionary$ID %in% id_order),] # Subset dictionary to genotypes which we have SNP data for

#setdiff(data_combined$Genotype, dictionary_gotSNPs$Name)
#length(setdiff(data_combined$Genotype, dictionary_gotSNPs$Name)) # Genotypes for which we have phenotype data but no genotype data (in the selected SNP set)

### Where are these 7 genotypes? Need to export and take a closer look in Excel.

#data_combined$found_in <- rep(NA, nrow(data_combined))

#data_combined$found_in[which(data_combined$Genotype %in% dictionary$Name)] <- "in_dictionary"
#data_combined$found_in[which(data_combined$Genotype %in% dictionary_882$Name)] <- "in_dictionary_and_population"

#fwrite(data_combined, "/scratch2/NSF_GWAS/phenodata/final_training/partially_parsed_phenodata.csv")

##### After checking data in Excel, by ctrl-F for specific genotype prefixes among those above 8, see there is NOTHING mislabeled that should be one of these. Good.

<div class="alert alert-block alert-info">Only if using 882-genotype SNP set....</div> 

<div class="alert alert-block alert-info">If using 882-genotype SNP set, use this version of code that merges by dictionary because of genotypes being listed by FID_IID. Otherwise, we want to simply merge all genotype ids using the got_genos variable farther down..</div> 

In [None]:
# phenos_w_names_IDs <- merge(data_combined, dictionary, by.x = "Genotype", by.y = "Name", all.x = FALSE, all.y = TRUE)
# phenos_w_names_IDs <- merge(as.data.table(id_order), phenos_w_names_IDs, by.x = "id_order", by.y = "ID", all.x = TRUE, all.y = FALSE)

Otherwise, we will sort the phenos_w_names_IDs by the vector of ordered IDs

##### Perform cross-referencing

In [None]:
length(levels(factor(data_combined$Genotype))) # total N genotypes in our phenotype dataset

In [None]:
got_pheno <- levels(factor(data_combined$Genotype))

In [None]:
length(setdiff(data_combined$Genotype, id_order)) # Genotypes in phenodata but not dict

In [None]:
got_pheno_no_geno <- setdiff(data_combined$Genotype, id_order)

In [None]:
got_geno_no_pheno <- setdiff(id_order, data_combined$Genotype)

For which genotypes do we have both phenotype and genotype data? Using this knowledge, we will make a subset of the dictionary that contains all these genotypes in the correct order, listed by both FID_IID and genotype name identifiers. This will next be merged with intermediately-parsed phenotype files to produce the final sorted and parsed phenotype files for GEMMA/EMMAX

In [None]:
got_geno_and_pheno <- intersect(id_order, data_combined$Genotype)

In [None]:
fwrite(as.data.table(got_pheno), "stem_regen_parsing_midway/got_pheno.txt")

In [None]:
fwrite(as.data.table(got_pheno_no_geno), "stem_regen_parsing_midway/got_pheno_no_geno_in_1323.txt")

In [None]:
fwrite(as.data.table(got_geno_no_pheno), "stem_regen_parsing_midway/got_geno_no_pheno_in_1323.txt")

In [None]:
fwrite(as.data.table(got_geno_and_pheno), "stem_regen_parsing_midway/got_geno_and_pheno.txt")

In [None]:
fwrite(data_combined, "stem_regen_parsing_midway/stem_regen_phenos_mid_parsing.csv")

### Formatting steps

We will convert phenotypes into EMMAX and PLINK formats with an adaptation of old code

Working with the 1323-genotype SNP set, which has genotypes indexed by genotypes ID and NOT FID_IID...

#### Subset to phenotype data for which we have genotype data

In [None]:
got_geno <- as.data.table(id_order)
colnames(got_geno)[1] <- "Genotype"

In [None]:
phenos_w_names_IDs <- merge(data_combined, got_geno, by="Genotype")

In [None]:
phenos_w_names_IDs <- phenos_w_names_IDs[which(phenos_w_names_IDs$TDZ==0.5),]

#### Scale callus/shoot data and remove stem data

In [None]:
phenos_w_names_IDs$callus <- phenos_w_names_IDs$callus/100
phenos_w_names_IDs$shoot <- phenos_w_names_IDs$shoot/100
phenos_w_names_IDs$stem <- NULL

#### Add a column for timepoint

In [None]:
split_phase_timepoint <- str_split_fixed(phenos_w_names_IDs$Phase, "\\.", "2")

#### Reorganize/inspect phase/timepoint data

We will do a bit of reorganizing since previously, Phase.timepoint info was labeled as "Phase". Let's set the record straight while retaining all information (with redundancy)

In [None]:
phenos_w_names_IDs$Phase.timepoint <- phenos_w_names_IDs$Phase
phenos_w_names_IDs$timepoint <- split_phase_timepoint[,2]
phenos_w_names_IDs$Phase <- split_phase_timepoint[,1]

In [None]:
levels(factor(phenos_w_names_IDs$timepoint))

<div class="alert alert-block alert-warning"> What is the factor ''? </div>


In [None]:
# phenos_w_names_IDs[which(phenos_w_names_IDs$timepoint == ''),]

Are they all levels one and two? Think someone forgot to put timepoint of 1 for those phases...

In [None]:
levels(factor(phenos_w_names_IDs[which(phenos_w_names_IDs$timepoint == ''),]$Phase))

Ok. Let's make sure using table.

In [None]:
table(phenos_w_names_IDs$timepoint, phenos_w_names_IDs$Phase)

These are week 1 data, which were only collected during phases 1 and 2.
<div class="alert alert-block alert-success"> We will not run GWAS on week 1 data because of extremely low rates of regeneration, as previously noted in red box above. We previously dumped only what was needed to get genotypes to match up (due to the issue of extra 0 on Phase 3 wk 1) but will now get rid of EVERYTHING for wk1, in all phases. </div>

In [None]:
phenos_w_names_IDs <- phenos_w_names_IDs[which(phenos_w_names_IDs$timepoint != ''),]

#### Now to pre-format data and run our old code to parse into files for EMMAX and PLINK formats at each timepoint

#### Keep last observation(s) when genotypes are studied in multiple phases

We decided in cases where a genotype was studied twice (in two phases) to data data for the second time, rather than taking the average.

##### What genotypes appear in what phases?

Let's subset data to timepoint 3 because this is a timepoint for which we have data in every phase (unlike timepoints 1, 2 and 5)

In [None]:
head(phenos_w_names_IDs)

In [None]:
dim(phenos_w_names_IDs)

In [None]:
levels(factor(phenos_w_names_IDs$timepoint))

##### Why do some genotypes appear twice in Phase 7 and 8?

Because they are replicates. In later phases (5 onward?) we began to do replicates instead of one with and one without TDZ.

A quick way to know if genotypes were studied in an earlier phase AND a later phase...

Look at the contingency table.

In [None]:
genotypes_phases <- table(phenos_w_names_IDs$Phase, phenos_w_names_IDs$timepoint)

In [None]:
genotypes_phases

Week 3 is a timepoint that was included in EVERY phase. This provides us a convenient means of counting how many times each genotype was studied, and seeing in which phases they were studied. Note, there are a few exceptions in which we do not have wk3 data for a genotype due to contamination or another factor; these cases will be handled individually afterward.

In [None]:
phenos_w_names_IDs_week3_only <- phenos_w_names_IDs[which(phenos_w_names_IDs$timepoint == '3'),]

Now convert contingency table to data frame and calculate 1) the total number of times a genotype was studied (with TDZ=0.5) and the number of phases a genotype was studied in.

In [None]:
genotypes_phases <- table(phenos_w_names_IDs_week3_only$Genotype,
                          phenos_w_names_IDs_week3_only$Phase)

In [None]:
genotypes_phases_w_sums <- as.data.frame.matrix(genotypes_phases)

In [None]:
genotypes_phases_w_sums$sum_total <- unlist(rowSums(genotypes_phases_w_sums))

genotypes_phases_binarized <- as.data.frame.matrix(genotypes_phases)
genotypes_phases_binarized[genotypes_phases_binarized > 1] <- 1
genotypes_phases_w_sums$phases_studied_in <- unlist(rowSums(genotypes_phases_binarized))

In [None]:
head(genotypes_phases_w_sums)

##### Of genotypes that appear in multiple phases, what are the phases they appear in?

In [None]:
genotypes_studied_in_multiple_phases <- genotypes_phases_w_sums[which(genotypes_phases_w_sums$phases_studied_in>1),]

In [None]:
nrow(genotypes_studied_in_multiple_phases) # Number of genotypes appearing in multiple phases

In [None]:
head(genotypes_studied_in_multiple_phases)

In [None]:
max(genotypes_studied_in_multiple_phases$phases_studied_in) # What is the maximum number of phases any genotype was studied in?

##### What is the last phase every genotype appears in?

In [None]:
genotypes_phases_2 <- as.data.frame.matrix(genotypes_phases)

In [None]:
nrow(genotypes_phases_2)

In [None]:
genotypes_phases_2 <- as.data.frame.matrix(genotypes_phases)
genotypes_phases_2$final_phase_studied_in <- rep(0, nrow(genotypes_phases_2))

In [None]:
genotypes_studied_in_multiple_phases_nosums <- genotypes_studied_in_multiple_phases
genotypes_studied_in_multiple_phases_nosums$sum_total <- NULL
genotypes_studied_in_multiple_phases_nosums$phases_studied_in <- NULL

Make sure what we're about to do inside a loop works

In [None]:
genotypes_studied_in_multiple_phases_nosums[1,]

In [None]:
which(genotypes_studied_in_multiple_phases_nosums[1,] != 0) # What phases are a genotype studied in?

##### Prepare `data.frame` of which genotypes are in which phases

In [None]:
head(rownames(genotypes_phases))

In [None]:
rownames(genotypes_phases)[1]

In [None]:
df <- data.frame(matrix(NA, ncol=4, nrow=nrow(genotypes_studied_in_multiple_phases_nosums)))
colnames(df) <- c("Genotype", "Final_phase", "Earlier_phase1", "Earlier_phase2")

In [None]:
for(i in 1:nrow(genotypes_studied_in_multiple_phases_nosums)){
    #print(i)
    #print(paste0('This genotype is: ', rownames(genotypes_studied_in_multiple_phases_nosums)[i]))
    df$Genotype[i] <- rownames(genotypes_studied_in_multiple_phases_nosums)[i]
    #print("Win")
    # We need the column names for which the value of df in the ith row is nonzero
    Phases_appeared_in <- which(genotypes_studied_in_multiple_phases_nosums[i,] != 0)
    max <- max(Phases_appeared_in)
    min <- min(Phases_appeared_in)
    if(length(Phases_appeared_in)==3){
        med <- median(Phases_appeared_in)
    }
    if(length(Phases_appeared_in)<3){
        med <- NA
    }
    df$Final_phase[i] <- max
    df$Earlier_phase1[i] <- min
    df$Earlier_phase2[i] <- med
    #print(head(df))
}

In addition to removing these genotypes from phenotype data, let's remove them from diameter (covariate) data

## Parse and clean covariate data

Note: The diameter data is stored in a strange way. The first file contains diameter data for the first two phases. In the second file, there is diameter data for all phases except the first two, for which diameter is left blank.

### Load

Data for phases 1 is in different file than for the rest

In [None]:
diameter.1 <- read_excel("/mnt/data/NSF_GWAS/phenodata/raw_manual_score_covariates/GWAS_Data_Phase_1_2.xlsx")

In [None]:
diameter.2 <- data.table::fread("/mnt/data/NSF_GWAS/phenodata/Master GWAS results RESCORE_10.10.19_stem_and callus scoring_AG - Copy.csv")

Evaluate these datasets and the differences between them.

In [None]:
colnames(diameter.1)
colnames(diameter.2)

head(diameter.1$T_quant...4)
head(diameter.2$Treatment)

### Parse and clean

#### Add phase column to df for phase 1

Add a phase column for the first datasheet. Consider everything in the first datasheet is for phase 1. Note that sheet is titled "GWAS Data All Phases" and "GWAS_Data_1_2" and neither of these names are accurate since it only has phase 1 data.

In [None]:
diameter.1$Phase <- 1

In [None]:
# head(diameter.1)

#### Clean up the columns
We will clean names for TDZ, diameter and drop all columns except for diameter and those for index, genotype, treatment and phase.... then combine both datasets into a single table.

##### [TDZ]

In [None]:
colnames(diameter.1)[4] <- "TDZ_conc"
colnames(diameter.2)[3] <- "TDZ_conc"

##### Diameter

In [None]:
colnames(diameter.1)[which(colnames(diameter.1) == 'Final Diameter  (mm)')] <- "diameter_mm"
colnames(diameter.2)[which(colnames(diameter.2) ==  'Final Stem Diameter (mm)')] <- "diameter_mm"

Did it work?

In [None]:
colnames(diameter.1)

##### Drop extra columns

In [None]:
diameter.1 <- as.data.table(cbind(diameter.1$Index,
                                  diameter.1$Genotype,
                                  diameter.1$TDZ_conc,
                                  diameter.1$Phase,
                                  diameter.1$diameter_mm))

diameter.2 <- as.data.table(cbind(diameter.2$Index,
                                  diameter.2$Genotype,
                                  diameter.2$TDZ_conc,
                                  diameter.2$Phase,
                                  diameter.2$diameter_mm))

diameter <- rbind(diameter.1,
                  diameter.2[which(as.numeric(as.character(diameter.2$V4)) != 1), ])

In [None]:
colnames(diameter) <- c("Index",
                        "Genotype",
                        "TDZ_conc",
                        "Phase",
                        "diameter_mm")

#### Clean genotype names

##### First get rid of spaces in genotype names

In [None]:
diameter$Genotype <- gsub(" ", "-", diameter$Genotype)

##### Correct naming inconsistencies

These are the same as corrections for phenotype files.

In [None]:
diameter$Genotype <- gsub("BLGC", "BLCG", diameter$Genotype)
diameter$Genotype <- gsub("BLCG-20", "BLCG-28", diameter$Genotype)
diameter$Genotype <- gsub("SKWF-24-2", "SKWE-24-2", diameter$Genotype)
diameter$Genotype <- gsub("SLMD-28-03", "SLMD-28-3", diameter$Genotype)
diameter$Genotype <- gsub("BESC _331",  "BESC_331", 
                                                  diameter$Genotype)

diameter$Genotype <- gsub("GW _",  "GW_", 
                                                  diameter$Genotype)

In [None]:
fwrite(diameter, "stem_regen_parsing_midway/diameter_midway.csv")

## Remove the first observation when a genotype was studied in multiple phases

We will do this for both diameter data and phenotype data

In [None]:
for(i in 1:nrow(genotypes_studied_in_multiple_phases_nosums)){
    # Delete data from earlier phases – first by replacing Phase # with NA for those we wish to remove
    # for certain genotypes
    diameter$Phase[which(diameter$Genotype==df$Genotype[i] & diameter$Phase != df$Final_phase[i])] <- NA
    phenos_w_names_IDs$Phase[which(phenos_w_names_IDs$Genotype==df$Genotype[i] & phenos_w_names_IDs$Phase != df$Final_phase[i])] <- NA
}

Now remove NA to drop the entries for earlier phases when genotypes were studied in multiple phases

In [None]:
nrow(phenos_w_names_IDs) #for 882-genotype SNP set, 6296; for 1323 set, 9253
phenos_w_names_IDs <- na.omit(phenos_w_names_IDs)
nrow(phenos_w_names_IDs) #for 88s-genotype SNP set, 4895; for 1323 set, 7647

In [None]:
nrow(diameter) #for 1323 set, 3148
diameter <- na.omit(diameter)
nrow(diameter) #for 1323 set, 2523

Did it work? For which genotypes do we still have data at multiple phases?

In [None]:
for(genotype in levels(factor(phenos_w_names_IDs$Genotype))){
    phases_studied_in <- unique(
        phenos_w_names_IDs[which(phenos_w_names_IDs$Genotype == genotype),]$Phase)
    n_phases_studied_in <- length(phases_studied_in)
    if(n_phases_studied_in>1){
        print(paste0(n_phases_studied_in,
                    " phases contain phenotype data for ",
                    genotype))
    }
    
    phases_studied_in <- unique(
        diameter[which(diameter$Genotype == genotype),]$Phase)
    n_phases_studied_in <- length(phases_studied_in)
    if(n_phases_studied_in>1){
        print(paste0(n_phases_studied_in,
                    " phases contain diameter data for ",
                    genotype))
    }
}

Look at data of genotypes for which we have data from multiple phases

In [None]:
phenos_w_names_IDs[which(phenos_w_names_IDs$Genotype == 'BESC-159'),]
phenos_w_names_IDs[which(phenos_w_names_IDs$Genotype == 'BESC-16'),]
phenos_w_names_IDs[which(phenos_w_names_IDs$Genotype == 'BESC-354'),]

In [None]:
phenos_w_names_IDs$Phase[which(phenos_w_names_IDs$Genotype == 'BESC-159' & phenos_w_names_IDs$Phase==1)] <- NA
phenos_w_names_IDs$Phase[which(phenos_w_names_IDs$Genotype == 'BESC-16' & phenos_w_names_IDs$Phase==4)] <- NA
phenos_w_names_IDs$Phase[which(phenos_w_names_IDs$Genotype == 'BESC-354' & phenos_w_names_IDs$Phase==2)] <- NA

In [None]:
diameter[which(diameter$Genotype == 'BESC-159'),]
diameter[which(diameter$Genotype == 'BESC-354'),]

In [None]:
diameter$Phase[which(diameter$Genotype == 'BESC-159' & diameter$Phase==1)] <- NA
diameter$Phase[which(diameter$Genotype == 'BESC-354' & diameter$Phase==2)] <- NA

Now remove rows with NA to get rid of this data.

In [None]:
nrow(phenos_w_names_IDs) #7647 for 1323 set (second filtering round)
phenos_w_names_IDs <- na.omit(phenos_w_names_IDs)
nrow(phenos_w_names_IDs) #7641 for 1323 set (second filtering round)

In [None]:
nrow(diameter) #2523 for 1323 set (second filtering round)
diameter <- na.omit(diameter)
nrow(diameter) #2519 for 1323 set (second filtering round)

We will revisit phase data after printing phenotypes, once it is time to write out phase data.

## Write intermediate results with extra early replicates removed BEFORE aggregating

This is being done for the annotation GUI paper, to make sure the replication structure is clear.

In [None]:
fwrite(phenos_w_names_IDs, "stem_regen_parsing_midway/stem_regen_phenos_mid_parsing_finalObservations.csv")

## Write duplicate-aggregated phenotype data into GWAS formats

In [None]:
if(!dir.exists("pheno_files/stem_regen")) dir.create("pheno_files/stem_regen")

In [None]:
setwd("pheno_files/stem_regen")

In [None]:
levels(factor(phenos_w_names_IDs$timepoint))

In [None]:
phenos_w_names_IDs$timepoint <- as.numeric(as.character(phenos_w_names_IDs$timepoint))

I believe I am finally ready to write the phenotype data....

In [None]:
getwd()

In [None]:
id_order_table <- as.data.table(id_order)
colnames(id_order_table) <- c("Genotype")

In [None]:
for (time in levels(factor(phenos_w_names_IDs$timepoint))){
    print(paste0("Timepoint: week ", time))
    ### Subset for this timepoint
    phenos_w_names_IDs_subset <- phenos_w_names_IDs[which(phenos_w_names_IDs$timepoint==time),]
    #print(head(phenos_w_names_IDs_subset))
    print(paste0("nrow after subsetting phenotype data to this timepoint is ", nrow(phenos_w_names_IDs_subset)))
    
    
    ### Callus
    aggregate_callus <- aggregate(callus~Genotype, data=phenos_w_names_IDs_subset, FUN=mean)
    callus_w_names_IDs.2 <- merge(id_order_table,
                                  aggregate_callus,
                                  by = "Genotype",
                                  all.x = TRUE,
                                  all.y = FALSE)
    print(head(callus_w_names_IDs.2))
    # If coded as FID_IID:
    # callus_w_names_IDs <- merge(aggregate_callus, dictionary, by.x = "Genotype", by.y = "Name", all.x = FALSE, all.y = TRUE)
    # callus_w_names_IDs.2 <- merge(as.data.table(id_order), callus_w_names_IDs, by.x = "id_order", by.y = "ID", all.x = TRUE, all.y = FALSE)
    phenotype_name <- paste0("callus_", time, "w")
    print(paste0("Writing out phenotype data for phenotype ",
                 phenotype_name,
                 " with # observations: ", nrow(na.omit(callus_w_names_IDs.2))))
    
    print(mean(na.omit(callus_w_names_IDs.2$callus)))
    callus_w_names_IDs.2$callus <- format(callus_w_names_IDs.2$callus, digits = 5)
    # If coded as FID_IID:
    #callus_out <- as.data.table(cbind(str_split_fixed(callus_w_names_IDs.2$id_order, "_", 2),
    #                           callus_w_names_IDs.2$callus))
    callus_out <- as.data.table(cbind(callus_w_names_IDs.2$Genotype,
                                      callus_w_names_IDs.2$Genotype,
                                      callus_w_names_IDs.2$callus))


    colnames(callus_out) <- c("FID", "IID",
                             phenotype_name)

    print(head(callus_out))
    fwrite(callus_out,
           paste0(phenotype_name, ".noheader.pheno"),
           sep = "\t", col.names = FALSE, row.names = FALSE, quote = FALSE, na = "NA")
    fwrite(callus_out,
           paste0(phenotype_name, ".header.pheno"),
           sep = "\t", col.names = TRUE, row.names = FALSE, quote = FALSE, na = "NA")
    cat("\n")
    ### Shoot
    aggregate_shoot <- aggregate(shoot~Genotype, data=phenos_w_names_IDs_subset, FUN=mean)
    shoot_w_names_IDs.2 <- merge(id_order_table,
                                 aggregate_shoot,
                                 by = "Genotype",
                                 all.x = TRUE,
                                 all.y = FALSE)
    # If coded as FID_IID:
    # shoot_w_names_IDs <- merge(aggregate_shoot, dictionary, by.x = "Genotype", by.y = "Name", all.x = FALSE, all.y = TRUE)
    # shoot_w_names_IDs.2 <- merge(as.data.table(id_order), shoot_w_names_IDs, by.x = "id_order", by.y = "ID", all.x = TRUE, all.y = FALSE)
    phenotype_name <- paste0("shoot_", time, "w")
    print(paste0("Writing out phenotype data for phenotype ",
                 phenotype_name,
                 " with # observations: ", nrow(na.omit(shoot_w_names_IDs.2))))
    
    shoot_w_names_IDs.2$shoot <- format(shoot_w_names_IDs.2$shoot, digits = 5)
    # If coded as FID_IID:
    # shoot_out <- as.data.table(cbind(str_split_fixed(shoot_w_names_IDs.2$id_order, "_", 2),
    #                            shoot_w_names_IDs.2$shoot))
    shoot_out <- as.data.table(cbind(shoot_w_names_IDs.2$Genotype,
                                     shoot_w_names_IDs.2$Genotype,
                                     shoot_w_names_IDs.2$shoot))

    colnames(shoot_out) <- c("FID", "IID",
                             phenotype_name)

    print(head(shoot_out))
    cat("\n\n")
    fwrite(shoot_out,
           paste0(phenotype_name, ".noheader.pheno"),
           sep = "\t", col.names = FALSE, row.names = FALSE, quote = FALSE, na = "NA")
    fwrite(shoot_out,
           paste0(phenotype_name, ".header.pheno"),
           sep = "\t", col.names = TRUE, row.names = FALSE, quote = FALSE, na = "NA")
}

## Check why there are fewer observations for Ph. 4

<div class="alert alert-block alert-danger">
Note: We have fewer observations for week 4 because...
</div>

Let's investigate

In [None]:
callus_3w <- fread("callus_3w.header.pheno")
callus_4w <- fread("callus_4w.header.pheno")
callus_5w <- fread("callus_5w.header.pheno")

In [None]:
geno_got_wk3_not_wk4 <- setdiff(na.omit(callus_3w)$FID,
                                na.omit(callus_4w)$FID) 

In [None]:
geno_got_wk3_not_wk5 <- setdiff(na.omit(callus_3w)$FID,
                                na.omit(callus_5w)$FID) 

In [None]:
geno_got_wk3_not_wk5 # These should all be cases that were studied in Phase 1.

In [None]:
phenos_w_names_IDs[which(phenos_w_names_IDs$Genotype == "BESC-124"),]

In [None]:
for(genotype in geno_got_wk3_not_wk5){
    pheno_subset <- phenos_w_names_IDs[which(phenos_w_names_IDs$Genotype == genotype), ]
    phases_studied_in <- unique(pheno_subset$Phase)
    if((length(phases_studied_in) > 1) | (phases_studied_in != 1)){
        print(pheno_subset)
        #stop()
    }
}

All are phase 1 except for BESC-143 and BESC-26, which I guess must have gotten contaminated or otherwise damaged and therefore were not imaged at later timepoints.

In [None]:
geno_got_wk5_not_wk4 <- setdiff(na.omit(callus_5w)$FID,
                                na.omit(callus_4w)$FID) 

In [None]:
geno_got_wk5_not_wk4 # Truly mysterious

In [None]:
phenos_w_names_IDs[which(phenos_w_names_IDs$Genotype == "BESC-113"),]

In [None]:
phenos_w_names_IDs[which(phenos_w_names_IDs$Genotype == "BESC-117"),]

It seems the answer is that in Phase 3 we collected no wk4 data.

In [None]:
for(genotype in geno_got_wk5_not_wk4){
    #print(genotype)
    pheno_subset <- phenos_w_names_IDs[which(phenos_w_names_IDs$Genotype == genotype), ]
    phases_studied_in <- unique(pheno_subset$Phase)
    if((length(phases_studied_in) > 1) | (phases_studied_in != 3)){
        print(pheno_subset)
        #stop()
    }
}

<div class="alert alert-block alert-success"> Success with writing phenotypes. Now to write phase and diameter data. </div>


## Finish parsing covariate data, then write

### Diameter

Format diameter data and evaluate

In [None]:
colnames(diameter)
colnames(diameter)[3] <- "Treatment"

In [None]:
levels(factor(diameter$Treatment))

In [None]:
diameter.2 <- diameter[which(diameter$Treatment == '0.5'),]

In [None]:
diameter.2$diameter_mm <- as.numeric(as.character(diameter.2$diameter_mm))

In [None]:
head(diameter.2)

#### Fix inconsistencies (as before... redundant)

In [None]:
diameter.2$Genotype <- gsub("\\ -", "-", diameter.2$Genotype)
diameter.2$Genotype <- gsub("--", "-", diameter.2$Genotype)

# Fixing more names 3.25.19
diameter.2$Genotype <- gsub("BLGC", "BLCG", diameter.2$Genotype)
diameter.2$Genotype <- gsub("BLGC-20", "BLCG-28", diameter.2$Genotype)
diameter.2$Genotype <- gsub("SKWF-24-2", "SKWE-24-2", diameter.2$Genotype)
diameter.2$Genotype <- gsub("SLMD-28-03", "SLMD-28-3", diameter.2$Genotype)
diameter.2$Genotype <- gsub("BESC _331",  "BESC_331", 
                                                  diameter.2$Genotype)

diameter.2$Genotype <- gsub("GW _",  "GW_", 
                                                  diameter.2$Genotype)

#### Aggregate duplicates

In [None]:
aggregate_diameters <- aggregate(diameter_mm~Genotype, data=diameter.2, FUN=mean)
#aggregate_diameters <- aggregate(`Final Stem Diameter (mm)`~Genotype, data=diameter.2, FUN=mean)

In [None]:
head(aggregate_diameters)

#### Order according to ID list

In [None]:
diameter_w_names_IDs <- merge(id_order_table,
                              aggregate_diameters,
                              by = "Genotype",
                              all.x = TRUE,
                              all.y = FALSE)

### Phase

#### Final inspection

First, let's make sure every genotype is only found in a single phase now.

In [None]:
table <- table(phenos_w_names_IDs$Phase, phenos_w_names_IDs$Genotype)
table[table>=1] <- 1
table_sums <- colSums(table)
print(max(table_sums)) # Max should be 1

Now back to phenotype data, to find the phase of every genotype and format appropriately. (This is redundant with the above)

In [None]:
# First for phenotype data

genotypes_phases <- table(phenos_w_names_IDs$Genotype,
                          phenos_w_names_IDs$Phase)
df <- as.data.frame.matrix(genotypes_phases)
df[df>1] <- 1
max(rowSums(df)) #Double check that nothing appears in more than one phase

In [None]:
df$Genotype <- rownames(df)

In [None]:
# If coded as FID_IID:
#phase_w_names_IDs <- merge(df, dictionary, by.x = "Genotype", by.y = "Name", all.x = FALSE, all.y = TRUE)
#phase_w_names_IDs <- merge(as.data.table(id_order), phase_w_names_IDs, by.x = "id_order", by.y = "ID", all.x = TRUE, all.y = FALSE)

#### Order according to ID list

In [None]:
phase_w_names_IDs <- merge(id_order_table,
                           df,
                           by = "Genotype",
                           all.x = TRUE,
                           all.y = FALSE)

In [None]:
colnames(phase_w_names_IDs)

In [None]:
colnames(diameter_w_names_IDs)

### Combine into a single covariate `data.frame`

In [None]:
covariates_out <- cbind(diameter_w_names_IDs$diameter_mm,
                        phase_w_names_IDs[,2:9])

In [None]:
colnames(covariates_out) <- c("Stem_diam_mm", "Ph1", "Ph2", "Ph3", "Ph4", "Ph5", "Ph6", "Ph7", "Ph8")

In [None]:
head(covariates_out)

### Write

In [None]:
fwrite(covariates_out, "../../covariate_files/stem_regen_covariates/Covariates_Stem_AllPhases.txt",
      row.names = FALSE, sep = "\t")