(c) 2023 Manuel Razo. This work is licensed under a [Creative Commons
Attribution License CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/).
All code contained herein is licensed under an [MIT
license](https://opensource.org/licenses/MIT).

In [1]:
# Load project package
@load_pkg BayesFitUtils

# Import project package
import BayesFitUtils

# Import Bayesia Fitness inference library
import BarBay

# Import package to handle DataFrames
import DataFrames as DF
import CSV

# Import basic statistical functions
import StatsBase
import Distributions

# Load CairoMakie for plotting
using CairoMakie
import ColorSchemes
import Makie
# Activate backend
CairoMakie.activate!()

# Set PBoC Plotting style
BayesFitUtils.viz.pboc_makie!()

# Aguilar 2023 exploratory data analysis

To process the Aguilar's data, let's first explore the structure of the dataset.

First, we need to load the data into memory.

In [4]:
# Load data into memory
data = CSV.read(
    "$(git_root())/data/aguilar_2023/raw_counts.tsv", DF.DataFrame; delim="\t"
)

first(data, 5)

Row,oligo,barcode,edit,condition,timepoint,replicate,counts,neutral
Unnamed: 0_level_1,String,String31,String31,String15,String3,String7,Int64,Int64
1,0#LADDER,TCCATCTAACCGCACAACTG,SNP_1_LADDER,M3.SL.2D,T1,Rep1,2,0
2,0#LADDER,TCCATCTAACCGCACAACTG,SNP_1_LADDER,M3.SL.2D,T1,Rep2,3,0
3,0#LADDER,TCCATCTAACCGCACAACTG,SNP_1_LADDER,M3.SL.2D,T1,Rep3,3,0
4,0#LADDER,TCCATCTAACCGCACAACTG,SNP_1_LADDER,M3.SL.2D,T2,Rep1,0,0
5,0#LADDER,TCCATCTAACCGCACAACTG,SNP_1_LADDER,M3.SL.2D,T2,Rep2,2,0


There are a few things that I will change from the original dataset to match my
standard format.

In [5]:
# Rename columns to my standards
DF.rename!(data, "replicate" => "rep", "counts" => "count")

# Convert neutral column from integer to boolean
neutral = data.neutral .== 1

# Remove old neutral column
data = data[:, DF.Not(:neutral)]

# Add neutral column
data[!, :neutral] = neutral

# Add time column
data[!, :time] = parse.(Int64, replace.(data.timepoint, "T" => ""))

# Replace replicate column notation
data.rep .= replace.(data.rep, "ep" => "")

# Sum counts by timepoint
data_sum = DF.combine(
    DF.groupby(data, [:timepoint, :rep]), :count => sum
)

# Add sum to dataframe
DF.leftjoin!(data, data_sum; on=[:timepoint, :rep])
# Add frequency column
data[!, :freq] = data.count ./ data.count_sum

first(data, 5)

Row,oligo,barcode,edit,condition,timepoint,rep,count,neutral,time,count_sum,freq
Unnamed: 0_level_1,String,String31,String31,String15,String3,String,Int64,Bool,Int64,Int64?,Float64
1,0#LADDER,TCCATCTAACCGCACAACTG,SNP_1_LADDER,M3.SL.2D,T1,R1,2,False,1,36819957,5.43184e-08
2,0#LADDER,TCCATCTAACCGCACAACTG,SNP_1_LADDER,M3.SL.2D,T1,R2,3,False,1,32853111,9.13156e-08
3,0#LADDER,TCCATCTAACCGCACAACTG,SNP_1_LADDER,M3.SL.2D,T1,R3,3,False,1,37121864,8.08149e-08
4,0#LADDER,TCCATCTAACCGCACAACTG,SNP_1_LADDER,M3.SL.2D,T2,R1,0,False,2,36892406,0.0
5,0#LADDER,TCCATCTAACCGCACAACTG,SNP_1_LADDER,M3.SL.2D,T2,R2,2,False,2,32572176,6.14021e-08


Let's print some of the main features of the dataset.

In [6]:
println("# of unique barcodes: $(length(unique(data.barcode)))")

println("# of unique neutral lineages: $(length(unique(data[data.neutral, :barcode])))")


# of unique barcodes: 5201151
# of unique neutral lineages: 72142


In [7]:
println("# of unique oligos: $(length(unique(data.oligo)))")

println("# of unique edits: $(length(unique(data.edit)))")

# of unique oligos: 21740


# of unique edits: 10821


That is an extremely large number of barcodes. It will be impossible to work
with them.

## Grouping data by oligo

Let's group the data by oligo to work with the next level in the data
organization hierarchy.

In [9]:
# Group data by oligos
data_oligo = DF.combine(
    DF.groupby(
        data[:, DF.Not([:barcode, :freq, :count_sum])],
        [:oligo, :edit, :condition, :rep, :time, :neutral]
    ),
    :count => sum
)

# Rename column
DF.rename!(data_oligo, :count_sum => :count)

# Compute total per oligo
data_sum = DF.combine(
    DF.groupby(data_oligo, [:condition, :rep, :time]), :count => sum
)

# Add count total
DF.leftjoin!(data_oligo, data_sum; on=[:condition, :rep, :time])

# Add frequency column
data_oligo[!, :freq] = data_oligo.count ./ data_oligo.count_sum

first(data_oligo, 5)

Row,oligo,edit,condition,rep,time,count,count_sum,freq
Unnamed: 0_level_1,String,String31,String15,String,Int64,Int64,Int64?,Float64
1,0#LADDER,SNP_1_LADDER,M3.SL.2D,R1,1,2218,36819957,6.02391e-05
2,0#LADDER,SNP_1_LADDER,M3.SL.2D,R2,1,2022,32853111,6.15467e-05
3,0#LADDER,SNP_1_LADDER,M3.SL.2D,R3,1,2235,37121864,6.02071e-05
4,0#LADDER,SNP_1_LADDER,M3.SL.2D,R1,2,2373,36892406,6.43222e-05
5,0#LADDER,SNP_1_LADDER,M3.SL.2D,R2,2,2089,32572176,6.41345e-05


Let's write these results into memory.

In [18]:
CSV.write(
    "$(git_root())/data/aguilar_2023/tidy_counts_oligo.csv", data_oligo
)

"/Users/mrazo/git/bayesian_fitness/data/aguilar_2023/tidy_counts_oligo.csv"

In [None]:
# Initialize figure
fig = Figure(resolution=(400, 300))

# Add axis
ax = Axis(
    fig[1, 1],
    xlabel="time point",
    ylabel="barcode frequency",
)

# Plot frequency trajectory
BarBay.viz.bc_time_series!(
    ax,
    data[data.rep.=="R1", :];
    id_col=:barcode,
    time_col=:time,
    quant_col=:freq
)

fig