# Putative Genetic Load
Given that the tara iti reference genome could not be annotated with RNA sequencing, ensure a robust and conservative assessment of genetic load that is translatable across species comparisons, we limited load estimates to highly conserved BUSCO genes in the fairy tern species complex (*Sterna nereis* spp.) and kakī (*Himantopus novazealandiae*) for comparison.  

To start, we ran BUSCO v5.4.7 for the tara iti and kakī reference genome.  

In [None]:
busco --in Katie_racon_ragtag_autosomes.fa --out busco/ --mode genome --lineage_dataset aves_odb10 --cpu 32
busco --in himNova-hic-scaff_autosomes.fa --out busco/ --mode genome --lineage_dataset aves_odb10 --cpu 32

We then concatenated single copy BUSCO sequences into species specific `GFF` files.  

In [None]:
cat busco/run_aves_odb10/busco_sequences/single_copy_busco_sequences/*.gff > SIFT_DB/TI_annotation/merged_scBUSCOs.gff
cat busco/run_aves_odb10/busco_sequences/single_copy_busco_sequences/*.gff > SIFT_DB/KI_annotation/merged_scBUSCOs.gff

cat busco/run_aves_odb10/busco_sequences/single_copy_busco_sequences/*.faa > SIFT_DB/TI_annotation/merged_scBUSCOs.fa
cat busco/run_aves_odb10/busco_sequences/single_copy_busco_sequences/*.faa > SIFT_DB/KI_annotation/merged_scBUSCOs.fa

### Sites Polarised with ANGSD
To estimate masked and realised load in each fairy tern population and kakī we used ANGSD to output the major minor alleles, the called genotype, the posterior probability of the called genotype and all possible genotypes (`-doGeno 31`) to a BCF file (`-doBcf 1`).  

In [None]:
angsd -P 24 -b ${ANGSD}GLOBAL.list -ref ${TREF} -anc ${TANC} -out ${ANGSD}samtools/genotypes/GLOBAL_polarized \
        -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -baq 1 -skipTriallelic 1 \
        -minMapQ 20 -minQ 20 -minInd 57 -setMinDepth 555 -setMaxDepth 1056 -doCounts 1 \
        -doPost 1 -postCutoff 0.95 -doBcf 1 -GL 1 -doMajorMinor 5 -doMaf 1 -SNP_pval 1e-6 -doGeno 31 --ignore-RG 0

angsd -P 32 -b ${DIR}KI.list -ref ${KREF} -anc ${KANC} -out ${DIR}samtools/genotypes/KI_polarized \
        -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -baq 1 -skipTriallelic 1 \
        -minMapQ 20 -minQ 20 -minInd 24 -setMinDepth 700 -maxDepth 1200 -doCounts 1 \
        -doPost 1 -postCutoff 0.95 -doBcf 1 -GL 1 -doMajorMinor 5 -doMaf 1 -SNP_pval 1e-6 -doGeno 31 --ignore-RG 0


### SIFT
We then used [SIFT4G](https://github.com/rvaser/sift4g) to identify deleterious sites within these BUSCO genes. To do so, we leveraged the outputs from BUSCO to create a database called for our tara iti and kakī reference genomes.  

Using the same method as when we concatenated the `GFF` files used to create `BED` files for variant discovery, we also obtained the reference protein sequences for each complete single copy BUSCO gene.

In [None]:
cat SP01_genome/busco/run_aves_odb10/busco_sequences/single_copy_busco_sequences/*.faa > SP01_genome/busco/merged_single_copy_busco_prot.fa
cat kaki_genome/busco/run_aves_odb10/busco_sequences/single_copy_busco_sequences/*.faa > kaki_genome/busco/merged_single_copy_busco_prot.fa

The SIFT documentation suggests using `gffread`, however this is deprecated and unmaintained. We opted to convert the `GFF` files output from BUSCO to `GTF` with [agat](https://github.com/NBISweden/AGAT) v. 1.0.0. First we fixed the concatenated `GFF` from BUSCO to be more compatible with SIFT and VEP.  

In [None]:
agat_sp_manage_IDs.pl --gff SIFT_DB/KI_annotations/merged_scBUSCO.gff -o SIFT_DB/KI_annotations/merged_scBUSCO_checked.gff
agat_sp_manage_IDs.pl --gff SIFT_DB/TI_annotations/merged_scBUSCO.gff -o SIFT_DB/TI_annotations/merged_scBUSCO_checked.gff

This `GFF` was then converted to `GTF` with AGAT and compressed for SIFT.  

In [None]:
agat_convert_sp_gff2gtf.pl --gff SIFT_DB/KI_annotations/merged_scBUSCO_checked.gff -o SIFT_DB/KI_annotations/merged_scBUSCO_checked.gtf
agat_convert_sp_gff2gtf.pl --gff SIFT_DB/TI_annotations/merged_scBUSCO_checked.gff -o SIFT_DB/TI_annotations/merged_scBUSCO_checked.gtf

bgzip -c SIFT_DB/KI_annotations/merged_scBUSCO_checked.gtf > SIFT_DB/kaki_database/gene-annotation-src/merged_scBUSCO_checked.gtf.gz
bgzip -c SIFT_DB/TI_annotations/merged_scBUSCO_checked.gtf > SIFT_DB/fairy_database/gene-annotation-src/merged_scBUSCO_checked.gtf.gz

We then used [SIFT4G_Create_Genomic_DB](https://github.com/pauline-ng/SIFT4G_Create_Genomic_DB) to construct a protein database from BUSCO genes identified in our tara iti and kakī reference genomes.  

First, we downloaded and configured the docker image.  

In [None]:
git clone https://github.com/pauline-ng/SIFT4G_Create_Genomic_DB.git
cd SIFT4G_Create_Genomic_DB
docker build -t sift4g_db .

To be able to run this tool on our HPC environment, we then exported this docker container to a `.tar` file for conversion to a `.sif` container with Apptainer.  

Then, a config file was formatted as below, adjusting the relevant settings as necessary.  

In [None]:
GENETIC_CODE_TABLE=1
GENETIC_CODE_TABLENAME=Standard
MITO_GENETIC_COTE_TABLE=2
MITO_GENETIC_CODE_TABLENAME=Vertebrate Mitochondrial

PARENT_DIR=/nesi/nobackup/uc03718/SIFT_DB/fairy_database
ORG=tara_iti
ORG_VERSION=tara_iti_v1
DBSNP_VCF_FILE=

# Running SIFT4G, this path works for the Dockerfile
SIFT4G_PATH=/sift4g/bin/sift4g

# POTEIN_DB needs to be uncompressed
PROTEIN_DB=/nesi/nobackup/uc03718/SIFT_DB/uniref100.fasta

# Subdirectories, don't need to change
GENE_DOWNLOAD_DEST=gene-annotation-src
CHR_DOWNLOAD_DEST=chr-src
LOGFILE=log.txt
ZLOGFILE=log2.txt
FASTA_DIR=fasta
SUBST_DIR=subst
ALIGN_DIR=SIFT_alignments
SIFT_SCORE_DIR=SIFT_predictions
SINGLE_REC_BY_CHR_DIR=singleRecords
SINGLE_REC_WITH_SIFTSCORE_DIR=singleRecords_with_scores
DBSNP_DIR=dbSNP

# Doesn't need to change
FASTA_LOG=fasta.log
INVALID_LOG=invalid.log
PEPTIDE_LOG=peptide.log
ENS_PATTERN=ENS
SINGLE_RECORD_PATTERN=:change:_aa1valid_dbsnp.singleRecord

SIFT databases were then constructed.  

In [None]:
ml purge
ml load Apptainer/1.3.1

printf "STARTED CONSTRUCTING DB FOR KI 10X AT "
date

apptainer exec --unsquash --bind /nesi/nobackup/uc03718/SIFT_DB/:/home/ /nesi/nobackup/uc03718/containers/sift_4g_v2.sif \
        /bin/bash -c "
                echo 'CHANGING DIRECTORY AT '; date;
                cd /SIFT4G_Create_Genomic_DB/;
                perl make-SIFT-db-all.pl -c /home/fairy.txt;
                echo 'EXITING CONTAINER AT '; date;
        "

And finally we annotated the filtered VCF file using our SIFT database.  

In [None]:
java -jar /home/jana/SIFT_Annotator.jar -c -i ${DIR}GLOBAL_polarized_filtered.vcf \
    -d fairy_SIFT_databases/SP01_v1/ \
    -r SIFT/fairy_output -t

### Variant Effect Predictor
VEP was run below using the GFF file constructed above.

In [None]:
perl vep -i GLOBAL_polarized.vcf \
    --custom references/SP01_BUSCO_check.gff.gz,FAIRY_GFF,gff \
    --fasta references/SP01_5kb_ragtag_fold60.fa.gz \
    --everything \
    -o vep_data/angsd_high_confidence_BUSCO_SNPs/GLOBAL_whole-genome_polarized

perl vep -i KI_polarized.vcf \
    --custom references/kaki_BUSCO_check.gff.gz,KAKI_GFF,gff \
    --fasta references/him_Nova-hic-scaff.fa \
    --everything \
    -o vep_data/angsd_high_confidence_BUSCO_SNPs/KI_whole-genome_polarized

### Intersecting calls between VEP & SIFT
To find sites that both had an impact (VEP) and were deleterious (SIFT), we first filtered the SIFT output for those sites with a SIFT score <= 0.05.

In [None]:
awk '{print $1"\t"$2"\t"$(NF-4)"\t"$(NF-3)"\t"$(NF-2)"\t"$(NF-1)"\t"$NF}' GLOBAL_whole-genome_SIFTannotations.txt | awk '{ if (($5 != "NA" && $6 <= 0.05 && $7 >= 2.75) || ($6 != "NA" && $6 <=0.05 && $7 >= 3.5)) print $0}' > deleterious_SIFT_fairy.tsv
awk '{print $1"\t"$2"\t"$(NF-4)"\t"$(NF-3)"\t"$(NF-2)"\t"$(NF-1)"\t"$NF}' GLOBAL_whole-genome_SIFTannotations.txt | awk '{ if (($5 != "NA" && $6 > 0.05 && $7 >= 2.75) || ($6 !="NA" && $6 > 0.05 && $7 <= 3.5) print $0}' > tolerant_SIFT_fairy.tsv

awk '{print $1"\t"$2"\t"$(NF-4)"\t"$(NF-3)"\t"$(NF-2)"\t"$(NF-1)"\t"$NF}' kaki_whole-genome_SIFTannotations.txt | awk '{ if (($5 != "NA" && $6 <= 0.05 && $7 >= 2.75) || ($5 != "NA" && $6 <=0.05 && $7 >= 3.5)) print $0}' > deleterious_SIFT_kaki.tsv
awk '{print $1"\t"$2"\t"$(NF-4)"\t"$(NF-3)"\t"$(NF-2)"\t"$(NF-1)"\t"$NF}' kaki_whole-genome_SIFTannotations.txt | awk '{ if (($5 != "NA" && $6 > 0.05 && $7 >= 2.75) || ($5 !="NA" && $6 > 0.05 && $7 <= 3.5) print $0}' > tolerant_SIFT_kaki.tsv

awk '{print $1"\t"$2"\t"$(NF-4)"\t"$(NF-3)"\t"$(NF-2)"\t"$(NF-1)"\t"$NF}' KI_10x_whole-genome_polarized_filtered_SIFTannotations.txt | awk '{ if (($5 != "NA" && $6 <= 0.05 && $7 >= 2.75) || ($5 != "NA" && $6 <=0.05 && $7 >= 3.5)) print $0}' > deleterious_SIFT_KI_10x.tsv
awk '{print $1"\t"$2"\t"$(NF-4)"\t"$(NF-3)"\t"$(NF-2)"\t"$(NF-1)"\t"$NF}' KI_10x_whole-genome_polarized_filtered_SIFTannotations.txt | awk '{ if (($5 != "NA" && $6 > 0.05 && $7 >= 2.75) || ($5 !="NA" && $6 > 0.05 && $7 <= 3.5) print $0}' > tolerant_SIFT_KI_10x.tsv

SIFT sites were then extracted, and used to determine impact from VEP.

In [None]:
grep -v "#" GLOBAL_whole-genome.tsv | grep -v IMPACT=MODIFIER > VEP_impacts_fairy.tsv
grep -v "#" KI_whole-genome.tsv | grep -v IMPACT=MODIFIER > VEP_impacts_kaki.tsv
grep -v "#" KI_10x_whole-genome.tsv | grep -v IMPACT=MODIFIER > VEP_impacts_KI_10x.tsv

for POP in fairy kaki
    do
    awk '{print $1":"$2}' deleterious_SIFT_${POP}.tsv > SIFT_sites_${POP}.txt
    while read -r line
        do
        grep "$line" VEP_impacts_${POP}.tsv >> SIFT_VEP_intersect_${POP}.txt
    done < SIFT_sites_${POP}.txt
done

A brief check showed that all sites marked deleterious in SIFT had some impact in VEP.  
These intersecting sites were then extracted from the filtered SNPs, and allele frequency estimated for each fairy tern population.

In [None]:
while read -r line
    do
    CHROM=$(echo $line | awk '{print $1}')
    POSIT=$(echo $line | awk '{print $2}')
    AALLE=$(bcftools query -t ${CHROM}:${POSIT} -f '[%GT\n]' GLOBAL_whole-genome_polarised_filtered_AU.vcf | sed 's%0/0%0%g' | sed 's%0/1%1%g'| sed 's%1/1%2%g' | awk '{sum += $1}; END {print sum}')
    TALLE$(bcftools query -t ${CHROM}:${POSIT} -f '[%GT\n]' GLOBAL_whole-genome_polarised_filtered_TI.vcf | sed 's%0/0%0%g' | sed 's%0/1%1%g'| sed 's%1/1%2%g' | awk '{sum += $1}; END {print sum}')
    printf "$line\t$AALLE\tAU\n" >> pop_harmful_allele_frequency.tsv
    printf "$line\t$TALLE\tTI\n" >> pop_harmful_allele_frequency.tsv
done < deleterious_SIFT_fairy.tsv

Then we counted the number of alleles per site, per individual for all those nonsynonymous mutations classed as either intolerant, tolerant.  

In [None]:
while read -r line
    do
    CHROM=$(echo $line | awk '{print $1}')
    POSIT=$(echo $line | awk '{print $2}')
    bcftools query -t ${CHROM}:${POSIT} -f '[%CHROM\t%POS\t$SAMPLE\t%GT\tAU\tTolerant]' GLOBAL_whole-genome_filtered_AU.vcf >> indiv_harmful_allele_frequency.tsv
    bcftools query -t ${CHROM}:${POSIT} -f '[%CHROM\t%POS\t$SAMPLE\t%GT\tTI\tTolerant]' GLOBAL_whole-genome_filtered_TI.vcf >> indiv_harmful_allele_frequency.tsv
done < tolerant_SIFT_fairy.tsv

while read -r line
    do
    CHROM=$(echo $line | awk '{print $1}')
    POSIT=$(echo $line | awk '{print $2}')
    bcftools query -t ${CHROM}:${POSIT} -f '[%CHROM\t%POS\t$SAMPLE\t%GT\tAU\tIntolerant]' GLOBAL_whole-genome_filtered_AU.vcf >> indiv_harmful_allele_frequency.tsv
    bcftools query -t ${CHROM}:${POSIT} -f '[%CHROM\t%POS\t$SAMPLE\t%GT\tTI\tIntolerant]' GLOBAL_whole-genome_filtered_TI.vcf >> indiv_harmful_allele_frequency.tsv
done < deleterious_SIFT_fairy.tsv

## Load Estimates
### Harmful allele frequency
The sum of sc-BUSCO gene regions is 289,743,567bp in kakī and 227,780,539bp in tara iti. This corresponds to roughly 26% and 21% of the kakī and fairy tern genomes. 

First we examined the frequency of the derived alleles determined to be 'Deleterious' by SIFT. Here, we can see that the presence of derived harmful alleles is markedly lower in tara iti than the other two populations.  

For each inidividual, we counted the number of intolerant (SIFT score <=0.05), tolerated (SIFT score >0.05 & <=0.10), and the sum of all nonsynonmyous SNPs as called within BUSCO regions.  

Here plotting Allele count for derived alleles per individuals for those nonsynonymous mutations classed as intolerant (fairy terns n = 1,800, KI n = 1,172) and tolerant (fairy terns n = 4,396, KI n = 5,426) falling within BUSCO regions.

In [None]:
# Group by 'Population', 'Mutation Class', and 'Consequence' and sum 'Allele Count'
del_allele = pd.read_csv('load/pop_harmful_allele_frequency.tsv', delimiter='\t')
del_allele_count = del_allele.groupby(['Population', 'Mutation Class', 'Consequence'])['Allele Count'].sum().reset_index()

print(del_allele_count)

We then examined allele frequency of putatively harmful alleles in AFT, tara iti, low and high coverage kakī data sets.  

In [None]:
pop_harm = del_allele[del_allele['Consequence']=='Intolerant']
pop_harm = pop_harm[pop_harm['Allele Count']>0]

au_harm_counts = pop_harm[pop_harm['Population']=='AU'].groupby(["Allele Count", "Population"]).size().reset_index(name="Frequency")
ti_harm_counts = pop_harm[pop_harm['Population']=='TI'].groupby(["Allele Count", "Population"]).size().reset_index(name="Frequency")
ki_harm_counts = pop_harm[pop_harm['Population']=='KI'].groupby(["Allele Count", "Population"]).size().reset_index(name="Frequency")

au_harm_pivot = au_harm_counts.pivot(index="Allele Count", columns="Population", values="Frequency").fillna(0)
ti_harm_pivot = ti_harm_counts.pivot(index="Allele Count", columns="Population", values="Frequency").fillna(0)
ki_harm_pivot = ki_harm_counts.pivot(index="Allele Count", columns="Population", values="Frequency").fillna(0)

# Plot a stacked bar chart
au_harm_pivot.plot(kind="bar", stacked=True, figsize=(12, 6), color="gold")
ti_harm_pivot.plot(kind="bar", stacked=True, figsize=(12, 6), color="steelblue")
ki_harm_pivot.plot(kind="bar", stacked=True, figsize=(12, 6), color="black")
plt.title("Frequency of Allele Counts by Population")
plt.ylabel("Frequency")
plt.xlabel("Allele Count")
plt.legend(title="Population", bbox_to_anchor=(1.05, 1), loc="upper left")
plt.tight_layout()
plt.show()

#pop_harm.head()

In [None]:
del_allele[del_allele['Population']=='KI_10x'].head()

In [None]:
au_harm = del_allele[(del_allele['Consequence']=='Intolerant') & (del_allele['Population']=='AU')]
au_harm['Allele Frequency'] = au_harm['Allele Count'] / 38
ti_harm = del_allele[(del_allele['Consequence']=='Intolerant') & (del_allele['Population']=='TI')]
ti_harm['Allele Frequency'] = ti_harm['Allele Count'] / 30
kiLC_harm = del_allele[(del_allele['Consequence']=='Intolerant') & (del_allele['Population']=='KI_10x')]
kiLC_harm['Allele Frequency'] = kiLC_harm['Allele Count'] / 48
ki_harm = del_allele[(del_allele['Consequence']=='Intolerant') & (del_allele['Population']=='KI')]
ki_harm['Allele Frequency'] = ki_harm['Allele Count'] / 48

print('The mean allele frequency of putatively harmful alleles in AFT: ', au_harm['Allele Frequency'].mean())
print('The mean allele frequency of putatively harmful alleles in TI:  ', ti_harm['Allele Frequency'].mean())
print('The mean allele frequency of putatively harmful alleles in KI 10x: ', kiLC_harm['Allele Frequency'].mean())
print('The mean allele frequency of putatively harmful alleles in KI:  ', ki_harm['Allele Frequency'].mean())

In [None]:
au_harm.head()

In [None]:
total_harm = pd.concat([au_harm, ti_harm, kiLC_harm, ki_harm], axis=0, ignore_index=True)
sns.violinplot(data=total_harm, x='Population', y='Allele Frequency')
sns.stripplot(data=total_harm, x='Population', y='Allele Frequency', jitter=True, size=2)

We then estimated R<sub>xy</sub> between each of the three populations for putatively intolerant, tolerant, and all synonymous mutations. The [code](https://github.com/samarth8392/MQU_EvoGenomics/blob/main/RScripts/Rxy.R) below was adapted from [Mathur et al (2023)](http://doi.org/10.1093/evolut/qpac061).  

In [None]:
def estimate_rxy(df, pop1, pop2, consequence, chroms_df):
    # Separate deleterious and synonymous data based on consequence type
    del_df = df[(df['Consequence'] == consequence)]
    syn_df = df[df['Mutation Class'] == 'SYNONYMOUS']

    # Merge data for both populations
    del_merged = pd.merge(del_df[del_df['Population'] == pop1], del_df[del_df['Population'] == pop2],
                          on=['Chromosome', 'Position'], suffixes=('_pop1', '_pop2'))
    
    syn_merged = pd.merge(syn_df[syn_df['Population'] == pop1], syn_df[syn_df['Population'] == pop2],
                          on=['Chromosome', 'Position'], suffixes=('_pop1', '_pop2'))

    # Calculate l_xy and l_yx
    l_xy = np.sum(del_merged["Allele Frequency_pop1"] * (1 - del_merged["Allele Frequency_pop2"])) / np.sum(syn_merged["Allele Frequency_pop1"] * (1 - syn_merged["Allele Frequency_pop2"]))
    l_yx = np.sum(del_merged["Allele Frequency_pop2"] * (1 - del_merged["Allele Frequency_pop1"])) / np.sum(syn_merged["Allele Frequency_pop2"] * (1 - syn_merged["Allele Frequency_pop1"]))
    
    # Initialize results list
    Rxy_results = []

    # Loop over chromosomes
    for chr in chroms_df['Chromosome']:
        # Filter data excluding one chromosome at a time
        del_df_excluded = del_df[(del_df['Chromosome'] != chr)]
        syn_df_excluded = syn_df[(syn_df['Chromosome'] != chr)]

        # Merge excluded data for both populations
        del_excluded_merged = pd.merge(del_df_excluded[del_df_excluded['Population'] == pop1],
                                      del_df_excluded[del_df_excluded['Population'] == pop2],
                                      on=['Chromosome', 'Position'], suffixes=('_pop1', '_pop2'))
        
        syn_excluded_merged = pd.merge(syn_df_excluded[syn_df_excluded['Population'] == pop1],
                                      syn_df_excluded[syn_df_excluded['Population'] == pop2],
                                      on=['Chromosome', 'Position'], suffixes=('_pop1', '_pop2'))

        # Calculate l_xy_excluded and l_yx_excluded
        l_xy_excluded = np.sum(del_excluded_merged["Allele Frequency_pop1"] * (1 - del_excluded_merged["Allele Frequency_pop2"])) / np.sum(syn_excluded_merged["Allele Frequency_pop1"] * (1 - syn_excluded_merged["Allele Frequency_pop2"]))
        l_yx_excluded = np.sum(del_excluded_merged["Allele Frequency_pop2"] * (1 - del_excluded_merged["Allele Frequency_pop1"])) / np.sum(syn_excluded_merged["Allele Frequency_pop2"] * (1 - syn_excluded_merged["Allele Frequency_pop1"]))

        # Calculate rxy_excluded
        rxy_excluded = l_xy_excluded / l_yx_excluded

        # Append result to list
        Rxy_results.append({'Chromosome Excluded': chr, 'Rxy': rxy_excluded, 'Consequence': consequence})

    # Convert list of dictionaries to DataFrame
    Rxy_results_df = pd.DataFrame(Rxy_results)

    return Rxy_results_df

In [None]:
def calculate_allele_frequency(row):
    if row['Population'] == 'TI':
        return row['Allele Count'] / 30
    elif row['Population'] == 'AU':
        return row['Allele Count'] / 38
    else:
        return None  # Handle other cases if needed

# Load and process data so we have Allele frequency column
del_pop_allele = pd.read_csv('load/pop_harmful_allele_frequency.tsv', delimiter='\t')
del_pop_allele = del_pop_allele[del_pop_allele['Population']!='KI']
del_pop_allele['Mutation Class'] = del_pop_allele['Mutation Class'].replace('START-LOST', 'NONSYNONYMOUS')
del_pop_allele = del_pop_allele[del_pop_allele['Allele Count']>0]

intergenic = del_pop_allele[(del_pop_allele['Mutation Class'] == 'INTERGENIC') | ( del_pop_allele['Mutation Class'] == 'SYNONYMOUS')]
intergenic = intergenic.drop_duplicates(subset=['Chromosome', 'Position', 'Population'])
intergenic = intergenic.groupby(['Chromosome', 'Position']).filter(lambda x: len(x) == 2)

# Apply the function to create a new column 'Allele Frequency'
del_pop_allele['Allele Frequency'] = del_pop_allele.apply(lambda row: calculate_allele_frequency(row), axis=1)
intergenic['Allele Frequency'] = intergenic.apply(lambda row: calculate_allele_frequency(row), axis=1)

chromNames = del_pop_allele['Chromosome'].sort_values().drop_duplicates()
chromNames = pd.DataFrame(chromNames, columns=['Chromosome'])

In [None]:
intergenic.head()

In [None]:
del_Rxy = estimate_rxy(del_pop_allele, 'TI', 'AU', 'Intolerant', chromNames)
tol_Rxy = estimate_rxy(del_pop_allele, 'TI', 'AU', 'Tolerant', chromNames)
neutral_Rxy = estimate_rxy(intergenic, 'TI', 'AU', 'Intergenic', chromNames)
Rxy = pd.concat([del_Rxy, tol_Rxy, neutral_Rxy], ignore_index=True)
Rxy_noNeutral = pd.concat([del_Rxy, tol_Rxy], ignore_index=True)

print("Mean Rxy: ", tol_Rxy['Rxy'].mean(), del_Rxy['Rxy'].mean(), neutral_Rxy['Rxy'].mean())

## Putatively harmful Allele diversity

In [None]:
indiv_del = pd.read_csv('load/indiv_harmful_allele_frequency.tsv', delimiter='\t')

In [None]:
replace_map = {
    '0/0': 0,
    '0/1': 1,
    '1/1': 2
}

indiv_del['Genotype'] = indiv_del['Genotype'].replace(replace_map)
indiv_del['Pop_Consequence'] = indiv_del['Population'] + '-' + indiv_del['Consequence']

indiv_del.head()