## Diversity Indices
### Runs of Homozygosity 
[ROHAN](https://github.com/grenaud/rohan)was used to call runs of homozygosity (ROHs). Following the recommendations, we removed all PCR-duplicates from BAM alignment files, estimated a transition/transversion ratio with [VCFtools](https://vcftools.github.io/index.html) v0.1.15, and defined an allowable level of heterozygosity within a 50kb window.  

In [None]:
# Update with how I estimate base-pair frequency and use that instead of TSTV ratio to estimate ROH.

And the Ts/Tv ratio was estimated as a prior for ROHan with VCFtools using the BCF outputs from ANGSD.

In [None]:
for POP in AU TI
    do
    if [[ "$POP" == "AU" ]]
        then
        angsd -P 24 -b ${ANGSD}${POP}.list -ref $TREF -out ${ANGSD}samtools/genotypes/${POP}_genotypes \
            -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -baq 1 \
            -minMapQ 20 -minQ 20 -minInd 19 -setMinDepth 200 -setMaxDepth 420 -doCounts 1 \
            -doPost 1 -doBcf 1 -GL 1 -doMajorMinor 1 -doMaf 1 -skipTriallelic 1 -SNP_pval 1e-6 -doGeno 10 --ignore-RG 0
        vcftools --bcf ${ANGSD}samtools/genotypes/${POP}_genotypes.bcf \
            --TsTv-summary --out ${ANGSD}samtools/genotypes/${POP}_genotypes
        elif [[ "$POP" == "TI" ]]
        then
        angsd -P 24 -b ${ANGSD}${POP}.list -ref $TREF -out ${ANGSD}samtools/genotypes/${POP}_genotypes \
            -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -baq 1 \
            -minMapQ 20 -minQ 20 -minInd 15 -setMinDepth 120 -setMaxDepth 280 -doCounts 1 \
            -doPost 1 -doBcf 1 -GL 1 -doMajorMinor 1 -doMaf 1 -skipTriallelic 1 -SNP_pval 1e-6 -doGeno 10 --ignore-RG 0
        vcftools --bcf ${ANGSD}samtools/genotypes/${POP}_genotypes.bcf \
            --TsTv-summary --out ${ANGSD}samtools/genotypes/${POP}_genotypes
        else
        angsd -P 24 -b ${ANGSD}${POP}.list -ref $KREF -out ${ANGSD}samtools/genotypes/${POP}_genotypes \
            -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -baq 1 \
            -minMapQ 20 -minQ 20 -minInd 24 -setMinDepth 700 -setMaxDepth 1200 -doCounts 1 \
            -doPost 1 -doBcf 1 -GL 1 -doMajorMinor 1 -doMaf 1 -skipTriallelic 1 -SNP_pval 1e-3 -doGeno 10 --ignore-RG 0
        vcftools --bcf ${ANGSD}samtools/genotypes/${POP}_genotypes.bcf \
            --TsTv-summary --out ${ANGSD}samtools/genotypes/${POP}_genotypes
    fi
done

This resulted in a TsTv ratio of 1.871 for tara iti, 2.507 for Australian fairy tern, and 2.725 for kakī.  

Finally, ROHan was run with the `--rohmu` flag varied to ensure regions containing ROHs were detected as per [this discussion](https://github.com/grenaud/ROHan/issues/12#issuecomment-1935539239). The value of 5 x 10<sup>-5</sup> and a window size of 50kb equates 2.5 heterozygous genotypes within this window. For kakī, this threshold was left at the default setting of 1 x 10<sup>-5</sup> and equated to 0.5 heterozygous sites within a 50kb window. This much lower ROH threshold for kakī is attributable to the much higher sequence depth for the kakī data (i.e, target 50x vs 10x sequence depth).  

In [None]:
./faSomeRecords reference/SP01_5kb_ragtag.fa SP01_autosomes2.bed reference/SP01_ragtag_autosomes2.fa
./faSomeRecords reference/himNova-hic-scaff.fa himNova_autsosomes.bed reference/himNova_autosomes.fa

samtools faidx reference/SP01_ragtag_autosomes2.fa
samtools faidx reference/himNova_autosomes.fa

REF=reference/SP01_ragtag_autosomes2.fa
KIREF=reference/himNova_autosomes.fa

for BAM in *_rohan.bam
    do
    BASE=$(basename $bam _rohan.bam)
    printf "STARTED RUNNING ROHAN FOR ${BASE} AT "
    date
    if [[ "$BASE" == "AU"*]]
        then
        rohan -t 16 --tstv 2.507 --size 50000 --rohmu 5e-5 -o output/${BASE} $TREF $BAM
    elif [[ "$BASE" == "H0"*]]
        then
        rohan -t 16 --tstv 2.725 --size 50000 -o output/${BASE} $KREF $BAM
    else
        rohan -t 16 --tstv 1.871 --size 50000 --rohmu 5e-5 -o output/${BASE} $TREF $BAM
    fi
    printf "FINISHED RUNNING ROHAN FOR ${BASE} AT "
    date
done

We then prepared summary plots for ROHs, one file each for tara iti and Australian fairy tern containing all ROH sizes calculated using the mid. estimates of heterozygosity.  

In [None]:
for FILE in fairy_tern/rohmu_5e5/AU*.mid.hmmrohl.gz
    do
    BASE=$(basename $FILE .mid.hmmrohl.gz)
    zcat $FILE | tail -n+2 | cut -f 2-5 | awk -v SAMP="$BASE" '{ if ( $4 <= 200000 ) len="Short ROH";
        else if ( $4 > 200000 && $4 <= 700000 ) len ="Medium ROH";
        else len ="Long ROH";
        print SAMP"\t"$0"\t"len"\tAU"; }' >> ROHs.tsv
done

for FILE in fairy_tern/rohmu_5e5/{SND,SP,TI}*_subset.mid.hmmrohl.gz
    do
    BASE=$(basename $FILE _subset.mid.hmmrohl.gz)
    zcat $FILE | tail -n+2 | cut -f 2-5 | awk -v SAMP="$BASE" '{ if ( $4 <= 200000 ) len="Short ROH";
        else if ( $4 > 200000 && $4 <= 700000 ) len ="Medium ROH";
        else len ="Long ROH";
        print SAMP"\t"$0"\t"len"\tNZ"; }' >> ROHs.tsv
done

for FILE in kaki/rohmu_default/H0*.mid.hmmrohl.gz
    do
    BASE=$(basename $FILE .mid.hmmrohl.gz)
    zcat $FILE | tail -n+2 | cut -f 2-5 | awk -v SAMP="$BASE" '{ if ( $4 <= 200000 ) len="Short ROH";
        else if ( $4 > 200000 && $4 <= 700000 ) len ="Medium ROH";
        else len ="Long ROH";
        print SAMP"\t"$0"\t"len"\tKI"; }' >> ROHs.tsv
done

We also parsed a file with subsetting the summary files of all individuals.  

In [None]:
printf "Sample\tPerc_in_ROH\tMean_ROH Length\tPopulation\n" > ROH_summary.txt

for FILE in rohan/rohmu_5e5/*.summary.txt
    do
    BASE=$(basename $FILE .summary.txt)
    PERC=$(tail -n4 $FILE | head -n1 | awk '{print $5}')
    SIZE=$(tail -n1 $FILE | awk '{print $6}')
    if [[ "$BASE" == "AU"* ]]
        then
        printf "${BASE}\t${PERC}\t${SIZE}\tAU\n" >> ROH_summary.txt
    elif [[ "$BASE" == "H0"* ]]
        then
        printf "${BASE}\t${PERC}\t${SIZE}\tKI\n" >> ROH_summary.txt
    else
        printf "${BASE}\t${PERC}\t${SIZE}\tNZ\n" >> ROH_summary.txt
    fi
done

## Visualising Inbreeding represented as *F<sub>ROH</sub>*
Here we leverage ROH position and size estimates from ROHan to represent individual inbreeding the tara iti, AFT and kakī populations. 

In [None]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib.legend_handler import HandlerTuple
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from scipy.stats import linregress
from sklearn.metrics import r2_score
from scipy.stats import mannwhitneyu
from scipy import stats

path = '/nesi/nobackup/uc03718/'
os.chdir(path)
print(os.getcwd())

c:\Users\jwold\GitHub\2024_MolEcol_ConsGen_Special_Issue


We also examined the proportion of the genome impacted by ROHs for each individual of Australian fairy tern and tara iti. The higher count of ROHs in tara iti population also translated to a higher proportion of the genome impacted overall. However, the vast majority of ROHs impacting the genome is attributable to those in the 'large' (>700kb in length) category. This indicates that the tara iti population has remained chronically small for several generations.  

In [52]:
roh = pd.read_csv('ROHs.tsv', sep='\t')
roh_segments = pd.read_csv('ROH_segment_summary.txt', sep='\t')

roh_summary = roh[['Sample', 'Size Class', 'Population']]
roh_summary = roh_summary.groupby(['Sample', 'Size Class', 'Population']).size().reset_index(name='ROH Counts')
roh_summary = roh_summary.sort_values('Size Class', ascending=False)
roh_summary['ROH_POP'] = roh_summary['Population'] + '-' + roh_summary['Size Class']

roh_segments['Total Segments Included'] = roh_segments['Segments in ROH'] + roh_segments['Segments in non-ROH']

roh_len = roh[['Sample', 'ROH Length', 'Size Class', 'Population']]
roh_len = roh_len.groupby(['Sample', 'Size Class', 'Population'])['ROH Length'].sum().reset_index(name='ROH Total')


roh_len = roh_len.merge(roh_segments[['Sample', 'Total Segments Included']], on='Sample', how='left')

roh_len['Proportion'] = roh_len['ROH Total'] / roh_len['Total Segments Included']

roh_len = roh_len.sort_values('Size Class', ascending=False)
roh_len['ROH_POP'] = roh_len['Population'] + '-' + roh_len['Size Class']

Here we examined the ROH size distrobution. However, because medium and larger ROHs occur at a much lower frequency than small ROH, it is challenging to visualise the distribution. To help allevieate this, we log10 transformed ROH length prior to plotting.  

In [None]:
roh_len[roh_len['PopulaNZon']=='NZ'].head()

Unnamed: 0,Sample,Size Class,Population,ROH Total,Total Segments Included,Proportion,ROH_POP
209,TI96,Short ROH,NZ,54900000,1133100000,0.048451,NZ-Short ROH
176,TI85,Short ROH,NZ,61000000,1112150000,0.054849,NZ-Short ROH
98,TI40,Short ROH,NZ,57250000,1069550000,0.053527,NZ-Short ROH
95,TI38,Short ROH,NZ,68150000,1100200000,0.061943,NZ-Short ROH
92,TI37,Short ROH,NZ,53850000,1111600000,0.048444,NZ-Short ROH


## Population Diversity
### Runs of Homozygosity (ROHs)
We first examined the number of short (>50kb & <=200kb in length), medium (>200kb & <=700kb in length) and long (>700kb in length) ROHs per individual. This is to provide some indication of long-, medium- and short-term inbreeding levels respectively for each of the populations. Here, we can see that tara iti consistently have more ROHs across all three size classes.  

In [60]:
def mann_whitney_test(df, group_col, value_col, group1, group2):
    group1_values = df[df[group_col] == group1][value_col]
    group2_values = df[df[group_col] == group2][value_col]
    stat, p_value = mannwhitneyu(group1_values, group2_values, alternative='two-sided')
    return stat, p_value

# Perform the test for indiv_total
groups = ['AU-Short ROH', 'AU-Medium ROH', 'AU-Long ROH', 'NZ-Short ROH', 'NZ-Medium ROH', 'NZ-Long ROH', 'KI_10x-Short ROH', 'KI_10x-Medium ROH', 'KI_10x-Long ROH', 'KI-Short ROH', 'KI-Medium ROH', 'KI-Long ROH']
results_total = {}

for i in range(len(groups)):
    for j in range(i + 1, len(groups)):
        group1 = groups[i]
        group2 = groups[j]
        stat, p_value = mann_whitney_test(roh_summary, 'ROH_POP', 'ROH Counts', group1, group2)
        results_total[f'{group1} vs {group2}'] = {'stat': stat, 'p_value': p_value}

# Print the results
print("Mann-Whitney U Test results for ROH counts:")
for comparison, result in results_total.items():
    print(f"{comparison}: U-statistic = {result['stat']}, p-value = {result['p_value']}")


Mann-Whitney U Test results for ROH counts:
AU-Short ROH vs AU-Medium ROH: U-statistic = 361.0, p-value = 1.4728853070303523e-07
AU-Short ROH vs AU-Long ROH: U-statistic = 361.0, p-value = 1.4207436261002553e-07
AU-Short ROH vs NZ-Short ROH: U-statistic = 24.0, p-value = 1.236627308196758e-09
AU-Short ROH vs NZ-Medium ROH: U-statistic = 608.5, p-value = 0.10283822116259339
AU-Short ROH vs NZ-Long ROH: U-statistic = 969.0, p-value = 1.6140058387966898e-10
AU-Short ROH vs KI_10x-Short ROH: U-statistic = nan, p-value = nan
AU-Short ROH vs KI_10x-Medium ROH: U-statistic = nan, p-value = nan
AU-Short ROH vs KI_10x-Long ROH: U-statistic = nan, p-value = nan
AU-Short ROH vs KI-Short ROH: U-statistic = nan, p-value = nan
AU-Short ROH vs KI-Medium ROH: U-statistic = nan, p-value = nan
AU-Short ROH vs KI-Long ROH: U-statistic = nan, p-value = nan
AU-Medium ROH vs AU-Long ROH: U-statistic = 361.0, p-value = 1.418513185426635e-07
AU-Medium ROH vs NZ-Short ROH: U-statistic = 0.0, p-value = 1.632426

  stat, p_value = mannwhitneyu(group1_values, group2_values, alternative='two-sided')


In [62]:
roh_segments.head()

Unnamed: 0,Sample,Segments Unclassified,Segments in ROH,Segments in non-ROH,Population
0,AU01,83900000,99700000,1046000000,AU
1,AU03,83900000,119650000,1026050000,AU
2,AU04,79200000,77050000,1073350000,AU
3,AU06,86700000,83000000,1059900000,AU
4,AU08,72350000,82850000,1074400000,AU


In [64]:
roh_proportion = roh.groupby(['Sample', 'Population'])['ROH Length'].sum().reset_index(name='ROH Total')
indiv_roh_counts = roh.groupby(['Sample', 'Population'])['ROH Length'].count().reset_index(name='Count').sort_values('Count')
roh_segments = pd.read_csv('ROH_segment_summary.txt', sep='\t')
roh_segments['Total Segments Included'] = roh_segments['Segments in ROH'] + roh_segments['Segments in non-ROH']

roh_proportion = roh_proportion.merge(roh_segments[['Sample', 'Total Segments Included']], on='Sample', how='left')
roh_proportion['Froh'] = roh_proportion['ROH Total'] / roh_proportion['Total Segments Included']

roh_proportion.sort_values('Froh')

groups = ['AU', 'NZ', 'KI_10x', 'KI']

for i in range(len(groups)):
    standard_dev = roh_proportion[roh_proportion['Population']==groups[i]]['Froh'].std()
    count = roh_proportion[roh_proportion['Population']==groups[i]]['Froh'].count()
    standard_err = standard_dev / np.sqrt(count)
    print(f"Mean individual FROH for {groups[i]}: {roh_proportion[roh_proportion['Population']==groups[i]]['Froh'].mean()} +/- {standard_err}")


Mean individual FROH for AU: 0.07738820920883634 +/- 0.0025104285507874502
Mean individual FROH for NZ: 0.6610220793959886 +/- 0.006279098298593244
Mean individual FROH for KI_10x: nan +/- nan
Mean individual FROH for KI: nan +/- nan


In [65]:
groups = ['AU-Short ROH', 'AU-Medium ROH', 'AU-Long ROH', 'NZ-Short ROH', 'NZ-Medium ROH', 'NZ-Long ROH', 'KI_10x-Short ROH', 'KI_10x-Medium ROH', 'KI_10x-Long ROH', 'KI-Short ROH', 'KI-Medium ROH', 'KI-Long ROH']
results_total = {}

for i in range(len(groups)):
    for j in range(i + 1, len(groups)):
        group1 = groups[i]
        group2 = groups[j]
        stat, p_value = mann_whitney_test(roh_len, 'ROH_POP', 'Proportion', group1, group2)
        results_total[f'{group1} vs {group2}'] = {'stat': stat, 'p_value': p_value}

# Print the results
print("Mann-Whitney U Test results for FROH:")
for comparison, result in results_total.items():
    print(f"{comparison}: U-statistic = {result['stat']}, p-value = {result['p_value']}")

Mann-Whitney U Test results for FROH:
AU-Short ROH vs AU-Medium ROH: U-statistic = 322.0, p-value = 3.8473714744244065e-05
AU-Short ROH vs AU-Long ROH: U-statistic = 341.0, p-value = 2.995041856121807e-06
AU-Short ROH vs NZ-Short ROH: U-statistic = 4.0, p-value = 2.3082251058670538e-10
AU-Short ROH vs NZ-Medium ROH: U-statistic = 0.0, p-value = 1.6360138495636074e-10
AU-Short ROH vs NZ-Long ROH: U-statistic = 0.0, p-value = 1.6360138495636074e-10
AU-Short ROH vs KI_10x-Short ROH: U-statistic = nan, p-value = nan
AU-Short ROH vs KI_10x-Medium ROH: U-statistic = nan, p-value = nan
AU-Short ROH vs KI_10x-Long ROH: U-statistic = nan, p-value = nan
AU-Short ROH vs KI-Short ROH: U-statistic = nan, p-value = nan
AU-Short ROH vs KI-Medium ROH: U-statistic = nan, p-value = nan
AU-Short ROH vs KI-Long ROH: U-statistic = nan, p-value = nan
AU-Medium ROH vs AU-Long ROH: U-statistic = 327.0, p-value = 2.0221305879843327e-05
AU-Medium ROH vs NZ-Short ROH: U-statistic = 0.0, p-value = 1.6360138495636

  stat, p_value = mannwhitneyu(group1_values, group2_values, alternative='two-sided')


In [66]:
print(roh_proportion[roh_proportion['Population']=='NZ'].sort_values(by='Froh', ascending=True))

   Sample Population  ROH Total  Total Segments Included      Froh
27   TI21         NZ  513850000               1024550000  0.501537
21  SND11         NZ  589450000               1047750000  0.562586
20  SND06         NZ  610650000               1059850000  0.576166
19  SND04         NZ  617750000               1059100000  0.583278
43   TI65         NZ  641150000               1071550000  0.598339
32   TI40         NZ  648400000               1069550000  0.606236
28   TI35         NZ  657300000               1081850000  0.607570
55   TI82         NZ  678450000               1086850000  0.624235
56   TI83         NZ  671350000               1074750000  0.624657
23   SP03         NZ  674900000               1078350000  0.625864
57   TI84         NZ  683850000               1076950000  0.634988
53   TI78         NZ  692950000               1089200000  0.636201
48   TI70         NZ  697850000               1093300000  0.638297
40   TI62         NZ  698050000               1089250000  0.64

The individuals with the smallest, median, and largest *F<sub>ROH</sub>* for each group.  
- Australian fairy tern
    - Smallest *F<sub>ROH</sub>*: AU13 (3.036%)
    - Median *F<sub>ROH</sub>*: AU08 (3.890%)
    - Largest *F<sub>ROH</sub>*: AU03 (6.801%)
- Tara iti
    - Smallest *F<sub>ROH</sub>*: TI21 (27.149%)
    - Median *F<sub>ROH</sub>*: SP07 (42.051%)
    - Largest *F<sub>ROH</sub>*: SND05 (59.947%)
- Kakī low-coverage
    - Smallest *F<sub>ROH</sub>*: H01402 (24.625%)
    - Median *F<sub>ROH</sub>*: H01406 (36.51%)
    - Largest *F<sub>ROH</sub>*: H01387 (49.57%)
- Kakī high-coverage
    - Smallest *F<sub>ROH</sub>*: H01396 (17.816%)
    - Median *F<sub>ROH</sub>*: H01406 (31.092%)
    - Largest *F<sub>ROH</sub>*: H01390 (38.175%)

In [None]:
roh_plot = pd.read_csv('ROHs.tsv', sep='\t')
roh_plot['IDs'] = roh_plot['Sample'] + '-' + roh_plot['Population']

chrom = 'Chromosome 1'
individuals = [
    'AU13-AU', 'AU08-AU', 'AU03-AU',
    'TI21-TI', 'TI40-TI', 'SND05-TI',
    'H01401-KI_10x', 'H01386-KI_10x', 'H01394-KI_10x',
    'H01396-KI', 'H01406-KI', 'H01390-KI'
    ]
populations = [
    'AFT1', 'AFT2', 'AFT3',
    'TI1', 'TI2', 'TI3',
    'KĪ low1', 'KĪ low2', 'KĪ low3',
    'KĪ1', 'KĪ2', 'KĪ3'
    ]

subset_roh = roh[roh['Sample'].isin(individuals)]
subset_roh = subset_roh[(subset_roh['Chromosome'] == 'CM020437.1_RagTag') | (subset_roh['Chromosome'] == 'scaffold_1')]
subset_roh['Chromosome'] = chrom
subset_roh = subset_roh.sort_values(by=['Sample', 'BEGIN'])

plot_rohs(subset_roh, chrom, individuals, populations, chr1_size, color_mapping=color_mapping, figsize=(20, 10))
plt.savefig('plots/ROH_positions.png', dpi=300, bbox_inches='tight')

We then examined the density of Runs of Homozygosity (ROHs) by size class. Here, we plot ROHs presence at a particular location across individuals for chromosome 1 as a means to visualise relative *fROH* among individuals.  

To start, we load and subset the original `ROH_density.tsv` file by population, ROH size class, and position (`BEGIN` and `END`).  

We then define a plot function that marks the presence/absence of ROHs by their `BEGIN` and `END` position along chromosome 1. The color `seagreen` denotes ROHs smaller than or equal to 200kb in length, `wheat` indicates ROHs larger than 200kb and smaller than or equal to 700kb, and `coral` represents ROHs larger than 700kb.  

In [None]:
# Replace with code from kakakpo

def plot_rohs(df, chromosome, individuals, populations, continuous_range, alpha_multiplier=1, color_mapping=None, figsize=(20, 15)):
    """
    Plot ROH presence along the chromosome for specific individuals.

    Parameters:
        df (DataFrame): Input DataFrame containing 'Chromosome', 'BEGIN', 'END', 'Size Class', 'Sample', and 'Population' columns.
        chromosome (str): Chromosome to plot.
        individuals (list): List of individuals to include in the plot.
        populations (list): List of populations corresponding to the individuals.
        continuous_range (range or list): Continuous range of positions to consider.
        alpha_multiplier (float): Multiplier to adjust alpha based on the number of overlapping ranges.
        color_mapping (dict): Mapping of ROH classes to colors.
        figsize (tuple, optional): Size of the figure in inches (width, height).

    Returns:
        None
    """
    num_individuals = len(individuals)
    fig, axs = plt.subplots(num_individuals, 1, figsize=figsize, sharex=True, sharey=True)

    for i, (individual, population) in enumerate(zip(individuals, populations)):
        ax = axs[i]
        # Filter Pandas DataFrame for the specified chromosome and individual
        chrom_ind_df = df[(df['Chromosome'] == chromosome) & (df['IDs'] == individual)]
        
        # Get color for ROH class
        if color_mapping:
            chrom_ind_df['Color'] = chrom_ind_df['Size Class'].map(color_mapping)
        else:
            chrom_ind_df['Color'] = 'steelblue'  # Default color

        # Plot ROH presence
        for _, row in chrom_ind_df.iterrows():
            start, end = row['BEGIN'], row['END']
            ax.fill_between(range(start, end+1), 0, 1, color=row['Color'], alpha=alpha_multiplier, edgecolor=None)
        
        # Set y-axis label to population
        ax.set_ylabel(population, labelpad=5, fontsize=18)

        # Remove y-axis ticks
        ax.set_xticks([])
        ax.set_yticks([])

        # Set x-axis limits
        ax.set_xlim(min(continuous_range), max(continuous_range))
        ax.set_ylim(0, 1)

    axs[-1].set_xlabel('Position on Chromosome 1', fontsize=25)
    axs[0].set_title('E)', loc='left', fontsize=35)
    plt.tight_layout()

# Define Color mapping
color_mapping = {
    'Short ROH': 'seagreen',
    'Medium ROH': 'wheat',
    'Long ROH': 'coral'
}
# Define chromosome size
chr1_size = range(0, 219000000, 1000)

### Individual Heterozygosity (H<sub>o</sub>)
#### SNPs
Here, we implemented a global (genome-wide heterozygosity) method from ANGSD. Essentially, this estimate is a proportion of heterozygous genotypes / genome size (excluding regions of the genome with low confidence). Unlike other runs of ANGSD, individual BAMs are used to estimate hetereozygosity, which is simply second value in the SFS/AFS.

In [None]:
SAMP=$(sed -n "$((SLURM_ARRAY_TASK_ID + 1))p" /nesi/nobackup/uc03718/angsd/GLOBAL.list)
NAME=$(basename $SAMP _autosomes_nodup.bam)

printf "\nSTARTED RUNNING ANGSD TO ESTIMATE HETEROZYGOSITY FOR $NAME AT "
date

angsd -P 32 -i ${SAMP} -ref $TREF -out ${DIR}samtools/heterozygosity/${NAME} \
        -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -angbaq 1 \
        -minMapQ 20 -minQ 20 -doCounts 1 -dosaf 1 -GL 1

realSFS ${DIR}samtools/heterozygosity/${NAME}.saf.ids > ${DIR}samtools/heterozygosity/${NAME}_est.ml

Once the SFS was estimated for each individual, the number of sites was estimated from the sum of all scaffold sizes included in the bam file and output to a file.  

In [None]:
printf "Sample\tHeterozygosity\tTool\tPopulation\n" > ${ANGSD}individual_het.tsv 

for SAMP in ${ANGSD}${TOOL}/heterozygosity/*_est.ml
    do
    BASE=$(basename $SAMP _est.ml)
    TOT=$(awk '{print $1 + $2 + $3}' $SAMP)
    HET=$(awk -v var=$TOT '{print $2/var}' $SAMP)
    if [[ "$BASE" == "AU"* ]]
        then
        printf "$BASE\t$HET\t$TOOL\tAU\n" >> ${ANGSD}individual_het.tsv
    elif [[ "$BASE" == "H0"* ]]
        then
        printf "$BASE\t$HET\t$TOOL\tKI\n" >> ${ANGSD}individual_het.tsv
    else
        printf "$BASE\t$HET\t$TOOL\tNZ\n" >> ${ANGSD}individual_het.tsv
    fi
done

#### SVs
Genotypes for each individual were estimated and then plotted in 06_visualisations.ipynb.  

In [None]:
while read -r line
    do
    REF=$(bcftools query -s ${line} -i 'GT!="mis" & GQ>=25 & SVMODEL=="AGGREGATED" & FORMAT/FT=="PASS" & GT=="RR"' -f '[%GT\n]' graphtyper/02_fairy_genotypes_filtered.vcf.gz | wc -l)
    HET=$(bcftools query -s ${line} -i 'GT!="mis" & GQ>=25 & SVMODEL=="AGGREGATED" & FORMAT/FT=="PASS" & GT=="het"' -f '[%GT\n]' graphtyper/02_fairy_genotypes_filtered.vcf.gz | wc -l | awk -v var=$TOT '{print $1/var}')
    ALT=$(bcftools query -s ${line} -i 'GT!="mis" & GQ>=25 & SVMODEL=="AGGREGATED" & FORMAT/FT=="PASS" & GT=="AA"' -f '[%GT\n]' graphtyper/02_fairy_genotypes_filtered.vcf.gz | wc -l)
    if [[ "$line" == "AU"* ]]
    then
        printf "$line\t$REF\t$HET\t$ALT\tAU\n"
        printf "$line\t$REF\t$HET\t$ALT\tAU\n" >> graphtyper/individual_svHet.tsv
    else
        printf "$line\t$REF\t$HET\t$ALT\tTI\n"
        printf "$line\t$REF\t$HET\t$ALT\tTI\n" >> graphtyper/individual_svHet.tsv
    fi
done < samples.txt

### SNP- and SV-based *H<sub>O</sub>* Estimates
The next diversity metric we examined was individual observed heterozygosity. SNPs were estimated using `ANGSD` and `realSFS` for all sites and putatively neutral sites where we excluded regions predicted to be coding regions by `AUGUSTUS`.  

We estimated structural variant heterozygosity as:
$$
H_{O} = \frac{\sum{H Sites}}{\sum{Genotyped Sites}}
$$

To examine the impact of read depth on SNPs and SVs, we leveraged the heterozygosity estimated using Australian fairy tern, tara iti, kakī subsampled alignments and high coverage kakī alignments. For esimtates of SV heterozygosity, we filtered out all invariant sites in the Australian fairy tern and tara iti populations. This included fixed SVs in Australian fairy terns.   

In [None]:
sv_het = pd.read_csv('graphtyper/individual_svHet.tsv', delimiter='\t')
sv_het = sv_het[
                (sv_het['Population']!='50xSVs_10xGTs') & 
                (sv_het['Population']!='50xSVs_50xGTs') & 
                (sv_het['Population']!='AU_PlotCritic') &
                (sv_het['Population']!='AU_sites') & 
                (sv_het['Population']!='AU') & 
                (sv_het['Population']!='TI_PlotCritic') & 
                (sv_het['Population']!='TI')
                ]

print(sv_het[sv_het['Population']=='KI'])

### Kakī SV Mendelian Inheritance
Here we examine the number of sites adhering to mendelian inheritance expectations in X trios for 3 data sets filtered for genotyping quality:
1) SVs discovered and genotyped using high coverage data
2) SVs discovered using high coverage data, and genotyped using subset data
3) SVs discovered and genotyped using subset data

Here, high coverage data is ranging from ~22x - ~67x alignment depth while subset data represents aligned depths ranging from ~6x = ~13x.  

In [None]:
kiSVmendel = pd.read_csv('graphtyper/graphtyper_kaki_subset/KI_mendelian_counts.tsv', sep='\t')

kiSVmendel.head()

In [None]:
kiSVmendel.groupby(['Data set'])['Percent Failing'].mean().reset_index()

Here, we are briefly examining the average proportion of SVs in either a homozygous reference or alternate state in Australian fairy tern, tara iti, and subsampled kakī data sets.  

In [None]:
sv_het = pd.read_csv('graphtyper/individual_svHet.tsv', delimiter='\t')
kiSVmendel = kiSVmendel[kiSVmendel['Data set']!='50xSVs_10xGenotypes']

order = ['KI_10x', '50xSVs_50xGenotypes']

# Grayscale colors for the three categories
colors = ['#d9d9d9', '#969696', '#525252']  # Light, medium, and dark gray

# Create subplots
fig, axes = plt.subplots(3, 2, figsize=(20, 20), sharex=False, sharey=False)  # 1 row, 4 columns, shared y-axis

au_svHet = sv_het[sv_het['Population'] == 'AU_noFixed']  # Filter for the current population
ti_svHet = sv_het[sv_het['Population'] == 'TI_sites']
kiLC_svHet = sv_het[sv_het['Population'] == 'KI_10x']
kiHC_svHet = sv_het[sv_het['Population'] == 'KI']
aux = np.arange(len(au_svHet))
tix = np.arange(len(ti_svHet))
kix = np.arange(len(kiLC_svHet))

# Plot stacked bars
axes[0,0].bar(aux, au_svHet['Homozygous Ref'], label='Homozygous Ref', color=colors[0])
axes[0,0].bar(aux, au_svHet['Heterozygous'], bottom=au_svHet['Homozygous Ref'], label='Heterozygous', color=colors[1])
axes[0,0].bar(aux, au_svHet['Homozygous Alt'], bottom=au_svHet['Homozygous Ref'] + au_svHet['Heterozygous'], label='Homozygous Alt', color=colors[2])
axes[0,0].set_xticks([])
axes[0,0].set_title('A)', fontsize = 20, loc='left')
axes[0,0].set_ylim(0,1800)

axes[0,1].bar(tix, ti_svHet['Homozygous Ref'], label='Homozygous Ref', color=colors[0])
axes[0,1].bar(tix, ti_svHet['Heterozygous'], bottom=ti_svHet['Homozygous Ref'], label='Heterozygous', color=colors[1])
axes[0,1].bar(tix, ti_svHet['Homozygous Alt'], bottom=ti_svHet['Homozygous Ref'] + ti_svHet['Heterozygous'], label='Homozygous Alt', color=colors[2])
axes[0,1].set_xticks([])
axes[0,1].set_title('B)', fontsize = 20, loc='left')
axes[0,1].set_ylim(0,1800)

axes[1,0].bar(kix, kiLC_svHet['Homozygous Ref'], label='Homozygous Ref', color=colors[0])
axes[1,0].bar(kix, kiLC_svHet['Heterozygous'], bottom=kiLC_svHet['Homozygous Ref'], label='Heterozygous', color=colors[1])
axes[1,0].bar(kix, kiLC_svHet['Homozygous Alt'], bottom=kiLC_svHet['Homozygous Ref'] + kiLC_svHet['Heterozygous'], label='Homozygous Alt', color=colors[2])
axes[1,0].set_xticks([])
axes[1,0].set_title('C)', fontsize = 20, loc='left')
axes[1,0].set_ylim(0,1800)

axes[1,1].bar(kix, kiHC_svHet['Homozygous Ref'], label='Homozygous Ref', color=colors[0])
axes[1,1].bar(kix, kiHC_svHet['Heterozygous'], bottom=kiHC_svHet['Homozygous Ref'], label='Heterozygous', color=colors[1])
axes[1,1].bar(kix, kiHC_svHet['Homozygous Alt'], bottom=kiHC_svHet['Homozygous Ref'] + kiHC_svHet['Heterozygous'], label='Homozygous Alt', color=colors[2])
axes[1,1].set_xticks([])
axes[1,1].set_title('D)', fontsize = 20, loc='left')
axes[1,1].set_ylim(0,1800)

plt.savefig('plots/Figure_2a.pdf', dpi=300, bbox_inches='tight')

In [None]:
kiSVmendel = kiSVmendel[kiSVmendel['Data set']!='50xSVs_10xGenotypes']

order = ['10xSVs_10xGenotypes', '50xSVs_50xGenotypes']

plt.figure(figsize=(10, 5))

sns.violinplot(data=kiSVmendel, x='Data set', y='Percent Failing', color='0.8', order=order)
sns.stripplot(data=kiSVmendel, x='Data set', y='Percent Failing', jitter=True, color='black', order=order)
plt.xticks(['10xSVs_10xGenotypes', '50xSVs_50xGenotypes'], ['KĪ low', 'KĪ high'], fontsize=14)
plt.yticks(fontsize=14)
plt.ylim(0, 18)
plt.ylabel('Percent Failing Sites', fontsize = 18)
plt.xlabel('')
plt.title('E)', fontsize = 20, loc='left')

plt.savefig('plots/Figure_2b.pdf', dpi=300, bbox_inches='tight')

Here, we are briefly examining the average proportion of SVs in either a homozygous reference or alternate state in Australian fairy tern, tara iti, and subsampled kakī data sets.  

In [None]:
sv_het = pd.read_csv('graphtyper/individual_svHet.tsv', delimiter='\t')

print("Mean proportion of homozygous alternate sites in AFT: ", sv_het[sv_het['Population']=='AU']['Proportion Alt'].mean())
print("Mean proportion of homozygous alternate sites in TI: ", sv_het[sv_het['Population']=='TI']['Proportion Alt'].mean())
print("Mean proportion of homozygous alternate sites in KI low: ", sv_het[sv_het['Population']=='KI_10x']['Proportion Alt'].mean())
print("Mean proportion of homozygous alternate sites in KI high: ", sv_het[sv_het['Population']=='KI']['Proportion Alt'].mean())

print("Mean proportion of homozygous reference sites in AFT: ", sv_het[sv_het['Population']=='AU']['Proportion Ref'].mean())
print("Mean proportion of homozygous reference sites in TI: ", sv_het[sv_het['Population']=='TI']['Proportion Ref'].mean())
print("Mean proportion of homozygous reference sites in KI low: ", sv_het[sv_het['Population']=='KI_10x']['Proportion Ref'].mean())
print("Mean proportion of homozygous reference sites in KI high: ", sv_het[sv_het['Population']=='KI']['Proportion Ref'].mean())

In [None]:
indiv_het = pd.read_csv('angsd/individual_het.tsv', delimiter='\t')
indiv_het = indiv_het.sort_values(by=['Sample', 'Population'], ascending=[True, True])
neutral_het = indiv_het[indiv_het['Tool'] == 'neutral']

neutral_het.head()

For relative comparisons of diversity, examined individual heterozygosity for SNPs (A) in Australian fairy tern, tara iti and high coverage kakī data sets, and SVs (B) jointly called and genotyped in fairy terns (i.e., including sites fixed in either population), as well as individual ROH counts (C) and ROH proportions (D) for all three groups.  

In [None]:
indiv_het = pd.read_csv('angsd/individual_het.tsv', delimiter='\t')
indiv_het = indiv_het.sort_values(by=['Sample', 'Population'], ascending=[True, True])
neutral_het = indiv_het[indiv_het['Tool'] == 'neutral']

sv_het = pd.read_csv('graphtyper/individual_svHet.tsv', delimiter='\t')
sv_fairyhet = sv_het[
                (sv_het['Population']!='KI_10x') & 
                (sv_het['Population']!='50xSVs_10xGTs') & 
                (sv_het['Population']!='50xSVs_50xGTs') & 
                (sv_het['Population']!='AU_PlotCritic') &
                (sv_het['Population']!='AU_noFixed') &
                (sv_het['Population']!='AU_sites') & 
                (sv_het['Population']!='TI_PlotCritic') & 
                (sv_het['Population']!='TI_sites') &
                (sv_het['Population']!='KI')
                ]

sv_het = sv_het[
                (sv_het['Population']!='50xSVs_10xGTs') & 
                (sv_het['Population']!='50xSVs_50xGTs') & 
                (sv_het['Population']!='AU_PlotCritic') &
                (sv_het['Population']!='AU') &
                (sv_het['Population']!='AU_sites') & 
                (sv_het['Population']!='TI_PlotCritic') &
                (sv_het['Population']!='TI')
                ]

roh = pd.read_csv('ROHAN_out/ROHs_v2.tsv', sep='\t')
roh = roh[(roh['Sample']!='SP01') & roh['Sample']!='TI22']

roh_summary = roh[['Sample', 'Size Class', 'Population']]
roh_summary = roh_summary.groupby(['Sample', 'Size Class', 'Population']).size().reset_index(name='ROH Counts')
roh_summary = roh_summary.sort_values('Size Class', ascending=False)

roh_len = roh[['Sample', 'ROH Length', 'Size Class', 'Population']]
roh_len = roh_len.groupby(['Sample', 'Size Class', 'Population'])['ROH Length'].sum().reset_index(name='ROH Total')

roh_conditions = [
    roh_len['Population']=='AU',
    roh_len['Population']=='TI',
    roh_len['Population']=='KI_10x',
    roh_len['Population']=='KI'
]

genome_vals = [
    roh_len['ROH Total'] / 1088797119,
    roh_len['ROH Total'] / 1088797119,
    roh_len['ROH Total'] / 1095624494,
    roh_len['ROH Total'] / 1095624494
]

roh_len['Proportion'] = np.select(roh_conditions, genome_vals, default=np.nan)

roh_len = roh_len.sort_values('Size Class', ascending=False)

xvalues = ["Short ROHs", "Medium ROHs", "Long ROHs"]
palette = ['gold', 'steelblue', 'grey', 'black']
order = ['AU', 'TI', 'KI_10x', 'KI']

fig, axes = plt.subplots(1, 5, figsize=(42, 10), sharex=False, sharey=False)

# SNP heterozygosity
sns.violinplot(data = neutral_het, x = "Population", y = "Heterozygosity", order=order, color="0.8", ax=axes[0])
sns.stripplot(data = neutral_het, x = "Population", y = "Heterozygosity", hue = "Population", hue_order=['AU', 'TI', 'KI_10x', 'KI'], palette=['gold', 'steelblue', 'grey', 'black'], jitter = True, size=5, ax=axes[0])
axes[0].set_title('A)', loc='left', fontsize=42)
axes[0].set_xticklabels(['AFT', 'TI', 'KĪ low', 'KĪ high'], fontsize=22, rotation=45)
axes[0].set_xlabel('Population', fontsize=22)
axes[0].set_ylabel("Heterozygosity", fontsize=20)
axes[0].tick_params(axis='y', which='major', labelsize=18)

# Fairy SV heterozygosity
sns.violinplot(data = sv_fairyhet, x = 'Population', y = 'Heterozygosity', color='0.8', ax=axes[1])
sns.stripplot(data = sv_fairyhet, x = 'Population', y = 'Heterozygosity', hue = 'Population', palette=palette, jitter = True, size=5, ax=axes[1])
axes[1].set_title('B)', loc='left', fontsize=42)
axes[1].set_xticklabels(['AFT', 'TI'], fontsize=22)
axes[1].set_xlabel('Population', fontsize=22)
axes[1].set_ylabel('Heterozygosity', fontsize=20)
axes[1].tick_params(axis='y', which='major', labelsize=18)

# Variable SV heterozygosity
sns.violinplot(data = sv_het, x = 'Population', y = 'Heterozygosity', color='0.8', ax=axes[2])
sns.stripplot(data = sv_het, x = 'Population', y = 'Heterozygosity', hue = 'Population', palette=palette, jitter = True, size=5, ax=axes[2])
axes[2].set_title('C)', loc='left', fontsize=42)
axes[2].set_xticklabels(['AFT', 'TI', 'KĪ low', 'KĪ high'], fontsize=20, rotation=45)
axes[2].set_xlabel('Population', fontsize=22)
axes[2].set_ylabel('Heterozygosity', fontsize=20)
axes[2].tick_params(axis='y', which='major', labelsize=18)

# ROH Count
sns.boxplot(data=roh_summary, x="Size Class", y="ROH Counts", hue="Population", palette=palette, hue_order=order, boxprops={'alpha': 0.4},ax=axes[3], legend=False)
sns.stripplot(data=roh_summary, x="Size Class", y="ROH Counts", hue="Population", palette=palette, hue_order=order, dodge=True, ax=axes[3], legend=False)
axes[3].set_title('D)', loc='left', fontsize=42)
axes[3].set_xticklabels(['Short', 'Medium', 'Large'], fontsize=20)
axes[3].set_xlabel('ROH Size Class', fontsize=22)
axes[3].set_ylabel("Count per Individual", fontsize=20)
axes[3].tick_params(axis='y', which='major', labelsize=18)

# ROH Proportion
sns.boxplot(data=roh_len, x="Size Class", y="Proportion", hue="Population", palette=palette, hue_order=order, boxprops={'alpha': 0.4}, ax=axes[4], legend=False)
sns.stripplot(data=roh_len, x="Size Class", y="Proportion", hue="Population", palette=palette, hue_order=order, dodge=True, ax=axes[4], legend=False)
axes[4].set_title('E)', loc='left', fontsize=42)
axes[4].set_xticklabels(['Short', 'Medium', 'Large'], fontsize=22)
axes[4].set_xlabel('ROH Size Class', fontsize=22)
axes[4].set_ylabel("$F_{ROH}$", fontsize=22)
axes[4].tick_params(axis='y', which='major', labelsize=18)

plt.savefig('plots/Figure_3.png', dpi=300, bbox_inches='tight')

Significance of differences in individual heterozygosity.  

In [None]:
def mann_whitney_test(df, group_col, value_col, group1, group2):
    group1_values = df[df[group_col] == group1][value_col]
    group2_values = df[df[group_col] == group2][value_col]
    stat, p_value = mannwhitneyu(group1_values, group2_values, alternative='two-sided')
    return stat, p_value

groups = ['AU', 'TI', 'KI_10x', 'KI']
results_total = {}

for i in range(len(groups)):
    for j in range(i + 1, len(groups)):
        group1 = groups[i]
        group2 = groups[j]
        stat, p_value = mann_whitney_test(neutral_het, 'Population', 'Heterozygosity', group1, group2)
        results_total[f'{group1} vs {group2}'] = {'stat': stat, 'p_value': p_value}

# Print the results
print("Mann-Whitney U Test results for SNP heterozygosity:")
for comparison, result in results_total.items():
    print(f"{comparison}: U-statistic = {result['stat']}, p-value = {result['p_value']}")

for i in range(len(groups)):
    standard_dev = neutral_het[neutral_het['Population']==groups[i]]['Heterozygosity'].std()
    count = neutral_het[neutral_het['Population']==groups[i]]['Heterozygosity'].count()
    standard_err = standard_dev / np.sqrt(count)
    print(f"Mean individual SNP heterozygosity for {groups[i]}: {neutral_het[neutral_het['Population']==groups[i]]['Heterozygosity'].mean()} +/- {standard_err}")

In [None]:
sv_het = pd.read_csv('graphtyper/individual_svHet.tsv', delimiter='\t')

def mann_whitney_test(df, group_col, value_col, group1, group2):
    group1_values = df[df[group_col] == group1][value_col]
    group2_values = df[df[group_col] == group2][value_col]
    stat, p_value = mannwhitneyu(group1_values, group2_values, alternative='two-sided')
    return stat, p_value

groups = ['AU', 'TI', 'KI_10x', 'KI', 'AU_sites', 'TI_sites']
results_total = {}

for i in range(len(groups)):
    for j in range(i + 1, len(groups)):
        group1 = groups[i]
        group2 = groups[j]
        stat, p_value = mann_whitney_test(sv_het, 'Population', 'Heterozygosity', group1, group2)
        results_total[f'{group1} vs {group2}'] = {'stat': stat, 'p_value': p_value}

# Print the results
print("Mann-Whitney U Test results for SV heterozygosity:")
for comparison, result in results_total.items():
    print(f"{comparison}: U-statistic = {result['stat']}, p-value = {result['p_value']}")

for i in range(len(groups)):
    standard_dev = sv_het[sv_het['Population']==groups[i]]['Heterozygosity'].std()
    count = sv_het[sv_het['Population']==groups[i]]['Heterozygosity'].count()
    standard_err = standard_dev / np.sqrt(count)
    print(f"Mean individual SV heterozygosity for {groups[i]}: {sv_het[sv_het['Population']==groups[i]]['Heterozygosity'].mean()} +/- {standard_err}")

### *F<sub>ROH</sub>* Relationship with H<sub>o</sub>


In [None]:
def add_trendline_and_r2_by_hue(ax, df, x_col, y_col, hue_col):
    trendline_labels = []
    for hue_value in df[hue_col].unique():
        subset = df[df[hue_col] == hue_value]
        # Fit the linear regression model
        slope, intercept, r_value, p_value, std_err = linregress(subset[x_col], subset[y_col])
        trendline = slope * subset[x_col] + intercept
        ax.plot(subset[x_col], trendline, label=f'{hue_value} trendline', linewidth=2)
        
        # Collect trendline labels for a separate legend
        trendline_labels.append(f'{hue_value} trendline')

        # Calculate R^2
        r_squared = r_value ** 2
        
        # Display R^2 on the plot
        ax.text(0.05, 0.95 - 0.05 * list(df[hue_col].unique()).index(hue_value), f'{hue_value} $R^2 = {r_squared:.2f}$', transform=ax.transAxes, fontsize=14, verticalalignment='bottom')

In [None]:
from scipy.stats import linregress

def add_trendline_and_r2_by_hue(ax, df, x_col, y_col, hue_col):
    # Initialize vertical placement for R^2 labels
    vertical_offset = 0.05
    hue_order = sorted(df[hue_col].unique())  # Ensure consistent order for hues
    
    for hue_index, hue_value in enumerate(hue_order):
        subset = df[df[hue_col] == hue_value]
        
        # Fit the linear regression model
        slope, intercept, r_value, p_value, std_err = linregress(subset[x_col], subset[y_col])
        trendline = slope * subset[x_col] + intercept
        ax.plot(subset[x_col], trendline, label=f'{hue_value} trendline', linewidth=2)
        
        # Calculate R^2
        r_squared = r_value ** 2
        
        # Display R^2 on the plot in the bottom-left corner
        ax.text(
            0.05, 
            0.05 + vertical_offset * hue_index,  # Offset each label vertically
            f'{hue_value}: $R^2 = {r_squared:.2f}$',
            transform=ax.transAxes,
            fontsize=10,
            verticalalignment='bottom',
            horizontalalignment='left'
        )

    # Add a legend for the trendlines
    ax.legend()

In [None]:
palette = ['gold', 'steelblue', 'grey', 'black']

roh = pd.read_csv('ROHAN_out/ROHs_v2.tsv', sep='\t')

roh_len = roh.groupby(['Sample', 'Size Class', 'Population'])['ROH Length'].sum().reset_index(name='ROH Total')
roh_len['Froh'] = roh_len['ROH Total'] / 1088797119

het = pd.read_csv('angsd/individual_het.tsv', delimiter='\t')
het = het[het['Tool'] == 'neutral']

sv_het = pd.read_csv('graphtyper/individual_svHet.tsv', delimiter='\t')

# Merge data
diversity = pd.merge(het, roh_len, on=['Sample', 'Population'])
svdiversity = pd.merge(sv_het, roh_len, on=['Sample', 'Population'])
totalhet = pd.merge(het, sv_het, on=['Sample', 'Population'])

# Plot
order= ['AU', 'TI', 'KI_10x', 'KI']

fig, ax = plt.subplots(1, 3, figsize=(20, 5), sharex=False, sharey=False)

sns.scatterplot(diversity.groupby(['Sample', 'Population', 'Heterozygosity'])['Froh'].sum().reset_index(), x="Froh", y="Heterozygosity", hue="Population", palette=palette, hue_order=order, s=100, edgecolor='none', ax=ax[0])
ax[0].legend(title='Population', loc='upper right')
ax[0].set_xlabel("$F_{ROH}$", fontsize=25)
ax[0].set_ylabel('SNP Heterozygosity', fontsize=25)
ax[0].set_title('A)', fontsize=25, loc='left')
add_trendline_and_r2_by_hue(ax[0], diversity.groupby(['Sample', 'Population', 'Heterozygosity'])['Froh'].sum().reset_index(), 'Froh', 'Heterozygosity', 'Population')

sns.scatterplot(svdiversity.groupby(['Sample', 'Population', 'Heterozygosity'])['Froh'].sum().reset_index(), x="Froh", y="Heterozygosity", hue="Population", palette=palette, hue_order=order, s=100, edgecolor='none', ax=ax[1])
ax[1].legend(title='Population', loc='upper right')
ax[1].set_xlabel("$F_{ROH}$", fontsize=25)
ax[1].set_ylabel('SV Heterozygosity', fontsize=25)
ax[1].set_title('B)', fontsize=25, loc='left')
add_trendline_and_r2_by_hue(ax[1], svdiversity.groupby(['Sample', 'Population', 'Heterozygosity'])['Froh'].sum().reset_index(), 'Froh', 'Heterozygosity', 'Population')

sns.scatterplot(data=totalhet, x='Heterozygosity_x', y='Heterozygosity_y', hue='Population', palette=palette, hue_order=order, s=100, edgecolor='none', ax=ax[2])
ax[2].legend(loc='upper right')
ax[2].set_xlabel('SNP Heterozygosity', fontsize=25)
ax[2].set_ylabel('SV Heterozygosity', fontsize=25)
ax[2].set_title('C)', fontsize=25, loc='left')

# Remove individual legends
for a in ax:
    a.legend_.remove()

# Add a single, shared legend
handles, labels = ax[0].get_legend_handles_labels()
fig.legend(
    handles, labels, title='Population', loc='upper center', bbox_to_anchor=(0.5, -0.1), 
    ncol=len(order), fontsize=14
)

plt.xticks(fontsize = 18)
plt.yticks(fontsize = 18)

# Save the plot
plt.savefig('plots/Supp_Fig_roh_het_relationship.png', dpi=300, bbox_inches='tight')

In [None]:
sv_het = pd.read_csv('graphtyper/individual_svHet.tsv', delimiter='\t')
het = pd.read_csv('angsd/individual_het.tsv', delimiter='\t')
het = het[het['Tool'] == 'neutral']

totalhet = pd.merge(het, sv_het, on=['Sample', 'Population'])

totalhet[totalhet['Population']=='KI_10x'].head()

### Local Heterozygosity 


In [None]:
for POP in AU TI KI
    do
    if [[ "$POP" == "AU" ]]
        then
        angsd -bam ${ANGSD}${POP}.list -ref $TREF -anc $TANC -out ${ANGSD}samtools/heterozygosity/${POP}_perSite \
            -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -baq 1 \
            -minMapQ 20 -minQ 20 -minInd 19 -setMinDepth 200 -setMaxDepth 420 -doCounts 1 \
            -GL 1 -doHWE 1 -domajorminor 1 -doMaf 1 -SNP_pval 1e-6
        angsd -bam ${ANGSD}${POP}.list -ref $TREF -anc $TANC -sites $TSITES -out ${ANGSD}samtools/heterozygosity/${POP}_perSite \
            -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -baq 1 \
            -minMapQ 20 -minQ 20 -minInd 19 -setMinDepth 200 -setMaxDepth 420 -doCounts 1 \
            -GL 1 -doHWE 1 -domajorminor 1 -doMaf 1 -SNP_pval 1e-6
    elif [[ "$POP" == "TI" ]]
        then
        angsd -bam ${ANGSD}${POP}.list -ref $TREF -anc $TANC -out ${ANGSD}samtools/heterozygosity/${POP}_perSite \
            -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -baq 1 \
            -minMapQ 20 -minQ 20 -minInd 15 -setMinDepth 120 -setMaxDepth 280 -doCounts 1 \
            -GL 1 -doHWE 1 -domajorminor 1 -doMaf 1 -SNP_pval 1e-6
        angsd -bam ${ANGSD}${POP}.list -ref $TREF -anc $TANC -sites $TSITES -out ${ANGSD}samtools/heterozygosity/${POP}_perSite \
            -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -baq 1 \
            -minMapQ 20 -minQ 20 -minInd 15 -setMinDepth 120 -setMaxDepth 280 -doCounts 1 \
            -GL 1 -doHWE 1 -domajorminor 1 -doMaf 1 -SNP_pval 1e-6
    else
        angsd -bam ${ANGSD}${POP}.list -ref $KREF -anc $KANC -out ${ANGSD}samtools/heterozygosity/${POP}_perSite \
            -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -baq 1 \
            -minMapQ 20 -minQ 20 -minInd 24 -setMinDepth 700 -setMaxDepth 1200 -doCounts 1 \
            -GL 1 -doHWE 1 -domajorminor 1 -doMaf 1 -SNP_pval 1e-6
        angsd -bam ${ANGSD}${POP}.list -ref $KREF -anc $KANC -sites $KSITES -out ${ANGSD}neutral/heterozygosity/${POP}_perSite \
            -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -baq 1 \
            -minMapQ 20 -minQ 20 -minInd 24 -setMinDepth 700 -setMaxDepth 1200 -doCounts 1 \
            -GL 1 -doHWE 1 -domajorminor 1 -doMaf 1 -SNP_pval 1e-6
    fi
done