# Archery Gender Analysis

This notebook performs the analysis required for a study into the differences between genders in indoor archery competition.

In [None]:
import os
import pandas as pd

from archery_gender_analysis import ianseo_scrape as ianseo_fn
from archery_gender_analysis import plotting as plot
from archery_gender_analysis import general_routines as gr

## Set up files to store data and results

Define the variables:

- `mydata: str` - directory where any raw data you download from ianseo will be saved
- `myresults: str` - directory where any results (numbers in `.txt` files and figures) will be saved

In [None]:
mydata = "newdata/"
myresults = "newresults/"

Check these directories exist, and if not create them.

In [None]:
datapath = f"{os.getcwd()}/{mydata}/"
resultspath = f"{os.getcwd()}/{myresults}/"
if not os.path.exists(datapath):
    os.makedirs(datapath)
if not os.path.exists(resultspath):
    os.makedirs(resultspath)

## Construct datasets for each set of competitions we wish to analyse

Datasets are defined as a dictionary with the key being a 'human-readable' identifier to each individual competition/event and the corresponding item being a url linking to the base ianseo results page for that event.

Here we define two datasets as used in the study:

- `AGB_NI` - Results from the Archery GB National Indoor Championships 2011, 2014 - 2019, and 2021
- `Nimes` - Results from the Nimes Sud de France 2015 - 2022

In [None]:
# Construct datasets for each competition
# dictionaries of ianseo urls corresponding to each event

# AGB NI
AGB_NI = {
    # "AGB NI 2022": "https://www.ianseo.net/TourData/2022/12327",
    "AGB NI 2021": "https://www.ianseo.net/TourData/2021/9399",
    "AGB NI 2019": "https://www.ianseo.net/TourData/2019/6526",
    "AGB NI 2018": "https://www.ianseo.net/TourData/2018/4739",
    "AGB NI 2017": "https://www.ianseo.net/TourData/2017/3214",
    "AGB NI 2016": "https://www.ianseo.net/TourData/2016/2192",
    "AGB NI 2015": "https://www.ianseo.net/TourData/2015/1322",
    "AGB NI 2014": "https://www.ianseo.net/TourData/2014/861",
    "AGB NI 2011": "https://www.ianseo.net/TourData/2011/237",
}

# Nimes
Nimes = {
    # "Nimes 2023": "https://www.ianseo.net/TourData/2023/12859",
    "Nimes 2022": "https://www.ianseo.net/TourData/2022/9959",
    "Nimes 2021": "https://www.ianseo.net/TourData/2021/8006",
    "Nimes 2020": "https://www.ianseo.net/TourData/2020/6255",
    "Nimes 2019": "https://www.ianseo.net/TourData/2019/4785",
    "Nimes 2018": "https://www.ianseo.net/TourData/2018/3113",
    "Nimes 2017": "https://www.ianseo.net/TourData/2017/2013",
    "Nimes 2016": "https://www.ianseo.net/TourData/2016/1276",
    "Nimes 2015": "https://www.ianseo.net/TourData/2015/797",
}

## Select which dataset to use

Select which of the datasets defined above we will be using for subsequent analysis and provide a name.

The analysis can be re-run for different datasets by changing the following cell and re-running the notebook.

In [None]:
# Set the dataset to use
# Here we choose the Archery GB National Indoors

dataset = AGB_NI
dataset_name = "AGB NI"

## Retrieve the data from ianseo

The following cell will retrieve the data for the requested events from ianseo and save combined results from each event as a `.csv` file in the directory specified above in the `mydata` variable.

In [None]:
for id, url in dataset.items():
    print(f"processing {id}...")

    # Assume recurve and compound always exist
    # Fetch results and combine into one table
    RM = ianseo_fn.get_cat(url, "IQRM.php", div="R", gen="M")
    RW = ianseo_fn.get_cat(url, "IQRW.php", div="R", gen="W")
    CM = ianseo_fn.get_cat(url, "IQCM.php", div="C", gen="M")
    CW = ianseo_fn.get_cat(url, "IQCW.php", div="C", gen="W")
    full_results = pd.concat([RM, RW, CM, CW], ignore_index=True)

    # Try fetching longbow results
    try:
        LM = ianseo_fn.get_cat(url, "IQLM.php", div="L", gen="M")
        LW = ianseo_fn.get_cat(url, "IQLW.php", div="L", gen="W")
        full_results = pd.concat([full_results, LM, LW], ignore_index=True)
    except ValueError:
        pass
    except AttributeError:
        pass

    # Try fetching barebow results
    try:
        BM = ianseo_fn.get_cat(url, "IQBM.php", div="B", gen="M")
        BW = ianseo_fn.get_cat(url, "IQBW.php", div="B", gen="W")
        full_results = pd.concat([full_results, BM, BW], ignore_index=True)
    except ValueError:
        pass
    except AttributeError:
        pass

    # Save to file
    full_results.to_csv(f'{datapath}/{id.replace(" ","_")}_scores.csv')

print("Done")

## Process the data

Read in all of the data from each event in this dataset into a single dataframe.

Then process to to:

- Calculate the percentile $p$ for separate-gender events
- Calculate the percentile $p$ for a mixed-gender event
- Calculate the position changes between separate and mixed events
- Sort dataframe based on category, then score, then 10s
- Save to files

The percentile $p$ is defined as:
$$
p = 100 \frac{r-1}{N-1}
$$
where $r$ is an individual's rank, and $N$ is the total number of competitors in the category.

In [None]:
df_all = gr.read_from_files(
    dataset.keys(),
    datapath=datapath,
    fname_fmt=".csv",
    f_pref="",
    f_suff="_scores",
)

# Calculate split and mixed gender rankings and percentiles
df_all = gr.calc_mixed_rank_percentiles(df_all)

# Calculate the position changes that result for individuals by combining
df_all = gr.calc_delta_sep_mixed(df_all)

# Sort and save the raw dataset
df_out = df_all.sort_values(
    by=["Event", "Division", "Class", "Score", "10"],
    ascending=[False, True, True, False, False],
    ignore_index=True,
)

with open(
    f'{resultspath}/{dataset_name.replace(" ","_")}_raw_data.txt',
    "w",
) as f:
    f.write(
        df_out.to_string(
            index=False,
            columns=[
                "Event",
                "Division",
                "Class",
                "Score",
                "10",
                "9",
                "Sep rank",
                "Mixed rank",
            ],
        )
    )

# Print the dataset to check
print(
    df_out.to_string(
        index=False,
        columns=[
            "Event",
            "Division",
            "Class",
            "Score",
            "10",
            "9",
            "Sep rank",
            "Mixed rank",
        ],
    )
)

## Produce plots

### 1) Scatter male and female raw score by percentile.

One plot per competition in the dataset.  
Plot for each bowstyle.  
Allows a comparison of male and female scores across both sets of participants.

In [None]:
# Scatter all scores against separate percentile position
plot.scatter_scores(df_all, fsave=True)

### 2) Scatter raw score by percentile for mixed competition, indicating male and female scores.

One plot per competition in the dataset.  
Plot for each bowstyle.  
Shows distribution of placings in hypothetical mixed-gender event.

In [None]:
# Scatter mixed scores against mixed percentile position (but showing gender)
plot.scatter_mixed_scores(df_all, fsave=True)

### 3) Scatter mixed percentile against separate-gender percentile

One plot per competition in the dataset.  
Plot for each bowstyle.  
Shows how each archer is affected by combining scores into single event.  
Add inset axes for compound and recurve to examine top 15%

In [None]:
# Scatter change in percentile when going from separate to mixed gender competition
plot.mixed_split_percentile(df_all, fsave=True)

### 3) Plot position changes resulting from mixed-gender competition

For male and female scores plot average change in absolute position (not percentile) resulting from a mixed gender competition.  
Point shows average.  
Tails show extrema.  
Dashed lines show cutoff for podium position.

In [None]:
# Examine the effect on medalists across all events and plot
# Get the change in absolute position for each archer
delta_pos = gr.get_pos_changes(df_all, fpref=dataset_name)
# Plot
plot.plot_pos_changes(delta_pos, fid=dataset_name)

## Perform statistical tests

Use scipy to perform Welch's t-test for each event.  
Test entire population and also broken down into smaller ranges.  
Write out results of t-test (p-values) and also score summary statistics.

In [None]:
# Conduct a t-test on the data by event and write to file along with key stats
ttest = gr.conduct_t_test(
    df_all, fpref=dataset_name, display_summary=True, display_all=True
)

print(ttest)