# Project Seminar DSAI
## Topic X: Estimating digital gender gaps among young people around the globe
## *Team 1: Nick Nagel 7015019, Hannah Schwed 7047756*

---



## 1. Literature Review
As a part of the United Nation Sustainable Development Goals (SDG) for 2030, achieving gender equality and women's empowerment is of relevance for all countries worldwide. Achieving equality and empowerment spans a wide range of areas to be worked on. Access to technology such as the internet is one of these areas. The SDG specify it as subtarget 5b "Enhance the use of enabling technology, in particular information and communications technology, to promote the empowerment of women" (United Nations, n.d. https://sdgs.un.org/goals/goal5#targets_and_indicators). Hence, proposing a way to estimate and analyse the gender gap related to internet adoption is of global current importance. According to the ITU's latest data, the digital gender gap stands at 8% (ITU, 2023 https://www.itu.int/en/mediacentre/backgrounders/Pages/bridging-the-gender-divide.aspx). GSMA's latest report from 2025 shows a gender gap in internet adoption of 11% (GSMA, 2025 https://www.gsma.com/r/wp-content/uploads/2025/06/The-Mobile-Gender-Gap-Report-2025.pdf).<br>

There are two main sources of data on digital gender gaps. Offline sources such as International Telecommunication Union (ITU) Connectivity Dashboard https://datahub.itu.int/data/?c=701&i=11624, OECD https://goingdigital.oecd.org/datakitchen/#/explorer/5/ict/indicator/explore/en?mainCubeId=OECD.STI.DEP%2FDSD_ICT_HH_IND%40DF_IND&pairCubeId=&sizeCubeId=&mainIndId=C5B_I&pairIndId=&sizeIndId=&mainBreakdowns=GEO_AREA%3A_Z.AGE%3AY16T74.SEX%3A_T.EDUCATION_LEVEL%3A_T.INCOME_GROUP%3A_T.EMP_STATUS%3A_T&pairBreakdowns=&sizeBreakdowns=&lollipop=&lollipopOpts=&countries=&countryFilter=&time=1104534000360.1704063600360&chart=barchart&fontSize=12&palette=normal&lastDates=true&timeScale=A&mainUnit=PT_POP&pairUnit=&sizeUnit=&flip=false&fullLabel=,  National Statistics Offices, Global System for Mobile Communications Association (GSMA) , Demographic and Health Surveys (DHS), Multi-Indicator Cluster
Surveys (MICS) https://mics.unicef.org/surveys?display=card&f[0]=status:241, Pew Research Center https://www.pewresearch.org/chart/internet-use-by-gender/ and non-traditional online sources, mainly advertising, such as Facebook, Google Ads or Snapchat. <br>

Previous studies about digital gender gaps have been based on a mix of sources and focused on a variety of locations, demographic groups and areas of divide. Tyers-Chowdhury et al. (2021) analysed the gender gap among adolescents aged 14-17 based on Facebook and Snapchat data alongside official sources. They found similar gaps like for the adult population in Facebook data while on Snapchat the picture was more scattered with more female users compared to male in countries like Indonesia, Russia and Japan. Breen et al. (2023) used Facebook data to get subnational gender gap estimates worldwide and show large within country differences.
Fatehkia et al. (2018) used Facebook data to track gender gaps worldwide and found that the data aligns with offline indicators including ITU and GSMA. Kashyap et al. (2020) used advertising data from Facebook and Google and found that Facebook is a stronger predictor than Google while both perform best when combined with the Human Development Indicator (HDI). They found that gender gaps are especially pronounced in South Asia and sub-Saharan Africa.

The current approach is motivated on the one hand by the lack of official data especially for low and middle income (LMIC) countries and on the other hand by potentially decreasing popularity of Facebook. The decline in popularity can lead to lower representability of the data regarding the true population. Having an alternative source can further enhance the reliability of the approach of estimating digital gender gap based on social media/advertising data. If we can obtain similar results based on TikTok data compared to Facebook data, we can gain confidence in the validity of these data sources. Should one of these sources undergo significant changes in the future, there is some fallback to the relevant other data source possible. Furthermore, TikTok provides an especially relevant source for the younger population with 66% of users being between 18 and 34 years (https://www.statista.com/statistics/1299771/tiktok-global-user-age-distribution/). The accessibility of the data also permits data collection over time and creation of a longitudinal dataset. In the future, this enables to analyse and detect changes over time. Besides country-level data TikTok also provides province- or city-level estimates. However the availability of each of these levels of granularity differs across countries (see Data Collection). <br>
<br>


should we write about gender gap implications etc? https://plan-international.org/quality-education/bridging-the-digital-divide/

- cannot have data for adolescents

<font color='red'> include why looking at gender gaps is important, relevance for emerging countries, can also include gender gap in education, why looking at technology is important </font>

## Goal <br>
The goal of the project is to gather data about digital adoption specificially for young people using Tiktok. With this, we estimate the digital gender gap among young people worldwide. We measure internet adoption by TikTok usage using the marketing API. We use data from TikTok as previous data on this specific topic has not been focused on the younger population. Given that the target audience of TikTok are mostly younger people, the data should shed light on gender gaps among this specific demographic group. <br>

#### Key Figure


$Internet \ Penetration_f = \frac{I_f}{I_{total}}$ <br>
$Internet \ Penetration_m = \frac{I_m}{I_{total}}$ <br>

where $I_f$ is the internet usage of females estimated by Tiktok audience estimates, $I_m$ is the internet usage of males estimated by Tiktok audience estimates, $I_{total}$ are the total TikTok users per country. <br>

$Gender \ Gap = \frac{I_f/I_m}{P_f/P_m} $ <br>
Gender Gap reflects normalized digital adoption rate by gender. A value of 1 indicates that no difference between female and male internet adoption exists. Values below 1 indicate female users less present than male relative to the population share. Values above 1 indicate male users are less present than female relative to the population share.


## 2. Methodology <br>
We are retrieving data from the Tiktok Marketing API. For documentation refer to: <br>
https://business-api.tiktok.com/portal/docs?id=1781891416235009&rid=l74jb3dn5ib <br>
In a first step after obtaining the authorization and authentication credentials from an app in a TikTok developer account, we start building the structure for the API calls.
....
 <br>
.... <br>
### Statistical Tests
Since we have continuous variables which are the ratios of female to male based on TikTok, Facebook or Population data we can use two-sample t-tests. For this, we first check if all assumptions are fulfilled. First, we test for normality of the data using the Shapiro Wilk test. With p < 0.05 we reject the null hypothesis of non-normality.<br>
$W = \frac{\left( \sum_{i=1}^{n} a_i x_{(i)} \right)^2}{\sum_{i=1}^{n} (x_i - \bar{x})^2}$ <br>
The assumption of independence is most likely violated in our data because ultimately the data is drawn from the same population of female and male in a country. Therefore we use paired t-tests.<br>
In a last step we test for homogeneity of variance using Levene's test. 
<font color='red'> -->we only need this if we don't do paired t-tests. what is right??? </font> <br> With p > 0.05, we fail to reject the null hypothesis of equal variance (homogeneity). This means we have sufficient evidence to say that the variance in the three groups are different and the assumption of the t-test are met. <br>
$L = \frac{\frac{1}{k - 1} \sum_{j=1}^{k} n_j (\overline{Y}_j - \overline{Y})^2}{\frac{1}{n - k} \sum_{j=1}^{k} \sum_{i=1}^{n_j} (Y_{ji} - \overline{Y}_j)^2}$ <br>

Next, we can perform paired two-tailed t-tests to see if there are significant differences between the sample means. <br>
$t = \frac{x - μ}{\frac{s}{\sqrt{n}}}$ <br>
With p > 0.05 we cannot reject the null hypothesis of equality of means. Hence we can conclude that there is no significant difference between the different sources of data.<br>

In a final step we do a Case Study.<br>

A similar approach has been done using Meta API. For reference, see: <br>
https://docs.google.com/document/d/1wdMl1RFCK02lpwfj8AK-vJy-snMgVBrF1d_i0ZgiKSA/edit?tab=t.0#heading=h.f8vm8a1d53rx


## 3. Data Collection and Limitations <br>
Our main data of interest are the audience estimates.<br>
https://business-api.tiktok.com/portal/docs?id=1740302379236353 <br>
We query these at a set point in time by
* location_id: not all location_ids are available for the specific placement and objective. To get available locations, refer to https://business-api.tiktok.com/portal/docs?id=1737189539571713 . Max size of location_ids: 3,000 per ad group.<br>

* gender: we split by GENDER_UNLIMITED, GENDER_MALE, GENDER_FEMALE as values. https://business-api.tiktok.com/portal/docs?id=1739381236849665

* age_groups: we split by AGE_13_17, AGE_18_24, AGE_25_34, AGE_35_44,
AGE_45_54, AGE_55_100. If there are restrictions, we can limit to only query specific age groups. https://business-api.tiktok.com/portal/docs?id=1737174886619138#item-link-Age%20Group

In order to query the data, we set up an automated query system which queries and stores the data for a given set of locations.

In order to calculate the Gender Gap we also need population data for each country. We retrieve the data from as csv: <br>
https://data.worldbank.org/indicator/SP.POP.TOTL <br>

### Locations
* countries (57 in total): 'KR' 'GR' 'DO' 'LB' 'AR' 'KZ' 'DK' 'PR' 'BD' 'RO' 'DE' 'DZ' 'CR' 'QA' 'KE' 'CZ' 'OM' 'PK' 'CH' 'NO' 'GT' 'TR' 'LK' 'UY' 'SE' 'NG' 'GB' 'SA' 'BY' 'ZA' 'BH' 'AT' 'HU' 'IE' 'PY' 'IQ' 'FR' 'US' 'UA' 'KW' 'PT' 'AZ'
 'IT' 'BR' 'CA' 'BO' 'MA' 'JP' 'AE' 'PA' 'NL' 'MX' 'EG' 'PL' 'BE' 'ES' 'FI'
* city: available for 33 out of 57 countries.
* district: only available for FR, ES, GB, IT and US. <font color='red'> what exactly does district refer to? </font>
* province: available for 33 out of 57 countries. represent federal countries or state (e.g. for FR, DE, US). Sometimes has additional aggregations like designated market areas (DMA) in the US.
* <font color='red'> check if geolcation queries with radius are possible </font>

### Rate Limits
The marketing API's rate limit for a type "Basic" app are:
- QPS Limit (Queries-Per-Second Rate Limiting): 10
- QPM Limit (Queries-Per-Minute Rate Limiting): 600 with reset after 5 minutes
- QPD Limit (Queries-Per-Day Rate Limiting): 864.000 with reset the next day respectively each day at midnight

https://business-api.tiktok.com/portal/docs?id=1740029171730433
<br>
### Age Group Restrictions
When it comes to ad creation, TikToks limits the possibilities to target users aged below 18. For United States, Latin America, the European Economic Area, the UK, Switzerland, or Canada, Lead Generation objective campaigns and Product Sales campaigns cannot be targeted below the age of 18. If the campaign includes TikTok or app bundles for these locations, personalization based targeting (i.e. gender, spending power, household income, audience, interests & behaviors) is not allowed for the age group 14-17. Targeting based on location, device, or language is still possible for that age group. <br>
https://business-api.tiktok.com/portal/docs?id=1788755983247362

### Data Availability
For small countries, there may be limited data available and TikTok could set the audience estimates to 0. Also, due to security reasons there are no audience estimates available for the age group 14-17. <br>
https://business-api.tiktok.com/portal/docs?id=1740302379236353

### Data Reliability
We should keep in mind that the data obtained are estimates and we are given a lower and upper end of the estimated audience size range. No further insights are given how or based on which parameters these estimates are calculated. https://business-api.tiktok.com/portal/docs?id=1740302379236353 <br>
However, there are categories 1-4 given for each estimate which can be viewed as an evaluation for the estimate. <br>
- "1: Too Narrow. The estimated audience size is less than 10,000, and accounts for less then 20% of possible audience size in the selected locations on the selected placements.
- 2: Narrow. The estimated audience size is equal to or larger than 10,000, but still accounts for less than 20% of possible audience size in the selected locations on the selected placements.
- 3: Balanced. The estimated audience size accounts for 20% or more but less than 80% of possible audience size in the selected locations on the selected placements.
- 4: Fairly Broad: The estimated audience size accounts for 80% or more of the possible audience size in the selected locations on the selected placements
" <br>
https://business-api.tiktok.com/portal/docs?id=1740302379236353 <br>

Also the data is inherently biased towards the group of people using TikTok. The data can also be biased by wrongly entered gender and/or age. Especially entering a wrong age could be a common use case for younger people pretending to be older or vice versa. <br>

## 4. Technical Set Up
### Input
The API expects json formatted inputs. We automatically generate these json inputs for all the combinations of locations, age ranges and gender that we want to query. The values for each of the fields are retrieved from the TikTok API or as fixed values from the documentation in a first step and are then combined.

### Docker
The python code developed will be packed into a docker container to ensure scalability and usability.

### Output
The results are first saved as csv files. These are then imported into the PostgreSQL database via a REST-API. The format of the csv files follows pySocialWatcher (https://github.com/maraujo/pySocialWatcher/blob/master/pysocialwatcher/output_examples/quick_example_dataframe_collected.csv), i.e. has columns: name, ages_ranges, geo_location, genders, interests, behavior, scholarities, languages, family_statuses, all_fields, targeting, response, audience. <br>
<font color='red'>Google Cloud ???</font>

### Parallelization
Since the input has 58,482 rows and there needs to be at least 0.7 seconds sleep between the calls in order to not get blocked, looping and requesting for each of the rows would result in running time that is too long. Therefore we use FastAPI with 3 workers to parallelize the requests.



## 5. Code Implementation

In [None]:
#pip install uvicorn fastapi pydantic
#!pip install requests pandas
!pip install tqdm

In [14]:
import uvicorn
import nest_asyncio
from fastapi import FastAPI
from pydantic import BaseModel
import requests
import pandas as pd
from datetime import datetime
import time
import urllib.parse
import itertools
import json
nest_asyncio.apply()
from concurrent.futures import ThreadPoolExecutor, as_completed
from fastapi.responses import FileResponse
from tqdm import tqdm

# Credentials
advertiser_id = '7381489555305775105'
secret = '01bebffff7dd4469b207a1622ad3892dfdf862a0'
app_id = '7384250931329630225'
auth_code = 'f57f9e8b140d03312509d43a9e70a96e65fde888' ## need to open link from App once reviewed; only valid for 1 hour, can only be used once
access_token = '7e4105012622ac077282d8a3e4bd6f937cbdec70'

In [16]:
################# PREPARING INPUT ######################
#get values for locations, age_ranges and gender and build json formatted input

#location_id or region_code. Research API lets you use region_code = country_codes
url = ' https://business-api.tiktok.com/open_api/v1.3/search/region/'

headers = {
    'Access-Token': access_token,
    'Content-Type': 'application/json'
}
params = {
    'advertiser_id': advertiser_id,
}

response = requests.get(url, headers=headers, params=params)

results = response.json()['data']['region_list'] #return json object of result of get request
#geolist.to_csv('./config/specs/specs_explore/tiktok_regions_list.csv')

countries_df = pd.json_normalize(results)
countries_df = countries_df.drop(['area_type','parent_id'],axis=1)
#print(countries_df)

country = countries_df[countries_df['region_level']=='COUNTRY'][['country_code','region_id']]
#.drop_duplicates().reset_index(drop=True)
#countries_df['country_code'].unique()
#print(country)

province = countries_df[countries_df['region_level']=='PROVINCE'][['country_code','region_name','region_id']]
#print(province)


district = countries_df[countries_df['region_level']=='DISTRICT'][['country_code','region_name','region_id']]
#print(district)

city = countries_df[countries_df['region_level']=='CITY'][['country_code','region_name','region_id']]
#print(city)

countries_df.to_csv('countries.csv',encoding='utf-8-sig')
#files.download('countries.csv')

#need ad group id for this, to get this id we would need to create a campaign and ad --> refer to fixed values for now
#if targeting_info and "age" in targeting_info:
#    age_buckets = list(targeting_info["age"].keys())
#else:

# reduce to one group for test, full: ["AGE_13_17", "AGE_18_24", "AGE_25_34", "AGE_35_44", "AGE_45_54", "AGE_55_100"]
age = ["AGE_18_24"]

gender = ["GENDER_FEMALE", "GENDER_MALE", "GENDER_UNLIMITED"]

#create input as json with all country, gender, age combinations
countries = country['country_code'].to_list()
location_ids = country['region_id'].to_list()

combine = list(itertools.product(countries,location_ids,gender,age))

#save input as csv for reference
df = pd.DataFrame(combine)
df.to_csv('input.csv',encoding='utf-8-sig')
#files.download('input.csv')

inputs = [{"country": country,"location_id":location, "gender": gender, "age": age} for country,location,gender,age in combine]
#input = json.dumps(input,indent=1)

print(inputs)
print(f"Input has {len(inputs)} rows")

[{'country': 'AZ', 'location_id': '587116', 'gender': 'GENDER_FEMALE', 'age': 'AGE_18_24'}, {'country': 'AZ', 'location_id': '587116', 'gender': 'GENDER_MALE', 'age': 'AGE_18_24'}, {'country': 'AZ', 'location_id': '587116', 'gender': 'GENDER_UNLIMITED', 'age': 'AGE_18_24'}, {'country': 'AZ', 'location_id': '660013', 'gender': 'GENDER_FEMALE', 'age': 'AGE_18_24'}, {'country': 'AZ', 'location_id': '660013', 'gender': 'GENDER_MALE', 'age': 'AGE_18_24'}, {'country': 'AZ', 'location_id': '660013', 'gender': 'GENDER_UNLIMITED', 'age': 'AGE_18_24'}, {'country': 'AZ', 'location_id': '630336', 'gender': 'GENDER_FEMALE', 'age': 'AGE_18_24'}, {'country': 'AZ', 'location_id': '630336', 'gender': 'GENDER_MALE', 'age': 'AGE_18_24'}, {'country': 'AZ', 'location_id': '630336', 'gender': 'GENDER_UNLIMITED', 'age': 'AGE_18_24'}, {'country': 'AZ', 'location_id': '690791', 'gender': 'GENDER_FEMALE', 'age': 'AGE_18_24'}, {'country': 'AZ', 'location_id': '690791', 'gender': 'GENDER_MALE', 'age': 'AGE_18_24'

In [None]:
################ EXTRACT DATA VIA API REQUESTS####################
# Define input schema for FastAPI
class InputItem(BaseModel):
    location_id: str
    age: str
    gender: str
    country: str

class InputList(BaseModel):
    inputs: list[InputItem]

app = FastAPI()

#get audience estimate
url = 'https://business-api.tiktok.com/open_api/v1.3/ad/audience_size/estimate/'
headers = {
    'Access-Token': access_token,
    'Content-Type': 'application/json'
}

results = []
output_columns = ["name", "ages_ranges", "geo_location", "genders", "interests", "behavior", "scholarities", "languages", "family_statuses", "all_fields", "targeting", "response", "lower_end","upper_end","user_count_stage"]
retries = 3
sleep = 0.2

def get_audience_estimate(data):
    for attempt in range(retries):
        response = requests.post(url, headers=headers, json=data).json()
        if response.get("code") == 0:
            return response
        elif response.get("code") == 51052:
            print(f"Error on attempt {attempt+1}, retrying {data}")
            time.sleep(sleep)
        else:
            print(f"API error {response.get('code')}: {response.get('message')} for input {data}")
            time.sleep(sleep)
            return None
    print("Max retries reached, skipping.")
    return None

def process_input(input):
    data = {
        "advertiser_id": '7381489555305775105',
        "objective_type": "REACH",
        "optimization_goal": "REACH",
        "placements": ["PLACEMENT_TIKTOK", "PLACEMENT_PANGLE", "PLACEMENT_GLOBAL_APP_BUNDLE"],
        "location_ids": [input.location_id],
        "gender": input.gender,
        "age_groups": [input.age]
    }
    try:
        response = get_audience_estimate(data)
        if not response:
            return None
        entry = {
            "name": input.country,
            "ages_ranges": input.age,
            "geo_location": input.location_id,
            "genders": input.gender,
            "interests": None,
            "behavior": None,
            "scholarities": None,
            "languages": None,
            "family_statuses": None,
            "all_fields": data,
            "targeting": None,
            "response": response,
            "lower_end": response["data"]["user_count"]["lower_end"],
            "upper_end": response["data"]["user_count"]["upper_end"],
            "user_count_stage": response["data"]["user_count_stage"]
        }
        return entry
    except requests.exceptions.RequestException as e:
        print(f"Request failed: {e}")
        return None

MAX_WORKERS = 10
@app.post("/audience_estimate/")

def audience_estimate(input_list: InputList):
    results = []
    futures = []

    with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
        for inp in input_list.inputs:
            futures.append(executor.submit(process_input, inp)) 
        for future in tqdm(as_completed(futures),total=len(futures),desc="Processing inputs"):
            res = future.result()
            if res:
                results.append(res)
    if results:
        results_df = pd.DataFrame(results)
        results_df.reset_index(inplace=True)
        results_df['timestamp'] = datetime.now()
        results_df.to_csv('output.csv', encoding='utf-8-sig')
        print(f"Results to csv done")
        return {"message":f"Processed {len(results)} inputs", "output_file": "output.csv"}

@app.get("/download_csv/")
def download_csv():
    return FileResponse("output.csv", media_type='text/csv', filename="output.csv")

uvicorn.run(app, host="0.0.0.0", port=8000)



In [None]:
payload = {"inputs": inputs}

response = requests.post(
    "http://localhost:8000/audience_estimate/",
    json=payload 
)