# Advertising on iffy newssites

The [Iffy Index of Unreliable Sources](https://iffy.news/index/#methodology) by Media Bias/Fact Check (MBFC) maintains a list of news sites that regularly fail fact-checks and is a widely used research tool in the field of mis/disinformation. The data contains domain names and a compilation of credibility ratings from different sources. For details, check the methodology on the iffy.news site.

It's somewhat common for websites to have an ads.txt (Authorized Digital Sellers) URL when they generate revenue with their content. It specifies a text file that companies can host on their web servers. This file contains information about which advertisers are allowed to operate on the site. By parsing this information for news sites in the dataset we might be able to find interesting relationships between sites' credibility and their advertisements.

In [2]:
import pandas as pd

newssites = pd.read_excel("iffy_index.xlsx")
newssites.head()

Unnamed: 0,Domain name,MBFC Factual,MBFC Cat,MBFC Cred,Global Site Rank,Wikipedia article,Misinfo.me,Fact-checks
0,100percentfedup.com,L,FN,L,72.284,,-0.91,100 Percent Fed Up
1,10news.one,L,FN,L,30000000.0,,0.0,10News.one
2,12minutos.com,L,FN,L,276.039,,-1.0,12minutos.com
3,163.com,M,FN,L,309.0,W,0.23,NetEase
4,1tv.ru,M,FN,L,2.598,W,0.28,Channel One Russia


## Data preparation

Misinfo.me rates the sites' credibility from low (-1.0) to high (1.0). To only include the most interesting domain names, we will filter out those that have a positive credibility rating. Next, we will create a new column and add the prefix 'https://' to every domain name to be able to use them with the `requests` library. This library is needed to send HTTP requests to the ads.txt URLs. 

In [3]:
newssites = newssites[newssites["Misinfo.me"] >= 0]
newssites["Domain name"] = "https://" + newssites["Domain name"]

parse the information from the ads.txt files

In [22]:
import requests
import warnings

warnings.filterwarnings("ignore")

def request_adstxt(urls, timeout=10):
    adstxt_contents = []

    headers = {
        "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
        "Accept":
        "text/plain"
    }
    
    for url in urls:
        ads_url = f"{url}/ads.txt"
        try:
            response = requests.get(ads_url, headers=headers, timeout=timeout, verify=False)
            if response.status_code == 200:
                text = response.text.lower()
                # Filter out cases where the server returns a HTML error page
                if "hmtl" not in text and "body" not in text and "div" not in text and "span" not in text:
                    adstxt_contents.append({"domain": url, "text": text})
        except Exception as e:
            # If an error occurs, skip the URL
            print(f"Error occurred for {url}: {str(e)}")
    
    return adstxt_contents
        

newssite_urls = newssites["Domain name"].tolist()
adstxt_contents = request_adstxt(newssite_urls[0:10])

Error occurred for https://10news.one: HTTPSConnectionPool(host='10news.one', port=443): Max retries exceeded with url: /ads.txt (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x00000185F04EE1D0>: Failed to resolve '10news.one' ([Errno 11002] getaddrinfo failed)"))
Error occurred for https://163.com: HTTPSConnectionPool(host='163.com', port=443): Max retries exceeded with url: /ads.txt (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x00000185F0150E90>, 'Connection to 163.com timed out. (connect timeout=10)'))
Error occurred for https://1tv.ru: HTTPSConnectionPool(host='www.1tv.ru', port=443): Max retries exceeded with url: /ads.txt (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x00000185F04BA0D0>, 'Connection to www.1tv.ru timed out. (connect timeout=10)'))
Error occurred for https://2020electioncenter.com: HTTPSConnectionPool(host='2020electioncenter.com', port=443): Max retries exceeded with u

Ads.txt files need to be a certain format in order for them to be crawled by advertising networks. Google Adsense has set up a few formatting guidelines to follow. 

In [21]:
import re

def parse_adstxt(adstxt_content):
    ad_networks = []
    seller_account_ids = []
    account_types = []
    tag_ids = []
    comments = []

    pattern = r'^([^,]+),\s*(\d+),\s*(\w+),\s*([\w\d]+)(\s*#\w+)?'

    for line in adstxt_content["text"].split("\n"):
        match = re.match(pattern, line)
        if match:
            ad_network, seller_account_id, account_type, tag_id, comment = match.groups()
            ad_networks.append(ad_network)
            seller_account_ids.append(seller_account_id)
            account_types.append(account_type)
            tag_ids.append(tag_id)
            comments.append(comment if comment else '')
    
    df = pd.DataFrame({
        "Domain name": adstxt_content["domain"],
        'Ad network': ad_networks,
        'Seller account ID': seller_account_ids,
        'Account type': account_types,
        'Tag ID': tag_ids,
        'Comment': comments
    })

    return df

ads_df = parse_adstxt(adstxt_contents[3])
ads_df.head()

Unnamed: 0,Domain name,Ad network,Seller account ID,Account type,Tag ID,Comment
0,https://abovetopsecret.com,pubmatic.com,156077,reseller,5d62403b186f2ace,
1,https://abovetopsecret.com,contextweb.com,560489,reseller,89ff185a4c4e857c,
2,https://abovetopsecret.com,rubiconproject.com,21642,reseller,0bfd66d529a55807,
3,https://abovetopsecret.com,rubiconproject.com,21434,reseller,0bfd66d529a55807,
4,https://abovetopsecret.com,pubmatic.com,158136,reseller,5d62403b186f2ace,


## Sources

https://iffy.news/index/#methodology

https://support.google.com/adsense/answer/7679060?hl=en