# Advertising on iffy newssites

The [Iffy Index of Unreliable Sources](https://iffy.news/index/#methodology) by Media Bias/Fact Check (MBFC) maintains a list of news sites that regularly fail fact-checks and is a widely used research tool in the field of mis/disinformation. The data contains domain names and a compilation of credibility ratings from different sources. For details, check the methodology on the iffy.news site.

It is common for websites to have an ads.txt (Authorized Digital Sellers) URL when they generate revenue with their content. It specifies a text file that companies can host on their web servers. This file contains information about which advertisers are allowed to operate on the site. By parsing this information for news sites in the dataset we might be able to find interesting relationships between sites' credibility and their advertisements.

In [10]:
import pandas as pd

newssites = pd.read_excel("iffy_index.xlsx")
newssites.head()

Unnamed: 0,Domain name,MBFC Factual,MBFC Cat,MBFC Cred,Global Site Rank,Wikipedia article,Misinfo.me,Fact-checks
0,100percentfedup.com,L,FN,L,72.284,,-0.91,100 Percent Fed Up
1,10news.one,L,FN,L,30000000.0,,0.0,10News.one
2,12minutos.com,L,FN,L,276.039,,-1.0,12minutos.com
3,163.com,M,FN,L,309.0,W,0.23,NetEase
4,1tv.ru,M,FN,L,2.598,W,0.28,Channel One Russia


## Data preparation

Misinfo.me rates the sites' credibility from low (-1.0) to high (1.0). To only include the most interesting domain names, we will filter out those that have a positive credibility rating. Next, we will create a new column and add the prefix 'https://' to every domain name to be able to use them with the `requests` library. This library is needed to send HTTP requests to the ads.txt URLs. 

In [11]:
newssites = newssites[newssites["Misinfo.me"] >= 0]
newssites["Domain name"] = "https://" + newssites["Domain name"]

Unnamed: 0,Domain name,MBFC Factual,MBFC Cat,MBFC Cred,Global Site Rank,Wikipedia article,Misinfo.me,Fact-checks
1,https://10news.one,L,FN,L,30000000.0,,0.0,10News.one
3,https://163.com,M,FN,L,309.0,W,0.23,NetEase
4,https://1tv.ru,M,FN,L,2.598,W,0.28,Channel One Russia
6,https://2020electioncenter.com,VL,CP,L,4494303.0,,0.6,Banned.Video
8,https://24jours.com,L,FN,L,1162979.0,,0.25,24jours.com


Not every domain name will have an ads.txt file. We will also filter out records if sending an HTTP request to the URL results in any kind of error, meaning the site or the ads.txt file isn't accessible for some reason. 

In [12]:
import requests
import warnings

warnings.filterwarnings("ignore")

def filter_adstxt(urls, timeout=10):
    filtered_urls = []

    headers = {
        "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
        "Accept":
        "text/plain"
    }
    
    for url in urls:
        ads_url = f"{url}/ads.txt"
        try:
            response = requests.get(ads_url, headers=headers, timeout=timeout, verify=False)
            if response.status_code == 200:
                text = response.text.lower()
                # Filter out cases where the server returns HTML error page
                if "hmtl" not in text and "body" not in text and "div" not in text and "span" not in text:
                    filtered_urls.append(url)
        except:
            # If an error occurs, skip the URL
            pass
    
    return filtered_urls
        

adstxt_newssites = filter_adstxt(newssites["Domain name"].tolist())
filtered_newssites = newssites[newssites["Domain name"].isin(adstxt_newssites)]
filtered_newssites.head()

## Sources

https://iffy.news/index/#methodology

https://support.google.com/adsense/answer/7679060?hl=en