# Exploring the Economics of Russian State Media Content Laundering

Recent research shows that networks of websites, which serve as pathways for Russian state and pro-Kremlin media, are able to operate in EU member countries despite clear restrictions in place.

This is mostly achieved through networks of mirror websites that by posing as ‘independent’ and / or ‘alternative’ news outlets are able to distribute by stealth Russian state media content. They operate either under new domain names or domain names that are related to the banned outlets. One such example is sputnikglobe.com - a mirror site of Sputnik News that is up and running in countries like the UK and the Netherlands.


This dataset includes mirror websites/reposted Op-eds for banned Russian newssite RT.com, resulting from running the domain through disinfo.id.

In [None]:
from google.colab import files
uploaded = files.upload()

Saving Output - Op-eds Russia Today (1).xlsx to Output - Op-eds Russia Today (1) (2).xlsx


## Preparation

Load the Excel file into a pandas DataFrame. Some records are tagged as social media posts or other irrelevant sites (such as RT.com itself). We will filter those out for now. We'll also adjust the value for the Domain name to include 'https://' to be able to use them with the `requests`

In [None]:
import pandas as pd

content = pd.read_excel("Output - Op-eds Russia Today (1).xlsx")
content = content[content["Irrelevant sites"].isnull()]
content["Domain"] = "https://" + content["Domain"]
content.head()

### Global Variables

We want to compare the list of resulting sites that could be possible mirrors for RT. We also need to provide a User-Agent for the request to be able to scrape the data from certain sites.

In [None]:
SEARCHED_URL = "https://www.rt.com"
RESULT_DOMAINS = content["Domain"].unique().tolist()
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"

## Metadata analysis

According to disinfo.id there are certain metadata indicators that make RT.com unique. The Verification ID is a unique identifier used for verifying ownership of authenticity of a website or online account. They can establish the legitimacy of a site or account, potentially linking it to a specific owner or organization.

We will use a function to scrape metatags from the source website (RT.com) and extract any whose name attribute contains the word 'verification'. Save the value for the content attribute for those in a dictionary.

In [None]:
import requests
from bs4 import BeautifulSoup, SoupStrainer

def fetch_verification_tags(url, timeout=10):
  """
  Fetch meta tags whose names contain 'verification' from the given URL.
  """
  try:
    # Set a User-Agent and timeout for the HTTP request
    headers = {"User-Agent": USER_AGENT}
    response = requests.get(url, headers=headers, timeout=timeout)
    response.raise_for_status()

    # Parse only the meta tags to improve performance
    parse_only = SoupStrainer("meta")
    soup = BeautifulSoup(response.text, "html.parser", parse_only=parse_only)

    # Find all meta tags with 'name' attribute containing 'verification'
    verification_tags = soup.find_all("meta", attrs={"name": lambda x: x and "verification" in x.lower()})
    # Extract 'content' attribute
    if verification_tags:
      return {tag['name']: tag.get('content', '') for tag in verification_tags if tag.has_attr('content')}
    else:
      return {}
  except Exception as e:
    print(f"Error fetching verification tags from {url}: {e}")
    return {}

Next we'll use a function that gathers the unique result URLs from the DataFrame, and compares the tags to the ones for RT.com. If one or more of the tags have matching Verification IDs, it's probably a mirror site. If this is the case, adjust the column 'Mirror / Reposter?' for that URL.

In [None]:
def compare_tags(source_url, target_urls, df):
    """
    Compare verification meta tags between a source URL and a list of target URLs.
    """
    source_tag = fetch_verification_tags(source_url)

    for target_url in target_urls:
      target_tag = fetch_verification_tags(target_url)

      # Check for matches with each dictionary in target_tags list
      matches = {name: content for name, content in target_tag.items() if name in source_tag and source_tag[name] == content}

      if matches:
        for name, content in matches.items():
          print(f"Matching metatags: {target_url}")
        # Update the 'Mirror / Reposter?' column
        df.loc[df["Domain"] == target_url, "Mirror / Reposter?"] = "Mirror"
      else:
        df.loc[df["Domain"] == target_url, "Mirror / Reposter?"] = ""

## CSS class analysis

Other metadata identifiers are the CSS classes for the websites. Similairly designed websites will share CSS classes. We will scrape all the unique CSS classes from RT.com.

In [None]:
def scrape_css(url, timeout=10):
  try:
    headers = {"User-Agent": USER_AGENT}
    response = requests.get(url, headers=headers, timeout=timeout)
    response.raise_for_status()

    # Parse only the meta tags to improve performance
    parse_only = SoupStrainer("meta")
    soup = BeautifulSoup(response.text, "html.parser", parse_only=parse_only)

    used_classes = set()

    for elem in soup.select("[class]"):
      classes = elem["class"]
      used_classes.update(classes)

    return used_classes
  except Exception as e:
    print(f"Error scraping CSS classes from {url}: {e}")
    return set()

Compare the unique CSS classes from the source URL to the ones from the dataset. If 90% of the classes match, return the URL for the mirror website.

In [None]:
def compare_css(source_url, target_urls):
    """
    Compare CSS classes from a base URL to CSS classes from a list of other URLs.
    Return URLs with 90% or more common classes.
    """
    source_css_classes = scrape_css(source_url)

    matching_urls = []
    for url in target_urls:
        css_classes = scrape_css(url)
        common_classes = source_css_classes.intersection(css_classes)

        # Calculate the percentage of common classes
        percentage_common = len(common_classes) / len(source_css_classes)

        if percentage_common >= 0.9:
          matching_urls.append(url)

    return matching_urls

## Conclusion

First we will compare the Verification IDs. Print the matching tags and update the DataFrame when a site is classified as a mirror to RT.com. Some sites may still result in errors, so check those manually.

In [None]:
rt_tags = fetch_verification_tags(SEARCHED_URL)
matching_tags = compare_tags(SEARCHED_URL, RESULT_DOMAINS, content)
print(matching_tags)

Compare CSS classes from all the unique domains to RT.com. Are the results the same as the matches that came up when comparing the Verification IDs?

In [53]:
rt_classes = scrape_css(SEARCHED_URL)
matching_css = compare_css(SEARCHED_URL, RESULT_DOMAINS)
print(matching_css)

Error scraping CSS classes from https://www.taghribnews.com: HTTPSConnectionPool(host='www.taghribnews.com', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x79d7f66777f0>, 'Connection to www.taghribnews.com timed out. (connect timeout=10)'))
Error scraping CSS classes from https://pk.shafaqna.com: 500 Server Error: Internal Server Error for url: https://pk.shafaqna.com/
Error scraping CSS classes from https://www.pk.shafaqna.com: 500 Server Error: Internal Server Error for url: https://www.pk.shafaqna.com/
Error scraping CSS classes from https://archive.ph: HTTPSConnectionPool(host='archive.ph', port=443): Read timed out. (read timeout=10)
Error scraping CSS classes from https://www.lebanonnewsapp.com: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
Error scraping CSS classes from https://news.nestia.com: 404 Client Error: Not Found for url: https://news.nestia.com/
Error scraping 

Which sites are exact mirrors to RT.com, and what location are they based in?

In [54]:
filtered_content = content[content["Mirror / Reposter?"] == "Mirror"] # Filter rows with non null values
unique_mirrors = filtered_content["Domain"].unique().tolist() # Extract those values of the 'Domain' column

for domain in unique_mirrors:
  rows = filtered_content[filtered_content["Domain"] == domain]
  locations = rows["Location"].tolist()
  print(f"Domain: {domain}")
  print(f"Locations: {locations}")

Domain: https://swentr.site
Locations: ['USA', 'USA', 'USA', 'USA']
Domain: https://admin.sonsuz1.com
Locations: ['USA']
Domain: https://doc.swentr.site
Locations: ['USA']


Save the resulting DataFrame to an Excel file

In [None]:
content.to_excel("Output - Op-eds Russia Today (1).xlsx")