<a href="https://colab.research.google.com/github/kiaonfire/Link_Checker/blob/main/Link_Checker.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This script will check the urls in a column of a csv file. It will assign a different status based on the HTTP code returned when checking each url e.g. 200 will return that the link is "Definitely Working".

It requires that your CSV file has a column named "URL" or "Primary Web Address" (if you would like it to check for columns with different labels let Kia know and they can update this)

For Links that do not respond, the script will retry them 3 times with 4-8 seconds in between each retry.

Some URLs such a government urls will not make any response when queried. The script will note these urls, it is likely that they have extra security features enabled.

Some urls will return inconsistent responses in which case they will be noted as requiring manual checking.

It will also note urls that prompt an automatic download, as well as urls that directly lead to pdfs.

To update:


*   Multiple Links in same column - Done

In [None]:
# This installs the packages needed
!pip install aiohttp nest_asyncio pandas



In [None]:
# This imports the packages into the environment
import pandas as pd
import aiohttp
import asyncio
import nest_asyncio
import random
from google.colab import files

nest_asyncio.apply()

In [None]:
# This is for uploading your CSV file. After you run you should be able to choose a file to upload. Right now only upload a CSV that has a column with the heading URL, it will check that column's urls
uploaded = files.upload()
df = pd.read_csv(next(iter(uploaded)))
df.head()

Saving all_list_items_2025_05_23.csv to all_list_items_2025_05_23.csv


Unnamed: 0,Title,Chapter/Article Title,Item Link,Author(s),Editor(s),List Appearance,List Link,Type,Importance,ISBN10,...,TADC Request ID,TADC Request Status,TADC Bundle ID,Has Container,Primary Web Address,Secondary Web Address,Online Resource Web Address,Online Resource Source,Has File Uploaded,File Type
0,MIT Sloan Management Review,Improve Key Performance Indicators With AI,http://lr.library.uq.edu.au/items/6d4eba67-8e1...,Candelon; François; Chu; Michael; Khodabandeh;...,,BISM7807 St Lucia,http://lr.library.uq.edu.au/lists/CC6BA6D2-6F5...,Article,Recommended,,...,,,,tenantSections:BF31DD5B-BC74-0153-4684-1C23814...,https://sloanreview.mit.edu/article/improve-ke...,,https://sloanreview.mit.edu/article/improve-ke...,Web Address,No,
1,Harvard Business Review,,http://lr.library.uq.edu.au/items/71cbcd77-acc...,Harvard Business School,,BISM7807 St Lucia,http://lr.library.uq.edu.au/lists/CC6BA6D2-6F5...,Webpage,Recommended,,...,,,,tenantSections:07E8E420-FFF4-35C8-EF99-B209309...,https://hbr.org/2023/11/how-ai-fits-into-lean-...,,https://search.ebscohost.com/login.aspx?authty...,Web Address,No,
2,Research Methods in Applied Linguistics,Research methods in L2 writing: Interdisciplin...,http://lr.library.uq.edu.au/items/fb9c0567-543...,Crosthwaite; Peter,,SLAT7827 St Lucia,http://lr.library.uq.edu.au/lists/93D696ED-969...,Article,Required,,...,,,,tenantSections:2120f0c0-ec11-44e4-a592-bb7f8a4...,,,,DOI,No,
3,The SAGE Handbook of Human Trafficking and Mod...,The International Legal Framework on Human Tra...,http://lr.library.uq.edu.au/items/3c29262a-19f...,McAdam; Marika,,POLS7208 St Lucia,http://lr.library.uq.edu.au/lists/0C4D7525-431...,Chapter,Required,,...,,,,tenantSections:55198f15-4bdc-4ddf-8273-13946d0...,https://sk.sagepub.com/reference/the-sage-hand...,https://sk.sagepub.com/reference/the-sage-hand...,https://sk.sagepub.com/reference/the-sage-hand...,DOI,No,
4,Trafficking in Persons Report: Burma (Tier 3),,http://lr.library.uq.edu.au/items/e36b7c3e-14f...,US Department of State,,POLS7208 St Lucia,http://lr.library.uq.edu.au/lists/0C4D7525-431...,Report,Recommended,,...,,,,tenantSections:55198f15-4bdc-4ddf-8273-13946d0...,https://www.state.gov/reports/2024-trafficking...,,https://www.state.gov/reports/2024-trafficking...,Web Address,No,


In [None]:
# Determine which column to use for URLs
url_column = 'URL' if 'URL' in df.columns else 'Primary Web Address'

In [None]:
# Drop rows where the URL column is missing or NaN
df = df.dropna(subset=[url_column])

In [None]:
import re

# Use regex to split only on "; " (semicolon followed by a space)
df[url_column] = df[url_column].astype(str)
df = df.assign(**{url_column: df[url_column].apply(lambda x: re.split(r';\s+', x))})
df = df.explode(url_column)
df[url_column] = df[url_column].str.strip()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[url_column] = df[url_column].astype(str)


In [None]:
# This function is for doing retries with some time in between each retry. It's currently set to 2 retries max with 2 seconds in between. The wait time will exponentially increase with each retry maxing out at 30 seconds.
# Keep the number of retries at 2 or 3 for speedier results, I'm not sure if there'd be much benefit after that many retries
async def fetch_with_retries(session, url, headers, proxy=None, max_retries=2):
    timeout = aiohttp.ClientTimeout(total=30)
    connector = aiohttp.TCPConnector(limit=30)
    session = aiohttp.ClientSession(connector=connector)

    for attempt in range(max_retries):
        try:
            async with session.get(url, headers=headers, proxy=proxy, allow_redirects=True, timeout=timeout) as response:
                return response
        except Exception as e:
            if attempt < max_retries - 1:
                wait_time = 6 ** attempt
                print(f"Attempt {attempt+1} failed for {url}. Retrying in {wait_time}s...")
                await asyncio.sleep(wait_time)
            else:
                print(f"All retries failed for {url}: {type(e).__name__} - {e}")
                return None

In [None]:
# This is the main function for returning the results in the file. It shows what error codes will return what result as well as ignores entries without a URL. It also mimics a real browser a bit with the headers defined near the top.
async def check_with_aiohttp(session, url, proxy=None):
    if not isinstance(url, str) or url.strip() == "":
        return ("No URL", None, None, None, None, None)

    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/120.0.0.0 Safari/537.36"
        ),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Referer": "https://www.google.com",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1"
    }

    response = await fetch_with_retries(session, url, headers, proxy)
    if response:
        status = response.status
        redirects = len(response.history)
        final_url = str(response.url)
        content_type = response.headers.get("Content-Type", "").lower()

# Detect if it's a PDF
        is_pdf = (
            "application/pdf" in content_type or
            final_url.lower().endswith(".pdf") or
            "pdf/" in final_url.lower() or
            "/pdfdirect/" in final_url.lower()
        )
        file_download = "Yes" if "application/pdf" in content_type or "application/octet-stream" in content_type else "No"
        pdf_flag = "Direct PDF" if is_pdf else "No"
        status_label = ""

        if status == 200:
            status_label = "Definitely Working"
        elif status in [201, 202, 203]:
            status_label = "Likely Working"
        elif status in [301, 302]:
            status_label = "Likely Working (Redirect)"
        elif status in [404, 410]:
            status_label = "Definitely Broken"
        elif status in [403, 400]:
            status_label = "Requires Manual Check (Request Blocked)"
        else:
            status_label = "Likely Broken"


        if is_pdf:
              status_label += " (Direct PDF)"

        return (status_label, status, redirects, final_url, file_download, pdf_flag)

    return ("No Response (Likely Government or PQ URL)", None, None, None, "Unknown", "Unknown")

In [None]:
#This is the main function that will begin the checking. This will take a while depending on the file size
async def main(proxy_url=None):
    async with aiohttp.ClientSession() as session:
        results = []
        for url in df[url_column]:
            if isinstance(url, str) and any(domain in url for domain in ["proquest.com", "resolver.library.uq", "ebookcentral.proquest"]):
                delay = random.randint(3, 5)
                print(f"Delaying {delay}s before checking: {url}")
                await asyncio.sleep(delay)

            result = await check_with_aiohttp(session, url, proxy=proxy_url)
            results.append(result)

        df['Link Status'], df['HTTP Status Code'], df['Redirect Count'], df['Final URL'], df['File Download'], df['PDF Detected'] = zip(*results)

await main()

Delaying 3s before checking: https://ebookcentral.proquest.com/lib/uql/detail.action?docID=1584059
Attempt 1 failed for https://ebookcentral.proquest.com/lib/uql/detail.action?docID=1584059. Retrying in 1s...
Attempt 2 failed for https://ebookcentral.proquest.com/lib/uql/detail.action?docID=1584059. Retrying in 8s...
All retries failed for https://ebookcentral.proquest.com/lib/uql/detail.action?docID=1584059: TooManyRedirects - 0, message='', url='https://ebookcentral.proquest.com/lib/uql/detail.action?docID=1584059'
Delaying 3s before checking: https://ebookcentral.proquest.com/lib/uql/detail.action?docID=310866
Attempt 1 failed for https://ebookcentral.proquest.com/lib/uql/detail.action?docID=310866. Retrying in 1s...
Attempt 2 failed for https://ebookcentral.proquest.com/lib/uql/detail.action?docID=310866. Retrying in 8s...
All retries failed for https://ebookcentral.proquest.com/lib/uql/detail.action?docID=310866: TooManyRedirects - 0, message='', url='https://ebookcentral.proquest

In [None]:
# Use this to download the final CSV file
df.head()
df.to_csv("output.csv", index=False)
files.download("output.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>