# All of the URLs in a sitemap should contain canonical tags and return a 200 status code. With this script, you can pinpoint the URLs that break these guidelines.

Credit: @RankSense ranksense.com

Run this in Colab:  https://colab.research.google.com/github/anirudh-tatavarthi/Twittorials/blob/master/XML_Sitemap_Audit.ipynb 

[1] This first block installs the necessary libraries for the code to run

In [None]:
%%capture
!pip install --upgrade -q gspread
!pip install requests-html

[2] The block below imports everything needed

In [4]:
import gspread
from google.colab import auth
from bs4 import BeautifulSoup
import requests
auth.authenticate_user()
from requests_html import HTMLSession
from oauth2client.client import GoogleCredentials
gc = gspread.authorize(GoogleCredentials.get_application_default())

ModuleNotFoundError: No module named 'google.colab'

[3] When you run the block above, you will have to provide an **authorization code** from the link provided

[4] Next, you will have to **enter the name of the spreadsheet** which you would like the results to be written on. Feel free to make a copy of this template: https://docs.google.com/spreadsheets/d/1KH1Jxsx77xwv-1wbrq3K5_Ga6-jUGibYWSIPBQw46IA/edit?usp=sharing


**Note:** You can also specify which columns you would like each piece of information to be displayed in. (You can keep the values already listed)

In [6]:
#Replace with your spreadsheet name
spreadsheetName = "Copy of Sitemap Canonical Information [Template]" #@param {type:"string"}

sheet = gc.open(spreadsheetName).sheet1

#Enter the column you'd like the URLs to be displayed:
URLcol =   1#@param {type:"number"}

#Enter the column you'd like the canonical URLs to be displayed:
Canonicalcol =   2#@param {type:"number"}

#Enter the column you'd like the self-referential canonical information to be displayed:
ContainsSelfRefCanonicalCol =   3#@param {type:"number"}

#Enter the column you'd like the status code to be displayed:
StatusCodecol =   4#@param {type:"number"}

[5] First, you will be prompted to **enter the URL of the sitemap** which you would like to use

[6] The code will go through the specified sitemap and output the URLs contained in the sitemap, the canonical URL (if applicable), whether or not the canonical is self-referential, and the status code.

In [7]:
session = HTMLSession()

#Prompts user to input a sitemap URL to analyze
sitemap_url = input("Please enter the sitemap URL: ")

#Finds any rel=canonical and grabs the href
canonical_xpath = "//link[@rel='canonical']/@href"

#Stores the webpage from the URL in a response object (r)
r = requests.get(sitemap_url)
xml=r.text

#We use beautiful soup to fetch the data we need from the HTML of the sitemap
soup = BeautifulSoup(xml, "html.parser")
#find the "loc" tags which include the URL
URLs = soup.find_all('loc')

print("The number of URLs found in the sitemap are {0}".format(len(URLs)))
num_of_URLs = len(URLs)

#If you would like your output to start on a different row, modify this variable
start_row =   2#@param {type:"number"}

Please enter the sitemap URL: https://www.p-tech.org.uk/page-sitemap.xml
The number of URLs found in the sitemap are 41


[7] This next block of code gathers all of the information and updates the google sheet. 

In [8]:
for each in URLs:
  #Grab the text within the loc tag 
  url = str(each.get_text())

  """
  Since we already parsed the HTML with beautiful soup, 
  requests obtains the information from the HTML snippets
  """
  with session.get(str(each.get_text())) as r:

    #canonical informaiton stored in the variable
    canonical=r.html.xpath(canonical_xpath, first=True)
    #update URL column and canonical column
    sheet.update_cell(start_row, URLcol, url)
    sheet.update_cell(start_row, Canonicalcol, canonical)

    #Checks for self-referential canonical or no canonical
    if canonical==url:
      sheet.update_cell(start_row, ContainsSelfRefCanonicalCol, "True")
    elif canonical == None:
      sheet.update_cell(start_row, ContainsSelfRefCanonicalCol, "N/A")
    else:
      sheet.update_cell(start_row, ContainsSelfRefCanonicalCol, "False")

    #URL information stored in the response object
    resp = requests.get(url)
    #grabs the status code of the URL
    status_code = resp.status_code
    #update the status code column
    sheet.update_cell(start_row, StatusCodecol, status_code)

    start_row+=1

print("Sitemap crawl has finished!")

ConnectionError: ignored

**Note:** Be sure to resize your sheet according to the number of URLs present in your sitemap

[8] Thats it! View your google sheet to see the magic happen! 