In [7]:
#TODO: Need to create example file for them to check against

# What it is

A script that takes an excel xlsx file containing the url to check and canoncial url pairings that a user wants to check. The tool outputs whether the canonical has been set at all, incorrectly, or correctly.

Ouputted file will contain the results of the check. Each tag checked will be on a seperate sheet labelled by the tag.

## Before Running All Cells

Check the the xlsx file containing canonicals that you want to check follows the correct format. To see the expected format, view the xlsx file "Example Structure" under the Inputs folder. Then, place the xlsx file in the Check-Canonicals > Inputs folder.

Finally, enter file input below under [User Input](#user_input)


## How to Run
On top select Cell -> Run All

<a id='user_input'></a>

# User Input

Enter information below before running the cells.

In [1]:

CANONICALS_WORKBOOK = 'Example.xlsx' # Name of xlsx containing canonicals you want to test
CANONICALS_WORKBOOK_SHEET = 'CanonicalsSheet' # Name of xlsx's sheet containing canonicals you want to test


# Import and Constant

Cells in this section import libraries, define where the ouputted file will go, and load the file the user wants to use to check canonicals.

In [2]:
# Imports and constants

import urllib.request as request
import pandas as pd
import re
import xlrd
from xlutils.copy import copy
from datetime import datetime
import requests
import urllib.request
import urllib
import pandas as pd
from bs4 import BeautifulSoup

CANONICALS_INPUT_FOLDER = 'Inputs/'
CANONICALS_OUTPUT_FOLDER = 'Results/'

CANONICALS_INPUT_WORKBOOK_PATH = CANONICALS_INPUT_FOLDER + CANONICALS_WORKBOOK

to_check = xlrd.open_workbook(CANONICALS_INPUT_WORKBOOK_PATH)
to_check_sheet = to_check.sheet_by_name(CANONICALS_WORKBOOK_SHEET)

check_wb = copy(to_check) 
check_sheet = check_wb.get_sheet(CANONICALS_WORKBOOK_SHEET)

NO_CANONICAL_SET_STRING = "No canonical set"

# Functions

In this section, functions are defined to make the code easier to read and write tests for.

In [5]:
def return_canonical(url):
    ''' This function parses the html of the parameter url and returns the url of a canonical if set.'''
    f = urllib.request.urlopen(url)
    soup = BeautifulSoup(f.read(), 'html.parser')

    # Originally set canonical to be none. Need to override this if it does exist.
    canonical = NO_CANONICAL_SET_STRING
    
    # Find all html elements that are of type link. Then, for each of these elements, if it is a canonical
    # link, set this to be the canonical and break out of the loop. There can only be one canonical set so no need
    # to check the rest of the links.
    for link in soup.findAll('link'):
        if link.get("rel")[0] == "canonical":
            canonical = link.get('href')
            break

    return canonical

def add_https_if_none(url):
    '''Adds the full url path if none was defined on the input file.
    This assumes that the url should start with https. If it is still an http site, this will likely be the source
    of any issues. In that case, the user should explicitly define their urls to be http:// in the inputted file.'''
    if url.startswith("www"):
        return "https://" + url
    else: return url

# Testing

The cells below are a check to make sure that the tool is working correctly. If one of these fails, the outputted results above may be incorrect. Reach out or trouble shoot based on the outputted error.

In [6]:
def test_return_canonical(url, canonical):
    '''Test for checking that the canonical parser is working correctly.
    If this returns a warning, first check that the passed in url actually has the redirect.'''
    result = return_canonical(url)
    if result == canonical:
        return True
    else:
        print("Error when parsing")
        return result

test_return_canonical("https://www.masterlock.com/business-use/product/A1266NBLK",
                      "http://www.masterlock.com/business-use/product/A1266NBLK")   

True

# Actual Check

Now on to applying the logic.

In [68]:
# These are what the headers of the outputted xlsx will be, along with the output printed after running this cell.
cols = ["result", "status code", "canonical status code", "url", "expected canonical", "actual canonical"]

# This will be the ouputted table that will hold all of the results. It is currently empty to have a container
# to put results in.
list_of_results = pd.DataFrame(columns=cols)

for i in range(1, len(check_sheet.rows)):
    '''For every row in the input data, check to see that the canonical 1) exists 2) is what was desired'''
    
    # Get data from the inputted file and add https to the front if not defined
    url_containing_canonical = add_https_if_none(to_check_sheet.cell(i, 0).value.strip())
    expected_canonical = add_https_if_none(to_check_sheet.cell(i, 1).value.strip())
    
    # Get status codes (200, 301, 404, etc) of the row's urls.
    url_status_code = requests.get(url_containing_canonical).status_code
    canonical_status_code = requests.get(expected_canonical).status_code
    
    if url_status_code in [200, 301]:
        actual_canonical = return_canonical(url_containing_canonical)

        if actual_canonical == NO_CANONICAL_SET_STRING:
            '''If canonical has not been set, set response to say that.'''
            result = actual_canonical
    
        elif actual_canonical == expected_canonical:
            if canonical_status_code == 200:
                result = "OK"
            elif canonical_status_code == 301:
                result = "OK, but warning, canonical is a redirect."
            else:
                result = "Canonical is bad link, but is equal to expected. Consider changing canonical."
        else:
            result = 'Expected and actual canonicals do not match'
    else: 
        result = "Error when accessing url containing canonical. See status code."

    
    # Append the result to a dataframe for output later
    list_of_results.loc[i] = [result, status_code, canonical_status_code, 
                              url_containing_canonical, expected_canonical, actual_canonical]

print(list_of_results)

                                         result status code  \
1                              No canonical set         200   
2                              No canonical set         200   
3   Expected and actual canonicals do not match         200   
4   Expected and actual canonicals do not match         200   
5   Expected and actual canonicals do not match         200   
6   Expected and actual canonicals do not match         200   
7   Expected and actual canonicals do not match         200   
8   Expected and actual canonicals do not match         200   
9   Expected and actual canonicals do not match         200   
10  Expected and actual canonicals do not match         200   
11  Expected and actual canonicals do not match         200   
12  Expected and actual canonicals do not match         200   
13  Expected and actual canonicals do not match         200   
14  Expected and actual canonicals do not match         200   
15  Expected and actual canonicals do not match        

# Create Result Output File

After running the cell below, the results gotten from checking canonicals will be placed in an xlsx with the current timestamp in the title and then outputted to the __Results__ folder.

In [69]:
# Run to output the dataframe as an xlsx file in the 'Results' folder

OUTPUT_FILE = CANONICALS_OUTPUT_FOLDER + 'canonical-results_'+ datetime.now().strftime("%Y-%m-%d_%H-%M") + '.xlsx'

writer = pd.ExcelWriter(OUTPUT_FILE, engine='xlsxwriter',)
list_of_results.to_excel(writer, sheet_name='Canonicals', index=False)
writer.save()