# What it is

A script that takes an excel xlsx file containing the original url and the redirected url pairings that a user wants to check and outputs whether the redirects are valid or not along with an SEO check.

# How to Use it

1. Create an input file containing the url's to check that is modelled off of the example sheet. Place it in the 'Inputs' folder. If you just want a feel of how to run the notebook, you can use the default examples. The input file name should be enclosed by quotes like shown and include the file extension .xlsx
2. If you are checking a Master Lock or SentrySafe site, select whether the site to check is review or production by inputting either True or False next to the variable __IS_ML_REVIEW__. If your site is neither, leave this value as False.
3. Select whether the site you are checking has ssl or not. Not needed if you have full url paths in the input folder.
4. Run the check by going to 'Cell' in the top navigation and selecting 'Run All'.
5. View what urls passed or failed the test by reading below or going to the 'Results' folder and selecting the file with the timestamp of your last run.

## User Input

Enter information below before running the cells.

In [137]:
# User input data

REDIRECTS_WORKBOOK = 'Example.xlsx'
REDIRECTS_WORKBOOK_SHEET = 'Redirects'

# Whether looking to test the review site or production site
# Used for Master Lock and SentrySafe
# May not be supported for other sites, leave False
IS_ML_REVIEW = False

# If an http(s) needs to be appended to the url, what should it be?
IS_SSL = True


## Imports and Constants

Cells in this section import libraries, define where the ouputted file will go, and load the file the user wants to use to check canonicals.

In [138]:
# Imports and constants

import urllib.request as request
import pandas as pd
import re
import xlrd
from xlutils.copy import copy
from datetime import datetime
import requests

import matplotlib
REDIRECTS_INPUT_FOLDER = 'Inputs/'
REDIRECTS_OUTPUT_FOLDER = 'Results/'

REDIRECTS_INPUT_WORKBOOK_PATH = REDIRECTS_INPUT_FOLDER + REDIRECTS_WORKBOOK

to_check = xlrd.open_workbook(REDIRECTS_INPUT_WORKBOOK_PATH)
to_check_sheet = to_check.sheet_by_name(REDIRECTS_WORKBOOK_SHEET)

check_wb = copy(to_check) 
check_sheet = check_wb.get_sheet(REDIRECTS_WORKBOOK_SHEET)

SSL = "https"
NON_SSL = "http"

## Functions

In this section, functions are defined to make the code easier to read and write tests for.

In [142]:
# Methods to parse data in file

def change_env(url, is_ml_review, base):
    url = return_full_clean_path(url, base)
    if is_ml_review:
        url = change_to_review(url) 
    else:
        url = change_to_prod(url)
    return url

def change_to_prod(url):
    if 'review.masterlock' in url or 'review.sentrysafe' in url:
        split_url = url.split('review')
        url = "".join([split_url[0],"www", split_url[1]])
    return url

def change_to_review(url):
    basic_path = url.split('www', 1)
    review_url = "".join([basic_path[0],'review', basic_path[1]])
    return review_url

def return_full_clean_path(url, base):
    '''Adds the full url path if none was defined on the input file.'''
    url = url.strip().rstrip('/')
    if "//" not in url:
        url = "".join([base,"://",url])
    return url

def get_base(ssl):
    if ssl: base = SSL    
    else: base = NON_SSL
    return base

def check_seo_hops(hops):
    "Response is based on seo accepted number of hops"
    if hops >= 3:
        seo_check = "Correct redirect, but hopped three times or more."
    else:
        seo_check = "OK"
    return seo_check

def check_matching(expected, actual):
    if actual == expected:
        match_result = "OK"
    else:
        match_result = "Expected and actual do not match"
    return match_result

# Testing

The cells below are a check to make sure that the tool is working correctly. If one of these fails, and the canonical checker still runs, outputted file may be incorrect. Reach out or trouble shoot based on the outputted error.

When selecting 'Run All Cells', if one of these tests fails, the code will stop running at this cell. If you want to continue, you can select the 'Actual Check' cell and continue by running that, but it's highly advised against.

In [147]:
def test_change_to_env(url, env_url, env, base):
    test_url = change_env(url, env, base)
    if test_url == env_url:
        print("Pass")
    else:
        print("An error occurred. Test url: " + test_url)
        print("Expected url: " + env_url)
        print("Env: "+ str(env))
        sys.exit()

test_change_to_env("https://www.masterlock.com/service-and-support/faqs/lost-combinations",
                   "https://www.masterlock.com/service-and-support/faqs/lost-combinations", False, 'https')
test_change_to_env("www.sentrysafe.com", "http://review.sentrysafe.com", True, 'http')
test_change_to_env("review.sentrysafe.com", "https://www.sentrysafe.com", False, 'https')
test_change_to_env("nm.org", "http://nm.org", False, 'http')

Pass
Pass
Pass
Pass


## Actual Check

Now on to applying the logic.

In [148]:
# Checking the redirects

cols = ["Redirect Result", "Status Code", "URL", "Expected Redirect", "Actual Redirect", "Hops", "SEO Results"]
list_of_results = pd.DataFrame(columns=cols)

base = get_base(IS_SSL)

# For every row in the input data, check to see that the actual redirect is the same as the desired
for i in range(1, len(check_sheet.rows)):
    seo_check = "n/a"
    matched_result = "n/a"
    
    url_to_redirect = change_env(to_check_sheet.cell(i, 0).value, IS_ML_REVIEW, base)
    expected_redirect = change_env(to_check_sheet.cell(i, 1).value, IS_ML_REVIEW, base)
    
    req = requests.get(url_to_redirect)
    
    # This status code is reflective of the last code outputted and will not reflect redirects
    status_code = req.status_code
    
    # To check redirects, the history of the response must be parsed
    # If there is no history, then a redirect did not occur
    if req.history:
        status_code = req.history[-1].status_code
        hops = len(req.history)
        seo_check = check_seo_hops(hops)
        if  status_code == 301:
            actual_redirect = return_full_clean_path(req.url, base)
            matched_result = check_matching(expected_redirect, actual_redirect)
        else: matched_result = "Wrong redirect response"
    else:
        matched_result = "View status code"
    
    # Append the result to a dataframe for output later
    list_of_results.loc[i] = [matched_result, status_code, url_to_redirect,
                              expected_redirect, actual_redirect, hops, seo_check]

print(list_of_results)

                     Redirect Result Status Code  \
1                                 OK         301   
2   Expected and actual do not match         301   
3                                 OK         301   
4                                 OK         301   
5                                 OK         301   
6                                 OK         301   
7                                 OK         301   
8                                 OK         301   
9                                 OK         301   
10                                OK         301   
11                                OK         301   
12                                OK         301   
13                                OK         301   
14                                OK         301   
15                                OK         301   
16                                OK         301   
17                  View status code         404   
18                                OK         301   

           

## Create Result Output File

After running the cell below, the results gotten from checking redirects will be placed in an xlsx with the current timestamp in the title and then outputted to the __Results__ folder.

In [120]:
# Run to output the dataframe as an xlsx file in the 'Results' folder

OUTPUT_FILE = REDIRECTS_OUTPUT_FOLDER + 'redirect-results_'+ datetime.now().strftime("%Y-%m-%d_%H-%M") + '.xlsx'

writer = pd.ExcelWriter(OUTPUT_FILE, engine='xlsxwriter',)
list_of_results.to_excel(writer, sheet_name='Redirects', index=False)
writer.save()