<a href="https://colab.research.google.com/github/hjtb/Data-Validation/blob/main/Product_Validation_Script.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PRODUCT VALIDATION SCRIPT
# Authors - David Leon (@Dleon) and William Holton (@Wholton)
## Intro
### The following manual validation needs to be added to the products publication process to avoid failing tests:

1.  Product URL needs to start with https:// or http://
2.  Company URL starts with https://www.linkedin.com/company/
3.  Company URL needs to be in the format of https://www.linkedin.com/company/<vanity_name> instead of [..]<company_id>
4.  Showcase page URL needs to start with https://www.linkedin.com/showcase/<vanity_name>
5.  Product Category ID corresponds to category/categories assigned to product
6.  Ensure Product URL is NOT a PDF (i.e. does not end with ‘.pdf’)
7.  Add validation to ensure product categories assigned to products are NOT Group Representatives **(multi-coded)
**New Validations/Validations to be updated following on from 3rd publication**
8.  Identify required fields and add a check to ensure all of those fields are filled  **(Needs adjusting for product changes)**
9. Validation that isActive = T, and isDeprecated = F for New Products, and the opposite deprecations on Product Changes. Also, check that both values are not T/T or F/F. **(Needs adjusting for product changes)**
10. Ensure all characters are unicode. (i.e. No special characters like Äô)
11. Ensure Product Skill ID is valid and not 0
12. Make sure there are no line breaks in product names or descriptions
13. Check that showcase IDs are within the valid range **[To Be Completed]**
14. Check that product name and product ID match for product changes
15. Dupe checks
16. Add comments for different error types
17. Check that Description Locale is valid eg. not "id_ID"


## Install and import relevant packages (restart runtime after installation)

In [None]:
# If you encounter the AttributeError: 'NotebookFormatter' object has no attribute 'get_result' run this and then reinstall the packages
# !pip uninstall numpy
# !pip uninstall pandas

In [None]:
# Install pygsheets to be able to connect to the spreadsheet:
!pip install pygsheets &> /dev/null 
!pip install validators
!pip install pandas==1.3  # We will need this version to use explode function on multiple columns
!pip install xlsxwriter

Collecting validators
  Downloading validators-0.18.2-py3-none-any.whl (19 kB)
Installing collected packages: validators
Successfully installed validators-0.18.2
Collecting pandas==1.3
  Downloading pandas-1.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (10.8 MB)
[K     |████████████████████████████████| 10.8 MB 14.1 MB/s 
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 1.3.5
    Uninstalling pandas-1.3.5:
      Successfully uninstalled pandas-1.3.5
Successfully installed pandas-1.3.0


Collecting xlsxwriter
  Downloading XlsxWriter-3.0.2-py3-none-any.whl (149 kB)
[?25l[K     |██▏                             | 10 kB 22.7 MB/s eta 0:00:01[K     |████▍                           | 20 kB 29.1 MB/s eta 0:00:01[K     |██████▋                         | 30 kB 32.9 MB/s eta 0:00:01[K     |████████▊                       | 40 kB 22.7 MB/s eta 0:00:01[K     |███████████                     | 51 kB 18.8 MB/s eta 0:00:01[K     |█████████████▏                  | 61 kB 15.3 MB/s eta 0:00:01[K     |███████████████▎                | 71 kB 12.7 MB/s eta 0:00:01[K     |█████████████████▌              | 81 kB 13.9 MB/s eta 0:00:01[K     |███████████████████▊            | 92 kB 15.2 MB/s eta 0:00:01[K     |█████████████████████▉          | 102 kB 13.3 MB/s eta 0:00:01[K     |████████████████████████        | 112 kB 13.3 MB/s eta 0:00:01[K     |██████████████████████████▎     | 122 kB 13.3 MB/s eta 0:00:01[K     |████████████████████████████▍   | 133 kB 13.3 MB/s

In [None]:
import pygsheets
import numpy as np
import pandas as pd
from pygsheets.datarange import DataRange
import pprint
import validators
import string
import xlsxwriter 

## Get Credentials and set the Script Mode

In [None]:
# Get connection through Service Account credentials (Google APIs):

try:
  # (EL 1) David Credentials
  google_credentials = pygsheets.authorize(service_account_file=
                         './product-data-validation-5a7794651578.json')
except:
  # (EL 2) Will Credentials
  google_credentials = pygsheets.authorize(client_secret='./client_secret_will.json')

In [None]:

# Ask for input about the 1st mode the script is going to run in:
script_mode_1 = input("Are you validating NEW PRODUCTS or PRODUCT CHANGES? Answer with: 'new products'/'product changes'")

# Keep asking for input until a valid input is provided
if (script_mode_1 == 'new products' or script_mode_1 == 'product changes') == False:
    while True:
        script_mode_1 = input("A typo was probably inputted. Please answer with:'new products'/'product changes'.")
        if (script_mode_1 == 'new products' or script_mode_1 == 'product changes') == True:
            break

# Ask for input about the second mode the script is going to run in:
script_mode_2 = input("Are you validating the CATALOG + PIPELINE/Product changes sheet or the PUBLICATION sheets? \nAnswer with: 'pipeline'/'publication'")

# Keep asking for input until a valid input is provided
if (script_mode_2 == 'pipeline' or script_mode_2 == 'publication') == False:
    while True:
        script_mode_2 = input("A typo was probably inputted. \nPlease answer with:'pipeline'/'publication'.")
        if (script_mode_2 == 'pipeline' or script_mode_2 == 'publication') == True:
            break

# SOL N1: -------------------------------------------------------------------------------------------------------------------


# Handle product changes if the input points to that script mode:
if script_mode_1 == 'product changes' and script_mode_2 == 'pipeline':
    spreadsheet_to_be_validated = google_credentials.open('Product Changes')
    specific_to_be_validated_tab = spreadsheet_to_be_validated.worksheet_by_title('Changes')

elif script_mode_1 == 'product changes' and script_mode_2 == 'publication':
    spreadsheet_to_be_validated = google_credentials.open('[Template] Changes to Products')
    specific_to_be_validated_tab = spreadsheet_to_be_validated.worksheet_by_title('Future Product Changes')

# Handle new products option:
elif script_mode_1 == 'new products':

    # If the chosen mode is for the publication sheet (script_mode_2 = 'publication'):
    if script_mode_2 == 'publication':
        spreadsheet_to_be_validated = google_credentials.open('[Template] Changes to Products')
        specific_to_be_validated_tab = spreadsheet_to_be_validated.worksheet_by_title('Future New Products')
    
    # If the chosen mode is not for the publication sheet (script_mode_2 = 'pipeline'):
    elif script_mode_2 == 'pipeline':
        # Ask for the period tab we'll be using for the Catalog + Pipeline sheet:  
        period_tab = input("Introduce the Catalog + Pipeline period tab you'd like to be testing. e.g. 'dev' or 'FY22Q2 Review Period 2 (08/11-07/12)':")

        while True:

            # Go into development mode:
            if period_tab == 'dev':
              spreadsheet_to_be_validated = google_credentials.open('Catalog + Product Pipeline')
              specific_to_be_validated_tab = spreadsheet_to_be_validated.worksheet_by_title('Validation Script Dev')
              print(f"Period tab chosen is: {period_tab}")
              break

            # Otherwise look for a specific tab:
            else:
              # try clause to check if the name of the tab provided exists:
              try:
                spreadsheet_to_be_validated = google_credentials.open('Catalog + Product Pipeline')
                specific_to_be_validated_tab = spreadsheet_to_be_validated.worksheet_by_title(period_tab)
                print(f"Period tab chosen is: '{period_tab}'")
                break
              except:
                period_tab = input("A typo was probably inputted. \nPlease try to introduce the period tab name again:")
                pass


script_mode_1, script_mode_2

Are you validating NEW PRODUCTS or PRODUCT CHANGES? Answer with: 'new products'/'product changes'new products
Are you validating the CATALOG + PIPELINE/Product changes sheet or the PUBLICATION sheets? 
Answer with: 'pipeline'/'publication'publication


('new products', 'publication')


## Highlight the cells with errors in red (Function) **(Find a way to speed up highlighting the full row as it's very slow. Possibly not needed)


In [None]:
def highlight_errors(error_indexes_array,column_index):
  for row in error_indexes_array:
    # Select the cell we wish to change the color of using the row index and the column value O for product Urls
    error_cell = specific_to_be_validated_tab.cell(f'{column_index}{row}')
    # Change the error cells to red
    error_cell.color = (1, 0.1, 0.1, 0.5)
    #Highlight the ELs name in green so they can locate the rows with errors (Pipeline sheet only)
    if script_mode_1 == 'product changes' and script_mode_2 == 'pipeline':
      el_cell = specific_to_be_validated_tab.cell(f'B{row}')
    elif script_mode_2 == 'publication':
      el_cell = specific_to_be_validated_tab.cell(f'Y{row}')
    elif script_mode_1 == 'new products' and script_mode_2 == 'pipeline':
      el_cell = specific_to_be_validated_tab.cell(f'A{row}')

    el_cell.color = (0.1, 0.98, 0.4, 0.0001)

  # Get a list of the column letters to loop through using inbuilt string.ascii_uppercase
  #letters_a_z = string.ascii_uppercase

  #for row in error_row_indexes:
    # Highlight the full row
    #for letter in letters_a_z:
      #cell = specific_to_be_validated_tab.cell(f'{letter}{row}')
      #cell.color = (0., 0.4, 0.4, 0.1)

## Create main Data Objects

In [None]:
# Convert Pipeline sheet into pandas dataframe:
dataframe_to_be_validated = specific_to_be_validated_tab.get_as_df()
dataframe_to_be_validated.index += 2  # Shift indexes to match the original doc


# Get non-empty rows
product_names = dataframe_to_be_validated.loc[:, 'Product name']
dataframe_to_be_validated = dataframe_to_be_validated.loc[product_names.str.len().gt(0)]

dataframe_to_be_validated.head(2)

Unnamed: 0,Category ID,Product Category,Product ID,Product name,Product Aliases,Product Description,Active,Deprecated,Product Skill ID,Product URLS,Description Locale,Company ID - OWNER,Comp URL,Company Name,Showcase ID,Showcase URL,Company/product page for display,Customer Organizations IDs,Customer Organizations Company URLs,Source of Product,In V1.1,In MVP,Has ingested IMAGES - 2020Dec09,Has ingested VIDEOS - 2020Dec09,Product notes,New Product?,Notify Pages?
2,1014,Content Delivery Network (CDN) Software,24110,jsDelivr CDN,,,T,F,,https://www.bootstrapcdn.com/,en_US,,https://www.linkedin.com/company/jsdelivr/,jsDelivr,,,,,,Builtwith Data,YES,YES,,,,YES,
3,1257,Data Privacy Management Software,24105,Tarte Au Citron,,,T,F,,https://tarteaucitron.io/en/,fr_FR,,https://www.linkedin.com/company/tarteaucitron...,tarteaucitron.js,,,,,,Builtwith Data,YES,YES,,,,YES,


In [None]:
# Open Golden Category Status spreadsheet:
golden_spreadsheet = google_credentials.open('Golden Category Status ')
specific_golden_tab = golden_spreadsheet.worksheet_by_title('Golden Product Categories')

In [None]:
# Convert Golden sheet into pandas dataframe:
golden_sheet_dataframe = specific_golden_tab.get_as_df()
golden_sheet_dataframe.head(2)



Unnamed: 0,Unnamed: 1,URN,Category Name,Category Description,Aliases,See Also,UPDATED Proposed: Parent Problem Category,parent_product_category_ids,Included in MVP,active,deprecated,is_product_category,is_service_category,is_group_representative,is_solution_category_or_else_problem_category,artifact_id,artifact_name,activity_id,associated_skills_ids,Verified,Published status,adstargeting category,Parent Problem Category Suggested,Is group representative,# MVP Products in Category,# products in category,Example Product,Category Created by,Has Metadata,Evidence,Category in G2?,Only in G2,Notes,Unnamed: 34,new count,Unnamed: 36
0,,,,,,,,,524,,,,,,,,,,,,,,,,,,,,,,71%,14%,,,,
1,1.0,1000.0,Enterprise Accounting Software,Software used to record and process financial ...,,,Financial Management > Accounting,10129.0,YES,T,F,T,F,F,T,1000.0,Software,,,Verified,Published,False,,,0.0,117.0,,Erin,,https://docs.google.com/spreadsheets/d/1HIpG4b...,TRUE,FALSE,"Previously called ""Accounting Software"" but re...",,Unique words,Frequency


In [None]:
# Open Catalog spreadsheet:
prod_catalog_spreadsheet = google_credentials.open('STZ Dedupe Check')
catalog_tab = prod_catalog_spreadsheet.worksheet_by_title('Product Catalog & Admin Products')

In [None]:
# Convert Catalog sheet into pandas dataframe:
product_catalog_dataframe = catalog_tab.get_as_df()
product_catalog_dataframe.head

<bound method NDFrame.head of        productId  ...                                         productUrl
0           1000  ...  https://quickbooks.intuit.com/desktop/enterprise/
1           1001  ...                           https://switchitapp.com/
2           1002  ...                              https://mbizcard.com/
3           1003  ...                                    sageintacct.com
4           1004  ...                               https://www.xero.com
...          ...  ...                                                ...
79730    1687426  ...                                https://tapteek.com
79731    1687429  ...      https://www.volumedic.com/volumedicforlw.html
79732    1687433  ...                                  https://pi2.life/
79733    1687475  ...                               www.examportal.co.za
79734    1810547  ...                   https://bombbomb.com/essentials/

[79735 rows x 3 columns]>

## 1.  Product URL needs to start with https:// or http://



In [None]:
# First get the column of product URLs
product_urls = dataframe_to_be_validated['Product URLS']
product_urls_dataframe = product_urls.to_frame()

#### Use Validators method to validate URLs

In [None]:
# Use validators package to validate urls and assign true and false values in new column called isURLValid
def isUrlValid(url):
    return True if validators.url(url) else False
product_urls_dataframe['isURLValid'] = product_urls_dataframe['Product URLS'].apply(isUrlValid)

In [None]:
# Get rows where url is not deemed valid
product_url_errors_validator_method = product_urls_dataframe.loc[product_urls_dataframe['isURLValid'] == False]
product_url_errors_validator_method

Unnamed: 0,Product URLS,isURLValid


#### Use pandas method to validate URLs

In [None]:
# Get the rows that don't pass the 'http'/'https' validation criteria:
prod_url_errors_string_method = dataframe_to_be_validated.loc[product_urls.str.startswith('http','https') == False, :]
prod_url_errors_string_method

Unnamed: 0,Category ID,Product Category,Product ID,Product name,Product Aliases,Product Description,Active,Deprecated,Product Skill ID,Product URLS,Description Locale,Company ID - OWNER,Comp URL,Company Name,Showcase ID,Showcase URL,Company/product page for display,Customer Organizations IDs,Customer Organizations Company URLs,Source of Product,In V1.1,In MVP,Has ingested IMAGES - 2020Dec09,Has ingested VIDEOS - 2020Dec09,Product notes,New Product?,Notify Pages?


#### Group errors from both methods in a set of unique index values and highlight the corresponding errors

In [None]:
all_prod_url_errors_index = product_url_errors_validator_method.index.append(prod_url_errors_string_method.index)
all_prod_url_errors_index = set(all_prod_url_errors_index)
all_prod_url_errors_index

set()

In [None]:
# Highlight all errors in red and highlight the first cell of the row in green for reference
column_index = dataframe_to_be_validated.columns.get_loc("Product URLS") 
column_letter = xlsxwriter.utility.xl_col_to_name(column_index)
highlight_errors(all_prod_url_errors_index, column_letter)

## 2. Company URL starts with "https://www.linkedin.com/company/" (and the company name)


In [None]:
# get non-empty showcase urls by checking if their string length is greater than 0
#company_urls_not_empty = dataframe_to_be_validated.loc[company_urls.str.len().gt(0)]
company_urls_not_empty = dataframe_to_be_validated[dataframe_to_be_validated['Comp URL'].str.len().gt(0)]
company_urls_not_empty

Unnamed: 0,Category ID,Product Category,Product ID,Product name,Product Aliases,Product Description,Active,Deprecated,Product Skill ID,Product URLS,Description Locale,Company ID - OWNER,Comp URL,Company Name,Showcase ID,Showcase URL,Company/product page for display,Customer Organizations IDs,Customer Organizations Company URLs,Source of Product,In V1.1,In MVP,Has ingested IMAGES - 2020Dec09,Has ingested VIDEOS - 2020Dec09,Product notes,New Product?,Notify Pages?
2,1014,Content Delivery Network (CDN) Software,24110,jsDelivr CDN,,,T,F,,https://www.bootstrapcdn.com/,en_US,,https://www.linkedin.com/company/jsdelivr/,jsDelivr,,,,,,Builtwith Data,YES,YES,,,,YES,
3,1257,Data Privacy Management Software,24105,Tarte Au Citron,,,T,F,,https://tarteaucitron.io/en/,fr_FR,,https://www.linkedin.com/company/tarteaucitron...,tarteaucitron.js,,,,,,Builtwith Data,YES,YES,,,,YES,
4,11761310,Photo Editing Software AND Image Recognition S...,24104,Air recon DL,,,T,F,,https://apps.gehealthcare.com/app-products/air...,en_US,,https://www.linkedin.com/company/gehealthcare/,GE Healthcare,,,,,,LSS Top Companies,YES,YES,,,,YES,
5,1615,Web Hosting,24086,Haylix,,,T,F,,https://www.haylix.com/,en_US,,https://www.linkedin.com/company/haylix/,Haylix,,,,,,Builtwith Data,YES,YES,,,,YES,
6,1136,Virtual Private Cloud (VPC) Software,24085,Rackco VPS Hosting,,,T,F,,https://www.rackco.com/vps-hosting/,en_US,,https://www.linkedin.com/company/rackco/,Rackco,,,,,,Builtwith Data,YES,YES,,,,YES,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
925,178316001219,Loan Origination Software AND Retail POS Syste...,25055,Linedata Consumer Finance,,,T,F,,https://www.linedata.com/lenders-and-lessors/c...,en_US,,https://www.linkedin.com/company/linedata/,Linedata,,,,,,LSS Top Companies,YES,YES,,,,YES,
926,1789,Equipment Rental Software,25056,Linedata Ekip360,,,T,F,,https://www.linedata.com/lenders-and-lessors/e...,en_US,,https://www.linkedin.com/company/linedata/,Linedata,,,,,,LSS Top Companies,YES,YES,,,,YES,
927,11651783,Loan Servicing Software AND Loan Origination S...,25057,Linedata Syndicated Lending,,,T,F,,https://www.linedata.com/lenders-and-lessors/s...,en_US,,https://www.linkedin.com/company/linedata/,Linedata,,,,,,LSS Top Companies,YES,YES,,,,YES,
928,10511026,Website Builder Software AND E-Commerce Platforms,25058,Wavoto,,,T,F,,https://www.wavoto.com/,en_US,,https://www.linkedin.com/company/wavoto/,Wavoto,,,,,,Builtwith Data,YES,YES,,,,YES,


In [None]:
# Get the company urls column
company_urls = company_urls_not_empty.loc[:, 'Comp URL']
company_urls.head()

2           https://www.linkedin.com/company/jsdelivr/
3    https://www.linkedin.com/company/tarteaucitron...
4       https://www.linkedin.com/company/gehealthcare/
5             https://www.linkedin.com/company/haylix/
6             https://www.linkedin.com/company/rackco/
Name: Comp URL, dtype: object

In [None]:
# get all company urls that don't start with https://www.linkedin.com/company/
comp_url_errs_string_method = company_urls_not_empty.loc[(company_urls.str.startswith('https://www.linkedin.com/company/') == False) | ((company_urls.str.len() == (len('https://www.linkedin.com/company/') - company_urls.str.count(' '))) == True)]
# count how many urls are incorrect
comp_url_errs_string_method_index = comp_url_errs_string_method.index
comp_url_errs_string_method_index

Int64Index([], dtype='int64')

#### Use Validators method to validate URLs

In [None]:
# Use validators package to validate urls and assign true and false values in new column called isURLValid
company_urls_dataframe = company_urls.to_frame()
def isUrlValid(url):
    return True if validators.url(url) else False
company_urls_dataframe['isURLValid'] = company_urls_dataframe['Comp URL'].apply(isUrlValid)

In [None]:
# Get rows where url is not deemed valid
comp_url_errs_validators_method = company_urls_dataframe.loc[company_urls_dataframe['isURLValid'] == False]
comp_url_errs_validators_method_index = comp_url_errs_validators_method.index
comp_url_errs_validators_method_index

Int64Index([], dtype='int64')

#### Group errors from both methods in a set of unique values

In [None]:
all_comp_url_errors_index = comp_url_errs_validators_method_index.append(comp_url_errs_string_method_index)
all_comp_url_errors_index_unique = set(all_comp_url_errors_index)
all_comp_url_errors_index

Int64Index([], dtype='int64')

#### Highlight the cells with errors in red

In [None]:
# Highlight all errors in red and highlight the first cell of the row in green for reference
column_index = dataframe_to_be_validated.columns.get_loc("Comp URL") 
column_letter = xlsxwriter.utility.xl_col_to_name(column_index)
highlight_errors(all_comp_url_errors_index, column_letter)

## 3. Company URL needs to be in the format of `https://www.linkedin.com/company/'vanity_name'` instead of `[..]'company_id'`


In [None]:
# Get the part of the url that comes after the 'company/' address:
vanity_name_bit = company_urls.str.split('https://www.linkedin.com/company/', expand=True)
vanity_name_bit.head()

Unnamed: 0,0,1
2,,jsdelivr/
3,,tarteaucitron.js/
4,,gehealthcare/
5,,haylix/
6,,rackco/


In [None]:
# Keep just the vanity name part and turn to series object:
vanity_name_bit = vanity_name_bit.pop(1).squeeze()  
# In case someone has added the url with the id both from admin view, or for member view:
vanity_name_bit = vanity_name_bit.str.split('/admin', expand=True).pop(0).squeeze()
vanity_name_bit = vanity_name_bit.str.split('/mycompany', expand=True).pop(0).squeeze()
vanity_name_bit.head()

2            jsdelivr/
3    tarteaucitron.js/
4        gehealthcare/
5              haylix/
6              rackco/
Name: 0, dtype: object

In [None]:
# Get rid of forward slash or potential white spaces at the end so that it doesn't 
# interfere with isdigit() in checking if the string is only numbers:
vanity_name_bit = vanity_name_bit.str.rstrip('/ ')
vanity_name_bit.head()

2            jsdelivr
3    tarteaucitron.js
4        gehealthcare
5              haylix
6              rackco
Name: 0, dtype: object

In [None]:
# Check if the url corresponding with the vanity name is only numbers (i.e. company ID instead of vanity name):
vanity_name_errs = company_urls_not_empty.loc[vanity_name_bit.str.isdigit() == True, :]
vanity_name_errs_index = vanity_name_errs.index
vanity_name_errs_index

Int64Index([], dtype='int64')

#### Highlight the cells with errors in red

In [None]:
# Highlight all errors in red and highlight the first cell of the row in green for reference
column_index = dataframe_to_be_validated.columns.get_loc("Comp URL") 
column_letter = xlsxwriter.utility.xl_col_to_name(column_index)
highlight_errors(vanity_name_errs_index, column_letter)

## 4. Showcase page URL needs to start with https://www.linkedin.com/showcase/<vanity_name>  and is deemed a valid url by the Validators package

In [None]:
# get the showcase urls
showcase_urls = dataframe_to_be_validated.loc[:, 'Showcase URL']
showcase_urls.head()

2    
3    
4    
5    
6    
Name: Showcase URL, dtype: object

In [None]:
# get non-empty showcase urls by checking if their string length is greater than 0
showcase_col_not_empty = dataframe_to_be_validated.loc[showcase_urls.str.len().gt(0)]
non_empty_showcase_urls = showcase_col_not_empty['Showcase URL']
non_empty_showcase_urls

Series([], Name: Showcase URL, dtype: object)

In [None]:
# find showcase url errors where they don't begin with 'https://www.linkedin.com/showcase/'
showcase_url_errs = showcase_col_not_empty.loc[non_empty_showcase_urls.str.startswith('https://www.linkedin.com/showcase/') == False]
showcase_url_errs = dataframe_to_be_validated.loc[showcase_url_errs.index]
showcase_url_errs_index = showcase_url_errs.index

#### Use Validators method to validate URLs

In [None]:
# Use validators package to validate urls and assign true and false values in new column called isURLValid
non_empty_showcase_urls_dataframe = non_empty_showcase_urls.to_frame()
def isUrlValid(url):
    return True if validators.url(url) else False
non_empty_showcase_urls_dataframe['isURLValid'] = non_empty_showcase_urls_dataframe['Showcase URL'].apply(isUrlValid)

In [None]:
# Get rows where url is not deemed valid
validator_showcase_url_errs = non_empty_showcase_urls_dataframe.loc[non_empty_showcase_urls_dataframe['isURLValid'] == False]
validator_showcase_url_errs_index = validator_showcase_url_errs.index
showcase_url_errs_index.append(validator_showcase_url_errs_index)

Int64Index([], dtype='int64')

#### Highlight the cells with errors in red

In [None]:
# Highlight all errors in red and highlight the first cell of the row in green for reference
column_index = dataframe_to_be_validated.columns.get_loc("Showcase URL") 
column_letter = xlsxwriter.utility.xl_col_to_name(column_index)
highlight_errors(showcase_url_errs_index, column_letter)

## 5. Product Category ID corresponds to category/categories assigned to product

### Single Coded Products:

In [None]:
# Get a Dataframe of just the Category Ids and Category names of the products in the pipeline sheet
pipeline_sheet_category_IDs_and_names = dataframe_to_be_validated.loc[:, "Category ID":"Product Category"]

pipeline_sheet_category_IDs_and_names.head()

Unnamed: 0,Category ID,Product Category
2,1014,Content Delivery Network (CDN) Software
3,1257,Data Privacy Management Software
4,11761310,Photo Editing Software AND Image Recognition S...
5,1615,Web Hosting
6,1136,Virtual Private Cloud (VPC) Software


In [None]:
# Rename the Cat ID column of pipeline sheet to 'URN' to use as a common value for the merge of pipeline and golden dataframes
pipeline_sheet_category_IDs_and_names_renamed = pipeline_sheet_category_IDs_and_names.rename(columns={'Category ID': 'URN'})
pipeline_sheet_category_IDs_and_names_renamed.head()

Unnamed: 0,URN,Product Category
2,1014,Content Delivery Network (CDN) Software
3,1257,Data Privacy Management Software
4,11761310,Photo Editing Software AND Image Recognition S...
5,1615,Web Hosting
6,1136,Virtual Private Cloud (VPC) Software


In [None]:
# Get a Dataframe of just the Category Ids and Category names from the Golden Sheet
golden_sheet_category_IDs_and_names = golden_sheet_dataframe.loc[:, "URN":"Category Name"]
golden_sheet_category_IDs_and_names.head()

Unnamed: 0,URN,Category Name
0,,
1,1000.0,Enterprise Accounting Software
2,1001.0,Campaign Management Software
3,1002.0,Graphic Design Software
4,1003.0,Desktop Publishing Software


In [None]:
# Cast merging columns to same dtype so that the merge is effective:
golden_sheet_category_IDs_and_names['URN'] = golden_sheet_category_IDs_and_names['URN'].astype(str)
pipeline_sheet_category_IDs_and_names_renamed['URN'] = pipeline_sheet_category_IDs_and_names_renamed['URN'].astype(str)

# Move index to the dataframe to preserve it after merging dataframes (otherwise would lose index):
pipeline_sheet_category_IDs_and_names_renamed.reset_index(inplace=True)
pipeline_sheet_category_IDs_and_names_renamed.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_item(key, value)


Unnamed: 0,index,URN,Product Category
0,2,1014,Content Delivery Network (CDN) Software
1,3,1257,Data Privacy Management Software
2,4,11761310,Photo Editing Software AND Image Recognition S...
3,5,1615,Web Hosting
4,6,1136,Virtual Private Cloud (VPC) Software


In [None]:
# Merge both dataframes using inner join on the common URN Column. The index corresponds to the row on the pipeline sheet. NOTE - Not returning doublecoded URNs
golden_and_pipeline_sheets_merged = pd.merge(golden_sheet_category_IDs_and_names, pipeline_sheet_category_IDs_and_names_renamed, how="inner", on=["URN"])

# Get rid of surrounding whitespaces that could alter the comparison:
golden_and_pipeline_sheets_merged['Category Name'] = golden_and_pipeline_sheets_merged['Category Name'].str.strip()
golden_and_pipeline_sheets_merged['Product Category'] = golden_and_pipeline_sheets_merged['Product Category'].str.strip()

golden_and_pipeline_sheets_merged

Unnamed: 0,URN,Category Name,index,Product Category
0,1000,Enterprise Accounting Software,333,Enterprise Accounting Software
1,1001,Campaign Management Software,391,Campaign Management Software
2,1001,Campaign Management Software,742,Campaign Management Software
3,1001,Campaign Management Software,825,Campaign Management Software
4,1003,Desktop Publishing Software,17,Desktop Publishing Software
...,...,...,...,...
825,1886,Release Notes Tools,93,Release Notes Tools
826,1886,Release Notes Tools,94,Release Notes Tools
827,1886,Release Notes Tools,95,Release Notes Tools
828,1886,Release Notes Tools,96,Release Notes Tools


In [None]:
# Using the merged dataframe locate the rows where the Product Category doesn't match the value of the Category name in the golden sheet 
id_category_mismatch_errs = golden_and_pipeline_sheets_merged.loc[
          ~golden_and_pipeline_sheets_merged.apply(
                          lambda x: x['Category Name'] in x['Product Category'], axis=1)]

id_category_mismatch_errs = id_category_mismatch_errs.set_index('index').sort_index()

id_category_mismatch_errs

Unnamed: 0_level_0,URN,Category Name,Product Category
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1


In [None]:
# Get all row numbers that contain a single coded category name - ID error and look them up on the original dataframe:
id_category_mismatch_errs = dataframe_to_be_validated.loc[id_category_mismatch_errs.index]
id_category_mismatch_errs_index = id_category_mismatch_errs.index

#### Highlight the cells with errors in red

In [None]:
# Highlight all errors in red and highlight the first cell of the row in green for reference
column_index = dataframe_to_be_validated.columns.get_loc("Product Category") 
column_letter = xlsxwriter.utility.xl_col_to_name(column_index)
highlight_errors(id_category_mismatch_errs_index, column_letter)

### Dealing with multi-coded products

##### Preprocessing of the multi-coded columns:

In [None]:

def preprocess_multi_coded_df(df):
  
  """Function where the preprocessing of the original Catalog + Products pipeline
   sheet dataframe occurs in preparation for error checking. It will be the step previous 
   to going into the logic of the validations of the different types of errors
  (see function below "extract_multi_coded_errors").
  
  Arguments:
    - df:  original Catalog + Products pipeline sheet dataframe
  Returns:
    - original_df_IDs_column: pandas dataframe with only the column "Category ID"; 
    will be used to process the different types of formatting errors
    - multi_coded_categories: dataframe with only products that have been 
    encoded into multiple categories (multiple IDs/category names); will be used
     to check for differing number of elements in both Category IDs columns and
      Category Name, and to get a dataframe clean from formatting errors that 
      will be used further down the line to check if IDs and Category Names match."""

  # Get the column with the category IDs from the original Catalog + Products pipeline sheet:
  pipeline_sheet_prod_cat_id_column = df.loc[:, "Category ID"]

  multi_coded_categories = df.loc[
    # Pick up double and triple coded, cases:
    (pipeline_sheet_prod_cat_id_column.astype(str).str.contains(',') == True)
    ]

  # Encode valid formatted multi-coded columns into lists to check that the num elements matches on both columns:
  multi_coded_categories['Category ID'] = multi_coded_categories['Category ID'].str.split(',', expand=False)
  multi_coded_categories['Product Category'] = multi_coded_categories['Product Category'].str.split('AND', expand=False)

  return [pipeline_sheet_prod_cat_id_column, multi_coded_categories]



def extract_multi_coded_errors(df):
  
  """Main function including the validations of errors for products with 
  multiple encoded categories.
  
  Arguments:
    - df:  original Catalog + Products pipeline sheet dataframe
  Returns:
    - format_and_num_elemen_errs: formatting errors for category IDs and errors 
    for non matching number of elements between Category IDs and Category Names
    """

  # Get column with cat IDs from original df and the preprocessed multi-coded df:
  original_df_IDs_column, multi_coded_df = preprocess_multi_coded_df(df)
  
  # Pick up cases in which num elements of the column Category ID and Product Category are not the same:
  num_elem_each_column = multi_coded_df.applymap(len)
  nonmatching_num_elements = num_elem_each_column.loc[
    ~(num_elem_each_column['Category ID'] ==
      num_elem_each_column['Product Category'])]

  # Get the actual original rows with a mismatch of num elements:
  nonmatching_num_elements_original_rows = multi_coded_df.loc[nonmatching_num_elements.index]

  # [WIP] Find different double coding potential errors in format when inputing 
  # the categories and category IDs:
  format_errs = df.loc[
      # Pick up cases such as '12341348' (no comma): [SEE CELL IMMEDIATELY BELOW]
      (((original_df_IDs_column.astype(str).str.len() >= 8)== True) &
       ((~original_df_IDs_column.astype(str).str.contains(','))== True)) |
      # Pick up cases such as '1234 1348' or '1643 1642,1042' (a missing comma and a whitespace):
      (((original_df_IDs_column.astype(str).str.len() >= 8) == True) &
       ((original_df_IDs_column.astype(str).str.contains(' '))== True)) |
      # Pick up cases such with a whitespace in between IDs such as in '1642, 1042':
      (((original_df_IDs_column.astype(str).str.len() >= 8) == True) &
       ((original_df_IDs_column.astype(str).str.contains(', '))== True)) |
      # Pick up cases in which not a number has been inputted by mistake such as in 'dfs' (it checks if all characters are alphabetic):
      ((original_df_IDs_column.astype(str).str.isalpha())== True)

      # Pick up cases incorrectly formatted by Google Sheets as '1002,1234' into '10,021,234':
      # ---- TODO ----
      ]
  
  # Concat format_errs and errs from non matching num elements in 'Category ID' and 'Product Category':
  format_and_num_elemen_errs = pd.concat([format_errs, nonmatching_num_elements_original_rows])
  # Get rid of duplicate rows for errors:
  format_and_num_elemen_errs = format_and_num_elemen_errs[~format_and_num_elemen_errs.index.duplicated(keep="first")]
  
  return format_and_num_elemen_errs

In [None]:

def extract_multi_coded_correct_and_errs(df):
  """ 
  Umbrella function that preprocesses, detects formatting errors, and separates 
  them from rows that don't have formatting errors, returning both in different 
  dataframes for further validation processing (checking that ID and category name match).

  Arguments:
    - df:  original Catalog + Products pipeline sheet dataframe.
  Returns: 
    - multi_coded_categories: dataframe with only products that have been 
    encoded into multiple categories (multiple IDs/category names, and that 
    don't contain formatting errors. 
    - format_and_num_elemen_errs: rows that contain such types of errors, 
    with indexes from original dataframe."""
  
  # Get column with cat IDs from original df and the preprocessed multi-coded df:
  _, multi_coded_df = preprocess_multi_coded_df(df)

  format_and_num_elemen_errs = extract_multi_coded_errors(df)

  # Keep only the correctly encoded rows for multiple categories by getting rid of all the errors:
  for err_row in format_and_num_elemen_errs.index:
    try:
      multi_coded_df.drop(err_row, inplace=True)
    except:
      pass


  return [multi_coded_df, format_and_num_elemen_errs]


In [None]:
# Call the function to get the multi-coded errors
correctly_multi_coded_categories, format_multi_coded_errs = extract_multi_coded_correct_and_errs(
    pipeline_sheet_category_IDs_and_names)

format_multi_coded_errs

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_item(key, value)


Unnamed: 0,Category ID,Product Category


In [None]:
# Move index to the dataframe to preserve it after merging dataframes (otherwise would lose index):
correctly_multi_coded_categories.reset_index(inplace=True)
correctly_multi_coded_categories.head()

Unnamed: 0,index,Category ID,Product Category
0,4,"[1176, 1310]","[Photo Editing Software , Image Recognition S..."
1,24,"[1341, 1037]","[Course Authoring Software , Marketing Automa..."
2,40,"[1070, 1284]","[Enterprise Messaging Software , Live Chat So..."
3,44,"[1051, 1026]","[Website Builder Software , E-Commerce Platfo..."
4,52,"[1026, 1477]","[E-Commerce Platforms , Automotive Marketing ..."


In [None]:
# Separate rows that have several coded IDs/Categories into several rows with 
# sharing indexes (**explode is list dependent**):
correctly_multi_coded_categories = correctly_multi_coded_categories.explode(['Category ID', 'Product Category'])
correctly_multi_coded_categories.head()

Unnamed: 0,index,Category ID,Product Category
0,4,1176,Photo Editing Software
0,4,1310,Image Recognition Software
1,24,1341,Course Authoring Software
1,24,1037,Marketing Automation Software
2,40,1070,Enterprise Messaging Software


In [None]:
# Rename the Cat ID column to URN to use as a common value for the merge of both dataframes
correctly_multi_coded_categories_renamed = correctly_multi_coded_categories.rename(columns={'Category ID': 'URN'})
correctly_multi_coded_categories_renamed.head()

Unnamed: 0,index,URN,Product Category
0,4,1176,Photo Editing Software
0,4,1310,Image Recognition Software
1,24,1341,Course Authoring Software
1,24,1037,Marketing Automation Software
2,40,1070,Enterprise Messaging Software


##### Final merging and checking of multi-coded:

In [None]:
# Cast merging columns to same dtype so that the merge is effective:
golden_sheet_category_IDs_and_names['URN'] = golden_sheet_category_IDs_and_names['URN'].astype(str)
correctly_multi_coded_categories_renamed['URN'] = correctly_multi_coded_categories_renamed['URN'].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_item(key, value)


In [None]:
# Merge both dataframes using inner join on the common URN Column. The index corresponds to the row on the pipeline sheet. 
golden_and_pipeline_multi_coded_merged = pd.merge(golden_sheet_category_IDs_and_names, 
                                      correctly_multi_coded_categories_renamed, 
                                      how="inner", on="URN")

golden_and_pipeline_multi_coded_merged.head()

Unnamed: 0,URN,Category Name,index,Product Category
0,1000,Enterprise Accounting Software,650,Enterprise Accounting Software
1,1001,Campaign Management Software,200,Campaign Management Software
2,1008,Business Intelligence (BI) Software,193,Business Intelligence (BI) Software
3,1008,Business Intelligence (BI) Software,413,Business Intelligence (BI) Software
4,1013,Construction Management Software,193,Construction Management Software


In [None]:
# Using the multi-coded dataframe to locate the rows where one or more of the Product Category doesn't match the value of the Category name in the golden sheet 
multi_id_category_mismatch_errs = golden_and_pipeline_multi_coded_merged.loc[
                      ~golden_and_pipeline_multi_coded_merged.apply(
                          lambda x: x['Category Name'] in x['Product Category'], axis=1)]

# Set the index to the original values and sort them in ascending order:
multi_id_category_mismatch_errs = multi_id_category_mismatch_errs.set_index('index').sort_index()

multi_id_category_mismatch_errs

Unnamed: 0_level_0,URN,Category Name,Product Category
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1


##### Gather all errors for Category Names and Category IDs:

In [None]:
# Get all row numbers that equal an multiple coded format or category mismatch error and look them up on the original dataframe (for consistency):
multi_category_format_errs = dataframe_to_be_validated.loc[format_multi_coded_errs.index]
multi_id_category_mismatch_errs = dataframe_to_be_validated.loc[multi_id_category_mismatch_errs.index]

# Concatenate both types of erros for multi-coded products:
multi_coded_errs = pd.concat([multi_category_format_errs, multi_id_category_mismatch_errs])
multi_coded_errs.head()

Unnamed: 0,Category ID,Product Category,Product ID,Product name,Product Aliases,Product Description,Active,Deprecated,Product Skill ID,Product URLS,Description Locale,Company ID - OWNER,Comp URL,Company Name,Showcase ID,Showcase URL,Company/product page for display,Customer Organizations IDs,Customer Organizations Company URLs,Source of Product,In V1.1,In MVP,Has ingested IMAGES - 2020Dec09,Has ingested VIDEOS - 2020Dec09,Product notes,New Product?,Notify Pages?


In [None]:
# Concatenate all errors for category names/category IDs (single and multi-coded):
category_IDs_and_names_errs = pd.concat([id_category_mismatch_errs, multi_coded_errs])
category_IDs_and_names_errs_index = category_IDs_and_names_errs.index

#### Highlight the cells with errors in red

In [None]:
# Highlight all errors in red and highlight the first cell of the row in green for reference
column_index = dataframe_to_be_validated.columns.get_loc("Product Category") 
column_letter = xlsxwriter.utility.xl_col_to_name(column_index)
highlight_errors(category_IDs_and_names_errs_index, column_letter)

## 6. Ensure Product URL is NOT a PDF (i.e. does not end with ‘.pdf’)


In [None]:
# Get the product urls column
product_urls = dataframe_to_be_validated.loc[:, 'Product URLS']

In [None]:
# get all product urls that end with .pdf:
product_url_pdf_errs = dataframe_to_be_validated.loc[product_urls.str.endswith('.pdf') == True, :]
product_url_pdf_errs_index = product_url_pdf_errs.index

#### Highlight the cells with errors in red

In [None]:
# Highlight all errors in red and highlight the first cell of the row in green for reference
column_index = dataframe_to_be_validated.columns.get_loc("Product URLS") 
column_letter = xlsxwriter.utility.xl_col_to_name(column_index)
highlight_errors(product_url_pdf_errs_index, column_letter)

## 7. Ensure product categories assigned to products are NOT Group Representatives

#### Single-coded

In [None]:
# Get Is group representative column
group_rep_column = golden_sheet_dataframe.loc[:, 'Is group representative']

In [None]:
# Get all rows where 'is group representative' is set to 'YES'
group_representatives = golden_sheet_dataframe.loc[group_rep_column.str.contains('YES') == True, :]
# Sanity check that we've got the correct amount of group representatives
group_representatives.head()

Unnamed: 0,Unnamed: 1,URN,Category Name,Category Description,Aliases,See Also,UPDATED Proposed: Parent Problem Category,parent_product_category_ids,Included in MVP,active,deprecated,is_product_category,is_service_category,is_group_representative,is_solution_category_or_else_problem_category,artifact_id,artifact_name,activity_id,associated_skills_ids,Verified,Published status,adstargeting category,Parent Problem Category Suggested,Is group representative,# MVP Products in Category,# products in category,Example Product,Category Created by,Has Metadata,Evidence,Category in G2?,Only in G2,Notes,Unnamed: 34,new count,Unnamed: 36
3,3,1002,Graphic Design Software,Software used to create and edit digital images.,,Vector Graphics Software,Content Management > Content Creation > Graphi...,10009,NO,T,F,T,F,T,T,1000,Software,,,Verified,Published,False,,YES,0,,,Tea,,https://docs.google.com/spreadsheets/d/1HIpG4b...,,,Previously Graphics software.,,,
28,31,1031,Customer Support Software,Software used to assist customers with the use...,,,Customer Support,10094,NO,T,F,T,F,T,T,1000,Software,,32157.0,Verified,Published,True,,YES,0,,,Tea,,https://docs.google.com/spreadsheets/d/1HIpG4b...,,,description due to be rewritten (became group ...,,,
45,50,1053,Cloud Security Software,"Software used to protect data, applications, s...",,,Computing > IT System Security > Cloud Security,10040,NO,T,F,T,F,T,T,1000,Software,,,Verified,Published,True,,YES,#VALUE!,,,Lia,,https://docs.google.com/spreadsheets/d/1HIpG4b...,,,,,,
112,119,1127,Cybersecurity Software,Software used to protect computer systems and ...,,,Computing > IT System Security,10027,NO,T,F,T,F,T,T,1000,Software,,,Verified,Published,True,,YES,#VALUE!,,,Lia,,https://docs.google.com/spreadsheets/d/1HIpG4b...,,,,,,
113,120,1128,Network Security Software,Software used to monitor network settings and ...,,,Computing > IT System Security > Network Security,10039,NO,T,F,T,F,T,T,1000,Software,,,Verified,Published,True,,YES,0,,,Lia,,https://docs.google.com/spreadsheets/d/1HIpG4b...,,,,,,


In [None]:
# Get the urns from the rows of group representatives
group_representative_urns = group_representatives.loc[:, 'URN']
group_rep_frame = group_representative_urns.to_frame().head()

In [None]:
# Cast merging columns to same dtype so that the merge is effective:
group_representatives['URN'] = group_representatives['URN'].astype(str)
pipeline_sheet_category_IDs_and_names_renamed['URN'] = pipeline_sheet_category_IDs_and_names_renamed['URN'].astype(str)

# Move index to the dataframe to preserve it after merging dataframes (otherwise would lose index):
pipeline_sheet_category_IDs_and_names_renamed.reset_index(inplace=True)
pipeline_sheet_category_IDs_and_names_renamed.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_item(key, value)


Unnamed: 0,level_0,index,URN,Product Category
0,0,2,1014,Content Delivery Network (CDN) Software
1,1,3,1257,Data Privacy Management Software
2,2,4,11761310,Photo Editing Software AND Image Recognition S...
3,3,5,1615,Web Hosting
4,4,6,1136,Virtual Private Cloud (VPC) Software


In [None]:
# Merge both dataframes using inner join on the common URN Column. The index corresponds to the row on the pipeline sheet. NOTE - Not returning doublecoded URNs
group_rep_and_pipeline_sheets_merged = pd.merge(group_representatives, pipeline_sheet_category_IDs_and_names_renamed, how="inner", on=["URN"])
group_rep_err_indexs = group_rep_and_pipeline_sheets_merged['index'].to_numpy()

#### Highlight the cells with errors in red

In [None]:
# Highlight all errors in red and highlight the first cell of the row in green for reference
column_index = dataframe_to_be_validated.columns.get_loc("Category ID") 
column_letter = xlsxwriter.utility.xl_col_to_name(column_index)
highlight_errors(group_rep_err_indexs, column_letter)

#### Multi-Coded

In [None]:
# Call the function to get the multi-coded errors
correctly_multi_coded_categories, format_multi_coded_errs = extract_multi_coded_correct_and_errs(
    pipeline_sheet_category_IDs_and_names)

correctly_multi_coded_categories

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_item(key, value)


Unnamed: 0,Category ID,Product Category
4,"[1176, 1310]","[Photo Editing Software , Image Recognition S..."
24,"[1341, 1037]","[Course Authoring Software , Marketing Automa..."
40,"[1070, 1284]","[Enterprise Messaging Software , Live Chat So..."
44,"[1051, 1026]","[Website Builder Software , E-Commerce Platfo..."
52,"[1026, 1477]","[E-Commerce Platforms , Automotive Marketing ..."
...,...,...
915,"[1074, 1352]","[Governance, Risk Management, and Compliance (..."
924,"[1783, 1165]","[Loan Origination Software , Loan Servicing S..."
925,"[1783, 1600, 1219]","[Loan Origination Software , Retail POS Syste..."
927,"[1165, 1783]","[Loan Servicing Software , Loan Origination S..."


In [None]:
# Separate rows that have several coded IDs/Categories into several rows with 
# sharing indexes (**explode is list dependent**):
correctly_multi_coded_categories_separated = correctly_multi_coded_categories.explode(['Category ID', 'Product Category'])
correctly_multi_coded_categories_separated.head()

Unnamed: 0,Category ID,Product Category
4,1176,Photo Editing Software
4,1310,Image Recognition Software
24,1341,Course Authoring Software
24,1037,Marketing Automation Software
40,1070,Enterprise Messaging Software


In [None]:
# Cast merging columns to same dtype so that the merge is effective:
group_representatives['URN'] = group_representatives['URN'].astype(str)
correctly_multi_coded_categories_separated['URN'] = correctly_multi_coded_categories_separated['Category ID'].astype(str)

# Move index to the dataframe to preserve it after merging dataframes (otherwise would lose index):
correctly_multi_coded_categories_separated.reset_index(inplace=True)
correctly_multi_coded_categories_separated.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_item(key, value)


Unnamed: 0,index,Category ID,Product Category,URN
0,4,1176,Photo Editing Software,1176
1,4,1310,Image Recognition Software,1310
2,24,1341,Course Authoring Software,1341
3,24,1037,Marketing Automation Software,1037
4,40,1070,Enterprise Messaging Software,1070


In [None]:
# Merge both dataframes using inner join on the common URN Column. The index corresponds to the row on the pipeline sheet. NOTE - Not returning doublecoded URNs
group_rep_and_correctly_multi_coded_merged = pd.merge(group_representatives, correctly_multi_coded_categories_separated, how="inner", on=["URN"])
group_rep_multi_coded_err_indexs = group_rep_and_correctly_multi_coded_merged['index'].to_numpy()
group_rep_multi_coded_err_indexs

array([], dtype=int64)

#### Highlight the cells with errors in red

In [None]:
# Highlight all errors in red and highlight the first cell of the row in green for reference
column_index = dataframe_to_be_validated.columns.get_loc("Category ID") 
column_letter = xlsxwriter.utility.xl_col_to_name(column_index)
highlight_errors(group_rep_multi_coded_err_indexs, column_letter)

## 8. Identify required fields and add a check to ensure all of those fields are filled (except for Product ID which should be left empty) **(Needs to be adjusted for publications sheets/ CHECK WEIRD PRINTING)

<ins>Required fields for new products</ins>: Category ID, Product ID (None), Product Name, isActive, isDeprecated, Product URL, LinkedIn Company URL, Company Name, Source of Product, Locale.

<ins>Required fields for product changes</ins>: Category ID, Product ID, Product Name, isActive, isDeprecated, Product URL, LinkedIn Company URL (for STZ), Company Name, Source of Product, Locale.

The solution will be to create a dataframe for each field with the errors. Then we should be able to index into each of the fields and mark the error independetly.

In [None]:
# NEW PRODUCTS - Create variables with the list of the names of the columns that 
# will be accessed according to the need to have the field filled-in or left empty:
to_be_filled = ['Category ID', 'Product name', 'Active', 
 'Deprecated', 'Product URLS', 'Comp URL', 'Company Name', 
 'Source of Product', 'Description Locale']
to_leave_empty = 'Product ID'

# Create empty dictionary to store the results:
mandatory_fields = {}

In [None]:
# Execute this instead of the above only if we're applying the script to product changes or to the publication sheets:
if script_mode_1 == 'product changes' or script_mode_2 == 'publication':

  # PRODUCT CHANGES & PUBLICATION SHEET - (Comp URL filled only for STZ & Product ID filled)
  # Create variables with the list of the names of the columns that will need to be filled:
  to_be_filled = ['Category ID', 'Product name','Product ID', 'Active', 
  'Deprecated', 'Product URLS', 'Comp URL', 'Company Name', 
  'Source of Product', 'Description Locale']

  # Create empty dictionary to store the results:
  mandatory_fields = {}

In [None]:
import pprint

# initiate indexes object:
mandatory_fields_index = {}

for field in to_be_filled:

  # Get the column data for that column field:
  accessed_column = dataframe_to_be_validated.loc[:, field]

  # For a Company URLs only check if empty when the product is STZ collected (<=30K for the ID):
  if (field == 'Comp URL') and (script_mode_1 == 'product changes' or script_mode_1 == 'publication'):
    mandatory_fields[field] = dataframe_to_be_validated.loc[(accessed_column.eq('') == True)
     & (dataframe_to_be_validated['Product ID'] <= '30000'), :]

    # Get indexes of errors for coloring purposes:
    mandatory_fields_index[field] = mandatory_fields[field ].index

  else:
    # Else, just check in general for blank cells in the provided required fields:
    mandatory_fields[field] = dataframe_to_be_validated.loc[accessed_column.eq('') == True, :]

    # Get indexes of errors for coloring purposes:
    mandatory_fields_index[field] = mandatory_fields[field].index

  # Execute the following after last element in to_be_filled has been processed and only for new products:
  if (field == to_be_filled[-1]) and (script_mode_1 == 'new products' and script_mode_2 == 'pipeline'):  
    # Add the field to be left empty to the dictionary of mandatory fields:
    accessed_column = dataframe_to_be_validated.loc[:, to_leave_empty]
    mandatory_fields[to_leave_empty] = dataframe_to_be_validated.loc[accessed_column.eq('') == False, :]

    # Get indexes of errors for coloring purposes:
    mandatory_fields_index[to_leave_empty] = mandatory_fields[to_leave_empty].index


# Print in a 'pretty', legible way the resulting dict:
pp = pprint.PrettyPrinter(indent=4)
# pp.pprint(mandatory_fields)

mandatory_fields_index

{'Active': Int64Index([], dtype='int64'),
 'Category ID': Int64Index([], dtype='int64'),
 'Comp URL': Int64Index([], dtype='int64'),
 'Company Name': Int64Index([], dtype='int64'),
 'Deprecated': Int64Index([], dtype='int64'),
 'Description Locale': Int64Index([], dtype='int64'),
 'Product ID': Int64Index([], dtype='int64'),
 'Product URLS': Int64Index([], dtype='int64'),
 'Product name': Int64Index([], dtype='int64'),
 'Source of Product': Int64Index([], dtype='int64')}

Check the errors for each given field

In [None]:
if script_mode_1 == 'new products':
  # Execute this only if we're applying the script to new products:
  print("Current script mode:", script_mode_1, "AND", script_mode_2, "\n")
  print(mandatory_fields['Product ID'].index)
else:
  # Execute this for product changes or for publication sheet validations:
  print("Current script mode:", script_mode_1, "AND", script_mode_2, "\n")
  print(mandatory_fields['Product ID'].index)

Current script mode: new products AND publication 

Int64Index([], dtype='int64')


In [None]:
mandatory_fields['Category ID']  

Unnamed: 0,Category ID,Product Category,Product ID,Product name,Product Aliases,Product Description,Active,Deprecated,Product Skill ID,Product URLS,Description Locale,Company ID - OWNER,Comp URL,Company Name,Showcase ID,Showcase URL,Company/product page for display,Customer Organizations IDs,Customer Organizations Company URLs,Source of Product,In V1.1,In MVP,Has ingested IMAGES - 2020Dec09,Has ingested VIDEOS - 2020Dec09,Product notes,New Product?,Notify Pages?


In [None]:
mandatory_fields['Product name'] 

Unnamed: 0,Category ID,Product Category,Product ID,Product name,Product Aliases,Product Description,Active,Deprecated,Product Skill ID,Product URLS,Description Locale,Company ID - OWNER,Comp URL,Company Name,Showcase ID,Showcase URL,Company/product page for display,Customer Organizations IDs,Customer Organizations Company URLs,Source of Product,In V1.1,In MVP,Has ingested IMAGES - 2020Dec09,Has ingested VIDEOS - 2020Dec09,Product notes,New Product?,Notify Pages?


In [None]:
mandatory_fields['Active'] 

Unnamed: 0,Category ID,Product Category,Product ID,Product name,Product Aliases,Product Description,Active,Deprecated,Product Skill ID,Product URLS,Description Locale,Company ID - OWNER,Comp URL,Company Name,Showcase ID,Showcase URL,Company/product page for display,Customer Organizations IDs,Customer Organizations Company URLs,Source of Product,In V1.1,In MVP,Has ingested IMAGES - 2020Dec09,Has ingested VIDEOS - 2020Dec09,Product notes,New Product?,Notify Pages?


In [None]:
mandatory_fields['Deprecated']

Unnamed: 0,Category ID,Product Category,Product ID,Product name,Product Aliases,Product Description,Active,Deprecated,Product Skill ID,Product URLS,Description Locale,Company ID - OWNER,Comp URL,Company Name,Showcase ID,Showcase URL,Company/product page for display,Customer Organizations IDs,Customer Organizations Company URLs,Source of Product,In V1.1,In MVP,Has ingested IMAGES - 2020Dec09,Has ingested VIDEOS - 2020Dec09,Product notes,New Product?,Notify Pages?


In [None]:
mandatory_fields['Product URLS']

Unnamed: 0,Category ID,Product Category,Product ID,Product name,Product Aliases,Product Description,Active,Deprecated,Product Skill ID,Product URLS,Description Locale,Company ID - OWNER,Comp URL,Company Name,Showcase ID,Showcase URL,Company/product page for display,Customer Organizations IDs,Customer Organizations Company URLs,Source of Product,In V1.1,In MVP,Has ingested IMAGES - 2020Dec09,Has ingested VIDEOS - 2020Dec09,Product notes,New Product?,Notify Pages?


In [None]:
# This one will only return the blanks for STZ products when we are validating 
# either for product changes or data in the publication sheet (based on logic set earlier):
mandatory_fields['Comp URL']

Unnamed: 0,Category ID,Product Category,Product ID,Product name,Product Aliases,Product Description,Active,Deprecated,Product Skill ID,Product URLS,Description Locale,Company ID - OWNER,Comp URL,Company Name,Showcase ID,Showcase URL,Company/product page for display,Customer Organizations IDs,Customer Organizations Company URLs,Source of Product,In V1.1,In MVP,Has ingested IMAGES - 2020Dec09,Has ingested VIDEOS - 2020Dec09,Product notes,New Product?,Notify Pages?


In [None]:
mandatory_fields['Company Name']

Unnamed: 0,Category ID,Product Category,Product ID,Product name,Product Aliases,Product Description,Active,Deprecated,Product Skill ID,Product URLS,Description Locale,Company ID - OWNER,Comp URL,Company Name,Showcase ID,Showcase URL,Company/product page for display,Customer Organizations IDs,Customer Organizations Company URLs,Source of Product,In V1.1,In MVP,Has ingested IMAGES - 2020Dec09,Has ingested VIDEOS - 2020Dec09,Product notes,New Product?,Notify Pages?


In [None]:
mandatory_fields['Source of Product']

Unnamed: 0,Category ID,Product Category,Product ID,Product name,Product Aliases,Product Description,Active,Deprecated,Product Skill ID,Product URLS,Description Locale,Company ID - OWNER,Comp URL,Company Name,Showcase ID,Showcase URL,Company/product page for display,Customer Organizations IDs,Customer Organizations Company URLs,Source of Product,In V1.1,In MVP,Has ingested IMAGES - 2020Dec09,Has ingested VIDEOS - 2020Dec09,Product notes,New Product?,Notify Pages?


In [None]:
mandatory_fields['Description Locale']

Unnamed: 0,Category ID,Product Category,Product ID,Product name,Product Aliases,Product Description,Active,Deprecated,Product Skill ID,Product URLS,Description Locale,Company ID - OWNER,Comp URL,Company Name,Showcase ID,Showcase URL,Company/product page for display,Customer Organizations IDs,Customer Organizations Company URLs,Source of Product,In V1.1,In MVP,Has ingested IMAGES - 2020Dec09,Has ingested VIDEOS - 2020Dec09,Product notes,New Product?,Notify Pages?


In [None]:
# Highlight all errors in red and highlight the first cell of the row in green for reference
for field in mandatory_fields_index.keys():
    column_index = dataframe_to_be_validated.columns.get_loc(field) 
    column_letter = xlsxwriter.utility.xl_col_to_name(column_index)
    highlight_errors(mandatory_fields_index[field], column_letter)

## 9. Validation that Active = T, and Deprecated = F for New Products, and the opposite deprecations on Product Changes. Also, check that both values are not T/T or F/F.


In [None]:
# Create a set to collect the errors
active_deprecated_errors_index = set()

In [None]:
# Get the isActive values of the collected products in the pipeline sheet
is_active_values = dataframe_to_be_validated.loc[:, "Active"]
is_active_values.head()

2    T
3    T
4    T
5    T
6    T
Name: Active, dtype: object

In [None]:
# Get the rows that don't have T marked in Active column
is_active_errors = dataframe_to_be_validated.loc[is_active_values.str.contains('T') == False, :]
is_active_errors_index = is_active_errors.index

# (Product Changes) Or get the rows that don't have F marked in Active column for to be deprecated products:

if script_mode_1 == 'product changes' and script_mode_2 == 'pipeline':
  field_changed_column = dataframe_to_be_validated.loc[:, "Field Changed"]

  is_active_deprecation_errors = dataframe_to_be_validated.loc[(is_active_values.str.contains('F') == False) & (field_changed_column == 'Product Deprecation'), :]
  is_active_deprecation_errors_index = is_active_deprecation_errors.index

  is_active_errors_index.append(is_active_deprecation_errors_index)

else:
  print(is_active_errors_index)

Int64Index([], dtype='int64')


In [None]:
# Get the isDeprecated values of the collected products in the pipeline sheet
is_deprecated_values = dataframe_to_be_validated.loc[:, "Deprecated"]
is_deprecated_values.head()

2    F
3    F
4    F
5    F
6    F
Name: Deprecated, dtype: object

In [None]:
# Get the rows that don't have F marked in Deprecated column
is_deprecated_errors = dataframe_to_be_validated.loc[is_deprecated_values.str.contains('F') == False, :]
is_deprecated_errors_index = is_deprecated_errors.index

# (Product Changes) Or get the rows that don't have T marked in isDeprecated column for to be deprecated products:

if script_mode_1 == 'product changes' and script_mode_2 == 'pipeline':
  is_deprecated_deprecation_errors = dataframe_to_be_validated.loc[(is_deprecated_values.str.contains('T') == False) & (field_changed_column == 'Product Deprecation'), :]
  is_deprecated_deprecation_errors_index = is_deprecated_deprecation_errors.index
  is_deprecated_errors_index.append(is_deprecated_deprecation_errors_index)

else:
  print(is_deprecated_errors_index)

Int64Index([], dtype='int64')


In [None]:
# (Both New Products and Product Changes) 
# Get rows that have both isActive and isDeprecated is the same value for both (e.g. both 'T' or both 'True'):
conditions = (is_active_values.str.contains('T') & is_deprecated_values.str.contains('T')) | (is_active_values.str.contains('F') & is_deprecated_values.str.contains('F'))
same_value_errors_index = dataframe_to_be_validated.loc[conditions].index
same_value_errors_index

Int64Index([], dtype='int64')

In [None]:
# Check that the values chosen are consistent for T and F (i.e. not True/TRUE/False/FALSE instead of T and F)
nomenclature_consistency_errors = dataframe_to_be_validated.loc[(is_active_values.str.len() > 1) | (is_deprecated_values.str.len() > 1)]
nomenclature_consistency_errors_index = nomenclature_consistency_errors.index

nomenclature_consistency_errors

Unnamed: 0,Category ID,Product Category,Product ID,Product name,Product Aliases,Product Description,Active,Deprecated,Product Skill ID,Product URLS,Description Locale,Company ID - OWNER,Comp URL,Company Name,Showcase ID,Showcase URL,Company/product page for display,Customer Organizations IDs,Customer Organizations Company URLs,Source of Product,In V1.1,In MVP,Has ingested IMAGES - 2020Dec09,Has ingested VIDEOS - 2020Dec09,Product notes,New Product?,Notify Pages?


In [None]:
# Update our error set with our errors from each instance
active_deprecated_errors_index.update(is_active_errors_index, is_deprecated_errors_index, same_value_errors_index, nomenclature_consistency_errors_index)
active_deprecated_errors_index

set()

#### Highlight the cells with errors in red

In [None]:
# Highlight all errors in red and highlight the first cell of the row in green for reference
column_index = dataframe_to_be_validated.columns.get_loc("Active") 
column_letter = xlsxwriter.utility.xl_col_to_name(column_index)
highlight_errors(active_deprecated_errors_index, column_letter)

## 10.  Ensure all characters are unicode. (i.e. No special characters like Äô)


## 11.  Ensure Product Skill ID is valid ie. Between 1 - 65000




In [None]:
# First get the column of product Skill IDs:
product_skills = dataframe_to_be_validated['Product Skill ID'].astype(str)
# Get non-empty values in product skills column
non_empty_product_skills = dataframe_to_be_validated[product_skills.str.len() > 0]
#df = dataframe_to_be_validated[~product_skills.between(1,65000)]
non_empty_product_skills

Unnamed: 0,Category ID,Product Category,Product ID,Product name,Product Aliases,Product Description,Active,Deprecated,Product Skill ID,Product URLS,Description Locale,Company ID - OWNER,Comp URL,Company Name,Showcase ID,Showcase URL,Company/product page for display,Customer Organizations IDs,Customer Organizations Company URLs,Source of Product,In V1.1,In MVP,Has ingested IMAGES - 2020Dec09,Has ingested VIDEOS - 2020Dec09,Product notes,New Product?,Notify Pages?


In [None]:
# We need to filter out values that contain letters
skills_containing_invalid_chars = non_empty_product_skills[~product_skills.str.isdigit()]
skills_containing_invalid_chars_index = skills_containing_invalid_chars.index

  


In [None]:
# Convert values to str so that we can use str.contains method:
skills_containing_valid_chars = non_empty_product_skills[product_skills.str.isdigit()]
skill_ids_containing_digits_only = skills_containing_valid_chars['Product Skill ID']
# Catch cases in which Skill ID is not between 1 and 65000:
product_skills_errs = skills_containing_valid_chars.loc[~skill_ids_containing_digits_only.between(1,65000)]
product_skills_errs_index = product_skills_errs.index
#Combine errors from both sections
all_skill_errs_index = product_skills_errs_index.append(skills_containing_invalid_chars_index)
all_skill_errs_index

  


Int64Index([], dtype='int64')

#### Highlight the cells with errors in red

In [None]:
# Highlight all errors in red and highlight the first cell of the row in green for reference
column_index = dataframe_to_be_validated.columns.get_loc("Product Skill ID") 
column_letter = xlsxwriter.utility.xl_col_to_name(column_index)
highlight_errors(all_skill_errs_index, column_letter)

## 12.  Make sure there are no line breaks in product names or descriptions



In [None]:
dataframe_to_be_validated.head()

Unnamed: 0,Category ID,Product Category,Product ID,Product name,Product Aliases,Product Description,Active,Deprecated,Product Skill ID,Product URLS,Description Locale,Company ID - OWNER,Comp URL,Company Name,Showcase ID,Showcase URL,Company/product page for display,Customer Organizations IDs,Customer Organizations Company URLs,Source of Product,In V1.1,In MVP,Has ingested IMAGES - 2020Dec09,Has ingested VIDEOS - 2020Dec09,Product notes,New Product?,Notify Pages?
2,1014,Content Delivery Network (CDN) Software,24110,jsDelivr CDN,,,T,F,,https://www.bootstrapcdn.com/,en_US,,https://www.linkedin.com/company/jsdelivr/,jsDelivr,,,,,,Builtwith Data,YES,YES,,,,YES,
3,1257,Data Privacy Management Software,24105,Tarte Au Citron,,,T,F,,https://tarteaucitron.io/en/,fr_FR,,https://www.linkedin.com/company/tarteaucitron...,tarteaucitron.js,,,,,,Builtwith Data,YES,YES,,,,YES,
4,11761310,Photo Editing Software AND Image Recognition S...,24104,Air recon DL,,,T,F,,https://apps.gehealthcare.com/app-products/air...,en_US,,https://www.linkedin.com/company/gehealthcare/,GE Healthcare,,,,,,LSS Top Companies,YES,YES,,,,YES,
5,1615,Web Hosting,24086,Haylix,,,T,F,,https://www.haylix.com/,en_US,,https://www.linkedin.com/company/haylix/,Haylix,,,,,,Builtwith Data,YES,YES,,,,YES,
6,1136,Virtual Private Cloud (VPC) Software,24085,Rackco VPS Hosting,,,T,F,,https://www.rackco.com/vps-hosting/,en_US,,https://www.linkedin.com/company/rackco/,Rackco,,,,,,Builtwith Data,YES,YES,,,,YES,


In [None]:
all_columns = dataframe_to_be_validated.columns

# Initialize dict for all of the rows with containing linebreaks:
linebreak_errs_index = {}

# Check that there is no line break ("\n") in any of them:
for column in all_columns:

  # Set a condition to make sure Product IDs are processed as str and not int64 dtypes:
  if column == 'Product ID':
    # dataframe_to_be_validated[column] = dataframe_to_be_validated[column].to_string()
    pass
  
  else:
    # **Duplicated column names (e.g. having two columns called 'error type') could throw errors, try to name them differently**
    linebreak_errs_index[column] = dataframe_to_be_validated.loc[dataframe_to_be_validated[column].str.contains('\n', na=False, regex=False)]
    # Get the indexes for the errors (for cell-coloring purposes)s:
    linebreak_errs_index[column] = linebreak_errs_index[column].index

# ** it is returning na for double couded in the category ID field, that is why we are forcing na to be False so that we can still create 
# the mask to access the rows with the errors (otherwise it would throw a cannot mask with nan values error).

In [None]:
linebreak_errs_index

{'Active': Int64Index([], dtype='int64'),
 'Category ID': Int64Index([], dtype='int64'),
 'Comp URL': Int64Index([], dtype='int64'),
 'Company ID - OWNER': Int64Index([], dtype='int64'),
 'Company Name': Int64Index([], dtype='int64'),
 'Company/product page for display': Int64Index([], dtype='int64'),
 'Customer Organizations Company URLs': Int64Index([], dtype='int64'),
 'Customer Organizations IDs': Int64Index([], dtype='int64'),
 'Deprecated': Int64Index([], dtype='int64'),
 'Description Locale': Int64Index([], dtype='int64'),
 'Has ingested IMAGES - 2020Dec09': Int64Index([], dtype='int64'),
 'Has ingested VIDEOS - 2020Dec09': Int64Index([], dtype='int64'),
 'In MVP': Int64Index([], dtype='int64'),
 'In V1.1': Int64Index([], dtype='int64'),
 'New Product?': Int64Index([], dtype='int64'),
 'Notify Pages?': Int64Index([], dtype='int64'),
 'Product Aliases': Int64Index([], dtype='int64'),
 'Product Category': Int64Index([], dtype='int64'),
 'Product Description': Int64Index([], dtype=

In [None]:
# Highlight all errors in red and highlight the first cell of the row in green for reference
for column in all_columns:
    if column == 'Product ID':  # Until we solve a small bug with that row
      pass
    else:
      column_index = dataframe_to_be_validated.columns.get_loc(column) 
      column_letter = xlsxwriter.utility.xl_col_to_name(column_index)
      highlight_errors(linebreak_errs_index[column], column_letter)

## 13.  Check that 'Showcase URL' and 'Showcase ID' are not placed in each others field

In [None]:
showcase_url_column, showcase_id_column = dataframe_to_be_validated['Showcase URL'], dataframe_to_be_validated['Showcase ID']

# Get non url data (errors) on the Showcase URL column:
showcase_url_errs = dataframe_to_be_validated[(showcase_url_column.str.startswith('https://www.linkedin.com/showcase') == False) & (showcase_url_column.str.len() > 0)]

# Get url data (errors) on the Showcase ID column:
showcase_id_errs = dataframe_to_be_validated[(showcase_id_column.str.startswith('https://www.linkedin.com/showcase') == True) & (showcase_url_column.str.len() > 0)] 

showcase_url_errs_index = showcase_url_errs.index
showcase_id_errs_index = showcase_id_errs.index
showcase_errors = showcase_url_errs_index.append(showcase_id_errs_index)

In [None]:
# Highlight all errors in red and highlight the first cell of the row in green for reference
column_index = dataframe_to_be_validated.columns.get_loc("Showcase ID") 
column_letter = xlsxwriter.utility.xl_col_to_name(column_index)
highlight_errors(showcase_errors, column_letter)

## 14.  Check that product name and product ID match for product changes

In [None]:
# get the product id and names column from the product changes sheet
changes_id_name_columns = dataframe_to_be_validated[['Product ID', 'Product name']]
# Move index to the dataframe to preserve it after merging dataframes (otherwise would lose index):
changes_id_name_columns.reset_index(inplace=True)
changes_id_name_columns.head()

Unnamed: 0,index,Product ID,Product name
0,2,24110,jsDelivr CDN
1,3,24105,Tarte Au Citron
2,4,24104,Air recon DL
3,5,24086,Haylix
4,6,24085,Rackco VPS Hosting


In [None]:
# get the product id and names column from the product catalog sheet and rename the prod id column to match the changes df so we can merge
catalog_id_name_columns = product_catalog_dataframe[['productId', 'productName']]
catalog_id_name_columns.rename(columns={"productId": "Product ID"}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [None]:
# merge dataframes on Product ID column
product_changes_product_catalog_merged = pd.merge(catalog_id_name_columns, changes_id_name_columns, how="inner", on=["Product ID"])
product_changes_product_catalog_merged

Unnamed: 0,Product ID,productName,index,Product name


In [None]:
# Get rid of surrounding whitespaces that could alter the comparison:
product_changes_product_catalog_merged['productName'] = product_changes_product_catalog_merged['productName'].str.strip()
product_changes_product_catalog_merged['Product name'] = product_changes_product_catalog_merged['Product name'].str.strip()

In [None]:
if script_mode_1 == 'product changes':

  # Using the merged dataframe locate the rows where the Product names don't match
  product_name_mismatch_errs = product_changes_product_catalog_merged.loc[
            ~product_changes_product_catalog_merged.apply(
                            lambda x: x['productName'] in x['Product name'], axis=1)]

  product_name_mismatch_errs = product_name_mismatch_errs.set_index('index').sort_index()

  product_name_mismatch_errs_index = product_name_mismatch_errs.index
  # Shift the index to match the sheet
  product_name_mismatch_errs.index += 2
  product_name_mismatch_errs_index = product_name_mismatch_errs.index
  product_name_mismatch_errs_index

In [None]:
if script_mode_1 == 'product changes':

  # Highlight all errors in red and highlight the first cell of the row in green for reference
  column_index = dataframe_to_be_validated.columns.get_loc("Product name") 
  column_letter = xlsxwriter.utility.xl_col_to_name(column_index)
  highlight_errors(product_name_mismatch_errs_index, column_letter)

## 15.  Dupe Checks

