In [1]:
import numpy as np
import pandas as pd

Products URL is considered the "gold-standard". We need to look through the log file to 1) find corrupted URL's, and 2) write and algorithm to correct them. 

In [2]:
# Products URL should be considered "gold standard"

products_df = pd.read_csv('../data/products.csv')
log_df = pd.read_csv('../data/log2.csv',
                     names=['sentiment', 'publication_URL', 'product_URL',
                     'clickORnot', 'gender', 'age_group'])
product_categories = pd.read_csv('../data/product_categories.csv')

### Question 1
*Some of the Product_URLs in the log file might have been corrupted. Write a Python (or PySpark) procedure to determine which Product_URLs are corrupted. Let us assume that if a Product_url in the log file doesn’t occur in the products table, it is regarded as corrupted. Using this procedure identify and list the corrupted URLs. (10)*

The code below uses the simple list matching function `np.in1d()` to quickly identify which URLs in the log file are not present in hte products file.

In [3]:
# Pull out URLs and strip leading/lagging whitespace
product_urls = products_df['product_URL'].str.strip().values
log_urls = log_df['product_URL'].str.strip().values

products_df['product_URL']  = product_urls
log_df['product_URL'] = log_urls

# Check which elements of log_urls are NOT an exact match in product_urls
url_mask = ~np.in1d(log_urls, product_urls)
corrupt_df = log_df[url_mask]
corrupt_df

Unnamed: 0,sentiment,publication_URL,product_URL,clickORnot,gender,age_group
83,negative,https://www.cbsnews.com/,https://haier.com/refrigermtors,0,female,young
109,neutral,https://mashable.com/,https://sony.comftelevisions,1,female,juvenile
123,negative,https://www.thedailybeast.com/,https://lg.com/gashers,0,female,middle-age
171,neutral,https://www.cnn.com/,https://leks.com/jeans,0,female,senior
203,neutral,https://www.nytimes.com/,https://InstantPot.con/cookers,1,female,young
...,...,...,...,...,...,...
9729,neutral,https://www.usnews.com/,http://nejoK.co/blenders,0,male,senior
9773,negative,https://www.nbcnews.com/,https://maytag.cpm/washers,1,male,middle-age
9845,positive,https://www.foxnews.com/,https://guessmcom/perfumes,1,female,middle-age
9881,neutral,https://www.usnews.com/,https://samsuag.com/televisions,0,female,young


### Question 2
*For each corrupted URL what will you do with it? Don’t assume that for each corrupted URL the correct approach is to delete that log entry. What if the URL contained ‘.cam’ instead of ‘.com’ but otherwise corresponded with a URL in the ‘products’ table? In that case the proper approach would be to correct the URL. In other cases, the URL might be so corrupted that the best approach would be to delete that log entry (the entire row). Describe your approach to dealing with corrupted URLs. That is, describe your approach to determining that a URL is too corrupted to be rescued. It must describe a) a procedure for determining the degree to which the URL is corrupted, b) a threshold for determining in terms of this degree of corruption whether it can be corrected, and c) for those which can be corrected, identifying its corrected form. For extra credit implement this in a Python (or PySpark) program. (25 + 20 points for extra-credit)*

For the corrupted URLs, we are going to assume that there is one intended URL in 'products.csv'. To pair corrupted URLs to real URLs, we will utilize the "edit distance" to identify how many insertions, deletions, or character changes are necessary to transform a corrupted URL into one of the true URLs. Formally this is referred to a Levenshtein distance  

Assuming that it is unlikely that a single character is corrupted, I will set a edit distance threshold of 3 such that any URLs that are 3 edits or greater away from any of the ground truth URLs will be considered "too corrupted". Additionally, if any corrupted URLs have the same edit distance from 2 or more ground truth URLs they will not be included in the database.

In [31]:
# Utilizing natural language toolkit implementation: https://www.nltk.org/api/nltk.metrics.distance.html
from nltk.metrics.distance import edit_distance
from functools import partial

dist_dict = dict()
for true_url in product_urls:
    # Create function to calculate edit distance to true URL
    dist_func = partial(edit_distance, s2=true_url)
    dist_dict[true_url] = corrupt_df['product_URL'].map(dist_func)

dist_df = pd.DataFrame(dist_dict).set_index(corrupt_df['product_URL']).transpose()
dist_df
    

product_URL,https://haier.com/refrigermtors,https://sony.comftelevisions,https://lg.com/gashers,https://leks.com/jeans,https://InstantPot.con/cookers,https://lenova.comslaptops,https://broyhill.cvm/recliners,https://apple.cfm/ipads,https://soundwavemai/speakers,https://haieq.com/refrigerators,...,https://basilbasel.io/perfunes,https://cougar.co/jeaas,https://levia.com/jeans,https://Ikea.lom/sofas,https://apple.com/iqads,http://nejoK.co/blenders,https://maytag.cpm/washers,https://guessmcom/perfumes,https://samsuag.com/televisions,https://InstantPotycom/cookers
https://vitamix.com/blenders,16,16,12,13,14,14,13,14,16,16,...,17,14,12,14,13,9,12,15,15,14
https://lenova.com/laptops,16,14,10,10,16,1,17,12,17,16,...,18,13,9,12,12,12,12,14,14,16
https://InstantPot.com/cookers,19,19,14,16,1,16,16,16,16,19,...,18,16,15,14,15,15,13,18,18,1
http://nemoK.co/blenders,17,16,12,12,15,12,16,14,16,17,...,19,13,12,15,13,1,14,15,16,16
https://HamiltonBeach/blenders,19,19,16,18,18,17,17,16,16,19,...,17,18,18,19,16,12,16,20,19,18
https://Lavazza.com/coffee,17,19,13,13,14,14,16,13,18,17,...,16,14,11,11,12,16,13,14,17,14
https://Starbucks.com/coffee,19,20,15,13,14,17,17,15,20,19,...,17,15,15,14,14,18,15,14,18,14
https://centrum.com/vitamins,16,15,14,11,16,13,16,13,17,17,...,19,13,11,14,12,15,14,14,15,16
https://NordicTrack.com/treadmills,19,18,19,18,21,19,18,19,20,21,...,20,17,16,19,18,20,19,20,17,21
https://NordicTrack.com/rowers,18,18,15,16,15,17,16,17,18,19,...,18,15,14,15,16,18,15,18,19,15


In [68]:
dist_threshold = 3

# Make list of corrupted URLs that can be corrected
fixed_mask = list()
fixed_urls = list()
for col_idx in range(len(dist_df.columns)):
    url_distances = dist_df.iloc[:, col_idx]
    edit_counts = url_distances.value_counts()
    min_dist = edit_counts.min()

    if (min_dist > dist_threshold) or edit_counts[min_dist] > 1:
        fixed_mask.append(False)
    else:
        min_idx = np.where(url_distances.values == min_dist)
        assert len(min_idx) == 1
        
        fixed_mask.append(True)
        fixed_urls.append(product_urls[min_idx[0][0]])
        print(f'edit dist={min_dist}; {dist_df.columns[col_idx]}, {fixed_urls[-1]}')



edit dist=1; https://haier.com/refrigermtors, https://haier.com/refrigerators
edit dist=1; https://sony.comftelevisions, https://sony.com/televisions
edit dist=1; https://lg.com/gashers, https://lg.com/washers
edit dist=1; https://leks.com/jeans, https://lees.com/jeans
edit dist=1; https://InstantPot.con/cookers, https://InstantPot.com/cookers
edit dist=1; https://lenova.comslaptops, https://lenova.com/laptops
edit dist=1; https://broyhill.cvm/recliners, https://broyhill.com/recliners
edit dist=1; https://apple.cfm/ipads, https://apple.com/ipads
edit dist=1; https://soundwavemai/speakers, https://soundwave.ai/speakers
edit dist=1; https://haieq.com/refrigerators, https://haier.com/refrigerators
edit dist=1; https://reminxton.com/shavers, https://guess.com/perfumes
edit dist=1; https://kaxi.com/handbags, https://kaai.com/handbags
edit dist=1; https://delk.com/computers, https://dell.com/computers
edit dist=1; https://apple.com/tomputers, https://apple.com/computers
edit dist=1; https://

In [69]:
print(f'num corrupted: {len(corrupt_df)}; num recovered: {len(fixed_urls)}')

num corrupted: 216 num recovered: 216


As we can see above, our algorithm was able to identify a matching URL for every corrupted URL. Therefore we can simply substitute the "fixed" URLs into the 'product_URL' field of `log_df`

In [77]:
log_urls_fixed = log_df['product_URL'].values
log_urls_fixed[url_mask] = fixed_urls

# Store fixed URLs in new columns alongside boolean indicating which rows were fixed
log_df['product_URL_fixed'] = log_urls_fixed
log_df['url_fixed'] = url_mask

### Question 3
*For each product, compute all the Publication_URLs containing an ad for that product. (Don’t just give the results. Show all the work by which you got those results. This applies to all the questions below.) (10)*

To gather all the publication URLs, we need to combine information from `log2.csv` and `products.csv`. Since 'publication_URL' is present in both tables.