#**Investigation of *items* database and correlations between items**

**EDA, NLP, Feature Generation**

Andreas Theodoulou and Michael Gaidis (May, 2020)

##**summary of Version 3 additions, May 25, 2020:**

1. Try to reduce the number of clusters to something closer to 250 instead of 2000+... (doing this after reading that decision trees don't like to have to many categories within a single feature)

##**summary of Version 2, May 9, 2020:**

I refined the "delimiter" characters, and did a bit more cleaning on some stuff I noticed with "blu-ray" vs. "bluray" vs. "bd"...
--> generated a new set of unique n-grams for n=1 to highest n, and filtered to include only where there are at least 2 item names containing that n-gram

--> created "word vector" representations of the item names in items dataset, including roughly 4000 elements (all of the delimited n-grams mentioned above)

--> for each of the 21700 items, they were encoded as word vectors, and then I used dot-product to identify which items were similar to others.  In the encoding of the word vectors, I did some "weighting" such that I didn't just have a word vector that was all 0's except for 1's in locations representing the particular n-grams that are found in the item name string (English translation).  I used 2 types of weighting: 
1. like TF-IDF, I counted the number of occurrences of each of the roughly 4000 n-grams in the set of all item names, and I binned the n-grams to more heavily weight n-grams that have less representation in the item names. (For example, "klompferstietnitz" is more valuable than "dvd", because so many item names contain "dvd".  So, the former 1-gram gets a larger integer inserted into its location in the word vector.)  I used binning rather than strict TF-IDF "continuous" weighting because I believe there is a cutoff at which a term no longer holds much weight at all.
2. longer n-grams get heavier weight as well... if two item names have the same 10-word string (10-gram), it is much more relevant than if they have the same 1-word string (1-gram).

--> after running the dot products between items in the 21700 x 4000 size matrix containing word vectors for each item, I end up with a 21700 x 21700 matrix containing integers = dot products of the word vectors i,j for cell at i,j coming from item i and item j.

--> then, I found a nifty jit-accelerated function (reference below) to pick out the top K largest dot-product values for a given item.  The function gives me the top K items and their dot-product values with the item of interest.  I run this function on all 21700 items, and pick (at first the top 3, but now...) the top 10 highest dot-product values for a given item.

**To date** I have only then taken the 5100 items that we know are in the test set (I was afraid of overwhelming the computing power or memory allocation in Colab).  
--> so, now I have a dataset with 5100 rows (each test item), and columns for the top-10 matching item ids as per the dot product, and for the actual top-10 dot products also.

--> I "explode" (unravel) the list of 10 matching items and dot product values so now I have 51000 rows and columns indicating item id of interest (one of the 5100 in the test set), the top-10 matching item ids, and a column for the dot product values between the two.

**Now we need to create features from this**

**First, look for clusters of tightly-matched item-item pairs**

I decided to use the networkX package to map these item pairs into an undirected graph with nodes = item ids, and edges weighted by the value of the dot product.  Then, I can utilize some of the pre-made algorithms that can automatically identify strongly-clustered groups.

Before feeding the 51000 row matrix into the graph, I applied a threshold so only dot products above a certain value would be allowed as nodes/edges in the graph.  (This is to filter out "matches" where both items have some common term like "dvd" but nothing else.  But, as some of the item names are short an nondescriptive, even this "dvd" match can place the item-item pair in the top 10.  Of course, you could have 1000 items like this that match the subject item with the same dot-product value, but the aforementioned algorithm just picks the first 10.)

Ok, so it goes into the graph, and I apply a "community" algorithm that takes into account the weighted edge values (preferentially grouping together items that have higher dot-product values).
I didn't find a way to get much control over how many clusters are identified by the algorithm.  It seems to go up roughly linearly in the number of nodes(items) in the graph.  Anyhow, I get about 1400 clusters for my 5100 input items.  These clusters contain item_ids both in and not in the test set, as the graph was made with 51000 edges (minus about 10000 from thresholding) that used top-10 matches with the 5100 test items.  These top-10 matches may or may not be in the test set.

These 1400 clusters have anywhere from 2 items (one edge) to perhaps 100 items.  I computed an average dot product value between all elements in a cluster, and used that to estimate the overall "strength" of clustering.  This provides a natural way to do category encoding.  I simply use the integer average of cluster dot-products as the "cluster category code".  (One minor complication is that some clusters have identical averages... I gave a small boost to the clusters with greater number of elements, so n=2, avg=300 might get a category value of 300, whereas n=5 avg=300 might get category value = 320)

I assign all items in a given cluster the same "cluster category code".

Any items that do not belong to a cluster (either because they didn't make the top-10 list for any of the test item matches, or because the thresholding eliminated them from inclusion in the graph) were assigned a "cluster category code" equal to their original item_category_code.  The cluster category code is a minimum of 2x or 3x larger than the largest original item category code (83), and the cluster category code can be quite a bit larger ... 100x or more, for the strongest-matching clusters.

</br>

**This was all done with the v1.1 items EDA ipynb on GitHub, and the dataset containing the "cluster category codes" is saved as csv.gz in the data_output directory.  You can just load in that dataset and use the cluster category code column (alongside the item_id column) as a feature in the model.  It shouldn't need further category encoding.**

I'm now working on v2.0 of this items EDA (this file), to remove unnecessary code stragglers, and I hope to try a graph/clustering with all 21700 items rather than just the 5100 test items.



#0. Configure Environment
**NOT OPTIONAL**

In [1]:
# General python libraries/modules used throughout the notebook
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator, FormatStrFormatter, AutoMinorLocator
import numpy as np
from scipy import sparse
import seaborn as sns
from numba import jit, prange
import networkx as nx
from networkx.algorithms import community, cluster

import os
import feather   # this is 3x to 8x faster than pd.read_csv and pd.to_hdf, but file size is 2x hdf and 10x csv.gz
import pickle
import string
from itertools import product
import re
from collections import OrderedDict
import json
import time
import datetime
from time import sleep, localtime, strftime, tzset, strptime
os.environ['TZ'] = 'EST+05EDT,M4.1.0,M10.5.0'
tzset()

# Magics
%matplotlib inline


# NLP packages
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() 

# # ML packages
# from sklearn.linear_model import LinearRegression

# !pip install catboost
# from catboost import CatBoostRegressor 

# %tensorflow_version 2.x
# import tensorflow as tf
# import keras as K

# # List of the modules we need to version-track for reference
modules = ['pandas','matplotlib','numpy','scipy','numba','seaborn','sklearn','tensorflow','keras','catboost','pip','nltk','networkx']
print(f'done: {strftime("%a %X %x")}')

  import pandas.util.testing as tm


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
done: Thu 09:45:55 05/28/20


In [2]:
# Notebook formatting
# Adjust as per your preferences.  I'm using a FHD monitor with a full-screen browser window containing my IPynb notebook

# format pandas output so we can see all the columns we care about (instead of "col1  col2  ........ col8 col9", we will see "col1 col2 col3 col4 col5 col6 col7 col8 col9" if it fits inside display.width parameter)
pd.set_option("display.max_columns",30)  
pd.set_option("display.max_rows",100)     # Override pandas choice of how many rows to show, so, for example, we can see the full 84-row item_category dataframe instead of the first few rows, then ...., then the last few rows
pd.set_option("display.width", 250)       # Similar to the above for showing more rows than pandas defaults to, we can show more columns than default, if we tune this to our monitor window size
pd.set_option("max_colwidth", None)

#pd.set_option("display.precision", 3)  # Nah, this is helpful, but below is even better
#Try to convince pandas to print without decimal places if a number is actually an integer (helps keep column width down, and highlights data types)
pd.options.display.float_format = lambda x : '{:.0f}'.format(x) if round(x,0) == x else '{:,.3f}'.format(x)
print(f'done: {strftime("%a %X %x")}')

done: Thu 09:45:55 05/28/20


#0.99999) Mount Google Drive (Local File Storage/Repo)

In [3]:
# click on the URL link presented to you by this command, get your authorization code from Google, then paste it into the input box and hit 'enter' to complete mounting of the drive
from google.colab import drive  
drive.mount('/content/drive')
print(f'done: {strftime("%a %X %x")}')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
done: Thu 09:45:55 05/28/20


#1. Load Data Files



##1.1) Enter Data File Names and Paths

**NOT Optional**

In [4]:
#  FYI, data is coming from a public repo on GitHub at github.com/migai/Kag
# List of the data files (path relative to GitHub master), to be loaded into pandas DataFrames
data_files = [  #"readonly/final_project_data/shops.csv",
                #"readonly/final_project_data/sample_submission.csv.gz",
                #"data_output/shops_transl.csv",
                "data_output/items_transl.csv",
                #"readonly/final_project_data/item_categories.csv",
                #"data_output/items_clustered_22170.csv.gz",
                #"readonly/en_50k.csv",
                #"data_output/item_categories_transl.csv",
                "data_output/shops_augmented.csv",
                #"readonly/final_project_data/items.csv",
                "data_output/sales_train_cleaned.csv.gz",
                "data_output/item_categories_augmented.csv",
                "data_output/items_new.csv",
                "data_output/shops_new.csv",
                "readonly/final_project_data/sales_train.csv.gz",
                "readonly/final_project_data/test.csv.gz"
              ]


# Dict of helper code files, to be loaded and imported {filepath : import_as}
code_files = {}  # not used at this time; example dict = {"helper_code/kaggle_utils_at_mg.py" : "kag_utils"}


# GitHub file location info
git_hub_url = "https://raw.githubusercontent.com/migai"
repo_name = 'Kag'
branch_name = 'master'
base_url = os.path.join(git_hub_url, repo_name, branch_name)

print(f'done: {strftime("%a %X %x")}')

done: Thu 09:45:55 05/28/20


##1.2) Load Data Files

In [5]:
%%time
# 4.5sec with csv and csv.gz files (including conversion to datetime)
# 3.6sec with csv and csv.gz files (no datetime conversion)

'''
############################################################
############################################################
'''
# Replace this path with the path on *your* Google Drive where the repo master branch is stored
#   (on GitHub, the remote repo is located at github.com/migai/Kag --> below is my cloned repo location)
GDRIVE_REPO_PATH = "/content/drive/My Drive/Colab Notebooks/NRUHSE_2_Kaggle_Coursera/final/Kag"
OUT_OF_REPO_PATH = "/content/drive/My Drive/Colab Notebooks"   # place > 100MB files here, because they won't sync with GitHub
'''
############################################################
############################################################
'''

# do feather files manually for now
%cd "{OUT_OF_REPO_PATH}"
testtrain_mrg_ftr = True
testtrain_mrg = pd.read_feather('testtrain_mrg.ftr', columns=None, use_threads=True);


%cd "{GDRIVE_REPO_PATH}"

print("Loading Files from Google Drive repo into Colab...\n")

# Loop to load the data files into appropriately-named pandas DataFrames
for path_name in data_files:
    filename = path_name.rsplit("/")[-1]
    data_frame_name = filename.split(".")[0]
    exec(data_frame_name + " = pd.read_csv(path_name)")
    # if data_frame_name == 'sales_train':
    #     sales_train['date'] = pd.to_datetime(sales_train['date'], format = '%d.%m.%Y')
    print("Data Frame: " + data_frame_name)
    print(eval(data_frame_name).head(2))
    print("\n")
print(f'done: {strftime("%a %X %x")}')

/content/drive/My Drive/Colab Notebooks
/content/drive/My Drive/Colab Notebooks/NRUHSE_2_Kaggle_Coursera/final/Kag
Loading Files from Google Drive repo into Colab...

Data Frame: items_transl
                                                              item_name  item_id  item_category_id                                                           en_item_name
0                             ! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.)         D        0                40                                           ! POWER IN glamor (PLAST.) D
1  !ABBYY FineReader 12 Professional Edition Full [PC, Цифровая версия]        1                76  ! ABBYY FineReader 12 Professional Edition Full [PC, Digital Version]


Data Frame: shops_augmented
                       shop_name  shop_id                       en_shop_name shop_city shop_category shop_federal_district  shop_city_population  shop_tested
0  !Якутск Орджоникидзе, 56 фран        0  ! Yakutsk Ordzhonikidze, 56 Franc   Yakutsk          Shop          

#2. Explore Data (EDA), Clean Data, and Generate Features

#2.x) ***items*** and ***item_categories*** Datasets: 
EDA, Cleaning, Correlations, and Feature Generation

---



---



In [6]:
if not testtrain_mrg_ftr:  # we already have this loaded from feather file
    # merge dataframes so we can do closer analysis of item dependence on shop and categories
    test_prep = test.copy(deep=True)
    test_prep['date_block_num'] = 34
    test_prep['date'] = '2015-11-30' #pd.Timestamp(year=2015, month=11, day=30)
    sales_traintest_clean_mrg = sales_train_cleaned.append(test_prep).fillna(0)
    testtrain_mrg = sales_traintest_clean_mrg.merge(items_new[['item_id','item_category_id','item_tested']],on='item_id',how='left').reset_index(drop=True)
    testtrain_mrg = testtrain_mrg.merge(items_transl[['item_id','en_item_name']],on='item_id',how='left').reset_index(drop=True)
    testtrain_mrg = testtrain_mrg.merge(item_categories_augmented[['item_category_id','en_cat_name','item_cat_tested','item_category3','item_category4']],on='item_category_id',how='left').reset_index(drop=True)
    testtrain_mrg = testtrain_mrg.merge(shops_augmented[['shop_id', 'en_shop_name', 'shop_city', 'shop_federal_district',  'shop_city_population',  'shop_tested']], on='shop_id',how='left').reset_index(drop=True)
    testtrain_mrg = testtrain_mrg.merge(shops_new[['shop_id', 'shop_type', 'fd_popdens',  'fd_gdp']], on='shop_id',how='left').reset_index(drop=True)
    testtrain_mrg = testtrain_mrg[['date', 'date_block_num', 'item_price', 'item_cnt_day', 'shop_id', 'item_id', 'en_item_name', 'item_tested', 'item_category_id', 'en_cat_name', 'item_cat_tested',
                                'item_category3', 'item_category4', 'en_shop_name', 'shop_type','shop_tested', 'shop_federal_district', 'fd_popdens', 'fd_gdp', 'shop_city', 'shop_city_population']]
    testtrain_mrg.columns = ['date', 'month', 'price', 'sales', 'shop_id', 'item_id', 'item_name', 'it_test', 'item_category_id', 'item_category_name', 'it_cat_test', 'item_cat3', 'item_cat4', 
                            'shop_name', 'sh_cat', 'sh_test', 'district', 'fd_popdens', 'fd_gdp', 'city', 'population']
    testtrain_mrg.date = pd.to_datetime(testtrain_mrg.date, format='%Y-%m-%d')

    # optional save file as feather type (big file; don't store inside repo) and/or csv.gz type (inside repo)
    # %cd "{OUT_OF_REPO_PATH}"
    # testtrain_mrg.to_feather('testtrain_mrg.ftr')
    # %cd "{GDRIVE_REPO_PATH}"
    # # alternative, or, in addition, can save as csv.gz for < 100 MB storage and sync with GitHub
    # compression_opts = dict(method='gzip',
    #                         archive_name='testtrain_mrg.csv')  
    # testtrain_mrg.to_csv('data_output/testtrain_mrg.csv.gz', index=False, compression=compression_opts)

print(f'done: {strftime("%a %X %x")}\n')
testtrain_mrg.tail()

done: Thu 09:46:01 05/28/20



Unnamed: 0,date,month,price,sales,shop_id,item_id,item_name,it_test,item_category_id,item_category_name,it_cat_test,item_cat3,item_cat4,shop_name,sh_cat,sh_test,district,fd_popdens,fd_gdp,city,population
3128463,2015-11-30,34,0,0,45,18454,Sat. Union 55,True,55,Music - CD of local production,True,Music,Music,"Samara mall of ""Parkhouse""",Mall,True,Volga,Intermediate,Low,Samara,1134730
3128464,2015-11-30,34,0,0,45,16188,Board Game Nano Curling,True,64,Gifts - Board Games,True,Gifts,Gifts,"Samara mall of ""Parkhouse""",Mall,True,Volga,Intermediate,Low,Samara,1134730
3128465,2015-11-30,34,0,0,45,15757,Novikov Aleksandr New Collection,True,55,Music - CD of local production,True,Music,Music,"Samara mall of ""Parkhouse""",Mall,True,Volga,Intermediate,Low,Samara,1134730
3128466,2015-11-30,34,0,0,45,19648,Terem - TEREMOK sb.m / f (Region),True,40,Movie - DVD,True,Movies,Movies,"Samara mall of ""Parkhouse""",Mall,True,Volga,Intermediate,Low,Samara,1134730
3128467,2015-11-30,34,0,0,45,969,3 DAYS TO KILL (BD),True,37,Movie - Blu-Ray,True,Movies,Movies,"Samara mall of ""Parkhouse""",Mall,True,Volga,Intermediate,Low,Samara,1134730


In [7]:
# look at sales by week or quarter as well as sales by month

# # make a new dataframe that only includes Sept 2014
# trans2014sept = transactions.loc[:][(transactions['date'] >= pd.Timestamp(year=2014, month=9, day=1)) 
#            & (transactions['date'] < pd.Timestamp(year=2014, month=10, day=1))]
tt = testtrain_mrg.copy(deep=True)
tt.insert(1,'day',0)
tt.day = tt.date.apply(lambda x: (datetime.datetime(x.year,x.month,x.day) - datetime.datetime(2013,1,1)).days )
tt.insert(2,'week',0)
tt.week = tt.day // 7
tt.insert(4,'quarter',0)  # 3 month chunks, with final one being months 31,32,33
tt.quarter = (tt.month + 2) // 3

print(f'done: {strftime("%a %X %x")}\n')
tt.tail()

done: Thu 09:46:18 05/28/20



Unnamed: 0,date,day,week,month,quarter,price,sales,shop_id,item_id,item_name,it_test,item_category_id,item_category_name,it_cat_test,item_cat3,item_cat4,shop_name,sh_cat,sh_test,district,fd_popdens,fd_gdp,city,population
3128463,2015-11-30,1063,151,34,12,0,0,45,18454,Sat. Union 55,True,55,Music - CD of local production,True,Music,Music,"Samara mall of ""Parkhouse""",Mall,True,Volga,Intermediate,Low,Samara,1134730
3128464,2015-11-30,1063,151,34,12,0,0,45,16188,Board Game Nano Curling,True,64,Gifts - Board Games,True,Gifts,Gifts,"Samara mall of ""Parkhouse""",Mall,True,Volga,Intermediate,Low,Samara,1134730
3128465,2015-11-30,1063,151,34,12,0,0,45,15757,Novikov Aleksandr New Collection,True,55,Music - CD of local production,True,Music,Music,"Samara mall of ""Parkhouse""",Mall,True,Volga,Intermediate,Low,Samara,1134730
3128466,2015-11-30,1063,151,34,12,0,0,45,19648,Terem - TEREMOK sb.m / f (Region),True,40,Movie - DVD,True,Movies,Movies,"Samara mall of ""Parkhouse""",Mall,True,Volga,Intermediate,Low,Samara,1134730
3128467,2015-11-30,1063,151,34,12,0,0,45,969,3 DAYS TO KILL (BD),True,37,Movie - Blu-Ray,True,Movies,Movies,"Samara mall of ""Parkhouse""",Mall,True,Volga,Intermediate,Low,Samara,1134730


In [8]:
tt.head(25)

Unnamed: 0,date,day,week,month,quarter,price,sales,shop_id,item_id,item_name,it_test,item_category_id,item_category_name,it_cat_test,item_cat3,item_cat4,shop_name,sh_cat,sh_test,district,fd_popdens,fd_gdp,city,population
0,2013-01-02,1,0,0,0,999.0,1,59,22154,Scene 2012 (BD),True,37,Movie - Blu-Ray,True,Movies,Movies,"Yaroslavl shopping center ""Altair""",SEC,True,Central,Populous,Intermediate,Yaroslavl,606730
1,2013-01-03,2,0,0,0,899.0,1,25,2552,DEEP PURPLE The House Of Blue Light LP,False,58,Music - Vinyl,True,Music,Music,"Moscow SEC ""Atrium""",SEC,True,Central,Populous,Intermediate,Moscow,10381222
2,2013-01-05,4,0,0,0,899.0,-1,25,2552,DEEP PURPLE The House Of Blue Light LP,False,58,Music - Vinyl,True,Music,Music,"Moscow SEC ""Atrium""",SEC,True,Central,Populous,Intermediate,Moscow,10381222
3,2013-01-06,5,0,0,0,1709.05,1,25,2554,DEEP PURPLE Who Do You Think We Are LP,False,58,Music - Vinyl,True,Music,Music,"Moscow SEC ""Atrium""",SEC,True,Central,Populous,Intermediate,Moscow,10381222
4,2013-01-15,14,2,0,0,1099.0,1,25,2555,DEEP PURPLE 30 Very Best Of 2CD (Businesses).,False,56,Music - CD production firm,True,Music,Music,"Moscow SEC ""Atrium""",SEC,True,Central,Populous,Intermediate,Moscow,10381222
5,2013-01-10,9,1,0,0,349.0,1,25,2564,DEEP PURPLE Perihelion: Live In Concert DVD (Cyrus).,False,59,Music - Music video,False,Music,Music,"Moscow SEC ""Atrium""",SEC,True,Central,Populous,Intermediate,Moscow,10381222
6,2013-01-02,1,0,0,0,549.0,1,25,2565,DEEP PURPLE Stormbringer (firms).,False,56,Music - CD production firm,True,Music,Music,"Moscow SEC ""Atrium""",SEC,True,Central,Populous,Intermediate,Moscow,10381222
7,2013-01-04,3,0,0,0,239.0,1,25,2572,DEFTONES Koi No Yokan,False,55,Music - CD of local production,True,Music,Music,"Moscow SEC ""Atrium""",SEC,True,Central,Populous,Intermediate,Moscow,10381222
8,2013-01-11,10,1,0,0,299.0,1,25,2572,DEFTONES Koi No Yokan,False,55,Music - CD of local production,True,Music,Music,"Moscow SEC ""Atrium""",SEC,True,Central,Populous,Intermediate,Moscow,10381222
9,2013-01-03,2,0,0,0,299.0,3,25,2573,DEL REY LANA Born To Die,False,55,Music - CD of local production,True,Music,Music,"Moscow SEC ""Atrium""",SEC,True,Central,Populous,Intermediate,Moscow,10381222


In [0]:
# To Do:
'''
re.findall isn't consistent... sometimes gives a list of an array of tuples (many of which are empty string, but a matching string will be in any one of the tuple positions... and, possibly more than one??), so x=[("",""," dvd","")] and x[0] gives the tuple, and x[0][0]=''
sometimes gives a list of a single string, so x[0] = 'abbyy' and x[0][0]='a'
--> maybe add an extra df column for ngrams to append, and merge it with the delim_item_strings after all 'cleaning' is done

explicitly concatenate the item category name as a n-gram string, with n the same for all categories (pad as needed); don't add it to string before separating

focus on any of the 84 categories that have the most items or the most spread in behavior... split apart the ones with odd behavior, or the ones that are sold by  a subset of shops consistently
(original train set: item_cat_id.counts()
(original train set: item_cat_id - item_cnt_day.value_counts()
(group by item_cat_id (agg: counts); then do shop_id.counts()
(group by month and item_cat_id; look at sum of sales by item_cat_id for last xx months, and characterize the various groups with min/max/std over past n months, for several n)
--> look after stripping the version numbers, etc. off the games and software, and create new groups based on only that??
--> can also try combining similar of the 84 categories (e.g., all playstations, or all xboxes, or all tickets/cards/...) and see if we have more consistent performance within a category
--> can also look at top 50 clusters created by NLP, and see how correlated their sales are, within a cluster, vs. uncorrelated outside a cluster

###
maybe create a combination category column like grouping certain shop-item pairs, or shop-item_category pairs, or shop_cat-item_cat pairing

####
look at most common n-grams for the cleaned/non-delimited item name
(start with n=15 and work backwards to n=1)
clean/replace as much as possible without overly distorting the item name
then:
for each n:
--> split clean string into consecutive n-grams and put them in df columns, where the # columns in df = largest number of n-grams for any of the item names (search to find longest name, and calculate how many columns are needed/used)
--> in each column, do count_values... perhaps do a combined unique() to get all the n-grams, then put into an array or series containing the n-gram string and the sum of value_counts over each column
--> sort by frequency, and choose the ? top 100 and ? bottom 100 (for sum>1)
--> convert some of the desired n-grams into n+x grams to reflect the relative importance?
'''

print(f'done: {strftime("%a %X %x")}')

done: Wed 13:19:47 05/27/20


#2.5) ***items*** Dataset: EDA, Cleaning, Correlations, and Feature Generation

---



---



###2.5.1) Initial data exploration and Russian -> English translation

####Thoughts regarding items dataframe
Let's first look at how many training examples we have to work with...

Many of the items have similar names, but slightly different punctuation, or only very slightly different version numbers or types.  (e.g., 'Call of Duty III' vs. 'Call of Duty III DVD')

One can expect that these two items would have similar sales in general, and by grouping them into a single feature category, we can eliminate some of the overfitting that might come as a result of the relatively small ratio of (training set shop-item-date combinations = 2935849)/(total number of unique items = 22170).  (This is an average of about 132 rows in the sales_train data for each shop-item-date combination that we are using to train our model.  Our task is to produce a monthly estimate of sales (for November 2015), so it is relevant to consider training our model based on how many sales in a month vs. how many sales in the entire training set.  Given that the sales_train dataset covers the time period from January 2013 to October 2015 (34 months), we have on average fewer than 4 shop-item combinations in our training set for a given item in any given month.  Furthermore, as we are trying to predict for a particular month (*November* 2015), it is relevant to consider how many rows in our training set occur in the month of November.  The sales_train dataset contains data for two 'November' months out of the total 34 months of data.  Another simple calculation gives us an estimate that our training set contains on average 0.23 shop-item combinations per item for November months.

To summarize:

*  *sales_train* contains 34 months of data, including 2935849 shop-item-date combinations
*  *items* contains 22170 "unique" item_id values

In the *sales_train* data, we therefore have:
*  on average, 132 rows with a given shop-item pair for a given item_id
*  on average, 4 rows with a given shop-item pair for a given item_id in a given month
*  on average, 0.23 rows with a given shop-item pair for a given item_id in all months named 'November'

If we wish to improve our model predictions for the following month of November, it behooves us to use monthly grouping of sales, or, even better, November grouping of sales.  This smooths out day-to-day variations in sales for a better monthly prediction.  However, the sparse number of available rows in the *sales_train* data will contribute to inaccuracy in our model training and predictions.

Imagine if we could reduce the number of item_id values from 22170 to perhaps half that or even less.  Given that the number of rows for training (per item, on a monthly or a November basis) is so small, then such a reduction in the number of item_id values would have a big impact.  (The same is true for creating features to supplement "shop_id" so as to group and reduce the individuality of each shop - and thus effectively create, on average, more rows of training data for each shop-item pair.

####Translate and Ruminate
We will start by translating the Russian text in the dataframe, and add our ruminations on possible new features we can generate.

The dataframe *items_transl* (equivalent to *items* plus a column for English translation) is saved as a .csv file so we do not have to repeat the translation process the next time we open a Google Colab runtime.

In [0]:
print(items_transl.info())
print("\n")
print(items_transl.tail(10))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22170 entries, 0 to 22169
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   item_name         22170 non-null  object
 1   item_id           22170 non-null  int64 
 2   item_category_id  22170 non-null  int64 
 3   en_item_name      22170 non-null  object
dtypes: int64(2), object(2)
memory usage: 692.9+ KB
None


                                                   item_name  item_id  item_category_id                                           en_item_name
22160                             ЯРМАРКА ТЩЕСЛАВИЯ (Регион)    22160                40                                   Vanity Fair (Region)
22161                       ЯРОСЛАВ. ТЫСЯЧУ ЛЕТ НАЗАД э (BD)    22161                37                YAROSLAV. Thousands of years ago e (BD)
22162                                                 ЯРОСТЬ    22162                40                                         

###2.5.2) **NLP for feature generation from items data**
Automate the search for commonality among items, and create new categorical feature to prevent overfitting from close similarity between many item names

####**Delimited Groups of Words**

Investigating "special" delimited word groups (like this) or [here] or /hobbitville/ that are present in item names, and may be particularly important in creating n>1 n-grams for uniquely identifying items so that we can tell if two items are the same or nearly the same

#####Some details on the approach, and code for helper functions to clean and separate the text:

In [0]:
# explanation of regex string I'm using to parse the item_name
'''

^\s+|\s*[,\"\/\(\)\[\]]+\s*|\s+$

gm
1st Alternative ^\s+
^ asserts position at start of a line
\s+ matches any whitespace character (equal to [\r\n\t\f\v ])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)

2nd Alternative \s*[,\"\/\(\)\[\]]+\s*
\s* matches any whitespace character (equal to [\r\n\t\f\v ])
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
Match a single character present in the list below [,\"\/\(\)\[\]]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
, matches the character , literally (case sensitive)
\" matches the character " literally (case sensitive)
\/ matches the character / literally (case sensitive)
\( matches the character ( literally (case sensitive)
\) matches the character ) literally (case sensitive)
\[ matches the character [ literally (case sensitive)
\] matches the character ] literally (case sensitive)
\s* matches any whitespace character (equal to [\r\n\t\f\v ])
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)

3rd Alternative \s+$
\s+ matches any whitespace character (equal to [\r\n\t\f\v ])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of a line

Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
'''
print(f'done: {strftime("%a %X %x")}')  # prevent Jupyter from printing triple-quoted comments

In [0]:
# This cell contains no code to run; it is simply a record of some inspections that were done on the items database

# before removing undesirable characters / punctuation from the item name,
#   let's see if we can find n-grams or useful describers or common abbreviations by looking between the nasty characters
# first, let's see what characters are present in the en_item_name column
'''
nasty_symbols = re.compile('[^0-9a-zA-Z ]')
nasties = set()
for i in range(len(items_transl)):
  n = nasty_symbols.findall(items_transl.at[i,'en_item_name'])
  nasties = nasties.union(set(n))
print(nasties)
{'[', '\u200b', 'ñ', '(', ')', '.', 'à', '`', 'ó', '®', 'Á', 
'\\', 'è', '&', '-', ':', 'ë', '_', 'û', '»', '=', '+', ']', ',', 
'«', 'ú', "'", 'ö', '#', 'ä', ';', 'ü', '"', 'ô', '/', '№', 'é', 
'í', '!', '°', 'å', '*', 'ĭ', 'ð', '?', 'â'}
'''
# From the above set of nasty characters, it looks like slashes, single quotes, double quotes, parentheses, and square brackets might enclose relevant n-grams
# Let's pull everything from en_item_name that is inside ' ', " ", (), or [] and see how many unique values we get, and if they are n-grams or abbreviations, for example
# It also seems that many of the item names end in a single character "D" for example, which should be converted to DVD

# ignore the :&+' stuff for now...
# Let's set up columns for ()[]-grams, for last string in the name, and for first string in name, and for text that precedes ":", and for text that surrounds "&" or "+"
#   but first, we will strip out every nasty character except ()[]:&+'"/ and replace the nasties with spaces, then eliminating double spaces

'''
# sanity check:
really_nasty_symbols = re.compile('[^0-9a-zA-Z \(\)\[\]:&+\'"/]')
really_nasties = set()
for i in range(len(items_transl)):
  rn = really_nasty_symbols.findall(items_transl.at[i,'en_item_name'])
  really_nasties = really_nasties.union(set(rn))
print(really_nasties)
{'\u200b', 'ñ', '.', 'à', '`', 'ó', '®', 'Á', '\\', 'è', '-', 'ë', '_', 'û', '»', '=', ',', '«', 'ú', 'ö', '#', 'ä', ';', 'ü', 'ô', '№', 'é', 'í', '!', '°', 'å', '*', 'ĭ', 'ð', '?', 'â'}
OK, looks good
'''
print(f'done: {strftime("%a %X %x")}')  # prevent Jupyter from printing triple-quoted comments

In [0]:
#  Start by defining stopwords and delimiters and punctuation that we wish to remove
#  Then, create a couple of functions to use for text cleaning, and for extracting delimited text n-grams

# stopwords to remove from item names (these are only a bit better than arbitrary selections from large stopwords lists -- may be worth adjusting them)
stop_words = "a,the,an,only,more,are,any,on,your,just,it,its,has,with,for,by,from".split(",")

# pre-compile regex strings to use for fast symbol removal or delimiting
nasty_symbols_re = re.compile(r'[^0-9a-zA-Z ]')  # remove all punctuation
really_nasty_symbols_re = re.compile(r'[^0-9a-zA-Z ,;\"\/\(\)\[\]\:\-\@]')  # remove nasties, but leave behind the delimiters
delimiters_re = re.compile(r'[,;\"\/\(\)\[\]\:\-\@\u00AB\u00BB~<>]')  # unicodes are << and >> thingies
# special symbols indicating a delimiter --> a space at start or end of item name will be removed at split time, along with ,;/()[]:"-@~<<>><>
delim_pattern_re = re.compile(r'^\s+|\s*[,;\"\/\(\)\[\]\:\-\@\u00AB\u00BB~<>]+\s*|\s+$') 
multiple_whitespace_re = re.compile(r'[ ]{2,}')

# pre-compile some specific regex strings to deal with inconsistencies in item names (more of this will be done later, after delimiting)
cleanup_text = {}
cleanup_text['preorder'] = re.compile(r'pre.?order')
cleanup_text[' dvd'] = re.compile(r'\s+d$')  #several item names end in "d" -- which actually seems to indicate dvd (because the items I see are in category 40: Movies-DVD)... standardize so d --> dvd
cleanup_text['digital version'] = re.compile(r'digital in$') # several items seem to end in "digital in"... maybe in = internet?, but looking at nearby items/categories, 'digital version' looks standard
cleanup_text['bluray dvd'] = re.compile(r'\bbd\b|\bblu\s+ray\b|\bblu\-ray\b|\bblueray\b|\bblue\s+ray\b|\bblue\-ray\b')
cleanup_text['007 : james bond : skyfall'] = re.compile(r'\bskyfall\b|\bskayfoll\b')
cleanup_text[' and '] = re.compile(r'[\&\+]')
cleanup_text[' xbox'] = re.compile(r'\bx[^0-9a-zA-Z ]box')  # anything like "x box" or "x-box" or "x%box" gets converted to a standard "xbox"
cleanup_text[' ps'] = re.compile(r'\bp[^0-9a-zA-Z ]s')      # attempt to do the same with "p-s4" --> "ps4"

def maid_service(text):
    """
    Compact routine to implement multiple regex substitutions using the above 'cleanup_text' dictionary
    """
    text = text.lower()
    for repl_text, pattern in cleanup_text.items():
        text = pattern.sub(repl_text, text)
    #r = re.compile(r'\bskayfoll\b')   # can add 'quickie' items here if you don't want to add to above dictionary, or if you want to perform something other than re.sub
    #text = r.sub('skyfall',text)  
    return text

def text_total_clean(text):
    """
    Gives a punctuation-free, cleaned, lemmatized version of the original English translation
    inputs: (text): the original en_item_name single-string, uncleaned, translated version of the Russian item name
    returns: single-string text, made lowercase, stripped of "really_nasties" and multiple spaces, and every word lemmatized
    """
    text = maid_service(text)
    text = delimiters_re.sub(" ", text)  # replace all delimiters with a space; other nasties get simply deleted
    text = nasty_symbols_re.sub("", text)  # delete anything other than letters, numbers, and spaces
    text = multiple_whitespace_re.sub(" ", text)  # replace multiple spaces with a single space
    text = text.strip() # remove whitespace around string
    # lemmatize each word
    text = " ".join([lemmatizer.lemmatize(w) for w in text.split(" ") if w not in stop_words])
    return text

def text_clean_delimited(text):
    """
    Gives a punctuation-free, cleaned version of the original English translation, 
        but the function returns a list of strings instead of a single string,
        with each element in the list corresponding to text that was separated from neighboring
        text with one of the above-defined 'delimiter' characters
        (so, rather than analyzing the full item name for n-grams, we define an item's important
        n-grams as being separated by such delimiters.  It greatly reduces the number of n-grams we need to analyze)
    inputs: (text): the original en_item_name single-string, uncleaned, translated version of the Russian item name
    returns: en_item_name made lowercase, stripped of "really_nasties" and multiple spaces, 
        in a list of strings that had been separated by one of the above 'delimiters',
        and, with every word in every string lemmatized 
    """
    text = maid_service(text)
    text = really_nasty_symbols_re.sub("", text)  # just delete the nasty symbols
    text = multiple_whitespace_re.sub(" ", text)  # replace multiple spaces with a single space
    text = delim_pattern_re.split(text)           # split item_name at all delimiters, irrespective of number of spaces before or after the string or delimiter
    text = [x.strip() for x in text if x != ""]           # remove empty strings "" from the list of split items in text, and remove whitespace outside text n-gram
    # lemmatize each word
    lemtext = []
    for ngram in text:
        lemtext.append(" ".join([lemmatizer.lemmatize(w) for w in ngram.split(" ") if w not in stop_words]))
    return lemtext

print(f'done: {strftime("%a %X %x")}')

#####Add 'delimited' and 'cleaned' data columns; shorten the titles of other columns so dataframe fits better on the screen

In [0]:
items_delimited = items_transl.copy(deep=True)
# delete the wide "item_name" column so we can read more of the data table width-wise
items_delimited = items_delimited.drop("item_name", axis=1).rename(columns = {'en_item_name':'item_name','item_category_id':'i_cat_id'})
items_in_test_set = test.item_id.unique()
items_delimited["i_tested"] = False
for i in items_in_test_set:
  items_delimited.at[i,"i_tested"] = True


# add item_category name with delimiter to the item_name, as this will be useful info for grouping similar items (remove delimiting punctuation from cat names first, so it stays as one chunk of text)
items_delimited['item_name'] = items_delimited.apply(lambda x: text_total_clean(item_categories_augmented.at[x.i_cat_id,'en_cat_name']) + " : " + x.item_name, axis=1)

# add a column of simply cleaned text without any undesired punctuation or delimiters
items_delimited['clean_item_name'] = items_delimited['item_name'].apply(text_total_clean)

# now add a column of lists of delimited (cleaned) text
items_delimited['delim_name_list'] = items_delimited['item_name'].apply(text_clean_delimited)

# remove duplicate entries and single-character 1-grams to assist with operations to come later in this notebook
alphnum = list(string.ascii_lowercase) + list('1234567890')  # get rid of all length=1 1-grams
def remove_dupes_singles(gramlist):
    unwanted = set(alphnum)
    dupe_gramset = unwanted
    return [x for x in gramlist if x not in dupe_gramset and not dupe_gramset.add(x)]
items_delimited.delim_name_list = items_delimited.delim_name_list.apply(lambda x: remove_dupes_singles(x) )


# have a look at what we got with our delimited text globs
def maxgram(gramlist):
    maxg = 0
    for g in gramlist:
        maxg = max(maxg,len(g.split()))
    return maxg
items_delimited['d_len'] = items_delimited.delim_name_list.apply(lambda x: len(x))
items_delimited['d_maxgram'] = items_delimited.delim_name_list.apply(maxgram)

#items_delimited.to_csv("data_output/items_delimited.csv", index=False)

print(f'done: {strftime("%a %X %x")}')
print("\n")
print(items_delimited.describe())
print("\n")
print(items_delimited.iloc[31][:])
print("\n")
items_delimited.head()

done: Wed 10:11:08 05/27/20


         item_id  i_cat_id  d_len  d_maxgram
count      22170     22170  22170      22170
mean  11,084.500    46.291  3.347      4.496
std    6,400.072    15.941  1.336      1.984
min            0         0      1          2
25%    5,542.250        37      2          3
50%   11,084.500        40      3          4
75%   16,626.750        58      4          5
max        22169        83     13         17


item_id                                                                              31
i_cat_id                                                                             37
item_name                           movie bluray dvd : 007: COORDINATES "SKAYFOLL» (BD)
i_tested                                                                           True
clean_item_name       movie bluray dvd 007 coordinate 007 james bond skyfall bluray dvd
delim_name_list    [movie bluray dvd, 007, coordinate, james bond, skyfall, bluray dvd]
d_len                              

Unnamed: 0,item_id,i_cat_id,item_name,i_tested,clean_item_name,delim_name_list,d_len,d_maxgram
0,0,40,movie dvd : ! POWER IN glamor (PLAST.) D,False,movie dvd power in glamor plast dvd,"[movie dvd, power in glamor, plast, dvd]",4,3
1,1,76,"program home and office digital : ! ABBYY FineReader 12 Professional Edition Full [PC, Digital Version]",False,program home and office digital abbyy finereader 12 professional edition full pc digital version,"[program home and office digital, abbyy finereader 12 professional edition full, pc, digital version]",4,6
2,2,40,movie dvd : *** In the glory (UNV) D,False,movie dvd in glory unv dvd,"[movie dvd, in glory, unv, dvd]",4,2
3,3,40,movie dvd : *** BLUE WAVE (Univ) D,False,movie dvd blue wave univ dvd,"[movie dvd, blue wave, univ, dvd]",4,2
4,4,40,movie dvd : *** BOX (GLASS) D,False,movie dvd box glass dvd,"[movie dvd, box, glass, dvd]",4,2


In [0]:
# do some more text manipulation to help ensure items are properly grouped
#   also, expand the breadth of n-gram matches to ignore things like version number, in an effort to reduce the final number of clusters that are generated
#   (looking for perhaps 200 clusters instead of 2000+ that we get without this extra treatment (see 'items_nlp_clusters_v3...ipynb' ))

highlight_roots = OrderedDict()
cleanup_sub = OrderedDict()
cleanup_final = OrderedDict()
#cleanup_complete_replace = OrderedDict()


# for some matches, I want to only make a new entry in the list to standardize games to root values (e.g., "assasin creed special ops" = "assasin creed part 2")
#     The new list element will be a 5-gram, for example, to give it substantial weight when grouping items
#     The original list of delimited text will remain the same, so as to catch matches like "assasin creed special ops dvd bluray english version"

# use replacement text = '' if you want the create operation to use the match as a base string (possibly adding to it with fill_strings until n_gram size is reached)
games1 = "adventure of tintin|advanced warfare|army of two|assasin creed|angry bird|batman|battlefield|behind enemy line|black ops|borderland|call of duty|chaggington funny train"
games2 = "child of light|dark soul|dead space|disney infinity|dragon age|elder scroll|far cry|final fantasy|game of throne|god of war|grand theft auto|harry potter|james bond|(lord of ring|hobbit)"
games3 = "mario|masha and bear|max payne|medal of honor|men in black|metal gear solid|mickey mouse|might and magic|modern warfare|mortal kombat|nba|need speed|nhl|ninja storm|pirate of car\w*\b"
games4 = "plant v zombie|pro evolution|resident evil|secret of unicorn|shadow of mordor|sherlock holmes|sid meiers civilization|skylander|sniper elite|star war|stick of truth|street fighter"
games5 = "tiger wood|tom clancy(s)?|tomb raider|transformer|walking dead|warhammer|watch dog|witcher|world of warcraft"
popular_games = re.compile(rf'\b({games1}|{games2}|{games3}|{games4}|{games5})\b')
highlight_roots['compress game names to root values'] =     {'optype':['create'], 'reg_pattern':popular_games, 
                                                                'replacement_text':'', 'final_gram_n':5, 
                                                                'fill_strings':['game','computer','electronic','multirelease']}

lego = re.compile(r'\blego\b')
#lego = re.compile(rf'\b(lego.*({games1}|{games2}|{games3}|{games4}|{games5})?|({games1}|{games2}|{games3}|{games4}|{games5}).*lego)\b')
highlight_roots['lego products'] =                          {'optype':['create'], 'reg_pattern':lego,
                                                                'replacement_text':'lego brand lego style game'}

popular_companies = re.compile(rf'\b(1c|abbyy)\b')
highlight_roots['highlight product origins'] =              {'optype':['create'], 'reg_pattern':popular_companies, 
                                                                'replacement_text':'', 'final_gram_n':4, 
                                                                'fill_strings':['educational','software','learning']}

# for some of the matches, I just want to do an inplace substitution
fix_accessory_game = re.compile(r'\baccessory game\b')
cleanup_sub['game accessory'] =         {'optype':['sub'], 'replacement_text':'game accessory', 'reg_pattern':fix_accessory_game}

biz = re.compile(r'\b(firm|enterprise|company|corporation|shop|store|outlet)\b')
cleanup_sub['standardize biz'] =        {'optype':['sub'], 'replacement_text':'business', 'reg_pattern':biz}

digit = re.compile(r'\b(digital|download|online)(\s?(version|edition|set|box set))?\b')
cleanup_sub['special edition'] =        {'optype':['sub'], 'replacement_text':'online digital version', 'reg_pattern':digit}

special = re.compile(r'\b(collector|premier|platinum|special|suite)(.*(version|edition|set|box set|suite))?\b')
cleanup_sub['special edition'] =        {'optype':['sub'], 'replacement_text':'special version', 'reg_pattern':special}

std = re.compile(r'\b(standard|std)(\s?(edition|version|set|box set))?\b')
cleanup_sub['standard edition'] =       {'optype':['sub'], 'replacement_text':'standard version', 'reg_pattern':std}

russia = re.compile(r'\b(russian|ru)(\s?(edition|version|set|box set|documentation|instruction|language|format|subtitle|feature))?\b')
cleanup_sub['russian version'] =        {'optype':['sub'], 'replacement_text':'russian language version', 'reg_pattern':russia}

engl = re.compile(r'\b(english|en|eng|engl)(\s?(edition|version|set|box set|documentation|instruction|language|format|subtitle|feature))?\b')
cleanup_sub['english version'] =        {'optype':['sub'], 'replacement_text':'english language version', 'reg_pattern':engl}


# for other matches, I want to create a new n-gram and insert it into the list, and also do an inplace substitution
#   substitution text is made longer or shorter, depending on rough importance to matching (longer matching n-grams get more weight)
yo = re.compile(r'\b(yo|yoyo|yo yo)\b')
cleanup_final['yo yo yo'] =         {'optype':['sub','create'], 'replacement_text':'yo yo toy game fun', 'reg_pattern':yo}

music = re.compile(r'\b(cd mirex|mirex cd|cd mirex cd|vinyl|cd.*production firm|cd.*local production|mp3)(\s?(cd mirex|mirex cd|cd mirex cd|vinyl|cd.*production firm|cd.*local production|mp3))?\b')
cleanup_final['music media'] =      {'optype':['sub','create'], 'replacement_text':'music media', 'reg_pattern': music}

dvdclean = re.compile(r'\b(\d\s?)?(disc\s?)?(\d\s?)?dvd\b')
cleanup_final['dvd'] =             {'optype':['sub','create'], 'replacement_text':'dvd', 'reg_pattern':dvdclean}

brdvd = re.compile(r'\b(4k\s?)?(\d\s?)?(bluray\s?)?(\d\s?)?(dvd\s?)?(and\s?)?(\d\s?)?(disc\s?)?(4k\s?)?(\d\s?)?bluray(\s?and)?(\s?4k)?(\s?(\d\s?)?dvd)?(\s?4k)?(\s?and)?(\s?(\d\s?)?dvd)?\b|\b2bd\b')
cleanup_final['bluray dvd'] =      {'optype':['sub','create'], 'replacement_text':'bluray dvd', 'reg_pattern':brdvd}

br3d=re.compile(r'\b(\d\s?)?(disc)?\s?(\d\s?)?(dvd)?\s?(and)?\s?(3d\s?(\d\s?)?(dvd)?\s?(\d\s?)?bluray\s?(\d\s?)?(dvd)?|(\d\s?)?bluray\s?(\d\s?)?(dvd)?\s?3d)\s?(\d\s?)?(bluray dvd)?\s?(3d)?\s?(and)?\s?(\d\s?)?(dvd)?\s?(3d)?\b')
cleanup_final['3d bluray dvd'] =   {'optype':['sub','create'], 'replacement_text':'3d bluray dvd', 'reg_pattern':br3d}

macregx = re.compile(r'\b(support\s?)?(mac|ipad|macbook|powerbook|imac|apple)(\s?support)?\b')
cleanup_final['pc'] =              {'optype':['create'], 'replacement_text':'mac computing platform product', 'reg_pattern':macregx}

pcregx = re.compile(r'\b(support\s?)?(pc|windows|microsoft windows)(\s?support)?\b')
cleanup_final['pc'] =              {'optype':['create'], 'replacement_text':'pc computing platform product', 'reg_pattern':pcregx}

playsta = re.compile(r'\b(support\s?)?p(sp|\s?s|\s?s?\s?(move|2|3|4|pro|vita|vita 1000))\b')
cleanup_final['sony playstn'] =    {'optype':['create'], 'replacement_text':'sony playstation gaming platform', 'reg_pattern':playsta}

xbox = re.compile(r'\bx?\s?box\s?(one|360|live)(.*(kinect|knect))?\b')
cleanup_final['microsoft xbox'] =  {'optype':['create'], 'replacement_text':'microsoft xbox gaming platform','reg_pattern':xbox}

kinect = re.compile(r'\b(support)?\s?m?\s?s?\s?(kinect|knect)\b')
cleanup_final['microsoft knect'] = {'optype':['create'], 'replacement_text':'microsoft xbox gaming platform', 'reg_pattern':kinect}

msoffice = re.compile(r'\b(microsoft office|ms office|m office|office mac|office home|office professional|home and office|office student|office enterprise)\b')
cleanup_final['ms office'] =       {'optype':['create'], 'replacement_text':'microsoft office productivity software', 'reg_pattern':msoffice}

educate = re.compile(r'\b(education|educational|development|course|school|history|lesson|accounting|b8)\b')
cleanup_final['educational dev'] = {'optype':['create'], 'replacement_text':'educational development training lessons', 'reg_pattern':educate}

paycard = re.compile(r'\b(payment|card|ticket|debit)(\s?(card|ticket|debit))?\b')
cleanup_final['payment card'] =    {'optype':['create'], 'replacement_text':'payment card ticket', 'reg_pattern':paycard}

licenses = re.compile(r'\b(subscription|renewal|1 year|extension|license)(.*(subscription|renewal|1 year|extension|license))?(.*(subscription|renewal|1 year|extension|license))?\b')
cleanup_final['licenses'] =        {'optype':['create'], 'replacement_text':'license renewal subscription extension', 'reg_pattern':licenses}

download = re.compile(r'\b(online|digital|download|access|without disc|without disk|epay)(.*(online|digital|download|access|without disc|without disk|epay|version|edition))?\b')
cleanup_final['downloads'] =       {'optype':['create'], 'replacement_text':'online download version', 'reg_pattern':download}

ship = re.compile(r'\b(delivery|deliver|postage|mail|send|ship|shipment)(.*(delivery|deliver|postage|mail|send|ship|shipment))?\b')
cleanup_final['shipping'] =        {'optype':['create'], 'replacement_text':'shipping delivery postage', 'reg_pattern':ship}

virus = re.compile(r'\b(kaspersky|panda|drweb|eset nod32|security|antivirus|virus)(.*(kaspersky|panda|drweb|eset nod32|security|antivirus|virus|software))?\b')
cleanup_final['antivirus'] =       {'optype':['create'], 'replacement_text':'antivirus defender internet security software', 'reg_pattern':virus}

print(f'done: {strftime("%a %X %x")}')

done: Wed 10:11:12 05/27/20


In [0]:
# here are the routines to implement the above pattern-matching instructions, and modify the delim_items_list column of the dataframe

def expand_gram(gram,final_gram_n,fill_strings):
    for f in range(final_gram_n - len(gram.split())):
        gram = gram + " " + fill_strings[f]
    return gram

def cleanup_service(gramlist=["word1 this is a 6 gram", "word1", "two gram", "three gram string"], 
                    pattern_dict=OrderedDict({'replace 0007 with 007':{'optype':['sub'],'reg_pattern':re.compile(r'\b0007\b'), 'replacement_text':'007', 
                                                                       'final_gram_n':4, 'fill_strings':['game','computer','electronic']},
                                             'replace skayfall with skyfall':{'optype':['sub','create'],'reg_pattern':re.compile(r'\bskayfall\b'), 'replacement_text':'skyfall'}})):
    """
    for text modification in the items_delimited dataframe, in an effort to help standardize terms to better highlight similarities between items,
    and to help group items a bit more broadly in some cases, so we create fewer clusters with the following graph/network analysis.

    gramlist = list of delimited n-grams provided typically from a single cell from 'items_delimited' DF, at a single row, in column = 'delim_name_list'
    pattern_dict = ordered dictionary of lists where operations are done in the order created by user (e.g., clean up "dvd" variants before cleaning up "bluray dvd" variants, so the latter becomes simpler in regex)
        keys = representative text, describing what is being done (somewhat irrelevant to this function)
        values = dict{  
                    optype = list of strings indicating if one wants to do one or more of the following 4 types of operation on the gramlist
                            'sub': (sub)stitute regex matches, searching each element in the gramlist,  (len(gramlist) remains the same, but each string in gramlist may shrink or grow or remain unchanged)
                            'create': (create) new "standardized" list elements if a match is found within the gramlist (so len(gramlist) grows by 1 for each match); original gramlist strings remain unchanged
                            'complete_replace': wherever you have matching elements in the gramlist, replace the entire gramlist element with the pattern_dict key (len(gramlist) remains the same, but n in each n-gram may change)
                    reg_pattern = regex patterns to find/substitute/create/replace, 
                    replacement_text = the text to put in place of the reg_pattern match, or to use when creating a new gramlist list element
                    final_gram_n = integer; desired final n-gram length (if desired) 
                    fill_strings= list(padding strings used to append on to the shorter regex matches to make final string = n grams in length, in order from most important to least)... must be long enough!
                    }
    """
    print_counter = 0
    previous_gramlist = gramlist.copy()
    for key_text, op_details in pattern_dict.items():
        optype = op_details['optype']
        do_sub = 'sub' in optype
        do_create = 'create' in optype
        do_replace = 'complete_replace' in optype
        do_regfind = do_create or do_replace
        reg_pattern = op_details['reg_pattern']
        replacement_text = op_details['replacement_text']
        gram_set_n = False  # don't try to expand the n-gram to an (n+x)-gram unless the information is provided
        if 'final_gram_n' in op_details.keys():
            gram_set_n = True
            final_gram_n = op_details['final_gram_n']
        if 'fill_strings' in op_details.keys():
            fill_strings = op_details['fill_strings']
        else:
            gram_set_n = False

        updated_gramlist = previous_gramlist.copy()

        if do_sub:   # do substitutions first (cleanup), then do create, then do full replace
            for idx, gram in enumerate(previous_gramlist):
                updated_gramlist[idx] = reg_pattern.sub(replacement_text, gram)
            previous_gramlist = updated_gramlist.copy()

        if do_regfind:
            for idx, gram in enumerate(previous_gramlist):
                # if (key_text == 'highlight product origins'):
                #     pfind = reg_pattern.findall(previous_gramlist[idx])
                #     #ffind = reg_pattern.find(previous_gramlist[idx])
                #     sfind = reg_pattern.search(previous_gramlist[idx])
                #     if 'abbyy' in pfind:
                #         print(f'abby gram: {previous_gramlist[idx]}')
                #         print(f'abbyy 1 findall: {pfind}')
                #         #print(f'abbyy 1 find: {ffind}')
                #         print(f'abbyy 1 search: {sfind}')
                    # for adx,aptn in enumerate(pfind):
                    #     if 'abbyy' in aptn:
                    #         print(f'abbyy 2: position {adx}, element {aptn}, full find = {pfind}')
                    #     if 'abbyy' in aptn[0]:
                    #         print(f'abbyy 3: position {adx}, element {aptn}[0], full find = {pfind}')
                find_list = [x[0] for x in reg_pattern.findall(previous_gramlist[idx])]
                if find_list:  # proceed only if we have found some matches
                    # if print_counter < 25:
                    #     pfind = reg_pattern.findall(previous_gramlist[idx])
                    #     mfind = reg_pattern.match(previous_gramlist[idx])
                    #     sfind = reg_pattern.search(previous_gramlist[idx])

                    #     print(f'{key_text} search gram_list: {previous_gramlist}')

                    #     print(f'  findall in "{previous_gramlist[idx]}": {pfind}')
                    #     print(f'  match   in "{previous_gramlist[idx]}": {mfind}')
                    #     print(f'  search  in "{previous_gramlist[idx]}": {sfind}\n')
                    #     print_counter += 1
                    if do_create:  # do creations before full replacements
                        new_grams = []
                        for nmatch, match_str in enumerate(find_list):
                            if match_str:  # make sure it's not an empty list that was found as one of the matching groups
                                if replacement_text:
                                    new_grams.append(replacement_text)
                                elif gram_set_n:
                                    new_grams.append(expand_gram(match_str, final_gram_n, fill_strings))
                                else:
                                    new_grams.append(match_str)
                        updated_gramlist += new_grams

                
                    # if do_replace:
                    #     print('You should not be in replace; not employed at this time')
                    #     modlist[idx] = replacement_text

        previous_gramlist = updated_gramlist.copy()
    return updated_gramlist

print(f'done: {strftime("%a %X %x")}')


done: Wed 10:11:14 05/27/20


In [0]:
# Test it on a few rows  highlight_roots, cleanup_sub, cleanup_sub_create
# dlist = items_delimited.at[36,'delim_name_list'].copy()
#
# for i in range(16000,16030):
#     dlist = items_delimited.at[i,'delim_name_list'].copy()
#     print(dlist)
#     for clean_dict in [highlight_roots,cleanup_sub,cleanup_final]:
#         dlist = cleanup_service(dlist,clean_dict)
#     print(dlist)


for clean_dict in [highlight_roots,cleanup_sub,cleanup_final]: #[cleanup_games,cleanup_sub,cleanup_sub_create]:
    items_delimited.delim_name_list = items_delimited.delim_name_list.apply(lambda x: cleanup_service(x,clean_dict))

print(f'done: {strftime("%a %X %x")}\n')
items_delimited.head()

done: Wed 10:11:21 05/27/20



Unnamed: 0,item_id,i_cat_id,item_name,i_tested,clean_item_name,delim_name_list,d_len,d_maxgram
0,0,40,movie dvd : ! POWER IN glamor (PLAST.) D,False,movie dvd power in glamor plast dvd,"[movie dvd, power in glamor, plast, dvd]",4,3
1,1,76,"program home and office digital : ! ABBYY FineReader 12 Professional Edition Full [PC, Digital Version]",False,program home and office digital abbyy finereader 12 professional edition full pc digital version,"[program home and office digital, abbyy finereader 12 professional edition full, pc, digital version, a educational software learning, microsoft office productivity software, educational development training lessons, online download version, online download version]",4,6
2,2,40,movie dvd : *** In the glory (UNV) D,False,movie dvd in glory unv dvd,"[movie dvd, in glory, unv, dvd]",4,2
3,3,40,movie dvd : *** BLUE WAVE (Univ) D,False,movie dvd blue wave univ dvd,"[movie dvd, blue wave, univ, dvd]",4,2
4,4,40,movie dvd : *** BOX (GLASS) D,False,movie dvd box glass dvd,"[movie dvd, box, glass, dvd]",4,2


In [0]:
# Let's remove duplicate entries and unwanted stuff
alphnum = list(string.ascii_lowercase) + list('1234567890')  # get rid of all length=1 1-grams
def remove_dupes(gramlist):
    unwanted = set(['and','weighed in','given y'] + alphnum)
    gramset = unwanted
    return [x for x in gramlist if x not in gramset and not gramset.add(x)]

items_delimited.delim_name_list = items_delimited.delim_name_list.apply(lambda x: remove_dupes(x) )

items_delimited['d_len'] = items_delimited.delim_name_list.apply(lambda x: len(x))
items_delimited['d_maxgram'] = items_delimited.delim_name_list.apply(maxgram)

print(f'done: {strftime("%a %X %x")}')

done: Wed 10:11:36 05/27/20


In [0]:
# make item df easier to read for the following stuff
items_clean_delimited = items_delimited.copy(deep=True).drop("item_name", axis=1).rename(columns = {'clean_item_name':'item_name'})

print(f'done: {strftime("%a %X %x")}')
print("\n")
print(items_clean_delimited.describe())
print("\n")
items_clean_delimited.head()

done: Wed 10:11:37 05/27/20


         item_id  i_cat_id  d_len  d_maxgram
count      22170     22170  22170      22170
mean  11,084.500    46.291  3.778      4.577
std    6,400.072    15.941  1.836      2.018
min            0         0      1          2
25%    5,542.250        37      2          3
50%   11,084.500        40      3          4
75%   16,626.750        58      5          5
max        22169        83     14         18




Unnamed: 0,item_id,i_cat_id,i_tested,item_name,delim_name_list,d_len,d_maxgram
0,0,40,False,movie dvd power in glamor plast dvd,"[movie dvd, power in glamor, plast, dvd]",4,3
1,1,76,False,program home and office digital abbyy finereader 12 professional edition full pc digital version,"[program home and office digital, abbyy finereader 12 professional edition full, pc, digital version, a educational software learning, microsoft office productivity software, educational development training lessons, online download version]",8,6
2,2,40,False,movie dvd in glory unv dvd,"[movie dvd, in glory, unv, dvd]",4,2
3,3,40,False,movie dvd blue wave univ dvd,"[movie dvd, blue wave, univ, dvd]",4,2
4,4,40,False,movie dvd box glass dvd,"[movie dvd, box, glass, dvd]",4,2


In [0]:
#%%time
# Inspect a single n, gathered from all possible delimited n-grams (4.64sec to run this cell without GPU, 4.01sec with GPU)
n_in_ngram = 4    # look at, e.g. length-4 (4-grams) strings of words
print_top_f = 10  # printout the top xx ngram strings, sorted by frequency of occurrence in the data

total_dupe_grams = 0
item_ngram = items_clean_delimited.copy(deep=True)
item_ngram['delim_ngrams'] = item_ngram.delim_name_list.apply(lambda x: [a for a in x if len(a.split()) == n_in_ngram])

item_ngram = item_ngram.explode('delim_ngrams').reset_index(drop=True) # < 0.2sec this method (CPU)

freq_grams = item_ngram.delim_ngrams.value_counts()
grams_dupe = len(freq_grams[freq_grams > 1])
print(f'done: {strftime("%a %X %x")}')
print('\n')
print(f'Number of unique delimited {n_in_ngram}-grams: {len(freq_grams)}')
print(f'Number of unique delimited {n_in_ngram}-grams that are duplicated at least once: {grams_dupe}\n')
print(f'Top {print_top_f:d} {n_in_ngram:d}-grams by frequency of appearance in item names:')
print(freq_grams[:print_top_f])
print('\n')
item_ngram.head()

Number of unique delimited 4-grams: 3463
Number of unique delimited 4-grams that are duplicated at least once: 376

Top 10 4-grams by frequency of appearance in item names:
educational development training lessons    1736
1 educational software learning             1042
microsoft xbox gaming platform               772
game pc standard version                     756
microsoft office productivity software       619
music cd production business                 397
gift gadget robot sport                      295
program home and office                      277
game pc additional publication               240
license renewal subscription extension       237
Name: delim_ngrams, dtype: int64


done: Wed 10:24:33 05/27/20




Unnamed: 0,item_id,i_cat_id,i_tested,item_name,delim_name_list,d_len,d_maxgram,delim_ngrams
0,0,40,False,movie dvd power in glamor plast dvd,"[movie dvd, power in glamor, plast, dvd]",4,3,
1,1,76,False,program home and office digital abbyy finereader 12 professional edition full pc digital version,"[program home and office digital, abbyy finereader 12 professional edition full, pc, digital version, a educational software learning, microsoft office productivity software, educational development training lessons, online download version]",8,6,a educational software learning
2,1,76,False,program home and office digital abbyy finereader 12 professional edition full pc digital version,"[program home and office digital, abbyy finereader 12 professional edition full, pc, digital version, a educational software learning, microsoft office productivity software, educational development training lessons, online download version]",8,6,microsoft office productivity software
3,1,76,False,program home and office digital abbyy finereader 12 professional edition full pc digital version,"[program home and office digital, abbyy finereader 12 professional edition full, pc, digital version, a educational software learning, microsoft office productivity software, educational development training lessons, online download version]",8,6,educational development training lessons
4,2,40,False,movie dvd in glory unv dvd,"[movie dvd, in glory, unv, dvd]",4,2,


#####Gather all info for duplicated n-grams in our delimited set

In [0]:
%%time 
# Should take < 4sec on CPU

# Get all of the delimited n-grams that are duplicated at least once in item names
#  range of sizes of delimited phrases (number of 'words'):

min_gram = 1
max_gram = items_delimited.d_maxgram.max()

total_dupe_grams = 0
gram_freqs = {}   # dict will hold elements that are pd.Series with index = phrase, value = number of repeats in items database item names
for n in range(min_gram,max_gram+1):
    item_ngram = items_clean_delimited.copy(deep=True)
    item_ngram['delim_ngrams'] = item_ngram.delim_name_list.apply(lambda x: [a for a in x if len(a.split()) == n])

    item_ngram = item_ngram.explode('delim_ngrams').reset_index(drop=True)  

    # grams = item_ngram.delim_ngrams.apply(pd.Series,1).stack()  #1min 23sec cpu
    # grams.index = grams.index.droplevel(-1)
    # grams.name = 'delim_ngrams'
    # del item_ngram['delim_ngrams']
    # item_ngram = item_ngram.join(grams)

    freq_grams = item_ngram.delim_ngrams.value_counts()
    print(f'Number of unique delimited {n}-grams: {len(freq_grams)}')
    grams_dupe = len(freq_grams[freq_grams > 1])
    print(f'Number of unique delimited {n}-grams that are duplicated at least once: {grams_dupe}\n')
    if grams_dupe > 0:
        gram_freqs[n] = freq_grams[freq_grams > 1].copy(deep=True)
        total_dupe_grams += grams_dupe
print(f'\nTotal number of unique, delimited, duplicated n-grams for all n: {total_dupe_grams}')

print(f'done: {strftime("%a %X %x")}')

Number of unique delimited 1-grams: 2819
Number of unique delimited 1-grams that are duplicated at least once: 1171

Number of unique delimited 2-grams: 4236
Number of unique delimited 2-grams that are duplicated at least once: 1190

Number of unique delimited 3-grams: 3870
Number of unique delimited 3-grams that are duplicated at least once: 752

Number of unique delimited 4-grams: 3463
Number of unique delimited 4-grams that are duplicated at least once: 376

Number of unique delimited 5-grams: 2843
Number of unique delimited 5-grams that are duplicated at least once: 288

Number of unique delimited 6-grams: 1934
Number of unique delimited 6-grams that are duplicated at least once: 143

Number of unique delimited 7-grams: 1258
Number of unique delimited 7-grams that are duplicated at least once: 68

Number of unique delimited 8-grams: 829
Number of unique delimited 8-grams that are duplicated at least once: 31

Number of unique delimited 9-grams: 522
Number of unique delimited 9-gram

In [0]:
'''
May 25: try adjusting code to incude ngrams in range 1 and up, but reduce weight for n-grams that contain many common words
'''
start_n = 0
finish_n = 10
# first, inspect data to see what are the common n-grams of little value in determining cluster coupling
df_busy_grams=pd.DataFrame({'n3_names':gram_freqs[3].index[start_n:finish_n], 'n3_counts':gram_freqs[3].values[start_n:finish_n],
                 'n4_names':gram_freqs[4].index[start_n:finish_n], 'n4_counts':gram_freqs[4].values[start_n:finish_n],
                 'n5_names':gram_freqs[5].index[start_n:finish_n], 'n5_counts':gram_freqs[5].values[start_n:finish_n]
                 })
print(df_busy_grams)

                   n3_names  n3_counts                                  n4_names  n4_counts                                         n5_names  n5_counts
0   online download version       2097  educational development training lessons       1736            music music media production business        397
1          movie bluray dvd       1787           1 educational software learning       1042                  program home and office digital        333
2  russian language version       1688            microsoft xbox gaming platform        772    antivirus defender internet security software        243
3         music music media       1217                  game pc standard version        756                xbox 360 russian language version        151
4           game pc digital       1125    microsoft office productivity software        619     batman game computer electronic multirelease        136
5             game xbox 360        501                   gift gadget robot sport        

In [0]:
# format data for feeding into word vector creator

count_bins = [0, 2, 4, 8, 16, 32, 128, 1024, 32768]
idf_weights = [8,7,6,5,4,3,2,1]  # more weight for ngrams with lower counts

notfirst = False
for n,s in gram_freqs.items():
    a=len(s)
    n_array = np.ones(a,dtype=np.int32)*n
    gram_count = s.values.astype(np.int32)
    gram_string0 = s.index.to_numpy(dtype='str')
    gram_string = [re.compile(r'\b' + gs + r'\b') for gs in gram_string0]  # I'm not looking for partial words; n-grams must match at word boundaries
    weight_bin = pd.cut(s,count_bins,labels=idf_weights,retbins=False).astype(np.int32)

    if notfirst:
        n_arrays = np.concatenate((n_arrays,n_array))
        gram_counts = np.concatenate((gram_counts,gram_count))
        gram_strings = np.concatenate((gram_strings,gram_string))
        weight_bins = np.concatenate((weight_bins,weight_bin))
    else:
        n_arrays = n_array
        gram_counts = gram_count
        gram_strings = gram_string
        weight_bins = weight_bin
        notfirst = True

print(n_arrays[:5],gram_counts[:5],gram_strings[:5],weight_bins[:5])
print(len(n_arrays),len(gram_counts),len(gram_strings),len(weight_bins))

[1 1 1 1 1] [2742 1835 1360  893  755] [re.compile('\\bpc\\b') re.compile('\\bregion\\b')
 re.compile('\\bjewel\\b') re.compile('\\b1c\\b')
 re.compile('\\bbusiness\\b')] [1 1 1 2 2]
4056 4056 4056 4056


In [0]:
# use np matrix storage to speed this up... the following code cell takes about 3 min using np, vs. 8 min with pandas dataframe calculations
#   also, reducing np matrix to hold only ngrams of size 3 or greater (5/25/20) takes 48 sec on CPU
def make_word_vecs(item_names, ngram_re_patterns, ngram_ns, ngram_weights):
    """Output is word vectors for input containing item names (english transl)"""

    # create np zeros array of size (number of items, word vector length)
    n_items = len(item_names)
    wv_len = len(ngram_ns)
    item_vec_array = np.zeros((n_items, wv_len), dtype = np.int32)

    for g in range(wv_len):
        gram_pattern = ngram_re_patterns[g] 
        gram_len = ngram_ns[g]
        gram_weight = ngram_weights[g]
        for i in range(n_items):
            if gram_pattern.search(item_names[i]):
                item_vec_array[i,g] = 2 * gram_len * gram_weight  # use weighting function 2 * (n= length of ngram) * (idf weight from binning above)
    return item_vec_array


In [0]:
%%time
item_word_vectors = make_word_vecs(items_clean_delimited.loc[:,'item_name'].to_numpy(dtype='str'), gram_strings,n_arrays,weight_bins)

CPU times: user 2min 40s, sys: 153 ms, total: 2min 40s
Wall time: 2min 40s


In [0]:
# # intermediate point: can save word vectors here for the 22170 items
#np.savez_compressed('data_output/item_word_vectorsCompressed.npz', arrayname = item_word_vectors)
# # ...
# iwv = np.load("data_output/item_word_vectors.npz")
# item_word_vectors = iwv['arrayname']
# print(item_word_vectors.shape)

#####Use scipy sparse matrices instead of pandas... faster, and less memory use

In [0]:
item_vec_matrix = sparse.csr_matrix(item_word_vectors)

In [0]:
%%time
# <2sec for 21,700 items x 4000+ ngrams; output is a csr matrix of type int64
dots = item_vec_matrix.dot(item_vec_matrix.transpose()) 

CPU times: user 1.26 s, sys: 431 ms, total: 1.69 s
Wall time: 1.69 s


In [0]:
# wicked fast way to get top K # of items by dot product value (i.e., closest K items to the item of interest)
# https://stackoverflow.com/questions/31790819/scipy-sparse-csr-matrix-how-to-get-top-ten-values-and-indices
# also, great reference for speeding up python here: https://colab.research.google.com/drive/1nMDtWcVZCT9q1VWen5rXL8ZHVlxn2KnL

@jit(cache=True)
def row_topk_csr(data, indices, indptr, K):
    """Take a sparse scipy csr matrix, and for each column, find the K largest 
    values in that column (like argmax or argsort[:K]).  Return the row indices 
    and associated values for each column as two separate np arrays of 
    length = number of columns in sparse matrix.  Inputs are data/indices/indptr
    of csr matrix, and integer K.  Call function like this:
    rows, vals = row_topk_csr(csr_name.data, csr_name.indices, csr_name.indptr, K)
    Use jit by importing jit and prange from numba, and decorating with
    @jit(cache=True) immediately before this function definition
    (adopted from https://stackoverflow.com/users/3924566/deepak-saini ) """

    m = indptr.shape[0] - 1
    max_indices = np.zeros((m, K), dtype=indices.dtype)
    max_values = np.zeros((m, K), dtype=data.dtype)
    # for i in prange(m):
    #     top_inds = np.argsort(data[indptr[i] : indptr[i + 1]])[::-1][:K]
    #     max_indices[i] = indices[indptr[i] : indptr[i + 1]][top_inds]
    #     max_values[i] = data[indptr[i] : indptr[i + 1]][top_inds]
    for i in prange(m):
        top_inds = np.arange(22190-K,22190)
        tops = np.argsort(data[indptr[i] : indptr[i + 1]])[::-1][:K]
        top_inds[:len(tops)] = tops
        #print(i,top_inds)
        max_indices[i] = indices[indptr[i] : indptr[i + 1]][top_inds]
        max_values[i] = data[indptr[i] : indptr[i + 1]][top_inds]

    return max_indices, max_values


In [0]:
%%time
dots.setdiag(0)
print(dots.indptr.shape)
kval = 20
closest_indices, highest_values = row_topk_csr(dots.data, dots.indices, dots.indptr, K=kval)  # Changed K from 10 to 2 on 5/25/20 to 3 on 5/26 to 15 with fakedots

  self._set_arrayXarray(i, j, x)


(22171,)
CPU times: user 6.15 s, sys: 458 ms, total: 6.61 s
Wall time: 6.66 s


In [0]:
#print(closest_indices.shape)
print(closest_indices[:5,:7]) #[:10,:])
print(highest_values[8000:8008,:7])

[[ 9920  9922 16973 21346 21661  9932 21667]
 [ 1155  1154  1156  1152  1153  1157  1182]
 [17212 16518 16519 16521 16616 16691 16692]
 [19630  9633 20027 10463 10462  9029 12427]
 [ 9172  2716  9449  9450  9451  7732  7845]]
[[ 656  480  480  480  480  480  480]
 [ 864  864  864  864  864  864  656]
 [ 224  224  224  224  224  224  224]
 [1024 1024 1024 1024 1024  224  224]
 [ 152  152  152  152  152  152  152]
 [2448  928  864  864  864  864  864]
 [2448  736  736  528  480  480  480]
 [ 772  720  708  708  708  672  672]]


In [0]:
similar_items = pd.DataFrame({'item_id':range(22170)}) #,'close_item_idx':closest_indices,'close_item_dot':highest_values})
similar_items['close_item_idx'] = [closest_indices[x][:kval] for x in range(22170)]
similar_items['close_item_dot'] = [highest_values[x][:kval] for x in range(22170)]
similar_items = similar_items.merge(items_clean_delimited[['item_id','i_tested','i_cat_id']], on='item_id')
similar_items['close_item_cat'] = similar_items.close_item_idx.apply(lambda x: [items.at[i,'item_category_id'] for i in x])
print(similar_items.head())


   item_id                                                                                                                                close_item_idx                                                                                               close_item_dot  i_tested  i_cat_id  \
0        0        [9920, 9922, 16973, 21346, 21661, 9932, 21667, 12449, 20043, 21420, 15819, 16616, 13950, 10811, 8125, 10290, 14864, 17831, 8635, 8631]         [288, 288, 288, 288, 288, 288, 288, 288, 288, 288, 288, 288, 288, 288, 288, 288, 288, 288, 288, 288]     False        40   
1        1                      [1155, 1154, 1156, 1152, 1153, 1157, 1182, 1177, 1174, 1172, 1184, 1170, 1181, 5730, 3873, 3876, 3878, 3877, 3875, 3874]  [10564, 1832, 1832, 1768, 1284, 1048, 984, 984, 984, 984, 984, 984, 964, 948, 932, 932, 928, 928, 928, 928]     False        76   
2        2  [17212, 16518, 16519, 16521, 16616, 16691, 16692, 16973, 17125, 17251, 18732, 17265, 17352, 17518, 17831, 17910, 18530, 18531, 18532,

In [0]:
# create a graph with nodes = item ids in test set, and edge weights = dot product values

# we will use the "community" algorithms to determine useful groupings of other items around/including the test items
# ##### start with a graph containing the 5100 items in the test set as starter nodes, and add in the 10 highest-match wordvector items if dot product > threshold
# TAKING A LEAP... gonna try with 21700 full items dataset / top 10 matches

edge_threshold = 100  # dot product (edge weight) must be greater than this for two item_ids to be connected in the graph

graph_items = similar_items[['item_id','close_item_idx']].copy(deep=True).explode('close_item_idx').reset_index(drop=True)
graph_weights = similar_items[['item_id','close_item_dot']].copy(deep=True).explode('close_item_dot').reset_index(drop=True)
graph_items['weight'] = graph_weights.loc[:]['close_item_dot']
graph_items.columns = ['item1_id','item2_id','weight']

print(len(graph_items))
graph_items = graph_items[graph_items.weight > edge_threshold]
print(len(graph_items))
graph_items.head()
# depending on threshold, we may end up dropping some of the test items (for example, we lose item 22154 if threshold = 150, but not if threshold = 100)
# K=15
# 332550
# x
# 284132 (thresh 100)
# 265912 (200)
# 143099 (500)

# K=10
# 221700
# x
# 192094 (100)

# K = 20
# 443400 -> 374990 (100)

443400
374990


Unnamed: 0,item1_id,item2_id,weight
0,0,9920,288
1,0,9922,288
2,0,16973,288
3,0,21346,288
4,0,21661,288


In [0]:
%%time
# import pandas df into weighted-edge graph:
G = nx.from_pandas_edgelist(graph_items, 'item1_id', 'item2_id', ['weight'])

CPU times: user 4.39 s, sys: 74.1 ms, total: 4.46 s
Wall time: 4.46 s


In [0]:
%%time
# employ a clustering method that utilizes the edge weights
communities2 = community.asyn_lpa_communities(G, weight='weight', seed=42)

CPU times: user 1min 25s, sys: 18.5 ms, total: 1min 25s
Wall time: 1min 25s


In [0]:
num_communities = 0
community_items = set()
cluster_nodes = []
n_nodes = []
weight_avgs = []
weight_sums = []
weight_maxs = []
weight_mins = []
weight_stds = []
for i,c in enumerate(communities2):
    num_communities += 1
    community_items = community_items | set(c)
    nodelist = list(c)
    n_nodes.append(len(nodelist))
    edgeweights = []
    for m in range(n_nodes[-1]-1):
        for n in range(m+1,n_nodes[-1]):
            try:
                edgeweights.append(G.edges[nodelist[m], nodelist[n]]['weight'])
            except:
                pass
    cluster_nodes.append(nodelist)
    weight_avgs.append(np.mean(edgeweights))
    weight_sums.append(np.sum(edgeweights))
    weight_maxs.append(np.max(edgeweights))
    weight_mins.append(np.min(edgeweights))
    weight_stds.append(np.std(edgeweights))

print(num_communities)

1386


In [0]:
weight_avgs = [round(x) for x in weight_avgs]
community_df = pd.DataFrame({'n_nodes':n_nodes,'w_avg':weight_avgs,'w_sum':weight_sums,'w_max':weight_maxs,'w_min':weight_mins,'w_std':weight_stds,'cluster_members':cluster_nodes})
print(community_df.head())
print("\n")
print(community_df.describe())

   n_nodes  w_avg    w_sum  w_max  w_min     w_std  \
0       60    333   216948  21220    288   830.564   
1       47    572   181232  10708    132   753.633   
2       45   2114  1002168   5456    256 1,674.615   
3       36    382   140196   6688    256   528.828   
4       10    717    31528   3424    288 1,075.847   

                                                                                                                                                                                                                                                                                                                                                                                                    cluster_members  
0  [0, 16513, 2, 10113, 9604, 14597, 16518, 16519, 17156, 16521, 9489, 9617, 15634, 20, 9492, 18843, 19611, 21661, 14366, 11039, 12449, 14243, 11940, 14885, 17831, 9643, 21420, 14381, 9519, 16691, 16692, 17212, 19520, 12354, 8136, 10442, 20043, 15819, 16973, 11723, 145

In [0]:
cluster_items = community_df[['n_nodes','cluster_members']].copy(deep=True).explode('cluster_members').reset_index(drop=True)
print(f'community_df length: {len(community_df)}')
print(f'cluster_items df length: {len(cluster_items)}')
print(f'number of unique item ids contained in clusters: {cluster_items.cluster_members.nunique()}')
for nn in [9,24,49]:
    nn_community = community_df.query("n_nodes > @nn").copy(deep=True)
    print(f'number of clusters with at least {nn+1} items as members: {len(nn_community)}')
    print(nn_community.describe())
    print('\n')

community_df length: 1386
cluster_items df length: 20406
number of unique item ids contained in clusters: 20406
number of clusters with at least 10 items as members: 421
       n_nodes     w_avg       w_sum     w_max   w_min     w_std
count      421       421         421       421     421       421
mean    40.542   990.131 333,508.561 5,282.546 375.696   690.150
std     79.047 1,079.365 658,943.227 5,751.013 334.278   930.635
min         10       139        7288       168     104         0
25%         15       453       64200      1456     196   174.118
50%         23       682      132068      3504     256   425.795
75%         40      1098      369912      6904     404   887.429
max       1300     11251     7564692     41760    2940 9,397.001


number of clusters with at least 25 items as members: 193
       n_nodes   w_avg       w_sum     w_max   w_min     w_std
count      193     193         193       193     193       193
mean    69.917 831.031 592,362.363 6,135.855 288.725   554.

In [0]:
%quickref

In [0]:
def tstfun(a,b):
    print(a+b)

,tstfun one two

onetwo


In [0]:
# with K=15 and threshold = 100, we get 1624 clusters, quantiles of n_nodes = 2 min, 2, 4 med, 11, 1161 max; 479 clusters with at least 10 items 479/10... 182/25...73/50;  20405 items actually clustered (out of 21700)
# with K=15 and threshold = 200, we get 1664 clusters, quantiles of n_nodes = 2 min, 2, 4 med, 10, 1164 max; 431 clusters with at least 10 items, 19787 items actually clustered (out of 21700)
# with K=15 and threshold = 500, we get 1843 clusters, quantiles of n_nodes = 2 min, 2, 3 med,  6,  236 max; 319 clusters with at least 10 items 319/10... 103/25...31/50;  13422 items actually clustered (out of 21700)
# with K=10 and threshold = 100, we get 1962 clusters, quantiles of n_nodes = 2 min, 2, 4 med, 10, 1096 max; 529 clusters with at least 10 items 529/10... 152/25...56/50;  20404 items actually clustered (out of 21700)
# community_df length: 1962
# cluster_items df length: 20404
# number of unique item ids contained in clusters: 20404
# number of clusters with at least 10 items as members: 529
#        n_nodes     w_avg       w_sum     w_max   w_min     w_std
# count      529       529         529       529     529       529
# mean    28.371 1,118.951 151,510.904 4,402.389 423.274   719.163
# std     56.233 1,174.116 292,205.958 5,415.598 359.267   967.964
# min         10       132        2904       132     104         0
# 25%         12       509       36548      1108     228   149.643
# 50%         17       791       75604      2604     320   385.788
# 75%         27      1293      165472      5668     504   987.902
# max       1096     11827     3816308     41760    2940 9,873.388


# number of clusters with at least 25 items as members: 152
#        n_nodes     w_avg       w_sum     w_max   w_min     w_std
# count      152       152         152       152     152       152
# mean    61.704   917.283 309,274.921 4,552.974 339.474   537.762
# std     97.197 1,102.951 480,567.401 5,888.855 251.088   947.367
# min         25       167       21352       244     104         0
# 25%         30   436.250      108438      1088     196   106.725
# 50%         39       639      191836      2586     272   255.362
# 75%         58 1,001.750      317200      5629     400   671.011
# max       1096     11495     3816308     41760    2448 9,873.388


# number of clusters with at least 50 items as members: 56
#        n_nodes   w_avg       w_sum     w_max   w_min     w_std
# count       56      56          56        56      56        56
# mean   110.143 670.875 458,976.500 4,891.786 268.071   423.299
# std    148.554 403.079 562,950.003 5,501.423 108.325   516.756
# min         50     167       84276       372     104     5.652
# 25%     56.500 403.500      198484      1288     196    95.334
# 50%         72 558.500      314638      2720     256   255.362
# 75%    106.250 823.500      534489      6053     320   575.131
# max       1096    1989     3392540     22140     608 2,978.267
#


#####################################################
## use clusters n>49, k=20, thresh = 100
###################################

# # with K=20 and threshold = 100, we get 1387 clusters, quantiles of n_nodes = 2 min, 2, 4 med, 13, 1319 max; 422 clusters with at least 10 items 422/10... 179/25...78/50;  20406 items actually clustered (out of 21700)
# community_df length: 1387
# cluster_items df length: 20406
# number of unique item ids contained in clusters: 20406
# number of clusters with at least 10 items as members: 422
#        n_nodes     w_avg       w_sum     w_max   w_min     w_std
# count      422       422         422       422     422       422
# mean    40.405   987.488 334,640.844 5,371.365 382.237   706.833
# std     79.321 1,002.201 663,848.830 6,042.819 366.626   930.412
# min         10       142        9836       168     104         0
# 25%         15   445.500       61230      1387     196   168.636
# 50%         22   688.500      130628      3394     256   424.621
# 75%         39 1,149.750      363573      6901     416   910.528
# max       1319     11251     7558860     41760    3192 9,339.161


# number of clusters with at least 25 items as members: 179
#        n_nodes     w_avg       w_sum     w_max   w_min     w_std
# count      179       179         179       179     179       179
# mean    73.682   878.994 633,793.497 6,350.324 295.330   606.438
# std    113.674   860.866 924,124.723 6,362.418 230.057   891.521
# min         25       165       36576       484     104     9.963
# 25%         33   435.500      198778      2168     164   180.056
# 50%         44       623      388704      4276     256   377.799
# 75%         75 1,072.500      667122      7824     342   767.243
# max       1319      8846     7558860     41760    2448 9,339.161


# number of clusters with at least 50 items as members: 78
#        n_nodes     w_avg         w_sum     w_max   w_min     w_std
# count       78        78            78        78      78        78
# mean   124.269   822.628 1,054,526.103 7,313.077 248.205   559.201
# std    158.805 1,056.614 1,253,915.801 7,405.189 106.790 1,093.614
# min         50       179         68600       528     104     9.963
# 25%     63.250   405.250        453204      2467     167   154.897
# 50%         83   562.500        666060      4136     244   300.057
# 75%    138.750   891.250       1162252      9802     272   628.492
# max       1319      8846       7558860     41760     592 9,339.161

UsageError: Cell magic `%%` not found.


In [0]:
community_df.w_avg.nunique()
# can't use this as a category code because not unique among clusters,
# but I want to use the average cluster weights property to encode the cluster category
# (higher numbers for category code --> stronger clustering; may be useful to have this correlation instead of random generation of category codes)

1234

In [0]:
# so, I will sort on w_avg, then on number of nodes as perhaps the next most important defining characteristic of a given cluster
#  and, to make the categorization unique, I will take the w_avg value and sum it with the index (row number)...
#     (with the sorting, this favors even more the clusters with high average item-to-item similarity)
community_df = community_df.sort_values(['w_avg','n_nodes']).reset_index(drop=True)
community_df['item_cluster_id'] = community_df.index + community_df['w_avg']
community_df.head()

Unnamed: 0,n_nodes,w_avg,w_sum,w_max,w_min,w_std,item_id,item_cluster_id
0,2,104,104,104,104,0,"[8491, 21478]",104
1,2,104,104,104,104,0,"[16040, 16178]",105
2,3,104,312,104,104,0,"[14938, 17619, 8188]",106
3,5,104,1040,104,104,0,"[5841, 5842, 5843, 5844, 5845]",107
4,22,104,9776,104,104,0,"[388, 401, 405, 409, 281, 282, 414, 418, 293, 294, 295, 298, 427, 303, 310, 311, 442, 451, 328, 335, 375, 253]",108


In [0]:
# unravel / explode the cluster node lists... we know this will not duplicate item ids, from the counting we did above
item_clusters = community_df.copy(deep=True).explode('item_id').reset_index().rename(columns = {'index':'cluster_number'})
item_clusters.head()

Unnamed: 0,cluster_number,n_nodes,w_avg,w_sum,w_max,w_min,w_std,item_id,item_cluster_id
0,0,2,104,104,104,104,0,8491,104
1,0,2,104,104,104,104,0,21478,104
2,1,2,104,104,104,104,0,16040,105
3,1,2,104,104,104,104,0,16178,105
4,2,3,104,312,104,104,0,14938,106


In [0]:
items_clustered = items_clean_delimited[['item_id','i_cat_id','i_tested','item_name']].merge(item_clusters,on='item_id',how='left')
items_clustered = items_clustered[['item_id','i_cat_id','item_cluster_id','i_tested','cluster_number','n_nodes','w_avg','w_sum','w_max','w_min','w_std','item_name']]
items_clustered.columns = ['item_id','item_category_id','item_cluster_id','item_tested','cluster_number','n_items_in_cluster','w_avg','w_sum','w_max','w_min','w_std','item_name']
print(items_clustered.head())

  item_id  item_category_id  item_cluster_id  item_tested  cluster_number  n_items_in_cluster  w_avg  w_sum  w_max  w_min     w_std                                                                                         item_name
0       0                40              920        False             556                  10    364   8724   2572    264   460.506                                                               movie dvd power in glamor plast dvd
1       1                76             2600        False            1322                  12   1278  61348  10520    404 1,497.166  program home and office digital abbyy finereader 12 professional edition full pc digital version
2       2                40              802        False             489                  56    313  83032   4984    144   396.113                                                                        movie dvd in glory unv dvd
3       3                40              330        False             129       

In [0]:
# how many test items are represented by clusters?
tested_clustered = items_clustered[items_clustered.item_tested==True][['item_id','item_category_id','item_cluster_id','item_name']]
tested_clustered['unclustered'] = tested_clustered.apply(lambda x: np.NaN if x.item_cluster_id > 0  else x.item_id, axis = 1)
print(tested_clustered.head(10))
print('\n')
print(tested_clustered.item_id.nunique())
unclustered = tested_clustered.unclustered.unique()
unclustered = [x for x in unclustered if x > 0]
print(len(unclustered))
train_items = sales_train.item_id.unique()
print(len(train_items))
print(len(items))
untrained = [x for x in unclustered if x not in train_items]
print(len(untrained))
print(len(items) - len(train_items) - len(untrained))

   item_id  item_category_id  item_cluster_id                                                            item_name  unclustered
30      30                40             1109                      movie dvd 007 coordinate 007 james bond skyfall          nan
31      31                37             1109    movie bluray dvd 007 coordinate 007 james bond skyfall bluray dvd          nan
32      32                40             4055                                                    movie dvd 1 and 1          nan
33      33                37             4055                                  movie bluray dvd 1 and 1 bluray dvd          nan
38      38                41              nan  cinema collector 10 most popular comedy twentieth century 10dvd rem           38
42      42                57              231                   music mp3 100 best romantic melody mp3 cd digipack          nan
45      45                57              231                   music mp3 100 of best folk song mp3 cd c

In [0]:
# revert to original item_category_id if item is not in clustered items
items_clustered['cluster_code'] = items_clustered.apply(lambda x: x.item_cluster_id if x.item_cluster_id > 0 else x.item_category_id, axis = 1)
items_clustered.head()

Unnamed: 0,item_id,item_category_id,item_cluster_id,item_tested,cluster_number,n_items_in_cluster,w_avg,w_sum,w_max,w_min,w_std,item_name,cluster_code
0,0,40,920,False,556,10,364,8724,2572,264,460.506,movie dvd power in glamor plast dvd,920
1,1,76,2600,False,1322,12,1278,61348,10520,404,1497.166,program home and office digital abbyy finereader 12 professional edition full pc digital version,2600
2,2,40,802,False,489,56,313,83032,4984,144,396.113,movie dvd in glory unv dvd,802
3,3,40,330,False,129,6,201,2212,1132,108,294.379,movie dvd blue wave univ dvd,330
4,4,40,1686,False,886,56,800,264120,2664,104,240.272,movie dvd box glass dvd,1686


In [0]:
# # save what we have; maybe refine later

# compression_opts = dict(method='gzip',
#                         archive_name='items_clustered_21700.csv')  
# items_clustered.to_csv('data_output/items_clustered_21700.csv.gz', index=False, compression=compression_opts)

In [0]:
# def join_friends(gramlist,friend1,friend2,reverse=False):
#     """
#     Combine things that were inadvertantly separated, like "x" and "box"
#     gramlist = list of text strings, each one is a 'delimited' string found by the above code
#     friend1 = first string to search for
#     friend2 = second string to search for
#     reverse = bool, if we want to check and standardize both orders of friends (like 'x box' as well as 'box x')
#     """
#     f1 = (friend1 in gramlist) 
#     f2 = (friend2 in gramlist)
#     f3 = (friend1 + " " + friend2) in gramlist
#     if reverse:
#         f4 = (friend2 + " " + friend1) in gramlist
#     else:
#         f4 = False

#     if (f1 and f2) or f3 or f4:
#         if f1:
#             gramlist.remove(friend1)
#         if f2: 
#             gramlist.remove(friend2)
#         if f4:
#             gramlist.remove(friend2 + " " + friend1)
#         if not f3:
#             gramlist.append(friend1 + " " + friend2)
#     return gramlist

# def friends(gramlist):
#     friends = []
#     friends.append("x","box",False) 
#     friends.append("p","s")
#     friends.append("bluray","dvd",True)
#     friends.append("4k","bluray dvd",True)
#     friends.append("4k bluray","dvd",True)
#     friends.append("4k","bluray",True)
#     friends.append("4k","dvd",True)
#     friends.append("3d","bluray dvd",True)
#     friends.append("3d bluray","dvd",True)
#     friends.append("3d dvd","bluray",True)
#     friends.append("3d","dvd",True)
#     friends.append("3d","bluray",True)
# tbd... maybe join delimited text and use a regex?