# Dom model training
We will be looking over the extracted tags and see if we can correctly predict the label of the data using some ML model. At first, we will be trying it on the first type of datasets, as described in the main notes and test it on both the same dataset, and the dataset of the entire website.

But first, we will do some inspection of the records to see if we can make some inferences about the predictions or use them as diagnostics if the model fails.

We will also be aproaching the task as both a binary clasification problem for each label and and a multiclass one with all of them.

In [None]:
%matplotlib inline
# standard library
import itertools
import sys, os
import re

from urllib.parse import urlparse

# pandas
import pandas as pd
import dask.dataframe as dd

# numpy, matplotlib, seaborn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


# skelearn
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import train_test_split

# local imports
sys.path.append(os.path.join(os.getcwd(), "src"))
from utils import get_domain_from_url

# this styling is purely my preference
# less chartjunk
sns.set_context('notebook', font_scale=1.5, rc={'line.linewidth': 2.5})
sns.set(style='ticks', palette='Set2')

We will be starting with the data from OLX and do some exploration first.

In [None]:
# load the data
df = pd.read_csv('../data/ecommerce-new/final/split-url/www.olx.ro.csv').drop('Unnamed: 0', axis='columns')
df.head()

Let's take a look at how the features for some of the labeled data look to see if we can spot any patterns.

As we can see, there is almost no diffrence between the records with this label. To check this exhaustively, we will check the average absolute z-score for all the columns for a given label.

In [None]:
label_cols = ['detail_description_label', 'detail_image_label', 'detail_price_label', 
              'detail_title_label', 'list_image_label', 'list_price_label', 'list_title_label']
one_hot_cols = list(filter(lambda x: 'tag' in x, df.columns))
df['label'] = df[label_cols].idxmax(1)  # revse one-hot
labeled_df = df.copy().loc[df[label_cols].any(1), :]

# use the categorical column for groupby
grouped_df = labeled_df.drop(['url', 'path'] + one_hot_cols + label_cols, axis='columns').groupby('label', as_index=True)
z_score_df = grouped_df.agg(lambda x: ((x - x.mean()) / x.std()).abs().mean()).fillna(0)
z_score_df

As we can see, the z scorse have varied values, but are very sparsely distributed. To actually get a better metric of the variance of the alues, we will average over the z-score of every column in each group.

In [None]:
z_score_df.mean(axis=1)

From the table above, we can observe that most feature values for each of the label groups, reside at less than 10% of standard deviation away from their mean. This means that the sought values are grouped in afairly tight cluster, making them not very varied.

To observe this visually, we will plot boxplots for the zscore inside of its own class of each variable. This way we can see how far the values stray from their mean.

In [None]:
# z-normalize al lthe columsn but label and create a dataset we can plot later
z_score_cols = df.drop(['url', 'path', 'label'] + one_hot_cols + label_cols, axis='columns').columns  # clumns to use for labeled df

# we can only use the labeled cols as grouping by with NaNs fail
labeled_df.loc[:, z_score_cols] = labeled_df.groupby('label').transform(lambda x: (x - x.mean()) / x.std()).fillna(0)
labeled_z_score_df = labeled_df.drop(['url', 'path'] + one_hot_cols + label_cols, axis='columns')

In [None]:
# melt into long format so we can plot with seaborn
g = sns.FacetGrid(data=pd.melt(labeled_z_score_df, id_vars=['label']), row='label', size=10)
g.map(sns.boxplot, 'value', 'variable', orient='h', )
sns.despine(trim=True)

Having plotted the boxplots for each vriable for the z-scores of every class of labels we can see thatmost of them have z scores of 0, meaning there is literally no variation between the values. Some of the variables have the IQR of their z scores in around the (-1, 1) interval which is normal. Overall, this means that elements of the same class on a site might represent a single point in the high-dimensional feature space.
**Note** This means testing would revolve around the model just learning to differentiate one sample from the others. could it be considered overfitting if the training data has all the possible samples in the domain?

There are however a few very high outliers for variables wher the IQR is the point 0. These might be due to some differences in some particular page templates, or bad labeling, but comparing with the other variables which have zero variance, it might be normal entropy.


The problem that remains to be explored is how this data variaes accross different websites. In order to do this  we will be plotting the z scores  of all the labeled data and group them by their domains.

In [None]:
# get all the laabeled data from all the sites
ddf = dd.read_csv('../data/ecommerce-new/final/split-url/*.csv')
ddf = ddf.loc[ddf[label_cols].any(1), :].drop('Unnamed: 0', axis=1) # we drop unlabeled record not to clutter the memory
all_site_df = ddf.compute()  # transform to pandas
all_site_df['label'] = all_site_df[label_cols].idxmax(1)  # revse one-hot
all_site_df['domain'] = all_site_df['url'].apply(get_domain_from_url)  # add the domain

z_score_cols = all_site_df.drop(['url', 'path', 'label', 'domain'] + one_hot_cols + label_cols, axis='columns').columns  # columns to use for z-scores
all_site_df.loc[:, z_score_cols] = all_site_df.groupby('label').transform(lambda x: (x - x.mean()) / x.std()).fillna(0) # get z scores among all sites

In [None]:
# plot the boxplot
all_site_df = all_site_df.drop(['url', 'path'] + one_hot_cols + label_cols, axis='columns') # drop unplottable cols

# melt into long format so we can plot with seaborn
g = sns.FacetGrid(data=pd.melt(all_site_df, id_vars=['label', 'domain']), row='label', size=50, aspect=1)
g.map(sns.boxplot, 'value', 'variable', 'domain', orient='h')  # group hue by domain
g.set(xlim=(-10, 10))  # set the xlims
sns.despine(trim=True)

# add horizontal grid
for axes in [g.axes[i, 0] for i in range(7)]:
    axes.xaxis.grid(True, linestyle='-')
    axes.yaxis.grid(True, linestyle='--')

Looking over the plot of the zscores grouped on a per site basis we can see some pearticular patterns emerging. The most obvious one is lack of variation for most descendant features especialy for descendents of  2 levels or more lower. This is  of course, due to the fact that all data we have labeled on the site is eiher a leaf node in the dom or the parent of a leaf node. Therefore, the descendant features are so iinvaried due tu the lack of such nodes. Although this might seem as redundance at first glance,those features actually convey information about the height of each node in the tree and fi our classes are mainly leafs in he DOM, it is crucical information.

As for the other features. we can see for each site patterns we noticed in the previous plot where we explored only data from one site, more specifically that each class of tags on a particular websites fits more-or-less a single point in the feature-space. Moreover, when they do happen to have diverge, they seem to diverg at most one standard deviation away from the mean(z-scores beloging o the (-1, 1) interval).

Overall, the lack of variance among websites indicates that we don't need more data from a single website as it would be redundant information, but rather to train a model to find potential patterns we would need a smalln umber of pages from many different websites.

### Dataset quality 
To asessthe quality of the variables, we should first see if there is any corellation between them.

In [None]:
corr = df.drop(['url', 'path', 'label'] + one_hot_cols + label_cols, axis='columns').corr()

plt.figure(figsize=(20, 20))
cmap = sns.color_palette("BrBG", 11)
sns.heatmap(corr, square=True, center=0, vmin=-1, vmax=1, cmap=cmap, linewidths=0.1, cbar=False)

Looking at the corellation plot we can see some some correlations between ancestor and descendant featres on different levels of depth. Here the feaures might be redundant, but there is no harm in keeping them.

In [None]:
# for better visualization, we will see which have the highest correlation
melt_corr = corr.unstack().reset_index().fillna(0)
melt_corr['pair'] = melt_corr.apply(lambda x: set([x['level_0'], x['level_1']]), axis=1).astype(str)
melt_corr = melt_corr[(melt_corr[0].abs() > 0.5) & (melt_corr['level_0'] != melt_corr['level_1'])]
melt_corr.drop_duplicates(subset='pair').drop('pair', axis=1)

Most of the corellations her are no surpise, as the same feature for one level of descendants/acestors should be corelated to the same feature on directly adjacent levels, but some are not so easily understandable. For example there is a certain degree fo corelation between the postion among siblings and the number of children of a node.

## Leave-one-out prediction
To see how good are the features for traning, we will try to predict them using some simple machine learning algorithm such as a decision tree, and asess the performance.

In [None]:
# split the dataset
train_df, test_df = train_test_split(df.drop(['url', 'path', 'label'] + label_cols, axis='columns'), test_size=0.2)

In [None]:
scores = pd.Series()  # the serieas to add the results to

# get the regression scores whene leaving out each column
for leave_out_col in df.drop(['url', 'path', 'label'] + label_cols, axis='columns').columns:
    # get the train and test data
    X_train, Y_train = train_df.drop(leave_out_col, axis=1), train_df[leave_out_col]
    X_test, Y_test = test_df.drop(leave_out_col, axis=1), test_df[leave_out_col]
    # train the decision tree
    reg = DecisionTreeRegressor()
    reg.fit(X_train, Y_train) 
    
    # add it to the results
    reg_score = reg.score(X_test, Y_test)
    scores[leave_out_col] =  reg_score
    print('{0}: {1}'.format(leave_out_col, reg_score))

In [None]:
scores.sort_values().head()