# DOM features experiment. Coninuation
We will be picking off where we left last time. We'll try to do redo the experiments and see the results, but now we'll try to classify the entire dataset as well, both after having it traind on one website, and all of them.

## Constants

In [1]:
URL_SPLIT_PATH = '../data/final/first/split-url/'
LABEL_SPLIT_PATH = '../data/final/first/split-label/'

EXPERIMENT_RESULTS_CSV = '../data/experimental-results/first-experiments.csv'
URL_SPLIT_CSVS = URL_SPLIT_PATH + '*.csv'
LABEL_SPLIT_CSVS = LABEL_SPLIT_PATH + '*.csv'

In [2]:
%matplotlib inline

# standard library
import itertools
import sys, os
import re
import glob
import logging

from urllib.parse import urlparse

# pandas
import pandas as pd
import dask.dataframe as dd

# numpy, matplotlib, seaborn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


# skelearn
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, precision_recall_fscore_support
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier

# local imports
sys.path.append(os.path.join(os.getcwd(), "../src"))
from utils import get_domain_from_url
from experiments import simple_model_experiment, get_dataset_descr_from_filename, rf_eval

# this styling is purely my preference
# less chartjunk
sns.set_context('notebook', font_scale=1.5, rc={'line.linewidth': 2.5})
sns.set(style='ticks', palette='Set2')



Now thatwe have the scaffolding for the experiments, we can define functions to run tour five experiments in terms of them, with a given model. We will define all the experiments in term of the `simple_model_experiment` function. For each one we will create the decription.

In [3]:
label_cols = ['detail_description_label', 'detail_image_label', 'detail_price_label',
              'detail_title_label', 'list_image_label', 'list_price_label', 'list_title_label']

For the first experiment, we will only use the the csvs with apges of a website containing a label, for both test and train.

In [4]:
# describe the experiments
train_dataset_files = glob.glob(LABEL_SPLIT_CSVS)
train_datasets = [(file, ) + get_dataset_descr_from_filename(file) for file in train_dataset_files]
train_file_df = pd.DataFrame(data=train_datasets, columns=('file', 'website', 'label'))

# merge it with self 
experiments_df = train_file_df.merge(train_file_df, left_index=True, right_index=True, suffixes=('_train', '_test'))
experiments_df.head()  # inspect the experiments

Unnamed: 0,file_train,website_train,label_train,file_test,website_test,label_test
0,../data/final/first/split-label/www.emag.ro-li...,www.emag.ro,list_price_label,../data/final/first/split-label/www.emag.ro-li...,www.emag.ro,list_price_label
1,../data/final/first/split-label/lajumate.ro-li...,lajumate.ro,list_image_label,../data/final/first/split-label/lajumate.ro-li...,lajumate.ro,list_image_label
2,../data/final/first/split-label/lajumate.ro-li...,lajumate.ro,list_title_label,../data/final/first/split-label/lajumate.ro-li...,lajumate.ro,list_title_label
3,../data/final/first/split-label/www.emag.ro-li...,www.emag.ro,list_title_label,../data/final/first/split-label/www.emag.ro-li...,www.emag.ro,list_title_label
4,../data/final/first/split-label/www.amazon.com...,www.amazon.com,detail_description_label,../data/final/first/split-label/www.amazon.com...,www.amazon.com,detail_description_label


In [5]:
first_experiment_df = experiments_df.copy()  # persist it

For the second one, the tesing set will be the entire website.

In [6]:
train_dataset_files = glob.glob(LABEL_SPLIT_CSVS)
test_dataset_files = glob.glob(URL_SPLIT_CSVS)

train_datasets = [(file, ) + get_dataset_descr_from_filename(file) for file in train_dataset_files]
test_datasets = [(file, ) + get_dataset_descr_from_filename(file) for file in test_dataset_files]

# we need to air the label/website files with their website equivalent
train_file_df = pd.DataFrame(data=train_datasets, columns=('file', 'website', 'label'))
test_file_df = pd.DataFrame(data=test_datasets, columns=('file', 'website', 'label'))

# join them on the same website, with the proper suffixes
experiments_df = train_file_df.merge(test_file_df, on='website',  suffixes=('_train', '_test'))
experiments_df['website_train'] = experiments_df['website_test'] = experiments_df['website']
experiments_df.head()  # inspect the experiments

Unnamed: 0,file_train,website,label_train,file_test,label_test,website_train,website_test
0,../data/final/first/split-label/www.emag.ro-li...,www.emag.ro,list_price_label,../data/final/first/split-url/www.emag.ro.csv,all,www.emag.ro,www.emag.ro
1,../data/final/first/split-label/www.emag.ro-li...,www.emag.ro,list_title_label,../data/final/first/split-url/www.emag.ro.csv,all,www.emag.ro,www.emag.ro
2,../data/final/first/split-label/www.emag.ro-de...,www.emag.ro,detail_image_label,../data/final/first/split-url/www.emag.ro.csv,all,www.emag.ro,www.emag.ro
3,../data/final/first/split-label/www.emag.ro-de...,www.emag.ro,detail_description_label,../data/final/first/split-url/www.emag.ro.csv,all,www.emag.ro,www.emag.ro
4,../data/final/first/split-label/www.emag.ro-de...,www.emag.ro,detail_title_label,../data/final/first/split-url/www.emag.ro.csv,all,www.emag.ro,www.emag.ro


In [7]:
second_experiment_df = experiments_df.copy()

For the third one, the train and test are both on the entire website.

In [8]:
# describe the experiments
train_dataset_files = glob.glob(URL_SPLIT_CSVS)
train_datasets = [(file, ) + get_dataset_descr_from_filename(file) for file in train_dataset_files]
train_file_df = pd.DataFrame(data=train_datasets, columns=('file', 'website', 'label'))

# merge it with self 
experiments_df = train_file_df.merge(train_file_df, left_index=True, right_index=True, suffixes=('_train', '_test'))
experiments_df.head()  # inspect the experiments

Unnamed: 0,file_train,website_train,label_train,file_test,website_test,label_test
0,../data/final/first/split-url/www.amazon.com.csv,www.amazon.com,all,../data/final/first/split-url/www.amazon.com.csv,www.amazon.com,all
1,../data/final/first/split-url/www.aliexpress.c...,www.aliexpress.com,all,../data/final/first/split-url/www.aliexpress.c...,www.aliexpress.com,all
2,../data/final/first/split-url/www.olx.ro.csv,www.olx.ro,all,../data/final/first/split-url/www.olx.ro.csv,www.olx.ro,all
3,../data/final/first/split-url/www.emag.ro.csv,www.emag.ro,all,../data/final/first/split-url/www.emag.ro.csv,www.emag.ro,all
4,../data/final/first/split-url/www.piata-az.ro.csv,www.piata-az.ro,all,../data/final/first/split-url/www.piata-az.ro.csv,www.piata-az.ro,all


In [9]:
third_experiment_df = experiments_df.copy()

The fourth one is trained on an entire website and tested on all of them

In [10]:
train_dataset_files = glob.glob(URL_SPLIT_CSVS)
train_datasets = [(file, ) + get_dataset_descr_from_filename(file) for file in train_dataset_files]
train_file_df = pd.DataFrame(data=train_datasets, columns=('file_train', 'website_train', 'label_train'))

experiments_df = train_file_df
experiments_df['file_test'] = URL_SPLIT_CSVS
experiments_df['website_test'] = experiments_df['label_test'] = 'all'
               
experiments_df.head()

Unnamed: 0,file_train,website_train,label_train,file_test,website_test,label_test
0,../data/final/first/split-url/www.amazon.com.csv,www.amazon.com,all,../data/final/first/split-url/*.csv,all,all
1,../data/final/first/split-url/www.aliexpress.c...,www.aliexpress.com,all,../data/final/first/split-url/*.csv,all,all
2,../data/final/first/split-url/www.olx.ro.csv,www.olx.ro,all,../data/final/first/split-url/*.csv,all,all
3,../data/final/first/split-url/www.emag.ro.csv,www.emag.ro,all,../data/final/first/split-url/*.csv,all,all
4,../data/final/first/split-url/www.piata-az.ro.csv,www.piata-az.ro,all,../data/final/first/split-url/*.csv,all,all


In [11]:
fourth_experiment_df = experiments_df.copy()

Finally, the last one is trained on all the websites and tested on them as well.

In [12]:
fifth_experiment_df = pd.DataFrame(data={'file_train': URL_SPLIT_CSVS, 
                                         'file_test': URL_SPLIT_CSVS,
                                         'website_train': 'all', 'website_test': 'all', 
                                         'label_train': 'all', 'label_test': 'all'}, index=[0])
fifth_experiment_df

Unnamed: 0,file_test,file_train,label_test,label_train,website_test,website_train
0,../data/final/first/split-url/*.csv,../data/final/first/split-url/*.csv,all,all,all,all


## Experiment running
Now that we have the datasets to run the experiments on, we can finally run the experiment on each of them respectively. We will save the results in acouple a dataframes which we will persist into a csv.

### Train/test on website subset

In [13]:
first_results_df = simple_model_experiment(map(lambda x: x[1], first_experiment_df.iterrows()), model_func=rf_eval, 
                                           experiment_name='first-random-forest', label_cols=label_cols)

../data/final/first/split-label/www.emag.ro-list_price_label-*.csv - ../data/final/first/split-label/www.emag.ro-list_price_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/lajumate.ro-list_title_label-*.csv - ../data/final/first/split-label/lajumate.ro-list_title_label-*.csv
../data/final/first/split-label/www.emag.ro-list_title_label-*.csv - ../data/final/first/split-label/www.emag.ro-list_title_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/www.emag.ro-detail_image_label-*.csv - ../data/final/first/split-label/www.emag.ro-detail_image_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/www.aliexpress.com-list_title_label-*.csv - ../data/final/first/split-label/www.aliexpress.com-list_title_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/www.olx.ro-list_title_label-*.csv - ../data/final/first/split-label/www.olx.ro-list_title_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/www.olx.ro-list_price_label-*.csv - ../data/final/first/split-label/www.olx.ro-list_price_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/www.aliexpress.com-detail_title_label-*.csv - ../data/final/first/split-label/www.aliexpress.com-detail_title_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/www.okazii.ro-detail_title_label-*.csv - ../data/final/first/split-label/www.okazii.ro-detail_title_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/www.aliexpress.com-detail_description_label-*.csv - ../data/final/first/split-label/www.aliexpress.com-detail_description_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/www.olx.ro-detail_description_label-*.csv - ../data/final/first/split-label/www.olx.ro-detail_description_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/lajumate.ro-list_price_label-*.csv - ../data/final/first/split-label/lajumate.ro-list_price_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/lajumate.ro-detail_title_label-*.csv - ../data/final/first/split-label/lajumate.ro-detail_title_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/www.amazon.com-detail_title_label-*.csv - ../data/final/first/split-label/www.amazon.com-detail_title_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/www.olx.ro-detail_image_label-*.csv - ../data/final/first/split-label/www.olx.ro-detail_image_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/www.aliexpress.com-list_image_label-*.csv - ../data/final/first/split-label/www.aliexpress.com-list_image_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/www.emag.ro-detail_description_label-*.csv - ../data/final/first/split-label/www.emag.ro-detail_description_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/www.amazon.com-detail_price_label-*.csv - ../data/final/first/split-label/www.amazon.com-detail_price_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/www.emag.ro-detail_title_label-*.csv - ../data/final/first/split-label/www.emag.ro-detail_title_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/www.aliexpress.com-detail_image_label-*.csv - ../data/final/first/split-label/www.aliexpress.com-detail_image_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/lajumate.ro-detail_image_label-*.csv - ../data/final/first/split-label/lajumate.ro-detail_image_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/www.okazii.ro-list_image_label-*.csv - ../data/final/first/split-label/www.okazii.ro-list_image_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/www.okazii.ro-detail_image_label-*.csv - ../data/final/first/split-label/www.okazii.ro-detail_image_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/www.amazon.com-list_title_label-*.csv - ../data/final/first/split-label/www.amazon.com-list_title_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/www.amazon.com-detail_image_label-*.csv - ../data/final/first/split-label/www.amazon.com-detail_image_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/www.okazii.ro-list_price_label-*.csv - ../data/final/first/split-label/www.okazii.ro-list_price_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/www.okazii.ro-detail_description_label-*.csv - ../data/final/first/split-label/www.okazii.ro-detail_description_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/www.amazon.com-list_price_label-*.csv - ../data/final/first/split-label/www.amazon.com-list_price_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/www.okazii.ro-detail_price_label-*.csv - ../data/final/first/split-label/www.okazii.ro-detail_price_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/www.olx.ro-detail_price_label-*.csv - ../data/final/first/split-label/www.olx.ro-detail_price_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/lajumate.ro-detail_price_label-*.csv - ../data/final/first/split-label/lajumate.ro-detail_price_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/www.olx.ro-list_image_label-*.csv - ../data/final/first/split-label/www.olx.ro-list_image_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/lajumate.ro-detail_description_label-*.csv - ../data/final/first/split-label/lajumate.ro-detail_description_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/www.emag.ro-list_image_label-*.csv - ../data/final/first/split-label/www.emag.ro-list_image_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/www.olx.ro-detail_title_label-*.csv - ../data/final/first/split-label/www.olx.ro-detail_title_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/www.amazon.com-list_image_label-*.csv - ../data/final/first/split-label/www.amazon.com-list_image_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/www.emag.ro-detail_price_label-*.csv - ../data/final/first/split-label/www.emag.ro-detail_price_label-*.csv


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


../data/final/first/split-label/www.aliexpress.com-list_price_label-*.csv - ../data/final/first/split-label/www.aliexpress.com-list_price_label-*.csv


  result = _execute_task(task, data)


ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.

+------------------------------------+---------+----------+
| Column                             | Found   | Expected |
+------------------------------------+---------+----------+
| ancestor1_depth                    | float64 | int64    |
| ancestor1_no_children              | float64 | int64    |
| ancestor1_no_classes               | float64 | int64    |
| ancestor1_sibling_pos              | float64 | int64    |
| ancestor1_tag_                     | float64 | int64    |
| ancestor1_tag_a                    | float64 | int64    |
| ancestor1_tag_address              | float64 | int64    |
| ancestor1_tag_article              | float64 | int64    |
| ancestor1_tag_aside                | float64 | int64    |
| ancestor1_tag_b                    | float64 | int64    |
| ancestor1_tag_body                 | float64 | int64    |
| ancestor1_tag_button               | float64 | int64    |
| ancestor1_tag_caption              | float64 | int64    |
| ancestor1_tag_center               | float64 | int64    |
| ancestor1_tag_colgroup             | float64 | int64    |
| ancestor1_tag_dd                   | float64 | int64    |
| ancestor1_tag_defs                 | float64 | int64    |
| ancestor1_tag_del                  | float64 | int64    |
| ancestor1_tag_div                  | float64 | int64    |
| ancestor1_tag_dl                   | float64 | int64    |
| ancestor1_tag_dt                   | float64 | int64    |
| ancestor1_tag_em                   | float64 | int64    |
| ancestor1_tag_fieldset             | float64 | int64    |
| ancestor1_tag_figcaption           | float64 | int64    |
| ancestor1_tag_figure               | float64 | int64    |
| ancestor1_tag_font                 | float64 | int64    |
| ancestor1_tag_footer               | float64 | int64    |
| ancestor1_tag_form                 | float64 | int64    |
| ancestor1_tag_g                    | float64 | int64    |
| ancestor1_tag_h1                   | float64 | int64    |
| ancestor1_tag_h2                   | float64 | int64    |
| ancestor1_tag_h3                   | float64 | int64    |
| ancestor1_tag_h4                   | float64 | int64    |
| ancestor1_tag_h5                   | float64 | int64    |
| ancestor1_tag_h6                   | float64 | int64    |
| ancestor1_tag_head                 | float64 | int64    |
| ancestor1_tag_header               | float64 | int64    |
| ancestor1_tag_html                 | float64 | int64    |
| ancestor1_tag_i                    | float64 | int64    |
| ancestor1_tag_ins                  | float64 | int64    |
| ancestor1_tag_label                | float64 | int64    |
| ancestor1_tag_legend               | float64 | int64    |
| ancestor1_tag_li                   | float64 | int64    |
| ancestor1_tag_main                 | float64 | int64    |
| ancestor1_tag_map                  | float64 | int64    |
| ancestor1_tag_mktz-div             | float64 | int64    |
| ancestor1_tag_nav                  | float64 | int64    |
| ancestor1_tag_noindex              | float64 | int64    |
| ancestor1_tag_object               | float64 | int64    |
| ancestor1_tag_ol                   | float64 | int64    |
| ancestor1_tag_omni-container       | float64 | int64    |
| ancestor1_tag_omni-interact        | float64 | int64    |
| ancestor1_tag_omni-ribb-cent       | float64 | int64    |
| ancestor1_tag_optgroup             | float64 | int64    |
| ancestor1_tag_p                    | float64 | int64    |
| ancestor1_tag_path                 | float64 | int64    |
| ancestor1_tag_querylogoperation    | float64 | int64    |
| ancestor1_tag_s                    | float64 | int64    |
| ancestor1_tag_section              | float64 | int64    |
| ancestor1_tag_select               | float64 | int64    |
| ancestor1_tag_small                | float64 | int64    |
| ancestor1_tag_span                 | float64 | int64    |
| ancestor1_tag_string               | float64 | int64    |
| ancestor1_tag_strong               | float64 | int64    |
| ancestor1_tag_svg                  | float64 | int64    |
| ancestor1_tag_table                | float64 | int64    |
| ancestor1_tag_tbody                | float64 | int64    |
| ancestor1_tag_td                   | float64 | int64    |
| ancestor1_tag_text                 | float64 | int64    |
| ancestor1_tag_th                   | float64 | int64    |
| ancestor1_tag_thead                | float64 | int64    |
| ancestor1_tag_tr                   | float64 | int64    |
| ancestor1_tag_ul                   | float64 | int64    |
| ancestor2_depth                    | float64 | int64    |
| ancestor2_no_children              | float64 | int64    |
| ancestor2_no_classes               | float64 | int64    |
| ancestor2_sibling_pos              | float64 | int64    |
| ancestor2_tag_                     | float64 | int64    |
| ancestor2_tag_a                    | float64 | int64    |
| ancestor2_tag_address              | float64 | int64    |
| ancestor2_tag_article              | float64 | int64    |
| ancestor2_tag_aside                | float64 | int64    |
| ancestor2_tag_body                 | float64 | int64    |
| ancestor2_tag_button               | float64 | int64    |
| ancestor2_tag_center               | float64 | int64    |
| ancestor2_tag_dd                   | float64 | int64    |
| ancestor2_tag_div                  | float64 | int64    |
| ancestor2_tag_dl                   | float64 | int64    |
| ancestor2_tag_dt                   | float64 | int64    |
| ancestor2_tag_fieldset             | float64 | int64    |
| ancestor2_tag_figcaption           | float64 | int64    |
| ancestor2_tag_figure               | float64 | int64    |
| ancestor2_tag_footer               | float64 | int64    |
| ancestor2_tag_form                 | float64 | int64    |
| ancestor2_tag_g                    | float64 | int64    |
| ancestor2_tag_h1                   | float64 | int64    |
| ancestor2_tag_h2                   | float64 | int64    |
| ancestor2_tag_h3                   | float64 | int64    |
| ancestor2_tag_h4                   | float64 | int64    |
| ancestor2_tag_h5                   | float64 | int64    |
| ancestor2_tag_header               | float64 | int64    |
| ancestor2_tag_html                 | float64 | int64    |
| ancestor2_tag_i                    | float64 | int64    |
| ancestor2_tag_ins                  | float64 | int64    |
| ancestor2_tag_label                | float64 | int64    |
| ancestor2_tag_li                   | float64 | int64    |
| ancestor2_tag_main                 | float64 | int64    |
| ancestor2_tag_mktz-div             | float64 | int64    |
| ancestor2_tag_nav                  | float64 | int64    |
| ancestor2_tag_noindex              | float64 | int64    |
| ancestor2_tag_ol                   | float64 | int64    |
| ancestor2_tag_omni-container       | float64 | int64    |
| ancestor2_tag_omni-interact        | float64 | int64    |
| ancestor2_tag_omni-ribb-cent       | float64 | int64    |
| ancestor2_tag_p                    | float64 | int64    |
| ancestor2_tag_querylogoperation    | float64 | int64    |
| ancestor2_tag_section              | float64 | int64    |
| ancestor2_tag_select               | float64 | int64    |
| ancestor2_tag_small                | float64 | int64    |
| ancestor2_tag_span                 | float64 | int64    |
| ancestor2_tag_string               | float64 | int64    |
| ancestor2_tag_strong               | float64 | int64    |
| ancestor2_tag_svg                  | float64 | int64    |
| ancestor2_tag_table                | float64 | int64    |
| ancestor2_tag_tbody                | float64 | int64    |
| ancestor2_tag_td                   | float64 | int64    |
| ancestor2_tag_thead                | float64 | int64    |
| ancestor2_tag_tr                   | float64 | int64    |
| ancestor2_tag_ul                   | float64 | int64    |
| ancestor3_depth                    | float64 | int64    |
| ancestor3_no_children              | float64 | int64    |
| ancestor3_no_classes               | float64 | int64    |
| ancestor3_sibling_pos              | float64 | int64    |
| ancestor3_tag_                     | float64 | int64    |
| ancestor3_tag_a                    | float64 | int64    |
| ancestor3_tag_address              | float64 | int64    |
| ancestor3_tag_article              | float64 | int64    |
| ancestor3_tag_aside                | float64 | int64    |
| ancestor3_tag_body                 | float64 | int64    |
| ancestor3_tag_button               | float64 | int64    |
| ancestor3_tag_center               | float64 | int64    |
| ancestor3_tag_dd                   | float64 | int64    |
| ancestor3_tag_div                  | float64 | int64    |
| ancestor3_tag_dl                   | float64 | int64    |
| ancestor3_tag_fieldset             | float64 | int64    |
| ancestor3_tag_figcaption           | float64 | int64    |
| ancestor3_tag_figure               | float64 | int64    |
| ancestor3_tag_footer               | float64 | int64    |
| ancestor3_tag_form                 | float64 | int64    |
| ancestor3_tag_h2                   | float64 | int64    |
| ancestor3_tag_h3                   | float64 | int64    |
| ancestor3_tag_header               | float64 | int64    |
| ancestor3_tag_html                 | float64 | int64    |
| ancestor3_tag_ins                  | float64 | int64    |
| ancestor3_tag_label                | float64 | int64    |
| ancestor3_tag_li                   | float64 | int64    |
| ancestor3_tag_main                 | float64 | int64    |
| ancestor3_tag_mktz-div             | float64 | int64    |
| ancestor3_tag_nav                  | float64 | int64    |
| ancestor3_tag_noindex              | float64 | int64    |
| ancestor3_tag_ol                   | float64 | int64    |
| ancestor3_tag_omni-container       | float64 | int64    |
| ancestor3_tag_omni-ribb-cent       | float64 | int64    |
| ancestor3_tag_p                    | float64 | int64    |
| ancestor3_tag_querylogoperation    | float64 | int64    |
| ancestor3_tag_section              | float64 | int64    |
| ancestor3_tag_span                 | float64 | int64    |
| ancestor3_tag_string               | float64 | int64    |
| ancestor3_tag_svg                  | float64 | int64    |
| ancestor3_tag_table                | float64 | int64    |
| ancestor3_tag_tbody                | float64 | int64    |
| ancestor3_tag_td                   | float64 | int64    |
| ancestor3_tag_thead                | float64 | int64    |
| ancestor3_tag_tr                   | float64 | int64    |
| ancestor3_tag_ul                   | float64 | int64    |
| ancestor4_depth                    | float64 | int64    |
| ancestor4_no_children              | float64 | int64    |
| ancestor4_no_classes               | float64 | int64    |
| ancestor4_sibling_pos              | float64 | int64    |
| ancestor4_tag_                     | float64 | int64    |
| ancestor4_tag_a                    | float64 | int64    |
| ancestor4_tag_article              | float64 | int64    |
| ancestor4_tag_aside                | float64 | int64    |
| ancestor4_tag_body                 | float64 | int64    |
| ancestor4_tag_button               | float64 | int64    |
| ancestor4_tag_center               | float64 | int64    |
| ancestor4_tag_dd                   | float64 | int64    |
| ancestor4_tag_div                  | float64 | int64    |
| ancestor4_tag_dl                   | float64 | int64    |
| ancestor4_tag_fieldset             | float64 | int64    |
| ancestor4_tag_figcaption           | float64 | int64    |
| ancestor4_tag_figure               | float64 | int64    |
| ancestor4_tag_footer               | float64 | int64    |
| ancestor4_tag_form                 | float64 | int64    |
| ancestor4_tag_h3                   | float64 | int64    |
| ancestor4_tag_header               | float64 | int64    |
| ancestor4_tag_html                 | float64 | int64    |
| ancestor4_tag_label                | float64 | int64    |
| ancestor4_tag_li                   | float64 | int64    |
| ancestor4_tag_main                 | float64 | int64    |
| ancestor4_tag_mktz-div             | float64 | int64    |
| ancestor4_tag_nav                  | float64 | int64    |
| ancestor4_tag_noindex              | float64 | int64    |
| ancestor4_tag_ol                   | float64 | int64    |
| ancestor4_tag_omni-container       | float64 | int64    |
| ancestor4_tag_p                    | float64 | int64    |
| ancestor4_tag_querylogoperation    | float64 | int64    |
| ancestor4_tag_section              | float64 | int64    |
| ancestor4_tag_span                 | float64 | int64    |
| ancestor4_tag_table                | float64 | int64    |
| ancestor4_tag_tbody                | float64 | int64    |
| ancestor4_tag_td                   | float64 | int64    |
| ancestor4_tag_tr                   | float64 | int64    |
| ancestor4_tag_ul                   | float64 | int64    |
| ancestor5_depth                    | float64 | int64    |
| ancestor5_no_children              | float64 | int64    |
| ancestor5_no_classes               | float64 | int64    |
| ancestor5_sibling_pos              | float64 | int64    |
| ancestor5_tag_                     | float64 | int64    |
| ancestor5_tag_a                    | float64 | int64    |
| ancestor5_tag_article              | float64 | int64    |
| ancestor5_tag_aside                | float64 | int64    |
| ancestor5_tag_body                 | float64 | int64    |
| ancestor5_tag_button               | float64 | int64    |
| ancestor5_tag_center               | float64 | int64    |
| ancestor5_tag_dd                   | float64 | int64    |
| ancestor5_tag_div                  | float64 | int64    |
| ancestor5_tag_dl                   | float64 | int64    |
| ancestor5_tag_fieldset             | float64 | int64    |
| ancestor5_tag_figcaption           | float64 | int64    |
| ancestor5_tag_figure               | float64 | int64    |
| ancestor5_tag_footer               | float64 | int64    |
| ancestor5_tag_form                 | float64 | int64    |
| ancestor5_tag_header               | float64 | int64    |
| ancestor5_tag_html                 | float64 | int64    |
| ancestor5_tag_label                | float64 | int64    |
| ancestor5_tag_li                   | float64 | int64    |
| ancestor5_tag_main                 | float64 | int64    |
| ancestor5_tag_mktz-div             | float64 | int64    |
| ancestor5_tag_nav                  | float64 | int64    |
| ancestor5_tag_noindex              | float64 | int64    |
| ancestor5_tag_ol                   | float64 | int64    |
| ancestor5_tag_p                    | float64 | int64    |
| ancestor5_tag_querylogoperation    | float64 | int64    |
| ancestor5_tag_section              | float64 | int64    |
| ancestor5_tag_span                 | float64 | int64    |
| ancestor5_tag_table                | float64 | int64    |
| ancestor5_tag_tbody                | float64 | int64    |
| ancestor5_tag_td                   | float64 | int64    |
| ancestor5_tag_tr                   | float64 | int64    |
| ancestor5_tag_ul                   | float64 | int64    |
| descendant1_no_nodes               | float64 | int64    |
| descendant1_tags_a                 | float64 | int64    |
| descendant1_tags_address           | float64 | int64    |
| descendant1_tags_animatetransform  | float64 | int64    |
| descendant1_tags_area              | float64 | int64    |
| descendant1_tags_article           | float64 | int64    |
| descendant1_tags_aside             | float64 | int64    |
| descendant1_tags_b                 | float64 | int64    |
| descendant1_tags_body              | float64 | int64    |
| descendant1_tags_br                | float64 | int64    |
| descendant1_tags_button            | float64 | int64    |
| descendant1_tags_canvas            | float64 | int64    |
| descendant1_tags_caption           | float64 | int64    |
| descendant1_tags_cat-game-intro    | float64 | int64    |
| descendant1_tags_center            | float64 | int64    |
| descendant1_tags_circle            | float64 | int64    |
| descendant1_tags_col               | float64 | int64    |
| descendant1_tags_colgroup          | float64 | int64    |
| descendant1_tags_dd                | float64 | int64    |
| descendant1_tags_defs              | float64 | int64    |
| descendant1_tags_del               | float64 | int64    |
| descendant1_tags_desc              | float64 | int64    |
| descendant1_tags_div               | float64 | int64    |
| descendant1_tags_dl                | float64 | int64    |
| descendant1_tags_dt                | float64 | int64    |
| descendant1_tags_em                | float64 | int64    |
| descendant1_tags_fieldset          | float64 | int64    |
| descendant1_tags_figcaption        | float64 | int64    |
| descendant1_tags_figure            | float64 | int64    |
| descendant1_tags_font              | float64 | int64    |
| descendant1_tags_footer            | float64 | int64    |
| descendant1_tags_form              | float64 | int64    |
| descendant1_tags_g                 | float64 | int64    |
| descendant1_tags_h1                | float64 | int64    |
| descendant1_tags_h2                | float64 | int64    |
| descendant1_tags_h3                | float64 | int64    |
| descendant1_tags_h4                | float64 | int64    |
| descendant1_tags_h5                | float64 | int64    |
| descendant1_tags_h6                | float64 | int64    |
| descendant1_tags_head              | float64 | int64    |
| descendant1_tags_header            | float64 | int64    |
| descendant1_tags_hr                | float64 | int64    |
| descendant1_tags_html_comment      | float64 | int64    |
| descendant1_tags_i                 | float64 | int64    |
| descendant1_tags_iframe            | float64 | int64    |
| descendant1_tags_img               | float64 | int64    |
| descendant1_tags_input             | float64 | int64    |
| descendant1_tags_ins               | float64 | int64    |
| descendant1_tags_label             | float64 | int64    |
| descendant1_tags_legend            | float64 | int64    |
| descendant1_tags_li                | float64 | int64    |
| descendant1_tags_like              | float64 | int64    |
| descendant1_tags_line              | float64 | int64    |
| descendant1_tags_link              | float64 | int64    |
| descendant1_tags_main              | float64 | int64    |
| descendant1_tags_map               | float64 | int64    |
| descendant1_tags_meta              | float64 | int64    |
| descendant1_tags_mktz-div          | float64 | int64    |
| descendant1_tags_nav               | float64 | int64    |
| descendant1_tags_noindex           | float64 | int64    |
| descendant1_tags_noscript          | float64 | int64    |
| descendant1_tags_object            | float64 | int64    |
| descendant1_tags_ol                | float64 | int64    |
| descendant1_tags_omni-container    | float64 | int64    |
| descendant1_tags_omni-interact     | float64 | int64    |
| descendant1_tags_omni-ribb-cent    | float64 | int64    |
| descendant1_tags_optgroup          | float64 | int64    |
| descendant1_tags_option            | float64 | int64    |
| descendant1_tags_p                 | float64 | int64    |
| descendant1_tags_param             | float64 | int64    |
| descendant1_tags_path              | float64 | int64    |
| descendant1_tags_pre               | float64 | int64    |
| descendant1_tags_querylogoperation | float64 | int64    |
| descendant1_tags_rect              | float64 | int64    |
| descendant1_tags_s                 | float64 | int64    |
| descendant1_tags_script            | float64 | int64    |
| descendant1_tags_section           | float64 | int64    |
| descendant1_tags_select            | float64 | int64    |
| descendant1_tags_small             | float64 | int64    |
| descendant1_tags_span              | float64 | int64    |
| descendant1_tags_string            | float64 | int64    |
| descendant1_tags_strong            | float64 | int64    |
| descendant1_tags_style             | float64 | int64    |
| descendant1_tags_sup               | float64 | int64    |
| descendant1_tags_svg               | float64 | int64    |
| descendant1_tags_table             | float64 | int64    |
| descendant1_tags_tbody             | float64 | int64    |
| descendant1_tags_td                | float64 | int64    |
| descendant1_tags_text              | float64 | int64    |
| descendant1_tags_textarea          | float64 | int64    |
| descendant1_tags_th                | float64 | int64    |
| descendant1_tags_thead             | float64 | int64    |
| descendant1_tags_time              | float64 | int64    |
| descendant1_tags_title             | float64 | int64    |
| descendant1_tags_tr                | float64 | int64    |
| descendant1_tags_tspan             | float64 | int64    |
| descendant1_tags_u                 | float64 | int64    |
| descendant1_tags_ul                | float64 | int64    |
| descendant1_tags_video             | float64 | int64    |
| descendant2_no_nodes               | float64 | int64    |
| descendant3_no_nodes               | float64 | int64    |
| descendant4_no_nodes               | float64 | int64    |
| descendant5_no_nodes               | float64 | int64    |
| detail_description_label           | float64 | int64    |
| detail_image_label                 | float64 | int64    |
| detail_price_label                 | float64 | int64    |
| detail_title_label                 | float64 | int64    |
| has_id                             | float64 | int64    |
| has_text                           | float64 | int64    |
| list_image_label                   | float64 | int64    |
| list_price_label                   | float64 | int64    |
| list_title_label                   | float64 | int64    |
| no_children                        | float64 | int64    |
| no_classes                         | float64 | int64    |
| tag_a                              | float64 | int64    |
| tag_address                        | float64 | int64    |
| tag_animatetransform               | float64 | int64    |
| tag_area                           | float64 | int64    |
| tag_article                        | float64 | int64    |
| tag_aside                          | float64 | int64    |
| tag_b                              | float64 | int64    |
| tag_body                           | float64 | int64    |
| tag_br                             | float64 | int64    |
| tag_button                         | float64 | int64    |
| tag_canvas                         | float64 | int64    |
| tag_caption                        | float64 | int64    |
| tag_cat-game-intro                 | float64 | int64    |
| tag_center                         | float64 | int64    |
| tag_circle                         | float64 | int64    |
| tag_col                            | float64 | int64    |
| tag_colgroup                       | float64 | int64    |
| tag_dd                             | float64 | int64    |
| tag_defs                           | float64 | int64    |
| tag_del                            | float64 | int64    |
| tag_desc                           | float64 | int64    |
| tag_div                            | float64 | int64    |
| tag_dl                             | float64 | int64    |
| tag_dt                             | float64 | int64    |
| tag_em                             | float64 | int64    |
| tag_fieldset                       | float64 | int64    |
| tag_figcaption                     | float64 | int64    |
| tag_figure                         | float64 | int64    |
| tag_font                           | float64 | int64    |
| tag_footer                         | float64 | int64    |
| tag_form                           | float64 | int64    |
| tag_g                              | float64 | int64    |
| tag_h1                             | float64 | int64    |
| tag_h2                             | float64 | int64    |
| tag_h3                             | float64 | int64    |
| tag_h4                             | float64 | int64    |
| tag_h5                             | float64 | int64    |
| tag_h6                             | float64 | int64    |
| tag_head                           | float64 | int64    |
| tag_header                         | float64 | int64    |
| tag_hr                             | float64 | int64    |
| tag_html                           | float64 | int64    |
| tag_html_comment                   | float64 | int64    |
| tag_i                              | float64 | int64    |
| tag_iframe                         | float64 | int64    |
| tag_img                            | float64 | int64    |
| tag_input                          | float64 | int64    |
| tag_ins                            | float64 | int64    |
| tag_label                          | float64 | int64    |
| tag_legend                         | float64 | int64    |
| tag_li                             | float64 | int64    |
| tag_like                           | float64 | int64    |
| tag_line                           | float64 | int64    |
| tag_link                           | float64 | int64    |
| tag_main                           | float64 | int64    |
| tag_map                            | float64 | int64    |
| tag_meta                           | float64 | int64    |
| tag_mktz-div                       | float64 | int64    |
| tag_nav                            | float64 | int64    |
| tag_noindex                        | float64 | int64    |
| tag_noscript                       | float64 | int64    |
| tag_object                         | float64 | int64    |
| tag_ol                             | float64 | int64    |
| tag_omni-container                 | float64 | int64    |
| tag_omni-interact                  | float64 | int64    |
| tag_omni-ribb-cent                 | float64 | int64    |
| tag_optgroup                       | float64 | int64    |
| tag_option                         | float64 | int64    |
| tag_p                              | float64 | int64    |
| tag_param                          | float64 | int64    |
| tag_path                           | float64 | int64    |
| tag_pre                            | float64 | int64    |
| tag_querylogoperation              | float64 | int64    |
| tag_rect                           | float64 | int64    |
| tag_s                              | float64 | int64    |
| tag_script                         | float64 | int64    |
| tag_section                        | float64 | int64    |
| tag_select                         | float64 | int64    |
| tag_small                          | float64 | int64    |
| tag_span                           | float64 | int64    |
| tag_string                         | float64 | int64    |
| tag_strong                         | float64 | int64    |
| tag_style                          | float64 | int64    |
| tag_sup                            | float64 | int64    |
| tag_svg                            | float64 | int64    |
| tag_table                          | float64 | int64    |
| tag_tbody                          | float64 | int64    |
| tag_td                             | float64 | int64    |
| tag_text                           | float64 | int64    |
| tag_textarea                       | float64 | int64    |
| tag_th                             | float64 | int64    |
| tag_thead                          | float64 | int64    |
| tag_time                           | float64 | int64    |
| tag_title                          | float64 | int64    |
| tag_tr                             | float64 | int64    |
| tag_tspan                          | float64 | int64    |
| tag_u                              | float64 | int64    |
| tag_ul                             | float64 | int64    |
| tag_video                          | float64 | int64    |
+------------------------------------+---------+----------+

Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:

dtype={'ancestor1_depth': 'float64',
       'ancestor1_no_children': 'float64',
       'ancestor1_no_classes': 'float64',
       'ancestor1_sibling_pos': 'float64',
       'ancestor1_tag_': 'float64',
       'ancestor1_tag_a': 'float64',
       'ancestor1_tag_address': 'float64',
       'ancestor1_tag_article': 'float64',
       'ancestor1_tag_aside': 'float64',
       'ancestor1_tag_b': 'float64',
       'ancestor1_tag_body': 'float64',
       'ancestor1_tag_button': 'float64',
       'ancestor1_tag_caption': 'float64',
       'ancestor1_tag_center': 'float64',
       'ancestor1_tag_colgroup': 'float64',
       'ancestor1_tag_dd': 'float64',
       'ancestor1_tag_defs': 'float64',
       'ancestor1_tag_del': 'float64',
       'ancestor1_tag_div': 'float64',
       'ancestor1_tag_dl': 'float64',
       'ancestor1_tag_dt': 'float64',
       'ancestor1_tag_em': 'float64',
       'ancestor1_tag_fieldset': 'float64',
       'ancestor1_tag_figcaption': 'float64',
       'ancestor1_tag_figure': 'float64',
       'ancestor1_tag_font': 'float64',
       'ancestor1_tag_footer': 'float64',
       'ancestor1_tag_form': 'float64',
       'ancestor1_tag_g': 'float64',
       'ancestor1_tag_h1': 'float64',
       'ancestor1_tag_h2': 'float64',
       'ancestor1_tag_h3': 'float64',
       'ancestor1_tag_h4': 'float64',
       'ancestor1_tag_h5': 'float64',
       'ancestor1_tag_h6': 'float64',
       'ancestor1_tag_head': 'float64',
       'ancestor1_tag_header': 'float64',
       'ancestor1_tag_html': 'float64',
       'ancestor1_tag_i': 'float64',
       'ancestor1_tag_ins': 'float64',
       'ancestor1_tag_label': 'float64',
       'ancestor1_tag_legend': 'float64',
       'ancestor1_tag_li': 'float64',
       'ancestor1_tag_main': 'float64',
       'ancestor1_tag_map': 'float64',
       'ancestor1_tag_mktz-div': 'float64',
       'ancestor1_tag_nav': 'float64',
       'ancestor1_tag_noindex': 'float64',
       'ancestor1_tag_object': 'float64',
       'ancestor1_tag_ol': 'float64',
       'ancestor1_tag_omni-container': 'float64',
       'ancestor1_tag_omni-interact': 'float64',
       'ancestor1_tag_omni-ribb-cent': 'float64',
       'ancestor1_tag_optgroup': 'float64',
       'ancestor1_tag_p': 'float64',
       'ancestor1_tag_path': 'float64',
       'ancestor1_tag_querylogoperation': 'float64',
       'ancestor1_tag_s': 'float64',
       'ancestor1_tag_section': 'float64',
       'ancestor1_tag_select': 'float64',
       'ancestor1_tag_small': 'float64',
       'ancestor1_tag_span': 'float64',
       'ancestor1_tag_string': 'float64',
       'ancestor1_tag_strong': 'float64',
       'ancestor1_tag_svg': 'float64',
       'ancestor1_tag_table': 'float64',
       'ancestor1_tag_tbody': 'float64',
       'ancestor1_tag_td': 'float64',
       'ancestor1_tag_text': 'float64',
       'ancestor1_tag_th': 'float64',
       'ancestor1_tag_thead': 'float64',
       'ancestor1_tag_tr': 'float64',
       'ancestor1_tag_ul': 'float64',
       'ancestor2_depth': 'float64',
       'ancestor2_no_children': 'float64',
       'ancestor2_no_classes': 'float64',
       'ancestor2_sibling_pos': 'float64',
       'ancestor2_tag_': 'float64',
       'ancestor2_tag_a': 'float64',
       'ancestor2_tag_address': 'float64',
       'ancestor2_tag_article': 'float64',
       'ancestor2_tag_aside': 'float64',
       'ancestor2_tag_body': 'float64',
       'ancestor2_tag_button': 'float64',
       'ancestor2_tag_center': 'float64',
       'ancestor2_tag_dd': 'float64',
       'ancestor2_tag_div': 'float64',
       'ancestor2_tag_dl': 'float64',
       'ancestor2_tag_dt': 'float64',
       'ancestor2_tag_fieldset': 'float64',
       'ancestor2_tag_figcaption': 'float64',
       'ancestor2_tag_figure': 'float64',
       'ancestor2_tag_footer': 'float64',
       'ancestor2_tag_form': 'float64',
       'ancestor2_tag_g': 'float64',
       'ancestor2_tag_h1': 'float64',
       'ancestor2_tag_h2': 'float64',
       'ancestor2_tag_h3': 'float64',
       'ancestor2_tag_h4': 'float64',
       'ancestor2_tag_h5': 'float64',
       'ancestor2_tag_header': 'float64',
       'ancestor2_tag_html': 'float64',
       'ancestor2_tag_i': 'float64',
       'ancestor2_tag_ins': 'float64',
       'ancestor2_tag_label': 'float64',
       'ancestor2_tag_li': 'float64',
       'ancestor2_tag_main': 'float64',
       'ancestor2_tag_mktz-div': 'float64',
       'ancestor2_tag_nav': 'float64',
       'ancestor2_tag_noindex': 'float64',
       'ancestor2_tag_ol': 'float64',
       'ancestor2_tag_omni-container': 'float64',
       'ancestor2_tag_omni-interact': 'float64',
       'ancestor2_tag_omni-ribb-cent': 'float64',
       'ancestor2_tag_p': 'float64',
       'ancestor2_tag_querylogoperation': 'float64',
       'ancestor2_tag_section': 'float64',
       'ancestor2_tag_select': 'float64',
       'ancestor2_tag_small': 'float64',
       'ancestor2_tag_span': 'float64',
       'ancestor2_tag_string': 'float64',
       'ancestor2_tag_strong': 'float64',
       'ancestor2_tag_svg': 'float64',
       'ancestor2_tag_table': 'float64',
       'ancestor2_tag_tbody': 'float64',
       'ancestor2_tag_td': 'float64',
       'ancestor2_tag_thead': 'float64',
       'ancestor2_tag_tr': 'float64',
       'ancestor2_tag_ul': 'float64',
       'ancestor3_depth': 'float64',
       'ancestor3_no_children': 'float64',
       'ancestor3_no_classes': 'float64',
       'ancestor3_sibling_pos': 'float64',
       'ancestor3_tag_': 'float64',
       'ancestor3_tag_a': 'float64',
       'ancestor3_tag_address': 'float64',
       'ancestor3_tag_article': 'float64',
       'ancestor3_tag_aside': 'float64',
       'ancestor3_tag_body': 'float64',
       'ancestor3_tag_button': 'float64',
       'ancestor3_tag_center': 'float64',
       'ancestor3_tag_dd': 'float64',
       'ancestor3_tag_div': 'float64',
       'ancestor3_tag_dl': 'float64',
       'ancestor3_tag_fieldset': 'float64',
       'ancestor3_tag_figcaption': 'float64',
       'ancestor3_tag_figure': 'float64',
       'ancestor3_tag_footer': 'float64',
       'ancestor3_tag_form': 'float64',
       'ancestor3_tag_h2': 'float64',
       'ancestor3_tag_h3': 'float64',
       'ancestor3_tag_header': 'float64',
       'ancestor3_tag_html': 'float64',
       'ancestor3_tag_ins': 'float64',
       'ancestor3_tag_label': 'float64',
       'ancestor3_tag_li': 'float64',
       'ancestor3_tag_main': 'float64',
       'ancestor3_tag_mktz-div': 'float64',
       'ancestor3_tag_nav': 'float64',
       'ancestor3_tag_noindex': 'float64',
       'ancestor3_tag_ol': 'float64',
       'ancestor3_tag_omni-container': 'float64',
       'ancestor3_tag_omni-ribb-cent': 'float64',
       'ancestor3_tag_p': 'float64',
       'ancestor3_tag_querylogoperation': 'float64',
       'ancestor3_tag_section': 'float64',
       'ancestor3_tag_span': 'float64',
       'ancestor3_tag_string': 'float64',
       'ancestor3_tag_svg': 'float64',
       'ancestor3_tag_table': 'float64',
       'ancestor3_tag_tbody': 'float64',
       'ancestor3_tag_td': 'float64',
       'ancestor3_tag_thead': 'float64',
       'ancestor3_tag_tr': 'float64',
       'ancestor3_tag_ul': 'float64',
       'ancestor4_depth': 'float64',
       'ancestor4_no_children': 'float64',
       'ancestor4_no_classes': 'float64',
       'ancestor4_sibling_pos': 'float64',
       'ancestor4_tag_': 'float64',
       'ancestor4_tag_a': 'float64',
       'ancestor4_tag_article': 'float64',
       'ancestor4_tag_aside': 'float64',
       'ancestor4_tag_body': 'float64',
       'ancestor4_tag_button': 'float64',
       'ancestor4_tag_center': 'float64',
       'ancestor4_tag_dd': 'float64',
       'ancestor4_tag_div': 'float64',
       'ancestor4_tag_dl': 'float64',
       'ancestor4_tag_fieldset': 'float64',
       'ancestor4_tag_figcaption': 'float64',
       'ancestor4_tag_figure': 'float64',
       'ancestor4_tag_footer': 'float64',
       'ancestor4_tag_form': 'float64',
       'ancestor4_tag_h3': 'float64',
       'ancestor4_tag_header': 'float64',
       'ancestor4_tag_html': 'float64',
       'ancestor4_tag_label': 'float64',
       'ancestor4_tag_li': 'float64',
       'ancestor4_tag_main': 'float64',
       'ancestor4_tag_mktz-div': 'float64',
       'ancestor4_tag_nav': 'float64',
       'ancestor4_tag_noindex': 'float64',
       'ancestor4_tag_ol': 'float64',
       'ancestor4_tag_omni-container': 'float64',
       'ancestor4_tag_p': 'float64',
       'ancestor4_tag_querylogoperation': 'float64',
       'ancestor4_tag_section': 'float64',
       'ancestor4_tag_span': 'float64',
       'ancestor4_tag_table': 'float64',
       'ancestor4_tag_tbody': 'float64',
       'ancestor4_tag_td': 'float64',
       'ancestor4_tag_tr': 'float64',
       'ancestor4_tag_ul': 'float64',
       'ancestor5_depth': 'float64',
       'ancestor5_no_children': 'float64',
       'ancestor5_no_classes': 'float64',
       'ancestor5_sibling_pos': 'float64',
       'ancestor5_tag_': 'float64',
       'ancestor5_tag_a': 'float64',
       'ancestor5_tag_article': 'float64',
       'ancestor5_tag_aside': 'float64',
       'ancestor5_tag_body': 'float64',
       'ancestor5_tag_button': 'float64',
       'ancestor5_tag_center': 'float64',
       'ancestor5_tag_dd': 'float64',
       'ancestor5_tag_div': 'float64',
       'ancestor5_tag_dl': 'float64',
       'ancestor5_tag_fieldset': 'float64',
       'ancestor5_tag_figcaption': 'float64',
       'ancestor5_tag_figure': 'float64',
       'ancestor5_tag_footer': 'float64',
       'ancestor5_tag_form': 'float64',
       'ancestor5_tag_header': 'float64',
       'ancestor5_tag_html': 'float64',
       'ancestor5_tag_label': 'float64',
       'ancestor5_tag_li': 'float64',
       'ancestor5_tag_main': 'float64',
       'ancestor5_tag_mktz-div': 'float64',
       'ancestor5_tag_nav': 'float64',
       'ancestor5_tag_noindex': 'float64',
       'ancestor5_tag_ol': 'float64',
       'ancestor5_tag_p': 'float64',
       'ancestor5_tag_querylogoperation': 'float64',
       'ancestor5_tag_section': 'float64',
       'ancestor5_tag_span': 'float64',
       'ancestor5_tag_table': 'float64',
       'ancestor5_tag_tbody': 'float64',
       'ancestor5_tag_td': 'float64',
       'ancestor5_tag_tr': 'float64',
       'ancestor5_tag_ul': 'float64',
       'descendant1_no_nodes': 'float64',
       'descendant1_tags_a': 'float64',
       'descendant1_tags_address': 'float64',
       'descendant1_tags_animatetransform': 'float64',
       'descendant1_tags_area': 'float64',
       'descendant1_tags_article': 'float64',
       'descendant1_tags_aside': 'float64',
       'descendant1_tags_b': 'float64',
       'descendant1_tags_body': 'float64',
       'descendant1_tags_br': 'float64',
       'descendant1_tags_button': 'float64',
       'descendant1_tags_canvas': 'float64',
       'descendant1_tags_caption': 'float64',
       'descendant1_tags_cat-game-intro': 'float64',
       'descendant1_tags_center': 'float64',
       'descendant1_tags_circle': 'float64',
       'descendant1_tags_col': 'float64',
       'descendant1_tags_colgroup': 'float64',
       'descendant1_tags_dd': 'float64',
       'descendant1_tags_defs': 'float64',
       'descendant1_tags_del': 'float64',
       'descendant1_tags_desc': 'float64',
       'descendant1_tags_div': 'float64',
       'descendant1_tags_dl': 'float64',
       'descendant1_tags_dt': 'float64',
       'descendant1_tags_em': 'float64',
       'descendant1_tags_fieldset': 'float64',
       'descendant1_tags_figcaption': 'float64',
       'descendant1_tags_figure': 'float64',
       'descendant1_tags_font': 'float64',
       'descendant1_tags_footer': 'float64',
       'descendant1_tags_form': 'float64',
       'descendant1_tags_g': 'float64',
       'descendant1_tags_h1': 'float64',
       'descendant1_tags_h2': 'float64',
       'descendant1_tags_h3': 'float64',
       'descendant1_tags_h4': 'float64',
       'descendant1_tags_h5': 'float64',
       'descendant1_tags_h6': 'float64',
       'descendant1_tags_head': 'float64',
       'descendant1_tags_header': 'float64',
       'descendant1_tags_hr': 'float64',
       'descendant1_tags_html_comment': 'float64',
       'descendant1_tags_i': 'float64',
       'descendant1_tags_iframe': 'float64',
       'descendant1_tags_img': 'float64',
       'descendant1_tags_input': 'float64',
       'descendant1_tags_ins': 'float64',
       'descendant1_tags_label': 'float64',
       'descendant1_tags_legend': 'float64',
       'descendant1_tags_li': 'float64',
       'descendant1_tags_like': 'float64',
       'descendant1_tags_line': 'float64',
       'descendant1_tags_link': 'float64',
       'descendant1_tags_main': 'float64',
       'descendant1_tags_map': 'float64',
       'descendant1_tags_meta': 'float64',
       'descendant1_tags_mktz-div': 'float64',
       'descendant1_tags_nav': 'float64',
       'descendant1_tags_noindex': 'float64',
       'descendant1_tags_noscript': 'float64',
       'descendant1_tags_object': 'float64',
       'descendant1_tags_ol': 'float64',
       'descendant1_tags_omni-container': 'float64',
       'descendant1_tags_omni-interact': 'float64',
       'descendant1_tags_omni-ribb-cent': 'float64',
       'descendant1_tags_optgroup': 'float64',
       'descendant1_tags_option': 'float64',
       'descendant1_tags_p': 'float64',
       'descendant1_tags_param': 'float64',
       'descendant1_tags_path': 'float64',
       'descendant1_tags_pre': 'float64',
       'descendant1_tags_querylogoperation': 'float64',
       'descendant1_tags_rect': 'float64',
       'descendant1_tags_s': 'float64',
       'descendant1_tags_script': 'float64',
       'descendant1_tags_section': 'float64',
       'descendant1_tags_select': 'float64',
       'descendant1_tags_small': 'float64',
       'descendant1_tags_span': 'float64',
       'descendant1_tags_string': 'float64',
       'descendant1_tags_strong': 'float64',
       'descendant1_tags_style': 'float64',
       'descendant1_tags_sup': 'float64',
       'descendant1_tags_svg': 'float64',
       'descendant1_tags_table': 'float64',
       'descendant1_tags_tbody': 'float64',
       'descendant1_tags_td': 'float64',
       'descendant1_tags_text': 'float64',
       'descendant1_tags_textarea': 'float64',
       'descendant1_tags_th': 'float64',
       'descendant1_tags_thead': 'float64',
       'descendant1_tags_time': 'float64',
       'descendant1_tags_title': 'float64',
       'descendant1_tags_tr': 'float64',
       'descendant1_tags_tspan': 'float64',
       'descendant1_tags_u': 'float64',
       'descendant1_tags_ul': 'float64',
       'descendant1_tags_video': 'float64',
       'descendant2_no_nodes': 'float64',
       'descendant3_no_nodes': 'float64',
       'descendant4_no_nodes': 'float64',
       'descendant5_no_nodes': 'float64',
       'detail_description_label': 'float64',
       'detail_image_label': 'float64',
       'detail_price_label': 'float64',
       'detail_title_label': 'float64',
       'has_id': 'float64',
       'has_text': 'float64',
       'list_image_label': 'float64',
       'list_price_label': 'float64',
       'list_title_label': 'float64',
       'no_children': 'float64',
       'no_classes': 'float64',
       'tag_a': 'float64',
       'tag_address': 'float64',
       'tag_animatetransform': 'float64',
       'tag_area': 'float64',
       'tag_article': 'float64',
       'tag_aside': 'float64',
       'tag_b': 'float64',
       'tag_body': 'float64',
       'tag_br': 'float64',
       'tag_button': 'float64',
       'tag_canvas': 'float64',
       'tag_caption': 'float64',
       'tag_cat-game-intro': 'float64',
       'tag_center': 'float64',
       'tag_circle': 'float64',
       'tag_col': 'float64',
       'tag_colgroup': 'float64',
       'tag_dd': 'float64',
       'tag_defs': 'float64',
       'tag_del': 'float64',
       'tag_desc': 'float64',
       'tag_div': 'float64',
       'tag_dl': 'float64',
       'tag_dt': 'float64',
       'tag_em': 'float64',
       'tag_fieldset': 'float64',
       'tag_figcaption': 'float64',
       'tag_figure': 'float64',
       'tag_font': 'float64',
       'tag_footer': 'float64',
       'tag_form': 'float64',
       'tag_g': 'float64',
       'tag_h1': 'float64',
       'tag_h2': 'float64',
       'tag_h3': 'float64',
       'tag_h4': 'float64',
       'tag_h5': 'float64',
       'tag_h6': 'float64',
       'tag_head': 'float64',
       'tag_header': 'float64',
       'tag_hr': 'float64',
       'tag_html': 'float64',
       'tag_html_comment': 'float64',
       'tag_i': 'float64',
       'tag_iframe': 'float64',
       'tag_img': 'float64',
       'tag_input': 'float64',
       'tag_ins': 'float64',
       'tag_label': 'float64',
       'tag_legend': 'float64',
       'tag_li': 'float64',
       'tag_like': 'float64',
       'tag_line': 'float64',
       'tag_link': 'float64',
       'tag_main': 'float64',
       'tag_map': 'float64',
       'tag_meta': 'float64',
       'tag_mktz-div': 'float64',
       'tag_nav': 'float64',
       'tag_noindex': 'float64',
       'tag_noscript': 'float64',
       'tag_object': 'float64',
       'tag_ol': 'float64',
       'tag_omni-container': 'float64',
       'tag_omni-interact': 'float64',
       'tag_omni-ribb-cent': 'float64',
       'tag_optgroup': 'float64',
       'tag_option': 'float64',
       'tag_p': 'float64',
       'tag_param': 'float64',
       'tag_path': 'float64',
       'tag_pre': 'float64',
       'tag_querylogoperation': 'float64',
       'tag_rect': 'float64',
       'tag_s': 'float64',
       'tag_script': 'float64',
       'tag_section': 'float64',
       'tag_select': 'float64',
       'tag_small': 'float64',
       'tag_span': 'float64',
       'tag_string': 'float64',
       'tag_strong': 'float64',
       'tag_style': 'float64',
       'tag_sup': 'float64',
       'tag_svg': 'float64',
       'tag_table': 'float64',
       'tag_tbody': 'float64',
       'tag_td': 'float64',
       'tag_text': 'float64',
       'tag_textarea': 'float64',
       'tag_th': 'float64',
       'tag_thead': 'float64',
       'tag_time': 'float64',
       'tag_title': 'float64',
       'tag_tr': 'float64',
       'tag_tspan': 'float64',
       'tag_u': 'float64',
       'tag_ul': 'float64',
       'tag_video': 'float64'}

to the call to `read_csv`/`read_table`.

Alternatively, provide `assume_missing=True` to interpret
all unspecified integer columns as floats.

### Train on website subset. Test on whole website

In [None]:
second_results_df = simple_model_experiment(map(lambda x: x[1], second_experiment_df.iterrows()), model_func=rf_eval, 
                                            experiment_name='second-random-forest', label_cols=label_cols)

### Train/test on whole website

In [None]:
third_results_df = simple_model_experiment(map(lambda x: x[1], third_experiment_df.iterrows()), model_func=rf_eval, 
                                           experiment_name='third-random-forest', label_cols=label_cols)

### Train on single website. Test on all

In [None]:
fourth_results_df = simple_model_experiment(map(lambda x: x[1], fourth_experiment_df.iterrows()), model_func=rf_eval, 
                                            experiment_name='fourth-random-forest', label_cols=label_cols)

### Train/test on all

In [None]:
fifth_results_df = simple_model_experiment(map(lambda x: x[1], fifth_experiment_df.iterrows()), model_func=rf_eval, 
                                            experiment_name='fifth-random-forest', label_cols=label_cols)

## Saving the results
Ths notebook is a little too crowded to do any proper analysis here. Memory is also pretty low as the experiments are fairly expensive. In order to mitigate the problem, we will save the experiment results and analyze them in a different notebook.

In [None]:
expermients = [first_results_df, second_results_df, third_results_df, fourth_results_df, fifth_results_df]
pd.concat(expermients, ignore_index=True).to_csv(EXPERIMENT_RESULTS_CSV)