# Using Sherlock out-of-the-box
This notebook shows how to predict a semantic type for a given table column.
The steps are basically:
- Extract features from a column.
- Initialize Sherlock.
- Make a prediction for the feature representation of the column.

In [1]:
import numpy as np
import pandas as pd
import pyarrow as pa

from sherlock import helpers
from sherlock.deploy.model import SherlockModel
from sherlock.functional import extract_features_to_csv
from sherlock.features.paragraph_vectors import initialise_pretrained_model, initialise_nltk
from sherlock.features.preprocessing import (
    extract_features,
    convert_string_lists_to_lists,
    prepare_feature_extraction,
    load_parquet_values,
)
from sherlock.features.word_embeddings import initialise_word_embeddings

In [3]:
%env PYTHONHASHSEED

## Extract features

In [3]:
prepare_feature_extraction()
initialise_word_embeddings()
initialise_pretrained_model(400)
initialise_nltk()

/Users/madelon/Documents/PhD/Sherlock/sherlock-project/notebooks
Preparing feature extraction by downloading 3 files:
        
 ../sherlock/features/glove.6B.50d.txt, 
 ../sherlock/features/par_vec_trained_400.pkl.docvecs.vectors_docs.npy and 
 ../sherlock/features/par_vec_trained_400.pkl.trainables.syn1neg.npy.
        
All files for extracting word and paragraph embeddings are present.
Initialising word embeddings
Initialise Word Embeddings process took 0:00:04.517863 seconds.
Initialise Doc2Vec Model, 400 dim, process took 0:00:04.135534 seconds. (filename = ../sherlock/features/par_vec_trained_400.pkl)
Initialised NLTK, process took 0:00:00.322240 seconds.


[nltk_data] Downloading package punkt to /Users/madelon/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/madelon/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
data = pd.Series([["madelon hulsebos", "lute something", "anna anotherhing"], ["Binnenkant 2", "Binnenkant 3 1011BH"]], name="values")

In [5]:
data

0    [madelon hulsebos, lute something, anna anothe...
1                  [Binnenkant 2, Binnenkant 3 1011BH]
Name: values, dtype: object

In [7]:
pd.DataFrame(data).astype(str).to_parquet("../data/data/raw/temporary.parquet")

### Issue 1:  `extract_features`
The below code yields incorrect features due to `,` (as a character features) being read as a separator in the "read_csv".

In [8]:
extract_features(
    "../temporary.csv",
    data
)
feature_vector = pd.read_csv("../temporary.csv", dtype=np.float32)

Extracting Features: 100%|██████████| 2/2 [00:00<00:00, 96.47it/s]

Exporting 1588 column features





In [9]:
feature_vector

Unnamed: 0,n_[0]-agg-any,n_[0]-agg-all,n_[0]-agg-mean,n_[0]-agg-var,n_[0]-agg-min,n_[0]-agg-max,n_[0]-agg-median,n_[0]-agg-sum,n_[0]-agg-kurtosis,n_[0]-agg-skewness,...,par_vec_390,par_vec_391,par_vec_392,par_vec_393,par_vec_394,par_vec_395,par_vec_396,par_vec_397,par_vec_398,par_vec_399
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-3.0,0.0,...,-0.04133,0.003258,-0.026008,-0.01572,-0.053032,0.023043,0.018317,0.006836,0.009824,-0.001796
1,1.0,0.0,0.5,0.25,0.0,1.0,0.5,1.0,-2.0,0.0,...,5.5e-05,0.000415,0.001177,0.000527,-0.000168,7.3e-05,-0.000643,4.6e-05,-0.000196,-0.00031


In [10]:
feature_vector.shape

(2, 1588)

In [11]:
feature_vector.columns.str.startswith('n_[,').sum()

10

### Issue 2: `extract_features_to_csv`

In [12]:
values = load_parquet_values("../data/data/raw/temporary.parquet")

In [13]:
values[0]

<pyarrow.StringScalar: "['madelon hulsebos', 'lute something', 'anna anotherhing']">

In [14]:
extract_features_to_csv("../data/data/processed/temporary.csv", values)

Starting ../data/data/processed/temporary.csv at 2022-02-20 14:36:40.942443. Rows=2, using 8 CPU cores
[
  [
    "['madelon hulsebos', 'lute something', 'anna anotherhing']",
    "['Binnenkant 2', 'Binnenkant 3 1011BH']"
  ]
]
['n_[0]-agg-any', 'n_[0]-agg-all', 'n_[0]-agg-mean', 'n_[0]-agg-var', 'n_[0]-agg-min', 'n_[0]-agg-max', 'n_[0]-agg-median', 'n_[0]-agg-sum', 'n_[0]-agg-kurtosis', 'n_[0]-agg-skewness', 'n_[1]-agg-any', 'n_[1]-agg-all', 'n_[1]-agg-mean', 'n_[1]-agg-var', 'n_[1]-agg-min', 'n_[1]-agg-max', 'n_[1]-agg-median', 'n_[1]-agg-sum', 'n_[1]-agg-kurtosis', 'n_[1]-agg-skewness', 'n_[2]-agg-any', 'n_[2]-agg-all', 'n_[2]-agg-mean', 'n_[2]-agg-var', 'n_[2]-agg-min', 'n_[2]-agg-max', 'n_[2]-agg-median', 'n_[2]-agg-sum', 'n_[2]-agg-kurtosis', 'n_[2]-agg-skewness', 'n_[3]-agg-any', 'n_[3]-agg-all', 'n_[3]-agg-mean', 'n_[3]-agg-var', 'n_[3]-agg-min', 'n_[3]-agg-max', 'n_[3]-agg-median', 'n_[3]-agg-sum', 'n_[3]-agg-kurtosis', 'n_[3]-agg-skewness', 'n_[4]-agg-any', 'n_[4]-agg-all', 'n

### Reproducing the problem with the original file (works in notebook `01-data-processing`)

In [15]:
values = load_parquet_values("../data/data/raw/test_values.parquet")

extract_features_to_csv("../data/data/processed/temporary.csv", values)

values = None

Starting ../data/data/processed/temporary.csv at 2022-02-11 08:56:21.898853. Rows=137353, using 8 CPU cores
[
  [
    "['Central Missouri', 'unattached', 'unattached', 'Kansas State University', 'unattached', 'North Dakota State', 'Nike']",
    "[95, 100, 95, 89, 84, 91, 88, 94, 75, 78, 90, 84, 90, 76, 93, 70, 80, 80, 82]",
    "['Katie Crews', 'Christian Hiraldo', 'Alex Estrada', 'Fredy Peltroche', 'Xavier Perez', 'Gustavo Larrosa', 'Jose Montano', 'Angel Cruz (7)', 'J Acosta']",
    "['Christian', 'Non-Christian', 'Unreported', 'Jewish', 'Atheists']",
    "['AAF-McQuay Canada Inc.', 'AAF-McQuay Canada Inc.', 'Ability Janitorial Services Ltd.', 'Acart Communications Inc.', 'Accu-lift Flooring Systems', 'Accurate Point Construction Ltd.  ', 'Acklands-Grainger Inc.', 'Acme Future Security Controls Inc.', 'ACMG Management Inc.', 'Advanced Chippers Technologies Inc.', 'Advanced Chippewa Technologies Inc.', 'Advanced Chippewa Technologies Inc.', 'Advanced Chippewa Technologies Inc.', 'Adva

None


Error: iterable expected, not NoneType

In [15]:
feature_vector = pd.read_csv("../data/data/processed/temporary.csv")

## Initialize Sherlock.

In [16]:
model = SherlockModel()
model.initialize_model_from_json(with_weights=True);

W0220 14:36:53.255482 4506066432 deprecation.py:506] From /Users/madelon/miniconda3/envs/sherlock-project/lib/python3.6/site-packages/tensorflow_core/python/ops/init_ops.py:97: calling Zeros.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
W0220 14:36:53.257615 4506066432 deprecation.py:506] From /Users/madelon/miniconda3/envs/sherlock-project/lib/python3.6/site-packages/tensorflow_core/python/ops/init_ops.py:97: calling Ones.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
W0220 14:36:53.263447 4506066432 deprecation.py:506] From /Users/madelon/miniconda3/envs/sherlock-project/lib/python3.6/site-packages/tensorflow_core/python/

## Predict semantic type for column


TODO: there is a problem with the featurization with `extract_features`.

Will be completed soon.

In [17]:
predicted_labels = model.predict(feature_vector, "sherlock")

In [18]:
predicted_labels

array(['album', 'publisher'], dtype=object)