# Using Sherlock out-of-the-box
This notebook shows how to predict a semantic type for a given table column.
The steps are basically:
- Extract features from a column.
- Initialize Sherlock.
- Make a prediction for the feature representation of the column.

In [1]:
import numpy as np
import pandas as pd
import pyarrow as pa

from sherlock import helpers
from sherlock.deploy.model import SherlockModel
from sherlock.functional import extract_features_to_csv
from sherlock.features.paragraph_vectors import initialise_pretrained_model, initialise_nltk
from sherlock.features.preprocessing import (
    extract_features,
    convert_string_lists_to_lists,
    prepare_feature_extraction,
    load_parquet_values
)
from sherlock.features.word_embeddings import initialise_word_embeddings

## Extract features

In [2]:
prepare_feature_extraction()
initialise_word_embeddings()
initialise_pretrained_model(400)
initialise_nltk()

Preparing feature extraction by downloading 2 files:
        
 ../sherlock/features/glove.6B.50d.txt and 
 ../sherlock/features/par_vec_trained_400.pkl.docvecs.vectors_docs.npy.
        
All files for extracting word and paragraph embeddings are present.
Initialising word embeddings
Initialise Word Embeddings process took 0:00:13.118153 seconds.
Initialise Doc2Vec Model, 400 dim, process took 0:00:05.229649 seconds. (filename = ../sherlock/features/par_vec_trained_400.pkl)
Initialised NLTK, process took 0:00:00.277687 seconds.


[nltk_data] Downloading package punkt to /Users/madelon/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/madelon/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
data = pd.Series([["madelon hulsebos", "lute something", "anna anotherhing"], ["Binnenkant 2", "Binnenkant 3 1011BH"]], name="values")

In [4]:
data

0    [madelon hulsebos, lute something, anna anothe...
1                  [Binnenkant 2, Binnenkant 3 1011BH]
Name: values, dtype: object

### Issue 1:  `extract_features`
The below code yields incorrect features due to `,` (as a character features) being read as a separator in the "read_csv".

In [5]:
extract_features(
    "../temporary.csv",
    data
)
feature_vector = pd.read_csv("../temporary.csv", dtype=np.float32)

Extracting Features: 100%|██████████| 2/2 [00:00<00:00, 58.33it/s]


Exporting 1588 column features


In [6]:
feature_vector

Unnamed: 0,n_[0]-agg-any,n_[0]-agg-all,n_[0]-agg-mean,n_[0]-agg-var,n_[0]-agg-min,n_[0]-agg-max,n_[0]-agg-median,n_[0]-agg-sum,n_[0]-agg-kurtosis,n_[0]-agg-skewness,...,par_vec_390,par_vec_391,par_vec_392,par_vec_393,par_vec_394,par_vec_395,par_vec_396,par_vec_397,par_vec_398,par_vec_399
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-3.0,0.0,...,,,,,,,,,,
1,1.0,0.0,0.5,0.25,0.0,1.0,0.5,1.0,-2.0,0.0,...,,,,,,,,,,


In [7]:
feature_vector.shape

(2, 1598)

In [8]:
feature_vector.columns.str.startswith('n_[,').sum()

0

Problem: 1598 features due to the `n_[,]-agg-any` being split due to the csv dataframe dump

### Issue 2: `extract_features_to_csv`

In [39]:
values = load_parquet_values("../temporary.parquet")

In [40]:
values

<pyarrow.lib.ChunkedArray object at 0x7fababa9db48>
[
  [
    "['madelon hulsebos', 'lute something', 'anna anotherhing']",
    "['Binnenkant 2', 'Binnenkant 3 1011BH']"
  ]
]

In [41]:
extract_features_to_csv("../temporary.csv", values)

Starting ../temporary.csv at 2022-02-10 21:32:15.638798. Rows=2, using 8 CPU cores


Error: iterable expected, not NoneType

In [44]:
feature_vector = pd.read_csv("../temporary.csv")

## Initialize Sherlock.

In [27]:
model = SherlockModel()
model.initialize_model_from_json(with_weights=True);

## Predict semantic type for column


TODO: there is a problem with the featurization with `extract_features`.

Will be completed soon.

In [30]:
predicted_labels = model.predict(feature_vector, "sherlock")

KeyError: "['n_[,]-agg-all', 'n_[,]-agg-median', 'n_[,]-agg-max', 'n_[,]-agg-sum', 'n_[,]-agg-any', 'n_[,]-agg-mean', 'n_[,]-agg-kurtosis', 'n_[,]-agg-skewness', 'n_[,]-agg-min', 'n_[,]-agg-var'] not in index"