# Using Sherlock out-of-the-box
This notebook shows how to predict a semantic type for a given table column.
The steps are basically:
- Extract features from a column.
- Initialize Sherlock.
- Make a prediction for the feature representation of the column.

In [1]:
import numpy as np
import pandas as pd
import pyarrow as pa

from sherlock import helpers
from sherlock.deploy.model import SherlockModel
from sherlock.functional import extract_features_to_csv
from sherlock.features.paragraph_vectors import initialise_pretrained_model, initialise_nltk
from sherlock.features.preprocessing import (
    extract_features,
    convert_string_lists_to_lists,
    prepare_feature_extraction,
    load_parquet_values,
)
from sherlock.features.word_embeddings import initialise_word_embeddings

In [2]:
%env PYTHONHASHSEED

UsageError: Environment does not have key: PYTHONHASHSEED


## Extract features

In [8]:
# helpers.download_data()

In [3]:
prepare_feature_extraction()
initialise_word_embeddings()
initialise_pretrained_model(400)
initialise_nltk()

Preparing feature extraction by downloading 4 files:
        
 ../sherlock/features/glove.6B.50d.txt, 
 ../sherlock/features/par_vec_trained_400.pkl.docvecs.vectors_docs.npy,
        
 ../sherlock/features/par_vec_trained_400.pkl.trainables.syn1neg.npy, and 
 ../sherlock/features/par_vec_trained_400.pkl.wv.vectors.npy.
        
All files for extracting word and paragraph embeddings are present.
Initialising word embeddings
Initialise Word Embeddings process took 0:00:05.607905 seconds.
Initialise Doc2Vec Model, 400 dim, process took 0:00:02.443327 seconds. (filename = ../sherlock/features/par_vec_trained_400.pkl)
Initialised NLTK, process took 0:00:00.181374 seconds.


[nltk_data] Downloading package punkt to /Users/madelon/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/madelon/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [35]:
data = pd.Series(
    [
        ["Jane Smith", "Lute Ahorn", "Anna James"],
        ["Amsterdam", "Haarlem", "Zwolle"],
    ],
    name="values"
)

In [36]:
data

0    [Jane Smith, Lute Ahorn, Anna James]
1            [Amsterdam, Haarlem, Zwolle]
Name: values, dtype: object

In [37]:
extract_features(
    "../temporary.csv",
    data
)
feature_vector = pd.read_csv("../temporary.csv", dtype=np.float32)

Extracting Features: 100%|██████████| 2/2 [00:00<00:00, 62.37it/s]


Exporting 1588 column features


In [38]:
feature_vector

Unnamed: 0,n_[0]-agg-any,n_[0]-agg-all,n_[0]-agg-mean,n_[0]-agg-var,n_[0]-agg-min,n_[0]-agg-max,n_[0]-agg-median,n_[0]-agg-sum,n_[0]-agg-kurtosis,n_[0]-agg-skewness,...,par_vec_390,par_vec_391,par_vec_392,par_vec_393,par_vec_394,par_vec_395,par_vec_396,par_vec_397,par_vec_398,par_vec_399
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-3.0,0.0,...,-0.115819,0.023961,-0.130739,0.006393,-0.135118,-0.071956,-0.051051,-0.068307,0.087342,-0.145716
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-3.0,0.0,...,-0.054351,0.02365,-0.165681,-0.016137,-0.059402,0.008454,-0.044624,0.02516,0.037831,-0.086235


## Initialize Sherlock.

In [39]:
model = SherlockModel();
model.initialize_model_from_json(with_weights=True);

## Predict semantic type for column

In [40]:
predicted_labels = model.predict(feature_vector, "sherlock")

In [41]:
predicted_labels

array(['creator', 'city'], dtype=object)