# Using Sherlock out-of-the-box
This notebook shows how to predict a semantic type for a given table column.
The steps are basically:
- Download files for word embedding and paragraph vector feature extraction (downloads only once) and initialize feature extraction models.
- Extract features from table columns.
- Initialize Sherlock.
- Make a prediction for the feature representation of the column.

In [1]:
!git clone https://github.com/penfever/sherlock-project

Cloning into 'sherlock-project'...
remote: Enumerating objects: 897, done.[K
remote: Counting objects: 100% (177/177), done.[K
remote: Compressing objects: 100% (59/59), done.[K
remote: Total 897 (delta 140), reused 133 (delta 118), pack-reused 720[K
Receiving objects: 100% (897/897), 74.85 MiB | 19.07 MiB/s, done.
Resolving deltas: 100% (539/539), done.


In [2]:
import os
os.chdir("sherlock-project")

In [3]:
#!pip install -r requirements38.txt
!pip install pyfunctional

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyfunctional
  Downloading PyFunctional-1.4.3-py3-none-any.whl (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.5/49.5 KB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dill>=0.2.5
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 KB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, pyfunctional
Successfully installed dill-0.3.6 pyfunctional-1.4.3


In [4]:
import numpy as np
import pandas as pd
import pyarrow as pa

from sherlock import helpers
from sherlock.deploy.model import SherlockModel
from sherlock.features.paragraph_vectors import initialise_pretrained_model, initialise_nltk
from sherlock.features.preprocessing import (
    extract_features,
    convert_string_lists_to_lists,
    prepare_feature_extraction,
    load_parquet_values,
)
from sherlock.features.word_embeddings import initialise_word_embeddings

In [5]:
from sherlock.functional import extract_features_to_csv

In [6]:
os.environ["PYTHONHASHSEED"] = "13"

In [7]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [8]:
import shutil

shutil.copyfile(r"/content/drive/MyDrive/School/NYU/Dataset Search/proj/glove.6B.50d.txt", r"/content/sherlock-project/sherlock/features/glove.6B.50d.txt")
shutil.copyfile(r"/content/drive/MyDrive/School/NYU/Dataset Search/proj/par_vec_trained_400.pkl.docvecs.vectors_docs.npy", r"/content/sherlock-project/sherlock/features/par_vec_trained_400.pkl.docvecs.vectors_docs.npy")
shutil.copyfile(r"/content/drive/MyDrive/School/NYU/Dataset Search/proj/par_vec_trained_400.pkl.trainables.syn1neg.npy", r"/content/sherlock-project/sherlock/features/par_vec_trained_400.pkl.trainables.syn1neg.npy")
shutil.copyfile(r"/content/drive/MyDrive/School/NYU/Dataset Search/proj/par_vec_trained_400.pkl.wv.vectors.npy", r"/content/sherlock-project/sherlock/features/par_vec_trained_400.pkl.wv.vectors.npy")


'/content/sherlock-project/sherlock/features/par_vec_trained_400.pkl.wv.vectors.npy'

## Initialize feature extraction models

In [9]:
# prepare_feature_extraction()
initialise_word_embeddings(path=r"/content/drive/MyDrive/School/NYU/Dataset Search/proj/")
initialise_pretrained_model(400, path=r"/content/sherlock-project/sherlock/features/")
initialise_nltk()

Initialising word embeddings
Initialise Word Embeddings process took 0:00:09.696109 seconds.
Initialise Doc2Vec Model, 400 dim, process took 0:00:04.354690 seconds. (filename = /content/sherlock-project/sherlock/features/par_vec_trained_400.pkl)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Initialised NLTK, process took 0:00:00.648900 seconds.


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## Extract features

In [43]:
data = pd.Series(
    [
        ["Jane Smith", "Lute Ahorn", "Anna James"],
        ["Amsterdam", "Haarlem", "Zwolle"],
        ["6400 LA VISTA DR", "4500 ABRAMS RD", "6900 MEADOW LAKE AV"]
    ],
    name="values"
)

In [44]:
data

0                 [Jane Smith, Lute Ahorn, Anna James]
1                         [Amsterdam, Haarlem, Zwolle]
2    [6400 LA VISTA DR, 4500 ABRAMS RD, 6900 MEADOW...
Name: values, dtype: object

In [50]:
data = pd.read_csv(r"/content/drive/MyDrive/School/NYU/Dataset Search/proj/csv/Street_Cuts_Prior_to_2013.csv")
data_m = pd.Series(data[["EventID", "EventType", "FacilityName", "CreateTime", "FromPointLocation"]].T.values.tolist())
data_m

0    [cc001-10/13/2006-006, cc001-10/13/2006-011, c...
1    [Construction, Construction, Construction, Con...
2    [ LA VISTA DR    ,  ABRAMS RD    ,  MEADOW LAK...
3    [10/13/2006 12:00:00 AM, 10/13/2006 12:00:00 A...
4    [6400  LA VISTA DR    , 4500  ABRAMS RD    , 6...
dtype: object

In [48]:
data

Unnamed: 0,EventID,EventType,FacilityName,CreateTime,LastUpdate,Direction,County,FromPointLocation,ToPointLocation,EstimatedDurationDays,...,job_street_prefix,job_street_name,job_street_type,City,Latitude,Longitude,Latitude2,Longitude2,City2,job_address
0,cc001-10/13/2006-006,Construction,LA VISTA DR,10/13/2006 12:00:00 AM,,,Dallas,6400 LA VISTA DR,,,...,,LA VISTA,DR,Dallas,,,0,0,Dallas,01/01/6400 12:00:00 AM
1,cc001-10/13/2006-011,Construction,ABRAMS RD,10/13/2006 12:00:00 AM,,,Dallas,4500 ABRAMS RD,,,...,,ABRAMS,RD,Dallas,,,0,0,Dallas,01/01/4500 12:00:00 AM
2,cc001-10/13/2006-015,Construction,MEADOW LAKE AV,10/13/2006 12:00:00 AM,,,Dallas,6900 MEADOW LAKE AV,,,...,,MEADOW LAKE,AV,Dallas,,,0,0,Dallas,01/01/6900 12:00:00 AM
3,cc001-10/13/2006-026,Construction,MERRILEE LA,10/13/2006 12:00:00 AM,,,Dallas,6900 MERRILEE LA,,,...,,MERRILEE,LA,Dallas,,,0,0,Dallas,01/01/6900 12:00:00 AM
4,cc001-10/13/2006-035,Construction,WINTON ST,10/13/2006 12:00:00 AM,,,Dallas,6500 WINTON ST,,,...,,WINTON,ST,Dallas,,,0,0,Dallas,01/01/6500 12:00:00 AM
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2027,pwd01-10/04/2013-028,Construction,S CLINTON AV,10/04/2013 12:00:00 AM,,,Dallas,423 S CLINTON AV,,,...,S,CLINTON,AV,Dallas,,,0,0,Dallas,12/31/0422 12:00:00 AM
2028,pwd01-10/04/2013-029,Construction,S WILLOMET AV,10/04/2013 12:00:00 AM,,,Dallas,314 S WILLOMET AV,,,...,S,WILLOMET,AV,Dallas,,,0,0,Dallas,12/31/0313 12:00:00 AM
2029,pwd01-10/04/2013-030,Construction,MORNINGSIDE AV,10/04/2013 12:00:00 AM,,,Dallas,5902 MORNINGSIDE AV,,,...,,MORNINGSIDE,AV,Dallas,,,0,0,Dallas,01/01/5902 12:00:00 AM
2030,pwd01-10/04/2013-031,Construction,DELMAR AV,10/04/2013 12:00:00 AM,,,Dallas,3400 DELMAR AV,,,...,,DELMAR,AV,Dallas,,,0,0,Dallas,01/01/3400 12:00:00 AM


In [54]:
extract_features(
    "../temporary.csv",
    data_m
)
feature_vectors = pd.read_csv("../temporary.csv", dtype=np.float32)

Extracting Features:  20%|██        | 1/5 [00:00<00:00,  5.64it/s]

Exporting 1588 column features


Extracting Features: 100%|██████████| 5/5 [00:01<00:00,  2.70it/s]


In [13]:
#feature_vectors

## Initialize Sherlock

In [23]:
model = SherlockModel(path="/content/sherlock-project/");
model.initialize_model_from_json(with_weights=True, model_id="sherlock");



## Predict semantic type for column

In [55]:
predicted_labels = model.predict(feature_vectors, "sherlock")



In [58]:
# print("CSV fields are: \n")
# print(str(list(data.columns)))
print("\n Sherlock predicted types are: \n")
print(str(list(predicted_labels)))
print("\n")
data_m


 Sherlock predicted types are: 

['address', 'type', 'address', 'sex', 'address']




0    [cc001-10/13/2006-006, cc001-10/13/2006-011, c...
1    [Construction, Construction, Construction, Con...
2    [ LA VISTA DR    ,  ABRAMS RD    ,  MEADOW LAK...
3    [10/13/2006 12:00:00 AM, 10/13/2006 12:00:00 A...
4    [6400  LA VISTA DR    , 4500  ABRAMS RD    , 6...
dtype: object