# Using Sherlock out-of-the-box
This notebook is adopted from Sherlock's guide on how to predict a semantic type for a given table column.

Key tasks performed in this notebook are (Task 1):
- read data files and filter for CSV format
- using Sherlock to predict semantics of the columns with varying confidence thresholds
- export predicted semantics into the output folder as CSV files

Pre-requisite:
- Execute in Python 3.8.0 environment with Sherlock installed 
- Data files (CSV) are ready to be imported


![Workflow for the experimental setup in leveraging column semantics for data discovery.](image/workflowv2.png "Workflow")

## 1. Init

In [None]:
# Path for folder containing data files, log files and output files
DIR_DATASET = '/ivi/inde/mmargaret/data_search_e_data_csv/' # replace with folder containing NTCIR's CSV datafiles
DIR_LOG = '/ivi/inde/mmargaret/sherlock-project/log_2/' # replace with folder to store log files
DIR_OUTPUT = '/ivi/inde/mmargaret/sherlock-project/output_2/' # replace with folder to output enriched datafiles

In [1]:
# TEMPORARY LOCAL DIR
# DIR_DATASET = '/Users/mmargaret/Documents/[UVA] Thesis/sherlock-project/data/data_search_e_data_csv/'
# DIR_LOG = '/Users/mmargaret/Documents/[UVA] Thesis/sherlock-project/log_2/'
# DIR_OUTPUT = '/Users/mmargaret/Documents/[UVA] Thesis/sherlock-project/output_2/'

### Setup Logging

In [2]:
import logging
from datetime import datetime

logger = logging.getLogger()
fhandler = logging.FileHandler(filename='{}{}'.format(DIR_LOG,datetime.now().strftime('%Y%m%d_%H%M_sherlock.log')), mode='a')
formatter = logging.Formatter('%(asctime)s - %(levelname)s : %(message)s', datefmt='%m/%d/%Y %I:%M')
fhandler.setFormatter(formatter)
logger.addHandler(fhandler)
logger.setLevel(logging.INFO)
logging.info('- LOGGING STARTS -')


### Import Libraries

In [None]:
import numpy as np
import pandas as pd
import pyarrow as pa
import os
import sys
    
from sherlock import helpers
from sherlock.deploy.model import SherlockModel
from sherlock.functional import extract_features_to_csv
from sherlock.features.paragraph_vectors import initialise_pretrained_model, initialise_nltk
from sherlock.features.preprocessing import (
    extract_features,
    convert_string_lists_to_lists,
    prepare_feature_extraction,
    load_parquet_values,
)
from sherlock.features.word_embeddings import initialise_word_embeddings
from sklearn.preprocessing import LabelEncoder

### Initialize Sherlock's feature extraction models

In [4]:
prepare_feature_extraction()
initialise_word_embeddings()
initialise_pretrained_model(400)
initialise_nltk()

Preparing feature extraction by downloading 4 files:
        
 ../sherlock/features/glove.6B.50d.txt, 
 ../sherlock/features/par_vec_trained_400.pkl.docvecs.vectors_docs.npy,
        
 ../sherlock/features/par_vec_trained_400.pkl.trainables.syn1neg.npy, and 
 ../sherlock/features/par_vec_trained_400.pkl.wv.vectors.npy.
        
All files for extracting word and paragraph embeddings are present.
Initialising word embeddings
Initialise Word Embeddings process took 0:00:04.636650 seconds.
Initialise Doc2Vec Model, 400 dim, process took 0:00:02.604305 seconds. (filename = ../sherlock/features/par_vec_trained_400.pkl)
Initialised NLTK, process took 0:00:00.149001 seconds.


[nltk_data] Downloading package punkt to /Users/mmargaret/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/mmargaret/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Get the list of all data files for processing
Retrieve all CSV filenames in the specified folder 

In [5]:
_ = os.listdir(DIR_DATASET)

In [6]:
file_list = [id for id in _ if '.csv' in id]
logging.info('Number of Files: {}'.format(len(file_list)))
print('Number of Files: {}'.format(len(file_list)))

Number of Files: 2919


## 2. Define Utilities Function

In [8]:
def getPredictedLabels(y_pred_proba, classes, threshold=0.0):
    """
        This function retrieve the predicted semantics by assigning classes with highest probability 
        that is at least the same or higher than the parameter: "thresholds". 
        Input: 
            y_pred_proba: 
            classes:
            threshold:
        Output: returns a list of predicted semantics 
    """
    pred_scores = np.max(y_pred_proba, axis=1)
    index_threshold = np.where(pred_scores >= threshold)[0]
    y_pred_int = np.argmax(y_pred_proba, axis=1)[index_threshold]
    
    encoder = LabelEncoder()
    encoder.classes_ = classes

    return encoder.inverse_transform(y_pred_int)


In [9]:
def extractIDSemanticsWithColumnNames(filename):
    """
        This function:
        (1) read the dataset given by the "filename", 
        (2) using Sherlock to: extract their features, initialise Sherlock model and predict their semantics. 
        (3) extract other features, such as column names and column types
        Input: 
            filename: 
        Output: returns a dictionary containing semantics and features of one data file
    """
    
    IDSemanticsColumns = {'data_filename':filename, 'colSemantics': [], 'colNames':[]}
    try:
        # read files
        with open(DIR_DATASET + filename, errors='ignore') as f:
            a_doc = pd.read_csv(f)
        
        # column stats
        col_types = a_doc.dtypes.tolist()
        col_complete = pct_completeness(a_doc)
        
        a_doc = a_doc.select_dtypes(include=[object]).astype(str)
        col_len = avgColumnLength(a_doc)
        data = pd.Series(a_doc.transpose().values.tolist(), name="values") #format it to list of values by columns

        # sherlock extract features
        extract_features("../temporary.csv",data)
        feature_vectors = pd.read_csv("../temporary.csv", dtype=np.float32)

        # sherlock init and predict with pre-trained model
        model = SherlockModel();
        model.initialize_model_from_json(with_weights=True, model_id=model_id);
        
        # PREDICT
        predicted_proba = model.predict_proba(feature_vectors, model_id)
        predicted_scores = np.max(predicted_proba, axis=1).round(4)

        # return dictionary with id: id of the doc, list of the columns' semantics, list of the columns' names
        IDSemanticsColumns = {'data_filename':filename
                              , 'colSemantics': list(getPredictedLabels(predicted_proba, classes))
                              , 'colSemantics_s10': list(getPredictedLabels(predicted_proba, classes, 0.1))
                              , 'colSemantics_s20': list(getPredictedLabels(predicted_proba, classes, 0.2))
                              , 'colSemantics_s30': list(getPredictedLabels(predicted_proba, classes, 0.3))
                              , 'colSemantics_s40': list(getPredictedLabels(predicted_proba, classes, 0.4))
                              , 'colSemantics_s50': list(getPredictedLabels(predicted_proba, classes, 0.5))
                              , 'colSemantics_s60': list(getPredictedLabels(predicted_proba, classes, 0.6))
                              , 'colSemantics_s70': list(getPredictedLabels(predicted_proba, classes, 0.7))
                              , 'colSemantics_s80': list(getPredictedLabels(predicted_proba, classes, 0.8))
                              , 'colSemantics_s90': list(getPredictedLabels(predicted_proba, classes, 0.9))
                              , 'colSemantics_s95': list(getPredictedLabels(predicted_proba, classes, 0.95))
                              , 'colSemantics_s98': list(getPredictedLabels(predicted_proba, classes, 0.98))
                              , 'colSemantics_s99': list(getPredictedLabels(predicted_proba, classes, 0.99))
                              , 'colNames':list(a_doc.columns)
                              , 'colTypes': col_types
                              , 'colLen': col_len
                              , 'colComplete': col_complete
                              , 'colScores': list(predicted_scores)}
    
    except Exception as e:
        logging.error('Unable to extract: {}'.format(filename))
        print('Unable to extract: {}'.format(filename))
        
        print(e)
        logging.error(e, exc_info=True)
        
        global error_list
        error_list += [filename]
        
    return IDSemanticsColumns


### Test Extraction of Semantics
Test for extraction of one sample of data file

In [None]:
# INIT
error_list=[] # initialise the list to store filenames with prediction errors
model_id = "sherlock"
classes = np.load(f"../model_files/classes_{model_id}.npy", allow_pickle=True)

In [10]:
# TEST function
logging.info('- TEST START -')

test_file = file_list[1]
print(test_file)
logging.info('filename: {}'.format(test_file))

test_extract = extractIDSemanticsWithColumnNames(test_file)
print (test_extract)
logging.info('extraction: {}'.format(test_extract))

print (error_list)
logging.info('error list: {}'.format(error_list))

logging.info('- TEST END -')


Extracting Features:   0%|                               | 0/12 [00:00<?, ?it/s]

30582267f36c39a6ff33b0e38787eaa72f9ad84192498830816d3d2bf5b2e73b.text.csv


Extracting Features:   8%|█▉                     | 1/12 [00:00<00:01,  7.13it/s]

Exporting 1588 column features


Extracting Features: 100%|██████████████████████| 12/12 [00:00<00:00, 18.56it/s]
2022-06-02 21:47:18.710884: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2022-06-02 21:47:18.724400: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fc343598620 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-06-02 21:47:18.724426: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version


{'data_filename': '30582267f36c39a6ff33b0e38787eaa72f9ad84192498830816d3d2bf5b2e73b.text.csv', 'colSemantics': ['area', 'state', 'state', 'state', 'state', 'state', 'state', 'state', 'state', 'state', 'state', 'state'], 'colSemantics_s10': ['area', 'state', 'state', 'state', 'state', 'state', 'state', 'state', 'state', 'state', 'state', 'state'], 'colSemantics_s20': ['area', 'state', 'state', 'state', 'state', 'state', 'state', 'state', 'state', 'state', 'state', 'state'], 'colSemantics_s30': ['area', 'state', 'state', 'state', 'state', 'state', 'state', 'state', 'state', 'state', 'state', 'state'], 'colSemantics_s40': ['area', 'state', 'state', 'state', 'state', 'state', 'state', 'state', 'state', 'state', 'state', 'state'], 'colSemantics_s50': ['area', 'state', 'state', 'state', 'state', 'state', 'state', 'state', 'state', 'state', 'state'], 'colSemantics_s60': ['area', 'state', 'state', 'state', 'state', 'state', 'state', 'state', 'state'], 'colSemantics_s70': ['state', 'state', 'st

## 3. Semantics Extraction
begin to predict semantics for all data files

In [None]:
_ = os.listdir(DIR_OUTPUT) # retrieve filenamne
enrich_list = [] # initiliase the list to store the output of 
output_filenames = [] # initialise the list to store filenames that have been processed

### Get the list of latest extracted semantics
If you are running a huge volume of files and require to continue from the latest run, use it. 
If you want to run a fresh run each time, skip this. 

In [12]:
try:
    output_list = [DIR_OUTPUT + str(id) for id in _ if 'enriched_part_' in id]
    logging.info('Number of Output Files: {}'.format(len(output_list)))
    print('Number of Output Files: {}'.format(len(output_list)))

    latest_output = max(output_list, key=os.path.getctime)
    logging.info('Latest Output Filename: {}'.format(latest_output))
    print('Latest Output Filename: {}'.format(latest_output))

    output_df = pd.read_csv(latest_output)
    output_filenames = output_df['data_filename'].tolist()
    enrich_list = output_df.to_dict('records')
    
    logging.info('Number of Extracted Dataset: {}'.format(len(output_filenames)))
    print('Number of Extracted Dataset: {}'.format(len(output_filenames)))

except Exception as e:
    logging.error('Unable to retrieve latest output')
    print('Unable to retrieve latest output')
    
    logging.error(e, exc_info=True)
    print(e)
    pass

output_filenames[:5]

Number of Output Files: 0
Unable to retrieve latest output
max() arg is an empty sequence


[]

### Start Extraction 
Predict semantics for all data files in the specified folder

In [None]:
logging.info('- EXTRACT START -')

error_list=[] # reset the list to keep track of filenames with prediction that were not successful
model_id = "sherlock"
classes = np.load(f"../model_files/classes_{model_id}.npy", allow_pickle=True)
col_csv = ['data_filename', 'colSemantics'
           , 'colSemantics_s10', 'colSemantics_s20', 'colSemantics_s30', 'colSemantics_s40', 'colSemantics_s50'
           , 'colSemantics_s60', 'colSemantics_s70', 'colSemantics_s80', 'colSemantics_s90', 'colSemantics_s95'
           , 'colSemantics_s98', 'colSemantics_s99', 'colNames', 'colScores', 'colComplete', 'colTypes', 'colLen']


In [13]:
"""
For each data file: 
- if datafile had been previously processed, skip the file. 
- otherwise, extract semantics prediction, column names, etc; store them in a list.
- for every 10 data files processed, store a temporary file containing latest collection of prediction and error.
"""

for i in range(0, len(file_list)):
    
    # so that it does not need to rerun existing output
    if (file_list[i] in output_filenames):
        logging.info('Existed: {} skipped'.format(file_list[i]))
        print('Existed: {} skipped'.format(file_list[i]))
        continue
        
    enrich_list += [extractIDSemanticsWithColumnNames(file_list[i])]
    if i%10==0:
        
        logging.info('i: {}'.format(i))
        sys.stdout.write('- i: {} -'.format(i))
        sys.stdout.write('\n')
        
        pd.DataFrame(enrich_list
                     , columns = col_csv).to_csv(DIR_OUTPUT +'enriched_part_' + str(i) +'.csv'
                     , index=False)
        
        pd.DataFrame(error_list
             , columns=['data_filename']).to_csv(DIR_OUTPUT + 'error_part_' + str(i) +'.csv'
             , index=False)
        
logging.info('- EXTRACT END -')
        

Extracting Features:  60%|██████████████▍         | 3/5 [00:00<00:00, 20.06it/s]

Exporting 1588 column features


Extracting Features: 100%|████████████████████████| 5/5 [00:00<00:00, 20.20it/s]
Extracting Features:   8%|█▉                     | 1/12 [00:00<00:01,  8.42it/s]

- i: 0 -
Exporting 1588 column features


Extracting Features: 100%|██████████████████████| 12/12 [00:00<00:00, 19.39it/s]
Extracting Features: 100%|████████████████████████| 1/1 [00:00<00:00, 59.93it/s]


Exporting 1588 column features


  if (await self.run_code(code, result,  async_=asy)):
Extracting Features:  25%|█████▊                 | 3/12 [00:00<00:00, 10.52it/s]

Exporting 1588 column features


Extracting Features: 100%|██████████████████████| 12/12 [00:01<00:00, 11.78it/s]
Extracting Features: 100%|███████████████████████| 4/4 [00:00<00:00, 344.83it/s]


Exporting 1588 column features


  """
Extracting Features:  22%|█████▎                  | 2/9 [00:00<00:01,  3.57it/s]

Exporting 1588 column features


Extracting Features: 100%|████████████████████████| 9/9 [00:02<00:00,  3.74it/s]
Extracting Features: 100%|███████████████████████| 6/6 [00:00<00:00, 147.84it/s]


Exporting 1588 column features


Extracting Features:  33%|███████▋               | 4/12 [00:00<00:01,  5.34it/s]

Exporting 1588 column features


Extracting Features: 100%|██████████████████████| 12/12 [00:00<00:00, 15.51it/s]
Extracting Features: 100%|███████████████████████| 2/2 [00:00<00:00, 171.87it/s]


Exporting 1588 column features


Extracting Features: 100%|███████████████████████| 7/7 [00:00<00:00, 231.27it/s]


Exporting 1588 column features


Extracting Features: 100%|█████████████████████| 33/33 [00:00<00:00, 191.12it/s]


Exporting 1588 column features


Extracting Features:   0%|                                | 0/9 [00:00<?, ?it/s]

- i: 10 -


Extracting Features:  11%|██▋                     | 1/9 [00:00<00:02,  3.84it/s]

Exporting 1588 column features


Extracting Features: 100%|████████████████████████| 9/9 [00:00<00:00,  9.23it/s]
Extracting Features:  22%|█████▎                  | 2/9 [00:00<00:00,  8.94it/s]

Exporting 1588 column features


Extracting Features: 100%|████████████████████████| 9/9 [00:01<00:00,  5.74it/s]
Extracting Features: 100%|████████████████████████| 3/3 [00:04<00:00,  1.65s/it]

Exporting 1588 column features



Extracting Features: 100%|███████████████████████| 1/1 [00:00<00:00, 211.92it/s]


Exporting 1588 column features


Extracting Features:  33%|████████                | 1/3 [00:00<00:00,  5.53it/s]

Exporting 1588 column features


Extracting Features: 100%|████████████████████████| 3/3 [00:00<00:00,  6.60it/s]
Extracting Features:  22%|█████                  | 4/18 [00:00<00:01, 10.37it/s]

Exporting 1588 column features


Extracting Features: 100%|██████████████████████| 18/18 [00:00<00:00, 21.44it/s]
Extracting Features: 100%|████████████████████████| 1/1 [00:00<00:00,  7.51it/s]


Exporting 1588 column features


Extracting Features:  15%|███▌                   | 4/26 [00:00<00:00, 25.50it/s]

Exporting 1588 column features


Extracting Features: 100%|██████████████████████| 26/26 [00:03<00:00,  8.37it/s]
Extracting Features: 100%|███████████████████████| 2/2 [00:00<00:00, 133.85it/s]


Exporting 1588 column features


Extracting Features: 100%|████████████████████████| 3/3 [00:00<00:00, 78.89it/s]


Exporting 1588 column features


Extracting Features: 100%|███████████████████████| 2/2 [00:00<00:00, 139.86it/s]

- i: 20 -
Exporting 1588 column features



Extracting Features: 100%|████████████████████████| 2/2 [00:01<00:00,  1.46it/s]

Exporting 1588 column features



Extracting Features: 100%|███████████████████████| 8/8 [00:00<00:00, 236.62it/s]


Exporting 1588 column features


Extracting Features: 100%|███████████████████████| 7/7 [00:00<00:00, 197.99it/s]


Exporting 1588 column features


Extracting Features: 100%|████████████████████████| 3/3 [00:00<00:00,  9.81it/s]

Exporting 1588 column features



Extracting Features:   0%|                                | 0/2 [00:00<?, ?it/s]

Exporting 1588 column features


Extracting Features: 100%|████████████████████████| 2/2 [00:00<00:00,  5.79it/s]
Extracting Features:  57%|█████████████▋          | 4/7 [00:00<00:00, 18.88it/s]

Exporting 1588 column features


Extracting Features: 100%|████████████████████████| 7/7 [00:00<00:00, 19.23it/s]
Extracting Features: 100%|███████████████████████| 1/1 [00:00<00:00, 251.46it/s]


Exporting 1588 column features


Extracting Features:   0%|                                | 0/9 [00:00<?, ?it/s]

Exporting 1588 column features


Extracting Features: 100%|████████████████████████| 9/9 [00:00<00:00, 14.02it/s]
Extracting Features: 100%|███████████████████████| 2/2 [00:00<00:00, 164.19it/s]


Exporting 1588 column features


Extracting Features: 100%|████████████████████████| 1/1 [00:00<00:00, 12.35it/s]

- i: 30 -
Exporting 1588 column features



Extracting Features: 100%|████████████████████████| 4/4 [00:00<00:00, 25.88it/s]


Exporting 1588 column features


Extracting Features:  33%|████████                | 2/6 [00:00<00:00, 13.74it/s]

Exporting 1588 column features


Extracting Features: 100%|████████████████████████| 6/6 [00:00<00:00, 18.07it/s]
Extracting Features: 100%|████████████████████████| 1/1 [00:00<00:00, 74.61it/s]


Exporting 1588 column features


Extracting Features:  14%|███▏                   | 3/22 [00:00<00:01, 18.80it/s]

Exporting 1588 column features


Extracting Features: 100%|██████████████████████| 22/22 [00:02<00:00,  7.91it/s]
Extracting Features:  33%|███████▋               | 4/12 [00:00<00:00, 10.68it/s]

Exporting 1588 column features


Extracting Features: 100%|██████████████████████| 12/12 [00:00<00:00, 20.62it/s]
Extracting Features:  33%|███████▋               | 5/15 [00:00<00:00, 10.30it/s]

Exporting 1588 column features


Extracting Features: 100%|██████████████████████| 15/15 [00:00<00:00, 22.40it/s]
Extracting Features: 0it [00:00, ?it/s]
Extracting Features:   0%|                               | 0/11 [00:00<?, ?it/s]

Unable to extract: 046c3561b610ab29de866f59abb248ecf1a32fc190647c5d456784bd06d36018.text.csv
No columns to parse from file


Extracting Features:   9%|██                     | 1/11 [00:00<00:02,  3.38it/s]

Exporting 1588 column features


Extracting Features: 100%|██████████████████████| 11/11 [00:02<00:00,  4.08it/s]
Extracting Features:   0%|                               | 0/44 [00:00<?, ?it/s]

Exporting 1588 column features


Extracting Features: 100%|██████████████████████| 44/44 [00:00<00:00, 90.94it/s]
Extracting Features: 100%|███████████████████████| 6/6 [00:00<00:00, 115.61it/s]

- i: 40 -
Exporting 1588 column features



Extracting Features: 100%|████████████████████████| 5/5 [00:00<00:00, 53.70it/s]


Exporting 1588 column features


Extracting Features: 100%|███████████████████████| 2/2 [00:00<00:00, 281.14it/s]


Exporting 1588 column features


Extracting Features: 100%|███████████████████████| 1/1 [00:00<00:00, 240.26it/s]


Exporting 1588 column features


  if (await self.run_code(code, result,  async_=asy)):
Extracting Features:   8%|█▊                     | 3/37 [00:00<00:01, 24.85it/s]

Exporting 1588 column features


Extracting Features: 100%|██████████████████████| 37/37 [00:02<00:00, 15.56it/s]
Extracting Features:  33%|███████▋               | 5/15 [00:00<00:00, 10.37it/s]

Exporting 1588 column features


Extracting Features: 100%|██████████████████████| 15/15 [00:00<00:00, 22.96it/s]
Extracting Features:   6%|█▍                     | 2/32 [00:00<00:02, 13.32it/s]

Exporting 1588 column features


Extracting Features: 100%|██████████████████████| 32/32 [00:03<00:00,  9.41it/s]
Extracting Features: 100%|███████████████████████| 2/2 [00:00<00:00, 183.49it/s]


Exporting 1588 column features


Extracting Features:  20%|████▊                   | 1/5 [00:00<00:00,  8.45it/s]

Exporting 1588 column features


Extracting Features: 100%|████████████████████████| 5/5 [00:00<00:00,  9.43it/s]
Extracting Features:  33%|████████                | 2/6 [00:00<00:00,  9.93it/s]

Exporting 1588 column features


Extracting Features: 100%|████████████████████████| 6/6 [00:00<00:00,  9.49it/s]
Extracting Features: 100%|███████████████████████| 2/2 [00:00<00:00, 102.39it/s]

- i: 50 -
Unable to extract: 131996a5027d1809789f1b20749c40d158883a4d4f242e568fbc1aa9ae2deffe.text.csv
Error tokenizing data. C error: Expected 1 fields in line 9, saw 8

Exporting 1588 column features



Extracting Features:  21%|████▉                  | 3/14 [00:00<00:00, 22.01it/s]

Exporting 1588 column features


Extracting Features: 100%|██████████████████████| 14/14 [00:00<00:00, 25.50it/s]


KeyboardInterrupt: 

## 4. Export 
Export extracted semantics and list of files with errors

In [None]:
pd.DataFrame(enrich_list
             , columns = col_csv).to_csv(DIR_OUTPUT +'enriched_all.csv'
             , index=False)


In [None]:
pd.DataFrame(error_list
             , columns=['data_filename']).to_csv(DIR_OUTPUT + 'error_all.csv'
             , index=False)


In [None]:
logging.info('- EOF -')