# Using Sherlock out-of-the-box
This notebook is adopted from Sherlock's guide on how to predict a semantic type for a given table column.

Key tasks performed in this notebook are (Task 1):
- read data files and filter for CSV format
- using Sherlock to predict semantics of the columns with varying confidence thresholds
- export predicted semantics into the output folder as CSV files

Pre-requisite:
- Data files (CSV) are ready to be imported
- Notebook was executed inside Sherlock "notebooks" folder installed in Python 3.7.0 environment 



![Workflow for the experimental setup in leveraging column semantics for data discovery.](image/workflowv2.png "Workflow")

## 1. Init

In [1]:
# # Path for folder containing data files, log files and output files
# DIR_DATASET = '/ivi/inde/mmargaret/data_search_e_data_csv/' # replace with folder containing NTCIR's CSV datafiles
# DIR_LOG = '/ivi/inde/mmargaret/sherlock-project/log_2/' # replace with folder to store log files
# DIR_OUTPUT = '/ivi/inde/mmargaret/sherlock-project/output_2/' # replace with folder to output enriched datafiles

In [2]:
# TEMPORARY LOCAL DIR
DIR_DATASET = '/Users/mmargaret/Documents/[UVA] Thesis/sherlock-project/data/data_search_e_data_csv/'
DIR_LOG = '/Users/mmargaret/Documents/[UVA] Thesis/sherlock-project/log_2/'
DIR_OUTPUT = '/Users/mmargaret/Documents/[UVA] Thesis/sherlock-project/output_2/'

### Setup Logging

In [3]:
import logging
from datetime import datetime

logger = logging.getLogger()
fhandler = logging.FileHandler(filename='{}{}'.format(DIR_LOG,datetime.now().strftime('%Y%m%d_%H%M_sherlock.log')), mode='a')
formatter = logging.Formatter('%(asctime)s - %(levelname)s : %(message)s', datefmt='%m/%d/%Y %I:%M')
fhandler.setFormatter(formatter)
logger.addHandler(fhandler)
logger.setLevel(logging.INFO)
logging.info('- LOGGING STARTS -')


### Import Libraries

In [4]:
import numpy as np
import pandas as pd
import pyarrow as pa
import os
import sys
    
from sherlock import helpers
from sherlock.deploy.model import SherlockModel
from sherlock.functional import extract_features_to_csv
from sherlock.features.paragraph_vectors import initialise_pretrained_model, initialise_nltk
from sherlock.features.preprocessing import (
    extract_features,
    convert_string_lists_to_lists,
    prepare_feature_extraction,
    load_parquet_values,
)
from sherlock.features.word_embeddings import initialise_word_embeddings
from sklearn.preprocessing import LabelEncoder

### Initialize Sherlock's feature extraction models

In [5]:
prepare_feature_extraction()
initialise_word_embeddings()
initialise_pretrained_model(400)
initialise_nltk()

Preparing feature extraction by downloading 4 files:
        
 ../sherlock/features/glove.6B.50d.txt, 
 ../sherlock/features/par_vec_trained_400.pkl.docvecs.vectors_docs.npy,
        
 ../sherlock/features/par_vec_trained_400.pkl.trainables.syn1neg.npy, and 
 ../sherlock/features/par_vec_trained_400.pkl.wv.vectors.npy.
        
All files for extracting word and paragraph embeddings are present.
Initialising word embeddings
Initialise Word Embeddings process took 0:00:04.550186 seconds.
Initialise Doc2Vec Model, 400 dim, process took 0:00:02.432000 seconds. (filename = ../sherlock/features/par_vec_trained_400.pkl)
Initialised NLTK, process took 0:00:00.188585 seconds.


[nltk_data] Downloading package punkt to /Users/mmargaret/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/mmargaret/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Get the list of all data files for processing
Retrieve all CSV filenames in the specified folder 

In [6]:
_ = os.listdir(DIR_DATASET)

In [7]:
file_list = [id for id in _ if '.csv' in id]
logging.info('Number of Files: {}'.format(len(file_list)))
print('Number of Files: {}'.format(len(file_list)))

Number of Files: 10


## 2. Define Utilities Function

In [8]:
def getPredictedLabels(y_pred_proba, classes, threshold=0.0):
    """
        This function retrieve the predicted semantics by assigning classes with highest probability 
        that is at least the same or higher than the parameter: "thresholds". 
        Input: 
            y_pred_proba: 
            classes:
            threshold:
        Output: returns a list of predicted semantics 
    """
    pred_scores = np.max(y_pred_proba, axis=1)
    index_threshold = np.where(pred_scores >= threshold)[0]
    y_pred_int = np.argmax(y_pred_proba, axis=1)[index_threshold]
    
    encoder = LabelEncoder()
    encoder.classes_ = classes

    return encoder.inverse_transform(y_pred_int)


In [9]:
def avgColumnLength(df):
    """
        This function calculates the average length of each columns in the dataframe.
    """
    len_list = []
    for col in df:
        len_list += [round(df[col].apply(len).mean(),4)]
    return len_list

def pct_completeness(df):
    """
        This function calculates the percentage of completeness of each columns in the dataframe.
    """
    pct_list = []
    df_len = len(df)
    for col in df:
        pct_list += [round(1 - df[col].isna().sum() / df_len,4)]
    return pct_list

In [10]:
def extractIDSemanticsWithColumnNames(filename):
    """
        This function:
        (1) read the dataset given by the "filename", 
        (2) using Sherlock to: extract their features, initialise Sherlock model and predict their semantics. 
        (3) extract other features, such as column names and column types
        Input: 
            filename: 
        Output: returns a dictionary containing semantics and features of one data file
    """
    
    IDSemanticsColumns = {'data_filename':filename, 'colSemantics': [], 'colNames':[]}
    try:
        # read files
        with open(DIR_DATASET + filename, errors='ignore') as f:
            a_doc = pd.read_csv(f)
        
        # column stats
        col_types = a_doc.dtypes.tolist()
        col_complete = pct_completeness(a_doc)
        
        a_doc = a_doc.select_dtypes(include=[object]).astype(str)
        col_len = avgColumnLength(a_doc)
        data = pd.Series(a_doc.transpose().values.tolist(), name="values") #format it to list of values by columns

        # sherlock extract features
        extract_features("../temporary.csv",data)
        feature_vectors = pd.read_csv("../temporary.csv", dtype=np.float32)

        # sherlock init and predict with pre-trained model
        model = SherlockModel();
        model.initialize_model_from_json(with_weights=True, model_id=model_id);
        
        # PREDICT
        predicted_proba = model.predict_proba(feature_vectors, model_id)
        predicted_scores = np.max(predicted_proba, axis=1).round(4)

        # return dictionary with id: id of the doc, list of the columns' semantics, list of the columns' names
        IDSemanticsColumns = {'data_filename':filename
                              , 'colSemantics': list(getPredictedLabels(predicted_proba, classes))
                              , 'colSemantics_s10': list(getPredictedLabels(predicted_proba, classes, 0.1))
                              , 'colSemantics_s20': list(getPredictedLabels(predicted_proba, classes, 0.2))
                              , 'colSemantics_s30': list(getPredictedLabels(predicted_proba, classes, 0.3))
                              , 'colSemantics_s40': list(getPredictedLabels(predicted_proba, classes, 0.4))
                              , 'colSemantics_s50': list(getPredictedLabels(predicted_proba, classes, 0.5))
                              , 'colSemantics_s60': list(getPredictedLabels(predicted_proba, classes, 0.6))
                              , 'colSemantics_s70': list(getPredictedLabels(predicted_proba, classes, 0.7))
                              , 'colSemantics_s80': list(getPredictedLabels(predicted_proba, classes, 0.8))
                              , 'colSemantics_s90': list(getPredictedLabels(predicted_proba, classes, 0.9))
                              , 'colSemantics_s95': list(getPredictedLabels(predicted_proba, classes, 0.95))
                              , 'colSemantics_s98': list(getPredictedLabels(predicted_proba, classes, 0.98))
                              , 'colSemantics_s99': list(getPredictedLabels(predicted_proba, classes, 0.99))
                              , 'colNames':list(a_doc.columns)
                              , 'colTypes': col_types
                              , 'colLen': col_len
                              , 'colComplete': col_complete
                              , 'colScores': list(predicted_scores)}
    
    except Exception as e:
        logging.error('Unable to extract: {}'.format(filename))
        print('Unable to extract: {}'.format(filename))
        
        print(e)
        logging.error(e, exc_info=True)
        
        global error_list
        error_list += [filename]
        
    return IDSemanticsColumns


### Test Extraction of Semantics
Test for extraction of one sample of data file

In [11]:
# INIT
error_list=[] # initialise the list to store filenames with prediction errors
model_id = "sherlock"
classes = np.load(f"../model_files/classes_{model_id}.npy", allow_pickle=True)

In [12]:
# TEST function
logging.info('- TEST START -')

test_file = file_list[1]
print(test_file)
logging.info('filename: {}'.format(test_file))

test_extract = extractIDSemanticsWithColumnNames(test_file)
print (test_extract)
logging.info('extraction: {}'.format(test_extract))

print (error_list)
logging.info('error list: {}'.format(error_list))

logging.info('- TEST END -')


Extracting Features:   0%|                                | 0/8 [00:00<?, ?it/s]

8238e6a8bbb8896f3d7e346013e6356cd6101aa03236beb1f8bbbc1326dd51a5.text.csv


Extracting Features:  12%|███                     | 1/8 [00:00<00:01,  3.72it/s]

Exporting 1588 column features


Extracting Features: 100%|████████████████████████| 8/8 [00:00<00:00,  8.01it/s]
2022-07-14 23:35:01.167279: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2022-07-14 23:35:01.183568: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f97cf8f83f0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-07-14 23:35:01.183582: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version


{'data_filename': '8238e6a8bbb8896f3d7e346013e6356cd6101aa03236beb1f8bbbc1326dd51a5.text.csv', 'colSemantics': ['company', 'location', 'address', 'city', 'state', 'category', 'duration', 'status'], 'colSemantics_s10': ['company', 'location', 'address', 'city', 'state', 'category', 'duration', 'status'], 'colSemantics_s20': ['company', 'location', 'address', 'city', 'state', 'category', 'duration', 'status'], 'colSemantics_s30': ['company', 'address', 'city', 'state', 'category', 'duration', 'status'], 'colSemantics_s40': ['company', 'address', 'city', 'state', 'category', 'duration', 'status'], 'colSemantics_s50': ['company', 'address', 'city', 'state', 'category', 'duration'], 'colSemantics_s60': ['company', 'address', 'city', 'state', 'duration'], 'colSemantics_s70': ['company', 'address', 'city', 'state', 'duration'], 'colSemantics_s80': ['company', 'address', 'city', 'state'], 'colSemantics_s90': ['company', 'address', 'city', 'state'], 'colSemantics_s95': ['company', 'address', 'c

## 3. Semantics Extraction
begin to predict semantics for all data files

In [13]:
_ = os.listdir(DIR_OUTPUT) # retrieve filenamne
enrich_list = [] # initiliase the list to store the output of 
output_filenames = [] # initialise the list to store filenames that have been processed

### Get the list of latest extracted semantics
If you are running a huge volume of files and require to continue from the latest run, use it. 
If you want to run a fresh run each time, skip this. 

In [14]:
try:
    output_list = [DIR_OUTPUT + str(id) for id in _ if 'enriched_part_' in id]
    logging.info('Number of Output Files: {}'.format(len(output_list)))
    print('Number of Output Files: {}'.format(len(output_list)))

    latest_output = max(output_list, key=os.path.getctime)
    logging.info('Latest Output Filename: {}'.format(latest_output))
    print('Latest Output Filename: {}'.format(latest_output))

    output_df = pd.read_csv(latest_output)
    output_filenames = output_df['data_filename'].tolist()
    enrich_list = output_df.to_dict('records')
    
    logging.info('Number of Extracted Dataset: {}'.format(len(output_filenames)))
    print('Number of Extracted Dataset: {}'.format(len(output_filenames)))

except Exception as e:
    logging.error('Unable to retrieve latest output')
    print('Unable to retrieve latest output')
    
    logging.error(e, exc_info=True)
    print(e)
    pass

output_filenames[:5]

Number of Output Files: 292
Latest Output Filename: /Users/mmargaret/Documents/[UVA] Thesis/sherlock-project/output_2/enriched_part_990.csv
Number of Extracted Dataset: 991


['7cdef5be079976d589563498ba1801d9317588a120c4816961c08dbe696b6af1.text.csv',
 '2dfdae56b7e139b560c7aae93f0845bb1baf19c5a2aaa9fd56b111838808365d.text.csv',
 'f005d30ac771e9601534e78e88d1787f6021f2392979a5a6834d8e58e1ebcab2.text.csv',
 'd97dca84965ad6da866ce3c590afe9723e8d885b9f5c7ce2e2a7d640d667eb4e.text.csv',
 '74a0b0f764269f666108703102cd2556cf9bcb9399b788ab96a49e6aa234f6c5.text.csv']

### Start Extraction 
Predict semantics for all data files in the specified folder

In [15]:
logging.info('- EXTRACT START -')

error_list=[] # reset the list to keep track of filenames with prediction that were not successful
model_id = "sherlock"
classes = np.load(f"../model_files/classes_{model_id}.npy", allow_pickle=True)
col_csv = ['data_filename', 'colSemantics'
           , 'colSemantics_s10', 'colSemantics_s20', 'colSemantics_s30', 'colSemantics_s40', 'colSemantics_s50'
           , 'colSemantics_s60', 'colSemantics_s70', 'colSemantics_s80', 'colSemantics_s90', 'colSemantics_s95'
           , 'colSemantics_s98', 'colSemantics_s99', 'colNames', 'colScores', 'colComplete', 'colTypes', 'colLen']


In [16]:
"""
For each data file: 
- if datafile had been previously processed, skip the file. 
- otherwise, extract semantics prediction, column names, etc; store them in a list.
- for every 10 data files processed, store a temporary file containing latest collection of prediction and error.
"""

for i in range(0, len(file_list)):
    
    # so that it does not need to rerun existing output
    if (file_list[i] in output_filenames):
        logging.info('Existed: {} skipped'.format(file_list[i]))
        print('Existed: {} skipped'.format(file_list[i]))
        continue
        
    enrich_list += [extractIDSemanticsWithColumnNames(file_list[i])]
    if i%10==0:
        
        logging.info('i: {}'.format(i))
        sys.stdout.write('- i: {} -'.format(i))
        sys.stdout.write('\n')
        
        pd.DataFrame(enrich_list
                     , columns = col_csv).to_csv(DIR_OUTPUT +'enriched_part_' + str(i) +'.csv'
                     , index=False)
        
        pd.DataFrame(error_list
             , columns=['data_filename']).to_csv(DIR_OUTPUT + 'error_part_' + str(i) +'.csv'
             , index=False)
        
logging.info('- EXTRACT END -')
        

Extracting Features:   9%|██                     | 1/11 [00:00<00:01,  8.87it/s]

Exporting 1588 column features


Extracting Features: 100%|██████████████████████| 11/11 [00:02<00:00,  5.28it/s]
  if sys.path[0] == '':
Extracting Features:  50%|████████████            | 1/2 [00:00<00:00,  7.99it/s]

- i: 0 -
Existed: 8238e6a8bbb8896f3d7e346013e6356cd6101aa03236beb1f8bbbc1326dd51a5.text.csv skipped
Exporting 1588 column features


Extracting Features: 100%|████████████████████████| 2/2 [00:00<00:00,  6.42it/s]
Extracting Features:  45%|██████████▍            | 5/11 [00:00<00:00, 48.36it/s]

Existed: 7c9bbf21459a35f1026578ed83c0133c542f86ad6ca7b3b68976fc446e83e20e.text.csv skipped
Exporting 1588 column features


Extracting Features: 100%|██████████████████████| 11/11 [00:00<00:00, 11.75it/s]
Extracting Features:   0%|                               | 0/23 [00:00<?, ?it/s]

Exporting 1588 column features


Extracting Features: 100%|██████████████████████| 23/23 [00:03<00:00,  7.34it/s]
Extracting Features:   0%|                                | 0/8 [00:00<?, ?it/s]

Existed: a0a9481da41ddd2fd7ff50e960f30b3e2683e66ee1e15861de5ef76f0322179a.text.csv skipped


Extracting Features:  12%|███                     | 1/8 [00:00<00:01,  3.94it/s]

Exporting 1588 column features


Extracting Features: 100%|████████████████████████| 8/8 [00:00<00:00,  8.12it/s]
Extracting Features:  50%|████████████            | 1/2 [00:00<00:00,  7.47it/s]

Existed: b0bbd0acf11ea708872f5275d1e73fca95cd25fe53e3457ea972d7ed49357a7c.text.csv skipped
Exporting 1588 column features


Extracting Features: 100%|████████████████████████| 2/2 [00:00<00:00,  6.27it/s]


## 4. Export 
Export extracted semantics and list of files with errors

In [17]:
pd.DataFrame(enrich_list
             , columns = col_csv).to_csv(DIR_OUTPUT +'enriched_all.csv'
             , index=False)


In [18]:
pd.DataFrame(error_list
             , columns=['data_filename']).to_csv(DIR_OUTPUT + 'error_all.csv'
             , index=False)


In [19]:
logging.info('- EOF -')