# Data Discovery Demonstration

The **goal** of *data discovery* is to find and suggest datasets that are relevant to answer a question. For that, we propose an interface where users of the *discovery system* interact to explore the available datasets in the organization. Users can choose among different **primitives** that try to capture the user's intuition. Examples: "give me any column that contains this term", i.e. keyword search, "I want to see all columns that are *similar* to this one".

In this demo we show some of the primitives to analyze a small dataset (50MB) that consists of 10 files extracted from data.gov. The files are related to three main topics: (i) taxes, (ii) taxes for liquor and tobacco and (iii) contracts granted to institutions such as grants, etc. *Note that, in general, the data discovery system does not need to own the data, only have read access to it.*


One important goal is to *understand what is the minimum set of primitives necessary to approach data discovery*.

##### Pre-Processing: Loading data and creating signatures

In [24]:
import api as API
API.load_precomputed_model()

Loading signature collections...
Done deserialization of signature collections!
Loading signature collections...DONE!
Loading graph...
Done deserialization of signature collection!
Loading graph...DONE!
Loading simrank matrix...
Done deserialization of simrank matrix!
Loading simrank matrix...DONE!
Loading dataset columns...
Done deserialization of dataset columns!
Loading dataset columns...DONE!


## Data Discovery Primitives

We introduce 4 different primitives for data discovery
1. **Keyword search** : Traditional keyword search on all the datasets available.
2. **Column similarity** : Provide all columns similar to column X (for both textual and numerical values).
3. **Structural similarity** : Provide all columns similar to column X, where similarity captures structural knowledge.
4. **Relation similarity** : Given X,Y, find f(X->Y). Then find Z such that f is similar to s, where s is s(X->Z)

We demonstrate 2, **column similarity** and **structural similarity**.

### Column Similarity

The column similarity primitive will return all columns in the dataset that are similar to the one provided. In the case of textual columns, a similar column is one that contains similar term distribution and content. In the case of numerical columns, a similar column is one with similar data distribution.

---

***Example A***: Which columns refer to US states? Find columns similar to "Geography" which is a **textual** column.

In [25]:
filename = "STC_2014_STC005_with_ann.csv"
columnname = "Geography"
res = API.columns_similar_to(filename, columnname, "ks")
API.print_result(res)

**STC_2014_STC006.US01_with_ann.csv**

   Government


**The_Tax_Burden_on_Tobacco_Volume_49__1970-2014.csv**

   LocationDesc


**CDC_STATE_System_Tobacco_Legislation_-_Tax.csv**

   LocationDesc


**Income_Tax_Components_by_Size_of_Income_by_Place_of_Residence__Beginning_Tax_Year_1999.csv**

   State
   Place of Residence


**STC_2014_STC005_with_ann.csv**

   Geography


**datafeeds\2000_NASA_Contracts_Full_20151015.csv**

   statecode


**SAMHSA_Synar_Reports__Youth_Tobacco_Sales.csv**

   LocationDesc


**NYS_Liquor_Authority_New_Applications_Received.csv**

   City


---

***Example B***: What columns in the dataset are taxes? Find columns similar to "License Taxes - Alcoholic Beverages License" which is a **numerical** column

In [26]:
filename = "STC_2014_STC005_with_ann.csv"
columname = "License Taxes - Alcoholic Beverages License"
res = API.columns_similar_to(filename, columname, "ks")
API.print_result(res)

**STC_2014_STC005_with_ann.csv**

   Sales and Gross Receipts Taxes - Selective Sales and Gross Receipts Taxes - Amusements Sales Tax
   License Taxes - Alcoholic Beverages License
   License Taxes - Other License Taxes


### Structural Similarity

The structural similarity primitive will return any column that is *structurally similar* to the one provided. Structural similarity relies on the structural context in which relations between objects occur, and captures relationships among such objects. Our primitive relies on a graph where nodes are columns and there are edges whenever the similarity between two nodes is higher than a given threshold. On such graph, we run SimRank to compute structural similarity and answer operations that require this primitive.

In [None]:
concept = ("datafeeds\\2000_NASA_Contracts_Full_20151015.csv", "vendorname")
res = API.give_structural_sim_of(concept)
API.print_result(res)

In [None]:
concept = ("Contracts_matchUSASpending.csv", "Agency Name")
res = API.give_structural_sim_of(concept)
API.print_result(res)

### The advantage of context

Column similarity will provide any column with a similarity higher than the threshold, when looking columns one to one. Structural similarity captures information about the context. For example ...

First we run **structural similarity**

In [27]:
concept = ("NYS_Liquor_Authority_New_Applications_Received.csv", "License Received Date")
res = API.give_structural_sim_of(concept)
API.print_result(res)

**NYS_Liquor_Authority_New_Applications_Received.csv**

   License Received Date


---
And **column similarity**

In [28]:
res = API.columns_similar_to("NYS_Liquor_Authority_New_Applications_Received.csv", "License Received Date", "ks")
API.print_result(res)

**Contracts.csv**

   Timestamp (Contract)
   Contract End Date (USAspending)
   Contract Start Date (USAspending)
   Timestamp (Base Contract)


**NYS_Liquor_Authority_New_Applications_Received.csv**

   License Received Date


In [None]:
API.