# Data Discovery Demonstration

In this demo we are going to analyze a small dataset that has been extracted from data.gov.
The dataset consists of 6 CSV files and it it 10MB size in total.
The files are related to X and Y and Z

### Pre-Processing: Loading data and creating signatures

In [1]:
import api as API
API.load_precomputed_model()

Loading signature collections...
Done deserialization of signature collections!
Loading signature collections...DONE!
Loading graph...
Done deserialization of signature collection!
Loading graph...DONE!
Loading dataset columns...
Done deserialization of dataset columns!
Loading dataset columns...DONE!


## Data Discovery Primitives

We introduce 4 differnet primitives for data discovery
* Columns similar to X : Provide all columns similar to the one provided (for both textual and numerical values)
* Structural similarity queries : Return all columns similar (structural similarity) to the column/value provided
* Columns similar to X->Y : Given the relation between X->Y, give columns that can replace Y keeping the relation

##### Column Similarity

The column similarity primitive will return all columns in the dataset that are similar to the one provided. In the case of textual columns, a similar column is one that contains similar term distribution and content. In the case of numerical columns, it may be columns with similar descriptive statistics and data distribution.

Example A: Find columns similar to "Geography" which is a **textual** column

In [2]:
filename = "STC_2014_STC005_with_ann.csv"
columnname = "Geography"
API.columns_similar_to(filename, columnname, "ks")

[('Income_Tax_Components_by_Size_of_Income_by_Place_of_Residence__Beginning_Tax_Year_1999.csv',
  'Place of Residence'),
 ('STC_2014_STC005_with_ann.csv', 'Geography'),
 ('STC_2014_STC006.US01_with_ann.csv', 'Government'),
 ('SAMHSA_Synar_Reports__Youth_Tobacco_Sales.csv', 'LocationDesc'),
 ('The_Tax_Burden_on_Tobacco_Volume_49__1970-2014.csv', 'LocationDesc'),
 ('NYS_Liquor_Authority_New_Applications_Received.csv', 'City'),
 ('Income_Tax_Components_by_Size_of_Income_by_Place_of_Residence__Beginning_Tax_Year_1999.csv',
  'State')]

Example B: Find columns similar to "License Taxes - Alcoholic Beverages" which is a **numerical** column

In [7]:
columname = "License Taxes - Alcoholic Beverages License"
API.columns_similar_to(filename, columname, "ks")

[('STC_2014_STC005_with_ann.csv', 'License Taxes - Other License Taxes'),
 ('STC_2014_STC005_with_ann.csv',
  'Sales and Gross Receipts Taxes - Selective Sales and Gross Receipts Taxes - Amusements Sales Tax'),
 ('STC_2014_STC005_with_ann.csv',
  'License Taxes - Alcoholic Beverages License')]

##### Structural Similarity

In [3]:
concept = ("STC_2014_STC005_with_ann.csv", "Geography")

In [6]:
API.give_related_concepts(concept)

[('STC_2014_STC005_with_ann.csv', 'Geography'),
 ('Income_Tax_Components_by_Size_of_Income_by_Place_of_Residence__Beginning_Tax_Year_1999.csv',
  'State'),
 ('The_Tax_Burden_on_Tobacco_Volume_49__1970-2014.csv', 'LocationDesc'),
 ('STC_2014_STC006.US01_with_ann.csv', 'Government'),
 ('SAMHSA_Synar_Reports__Youth_Tobacco_Sales.csv', 'LocationDesc'),
 ('NYS_Liquor_Authority_New_Applications_Received.csv', 'City'),
 ('Income_Tax_Components_by_Size_of_Income_by_Place_of_Residence__Beginning_Tax_Year_1999.csv',
  'Place of Residence'),
 ('STC_2014_STC005_with_ann.csv', 'Geography')]