#  Jaccard containment

We utilize Jaccard containment to find 10 other datasets in NYC
Open Data whose columns overlap with the Vehicle Collision -
Crashes.

The formula of Jaccard containment is $\frac{|Q|∩|X|}{|Q|}$.

### Step 1: Read all column names of each NYC Open dataset.

The reason why we use column names instead of column content to compare is:

1.There are over 2000 datasets and we only need to find 10 datasets.

2.These datasets are all about New York City, so geographical information in them must be similar.

In [5]:
#Read the findcol.txt
#findcol.txt contains path and column names of datasets on Peel HDFS
f = open("findcol.txt","r",encoding='UTF-8')
file = f.read()
print(file)



In [6]:
l = file.split('), (')
len(l)

2627

In [7]:
print(l[0])

[('/user/CS-GY-6513/project_data/data-cityofnewyork-us.2232-dj5q.csv', ['category', 'single men', 'single women', 'total single adults', 'families with children', 'total families', 'total adults in families', 'total children', 'data period']


In [8]:
l[0] = l[0].strip('[(')
print(l[0])

'/user/CS-GY-6513/project_data/data-cityofnewyork-us.2232-dj5q.csv', ['category', 'single men', 'single women', 'total single adults', 'families with children', 'total families', 'total adults in families', 'total children', 'data period']


In [9]:
print(l[2626])

'/user/CS-GY-6513/project_data/data-cityofnewyork-us.zt9s-n5aj.csv', ['DBN', 'School Name', 'Number of Test Takers', 'Critical Reading Mean', 'Mathematics Mean', 'Writing Mean'])]


In [10]:
l[2626] = l[2626].strip(")]")

In [11]:
print(l[2626])

'/user/CS-GY-6513/project_data/data-cityofnewyork-us.zt9s-n5aj.csv', ['DBN', 'School Name', 'Number of Test Takers', 'Critical Reading Mean', 'Mathematics Mean', 'Writing Mean'


### Step 2: Implement Jaccard containment.

Regard all column names in one dataset as a string and convert the string to sets containing strings with two-word-shingle.

In [12]:
# Ex. "I am Sam" = { ['I am'], ['am Sam']}
def twoWordGram(strr):
    tokens = strr.split(' ')
    kGrams = set()
    #2-word gram
    for i in range(len(tokens)-1):
        if tokens[i] + ' ' + tokens[i+1] not in kGrams:
            kGrams.add(tokens[i] + ' ' + tokens[i+1])
    return kGrams

In [13]:
def jaccard_2Word(strr1,strr2):
    x = twoWordGram(strr1)
    y = twoWordGram(strr2)
    return 100.* len(x.intersection(y))/ len(x)

### Step 3: Apply Jaccard containment and find 10 most similar datasets

In [14]:
import numpy as np
score = np.zeros(2627)

In [15]:
column = 'BOROUGH ZIP CODE LATITUDE LONGITUDE STREET'

In [16]:
for i in range(0,2627):
    a = l[i].split(',',1)
    x = a[1]
    x = x.replace("'", "")
    x = x.replace(",", "")
    x = x.replace("[", "")
    x = x.replace("]", "")
    upper_str = x.upper()
    score[i] = jaccard_2Word(column,upper_str)

In [17]:
index_array=[]
npscore = np.array(score)
for i in range(0,10):
    maxindex  = np.argmax(npscore)
    index_array.append(maxindex)
    npscore[maxindex]=0

In [18]:
print(index_array)

[1284, 1221, 1251, 59, 102, 146, 290, 333, 412, 599]


In [19]:
for i in index_array:
    x = l[i].split(',',1)
    print(x[0])

'/user/CS-GY-6513/project_data/data-cityofnewyork-us.h9gi-nx95.csv'
'/user/CS-GY-6513/project_data/data-cityofnewyork-us.gfej-by6h.csv'
'/user/CS-GY-6513/project_data/data-cityofnewyork-us.gsr2-xq9e.csv'
'/user/CS-GY-6513/project_data/data-cityofnewyork-us.2pg3-gcaa.csv'
'/user/CS-GY-6513/project_data/data-cityofnewyork-us.37fm-7uaa.csv'
'/user/CS-GY-6513/project_data/data-cityofnewyork-us.3ub5-4ph8.csv'
'/user/CS-GY-6513/project_data/data-cityofnewyork-us.5hsa-dfq5.csv'
'/user/CS-GY-6513/project_data/data-cityofnewyork-us.5ziv-wcy4.csv'
'/user/CS-GY-6513/project_data/data-cityofnewyork-us.72ss-25qh.csv'
'/user/CS-GY-6513/project_data/data-cityofnewyork-us.98b7-th5j.csv'


We found that some of top 10 datasets are very small, such as 2KB, 16KB. So, we manully chose 10 datasets with larger size in top 20 datasets.