In [1]:
import numpy as np
import pandas as pd
import csv
from nltk import ngrams
from sklearn.metrics import jaccard_score
import glob
from itertools import combinations

In [2]:
def jaccard_similarity(list1, list2):
    intersection = len(list(set(list1).intersection(list2)))
    union = (len(list1) + len(list2)) - intersection
    return float(intersection) / union

In [3]:
corpus = []

file_list = glob.glob("data/corpus-20090418/*.txt")
for file_path in file_list:
    with open(file_path, encoding="utf8", errors='ignore') as file_input:
        doc = file_input.read()
        doc = doc.replace('\n', ' ')
        doc = doc.replace('  ', ' ')
        corpus.append(doc)

In [4]:
nrow = len(corpus)
k = 3
list_kShingles = []
list_hashed = []
for i in range(nrow):
    tokens = corpus[i].split()
    list_kShingles.append([shingle for shingle in ngrams(tokens, k)])
    list_hashed.append([hash(shingle) for shingle in ngrams(tokens, k)])

In [5]:
jac_sim = pd.DataFrame(columns=['Doc1', 'Doc2', 'Jaccard_Score'])
for i in range(nrow):
    for j in range(nrow):
        if i != j:
            jac_sim = jac_sim.append({'Doc1': 'Doc'+str(i), 'Doc2': 'Doc'+str(j), 'Jaccard_Score': jaccard_similarity(list_hashed[i], list_hashed[j])}, ignore_index=True)

In [6]:
jac_sim

Unnamed: 0,Doc1,Doc2,Jaccard_Score
0,Doc0,Doc1,0.000000
1,Doc0,Doc2,0.000000
2,Doc0,Doc3,0.000000
3,Doc0,Doc4,0.002688
4,Doc0,Doc5,0.003135
...,...,...,...
9895,Doc99,Doc94,0.000000
9896,Doc99,Doc95,0.000000
9897,Doc99,Doc96,0.000000
9898,Doc99,Doc97,0.000000


In [7]:
corpus[17]

'In object-oriented programming, inheritance is a way to form new classes (instances of which are called objects) using classes that have already been defined. The inheritance concept was invented in 1967 for Simula. The new classes, known as derived classes, take over (or inherit) attribute and behaviour of the pre-existing classes, which are referred to as base classes (or ancestor classes). It is intended to help reuse existing code with little or no modification. Inheritance provides the support for representation by categorization in computer languages. Categorization is a powerful mechanism number of information processing, crucial to human learning by means of generalization (what is known about specific entities is applied to a wider group given a belongs relation can be established) and cognitive economy (less information needs to be stored about each specific entity, only its particularities). Inheritance is also sometimes called generalization, because the is-a relationships

In [8]:
corpus[35]

'In object-oriented programming, inheritance is a way to form new classes (instances of which are called objects) using classes that have already been defined. The inheritance concept was invented in 1967 for Simula. The new classes, known as derived classes, take over (or inherit) attributes and behavior of the pre-existing classes, which are referred to as base classes (or ancestor classes). It is intended to help reuse existing code with little or no modification. Inheritance provides the support for representation by categorization in computer languages. Categorization is a powerful mechanism number of information processing, crucial to human learning by means of generalization (what is known about specific entities is applied to a wider group given a belongs relation can be established) and cognitive economy (less information needs to be stored about each specific entity, only its particularities). Inheritance is also sometimes called generalization, because the is-a relationships

In [9]:
global_hash = list(set(per_hash for per_doc in list_hashed for per_hash in per_doc))

In [10]:
matrix_input = {}
for i in range(len(list_hashed)):
    matrix_input['Doc'+str(i)] = [ 1 if single_hash in list_hashed[i] else 0 for single_hash in global_hash ]

In [11]:
chr_matrix = pd.DataFrame(matrix_input, index=global_hash).reset_index(drop=True)

In [12]:
chr_matrix

Unnamed: 0,Doc0,Doc1,Doc2,Doc3,Doc4,Doc5,Doc6,Doc7,Doc8,Doc9,...,Doc90,Doc91,Doc92,Doc93,Doc94,Doc95,Doc96,Doc97,Doc98,Doc99
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13881,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
13882,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
13883,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
13884,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
signature_num = 100
nrow_matrix = chr_matrix.shape[0]
prime = 13887
np.random.seed(12345)
coeff_a = np.random.choice(nrow_matrix, size=signature_num, replace=False)
coeff_b = np.random.choice(nrow_matrix, size=signature_num, replace=False)

In [14]:
coeff_a

array([ 6176,  2748, 13072, 12584,  1743,  8400,  5940,  7680, 12358,
        2416,  6231,  8309,  8381,  7751,  3031,  5988, 11400,  4427,
       11975,  9887,  1432, 10742, 12388,  5090,  3892, 11369,  5232,
       11916,  2899, 10046, 13565,   659,  1612,  3997,  4473, 13848,
        4529,  2473, 13075,  1975, 10706,  2224,   378,  1846,  5530,
        8418,  6309,  4980,  7756,  2239,  9512,   101,  3630, 11963,
        1226, 12465, 13863,  5279,  9061,  3336,  1546,   467,  7203,
        6328, 11701,  3924,  3164, 11926,  2536,   658,  7008,  3060,
        2660, 10680,  2799,  3964,  8025, 10063,  8601, 13246, 11954,
        3707,  8836,  2410,  8998,  5417,  1023,  8222,  1417,  6859,
        8983,   294, 11942,  9002,  2875,  2983,  6152,  9434,  7745,
        2451])

In [15]:
coeff_b

array([ 2831, 11313,   600,   279,  3024,  8522,  3750,  6643,  1818,
        6222, 12436,  3241,  6878, 11984,  6393,  9401,  7603, 12897,
       13515,  7506, 13297,  4309, 11655,  6306,  9324,  3156,  3936,
        7037, 12501,  3667,  7730,  2130, 11570,  1486,  5787, 10204,
        5577,  9608,  3597,  8955, 11303, 10027,  1891, 10150, 10361,
        5077,  6854, 13761,  4277, 13278,  7741,  5006, 11544, 12289,
       13351,  3669,  4402,  5841,  1333,  1124, 12766, 12067,  2402,
       11787,  4273, 11260,  8887, 11992, 13082, 10387,  2285,  9789,
        3920,   762,  2472,  6478,  7852, 11537,  2695,  3578,  1096,
        5397,  9224, 12828, 10255, 12420, 10867,  2181,   416, 11116,
          87, 12189,   192,  4281,  9258, 10219, 11968,   395, 13266,
       10840])

In [16]:
matrix_permutation = pd.DataFrame(columns=['Hash'+str(j) for j in range(signature_num)])
for i in range(nrow_matrix):
    dict_hash = {}
    for j in range(signature_num):
        dict_hash['Hash'+str(j)] = ( coeff_a[j] * i + coeff_b[j] ) % prime
    matrix_permutation = matrix_permutation.append(dict_hash, ignore_index=True)

In [17]:
matrix_permutation

Unnamed: 0,Hash0,Hash1,Hash2,Hash3,Hash4,Hash5,Hash6,Hash7,Hash8,Hash9,...,Hash90,Hash91,Hash92,Hash93,Hash94,Hash95,Hash96,Hash97,Hash98,Hash99
0,2831,11313,600,279,3024,8522,3750,6643,1818,6222,...,87,12189,192,4281,9258,10219,11968,395,13266,10840
1,9007,174,13672,12863,4767,3035,9690,436,289,8638,...,9070,12483,12134,13283,12133,13202,4233,9829,7124,13291
2,1296,2922,12857,11560,6510,11435,1743,8116,12647,11054,...,4166,12777,10189,8398,1121,2298,10385,5376,982,1855
3,7472,5670,12042,10257,8253,5948,7683,1909,11118,13470,...,13149,13071,8244,3513,3996,5281,2650,923,8727,4306
4,13648,8418,11227,8954,9996,461,13623,9589,9589,1999,...,8245,13365,6299,12515,6871,8264,8802,10357,2585,6757
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13881,7436,8712,5490,8097,6453,13670,9771,2224,10992,5613,...,1737,10425,11862,5817,5895,6208,2830,13226,8457,10021
13882,13612,11460,4675,6794,8196,8183,1824,9904,9463,8029,...,10720,10719,9917,932,8770,9191,8982,8773,2315,12472
13883,5901,321,3860,5491,9939,2696,7764,3697,7934,10445,...,5816,11013,7972,9934,11645,12174,1247,4320,10060,1036
13884,12077,3069,3045,4188,11682,11096,13704,11377,6405,12861,...,912,11307,6027,5049,633,1270,7399,13754,3918,3487


In [18]:
matrix_signature = pd.DataFrame(columns=['Doc'+str(j) for j in range(nrow)])
for i in range(signature_num):
    dict_doc = {}
    idx = list(matrix_permutation[['Hash'+str(i)]].values.ravel())
    for j in range(nrow):
        dict_doc['Doc'+str(j)] = np.where(chr_matrix.reindex(idx)[['Doc'+str(j)]] == 1)[0].min()
    matrix_signature = matrix_signature.append(dict_doc, ignore_index=True)

In [19]:
matrix_signature

Unnamed: 0,Doc0,Doc1,Doc2,Doc3,Doc4,Doc5,Doc6,Doc7,Doc8,Doc9,...,Doc90,Doc91,Doc92,Doc93,Doc94,Doc95,Doc96,Doc97,Doc98,Doc99
0,13,620,43,154,14,63,43,32,202,171,...,29,17,16,56,53,35,34,0,46,19
1,52,129,426,371,26,9,138,40,9,264,...,233,93,48,66,92,267,18,76,4,3
2,88,26,515,1,23,36,20,21,7,16,...,154,39,2,24,50,45,57,107,48,272
3,17,95,9,4,0,75,9,154,68,315,...,98,32,34,63,113,22,13,97,143,16
4,22,171,39,164,19,125,39,147,56,10,...,334,0,87,371,28,181,48,27,28,54
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,11,226,22,29,147,206,22,140,34,0,...,104,76,14,27,3,14,188,32,30,83
96,49,51,134,166,48,5,37,3,5,165,...,189,61,145,6,86,76,283,48,82,215
97,36,25,146,47,19,1,70,40,72,16,...,57,98,75,76,32,102,181,34,32,50
98,32,49,181,146,117,172,11,36,97,81,...,184,35,90,4,42,77,246,88,27,17


In [20]:
sign_sim = pd.DataFrame(columns=['Doc1', 'Doc2', 'Signature_Score'])
for i in range(nrow):
    for j in range(nrow):
        if i != j:
            sign_sim = sign_sim.append({'Doc1': 'Doc'+str(i), 'Doc2': 'Doc'+str(j), 'Signature_Score': matrix_signature.loc[:,'Doc'+str(i)].eq(matrix_signature.loc[:,'Doc'+str(j)]).sum()/signature_num}, ignore_index=True)
            

In [21]:
sign_sim

Unnamed: 0,Doc1,Doc2,Signature_Score
0,Doc0,Doc1,0.0
1,Doc0,Doc2,0.0
2,Doc0,Doc3,0.0
3,Doc0,Doc4,0.0
4,Doc0,Doc5,0.0
...,...,...,...
9895,Doc99,Doc94,0.0
9896,Doc99,Doc95,0.0
9897,Doc99,Doc96,0.0
9898,Doc99,Doc97,0.0


In [75]:
num_cols = matrix_signature.shape[1]
candidates = set()
b = 25
r = 4
s = 0.8

In [76]:
for i in range(b):
    for j in list(combinations(range(num_cols), 2)):
        col1 = matrix_signature.iloc[i * r : (i + 1) * r, j[0]]
        col2 = matrix_signature.iloc[i * r : (i + 1) * r, j[1]]
        sims = col1.eq(col2).sum()/r
        if sims >= s:
            candidates.add((j[0], j[1]))

In [65]:
corpus[next(iter(candidates))[0]]

' In probability theory, Bayes\' theorem also called Bayes\' law after Rev Thomas Bayes compares the conditional and marginal probabilities of two random events. It is often used to calculate posterior probabilities given observations. For example, a patient may be observed to have certain symptoms. Bayes\' theorem can be used to calculate the likelihood that a proposed analysis is accurate, given that observation. As an official theorem, Bayes\' theorem is valid in all universal interpretations of probability. However, it plays a fundamental role in the debate around the foundations of statistics: frequentist and Bayesian interpretations disagree about the ways in which probabilities should be assigned in applications. Frequentists assign probabilities to random events according to their frequencies of happening or to subsets of populations as proportions of the whole. Whilst Bayesians describe probabilities in terms of beliefs and degrees of uncertainty. The articles on Bayesian prob

In [70]:
corpus[next(iter(candidates))[1]]

"In probability theory, Bayes' theorem (or Bayes' law after Rev Thomas Bayes) provides relation between the conditional and marginal probabilities of two random events. It is usually used to calculate posterior probabilities given observations. For example: a patient might be observed to show certain symptoms. Bayes' theorem could be used to compute the probability that a certain diagnosis is right, given that observation. Since it is a formal theorem, Bayes' theorem holds in all popular interpretations of probability. Bayes' theorem relates the conditional and marginal probabilities of events a and b, where b has a non-vanishing probability:   P(a|b) = P(a|b)P(a)/P(b) Terms in Bayes' theorem are named by a convention: P(A) is the prior probability or marginal probability of A. It does not take into account any information about B and therefore is considered prior. P(A|B) is the conditional probability of A, given B. It it is derived from or depends upon the specified value of B. Usual

In [67]:
candidates

{(6, 56),
 (6, 62),
 (6, 88),
 (9, 62),
 (13, 63),
 (17, 35),
 (17, 41),
 (17, 53),
 (17, 80),
 (22, 43),
 (22, 50),
 (28, 57),
 (28, 79),
 (30, 88),
 (35, 41),
 (35, 53),
 (35, 80),
 (41, 53),
 (41, 80),
 (42, 61),
 (43, 50),
 (53, 80),
 (56, 62),
 (57, 63),
 (57, 79),
 (57, 86),
 (63, 86),
 (78, 91)}