# Cluster Articles

## Introduction
In this notebook I will try to cluster the articles given in the Articles.csv. These articles are from kaggle and contain 30279 articles that are tagged as "good articles".
https://www.kaggle.com/urbanbricks/wikipedia-promotional-articles

In [1]:
import pandas as pd
import numpy as np

In [2]:
df_articles = pd.read_csv("./Articles.csv", nrows=800)
df_articles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    800 non-null    object
 1   url     800 non-null    object
dtypes: object(2)
memory usage: 12.6+ KB


In [3]:
df_articles.describe()

Unnamed: 0,text,url
count,800,800
unique,800,800
top,Adetokumboh M'Cormack as YemiKolawole Obileye ...,https://en.wikipedia.org/wiki/1948%20Winter%20...
freq,1,1


In [4]:
df_articles.head()

Unnamed: 0,text,url
0,Nycticebus linglom is a fossil strepsirrhine p...,https://en.wikipedia.org/wiki/%3F%20Nycticebus...
1,Oryzomys pliocaenicus is a fossil rodent from ...,https://en.wikipedia.org/wiki/%3F%20Oryzomys%2...
2,.hack dt hk is a series of single player actio...,https://en.wikipedia.org/wiki/.hack%20%28video...
3,The You Drive Me Crazy Tour was the second con...,https://en.wikipedia.org/wiki/%28You%20Drive%2...
4,0 8 4 is the second episode of the first seaso...,https://en.wikipedia.org/wiki/0-8-4


## Method 1: Methods from Data Mining cours at university.
Using:
- Shingles
- Min-Hashing
- locality sensitive hashing (LHS)

In [5]:
#df_articles = df_articles.head(800)

### Create Shingles

In [6]:
shingleSize = 3

In [7]:
#s = sentence, k = Shingle size
def findShingles(s, k):
    s = s.replace(" ", "")
    s = ''.join(e for e in s if e.isalnum())
    s = s.lower()
    shingles = set();
    for i in range(0, len(s)-k+1):
        shingles.add(s[i:i+k])
    return shingles

In [8]:
shingles = [findShingles(x, shingleSize) for x in df_articles["text"]]
df_articles.insert(1,"Shingles", shingles)

In [9]:
df_articles.head()

Unnamed: 0,text,Shingles,url
0,Nycticebus linglom is a fossil strepsirrhine p...,"{des, lyr, lsi, san, erp, ers, vir, ign, ish, ...",https://en.wikipedia.org/wiki/%3F%20Nycticebus...
1,Oryzomys pliocaenicus is a fossil rodent from ...,"{des, lyr, ddu, lsi, san, ers, 6ph, erp, ame, ...",https://en.wikipedia.org/wiki/%3F%20Oryzomys%2...
2,.hack dt hk is a series of single player actio...,"{des, lsi, san, rho, fdi, ers, erp, pea, ctc, ...",https://en.wikipedia.org/wiki/.hack%20%28video...
3,The You Drive Me Crazy Tour was the second con...,"{des, san, ers, erp, pea, inf, ket, atu, dto, ...",https://en.wikipedia.org/wiki/%28You%20Drive%2...
4,0 8 4 is the second episode of the first seaso...,"{des, gov, san, tzo, ers, erp, pea, lix, inf, ...",https://en.wikipedia.org/wiki/0-8-4


### Min-Hashing
Currently very imperformant. It works only for 3-Shingles or less.

In [10]:
#s = array of subsets
#u = universe
def charMatrix(s, u):
    charMatrix = np.full((len(u), len(s)),False)
    for i in range(0, len(u)):
        for ii in range(0,len(s)):
            if(u[i] in s[ii]):
                charMatrix[i][ii] = True
    return charMatrix

In [11]:
# c = characteristic matrix
# list of Hash functions
def minHashing(c, h):
    sigMatrix = np.full((len(h),np.size(c,1)),np.inf)
    #iterate over rows
    for i in range(0, np.size(c,0)):
        hashResults = [h[x](i) for x in range(0,len(h))]
        #iterate over columns/subsets
        for ii in range(0, np.size(c,1)):
            if(c[i,ii]):
                for j in range(0,np.size(sigMatrix,0)):
                    sigMatrix[j,ii] = min(sigMatrix[j,ii], hashResults[j])
    return sigMatrix
        
        

In [12]:
#a = alphabet
#l = shingle len
def createUniverse(a, l):
    if(l == 1):
        return a
    else:
        result = []
        for char in a:
            for comb in createUniverse(a, l-1):
                result.append(char+comb)
        return result
            

### Test functions
Simple test

In [13]:
testTexts = ["ad", "c", "bde", "acd"]
testOneShingles = [findShingles(x, 1) for x in testTexts]
testUniverse = createUniverse(["a","b","c","d","e"], 1)
testCharMatrix = charMatrix(testOneShingles, testUniverse)
testSigMatrix = minHashing(testCharMatrix, [lambda x: (x+1)%np.size(testCharMatrix,0), lambda x: (3*x+1)%np.size(testCharMatrix,0)])

In [14]:
assert(testOneShingles == [{'a', 'd'}, {'c'}, {'b', 'd', 'e'}, {'a', 'c', 'd'}])
assert(testUniverse == ['a', 'b', 'c', 'd', 'e'])
expectedResult = np.array([[ True, False, False,  True],
                           [False, False,  True, False],
                           [False,  True, False,  True],
                           [ True, False,  True,  True],
                           [False, False,  True, False]])
assert(np.array_equal(testCharMatrix,expectedResult))
expectedResult =np.array([[1., 3., 0., 1.],
                          [0., 2., 0., 0.]])
assert(np.array_equal(testSigMatrix,expectedResult))

### Calculate Dataset

In [15]:
import string
import sys
universe = createUniverse([i for i in string.printable if i.islower() or i.isnumeric()], shingleSize)

In [16]:
charMatrixRes = charMatrix(df_articles["Shingles"], universe)

In [17]:
minHash = minHashing(charMatrixRes, [lambda x: (1.34*x)%np.size(charMatrixRes,0)])

In [18]:
np.size(minHash)

800

In [19]:
minHash

array([[731.76,   0.  ,   0.  ,   0.  ,  52.26,   0.  ,   9.38,   0.  ,
          0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,
          0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  , 274.7 ,
          2.68,   0.  ,   0.  ,   0.  , 293.58,   0.  ,   0.  ,   0.  ,
        293.58,   0.  , 730.42,   0.  ,   0.  ,   0.  ,   0.  ,  13.4 ,
          0.  ,  13.4 ,   0.  ,  33.5 ,  24.12,  33.5 ,   0.  ,   0.  ,
         25.46,  29.48,  29.48,   0.  , 290.9 ,  24.12,   0.  ,   0.  ,
          0.  ,   0.  ,   6.7 ,   8.04,   1.34, 274.7 ,   0.  ,   0.  ,
          0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,
         17.42,   0.  ,   0.  ,  81.74,   0.  ,   1.34,   0.  , 411.5 ,
          0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,
          0.  ,  33.5 ,   0.  ,   2.68,   0.  ,   0.  ,   5.36,   0.  ,
          0.  ,   0.  ,   0.  ,   0.  ,   1.34,   1.34,   0.  , 293.58,
          0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  , 