# Aga Patro - lab 3

## Zadanie 1. Zaimplementuj przynajmniej 3 "metryki"

In [12]:
import numpy as np

### 1.1 cosinusowa

$$
\begin{equation}
    \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|}= \frac{\sum\limits_{i=1}^{n} A_i B_i}{\sqrt{\sum\limits_{i=1}^{n} A_i^2} \sqrt{\sum\limits_{i=1}^{n} B_i^2}}
    \qquad\begin{aligned}
    &\text{where:} \\
    &\mathbf{A}\text{ and }\mathbf{B} \text{ are the two vectors being compared}\\
    &n \text{ is the dimensionality of the vectors}\\
    &\theta \text{ represents the angle between two vectors } \mathbf{A} \text{ and } \mathbf{B} \text{ in a high-dimensional space}
    \end{aligned}
\end{equation}
$$


In [13]:
def cos_metric(A, B):
    """
    Counts cosius metric between two vectors A and B
    """
    return np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))

### 1.2 Dice coefficient

$$
\begin{equation}
    \text{Dice}(A, B) = \frac{2 |A \cap B|}{|A| + |B|} 
    \qquad\begin{aligned}
    &\text{where:} \\
    &A \text{ and } B \text{ represent the two sets being compared} \\
    &|A| \text{ and } |B| \text{ represent the cardinality (number of elements) of the sets} \\
    &\text{and } |A \cap B| \text{ represents the size of the intersection of the two sets}
    \end{aligned}
\end{equation}
$$

In [14]:
def dice_metric(A, B):
    """
    Counts dice metric between two vectors A and B
    """
    intersection = np.intersect1d(A, B)
    intersection_size = len(intersection)
    return (2.0 * intersection_size) / (len(A) + len(B))


### 1.4 eukalidusowa

$$
\begin{equation}
    d(x,y) = \sqrt{\sum_{i=1}^{n}(x_i-y_i)^2}
    \qquad\begin{aligned}
    &\text{where:} \\
    &d(x,y) \text{ is the Euclidean distance} \\
    &x_i, y_i \text{ are the values of the i-th dimension of vectors } x \text{ and } y \\
    &n \text{ is the number of dimensions in the vectors}
    \end{aligned}
\end{equation}
$$

In [21]:
def euclidean_metric(A, B):
    """
    Counts eucalides metric between two vectors A and B
    """
    result = 0
    for i in range(len(A)):
        result += (A[i]-B[i])**2

    return np.sqrt(result)

## Zadanie 2. Zaimplementuj sposób oceny jakości klasteryzacji 

### Davies-Bouldin

Centroid klastra -> średnia pozycja wszystkich punktów należących do klastra.
Dla klastra o n punktach i d wymiarach, gdzie $ \mathbf{x}_i $ oznacza i-ty punkt w klastrze:

$  c = \frac{1}{n}\sum_{i=1}^{n} \mathbf{x}_i $

Odległość między centroidami -> odległość euklidesowa pomiędzy centroidami dwóch różnych klastrów.
Dla klastrów $ C_i $ i $ C_j $ :

$  \Delta_{ij} =|| C_{i} - C_{j}||_{2} $


Odległość wewnętrzna -> średnia odległość punktów w klastrze od jego centroidu. Dla klastra $ C_i $ :

$ s_{i} = \frac{1}{n_{i}}\sum_{x \epsilon C_{i}}^{} ||x - c_{i}||_{2} $


Współczynnik Daviesa-Bouldina -> $ R_{i} = \frac{s_{i} + s_{j}}{ \Delta_{ij}} $


gdzie $ j $ jest klastrem różny od $ i $ i dla którego wartość $ \frac{s_i+s_j}{\Delta_{ij}} $ jest maksymalna.
Ostateczna wartość współczynnika Daviesa-Bouldina to średnia wartość $ R_i $ dla wszystkich klastrów.


In [75]:
def count_distance(cluster, metric):
    distances = [0 for _ in range(len(cluster))]
    
    for i in range(len(cluster)):
        for j in range(i):
            A, B = create_vec(cluster[i], cluster[j])
            dist = metric(A, B)
            distances[i] += dist
            distances[j] += dist
            
    return distances

def create_vec(pre_A, pre_B):
    return text_to_vec((pre_A, pre_B))
    
def count_norm(cluster):
    vec = [len(line) for line in cluster]
    return np.sqrt(sum(x**2 for x in vec))
    
def centroid(cluster, metric):
    distances = count_distance(cluster, metric)
    min_dist = -np.inf
    index = -1
    
    for i in range(len(distances)):
        if distances[i] < min_dist:
            min_dist = distances[i]
            index = i
            
    return cluster[index]

def count_average_dist(cluster, metric, centroid):
    res = 0
    for i in cluster:
        A, B = create_vec(i, centroid)
        res += metric(A, B)
        
    return res/count_norm(cluster)

def count_R(metric, centroids, N, S):
    R=[[0 for i in range (N)] for j in range (N)]
    for i in range (N):
        for j in range (N):
            if i!=j:
                A, B = create_vec(centroids[i], centroids[j])
                R[i][j]=(S[i]+S[j])/metric(A, B)
    
    return R
   
    
def davies_bouldin(clusters, metric):
    N=len(clusters)
    
    centroids=[centroid(clusters[i], metric) for i in range (N)]
    S=[count_average_dist(clusters[i], metric, centroids[i]) for i in range (N)]
    
    R = count_R(metric, centroids, N, S)
    sum_D=0
    for i in range(N):
        a=0
        if i>0: 
            a=max(R[i][:i])
        b=0
        if i<N-1: 
            b=max(R[i][i+1:])
        sum_D+=max(a,b)
        
    return sum_D/N

## Zadanie 3. Stwórz stoplistę najczęściej występujących słów i zastosuj ją jako pre-processing dla nazw. Algorytmy klasteryzacji powinny działać na dwóch wariantach: z pre-processingiem i bez pre-processingu.

In [29]:
def make_text():
    with open('lines.txt', 'r') as filee:
        text = filee.readlines()

    return text

def stoplist(frequency=200, range_a=100, range_b=200):
    text = make_text()[range_a:range_b]
    words = []
    for line in text:
        words += line.split()

    counted = {}
    for word in words:
        if word not in counted:
            counted[word] = 0
        counted[word] += 1

    stop_words = set()
    for word, count in counted.items():
        if count >= frequency:
            stop_words.add(word)

    return stop_words

## Zadanie 4. Wykonaj klasteryzację zawartości załączonego pliku (lines.txt) przy użyciu  metryk zaimplementowanych w pkt. 1. Każda linia to adres pocztowy firmy, różne sposoby zapisu tego samego adresu powinny się znaleźć w jednym klastrze.

### Preprocessing

In [30]:
import re
from collections import Counter

def preprocess(text):
    """
    Converts text to lowercase and removes all punctation marks
    """
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    
    return text

def text_to_vec(docs):
    """
    Swithes text to vectors
    """
    vocab = set()
    for doc in docs:
        doc = preprocess(doc)
        words = doc.split()
        vocab.update(words)

    freq_vecs = []
    for doc in docs:
        doc = preprocess(doc)
        words = doc.split()
        word_counts = Counter(words)
        freq_vec = [word_counts[word] for word in vocab]
        freq_vecs.append(freq_vec)
    
    return freq_vecs

### Klasteryzacja

In [63]:
from sklearn.cluster import KMeans

def cut_the_text(range_a=100, range_b=200):
    with open('lines.txt', 'r') as file:
        lines = [line.strip() for line in file.readlines()]

    return lines[range_a:range_b+1]

def clusterization(metric, pre_mode, range_a=100, range_b=200, frequency=200, is_print=True):
    lines = cut_the_text(range_a, range_b)

    if is_print:
        if pre_mode:
            print(f"'\n-------------- WITH PRE-PROCESSING --------------")
        else:
            print(f"'\n------------ WITHOUT PRE-PROCESSING --------------")

    if pre_mode:
        stop_words = stoplist(frequency, range_a, range_b)
        cut_lines = [' '.join([word for word in line.split() if word not in stop_words]) for line in lines]

    n = len(lines)
    dist_matrix = np.zeros((n, n))
    for i in range(n):
        for j in range(i+1, n):
            A, B = text_to_vec([lines[i], lines[j]])
            dist_matrix[i, j] = metric(A, B)
            dist_matrix[j, i] = dist_matrix[i, j]

    k = 10
    kmeans = KMeans(n_clusters=k, init='k-means++', random_state=42)
    kmeans.fit(dist_matrix)

    if is_print:
        for i in range(k):
            cluster_indices = np.where(kmeans.labels_ == i)[0]
            cluster_lines = [lines[idx] for idx in cluster_indices]
            print(f'\n\n---------------- CLUSTER {i+1} ----------------\n')
            cnt = 0
            for line in cluster_lines:
                if cnt < 5:
                    print(line[:100])
                    cnt += 1
            print()
        return 0
    else:
        clusters = []
        for i in range(k):
            cluster_indices = np.where(kmeans.labels_ == i)[0]
            cluster_lines = [lines[idx] for idx in cluster_indices]
            clusters.append(cluster_lines)
            
        return clusters

        

In [56]:
metrics = [cos_metric, dice_metric, euclidean_metric]
pre_mode = [False, True]
metrics_names = ["COS", "DICE", "EUCLIDEAN"]
for i in range(len(metrics)):
    for mode in pre_mode:
        print(f"'\n\n---------------- METRIC {metrics_names[i]} ----------------")
        clusterization(metric=metrics[i], pre_mode=mode, frequency=300)

'

---------------- METRIC COS ----------------
'
------------ WITHOUT PRE-PROCESSING --------------


---------------- CLUSTER 1 ----------------

"TRADE HOUSE "PIROFF"" LTD 14, OBRAZTSOVA STR., MOSCOW, RUSSIAN FEDERATION,   127055 TEL/FAX:   +7  
"TRADE HOUSE "PIROFF"" LTD  14, OBRAZTSOVA  STR.,  MOSCOW, RUSSIAN  FEDERATION,   127055  TEL/FAX:  
"TRADE HOUSE "PIROFF"" LTD   14, OBRAZTSOVA  STR.,  MOSCOW, RUSSIANFEDERATION,   127055  TEL/FAX:  +



---------------- CLUSTER 2 ----------------

1/FMG SHIPPING AND FORWARDING LTD 190020, SAINT PETERSBURG, LIFLYANDSKAYA STR., 6, LITERA "A", BUILD
1/FMG Shipping and Forwarding Ltd190020, Saint Petersburg,  Liflyandskaya str., 6, litera A,building
1/FMG SHIPPING AND FORWARDING  LTD190020, SAINT PETERSBURG, LIFLYANDSKAYA STR., 6, LITERA A 2/<TORGO



---------------- CLUSTER 3 ----------------

1,FORWARD EXPEDITION LIMITED HKG MOSCOW REPRESENTATION, 121069,MOSCOW,NOVINSKIY BULVAR, 20 A, BLD.8 
1.FORWARD EXPEDITION LIMITED HKG,MOSCOW REPRESENT

## Zadanie 5. Porównaj jakość wyników sposobami zaimplementowanymi w pkt. 2.

In [65]:
cluster_c_ns = clusterization(metric=cos_metric, pre_mode=False, range_a=100, range_b=400, frequency=200, is_print=False)
cluster_c_s = clusterization(metric=cos_metric, pre_mode=True, range_a=100, range_b=400, frequency=200, is_print=False)
cluster_d_ns = clusterization(metric=dice_metric, pre_mode=False, range_a=100, range_b=400, frequency=200, is_print=False)
cluster_d_s = clusterization(metric=dice_metric, pre_mode=True, range_a=100, range_b=400, frequency=200, is_print=False)
cluster_e_ns = clusterization(metric=euclidean_metric, pre_mode=False, range_a=100, range_b=400, frequency=200, is_print=False)
cluster_e_s = clusterization(metric=euclidean_metric, pre_mode=True, range_a=100, range_b=400, frequency=200, is_print=False)

In [76]:
dbi_c_ns = davies_bouldin(cluster_c_ns, cos_metric)
dbi_c_s = davies_bouldin(cluster_c_s, cos_metric)
dbi_d_ns = davies_bouldin(cluster_d_ns, dice_metric)
dbi_d_s = davies_bouldin(cluster_d_s, dice_metric)
dbi_e_ns = davies_bouldin(cluster_e_ns, euclidean_metric)
dbi_e_s = davies_bouldin(cluster_e_s, euclidean_metric)

  R[i][j]=(S[i]+S[j])/metric(A, B)


In [78]:
dbi_c = {"without stoplist": dbi_c_ns,
         "with stoplist":dbi_c_s}
dbi_d = {"without stoplist": dbi_d_ns,
         "with stoplist":dbi_d_s}
dbi_e = {"without stoplist": dbi_e_ns,
         "with stoplist":dbi_e_s}

dbi = {"Euclidean distance": dbi_e,
       "Sørensen–Dice coefficient": dbi_d,
       "Cosine similarity": dbi_c}

In [81]:
import pandas as pd

data_frame=pd.DataFrame(dbi)
print("Davies–Bouldin index")
display(data_frame)

Davies–Bouldin index


Unnamed: 0,Euclidean distance,Sørensen–Dice coefficient,Cosine similarity
without stoplist,0.161993,1.395741,inf
with stoplist,0.161993,1.395741,inf


### Obserwacje:
1. Zastosowanie stoplisty nie wpływa ma jakość klasteryzacji (współczynnik dbi jest taki sam)

## Zadanie 6. Czy masz jakiś pomysł na poprawę jakości klasteryzacji w tym zadaniu?

By klasteryzacja była jak najbardziej efektywna, należy dobrać odpowiednie parametry dla wykorzystywanych funkcji/bibliotek. Ponadto należy pamiętać o dobraniu odpowiedniej metryki, dopasowanej do danych wejściowych.