# <center>Project 3 on Fuzzy system</center>

Subject: Fuzzy C-mean clustering

Name: Hesam Mousavi

Student number: 9931155

Master student

<p align="center">
 <img src="report/cmean.png">
</p>

<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {
inlineMath: [['$','$'], ['\\(','\\)']],
processEscapes: true},
jax: ["input/TeX","input/MathML","input/AsciiMath","output/CommonHTML"],
extensions: ["tex2jax.js","mml2jax.js","asciimath2jax.js","MathMenu.js","MathZoom.js","AssistiveMML.js", "[Contrib]/a11y/accessibility-menu.js"],
TeX: {
extensions: ["AMSmath.js","AMSsymbols.js","noErrors.js","noUndefined.js"],
equationNumbers: {
autoNumber: "AMS"
}
}
});
</script>

In [1]:
import numpy as np
from fcmeans import FCM
from my_io import read_dataset_to_X_and_y
from copy import deepcopy
import matplotlib.pyplot as plt


## Build Class to easily have all the variables

I like to have my variable all together so I build a class and named it
UniSet(short form of universal set)

Read dataset with my function on my_io module that can shuffle sample and
correct missing values also normalized the feature.  

In here I shuffle data and use class-mean for the missing values then
normalized it with the z-score method(zero-mean unit-variance)

I use all the features(12) and change sex from m, f to 0, 1 (actually I map
each string to a specific number in my_io module)


In [2]:
class UniSet():
    def __init__(self, file, range_feature, range_label,
                 normalization='scaling', min_value=0.1, max_value=1,
                 shuffle=False, about_nan='class_mean'):
        np.random.seed(1)
        sample, label = read_dataset_to_X_and_y(
            file, range_feature, range_label, normalization, min_value, max_value,
            shuffle=shuffle, about_nan=about_nan)
        self.universal = sample.astype(float)
        self.label = label
        self.number_of_feature = sample.shape[1]
        self.size_of_universal = sample.shape[0]
        self.diffrent_label = np.unique(label)
        self.number_of_diffrent_label = self.diffrent_label.shape[0]


uni_total = UniSet(
    'dataset/hcvdat0.csv', (2, 14), (1, 2),
    normalization='scaling', shuffle=True, about_nan='class_mean')


print(f'The whole dataset is {uni_total.universal.shape} matrix')


The whole dataset is (615, 12) matrix


> #### Details 
>
> In my_io module I have a function named read_dataset_to_X_and_y
>  that get dataset file, range of attributes that are our features,
>  range of attributes that are our labels, normalization which is
>  our normalization method, shuffle which if be True our samples be
>  shuffled, and about_nan that can be "delete" which delete samples
>  with NA values or "class_mean" which replace NA values with mean of
>  that feature in the sample class
>
> Also as I mentioned above this function can get string attributes too
>  by mapping each string to a specific value so now our labels $\in [0, 4]$
>
> I change NA value with class-mean because It doesn't change the
>  similarity(or distance) of two samples in one class 
>
> In my class, I have all things that I'll need such as universal
>  (sample data), their label, number of features, size of universal
>  (dataset), different labels (unique labels), and number of different labels.
>
> Our labels in this dataset is attributed [1, 2) and features are attributed
>  [2, 14) (12 features)


## Split the whole dataset to Train and Test

As I shuffle the dataset before, now I just consider the first 80%
of the data for the train and the rest for the test case



In [3]:
def split_train_test(universe: UniSet, train_size: float) -> list[UniSet]:
    train = deepcopy(universe)
    test = deepcopy(universe)
    train.size_of_universal = \
        int(universe.size_of_universal*train_size)
    train.universal = \
        universe.universal[0:train.size_of_universal]
    train.label = \
        universe.label[0:train.size_of_universal]
    test.size_of_universal = (
        universe.size_of_universal - train.size_of_universal)
    test.universal = \
        universe.universal[train.size_of_universal:]
    test.label = \
        universe.label[train.size_of_universal:]

    return train, test


uni_train, uni_test = split_train_test(uni_total, 0.8)
print(f'The train dataset is {uni_train.universal.shape} matrix')
print(f'The test dataset is {uni_test.universal.shape} matrix')


The train dataset is (492, 12) matrix
The test dataset is (123, 12) matrix


> #### Details 
>
> I create two classes for train and test by copying the total set and
just changing universal, level, and size of universal for both train and test

## Our parameters

$\mu_{i, j}$: the probability that the jth data point belongs to the ith cluster which 
the sum of $\mu_{i, j}$ over C cluster centers is 1 for every data point j

$c_{i}$: the center of the ith cluster


<p align="center">
 <img src="report/parameter.png">
</p>

## Objective function
$$J=\sum_{i=1}^{C} \sum_{j=1}^{N} \mu_{i j}^{m}\left\|x_{j}-c_{i}\right\|^{2}$$

## Accuracy

I use **Confusion matrix** to find the label of clusters (argmax in each row)
and then choose f1-score as accuracy metric because as $\alpha$ increase
precision increase and recall decrease and I want to find $\alpha$ that satisfy both


In [4]:
def evaluate(gold_label: np.ndarray, predict_label: np.ndarray,
             method: str = 'f1-score') -> float:
    diffrent_label_in_gold_label = np.unique(gold_label)
    diffrent_label_in_predict_label = np.unique(predict_label)
    confusion_matrix = np.array(
        list(map(lambda k: list(map(
            lambda s: sum((predict_label == k)*(gold_label == s))[0],
            diffrent_label_in_gold_label)),
            diffrent_label_in_predict_label)))
    precision = np.sum(
        np.max(confusion_matrix, axis=1)) / np.sum(confusion_matrix)
    recall = np.sum(
        np.max(confusion_matrix, axis=0)) / np.sum(confusion_matrix)
    if(method == 'precision'):
        return precision
    if(method == 'recall'):
        return recall
    if(method == 'f1-score'):
        return 2 * ((precision*recall)/(precision+recall))

> #### Details 
> 
> ##### Confusion matrix
> 
> <p align="center">
>  <img src="report/confusion-matrix.png">
> </p>
> 
> Its $(K * S)$ matrix that $a_{k, s}=$ total number of samples clustered
>  to the kᵗʰ cluster and belongs to the sᵗʰ class.
>  $$\text { Precision }=\frac{\sum_{k}
\max _{s}\left\{a_{k s}\right\}}{\sum_{k} \sum_{s} a_{k s}}$$
>  $$\operatorname{Recall}=\frac{\sum_{s}
\max _{k}\left\{a_{k s}\right\}}{\left(\sum_{k} \sum_{s} a_{k s}+U\right)}$$
>  $$F1-score=2 \times \frac{\text
{ Precision } \times \text { Recall }}{\text { Precision }+\text { Recall }}$$


In [5]:
fcm = FCM(n_clusters=5)
fcm.fit(uni_train.universal)
# outputs
fcm_centers = fcm.centers
# fcm_labels = fcm.predict(uni_test.universal)
fcm_labels = fcm.predict(uni_test.universal)
fcm_labels = fcm_labels.reshape((-1,1))
sum(fcm_labels == uni_test.label)
# evaluate(uni_test.label, fcm_labels)
# np.unique(fcm_labels, return_counts=True)

array([23])

In [6]:
import skfuzzy as fuzz
cntr, u, u0, d, jm, p, fpc = fuzz.cmeans(uni_train.universal.T, 5, 2, 0.005, 1000)
fuzz_pre = fuzz.cmeans_predict(uni_test.universal.T, cntr, 2, 0.005, 1000)
prob_fuzz = fuzz_pre[0].T
label_fuzz = np.argmax(prob_fuzz, axis=1).reshape((-1, 1))
sum(label_fuzz == uni_test.label)
# evaluate(uni_test.label, label_fuzz)
# np.unique(label_fuzz, return_counts=True)

array([23])

In [7]:
print('pause')

pause


# <center>Thanks for your time</center>