<h1>
<center>Multi-Instance Learning </center>
</h1>

<font size="3"> 
The notebook implements multi-instance learning (MIL) using the DeliciousMILDataset, which contains text documents and their associated binary labels. Two experiments are conducted, one at the document level (vectorized text input) and another at the sentence level (clusters-based input), using multiple classifiers and hyperparameter settings. The second approach uses k-means clustering to extract features from bags of words and sentences, replacing the standard feature extraction method in scikit-learn. This approach allows for more interpretable and customizable feature extraction, with the added benefit of scalability to large datasets. Finally the results reported in terms of accuracy, precision, and recall.
</font>

## Generals 

<font size="3"> 
Packages import and system configurations. 
</font>

In [1]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import normalize
import numpy as np
import scipy.sparse
import numpy as np
from sklearn.svm import LinearSVC
from sklearn.metrics import precision_score, recall_score, accuracy_score
import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier
import scipy.sparse

## Load & Preproces Dataset

<font size="3">
The classes aim to read and process document data from DeliciousMILDataset, and build vectors for documents and sentences. Using the following class we can represent the dataset in both approaches requested by the pronunciation. 
<br>
<br>
The function performs the following steps:
<ol>
<li>Define Document class with empty sentence, label, and vector lists.</li>
<li>Define DeliciousMILDataset class with vocab, train/test documents, labels, and sentence labels.</li>
<li>Define two methods, get_data_on_document_level and get_data_on_sentence_level, which build vectors for the documents or sentences.</li>
<li>Read data, labels, vocabulary, and sentence labels from corresponding files.</li>
<li>Build vectors for each document or sentence by iterating over sentences and their values.</li>
<li>If sparse=True, use csr_matrix to represent the vectors in a compressed form.</li>
</ol>
<br>
In more details, the class contains the bellow functions:
<br>
<br>
<li><b>_read_data():</b> Reads the data file and creates a list of Document objects, each containing a list of Document objects representing sentences.</li>
<li><b>_read_labels():</b> Reads the label file and creates a list of labels for each document.</li>
<li><b>_read_vocabulary():</b> Reads the vocabulary file and creates a dictionary of words and their corresponding indices.</li>
<li><b>_read_test_sentence_labels():</b> Reads the labeled test sentences file and creates a dictionary mapping each sentence to its labels.</li>
<li><b>_build_vectors():</b> Builds vectors for each document and its sentences by calling _buld_vector for each document.</li>
<li><b>_buld_vector():</b> Builds a vector for a given document by summing the vectors of its sentences. The resulting vector is stored in document.vector. If keep_sentence_vector=True, the vector for each sentence is also stored in sent.vector.</li>
<li><b>get_data_on_document_level():</b> Builds document-level vocabulary vectors for both the training and test datasets and returns the resulting Document objects. Vectros have the lenght of the global vocabulary and represets the number of each word existance.</li>
<li><b>get_data_on_sentence_level():</b> Builds sentence-level vocabulary vectors for both the training and test datasets and returns the resulting Document objects. Vectros have the lenght of the global vocabulary and represets the number of each word existance.</li>
</font>

In [2]:
class Document(object):

    def __init__(self):
        self.sentences = [] # Document objects
        self.labels = []
        self.vector = []

class DeliciousMILDataset(object):

    def __init__(self,
                 train_data="./data/train-data.dat", test_data="./data/test-data.dat",
                 train_labels="./data/train-label.dat", test_labels="./data/test-label.dat",
                 labeled_test_sentences="./data/labeled_test_sentences.dat", vocab_file="./data/vocabs.txt"):

        self._vocab = self._read_vocabulary(vocab_file)
        self._train_documents = self._read_data(train_data)
        self._test_documents = self._read_data(test_data)
        self._train_labels = self._read_labels(train_labels)
        self._test_labels = self._read_labels(test_labels)
        self._test_sentence_labels = self._read_test_sentence_labels(labeled_test_sentences)

        
    def get_data_on_document_level(self, sparse):
        self._build_vectors\
            (self._train_documents, self._train_labels, len(self._vocab), keep_sentence_vector=False, sparse=sparse)
        self._build_vectors\
            (self._test_documents, self._test_labels, len(self._vocab), keep_sentence_vector=False, sparse=sparse)

        return self._train_documents, \
               self._test_documents

    def get_vocab(self):
        
        self._read_vocabulary()
            
        return self._vocab
    
    def get_data_on_sentence_level(self, sparse):
        self._build_vectors\
            (self._train_documents, self._train_labels, len(self._vocab), keep_sentence_vector=True, sparse=sparse)
        self._build_vectors\
            (self._test_documents, self._test_labels, len(self._vocab), keep_sentence_vector=True, sparse=sparse)

        return self._train_documents, \
               self._test_documents

    
    def _read_data(self, data_file):
        documents = []

        with open(data_file, "r") as f:
            for line in f:
                tokens = line.split()

                document = Document()

                # First token is number of sentences, ignore
                tokens.pop(0)

                while len(tokens) > 0:
                    sentence_num_token = tokens.pop(0)

                    num_sentences = int(sentence_num_token[1:-1])

                    sentence = Document()

                    for i in range(num_sentences):
                        sentence.vector.append(int(tokens.pop(0)))

                    document.sentences.append(sentence)

                documents.append(document)

        return documents

    
    def _read_labels(self, label_file):
        labels = []

        with open(label_file, "r") as f:
            for line in f:
                line = line.strip()

                labels.append([int(v) for v in line.split()])

        return labels

    
    def _read_vocabulary(self, vocab_file):
        word_dict = {}

        with open(vocab_file, "r") as f:
            for line in f:
                tokens = line.split(",")
                
                word_dict[tokens[0].strip()] = int(tokens[1].strip())
                
        return word_dict

    
    def _read_test_sentence_labels(self, label_file):
        doc_sent_labels_dict = {}

        with open(label_file, "r") as f:
            for line in f:
                tokens = line.split()

                doc = tokens.pop(0)
                sent = tokens.pop(0)

                if doc not in doc_sent_labels_dict:
                    doc_sent_labels_dict[doc] = {}

                doc_sent_labels_dict[doc][sent] = [int(v) for v in tokens]

        return doc_sent_labels_dict

    
    def _build_vectors(self, documents, labels, vocabulary_length, keep_sentence_vector=True, sparse=True):
        for doc, doc_labels in zip(documents, labels):
            self._buld_vector(doc, doc_labels, vocabulary_length, keep_sentence_vector)

            
    def _buld_vector(self, document, doc_labels, vocabulary_length, keep_sentence_vector=True, sparse=True):
        vec = np.zeros(vocabulary_length)

        for sent in document.sentences:
            sent_vec = np.zeros(vocabulary_length)

            for val in sent.vector:
                sent_vec[val] += 1

            vec += sent_vec
            if keep_sentence_vector:
                if sparse:
                    sent.vector = scipy.sparse.csr_matrix(sent_vec)
                else:
                    sent.vector = sent_vec

        if sparse:
            document.vector = scipy.sparse.csr_matrix(vec)
            document.labels = scipy.sparse.csr_matrix(doc_labels)
        else:
            document.vector = vec
            document.labels = doc_labels

<font size="3"> 
The aim of the function bellow is to load and prepare the DeliciousMILDataset for classification.
<br>
<br>
The function performs the following steps:
<ol>
<li>Initialize a DeliciousMILDataset object.</li>
<li>If sentence_level is True, extract the dataset on the sentence level, else extract on document level.</li>
<li>Calculate the most frequent label class across all documents in the training set.</li>
<li>Extract the labels of each document and assign the label of the most frequent class to it. Return the train and test datasets and their corresponding labels.</li>
</ol>
</font>

In [3]:
def get_dataset(sentence_level):
    d = DeliciousMILDataset()
    
    if sentence_level:
        train_docs, test_docs = d.get_data_on_sentence_level(sparse=False)
    else:
        train_docs, test_docs = d.get_data_on_document_level(sparse=False)
    
    label_sum = np.zeros(train_docs[0].labels.shape[1])
    for doc in train_docs:
        label_sum += doc.labels
        
    most_freq_class = np.argmax(label_sum)
    train_labels = np.asarray([doc.labels[0, most_freq_class] for doc in train_docs])
    test_labels = np.asarray([doc.labels[0, most_freq_class] for doc in test_docs])
    return train_docs, test_docs, train_labels, test_labels
    

## Vectorize Text

<font size="3"> 
The aim of the function bellow is to train a KMeans clustering model on multi-instance data and return the fitted model.
</font>

In [4]:
def fit_multi_instance_kmeans(train_x, paramset):
    base_estimator = KMeans(paramset['k'])
    base_estimator.fit(train_x)
    return base_estimator

<font size="3"> 
The aim of the function bellow is to transform the input data into a fixed-dimensional representation using the previously fitted KMeans model. By doing this, the bags are transformed into vectors of cluster frequencies by computing the distance of each instance in the bag to each of the cluster centroids, and summing the distances for each cluster. Function returns a list of vectors, where each vector represents a bag of instances. The function can optionally use distances between instances and cluster centers instead of binary cluster memberships.
</font>

In [5]:
def transform_multi_instance_kmeans(x, base_estimator, paramset):
    vectors = []

    for test_bag in x:
        vec = np.zeros(paramset['k'])

        if paramset['use_distances']:
            for sent_vec in base_estimator.transform(test_bag):
                vec += sent_vec
            vec = normalize(vec.reshape(1, -1))[0]
        else:
            for c in base_estimator.predict(test_bag):
                vec[c] = 1

        vectors.append(vec)

    return vectors

## Sentence Level Classification 

<font size="3"> 
The aim of the function bellow is to to conduct experiments on a sentence-level dataset using different classifiers (cluster-based input).
<br>
<br>
The function performs the following steps:
<ol>
<li>Get the dataset for sentence-level classification.</li>
<li>Extract the sentence-level vectors from the training and test documents by iterating over the sentences in each document, and stacking the sentence vectors vertically.</li>
<li>For each set of parameters in params, train a multi-instance k-means vectorizer.</li>
<li>Transform the training and test sentence vectors into bag-level representations using the trained vectorizer.</li>
<li>Train each classifier on the transformed training data and labels using the fit function.</li>
<li>Make predictions on the transformed test data using the predict function.</li>
<li>Print the results, including the model name, parameter settings, and evaluation metrics (accuracy, precision, and recall) using the print function.</li>
</ol>
</font>

In [6]:
def experiments_sentence_level(classifiers,params):
    train_docs, test_docs, train_labels, test_labels = get_dataset(sentence_level = True)

    train_sentences = []
    for doc in train_docs:
        train_sentences.extend([sent.vector for sent in doc.sentences])
    train_sentences = scipy.sparse.vstack(train_sentences)
    
    train_sentence_vectors = []
    for doc in train_docs:
        train_sentence_vectors.append(scipy.sparse.vstack([sent.vector for sent in doc.sentences]))
    test_sentence_vectors = []
    for doc in test_docs:
        test_sentence_vectors.append(scipy.sparse.vstack([sent.vector for sent in doc.sentences]))

    i=0
    for param_set in params:
        i += 1
        print('\nExperiment :',i)
        vectorizer = fit_multi_instance_kmeans(train_sentences, param_set)
        
        transformed_train_vec = np.asarray(transform_multi_instance_kmeans(train_sentence_vectors, vectorizer, param_set))
        transformed_test_vec = np.asarray(transform_multi_instance_kmeans(test_sentence_vectors, vectorizer, param_set))
        
        for classifier_name in classifiers:
            classifier = classifiers[classifier_name]()
            classifier.fit(transformed_train_vec, train_labels)
            predictions = classifier.predict(transformed_test_vec)

            print("Model:{}, K:{}, Use distances:{}, Accuracy:{}, Precision:{}, Recall:{}"
                  .format(classifier_name,
                          *param_set.values(),
                          round(accuracy_score(test_labels, predictions),5),
                          round(precision_score(test_labels, predictions),5),
                          round(recall_score(test_labels, predictions),5)))

## Document Level Classification 

<font size="3"> 
The aim of the function bellow is to to conduct experiments on a document-level dataset using different classifiers (vocabulay existance-based input vector).
<br>
<br>
The function performs the following steps:
<ol>
<li>Get the dataset on document level.</li>
<li>Train and test different classifiers using accuracy, precision, and recall scores.</li>
<li>Print the results for each classifier and parameter set.</li>
</ol>
</font>

In [11]:
def experiments_document_level(classifiers, params):  
    train_docs, test_docs, train_labels, test_labels = get_dataset(sentence_level=False)
    
    train_vec = scipy.sparse.vstack([doc.vector for doc in train_docs])
    test_vec = scipy.sparse.vstack([doc.vector for doc in test_docs])
    i=0
    for classifier_name in classifiers:
        classifier = classifiers[classifier_name]()
        classifier.fit(train_vec, train_labels)
        predictions = classifier.predict(test_vec)

        print("Model:{}, Accuracy:{}, Precision:{}, Recall:{}"
              .format(classifier_name,
                      round(accuracy_score(test_labels, predictions),5),
                      round(precision_score(test_labels, predictions),5),
                      round(recall_score(test_labels, predictions),5)))

## Pipeline Execution

<font size="3"> 
Global Envariables.
</font>

In [12]:
classifiers = {
        "RandomForests": RandomForestClassifier,
        "LinearSVC": LinearSVC,
        "XGBoost": xgb.XGBClassifier
}

params = [{"k": 5, "use_distances": False},
        {"k": 5, "use_distances": True},
        {"k": 10, "use_distances": False},
        {"k": 10, "use_distances": True},
        {"k": 25, "use_distances": False},
        {"k": 25, "use_distances": True},
        {"k": 50, "use_distances": False},
        {"k": 50, "use_distances": True},]

### Approach 1: Sentence Level

<font size="3"> 
Cluster the proposals of the training set and represent each document by the clusters to which its proposals belong.
</font>

In [13]:
experiments_sentence_level(classifiers, params)


Experiment : 1
Model:RandomForests, K:5, Use distances:False, Accuracy:0.60984, Precision:0.54762, Recall:0.01476
Model:LinearSVC, K:5, Use distances:False, Accuracy:0.60909, Precision:0.66667, Recall:0.00128
Model:XGBoost, K:5, Use distances:False, Accuracy:0.60984, Precision:0.54762, Recall:0.01476

Experiment : 2
Model:RandomForests, K:5, Use distances:True, Accuracy:0.57771, Precision:0.43501, Recall:0.26637
Model:LinearSVC, K:5, Use distances:True, Accuracy:0.60884, Precision:0.5, Recall:0.00064
Model:XGBoost, K:5, Use distances:True, Accuracy:0.57971, Precision:0.43224, Recall:0.23748

Experiment : 3
Model:RandomForests, K:10, Use distances:False, Accuracy:0.6003, Precision:0.46816, Recall:0.16046
Model:LinearSVC, K:10, Use distances:False, Accuracy:0.6116, Precision:0.51792, Recall:0.10205
Model:XGBoost, K:10, Use distances:False, Accuracy:0.60658, Precision:0.49072, Recall:0.15276

Experiment : 4
Model:RandomForests, K:10, Use distances:True, Accuracy:0.60331, Precision:0.4863

  _warn_prf(average, modifier, msg_start, len(result))


Model:XGBoost, K:25, Use distances:True, Accuracy:0.60532, Precision:0.49354, Recall:0.34339

Experiment : 7
Model:RandomForests, K:50, Use distances:False, Accuracy:0.60457, Precision:0.49079, Recall:0.29076
Model:LinearSVC, K:50, Use distances:False, Accuracy:0.61562, Precision:0.52139, Recall:0.21117
Model:XGBoost, K:50, Use distances:False, Accuracy:0.59729, Precision:0.47767, Recall:0.31579

Experiment : 8
Model:RandomForests, K:50, Use distances:True, Accuracy:0.62315, Precision:0.53696, Recall:0.26573
Model:LinearSVC, K:50, Use distances:True, Accuracy:0.60884, Precision:0.0, Recall:0.0


  _warn_prf(average, modifier, msg_start, len(result))


Model:XGBoost, K:50, Use distances:True, Accuracy:0.61813, Precision:0.51683, Recall:0.36457


### Approach 2: Document Level

<font size="3"> 
Represent each document by all its propositions represent dataset without using clusters but with a common vocabulary existance-based representation.
</font>

In [14]:
experiments_document_level(classifiers, params)

Model:RandomForests, Accuracy:0.6578, Precision:0.64931, Recall:0.27214




Model:LinearSVC, Accuracy:0.61788, Precision:0.51135, Recall:0.52054
Model:XGBoost, Accuracy:0.66206, Precision:0.60794, Recall:0.38318


## Conclusion

<font size="3"> 
<b>Comparison:</b>
<br>
The document-level approach (vector-based input) outperforms the sentence-level approach (cluster-based input) in terms of accuracy, precision, and recall, as seen in most experiments. Comparing the two approaches, we can see that Approach 2 generally produces higher performance than Approach 1, with higher accuracy, precision, and recall for all three models. 

<b>Apprach 1 (K-Means Hyper-Parameters):</b>
<br>
The choice of k and whether to use distances has a significant impact on the performance of the models. Using distances generally leads to lower performance than not using distances, possibly because the distances contain noise and irrelevant information.
The optimal value of k depends on the model and the use of distances. In general, smaller values of k tend to lead to higher accuracy and precision, but lower recall.For Random Forests and XGBoost, using k=5 without distances tends to produce the best performance across all three metrics. For LinearSVC, using k=10 without distances leads to the best accuracy and precision, but with lower recall.
    
It's important to note that the performance of the models is generally not very high, with accuracy and precision around 60% and recall around 25-30%. This suggests that there may be other factors beyond the choice of k and use of distances that are limiting the performance of the models.
    

</font>