# Supervised Classifier 
# Categorize text with pre-defined labels

In case of short texts, as metadata records, the best approach is to build up a hierarchy of pre-defined words
related to the topic and assign each text to those categories. 

The approach in this case is the following:
1. run an unsupervised classifier for short texts to obtain several topics
2. build a similarity matrix between each set of expert labels and the obtained topics
3. add the similarities between each text and its topic to the matrix
4. for each topic, arrange the results in descending order, based on the similarity


## Unsupervised classification

Regarding unsupervised classification, one of the most common techniques is Latent semantic analysis, which creates vector representations of documents. It takes the list of documents as the input corpus and it computes similarities as the distance between vectors. The first step in LSA is to build a term frequency-inverse document frequency (tf-idf) where each position in the vector corresponds to a different word and a document is represented by the number of times each word appears. So, the most important words will be the ones that appear the most often in the documents. In order to make the process better, the LSA algorithms improve the process by also considering synonymity between words.

In this case, LSA is not enough for short texts, where the words related to the topic can occur only once or twice in the text. Generally, the technical words are not used often in the same paragraph and they are usually ignored by the LSA algorithm. Even if the stop words are removed. there are still English words in text that occur more often. Even if the unsupervised classifier doesn't bring the best results, it is used as an intermediate step to get the final similarity. Beside the topics, it also returns a matrix of similarity between each document and each topic.

Considering:
- N = total number of documents in corpus
- T1 = total number of automatic topics

The results to be kept are the top words for each topic and the matrix of similarity between the documents and the topics of size N x T1.

The number of topics is set to a pre-defined number, but the algorithm may find a lower number and return the last topics empty.

T = the number of topics obtained as a result, it may be T or less

 
## Build similarity matrix between topics and pre-defined labels

Notation:
- tw = number of words per topic (set to 30 in this case)
- lw = number of words per pre-defined label

The next step is to build a classification matrix between the labels and the topics that we obtained at the above step.
For each topic, we considered a list of tw words. For each word in label and for each word in topic, we compute the similarity, using the cosine distance of the lanugage model obtained as prerequisite.

So, for each topic, we obtain a matrix of size tw x lw, containing the similarities. We save, from each line, the maximum value and we will obtain a vector of tw entries. The final similarity will be computes as the magnitude of the array:
w = math.sqrt((tw1)^2 + (tw2)^2 + ... + (twn)^2) / tw


## Add the similarities between each text and its topic to the matrix

For each document, we have a list of similarity to each automatic topic, meaning an array of length T2 
sim_D_T = [sdt1, sdt2, .. sdtT2], where the sum of elements is 1
For each label and topic we have a similarity array:
sim_T_L = [slt1, slt2, .. sltL]

In order to compute the similarity between document and pre-defined label, we apply the following formula:
sim_D_L = sim_D_T * sim_T_L

Then, the results are analysed per topic. The maximum value is selected and all the values in the corresponding column are divided by it. The entries are then saved in files, in order of relevance, together with the index.

In [2]:
%run "Common Defines Biologic Process.ipynb"
%run "NLP_clustering.ipynb"
%run "Predefined Labels - Biologic Process.ipynb"


from operator import itemgetter

ERROR:root:File `'__imports__.ipynb.py'` not found.


NameError: name 'LANGUAGE_MODEL' is not defined

In [5]:
# The files where the similarities can be saved for further testing based on the keywords
# These files contain the similarity matrix between each entry in the database and each pre-defined label
# and can be then used to get similarities between keywords and documents

SIM_MATRIX_FILE = "biologic_process_similarity_matrix.csv"
ID_LIST_FILE = "biologic_process_id_list.txt"


In [6]:
# The language model that will be used
# It can be initialized only once and then will be stored in memory for further uses

model, index2word_set = init_language_model("biogenesis.bin")

model.init_sims(replace=True)

# The langugage model can be tested on several words to check if it runs correctly
print(model.wv.most_similar("cell"))
print(model.wv.most_similar("adhesion"))

biogenesis.bin
[('tissue', 0.9093469381332397), ('human', 0.8972060680389404), ('vessel', 0.8961011171340942), ('cellular', 0.8946257829666138), ('protein', 0.8868533372879028), ('unraveled', 0.8854694366455078), ('expression', 0.8852810859680176), ('gene', 0.8829666376113892), ('molecular', 0.877745509147644), ('signal', 0.8768489360809326)]
[('proliferation', 0.9623371362686157), ('differentiation', 0.9575070142745972), ('unraveled', 0.932964563369751), ('expression', 0.9228825569152832), ('cellular', 0.9145495295524597), ('embryonic', 0.9032720923423767), ('migration', 0.9003199338912964), ('receptive', 0.8990265727043152), ('endothelial', 0.8990079164505005), ('response', 0.8969208598136902)]


  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.


In [20]:
def get_topic_similarity(topic_words_avg, sent_words):
    
    sent_words_avg = avg_feature_vector((' '.join(sent_words)), model, num_features=NUM_FEATURES, index2word_set=index2word_set)
    return 1 - spatial.distance.cosine(sent_words_avg, topic_words_avg)


def cosine_similarity_compare (label_words, topic_words):
    A = [[0 for x in range(len(label_words))] for y in range(len(topic_words))]
    for i in range(len(topic_words)):
        tw = topic_words[i]
        if tw in model.wv:
            for j in range(0, len(label_words)):
                lw = label_words[j]
                if lw in model.wv:
                    A[i][j] = 1 - spatial.distance.cosine(model.wv[tw], model.wv[lw])

    maxa = np.max(A, axis = -1, initial = 0)
    weight = 0.0

    if isinstance(maxa, float):
        weight = maxa
    else:
        for i in range(len(maxa)):
            weight += math.pow(maxa[i], 2)

        weight = math.sqrt(weight) / len(maxa)

    return weight


In [34]:
 if __name__ == "__main__":

    titles = []
    urls = []
    abstracts = []
    ids = []
    abbr_lower = [abbr.lower() for abbr in list(abbreviations.keys())]
    original_titles = []
    
    # get the abstracts from the files 
    for file in os.listdir("biologic_process_abstracts"):
         with open("biologic_process_abstracts\\" + file, "r") as f:
                data = json.load(f)
                if not "abstract" in data:
                    continue
                ids.append(file)
                title = data["title"]
                original_titles.append(title)
                title = prepareDescription(title, keepwords, abbr_lower)
                title = replace_abbrevations(title, abbreviations)
                titles.append(title)
                
                urls.append(data["url"])
                
                abstract = data["abstract"]
                abstract = prepareDescription(abstract, keepwords, abbr_lower)
                abstract = replace_abbrevations(abstract, abbreviations)
                abstracts.append(abstract)

    title_sim = [ [0 for i in range(NUM_LABELS_BIO)] for j in range(len(titles))]
    title_id = 0

    label_words = [cellular_process_info, development_info, physiological_process_info]

    # compare each title to the theme and domain lists
    # and obtain the similarity
    
    # in this case, we have 3 subdomains, so we will compute similarities between titles and 
    # each of these
    
    for title in titles:
        domain_sim = [0] * NUM_LABELS_BIO
        
        theme_sim = cosine_similarity_compare(themes, title.split())
        
        i = 0
        for domain in domains:
            domain_sim[i] = cosine_similarity_compare(domain.split(), title.split())
            i +=1
            
        for i in range(NUM_LABELS_BIO):
            title_sim[title_id][i] += cosine_similarity_compare(label_words[i], title)

        for i in range(NUM_LABELS_BIO):
            title_sim[title_id][i] = (title_sim[title_id][i] + domain_sim[i] + theme_sim) / 3

        title_id += 1

    abs_content = [''] * len(titles)

    for i in range(len(titles)):
        # concat title + abstract, then
        # remove words duplicates, this will show some better results
        abs_content[i] = (list(dict.fromkeys((titles[i] + ' ' + abstracts[i]).split())))
        
    T=10
    mgp = MovieGroupProcess(K=T, alpha=0.1, beta=0.1, n_iters=30)
    vocab = set(x for doc in abs_content for x in doc)
    n_terms = len(vocab)
    y = mgp.fit(abs_content, n_terms)
    

    # Save model
    with open("sttm_v1_bio.model", "wb") as f:
        pickle.dump(mgp, f)
        f.close()
    
    doc_count = np.array(mgp.cluster_doc_count)
    print('Number of documents per topic :', doc_count)
    print('*'*20)# Topics sorted by the number of document they are allocated to
    top_index = doc_count.argsort()[(-1*T):][::-1]
    
    print('*'*20)# Show the top 5 words in term frequency for each cluster
    count = 0
    
    sims_T_L = [ [0 for i in range(NUM_LABELS_BIO)] for j in range(len(mgp.cluster_word_distribution))]
    topics = []
    
    for cluster_dict_per_topic in mgp.cluster_word_distribution:
        counter = Counter(cluster_dict_per_topic)

        high = counter.most_common(30)

        if high == []:
            continue

        topic_words = [x[0] for x in high]
        # the most common words, how are they connected to each predefined topic?
        for i in range(len(label_words)):
            sims_T_L[count][i] = cosine_similarity_compare(label_words[i], topic_words)

        topics.extend(topic_words)
        count += 1
    
    T2 = count
    
    count = 0

    sims_D_T = [ [0 for i in range(T2)] for j in range(len(titles))]
    
    for doc in abs_content:
        sims_D_T[count] = mgp.score(doc)
        count += 1

    # multiply matrices
    sims_D_T = np.array(sims_D_T)
    sims_T_L = np.array(sims_T_L)
    sims_D_L = np.zeros([np.size(sims_D_T, 0), np.size(sims_T_L, 1)])
    
    for i in range(np.size(sims_D_T, 0)):
        for j in range(np.size(sims_T_L, 1)):
            if np.count_nonzero(sims_D_T[i, :]) > 0 and np.count_nonzero(sims_T_L[:, j]) > 0:
                sims_D_L[i][j] = np.matmul(sims_D_T[i, :], sims_T_L[:, j])
                #spatial.distance.cosine(sims_D_T[i, :], sims_T_L[:, j])
            else:
                sims_D_L[i][j] = 0

    no_lines = len(sims_D_L)
    no_cols = len(sims_D_L[0])

    #which are the words from the concepts which are not considered by the automatic features
    topics = list(dict.fromkeys(topics))
    
    sim_labels = []
    i = 0
    for label in label_words:
        label_list = [w for w in label if w not in topics]
        sim_labels.append(label_list)
        i += 1

    i = 0
    sims_D_L_2 = [ [0 for i in range(NUM_LABELS_BIO)] for j in range(len(abs_content))]

    # compare these to the corpus and gather similarities
    for count in range(len(abs_content)):
        sentence  = abs_content[count]
        if sentence != []:
            for i in range(len(sim_labels)):
                label = sim_labels[i]
                sims_D_L_2[count][i] = cosine_similarity_compare(label, sentence)

    i = 0
    
    # how many lines and columns for sim_D_L_2
    sims_D_L_2_lines = len(sims_D_L_2)
    sims_D_L_2_cols = len(sims_D_L_2[0])

    for i in range(len(sims_D_L)):
        maxv = max(sims_D_L[i])
        sims_D_L[i] = [x / maxv for x in np.array(sims_D_L[i])]
    
    with open(SIM_MATRIX_FILE, "w") as f:
            writer = csv.writer(f)
            [writer.writerow(r) for r in sims_D_L]
    
    with open(ID_LIST_FILE, "w") as f:
        for idname in ids:
            f.write(idname + "\n")
    

    sim_values_trans = np.array(sims_D_L).transpose()

    # print best resources for each topic
    cnt = 0
    for line in sim_values_trans:
        arr = np.array(line)
        idx = arr.argsort()[-len(ids):][::-1]
        count = 0
        cnt += 1
        with open("results_file" + str(cnt) + ".txt", "w") as f:
            for i in idx:
                count += 1
                if (count > 10):
                    break
                if (arr[i] > 0):
                    f.write(original_titles[i] + ' ' + urls[i] + '\n')


  # This is added back by InteractiveShellApp.init_path()
  
  from ipykernel import kernelapp as app


In stage 0: transferred 752 clusters with 10 clusters populated
In stage 1: transferred 423 clusters with 10 clusters populated
In stage 2: transferred 185 clusters with 10 clusters populated
In stage 3: transferred 92 clusters with 10 clusters populated
In stage 4: transferred 71 clusters with 10 clusters populated
In stage 5: transferred 55 clusters with 10 clusters populated
In stage 6: transferred 41 clusters with 10 clusters populated
In stage 7: transferred 38 clusters with 10 clusters populated
In stage 8: transferred 40 clusters with 10 clusters populated
In stage 9: transferred 46 clusters with 10 clusters populated
In stage 10: transferred 49 clusters with 10 clusters populated
In stage 11: transferred 44 clusters with 10 clusters populated
In stage 12: transferred 45 clusters with 10 clusters populated
In stage 13: transferred 52 clusters with 10 clusters populated
In stage 14: transferred 34 clusters with 10 clusters populated
In stage 15: transferred 37 clusters with 10 cl

UnicodeEncodeError: 'charmap' codec can't encode character '\u2010' in position 2: character maps to <undefined>