# Supervised Classifier 
# Categorize text with pre-defined labels

In case of short texts, as metadata records, the best approach is to build up a hierarchy of pre-defined words
related to the topic and assign each text to those categories. 

The approach in this case is the following:
1. run an unsupervised classifier for short texts to obtain several topics
2. build a similarity matrix between each set of expert labels and the obtained topics
3. add the similarities between each text and its topic to the matrix
4. for each topic, arrange the results in descending order, based on the similarity


## Unsupervised classification

Regarding unsupervised classification, one of the most common techniques is Latent semantic analysis, which creates vector representations of documents. It takes the list of documents as the input corpus and it computes similarities as the distance between vectors. The first step in LSA is to build a term frequency-inverse document frequency (tf-idf) where each position in the vector corresponds to a different word and a document is represented by the number of times each word appears. So, the most important words will be the ones that appear the most often in the documents. In order to make the process better, the LSA algorithms improve the process by also considering synonymity between words.

In this case, LSA is not enough for short texts, where the words related to the topic can occur only once or twice in the text. Generally, the technical words are not used often in the same paragraph and they are usually ignored by the LSA algorithm. Even if the stop words are removed. there are still English words in text that occur more often. Even if the unsupervised classifier doesn't bring the best results, it is used as an intermediate step to get the final similarity. Beside the topics, it also returns a matrix of similarity between each document and each topic.

Considering:
- N = total number of documents in corpus
- T1 = total number of automatic topics

The results to be kept are the top words for each topic and the matrix of similarity between the documents and the topics of size N x T1.

The number of topics is set to a pre-defined number, but the algorithm may find a lower number and return the last topics empty.

T = the number of topics obtained as a result, it may be T or less

 
## Build similarity matrix between topics and pre-defined labels

Notation:
- tw = number of words per topic (set to 30 in this case)
- lw = number of words per pre-defined label

The next step is to build a classification matrix between the labels and the topics that we obtained at the above step.
For each topic, we considered a list of tw words. For each word in label and for each word in topic, we compute the similarity, using the cosine distance of the lanugage model obtained as prerequisite.

So, for each topic, we obtain a matrix of size tw x lw, containing the similarities. We save, from each line, the maximum value and we will obtain a vector of tw entries. The final similarity will be computes as the magnitude of the array:
w = math.sqrt((tw1)^2 + (tw2)^2 + ... + (twn)^2) / tw


## Add the similarities between each text and its topic to the matrix

For each document, we have a list of similarity to each automatic topic, meaning an array of length T2 
sim_D_T = [sdt1, sdt2, .. sdtT2], where the sum of elements is 1
For each label and topic we have a similarity array:
sim_T_L = [slt1, slt2, .. sltL]

In order to compute the similarity between document and pre-defined label, we apply the following formula:
sim_D_L = sim_D_T * sim_T_L

Then, the results are analysed per topic. The maximum value is selected and all the values in the corresponding column are divided by it. The entries are then saved in files, in order of relevance, together with the index.

In [None]:
%run "Common Defines.ipynb"
%run "NLP_clustering.ipynb"
%run "Predefined Labels.ipynb"


from operator import itemgetter

In [None]:
# The files where the similarities can be saved for further testing based on the keywords
# These files contain the similarity matrix between each entry in the database and each pre-defined label
# and can be then used to get similarities between keywords and documents

SIM_MATRIX_FILE = "geocatalogue_similarity_matrix.csv"
ID_LIST_FILE = "geocatalogue_id_list.txt"


In [None]:
# The language model that will be used
# It can be initialized only once and then will be stored in memory for further uses

model, index2word_set = init_language_model()

model.init_sims(replace=True)

# The langugage model can be tested on several words to check if it runs correctly
print(model.wv.most_similar("sky"))
print(model.wv.most_similar("downwelling"))

In [None]:
def get_topic_similarity(topic_words_avg, sent_words):
    
    sent_words_avg = avg_feature_vector((' '.join(sent_words)), model, num_features=NUM_FEATURES, index2word_set=index2word_set)
    return 1 - spatial.distance.cosine(sent_words_avg, topic_words_avg)


def cosine_similarity_compare (label_words, topic_words):
    A = [[0 for x in range(len(label_words))] for y in range(len(topic_words))]
    for i in range(len(topic_words)):
        tw = topic_words[i]
        if tw in model.wv:
            for j in range(0, len(label_words)):
                lw = label_words[j]
                if lw in model.wv:
                    A[i][j] = 1 - spatial.distance.cosine(model.wv[tw], model.wv[lw])

    maxa = np.max(A, axis = -1, initial = 0)
    weight = 0.0

    if isinstance(maxa, float):
        weight = maxa
    else:
        for i in range(len(maxa)):
            weight += math.pow(maxa[i], 2)

        weight = math.sqrt(weight) / len(maxa)

    return weight


In [None]:
 if __name__ == "__main__":
    csw = CatalogueServiceWeb('http://geocatalog.webservice-energy.org/geonetwork/srv/eng/csw')
    set_title = fes.PropertyIsLike('any', '')#'solar observations')
    filter_list = [set_title]

    csw.getrecords2(constraints=filter_list, maxrecords=2000)

    fmt = '{:*^64}'.format
    print(fmt(' Catalog information '))
    print("CSW version: {}".format(csw.version))
    print("Number of datasets available: {}".format(len(csw.records.keys())))
    print('\n')

    original_list_of_titles = []
    identifiers = []
    preprocessed_list_of_titles = []

    abbr_lower = [abbr.lower() for abbr in list(abbreviations.keys())]

    for rec in csw.records:
        original_list_of_titles.append(csw.records[rec].title)
        identifiers.append(csw.records[rec].identifier)
        title = csw.records[rec].title
        
        title = prepareDescription(title, keepwords, abbr_lower)
        title = replace_abbrevations(title, abbreviations)

        #remove words duplicates, maybe this will show some better results
        title =  ' '.join(list(dict.fromkeys(title.split())))
        preprocessed_list_of_titles.append(title.split())


    title_sim = [ [0 for i in range(NUM_LABELS)] for j in range(len(preprocessed_list_of_titles))]
    title_id = 0
    label_words = [time_series_solar_resources, atmosphere_meteorology,
                   ground_topography, meteorological_year, solar_potential]
    
    #compare all the titles to themes, domains and sub_domains
    for title in preprocessed_list_of_titles:
        domain_sim = 0
        
        theme_sim = cosine_similarity_compare(themes_title, title)
        domain_sim += cosine_similarity_compare(domains_title, title)
            
        # which of these is the most suitable sub_domain
        i = 0
        for item in sub_domains:
            title_sim[title_id][i] = cosine_similarity_compare(item.split(), title)
            i += 1

        for i in range(NUM_LABELS):
            title_sim[title_id][i] += cosine_similarity_compare(label_words[i], title)

        for i in range(NUM_LABELS):
            title_sim[title_id][i] += cosine_similarity_compare(info[i].split(), title)
            
        for i in range(NUM_LABELS):
            title_sim[title_id][i] = (title_sim[title_id][i]) / 3# + domain_sim + theme_sim) / 5

        title_id += 1

    print("-------------------------------------------")
    print(title_sim)
    print("-------------------------------------------")

    # add abstract and keywords for comparisons
    preprocessed_list_of_titles = []

    for rec in csw.records:
        title = csw.records[rec].title
        
        if csw.records[rec].abstract != None:
            title = title + " " + csw.records[rec].abstract

        if len(csw.records[rec].subjects) > 0 and csw.records[rec].subjects != [None]:
            keywords_set = build_keywords_set(csw.records[rec].subjects, [])
            if len(keywords_set) > 0:
                title = title + ' '.join(keywords_set)

        title = prepareDescription(title, keepwords, abbr_lower)
        title = replace_abbrevations(title, abbreviations)

        #remove words duplicates, maybe this will show some better results
        title =  ' '.join(list(dict.fromkeys(title.split())))
        preprocessed_list_of_titles.append(title.split())


    T=10
    mgp = MovieGroupProcess(K=T, alpha=0.1, beta=0.1, n_iters=30)
    vocab = set(x for doc in preprocessed_list_of_titles for x in doc)
    n_terms = len(vocab)
    y = mgp.fit(preprocessed_list_of_titles, n_terms)
    
    
    # Save model
    with open("sttm_v1.model", "wb") as f:
        pickle.dump(mgp, f)
        f.close()
    
    doc_count = np.array(mgp.cluster_doc_count)
    print('Number of documents per topic :', doc_count)
    print('*'*20)# Topics sorted by the number of document they are allocated to
    top_index = doc_count.argsort()[(-1*T):][::-1]
    
    print('*'*20)# Show the top 5 words in term frequency for each cluster
    
    
    label_words_measures = [time_series_solar_resources_measures, atmosphere_meteorology_measures,
                   ground_topography_measures, meteorological_year_measures, solar_potential_measures]
    
    sims_T_L = [ [0 for i in range(NUM_LABELS)] for j in range(len(mgp.cluster_word_distribution))]
    

    count = 0
    
    topics = []
    
    for cluster_dict_per_topic in mgp.cluster_word_distribution:
        counter = Counter(cluster_dict_per_topic)

        high = counter.most_common(30)

        if high == []:
            continue

        topic_words = [x[0] for x in high]
        # the most common words, how are they connected to each predefined topic?
        for i in range(len(label_words)):
            sims_T_L[count][i] = cosine_similarity_compare(label_words[i], topic_words)

        topics.extend(topic_words)
        count += 1
    
    T2 = count
    
    count = 0
    sims_D_T = [ [0 for i in range(T2)] for j in range(len(preprocessed_list_of_titles))]
    
    for doc in preprocessed_list_of_titles:
        sims_D_T[count] = mgp.score(doc)
        count += 1

    print("----------------------------------se inmulteste cu:")
    print(sims_T_L)
    print("----------------------------------")

    """
    print("----------------------------------se inmulteste cu:")
    print(sims_D_T)
    print("----------------------------------")
    """

    # multiply matrices
    sims_D_T = np.array(sims_D_T)
    sims_T_L = np.array(sims_T_L)
    sims_D_L = np.zeros([np.size(sims_D_T, 0), np.size(sims_T_L, 1)])
    
    for i in range(np.size(sims_D_T, 0)):
        for j in range(np.size(sims_T_L, 1)):
            if np.count_nonzero(sims_D_T[i, :]) > 0 and np.count_nonzero(sims_T_L[:, j]) > 0:
                sims_D_L[i][j] = np.matmul(sims_D_T[i, :], sims_T_L[:, j])
                #spatial.distance.cosine(sims_D_T[i, :], sims_T_L[:, j])
            else:
                sims_D_L[i][j] = 0

    no_lines = len(sims_D_L)
    no_cols = len(sims_D_L[0])

    """
    print("----------------------------------rezultatul e:")
    print(sims_D_L)
    print("----------------------------------")
    """

    #which are the words from the concepts which are not considered by the automatic features
    topics = list(dict.fromkeys(topics))
    print("set of topics: ", topics)
    
    sim_labels = []
    i = 0
    for label in label_words:
        label_list = [w for w in label if w not in topics]
        sim_labels.append(label_list)
        i += 1

    print("sim_labels:")
    print(sim_labels)
    
    i = 0
    sims_D_L_2 = [ [0 for i in range(NUM_LABELS)] for j in range(len(preprocessed_list_of_titles))]

    # compare these to the corpus and gather similarities
    for count in range(len(preprocessed_list_of_titles)):
        sentence  = preprocessed_list_of_titles[count]
        if sentence != []:
            for i in range(len(sim_labels)):
                label = sim_labels[i]
                sims_D_L_2[count][i] = cosine_similarity_compare(label, sentence)

    i = 0
    sims_D_L_3 = [ [0 for i in range(NUM_LABELS)] for j in range(len(preprocessed_list_of_titles))]

    # compare the measures labels to the corpus and gather similarities
    for count in range(len(preprocessed_list_of_titles)):
        sentence  = preprocessed_list_of_titles[count]
        if sentence != []:
            for i in range(len(label_words_measures)):
                label = label_words_measures[i]
                sims_D_L_3[count][i] = cosine_similarity_compare(label, sentence)
    
    # how many lines and columns for sim_D_L_2
    sims_D_L_2_lines = len(sims_D_L_2)
    sims_D_L_2_cols = len(sims_D_L_2[0])

    """
    A = np.matrix.flatten(np.matrix(sims_D_L_2))
    A = np.mat(A)
    A = preprocessing.normalize(A)
    sims_D_L_2 = A.reshape(sims_D_L_2_lines, sims_D_L_2_cols)
    """
    
    print("prima matrice: ")
    print(sims_D_L)
    
    print("a doua matrice: ")
    print(sims_D_L_2)


    for i in range(sims_D_L_2_lines):
        for j in range(sims_D_L_2_cols):
            sims_D_L[i][j] = (title_sim[i][j] * 3 + sims_D_L[i][j] * 3 + sims_D_L_2[i][j] * 2 + sims_D_L_3[i][j] * 2) / 10
            if sims_D_L[i][j] > 1:
                print("exista valori mai mari ca 1: sims_D_L[i][j]")

    for i in range(len(sims_D_L)):
        maxv = max(sims_D_L[i])
        sims_D_L[i] = [x / maxv for x in np.array(sims_D_L[i])]
    
    with open(SIM_MATRIX_FILE, "w") as f:
            writer = csv.writer(f)
            [writer.writerow(r) for r in sims_D_L]
    
    with open(ID_LIST_FILE, "w") as f:
        for idname in identifiers:
            f.write(idname + "\n")
    

    print("----------------------- final matrix")
    print(sims_D_L)
    print("-----------------------------------------")
    sim_values_trans = np.array(sims_D_L).transpose()

    # print best resources for each topic
    cnt = 0
    for line in sim_values_trans:
        arr = np.array(line)
        idx = arr.argsort()[-len(identifiers):][::-1]
        count = 0
        with open("results_file" + str(cnt) + ".txt", "w") as f:
            for i in idx:
                count += 1
                if (count > 10):
                    break

                if arr[i] > 0:
                    linktext = "http://geocatalog.webservice-energy.org/geonetwork/srv/eng/main.search.embedded?any=" + str(identifiers[i]) + "&dummyfield=&northBL=&westBL=&eastBL=&southBL=&relation=overlaps&region_simple=&sortBy=relevance&sortOrder=&hitsPerPage=10&output=full\n"
                    resource = urllib.request.urlopen(linktext)
                    content = resource.read().decode(resource.headers.get_content_charset())

                    if "metadata.show" in content:
                        m = re.search('metadata.show\?id=([0-9]+)', content)
                        if m:
                            found = m.group(1)
                            linktext = "http://geocatalog.webservice-energy.org/geonetwork/srv/eng/metadata.show?id=" + found + "&currTab=simple"
                            f.write(str(arr[i]) + " " + linktext + "\n")
                            #print(str(arr[i]) + " " + linktext)
                        else:
                            linktext = "http://geocatalog.webservice-energy.org/geonetwork/srv/eng/csw?REQUEST=GetRecordById&id=" + str(identifiers[i]) + "&SERVICE=CSW&VERSION=2.0.2"
                            f.write(str(arr[i]) + " " + linktext + "\n")
                            #print(str(arr[i]) + " " + linktext)

        cnt += 1
        print('*'*44)

In [None]:
# Open the CSV file
# and compute similarity <label> - <query>

if __name__ == "__main__":
    with open(SIM_MATRIX_FILE, 'r') as csvfile:
        reader = csv.reader(csvfile)
        similarity_matrix =  [[float(e) for e in r] for r in reader]
        
    search_query = "solar pond"
    
    csw = CatalogueServiceWeb('http://geocatalog.webservice-energy.org/geonetwork/srv/eng/csw')
    set_title = fes.PropertyIsLike('any', search_query)
    filter_list = [set_title]

    csw.getrecords2(constraints=filter_list, maxrecords=2000)

    fmt = '{:*^64}'.format
    print(fmt(' Catalog information '))
    print("CSW version: {}".format(csw.version))
    print("Number of datasets available: {}".format(len(csw.records.keys())))
    print('\n')

    original_list_of_titles = []
    identifiers = []
    preprocessed_list_of_titles = []

    abbr_lower = [abbr.lower() for abbr in list(abbreviations.keys())]

    for rec in csw.records:
        original_list_of_titles.append(csw.records[rec].title)
        identifiers.append(csw.records[rec].identifier)
        title = csw.records[rec].title
        
        if csw.records[rec].abstract != None:
            title = title + " " + csw.records[rec].abstract

        if len(csw.records[rec].subjects) > 0 and csw.records[rec].subjects != [None]:
            keywords_set = build_keywords_set(csw.records[rec].subjects, [])
            if len(keywords_set) > 0:
                title = title + ' '.join(keywords_set)

        title = prepareDescription(title, keepwords, abbr_lower)
        title = replace_abbrevations(title, abbreviations)
        
        #remove words duplicates, maybe this will show some better results
        title =  ' '.join(list(dict.fromkeys(title.split())))
        preprocessed_list_of_titles.append(title.split())

    # read the matrix from the file and get the similarities
    with open(SIM_MATRIX_FILE, 'r') as csvfile:
        reader = csv.reader(csvfile)
        similarity_matrix = [[float(e) for e in r] for r in reader]

    all_identifiers = [line.strip() for line in open(ID_LIST_FILE, 'r')]
    
    # build similarity matrix <query> <label>
    
    sims = [ [0 for i in range(NUM_LABELS)] for j in range(len(mgp.cluster_word_distribution))]

    label_words = [time_series_solar_resources, atmosphere_meteorology,
                   ground_topography, meteorological_year, solar_potential_measures]

    query_topic_sims = []
    for label in label_words:
        query_topic_sims.append(cosine_similarity_compare(label, search_query.split()))
    
    print("???????????")
    print(query_topic_sims)
    print("???????????")
        
    print("topic sims: ", query_topic_sims)
    
    sim_tuples = []
    
    for id_res in all_identifiers:
        if not id_res in identifiers:
            # if it's not in in the result list, check the similarity between the article and each topic
            index = all_identifiers.index(id_res)
            sims = similarity_matrix[index]
            
            sim_tuples.append((id_res, np.dot(np.matrix(query_topic_sims), np.matrix(sims).T).item(0, 0)))
    
    
    sim_tuples.sort(key=itemgetter(1), reverse = True)
    
    count = 0
    
    for idx, value in sim_tuples:
        linktext = "http://geocatalog.webservice-energy.org/geonetwork/srv/eng/main.search.embedded?any=" + idx + "&dummyfield=&northBL=&westBL=&eastBL=&southBL=&relation=overlaps&region_simple=&sortBy=relevance&sortOrder=&hitsPerPage=10&output=full\n"
        resource = urllib.request.urlopen(linktext)
        content = resource.read().decode(resource.headers.get_content_charset())
        if "metadata.show" in content:
            m = re.search('metadata.show\?id=([0-9]+)', content)
            if m:
                found = m.group(1)
                linktext = "http://geocatalog.webservice-energy.org/geonetwork/srv/eng/metadata.show?id=" + found + "&currTab=simple"
                
            else:
                linktext = "http://geocatalog.webservice-energy.org/geonetwork/srv/eng/csw?REQUEST=GetRecordById&id=" + str(identifiers[i]) + "&SERVICE=CSW&VERSION=2.0.2"
                #f.write(str(arr[i]) + " " + linktext + "\n")
    
            print(value, ' ', linktext)
        if count == 10:
            break
        count += 1

In [None]:
%run "Utils_Zenodo.ipynb"


if __name__ == "__main__":
    
    documents = []
    search_query = "solar irradiance"
    
    results = get_zenodo_entries(search_query)
    
    # the whole list of documents - consider this our database
    #documents.extend(get_zenodo_entries("photovoltaic"))
    time.sleep(2.4)
    documents.extend(get_zenodo_entries("Renewable Energy"))
    time.sleep(2.4)
    documents.extend(get_zenodo_entries("solar pond"))
    time.sleep(2.4)
    documents.extend(get_zenodo_entries("solar observations"))
     
    #for each document in the database. check if it in the list of results
    #otherwise, apply the algorithm
    search_doi = [article['doi'] for article in results]
    database_doi = [article['doi'] for article in documents]
    database_doi = [doi for doi in database_doi if doi not in search_doi]

    # we already have the fies downloaded
    # so look for them in the /tmp folder
    # otherwise, download it
    for doc in documents:
        if doc['doi'] in database_doi:
            mod_doi = doc['doi'].replace('/', '-')

            #look in tmp folder if there is a file containing the doi
            # if there is, just read the file and move to the next entry
            if os.path.isfile('/tmp/' + mod_doi + ".txt"):
                continue

            doc_list = save_pdf_and_get_text(doc['files'][0]['links']['self'])
            # overwrite the doi temporary file
            with open('/tmp/' + mod_doi + ".txt", 'w') as f:
                f.write(' '.join(doc_list))
                f.close()

    #get each file in the folder and build a corpus based on those articles
    corpus = []
    doi_list = []
    for doi in database_doi:
        mod_doi = doi.replace('/', '-')
        with open('/tmp/' + mod_doi + ".txt", 'r') as f:
            corpus.append(f.read())
            doi_list.append(doi)
    print(corpus)

    NUM_LABELS = 5
    K=10

    mgp = MovieGroupProcess(K=K, alpha=0.1, beta=0.1, n_iters=30)
    vocab = set(x for doc in corpus for x in doc)
    n_terms = len(vocab)
    y = mgp.fit(corpus, n_terms)
        
    # Save model
    with open("sttm_v1.model", "wb") as f:
        pickle.dump(mgp, f)
        f.close()
    

    sims = [ [0 for i in range(NUM_LABELS)] for j in range(len(mgp.cluster_word_distribution))]
    label_words = [time_series_solar_resources, atmosphere_meteorology,
                   ground_topography, meteorological_year, solar_potential_measures]

    count = 0

    for cluster_dict_per_topic in mgp.cluster_word_distribution:
        counter = Counter(cluster_dict_per_topic)

        high = counter.most_common(30)

        print('!'*20)
        print(high)
        print('!'*20)

        # the most common words, how are they connected to each predefined topic?
        for i in range(len(label_words)):
            sims[count][i] = cosine_similarity_compare(label_words[i], [x[0] for x in high])

        print('*'*20)
        print(sims[count])
        print('*'*20)
        count += 1

    count = 0
    similarity_values = [ [0 for i in range(NUM_LABELS)] for j in range(len(corpus))]
    for doc in preprocessed_list_of_titles:
        for scoreid in range(len(mgp.score(doc))):
            score = mgp.score(doc)[scoreid]
            if score > 0.1:
                for i in range(NUM_LABELS):
                    similarity_values[count][i] += score * sims[scoreid][i]
                # there is only one topic
                topic_index = mgp.score(doc)
        count += 1


    similarity_values_cols = [ [0.0 for i in range(no_cols)] for j in range(no_lines)]
    #compute maximum per entry and divide everything by it
    for col in range(0, len(similarity_values[0])):
        maxv = similarity_values[0][col]
        for line in range(1, len(similarity_values)):
            if similarity_values[line][col] > maxv:
                maxv = similarity_values[line][col]
        if maxv > 0.0:
            for line in range(len(similarity_values)):
                similarity_values_cols[line][col] = similarity_values[line][col] / maxv


    sim_values_trans = np.array(similarity_values_cols).transpose()
    
    # print best resources for each topic
    for line in sim_values_trans:
        cnt = 0
        arr = np.array(line)
        idx = arr.argsort()[-len(identifiers):][::-1]
        for i in idx:
            if arr[i] > 0:
                linktext = "doi.org/" + doi_list[i]
                print(arr[i], " ", linktest)

            cnt += 1
            if cnt == 20:
                break
        print('*'*20)

