# Build a language model for a specific subject


One of the state-of-the art solutions to build NLP application is using word embeddings to
compute similarities between texts. Generally, they are vector representations of words that
are capable of capturing the context of a word in a document or relation with other words.

Word2vec is a two-layer neural network that processes text. Its input is a text corpus and its output is a set of vectors: feature vectors for words in that corpus. The purpose and usefulness of Word2vec is to group the vectors of similar words together in vectorspace. This way. it is able to detect similarities mathematically [1].

A pre-trained set of vectors containing part of Google News dataset can be downloaded and used [2].
In case of technical fields, however, in order to achieve a better accuracy, a new language model should be trained,
in order to obtain the specific vocabulary that may be missing from the news dataset.

## Build up an input corpus

The first step is to build a corpus of specialized words, that can be used to get the vocabulary.
In order to do that, we decided to use papers stored in Zenodo [3], since it offers a free end point for
querying. In order to do that, we raise multiple queries to the server and we store the results locally.
The approach that we selected was to download the pdf files and transform them to txt, using a command lines tool
for Linux. To speed up the process and to avoid duplicates, we save the files in the '/tmp' folder and check, each time, if the file is already there. In case it is, we load the text, which is already pre-processed. I case of builing a language model, it is better to not have duplicates, since it can change the values attached to words.

The next step is to cleanup the text. In order to to that, the first step is to remove all punctation and replace non-alpha characters to spaces. Them all the words are transformed to lowercase and lemmatized to their basis word. Then,
all the stop words are removed, together with the non-English and short words. In this case, it is important to keep the  abbreviations, technical words which are not the English dictionary and short words related to the concerned field.
Them all the abreviations are replaced by the corresponding words sequence, to be able to compute similarities right. The clean-up texts are then saved in the corresponding text files in the '/tmp' folder.

## Build a language model

The corpus of documents is directly sent to word2vec library of gensim [4] and then saved in a binary file for further
use.


[1] https://skymind.ai/wiki/word2vec
[2] https://code.google.com/archive/p/word2vec/
[3] https://zenodo.org/
[4] https://radimrehurek.com/gensim/models/word2vec.html

In [None]:
%run "NLP_clustering.ipynb"
%run "Utils_Zenodo.ipynb"


def get_entries_from_zenodo(query):
    
    total_articles = get_zenodo_entries(query)
    documents = []
    
    for article in total_articles:
        doi = article['doi']
        mod_doi = doi.replace('/', '-')

        #look in tmp folder if there is a file containing the doi
        # if there is, just read the file and move to the next entry
        if os.path.isfile('/tmp/' + doi + ".txt"):
            with open('/tmp/' + doi + ".txt") as f:
                doc_list = f.read().split(' ')
        else:
            print(article['files'][0]['links']['self'])
            # download the file and preprocess in the information
            doc_list = save_pdf_and_get_text(article['files'][0]['links']['self'])
            # overwrite the doi temporary file
            with open('/tmp/' + mod_doi + ".txt", 'w') as f:
                f.write(' '.join(doc_list))
                f.close()
            # add the text to the corpus   
            if doc_list:
                documents.append(doc_list)
 
    return documents


if __name__ == "__main__":
    #documents is a list of lists
    documents = get_entries_from_zenodo(SEARCH_QUERY)
    documents.extend(get_entries_from_zenodo("solar irradiance"))
    documents.extend(get_entries_from_zenodo("photovoltaic"))
    documents.extend(get_entries_from_zenodo("Renewable Energy"))
    documents.extend(get_entries_from_zenodo("solar pond"))
    documents.extend(get_entries_from_zenodo("solar observations"))
    documents.extend(get_entries_from_zenodo("global horizontal irradiance"))
    documents.extend(get_entries_from_zenodo("irradiance"))
    documents.extend(get_entries_from_zenodo("solar collector"))
    documents.extend(get_entries_from_zenodo("downwelling longwave"))
    documents.extend(get_entries_from_zenodo("clear sky"))
    documents.extend(get_entries_from_zenodo("diffuse radiation"))
    documents.extend(get_entries_from_zenodo("water vapor"))
    documents.extend(get_entries_from_zenodo("snowdepth"))
    documents.extend(get_entries_from_zenodo("bi-directional reflectance"))
    documents.extend(get_entries_from_zenodo("snowcover"))
    documents.extend(get_entries_from_zenodo("airmass"))
    documents.extend(get_entries_from_zenodo("downwelling"))
    
    model = gensim.models.Word2Vec (documents, size=150, window=10, min_count=2, workers=10)
    model.train(documents,total_examples=len(documents),epochs=10)
    model.wv.save_word2vec_format(MODEL_TO_TRAIN, binary=True)
    

In [None]:
print(len(model.wv.vocab))
#print(documents)

#print(model.wv.most_similar('solar'))
#print(model.wv.most_similar('photovoltaic'))

print(model.wv.most_similar('irradiance'))
print(model.wv.most_similar('solar'))
print(model.wv.most_similar('photovoltaic'))

model.wv.most_similar("sky")



In [None]:
print(len(model.wv.vocab))
print(documents)