# Basic journal entries processing

This notebooks shows a set of functionalities to process the diary entries:
- Loading and writing to disk.
- Divide into sentences.
- Sentences processing: Autocorrection, Translation, Lemmatization. 
- Words processing: N-grams, bow,...

In [None]:
%load_ext autoreload
%autoreload 2

from TexSoup import TexSoup
import glob
import pandas as pd
import nlu
import datetime as dt

from obsidianizer.latex_tools.utils import load_drafts_entries, save_cleaned_sentences_to_latex, print_differences_in_journals
from obsidianizer.latex_tools.journal_processing import get_sentences
from obsidianizer.nlp.bow import generate_word_cloud_image
from obsidianizer.latex_tools.plots import get_statistics_email_draft
from obsidianizer.nlp.translation import get_translator, get_journal_translator

from obsidianizer.nlp.text_cleanup import n_grams_function
from obsidianizer.nlp.text_cleanup import get_most_used_words, remove_stop_words

from obsidianizer.nlp.auto_correction import get_misspelled_words, correct_sentence, get_candidates
from obsidianizer.nlp.contextual_auto_correction import create_contextual_spell_check
from obsidianizer import EXAMPLE_JOURNAL_PATH, EXAMPLE_CLEANED_JOURNAL_PATH, JOURNALS_PATH
from obsidianizer.journal.cleaning import load_journals_splitted_by_language, get_journal_splits_by_language_filepaths, split_journal_by_language,write_journal_splits_by_language_to_latex



# 1. Load journal entries from latex file.

In the following it is shown how to load the items generated by the email function

In [None]:
filepath = EXAMPLE_JOURNAL_PATH

## 1.1 Load draft entries

The draft entries have as columns:
- datetime_str: The string datetime created from parse_dated_comment_to_latex_item
- entry_text: The text writen in the email draft (or more generally in the comment)
- datetime: The datetime_str transformed to datetime object
- date: The date of the text to serve as a groupby 


In [None]:
data_entries = load_drafts_entries(filepath)

In [None]:
data_entries

## 1.2 Get the sentences in the entries

We need to preprocess the sentences properly.
- It splits the entries into sentences.
- It detects the language of the sentences.
- It counds the number of words and sentences.

In [None]:
journal_df = get_sentences(data_entries)

In [None]:
journal_df

## 1.3 Save cleaned journal to disk

We will for sure have to clean a lot from the comments such as:
- Deleting meaningless entries (numbers and other shit I might have put there just to remember)
- Trimming sentences: Removing unnecesary new lines and spaces.
- Correcting words: There is usually a lot of misspelled words that we should fix

In [None]:
output_cleaned_journal = EXAMPLE_CLEANED_JOURNAL_PATH
text_output_journal = save_cleaned_sentences_to_latex(journal_df, output_cleaned_journal)

In [None]:
data_entries_2 = load_drafts_entries(output_cleaned_journal)
data_entries_2

In [None]:
journal_df_2 = get_sentences(data_entries_2)
journal_df_2

### Compare that the sentences are the same

There seems to be some slight differences between the original drafts and what we save to disk, due to special characters, comments it seems. Lines related to "%"

In [None]:
weird_indices,weird_sentence_within_index = print_differences_in_journals(journal_df, journal_df_2)

In [None]:
weird_indices

# 2. Sentences processing

The next subsections contain a list of different transformations of the sentences in the entries

## 2.1 Autocorrect words

Since the OCR of the pdfs or our journal entries usually contain typos, we have implemented some automatic correction of words. Their precision is not great so use them with skepticism. 

In [None]:
sentence = "To die having experience the sweetness of dying, without dead, that is where conciousness and reality rejoyce, rejoyce in the destruction of the self, the ego, the eye, the judgement, the look for meaning. If meaning were to exist, that would be universe in-it-self. The thing in it-self. Not that far from Kand, buddism and stoicism. But that is what it is, the pleasure of the dissolution of the ego, and maybe that is for many, the best way to live. I am not ashamed to admit that maybe that would be a good picture for whoever can actually believe in it 100%, but I cannot, and consciousness reveals againt it, maybe driven by fear? Maybe consicousness was born out of fear, maybe black creates blue, chaos creates order. And in the end, the highest pleasure is the dissapearance of one-self. Which in my opinion one can only allow unti lthe skeptic wihin us is satisfied, it feels it is not being deceived, by others and us. The skiptic of trust, of suspension of jusgement, the lack of agreeblemenss. But that also is necessary, it is necesaary because of deceive, danger, betrail, because of the harshness of nature, the will to power of others, the cosmic dance between trust to become nothing, to give up the self, and the will to power. The conquer of knowledge, the facing of the dark to get tools. Welcome to my fucked up mind :)"
print(sentence)

### Classic single word autocorrection



#### Get the misspeled words

In [None]:
misspelled_words = get_misspelled_words(sentence)
misspelled_words

#### Get the most likely candidates to replace the words with

In [None]:
candidates_misspelled = get_candidates(misspelled_words)
candidates_misspelled

#### Correct the misspelled words in a sentence

In [None]:
sentece_corrected= correct_sentence(sentence)
sentece_corrected

### Contextual autocorrection

Based on spacy pre-trained models.

In [None]:
auto_corrector = create_contextual_spell_check("en")

In [None]:
doc = auto_corrector(sentence)
doc._.outcome_spellCheck

In [None]:
doc._.suggestions_spellCheck

# 2.3 Translation

We can translate sentences into each other.

In [None]:
journal_translator = get_journal_translator("es")

In [None]:
journal_df["sentences_translated"] = journal_df[["sentences","languages"]].apply(journal_translator, axis = 1)

In [None]:
journal_df

## 2.2 Lematization

In [None]:
bert = nlu.load("en.embed")

In [None]:
bert.predict("Hello my friend")

## 2.3 n_grams

Get the most common n_grams in the text

In [None]:
n_ngrams = 3
n_grams_df = n_grams_function(journal_df.iloc[:1000], column = "sentences",n = n_ngrams)

In [None]:
n_grams_df

## 2.4 Get the most common words

In [None]:
most_used_words = get_most_used_words(journal_df) 
most_used_words

In [None]:
#most_used_words_cleaned = remove_stop_words(most_used_words)
# most_used_words_cleaned

# 3. Split by languages

We can split the journal entries by language and use the number of minutes as separation. It is split in sentence by sentence basis. If an entry needs to be broken into several languages entries, then each entry is added a number of minutes equal to the index of the sentence within the entry.

In [None]:
filedir =  JOURNALS_PATH

### Split the journal by languages

In [None]:
journal_language_groupby = split_journal_by_language(journal_df)

### Write the different languages in files

In [None]:
write_journal_splits_by_language_to_latex(journal_df,filedir)

### The files in which it has been saved to

In [None]:
get_journal_splits_by_language_filepaths(filedir)

### Reload and join the divided journals

In [None]:
reloaded_journals = load_journals_splitted_by_language(filedir)

In [None]:
reloaded_journals

### Clean and save the reloaded sentences

In [None]:
reloaded_journals = get_sentences(reloaded_journals)

In [None]:
_ = save_cleaned_sentences_to_latex(reloaded_journals, EXAMPLE_CLEANED_JOURNAL_PATH)