### Working with CHRONOBERG 

In this notebook, we walk through the Chronoberg dataset explaining how to use it to perform lexical analysis on prefered temporal slice and lexical analysis. The dataset is publicly available at [ChronoBerg](https://huggingface.co/datasets/sdp56/ChronoBerg/tree/main).

The dataset is made available in two variants: 
- The non-annotated raw ChronoBerg
- The annotated ChronoBerg

In each version, the text is grouped by years from 1750-2000s. The annotated version employs a further splitting of texts into sentences. 


In [None]:
import json, re
import os
import nltk
import pandas as pd 
import torch
from nltk.tokenize import sent_tokenize
from nltk.tokenize import sent_tokenize
import itertools 
from tqdm import tqdm
from gensim.models.word2vec import Word2Vec
from sklearn.model_selection import train_test_split
import gensim 
from IPython.display import display, Markdown
from gensim.parsing.preprocessing import remove_stopwords, strip_punctuation, strip_multiple_whitespaces, strip_numeric
from gensim.parsing.preprocessing import preprocess_string, preprocess_documents

#### Loading the dataset

We will first load the raw Chronoberg dataset and walk the steps of preprocessing and loading. 
One can download the raw Chornoberg dataset at [Chrono_raw](https://huggingface.co/datasets/chb19/ChronoBerg/blob/main/dataset/ChronoBerg_raw.jsonl)

In [2]:
data_dict = {'year': [], 'text': []}
path_to_your_data_folder = ''
with open(os.path.join(path_to_your_data_folder, 'ChronoBerg_raw.jsonl'), 'r', encoding='utf-8') as dataset_in:
    for line in tqdm(dataset_in):
        file = json.loads(line)
        text = file['text']
        text = text.replace('\n', ' ')
        # Remove all kinds of quotation marks
        file['text'] = text
        data_dict['text'].append(file['text'])
        data_dict['year'].append(file['year'])

print('data loaded')

print("------------------------------------------")
print("The number of years in the dataset is: ", len(set(data_dict['year'])) )

249it [00:47,  5.26it/s]

data loaded
------------------------------------------
The number of years in the dataset is:  249





### Visualize sample textual data from a particular year

In [20]:
#### Show a sample text from a particular year
year = 1900
for i in range(len(data_dict['year'])):
    if data_dict['year'][i] == year:
        display(Markdown(f"** Time_Interval- {data_dict['year'][i]}**: {data_dict['text'][i][:1500]}"))
        break

** Time_Interval- 1900**:  [Illustration]     The Wonderful Wizard of Oz  by L. Frank Baum   This book is dedicated to my good friend & comrade My Wife L.F.B.   Contents   Introduction  Chapter I. The Cyclone  Chapter II. The Council with the Munchkins  Chapter III. How Dorothy Saved the Scarecrow  Chapter IV. The Road Through the Forest  Chapter V. The Rescue of the Tin Woodman  Chapter VI.  The Cowardly Lion  Chapter VII. The Journey to the Great Oz  Chapter VIII. The Deadly Poppy Field  Chapter IX. The Queen of the Field Mice  Chapter X. The Guardian of the Gates  Chapter XI. The Emerald City of Oz  Chapter XII. The Search for the Wicked Witch  Chapter XIII. The Rescue  Chapter XIV. The Winged Monkeys  Chapter XV. The Discovery of Oz, the Terrible  Chapter XVI. The Magic Art of the Great Humbug  Chapter XVII. How the Balloon Was Launched  Chapter XVIII. Away to the South  Chapter XIX. Attacked by the Fighting Trees  Chapter XX. The Dainty China Country  Chapter XXI. The Lion Becomes the King of Beasts  Chapter XXII. The Country of the Quadlings  Chapter XXIII. Glinda The Good Witch Grants Dorothyâs Wish  Chapter XXIV. Home Again     Introduction   Folklore, legends, myths and fairy tales have followed childhood through the ages, for every healthy youngster has a wholesome and instinctive love for stories fantastic, marvelous and manifestly unreal. The winged fairies of Grimm and Andersen have brought more happiness to childish hearts than all other human creations.  Yet the old time fairy tale, h

In [3]:
### Short the text by year
df = pd.DataFrame(data_dict)
df.head()
df = df.sort_values(by=['year'], ascending=True)

### Recreate the data_dict based on the sorted dataframe
data_dict = {'year': [], 'text': []}
for i in range(len(df)):
    data_dict['year'].append(df['year'].iloc[i])
    data_dict['text'].append(df['text'].iloc[i])

#### Transform the block of text into a set of sentences

In [8]:
text_total = []
for text in tqdm(data_dict['text']):
    sentence = sent_tokenize(text)
    text_total.append(sentence)

text_total = list(itertools.chain(*text_total))
print(f'Total sentences: {len(text_total)}')

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 249/249 [35:31<00:00,  8.56s/it]

Total sentences: 88957134





### Extract Textual data from a particular year 

In [4]:
def extract_text_by_year(year, data_dict):
    sents = []
    texts = data_dict['text'][year]
    sents.extend(sent_tokenize(texts))
    return sents, texts

#### Transform the block of text into a set of sentences
year = 20
sents, texts = extract_text_by_year(year, data_dict)
print(f'Number of sentences in year {data_dict["year"][year]}: {len(sents)}')
print("------------------------------ Sample sentences ------------------------------")
for i in range(len(sents[:5])):
    print(f'Sentence {i+1}: {sents[i]}')

Number of sentences in year 1771: 10323
------------------------------ Sample sentences ------------------------------
Sentence 1:       THE EXPEDITION OF HUMPHRY CLINKER  by TOBIAS SMOLLETT     To Mr HENRY DAVIS, Bookseller, in London.
Sentence 2: ABERGAVENNY, Aug. 4.
Sentence 3: RESPECTED SIR,  I have received your esteemed favour of the 13th ultimo, whereby it appeareth, that you have perused those same Letters, the which were delivered unto you by my friend, the reverend Mr Hugo Behn; and I am pleased to find you think they may be printed with a good prospect of success; in as much as the objections you mention, I humbly conceive, are such as may be redargued, if not entirely removed--And, first, in the first place, as touching what prosecutions may arise from printing the private correspondence of persons still living, give me leave, with all due submission, to observe, that the Letters in question were not written and sent under the seal of secrecy; that they have no tendency to 

#### Polish the text if needed

In [5]:
def preprocess_text(texts):
    for i in range(len(texts)):
        text = texts[i]
        text = re.sub(r'[\'\"]', '', text)
        text = re.sub('\.', ' ', text)
        text = re.sub(r'[\x80-\xFF]', '', text)
        text = re.sub(r'\d+','', text) 
        # Reduce all consecutive whitespace to a single whitespace
        text = re.sub(r'\s+', ' ', text)
        texts[i] = text
    return texts

In [7]:
texts_ = preprocess_text(sents)

#### Load preprocessed Chronoberg

We have also made available a preprocessed version of Chronoberg.  One can download the dataset at [Chronoberg_preprocessed](https://huggingface.co/datasets/chb19/ChronoBerg/blob/main/dataset/Chronoberg_preprocessed.jsonl)

In [None]:
path_to_data_folder = ''
yearly_sentence_dict = {}
with open(os.path.join(path_to_data_folder, 'ChronoBerg_preprocessed.jsonl'), 'r', encoding='utf-8') as f:
    for line in f:
        year, sentence = json.loads(line)['year'], json.loads(line)['sentence']
        if year not in yearly_sentence_dict:
            yearly_sentence_dict[year] = {}
        yearly_sentence_dict[year] = sentence

print(f'Total years in the preprocessed dataset: {len(yearly_sentence_dict.keys())}')
print(f'Total sentences in the preprocessed dataset: {sum([len(yearly_sentence_dict[year]) for year in yearly_sentence_dict.keys()])}')

#### Statistics

- Number of sentences and number of words 



In [None]:
### Calculate the number of words in the preprocessed dataset

### load the preprocessed dataset

text_total = []
for text in tqdm(data_dict['text']):
    sentence = sent_tokenize(text)
    text_total.append(sentence)

text_total = list(itertools.chain(*text_total))
print(f'Total sentences: {len(text_total)}')

for year, sentences in yearly_sentence_dict.items():
    for text in sentences:
        word_list = nltk.tokenize.word_tokenize(text)
        count += len(word_list)

print(f'Total words in the preprocessed dataset: {count}')

### Create train and test splits

We will now look into ways of creating train and test splits for the different time-intervals.
In our paper, we have reported experimental results evaluating LLMS in a sequential setup based on valence-stable or valence-shifting words. Consequently, a strategic way to create train and test splits for each time interval is to split the dataset through the occurence of words in specific intervals. 


Load the pre-processed sentence splitted dataset

In [None]:
path_to_data_folder = ""

file = open(os.path.join(path_to_data_folder, 'ChronoBerg_preprocessed.jsonl'), 'w', encoding='utf-8')

  chrono_1750 = torch.load('/app/src/ChronoBerg/cade/new_lexicons/cl_score_min_1750.pth')


#### Get block of sentences per time-interval

In [None]:
def get_block_sentence_per_era(chrono_dict):
    block_sent = []
    for year, sent_value in chrono_dict.items():
        block_sent.append(list(sent_value))
    
    block_sent = list(itertools.chain(*block_sent))
    #print(f'Number of sentences in year {data_dict["year"][year]} with at least one chrono-berg word: {len(block_sent)}')
    return block_sent

In [None]:
sent_b = get_block_sentence_per_era(file)

- Specify some words and create train and test splits based on that word

In [None]:
### Specify your words here
word_list = ['act', 'action', 'active', 'actor', 'activity', 'actual', 'actually', 'actress', 'acts']
sent_app = []
sent_non_app = []
for word in word_list:
    print("Preprocessing the word: ", word)
    for snt in sent_b:
        if word in snt:
            sent_app.append(snt)
        else:
            sent_non_app.append(snt)
print(f'Number of sentences with at least one chrono-berg word: {len(sent_app)}')

Number of sentences with at least one chrono-berg word: 68626


##### Create train and test splits

In [None]:
def create_splits(sent_app, sent_non_app, train_ratio=0.8):

    train_app, test_app = train_test_split(sent_app, train_size=train_ratio, random_state=42)
    train_non_app, test_non_app = train_test_split(sent_non_app, train_size=train_ratio, random_state=42)

    train_data = train_app + train_non_app
    test_data = test_app + test_non_app

    print(f'Number of training samples: {len(train_data)}')
    print(f'Number of testing samples: {len(test_data)}')

    return train_data, test_data