## Tasks

### 1. Preliminary analysis

| Task    | Assigned to | Status |
|:----------------|:------------|:------:|
| What type of documents does it contain?  | Alberto | ✅ |
|       How many documents are there?       | Alberto | ✅ |
|       Calculate and visualise some simple statistics for the collection, e.g. the average document length, the average vocabulary size, etc.       | Tutti | ❌ |
|    BIO tagging for each file.       | Paolo | ✅ |
| Create datasets with sentences from the tagged dataset. | Paolo | ✅ |
| Create merged datasets | Paolo | ✅ |
| Cluster the documents and visualise the clusters to see what types of groups are present | Paolo | ❌ |
| Index the documents so that you can perform keyword search over them | Leonardo | ❌ |
| Train a Word2Vec embedding on the data and investigate the properties of the resulting embedding | Alberto | ❌ |


> **_KEY:_** [✅]() Completed [❌]() Not Completed

### 2. Training models

| Task    | Assigned to | Status |
|:----------------|:------------|:------:|
| Train a model to perform that task (by fine-tuning models on the training data)  | ??? | ❌ |
| Test pre-trained models on the task (if they already exist)                      | ??? | ❌ |
| Evaluate different models and compare their performance                          | ??? | ❌ |
> **_KEY:_** [✅]() Completed [❌]() Not Completed

> **_HINT_**: as a minimum here we would expect to see a linear classifier trained on the data (if an
appropriate for the task) and compare it with deep learning model, such as BERT.

### 3. Possible extensions:
Depending on the dataset chosen there will be many additional investigations you can perform.
For instance, oftentimes we can improve performance of a model on a particular task by simply
including additional data that is related to the task in its training set. So see if you can find other
data that helps with the task that you chose. Moreover, there are many NLP challenges out
there, so if you can’t find more data for the task you’re working on, look for another interesting
challenge to work on.

## Libraries

In [None]:
#!{sys.executable} -m spacy download it_core_news_sm;

In [None]:
import string
import pandas as pd
import matplotlib.pyplot as plt
import nltk
import re
import random
import numpy as np
import plotly.express as px
import sys
import spacy
import it_core_news_sm
import csv
import os

from nltk.corpus import stopwords
from collections import Counter
from sklearn.manifold import TSNE
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models.word2vec import Word2Vec
from tqdm import tqdm
from IPython.display import display

In [None]:
RunningInCOLAB = 'google.colab' in str(get_ipython()) if hasattr(__builtins__,'__IPYTHON__') else False

if RunningInCOLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    import os
    os.chdir('/content/drive/MyDrive/Colab Notebooks/NLP/KIND-main')
    os.getcwd()

    print("Colab environment detected. Paths have been set.")
else:
    print("You are not running in Google Colab. Skipping this step...")

## Data Import

In the following section we manipulate the dataset provided to get the full sentences that have been annotated by the original authors.

First we create the directories if they do not exist.

In [None]:
os.mkdir(os.path.join(os.getcwd(), 'dataset')) if not os.path.exists(os.path.join(os.getcwd(), 'dataset')) else None
os.mkdir(os.path.join(os.getcwd(), 'dataset/txt-version')) if not os.path.exists(os.path.join(os.getcwd(), 'dataset/txt-version')) else None
os.mkdir(os.path.join(os.getcwd(), 'dataset/BIO-tagged-version')) if not os.path.exists(os.path.join(os.getcwd(), 'dataset/BIO-tagged-version')) else None

### Merged dataset generation

We want to provide an additional dataset in which content of all the files is merged together. This will be useful for the subsequent tasks.

First we merge the initial `.tsv` train files:

In [None]:
# Get the list of files in the dataset directory
dataset_dir = os.path.join(os.getcwd(), 'dataset')
dataset_files = os.listdir(dataset_dir)

# Merging all train.tsv files together
content = ""
for file in tqdm(dataset_files):
    # We are dealing with the train.tsv files only
    if file.endswith('test.tsv') or not file.endswith('.tsv'):
        continue

    file_path = os.path.join(dataset_dir, file)  

    # Read the content of the file and appending it to the content variable
    with open(file_path, 'r') as f:
        content += f.read()
    
# Write the merged content back to the file
with open(f'{dataset_dir}/merged_dataset_train.tsv', 'w') as f:
    f.write(content)

Now we do the same for the test files:

In [None]:
# Get the list of files in the dataset directory
dataset_dir = os.path.join(os.getcwd(), 'dataset')
dataset_files = os.listdir(dataset_dir)

# Merging all test.tsv files together
content = ""
for file in tqdm(dataset_files):
    # We are dealing with the test.tsv files only
    if file.endswith('train.tsv') or not file.endswith('.tsv'):
        continue

    file_path = os.path.join(dataset_dir, file)  

    # Read the content of the file and appending it to the content variable
    with open(file_path, 'r') as f:
        content += f.read()
    
# Write the merged content back to the file
with open(f'{dataset_dir}/merged_dataset_test.tsv', 'w') as f:
    f.write(content)

### Complete sentences generation

Now for each train and test `.tsv` file we want to retrieve the compact form of the sentences that have been annotated by the original authors.

For each file, sentences have been reconstructed and allocated one-for-row in the corresponding `.txt` file.

In [None]:
# Get the list of files in the dataset directory
dataset_dir = os.path.join(os.getcwd(), 'dataset')
dataset_files = os.listdir(dataset_dir)

# For each file in the dataset directory split the content when finding empty lines
for file in tqdm(dataset_files):
    # Do not read txt files
    if not file.endswith('.tsv'):
        continue

    # Get the .tsv file path
    file_path = os.path.join(dataset_dir, file)

    # Inserting the content of the tsv file into a dataframe keeping the blank lines (i.e. the end of a sentence)
    data_df = pd.read_csv(file_path, sep='\t', names=['Word', 'Entity'], skip_blank_lines=False, quoting=csv.QUOTE_NONE)

    # Replace NaN values with a new line (\n) to mark the beginning of a new phrase
    data_df.fillna('\n', inplace=True)
    
    # Reconsetructing the sentences by joining the words together
    sentences = " ".join(data_df['Word']).replace('\n ', '\n') #.replace(' .', '.').replace(' ,', ',').replace(' !', '!').replace(' ?', '?').replace(' :', ':').replace(' ;', ';').replace(' %', '%').replace(' )', ')').replace('( ', '(').replace(' ]', ']').replace('[ ', '[').replace(' }', '}').replace('{ ', '{')
    
    # Write the content back to a text file
    output_file_path = f"{os.path.join(os.getcwd(), 'dataset/txt-version')}/{file[:-3] + 'txt'}"
    with open(output_file_path , 'w') as f:
        f.write(sentences)

### Bio tagging conversion

Finally, we convert into BIO-tagging format the entities in the datasets.

In [None]:
o_tag = "O"
types = set()
count = {}

# Dictionary of input and output files names
files = {
	"wikinews_train.tsv": "WN_train.tsv",
	"wikinews_test.tsv": "WN_test.tsv",
	"fiction_train.tsv": "FIC_train.tsv",
	"fiction_test.tsv": "FIC_test.tsv",
	"degasperi_train.tsv": "ADG_train.tsv",
	"degasperi_test.tsv": "ADG_test.tsv",
	"moro_train.tsv": "AM_train.tsv",
	"moro_test.tsv": "AM_test.tsv",
	"merged_dataset_train.tsv": "MERGED_train.tsv",
	"merged_dataset_test.tsv": "MERGED_test.tsv",
}

for file in tqdm(files):
	with open(f"{os.path.join(os.getcwd(), 'dataset')}/{file}", "r") as f:
		# Getting the output file name related to the current file
		out_file = files[file]
		count[out_file] = {"sentences": 0, "tags": {}, "tokens": 0}

		sentences = []
		current_sentence = []

		for line in f:
			line = line.strip()
			if len(line) == 0:
				if len(current_sentence) > 0:
					sentences.append(current_sentence)
					current_sentence = []
				continue
			parts = line.split("\t")
			current_sentence.append(parts)
			count[out_file]["tokens"] += 1

		if len(current_sentence) > 0:
			sentences.append(current_sentence)

		count[out_file]["sentences"] = len(sentences)

		# BIO tagging conversion. The first non-O tag after a sequence of O-tags is converted to B-tag.
		# The following non-O tags (the ones that follow the B-tag) are converted to I-tags until
		# an O-tag is found. The same procedure is repeated for each sequence of non-O tags
		for sentence in sentences:
			previous_ner = o_tag
			for token in sentence:
				ner = token[1]
				new_ner = ner
				if ner != o_tag:
					if previous_ner != ner:
						if ner not in count[out_file]["tags"]:
							count[out_file]["tags"][ner] = 0
						new_ner = "B-" + ner
						count[out_file]["tags"][ner] += 1
						types.add(ner)
					else:
						new_ner = "I-" + ner
				token[1] = new_ner
				previous_ner = ner

		# Writing the converted file into the appropriate directory
		with open(f"{os.path.join(os.getcwd(), 'dataset/BIO-tagged-version')}/{out_file}", "w") as fw:
			for sentence in sentences:
				for token in sentence:
					fw.write(token[0])
					fw.write("\t")
					fw.write(token[1])
					fw.write("\n")
				fw.write("\n")

## Data inspection

KIND (Kessler Italian Named-entities Dataset) is an Italian dataset for Named-Entity Recognition (NER).

The purpose of NER task is to tag all the named entity, namely identify all the objects in the real world.

In this case there are three categories to annotate:
- person (PER): a single individual, an animal or a group of humans with a proper name;
- organization (ORG): every formally established association defined by an organizational structure;
- location (LOC): geographical entities defined by political and/or social groups which possess a physical location and a proper name.

The dataset is composed by four different collections with texts taken from: 
- Wikinews (WN) as a source of news texts, picking articles belonging to the last two decades; 
- Italian fiction books (FIC) in the public domain ; 
- writings and speeches from Italian politician Aldo Moro (AM);
- public documents written by Alcide De Gasperi (ADG).

The texts belong to three different domains: news, literature, and political discourses.

The dataset contains more than one million tokens, of which around 600K are manually annotated instead the remaining part is semi-automatically annotated.

| Dataset   | Documents |
| --------- | --------- |
| Wikinews  | 1,000 |
| Fiction | 86 |
| Aldo Moro | 250 |
|Alcide De Gasperi | 158 |

From the given files is not possible to distinguish the different documents but only the sentences.


### Let's analyze a single dataset, we take `degasperi_train.tsv

In [None]:
# Path of file degasperi_train.tsv
file_path = os.path.join(os.getcwd(), 'dataset/degasperi_train.tsv')

# Each row of the dataframe is a word with associated type of entity
data_df = pd.read_csv(file_path, sep='\t', names=['Word', 'Entity'])

In [None]:
# See first elements
data_df.head()

In [None]:
data_df.info()

All elements are not null.

In [None]:
data_df.describe()

In [None]:
data_df[:25]

In [None]:
counts_entity = Counter(data_df['Entity'])

plt.bar(counts_entity.keys(), counts_entity.values(), color="#3F5D7D", width=0.8)

# Add the values to the plot
for i, value in enumerate(counts_entity.values()):
    plt.text(i, value, str(value), ha='center', va='bottom')

plt.show()

There is a huge unbalance between the 'O' class and the other classes.

In [None]:
counts_word = Counter(data_df['Word'])
print(counts_word)

Most common words are punctuation and italian stop words.

There are three classes: person (PER), location (LOC) and organization(ORG).

The tag 'O' is used when a word is not a named entity.

In [None]:
# Retrieve all words with label 'PER'
persons = data_df.loc[data_df['Entity'] == 'PER', 'Word']
persons_set = set(persons)
print(f'There are {len(persons_set)} different words labelled as PER')
sorted(persons_set)[:10]

In [None]:
# Retrieve all words with label 'LOC'
locations = data_df.loc[data_df['Entity'] == 'LOC', 'Word']
locations_set = set(locations)
print(f'There are {len(locations_set)} different words labelled as LOC')
sorted(locations_set)[:10]

In [None]:
# Retrieve all words with label 'ORG'
organizations = data_df.loc[data_df['Entity'] == 'ORG', 'Word']
organizations_set = set(organizations)
print(f'There are {len(organizations_set)} different words labelled as ORG')
sorted(organizations_set)[:10]

In [None]:
# Open the text file degasperi_train.txt
file_path = os.path.join(os.getcwd(), 'dataset/txt-version/degasperi_train.txt')

with open(file_path, 'r') as f:
    text = f.read()

print(f'The length of the text is {len(text)} characters')

In [None]:
# Obtain all the sentences
sentences = re.split('\n', text)

print(f'There are {len(sentences)} sentences')
sentences[:5]

### Retrieve some statistics for all the four training dataset plus the merged one

In [None]:
# Get the list of files in the dataset directory
dataset_dir = os.path.join(os.getcwd(), 'dataset')
dataset_files = os.listdir(dataset_dir)
datasets = dict()

# Create a dictionary with all train datasets
for file in tqdm(dataset_files):
    # We are dealing with the train.tsv files only
    if file.endswith('test.tsv') or not file.endswith('.tsv'):
        continue

    file_path = os.path.join(dataset_dir, file)  

    # Inserting the content of the tsv file into a dataframe and add it to the dictionary
    datasets[str(file)[:-4]] = pd.read_csv(file_path, sep='\t', names=['Word', 'Entity'])

In [None]:
for dataset in datasets.keys():
  print(f'\nDataset {dataset}')
  datasets[dataset].info()

In [None]:
for dataset in datasets.keys():
  print(f'\nDataset {dataset}')
  display(datasets[dataset].describe())

In [None]:
for dataset in datasets.keys():
  counts_entity = Counter(datasets[dataset]['Entity'])

  plt.bar(counts_entity.keys(), counts_entity.values(), color="#3F5D7D", width=0.8)

  # Add the values to the plot
  for i, value in enumerate(counts_entity.values()):
      plt.text(i, value, str(value), ha='center', va='bottom')
  plt.title(dataset)    

  plt.show()

In [None]:
for dataset in datasets.keys():
  counts_word = Counter(datasets[dataset]['Word'])
  print(f'\nDataset {dataset} {counts_word.most_common(10)}')

In [None]:
# Get the list of files in the dataset directory
dataset_dir = os.path.join(os.getcwd(), 'dataset/txt-version')
dataset_files = os.listdir(dataset_dir)
datasets = dict()

# Create a dictionary with all train datasets
for file in tqdm(dataset_files):
    # We are dealing with the train.txt files only
    if file.endswith('test.txt') or not file.endswith('.txt'):
        continue

    file_path = os.path.join(dataset_dir, file)  

    # Add each txt file to the dictionary
    with open(file_path, 'r') as f:
      datasets[str(file)[:-4]] = f.read()

In [None]:
for dataset in datasets.keys():
  print(f'The length of the text of {dataset} is {len(datasets[dataset])} characters')

In [None]:
for dataset in datasets.keys():
  # Obtain all the sentences
  sentences = re.split('\n', datasets[dataset])

  print(f'The dataset {dataset} has {len(sentences)} sentences') 

## Vocabulary


### Vocabulary for a single dataset, we take `degasperi_train.tsv`

In [None]:
# Open the text file degasperi_train.txt
file_path = os.path.join(os.getcwd(), 'dataset/txt-version/degasperi_train.txt')

with open(file_path, 'r') as f:
    text = f.read()

print(text[:150])

In [None]:
regex = '[' + string.punctuation + ']'

# Remove all the punctuation (maybe not necessary)
text_no_punctuation = re.sub(regex,'',text)

print(text_no_punctuation[:150])

In [None]:
# Obtain all the sentences
sentences = re.split('\n', text_no_punctuation)

In [None]:
print(f'The length of the text with punctuation is {len(text)} characters')
print(f'The length of the text without punctuation is {len(text_no_punctuation)} characters')
print(f'{len(text) - len(text_no_punctuation)} characters are removed')

In [None]:
# Build vocabulary
# Convert to lowercase, split on whitespace, select only distinct words, sort the words alphabetically
words = text_no_punctuation.lower().split()
vocabulary = sorted(set(words))
print(f'The vocabulary contains {len(vocabulary)} words')
#print(vocabulary)

counts_word = Counter(words)
print(f'Most common words: {counts_word.most_common(10)}')

In [None]:
# Build vocabulary using CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit(sentences)

print(f'The vocabulary contains {len(vectorizer.get_feature_names_out())} words')
vectorizer.get_feature_names_out()[:150]

In [None]:
nltk.download('stopwords')
#print('Italian stopwords:')
#print(stopwords.words('italian'))

In [None]:
# Build vocabulary by removing italian stop words and with words with at least 3 occurences
vectorizer = CountVectorizer(min_df=3, stop_words=stopwords.words('italian'))
vectorizer.fit(sentences)
print(f"vocabulary size: {len(vectorizer.get_feature_names_out())}")
# vectorizer.get_feature_names_out()[:50]

### Build vocabulary for each training set


In [None]:
# Get the list of files in the dataset directory
dataset_dir = os.path.join(os.getcwd(), 'dataset/txt-version')
dataset_files = os.listdir(dataset_dir)
datasets = dict()

# Create a dictionary with all train datasets
for file in tqdm(dataset_files):
    # We are dealing with the train.txt files only
    if file.endswith('test.txt') or not file.endswith('.txt'):
        continue

    file_path = os.path.join(dataset_dir, file)  

    # Add each txt file to the dictionary
    with open(file_path, 'r') as f:
      datasets[str(file)[:-4]] = f.read()

In [None]:
# Dictionary that contains the vocabulary of each dataset
vocabularies = dict()

regex = '[' + string.punctuation + ']'

for dataset in datasets.keys():
  # Get the text
  text = datasets[dataset]

  # Remove all the punctuation (maybe not necessary)
  text_no_punctuation = re.sub(regex,'',text)

  # Obtain all the sentences
  sentences = re.split('\n', text_no_punctuation)

  # Build vocabulary by removing italian stop words and with words with at least 3 occurences
  vectorizer = CountVectorizer(min_df=3, stop_words=stopwords.words('italian'))
  vectorizer.fit(sentences)

  # Save vocabulary
  vocabularies[dataset] = vectorizer
  print(f"{dataset} vocabulary size: {len(vectorizer.get_feature_names_out())}")

In [None]:
vocabularies['merged_dataset_train'].get_feature_names_out()[-50:]

## Word embeddings

Clean the data: 
- remove non-letter characters from each sentence 
- lowercase 
- tokenize the sentences based on whitespace
- remove any sentence with length less than 2 since it won't be useful for training Word2Vec. 

In [None]:
# maybe don't remove punctuation
tokenized_sentences = [re.sub('\W', ' ', sentence).lower().split() for sentence in sentences]
# remove sentences that are only 1 word long
tokenized_sentences = [sentence for sentence in tokenized_sentences if len(sentence) > 1]

for sentence in tokenized_sentences[:10]:
    print(sentence)

Train Word2Vec.

Parameters:
- embedding size = 30,
- minimum count for any vocabulary term = 1
- size of the context window = 10.

In [None]:

model = Word2Vec(tokenized_sentences, vector_size=30, min_count=1, window=10)

In [None]:
print(f'There are {len(model.wv)} word embeddings')

In [None]:
term = 'italia'
model.wv[term]

In [None]:
term ='italia'

model.wv.most_similar(term)

In [None]:
# properties are not valid, probably because the dataset is too small
vec = model.wv['roma'] + (model.wv['francia'] - model.wv['italia'])  

model.wv.similar_by_vector(vec)

In [None]:
# sample 500 random word embeddings
sample = random.sample(list(model.wv.key_to_index), 300)
word_vectors = model.wv[sample]

# visualize word embeddings using TSNE
tsne = TSNE(n_components=3, n_iter=2000)
tsne_embedding = tsne.fit_transform(word_vectors)
x, y, z = np.transpose(tsne_embedding)

In [None]:
!pip install plotly

In [None]:

fig = px.scatter_3d(x=x[:200],y=y[:200],z=z[:200],text=sample[:200])
fig.update_traces(marker=dict(size=3,line=dict(width=2)),textfont_size=10)
fig.show()

## spaCy

In [None]:
!pip install -U spacy

There is also the large version https://spacy.io/models/it

In [None]:
!{sys.executable} -m spacy download it_core_news_sm;

In [None]:
nlp_model = it_core_news_sm.load()

In [None]:
text_spacy = text[:151]
parsed_text = nlp_model(text_spacy)
print(parsed_text)

In [None]:
print(f'The length of the original text is {len(text_spacy)} characters')
print(f'The length of the parsed text is {len(parsed_text)} words')

In [None]:
parsed_text.ents

In [None]:
print([(ent.text, ent.label_) for ent in parsed_text.ents])

In [None]:
[(X, X.ent_iob_, X.ent_type_) for X in parsed_text]

In [None]:
from spacy import displacy
displacy.render(parsed_text, jupyter=True, style='ent')