# Reference removal classifier

After some initial semantic chunking experiments, it became apparent that the reference list is included with the text of each article. Unfortunately, there is no section header or any other obvious formatting that could make it easy to remove the reference list. For some articles, as much as 1/3 of the text is references. We need a good way to remove them in the Wikipedia extraction phase or maybe in the semantic chunking phase - I could imagine wanting to do this with other data sources too. Either way, let's see what we can come up with. Here is the plan:

1. Split some articles into sentences and manually label them as 'text' or 'reference' to create training data.
2. Use Scikit-learn's TFIDF or count vectorizer to prepare the data
3. Train a classifier of some sort on it. My thought was XGBoost, but I have seen a few tutorials where people use a Multinomial Naive Bayes model (*sklearn.naive_bayes import MultinomialNB*)

Famous last words: sounds like it shouldn't be too hard....


## 1. Run setup

In [1]:
# Change working directory to parent so we can import as we would from __main__.py
print(f'Working directory: ', end = '')
%cd ..

# PyPI imports
import h5py
import nltk
import pandas as pd
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, cohen_kappa_score, f1_score, classification_report
from xgboost import XGBClassifier

# Internal imports
import configuration as config

nltk.download('punkt_tab')

Working directory: /mnt/arkk/opensearch/semantic_search


Note: You have installed the 'manylinux2014' variant of XGBoost. Certain features such as GPU algorithms or federated learning are not available. To use these features, please upgrade to a recent Linux distro with glibc 2.28+, and install the 'manylinux_2_28' variant.
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/siderealyear/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [2]:
input_file=f'{config.DATA_PATH}/wikipedia/{config.BATCHED_TEXT}'
training_data_path=f'{config.DATA_PATH}/ref_removal_classifier'

## 2. Training data preparation

This is going to require some manual curation, but hopefully we will not need to many examples. Plan is to sentence split a few records, maybe start with 10, and save them to txt files. Then we can look at each file and copy past the 'text' sentences out into one file and the 'reference' sentences out into another file. Should be pretty easy and quick, will just need to visualy identify where the reference section starts and copy/paste above/below that.

In [3]:
# Open the hdf5 file and grab the first batch
input_data=h5py.File(input_file, 'r')
batch=input_data['batches/0']

# Loop through 10 records, sentence split and write each to a file
for i, record in enumerate(batch[:100]):

    # Get the text string from bytes object
    record_text=record.decode('utf-8')

    # Apply sentence splitter
    sentences=nltk.tokenize.sent_tokenize(record_text)

    # Join with newline for output to file
    sentences='\n'.join(sentences)

    output_file=f'{training_data_path}/{i}.txt'

    with open(output_file, 'w') as output:
        print(sentences, file=output)

OK, went through up to record 30. Let's try with that. We have about 500 lines each of text and reference. Also notices along the way there were sever; 'disambiguation' pages and a few that were just a sentence or two about a movie: '_____ is a film by ____' and then a cast list. Maybe should be filtering those out too?

Next, we need to load the two training data files, add the labels, combine and shuffle them, train test split them and then vectorize.

In [4]:
# Load the reference and text sentences
with open(f'{training_data_path}/ref_sentences.txt') as file:
    ref_sentences=[line.rstrip() for line in file]

with open(f'{training_data_path}/text_sentences.txt') as file:
    text_sentences=[line.rstrip() for line in file]

# Create lists of labels
ref_labels=[1]*len(ref_sentences)
text_labels=[0]*len(text_sentences)

# Make dataframes
ref_df=pd.DataFrame({
    'sentence': ref_sentences,
    'label': ref_labels
})

text_df=pd.DataFrame({
    'sentence': text_sentences,
    'label': text_labels
})

# Concatenate
data_df=pd.concat([text_df, ref_df])

# Shuffle
data_df=data_df.sample(frac=1).reset_index(drop=True)

data_df.head(20)

Unnamed: 0,sentence,label
0,Raj Hans Kumar stated that in political affair...,0
1,"Since the mid-80s, Breydon Water has been a na...",0
2,The Khalsa Diwan Society was founded on July 2...,0
3,TVFPlay.,1
4,Beaton has also occasionally appeared on Fly T...,0
5,"""Editor's Review"", Harvard Educational Review,...",1
6,CITED: p. 377-378.,1
7,Eastern Daily Press.,1
8,The Diaries of A. Christie.,1
9,The A.B.C.,1


In [5]:
# Train-test split
train_df, test_df = train_test_split(data_df, test_size=0.20, stratify=data_df.label)

# Vectorize with bag-of-words
# vec=CountVectorizer(
#     ngram_range=(1, 3),
#     stop_words='english',
# )

# Vectorize with TF-IDF
vec=TfidfVectorizer()

train_features = vec.fit_transform(train_df.sentence)
test_features = vec.transform(test_df.sentence)

train_labels = train_df.label
test_labels = test_df.label

## 3. Gaussian Naive Bayes

In [6]:
model=GaussianNB()
model.fit(train_features.toarray(), train_labels)

predictions=model.predict(test_features.toarray())
print(classification_report(test_labels, predictions))

              precision    recall  f1-score   support

           0       0.76      0.94      0.84       168
           1       0.94      0.75      0.84       204

    accuracy                           0.84       372
   macro avg       0.85      0.85      0.84       372
weighted avg       0.86      0.84      0.84       372



## 4. Gradient boosting decision tree

In [9]:
model=XGBClassifier()
model.fit(train_features, train_labels)

predictions=model.predict(test_features)
print(classification_report(test_labels, predictions))

              precision    recall  f1-score   support

           0       0.94      0.89      0.91       168
           1       0.92      0.95      0.93       204

    accuracy                           0.92       372
   macro avg       0.93      0.92      0.92       372
weighted avg       0.93      0.92      0.92       372



## 5. Sanity check
XGBoost appears to be winning. Let's inspect some predictions and see how we are doing.

In [13]:
for prediction, sentence in zip(predictions[:20], test_df['sentence'][:20]):
    print(f'{prediction}: {sentence}')

0: "After a careful study of the goods displayed in the window, Poirot entered and represented himself as desirous of purchasing a rucksack for a hypothetical nephew."
0: A listing of graduates from 1926, lists the first class as 1886.
1: "Career - Theatre 1956-1959".
0: Their style was considered "modern instrumental".
1: Christie 1939, Chapter 7.
0: [citation needed] The initial plans asked for a library and community centre, but these aspects were eliminated from the plans.
1: Atlas of the Moon.
1: Play.cine.ar.
1: Stockholm: Prisma.
0: Crazy Women (Spanish: Las locas) is a 1977 Argentine drama film written by José P. Dominiani and directed by Enrique Carreras.
1: Retrieved 7 July 2020.
0: So much had he become the rage that every rich woman who had mislaid a bracelet or lost a pet kitten rushed to secure the services of the great Hercule Poirot.
1: "Jockey Julien Leparoux Prepares for Biggest Tests".
1: Archived from the original on 22 April 2022.
0: He also was member of the Ameri

We can see some obvious misclassifications, for example:

```text
At low tide there are vast areas of mudflats and saltings, all teeming with birds.
```

Is definitely not a reference derived fragment, but on the whole we are doing pretty well. I would rather loose a sentence here or there than include more than 50% by reference list content.