# My BA level computational linguistics project
### Start date: 26 May 2025

Hi, this is a notebook where I practice my NLP skills. I use Harry Potter books in Russian as a data set. This is my BA level computational linguistic project. 

- Linguistics: automatically annotate wh-questions and polar questions
- Computational: train a model on this annotation and test it on a different set 

## Step 1: Loading and preprocessing text 

In [2]:
import os
os.chdir("/Users/maria.onoeva/Desktop/new_folder/GitHub/nlp-repo")
path = 'questions/txts/'
all_HP = '/Users/maria.onoeva/Desktop/HP_all_ru_Spivak.txt'

with open(all_HP, encoding='utf8') as file:
    text_ru = file.read()


In [3]:
import re
text_ru_cleaned = re.sub(r'\.(?!\s)', '. ', text_ru)

# Replace CRLF and other common line breaks with a space
text_ru_cleaned = text_ru_cleaned.replace('\r\n', ' ').replace('\n', ' ').replace('\r', ' ')

# Replace actual non-breaking space (Unicode \u00A0), not the string 'NBSP'
text_ru_cleaned = text_ru_cleaned.replace('\u00A0', ' ')

text_ru_cleaned = re.sub(r'\s{2,}', ' ', text_ru_cleaned)
text_ru_cleaned = text_ru_cleaned.strip()

I import Russian from spaCy and initialize it in `nlp`. It then creates a so-called `doc` object. Docs contain tokens and if I call `doc[34]`, it will return the 34th token in this doc. There is also a `span` object as below. 

Разобраться со спейси!!

In [4]:
import spacy

questions_spacy = []
nlp = spacy.load("ru_core_news_sm", disable=["ner", "lemmatizer", "tok2vec", "attribute_ruler"])
print(nlp.pipe_names)

nlp.max_length = 5606530

['morphologizer', 'parser']


Now I want to print out some sentences. Since I investigate questions, I want to print out some of them. The first step is to extract all questions or sentences with a question mark at the end. Update: now I check whether sent contains '?' because I can have '??' (well this is ok with the previous method) or '?!'.

I am not completely happy about automatic sentencing by `spaCy`. I'll try `nltk`.

In [5]:
import nltk
from nltk.tokenize import sent_tokenize

# This is necessary because I had the error with loading nltk parts 
nltk.download('punkt_tab')
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/maria.onoeva/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [6]:
sentences_nltk = sent_tokenize(text_ru_cleaned, language='russian')
question_pattern = '?'
questions_nltk = [sent for sent in sentences_nltk if question_pattern in sent]

Not the best but seems better. I am going to try another tool, [Natasha](https://natasha.github.io/).

In [7]:
from natasha import Segmenter, Doc

segmenter = Segmenter()
natasha_doc = Doc(text_ru_cleaned)

natasha_sentences = natasha_doc.segment(segmenter)
questions_natasha = [sent.text for sent in natasha_doc.sents if question_pattern in sent.text]

In [8]:
#print(f'spaCy: {len(questions_spacy)}' )
print(f'NLTK: {len(questions_nltk)}')
print(f'Natasha: {len(questions_natasha)}')

NLTK: 11263
Natasha: 11252


I want to compare questions from all three tools. 

In [9]:
from itertools import zip_longest
import csv

questions_3_tools = 'questions/csvs/questions_3_tools.csv'

with open(questions_3_tools, mode='w', newline='') as file:
    writer = csv.writer(file, delimiter=';')
    
    # Write header row
    writer.writerow(['ID', 'NLTK', 'Natasha'])
    
    # Use enumerate to add an ID for each row, starting at 1
    for id_num, (nltk, natasha) in enumerate(zip_longest(questions_nltk, questions_natasha, fillvalue=''), start=1):
        writer.writerow([id_num, nltk, natasha])             


I picked Natasha as sentencizer for Ru, it seems more accurate than others. But it still needs cleaning, so: 
1) I match all that begins with a capital letter and ends with a question mark; if no match, returns just an input text 
2) Then I replace all that might precede a question, it usually begins with a capital letter and marked with ': -'

In [10]:
def hyphen_start(text):
    pattern = r"^\– "
    sub_str = re.sub(pattern, "", text)
    return sub_str

def extract_question_or_full(text):
    pattern = r"^[А-ЯA-Z][^?]*\?"
    match = re.match(pattern, text)
    return match.group() if match else text

def extract_question_or_full2(text):
    pattern = r"^[А-ЯA-Z][^?]*\: – "
    sub_str = re.sub(pattern, '', text)
    return sub_str


In [11]:
import pandas as pd

questions_pd = pd.read_csv(questions_3_tools, sep=';', usecols=['Natasha'])

In [12]:
questions_pd['Natasha_no_hyphen'] = questions_pd['Natasha'].apply(lambda x: hyphen_start(str(x)))
questions_pd['Natasha_no_hyphen'] = questions_pd['Natasha_no_hyphen'].apply(lambda x: extract_question_or_full(str(x)))
questions_pd['Natasha_no_hyphen1'] = questions_pd['Natasha_no_hyphen'].apply(lambda x: extract_question_or_full2(str(x)))

questions_pd[60:70]

Unnamed: 0,Natasha,Natasha_no_hyphen,Natasha_no_hyphen1
60,"Скажешь, нет, Гарри?","Скажешь, нет, Гарри?","Скажешь, нет, Гарри?"
61,И все же иногда ему казалось (или он это приду...,И все же иногда ему казалось (или он это приду...,И все же иногда ему казалось (или он это приду...
62,"– Хочешь пойдем наверх, потренируемся?","Хочешь пойдем наверх, потренируемся?","Хочешь пойдем наверх, потренируемся?"
63,– Что это? – спросил он тетю Петунию.,Что это?,Что это?
64,Да и кому бы?,Да и кому бы?,Да и кому бы?
65,– Чего застрял? – раздался голос дяди Вернона.,Чего застрял?,Чего застрял?
66,"– Проверяешь, нет ли бомб?","Проверяешь, нет ли бомб?","Проверяешь, нет ли бомб?"
67,"– Пап, смотри, что это у Гарри?","Пап, смотри, что это у Гарри?","Пап, смотри, что это у Гарри?"
68,– Кто станет тебе писать? – осклабился дядя Ве...,Кто станет тебе писать?,Кто станет тебе писать?
69,"Откуда они знают, где он спит?","Откуда они знают, где он спит?","Откуда они знают, где он спит?"


## Step 2: Annotating questions 

Now I have 11 263 questions. Although I am super interested in one-word questions, for this model I need to remove them (:sadface:). I will apply tokenizer to each row and if a number is 2 (word and question mark), then it is unsuitable. 

Task: some words are written without a space, so they are not recognized correctly. I want to detect these words and fix them. 
How: bag of words? They shouldn't be very frequent.

In [13]:
questions_pd.count()

Natasha               11252
Natasha_no_hyphen     11263
Natasha_no_hyphen1    11263
dtype: int64

In [14]:
from nltk.tokenize import word_tokenize
def counting_tokens(text):
    doc = Doc(text)
    doc.segment(segmenter)
    return len(doc.tokens)

In [15]:
questions_pd['Tokens'] = questions_pd['Natasha_no_hyphen1'].apply(lambda x: counting_tokens(str(x)))
questions_pd[:10]

Unnamed: 0,Natasha,Natasha_no_hyphen,Natasha_no_hyphen1,Tokens
0,"«Это что, нормально для кошки?» – нервно подум...","«Это что, нормально для кошки?» – нервно подум...","«Это что, нормально для кошки?» – нервно подум...",26
1,"Что, будут у нас вечером совопады, Джим?","Что, будут у нас вечером совопады, Джим?","Что, будут у нас вечером совопады, Джим?",10
2,Метеоритные дожди по всей Британии?,Метеоритные дожди по всей Британии?,Метеоритные дожди по всей Британии?,6
3,Совы средь бела дня?,Совы средь бела дня?,Совы средь бела дня?,5
4,Странные люди в мантиях?,Странные люди в мантиях?,Странные люди в мантиях?,5
5,"Петуния, дорогая… к слову… про сестру твою нич...","Петуния, дорогая… к слову… про сестру твою нич...","Петуния, дорогая… к слову… про сестру твою нич...",14
6,– А что?,А что?,А что?,3
7,– И что? – перебила миссис Дурслей.,И что?,И что?,3
8,"Мистер Дурслей колебался: говорить или нет, чт...","Мистер Дурслей колебался: говорить или нет, чт...","Мистер Дурслей колебался: говорить или нет, чт...",17
9,Он ведь по возрасту примерно как наш Дудли?,Он ведь по возрасту примерно как наш Дудли?,Он ведь по возрасту примерно как наш Дудли?,9


In [16]:
questions_pd = questions_pd.loc[questions_pd['Tokens'] > 2].copy()
questions_pd_ready = questions_pd.loc[questions_pd['Tokens'] < 30].copy()

questions_pd_ready.count()

Natasha               9816
Natasha_no_hyphen     9816
Natasha_no_hyphen1    9816
Tokens                9816
dtype: int64

In [17]:
from natasha import NewsEmbedding, NewsSyntaxParser, MorphVocab, NewsMorphTagger

morph_vocab = MorphVocab()

emb = NewsEmbedding()
morph_tagger = NewsMorphTagger(emb)  
syntax_parser = NewsSyntaxParser(emb)

Now I have 9 963 questions that are more than 2 tokens. The next step is to automatically annotate wh-questions and polar questions. One can say 'oh, it's easy, just find all wh-words,' but it is not. Below are the examples of *что* 'what.' The first one is a polar question with *что* as a particle, then as a conjunction, and only the last two are wh-questions. I need to know how to distinguish them. 

The first and the easiest step is to find all questions without any wh-word. I need to create a list of wh-words and then check whether any item from this list is in a question. If it is not, then it is probably a PQ. I asked ChatGPT to create this list.

I make a column with lemmas for each question. 

In [18]:
def question_lemmas(text):
    doc = Doc(text)
    doc.segment(segmenter)
    doc.tag_morph(morph_tagger)
    
    lemmas = []
    for token in doc.tokens:
        token.lemmatize(morph_vocab)
        lemmas.append({token.pos:token.lemma})
    return lemmas

questions_pd_ready['Lemmas'] = questions_pd_ready['Natasha_no_hyphen1'].apply(lambda x: question_lemmas(str(x)))

So now I am going to check if a wh-word is in lemmas. The first step is to create a list of all Russian wh-words. 

In [19]:
wh_words_ru = [
    # Nominative case (interrogative pronouns & adjectives)
    "кто",
    "что",
    "чей",    # masc. nom.
    "чья",    # fem. nom.
    "чье",    # neut. nom.
    "чьи",    # plural nom.
    "какой",  # masc. nom.
    "какая",  # fem. nom.
    "какое",  # neut. nom.
    "какие",  # plural nom.
    "сколько",

    # Interrogative adverbs (indeclinable)
    "где",
    "куда",
    "откуда",
    "когда",
    "почему",
    "зачем",
    "как",
    "насколько",
    "почему бы",
    "отчего",
    "почем"
]

def is_wh_word_str(lemmas):
    found = []
    for item in lemmas:
        value = list(item.values())[0]
        if value in wh_words_ru:
            found.append(value)
    if not found:
        return 0
    return '; '.join(found)

def is_wh_word_dict(lemmas):
    found = []
    for item in lemmas:
        key, value = list(item.items())[0]  # get both POS tag and the word
        if value in wh_words_ru:
            found.append(item)
    if not found:
        return 0
    return found


In [20]:
questions_pd_ready['wh_words'] = questions_pd_ready['Lemmas'].apply(lambda x: is_wh_word_str(x)) 
questions_pd_ready['wh_dicts'] = questions_pd_ready['Lemmas'].apply(lambda x: is_wh_word_dict(x)) 

In [21]:
questions_pd_all_wh = questions_pd_ready.loc[questions_pd_ready['wh_words'] != 0].copy()
questions_pd_all_wh.count()

Natasha               5377
Natasha_no_hyphen     5377
Natasha_no_hyphen1    5377
Tokens                5377
Lemmas                5377
wh_words              5377
wh_dicts              5377
dtype: int64

In [22]:
questions_pd_no_wh = questions_pd_ready.loc[questions_pd_ready['wh_words'] == 0].copy()
questions_pd_no_wh.count()

Natasha               4439
Natasha_no_hyphen     4439
Natasha_no_hyphen1    4439
Tokens                4439
Lemmas                4439
wh_words              4439
wh_dicts              4439
dtype: int64

### Annotating questions with 'что'

I want to take a look at the unique values in `questions_pd_all_wh`. Out of 5 492 questions containing any wh-word, I have 2 471 questions with only 'что.' It can be a pronoun, then a question is a wh-question. It also can be a conjunction and then the question type is not clear. 

In [23]:
questions_pd_all_wh['wh_words'].value_counts()

wh_words
что                  2448
как                   799
кто                   390
почему                302
где                   265
                     ... 
откуда; куда            1
зачем; зачем            1
насколько; как          1
что; откуда             1
какой; кто; какой       1
Name: count, Length: 122, dtype: int64

In [24]:
questions_pd_what = questions_pd_ready.loc[questions_pd_ready['wh_words'] == 'что'].copy()
questions_pd_what['wh_dicts'].value_counts()

wh_dicts
[{'PRON': 'что'}]     1782
[{'SCONJ': 'что'}]     648
[{'PROPN': 'что'}]      13
[{'PART': 'что'}]        3
[{'NOUN': 'что'}]        2
Name: count, dtype: int64

In [25]:
questions_pd_all_wh['wh_words'].value_counts()
questions_pd_who = questions_pd_ready.loc[questions_pd_ready['wh_words'] == 'кто'].copy()
questions_pd_who['wh_dicts'].value_counts()

wh_dicts
[{'PRON': 'кто'}]     381
[{'VERB': 'кто'}]       3
[{'PROPN': 'кто'}]      3
[{'NOUN': 'кто'}]       2
[{'PART': 'кто'}]       1
Name: count, dtype: int64

In [72]:
questions_pd_zacem = questions_pd_ready.loc[questions_pd_ready['wh_words'] == 'зачем'].copy()
questions_pd_zacem.to_csv('questions/csvs/questions_pd_zacem.csv', 
                          columns=['Natasha_no_hyphen1', 'Tokens', 'Lemmas', 
                                   'wh_words', 'wh_dicts'], 
                          sep=';') 

In [None]:
def manual_anno(csv): 
    questions = pd.read_csv(csv, sep=';')
    questions = questions.dropna(how='all')
    annotation = []
    for question in questions.iloc[:, 1]:
        answer = input(f"Is this a PQ? ---- {question}")
        annotation.append(answer)
    questions['annotation'] = annotation
    questions.to_csv(csv, index=False)
    

In [73]:
manual_anno('questions/csvs/questions_pd_zacem.csv')   

In [26]:
questions_pd_who.to_csv('questions/csvs/questions_pd_who.csv', 
                          columns=['Natasha_no_hyphen1', 'Tokens', 'Lemmas', 
                                   'wh_words', 'wh_dicts'], 
                          sep=';') 

In [27]:
questions_pd_more30 = questions_pd_ready.loc[questions_pd_ready['Tokens'] > 30]
questions_pd_more30.count()
questions_pd_more30.to_csv('questions/csvs/questions_pd_more30.csv', 
                          columns=['Natasha_no_hyphen1', 'Tokens', 'Lemmas', 
                                   'wh_words', 'wh_dicts'], 
                          sep=';') 

In [28]:
questions_pd_what.to_csv('questions/csvs/questions_pd_what.csv', 
                         columns=['Natasha_no_hyphen1', 'Tokens', 'Lemmas', 
                                   'wh_words', 'wh_dicts'], 
                          sep=';') 

In [29]:
questions_pd_no_wh.to_csv('questions/csvs/questions_pd_no_wh.csv', 
                          columns=['Natasha_no_hyphen1', 'Tokens', 'Lemmas', 'wh_words'], 
                          sep=';') 

In [30]:
questions_pd_all_wh.to_csv('questions/csvs/questions_pd_all_wh.csv', 
                           columns=['Natasha_no_hyphen1', 'Tokens', 'Lemmas', 'wh_words'], 
                           sep=';') 

In [31]:
questions_pd_ready.to_csv('questions/csvs/questions_pd_ready.csv', 
                          columns=['Natasha_no_hyphen1', 'Tokens', 'Lemmas', 'wh_words', 'wh_dicts'], 
                          sep=';') 

In [32]:
questions_pd.to_csv('questions/csvs/questions_pd.csv', sep=';') 