Marcin Wardyński  
czwartek, 8:00

## Lab 2

Używam zdockeryzowanego Elasticsearch z repozytorium i zakładam, że przed wykonaniem jakichkolwiek zapytań kontener ten zostanie uruchomiony.

#### 1. Utwórz analizator polskich tekstów

In [22]:
es_url = "http://localhost:9200"
index_name = "mw_nlp_lab2"

index_url = F"{es_url}/{index_name}"

In [48]:
import requests

# Delete the index
delete_response = requests.delete(f"{index_url}")

# Check if the deletion was successful
if delete_response.status_code == 200:
    print(f"Index '{index_name}' deleted successfully.")
else:
    print(f"Failed to delete index '{index_name}': {delete_response.text}")

Index 'mw_nlp_lab2' deleted successfully.


In [49]:
import requests

index_list_response = requests.get(f"{es_url}/_cat/indices?format=json")
index_list_response.content

b'[]'

In [50]:
import requests
import json

fiqa_index_settings = {
    "settings": {
        "analysis": {
            "filter": {
                "polish_months_synonym": {
                    "type": "synonym",
                    "synonyms": [
                        "kwiecień, kwi, IV",
                    ]
                },
                "polish_morfologik": {
                    "type": "morfologik_stem"
                }
            },
            "analyzer": {
                "polish_analyzer_1": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "polish_months_synonym",
                        "polish_morfologik",
                        "lowercase"
                    ]
                },
                "polish_analyzer_2": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "polish_morfologik",
                        "lowercase"
                    ]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "text": {
                "type": "text",
                "fields": {
                    "analyzed_1": {
                        "type": "text",
                        "analyzer": "polish_analyzer_1"
                    },
                    "analyzed_2": {
                        "type": "text",
                        "analyzer": "polish_analyzer_2"
                    }
                }
            }
        }
  }
}

response = requests.put(index_url, headers={"Content-Type": "application/json"}, data=json.dumps(fiqa_index_settings))

# Check if the index was created successfully
if response.status_code == 200:
    print("Index created.")
else:
    print(f"Index creation failed: {response.text}")


Index created.


In [51]:
with open("corpus.jsonl", "r") as file:
    bulk_data = ""
    for line in file:
        doc = json.loads(line.strip())
        doc_id = doc.pop("_id")
        bulk_data += json.dumps({"index": {"_index": index_name, "_id": doc_id}}) + "\n"
        bulk_data += json.dumps(doc) + "\n"
        

bulk_response = requests.post(f"{es_url}/_bulk", headers={"Content-Type": "application/x-ndjson"}, data=bulk_data)

if bulk_response.status_code == 200:
    response_data = bulk_response.json()
    if any(item.get("index", {}).get("error") for item in response_data["items"]):
        print("Some documents failed to index:")
        for item in response_data["items"]:
            if "error" in item["index"]:
                print(item["index"]["error"])
    else:
        print("All documents indexed successfully.")
else:
    print(f"Failed to index data: {bulk_response.text}")

All documents indexed successfully.


In [92]:
search_word = "kwiecień"  # Replace with the word you're looking for

# Define the search query
search_query = {
    "size": 500,
    "query": {
        "match": {
            "text.analyzed_2": search_word # text.analyzed_1 for the search with synonims
        }
    }
}

# Perform the search request
response = requests.get(f"{index_url}/_search", headers={"Content-Type": "application/json"}, data=json.dumps(search_query))

docs_found = set([])

# Check if the request was successful
if response.status_code == 200:
    search_results = response.json()
    print(f"Found {search_results['hits']['total']['value']} occurrences of '{search_word}' in text.analyzed_1.")
    
    # Print each result
    for hit in search_results["hits"]["hits"]:
        docs_found.add(hit['_id'])
else:
    print(f"Search failed: {response.text}")


Found 257 occurrences of 'kwiecień' in text.analyzed_1.


Powyższy kod zwraca przy wyszukiwaniu z synoniami 306 dokumentów.
Namiast bez synonimów 257 dokumentów.

W poprzednim laboratorium mieliśmy za zadanie utworzyć wyrażenie regularne, które znajduje "kwiecień" w pełnej odmianie przez przypadki obydwu liczb. Poniżej użyję tego kodu jeszcze raz, żeby sprawdzić, jak się mają jego wyniki z wynikami analizatora bez synonimów. (Porównanie z synonimami nie ma większego sensu, gdyż wyrażenie regularne nie miało ich uwzględniać).

In [82]:
import regex

from datasets import load_dataset

dataset = load_dataset("clarin-knext/fiqa-pl", name="corpus")
corpus = dataset['corpus']

april_p = r"kwie(cień|tni)"
april_pattern = regex.compile(april_p, flags=regex.IGNORECASE | regex.MULTILINE)

def count_april_occurrences(what, pattern):
    occurrences = {}
    counter = 0

    for entry in corpus:
        found = regex.findall(pattern, entry['text'])
        
        counter += len(found)
        if found:
            occurrences[entry["_id"]] = len(found)

        
        
    print(f"{what} found in {len(occurrences.keys())} documents in total {counter} times.")
    return occurrences


occ_april = count_april_occurrences("April (directly)", april_pattern)
regex_docs_found = occ_april.keys()

April (directly) found in 265 documents in total 362 times.


In [93]:
docs_found-regex_docs_found

set()

Nie ma dokumentów ze słowem bazującym na "kwiecień", które by zostało znalezione przez Elasticsearch, ale nie przez wyrażenie regularne.

In [89]:
regex_docs_found-docs_found

{'109292', '159500', '166563', '208216', '265866', '441143', '469888', '82284'}

Za to istnieje osiem dokumentów odnalezionych przez wyrażenie regularne, ale nie przez FTS. Bierze się to z faktu, iż wyrażenie regularne zostało sformuowane dość luźno, przez co znajdowało przymiotnik od słowa "kwiecień", czyli "kwietniowy" i jego pełną fleksję.

In [2]:
from datasets import load_dataset

subset = 'train'

# Load the FiQA-PL qrels dataset
dataset = load_dataset("clarin-knext/fiqa-pl-qrels")

with open(f"fiqa_pl_qrels.jsonl", "w") as f:
    for item in dataset['test']:
        f.write(f"{json.dumps(item)}\n")

NameError: name 'json' is not defined