## Uczenie głębokie – przetwarzanie tekstu – laboratoria
# 1. TF–IDF

In [3]:
!pip install numpy
import numpy as np
import re

Collecting numpy
  Using cached numpy-2.2.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
Using cached numpy-2.2.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.1 MB)
Installing collected packages: numpy
Successfully installed numpy-2.2.5


## Zbiór dokumentów

In [7]:
documents = ['Ala lubi zwierzęta i ma kota oraz psa!',
             'Ola lubi zwierzęta oraz ma kota a także chomika!',
             'I Jan jeździ na rowerze.',
             '2 wojna światowa była wielkim konfliktem zbrojnym',
             'Tomek lubi psy, ma psa  i jeździ na motorze i rowerze.',
            ]

Czego potrzebujemy?

- Chcemy zamienić teksty na zbiór słów.

### ❔ Pytania

- Czy do stokenizowania tekstu możemy użyć `document.split(' ')`?
- Jakie trudności możemy napotkać?

## Preprocessing

In [8]:
def get_str_cleaned(str_dirty):
    punctuation = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
    new_str = str_dirty.lower()
    new_str = re.sub(' +', ' ', new_str)
    for char in punctuation:
        new_str = new_str.replace(char,'')
    return new_str

In [9]:
sample_document = get_str_cleaned(documents[0])

In [10]:
sample_document

'ala lubi zwierzęta i ma kota oraz psa'

## Tokenizacja

In [11]:
def tokenize_str(document):
    return document.split(' ')

In [12]:
tokenize_str(sample_document)

['ala', 'lubi', 'zwierzęta', 'i', 'ma', 'kota', 'oraz', 'psa']

In [13]:
documents_cleaned = [get_str_cleaned(document) for document in documents]

In [14]:
documents_cleaned

['ala lubi zwierzęta i ma kota oraz psa',
 'ola lubi zwierzęta oraz ma kota a także chomika',
 'i jan jeździ na rowerze',
 '2 wojna światowa była wielkim konfliktem zbrojnym',
 'tomek lubi psy ma psa i jeździ na motorze i rowerze']

In [15]:
documents_tokenized = [tokenize_str(d) for d in documents_cleaned]

In [16]:
documents_tokenized

[['ala', 'lubi', 'zwierzęta', 'i', 'ma', 'kota', 'oraz', 'psa'],
 ['ola', 'lubi', 'zwierzęta', 'oraz', 'ma', 'kota', 'a', 'także', 'chomika'],
 ['i', 'jan', 'jeździ', 'na', 'rowerze'],
 ['2', 'wojna', 'światowa', 'była', 'wielkim', 'konfliktem', 'zbrojnym'],
 ['tomek',
  'lubi',
  'psy',
  'ma',
  'psa',
  'i',
  'jeździ',
  'na',
  'motorze',
  'i',
  'rowerze']]

### ❔ Pytania

- Jaki jest następny krok w celu stworzenia wektórów TF lub TF–IDF?
- Jakie wielkości będzie wektor TF lub TF–IDF?

## Stworzenie słownika

In [17]:
vocabulary = []
for document in documents_tokenized:
    for word in document:
        vocabulary.append(word)
vocabulary = sorted(set(vocabulary))

In [18]:
vocabulary

['2',
 'a',
 'ala',
 'była',
 'chomika',
 'i',
 'jan',
 'jeździ',
 'konfliktem',
 'kota',
 'lubi',
 'ma',
 'motorze',
 'na',
 'ola',
 'oraz',
 'psa',
 'psy',
 'rowerze',
 'także',
 'tomek',
 'wielkim',
 'wojna',
 'zbrojnym',
 'zwierzęta',
 'światowa']

## 📝 Zadanie **1.1** *(1 pkt)*

Napisz funkcję `word_to_index(word: str)`, która dla danego słowa zwraca wektor jednostkowy (*one-hot vector*) w postaci `numpy.array`.

Przyjmij, że słownik dany jest za pomocą zmiennej globalnej `vocabulary`.

In [19]:
def word_to_index(word: str) -> np.array:
    vector = np.zeros(len(vocabulary))
    if word in vocabulary:
        index = vocabulary.index(word)
        vector[index] = 1
    return vector

In [20]:
word_to_index('psa')

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
       0., 0., 0., 0., 0., 0., 0., 0., 0.])

## 📝 Zadanie **1.2** *(1 pkt)*

Napisz funkcję, która zamienia listę słów na wektor TF. 

In [22]:
def tf(document: list) -> np.array:
    vector = np.zeros(len(vocabulary))
    for word in document:
        if word in vocabulary:
            index = vocabulary.index(word)
            vector[index] += 1
    return vector

In [None]:
tf(documents_tokenized[0])

array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,
       0., 0., 0., 0., 0., 0., 0., 1., 0.])

In [24]:
documents_vectorized = list()
for document in documents_tokenized:
    document_vector = tf(document)
    documents_vectorized.append(document_vector)

In [25]:
documents_vectorized

[array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,
        0., 0., 0., 0., 0., 0., 0., 1., 0.]),
 array([0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0.,
        0., 0., 1., 0., 0., 0., 0., 1., 0.]),
 array([0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
        0., 1., 0., 0., 0., 0., 0., 0., 0.]),
 array([1., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 1., 1., 1., 0., 1.]),
 array([0., 0., 0., 0., 0., 2., 0., 1., 0., 0., 1., 1., 1., 1., 0., 0., 1.,
        1., 1., 0., 1., 0., 0., 0., 0., 0.])]

## IDF

In [26]:
idf = np.zeros(len(vocabulary))
idf = len(documents_vectorized) / np.sum(np.array(documents_vectorized) != 0,axis=0)
display(idf)

array([5.        , 5.        , 5.        , 5.        , 5.        ,
       1.66666667, 5.        , 2.5       , 5.        , 2.5       ,
       1.66666667, 1.66666667, 5.        , 2.5       , 5.        ,
       2.5       , 2.5       , 5.        , 2.5       , 5.        ,
       5.        , 5.        , 5.        , 5.        , 2.5       ,
       5.        ])

## 📝 Zadanie **1.3** *(1 pkt)*

Napisz funkcję, która zwraca podobieństwo kosinusowe między dwoma dokumentami w postaci zwektoryzowanej.

In [27]:
def similarity(query: np.array, document: np.array) -> float:
    numerator = np.dot(query, document)
    denominator = np.linalg.norm(query) * np.linalg.norm(document)
    if denominator == 0:
        return 0.0
    return numerator / denominator

In [28]:
documents[0]

'Ala lubi zwierzęta i ma kota oraz psa!'

In [29]:
documents_vectorized[0]

array([0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,
       0., 0., 0., 0., 0., 0., 0., 1., 0.])

In [30]:
documents[1]

'Ola lubi zwierzęta oraz ma kota a także chomika!'

In [31]:
documents_vectorized[1]

array([0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 1., 1., 0.,
       0., 0., 1., 0., 0., 0., 0., 1., 0.])

In [32]:
similarity(documents_vectorized[0], documents_vectorized[1])

np.float64(0.5892556509887895)

## Prosta wyszukiwarka

In [33]:
def transform_query(query):
    """Funkcja, która czyści i tokenizuje zapytanie"""
    query_vector = tf(tokenize_str(get_str_cleaned(query)))
    return query_vector

In [34]:
similarity(transform_query('psa kota'), documents_vectorized[0])

np.float64(0.4999999999999999)

In [35]:
query = 'psa kota'
for i in range(len(documents)):
    display(documents[i])
    display(similarity(transform_query(query), documents_vectorized[i]))

'Ala lubi zwierzęta i ma kota oraz psa!'

np.float64(0.4999999999999999)

'Ola lubi zwierzęta oraz ma kota a także chomika!'

np.float64(0.2357022603955158)

'I Jan jeździ na rowerze.'

np.float64(0.0)

'2 wojna światowa była wielkim konfliktem zbrojnym'

np.float64(0.0)

'Tomek lubi psy, ma psa  i jeździ na motorze i rowerze.'

np.float64(0.19611613513818402)

In [36]:
# dlatego potrzebujemy mianownik w cosine similarity
query = 'rowerze'
for i in range(len(documents)):
    display(documents[i])
    display(similarity(transform_query(query), documents_vectorized[i]))

'Ala lubi zwierzęta i ma kota oraz psa!'

np.float64(0.0)

'Ola lubi zwierzęta oraz ma kota a także chomika!'

np.float64(0.0)

'I Jan jeździ na rowerze.'

np.float64(0.4472135954999579)

'2 wojna światowa była wielkim konfliktem zbrojnym'

np.float64(0.0)

'Tomek lubi psy, ma psa  i jeździ na motorze i rowerze.'

np.float64(0.2773500981126146)

In [37]:
# dlatego potrzebujemy term frequency → wiecej znaczy bardziej dopasowany dokument
query = 'i'
for i in range(len(documents)):
    display(documents[i])
    display(similarity(transform_query(query), documents_vectorized[i]))

'Ala lubi zwierzęta i ma kota oraz psa!'

np.float64(0.35355339059327373)

'Ola lubi zwierzęta oraz ma kota a także chomika!'

np.float64(0.0)

'I Jan jeździ na rowerze.'

np.float64(0.4472135954999579)

'2 wojna światowa była wielkim konfliktem zbrojnym'

np.float64(0.0)

'Tomek lubi psy, ma psa  i jeździ na motorze i rowerze.'

np.float64(0.5547001962252291)

In [38]:
# dlatego IDF - żeby ważniejsze słowa miał większą wagę
query = 'i chomika'
for i in range(len(documents)):
    display(documents[i])
    display(similarity(transform_query(query), documents_vectorized[i]))

'Ala lubi zwierzęta i ma kota oraz psa!'

np.float64(0.24999999999999994)

'Ola lubi zwierzęta oraz ma kota a także chomika!'

np.float64(0.2357022603955158)

'I Jan jeździ na rowerze.'

np.float64(0.31622776601683794)

'2 wojna światowa była wielkim konfliktem zbrojnym'

np.float64(0.0)

'Tomek lubi psy, ma psa  i jeździ na motorze i rowerze.'

np.float64(0.39223227027636803)

## Biblioteki

In [41]:
!pip install scikit-learn
import numpy as np
import sklearn.metrics

from sklearn.datasets import fetch_20newsgroups

from sklearn.feature_extraction.text import TfidfVectorizer

Collecting scikit-learn
  Using cached scikit_learn-1.6.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting scipy>=1.6.0 (from scikit-learn)
  Using cached scipy-1.15.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.5.0-py3-none-any.whl.metadata (5.6 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Using cached threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Using cached scikit_learn-1.6.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.1 MB)
Downloading joblib-1.5.0-py3-none-any.whl (307 kB)
Using cached scipy-1.15.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (37.3 MB)
Using cached threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, scipy, joblib, scikit-learn
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4/4[0m [scikit-learn]0m [32m3/4[0

In [43]:
newsgroups = fetch_20newsgroups()['data']

In [44]:
len(newsgroups)

11314

In [45]:
print(newsgroups[0])

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







### Naiwne przeszukiwanie

In [46]:
all_documents = list() 
for document in newsgroups:
    if 'car' in document:
        all_documents.append(document)

In [47]:
print(all_documents[0])

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







In [48]:
print(all_documents[1])

From: guykuo@carson.u.washington.edu (Guy Kuo)
Subject: SI Clock Poll - Final Call
Summary: Final call for SI clock reports
Keywords: SI,acceleration,clock,upgrade
Article-I.D.: shelley.1qvfo9INNc3s
Organization: University of Washington
Lines: 11
NNTP-Posting-Host: carson.u.washington.edu

A fair number of brave souls who upgraded their SI clock oscillator have
shared their experiences for this poll. Please send a brief message detailing
your experiences with the procedure. Top speed attained, CPU rated speed,
add on cards and adapters, heat sinks, hour of usage per day, floppy disk
functionality with 800 and 1.4 m floppies are especially requested.

I will be summarizing in the next two days, so please add to the network
knowledge base if you have done the clock upgrade and haven't answered this
poll. Thanks.

Guy Kuo <guykuo@u.washington.edu>



#### ❔ Pytanie

Jakie są problemy z takim podejściem?

### TF–IDF i odległość kosinusowa

In [49]:
vectorizer = TfidfVectorizer()

In [50]:
document_vectors = vectorizer.fit_transform(newsgroups)

In [51]:
document_vectors

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 1787565 stored elements and shape (11314, 130107)>

In [52]:
document_vectors[0]

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 89 stored elements and shape (1, 130107)>

In [53]:
document_vectors[0].todense()

matrix([[0., 0., 0., ..., 0., 0., 0.]], shape=(1, 130107))

In [54]:
document_vectors[0:4].todense()

matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]], shape=(4, 130107))

In [55]:
query_str = 'speed'
#query_str = 'speed car'
#query_str = 'spider man'

In [56]:
query_vector = vectorizer.transform([query_str])
similarities = sklearn.metrics.pairwise.cosine_similarity(query_vector,document_vectors)
print(np.sort(similarities)[0][-4:])
print(similarities.argsort()[0][-4:])

for i in range (1,5):
    print(newsgroups[similarities.argsort()[0][-i]])
    print(np.sort(similarities)[0,-i])
    print('-'*100)
    print('-'*100)
    print('-'*100)

[0.26949927 0.3491801  0.44292083 0.47784165]
[4517 5509 2116 9921]
From: ray@netcom.com (Ray Fischer)
Subject: Re: x86 ~= 680x0 ??  (How do they compare?)
Organization: Netcom. San Jose, California
Distribution: usa
Lines: 36

dhk@ubbpc.uucp (Dave Kitabjian) writes ...
>I'm sure Intel and Motorola are competing neck-and-neck for 
>crunch-power, but for a given clock speed, how do we rank the
>following (from 1st to 6th):
>  486		68040
>  386		68030
>  286		68020

040 486 030 386 020 286

>While you're at it, where will the following fit into the list:
>  68060
>  Pentium
>  PowerPC

060 fastest, then Pentium, with the first versions of the PowerPC
somewhere in the vicinity.

>And about clock speed:  Does doubling the clock speed double the
>overall processor speed?  And fill in the __'s below:
>  68030 @ __ MHz = 68040 @ __ MHz

No.  Computer speed is only partly dependent of processor/clock speed.
Memory system speed play a large role as does video system speed and
I/O speed.  As pro

## 📝 Zadanie **1.4** *(4 pkt.)*

Wybierz zbiór tekstowy, który ma conajmniej 10000 dokumentów (inny niż w tym przykładzie).
Na jego podstawie stwórz wyszukiwarkę wykorzystującą TF–IDF i podobieństwo kosinusowe do oceny podobieństwa dokumentów. Wyszukiwarka powinna zwracać kilka posortowanych najbardziej pasujących dokumentów razem ze score'ami.

In [None]:
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

dataset = load_dataset("imdb", split="train")
texts = [item["text"] for item in dataset][:10000]  # pierwsze 10k

vectorizer = TfidfVectorizer(stop_words='english', max_features=10000)
document_vectors = vectorizer.fit_transform(texts)

def search(query, top_n=5):
    query_vec = vectorizer.transform([query])
    sims = cosine_similarity(query_vec, document_vectors).flatten()
    top_indices = sims.argsort()[-top_n:][::-1]
    return [(i, texts[i], sims[i]) for i in top_indices]

for idx, doc, score in search("great movie about friendship"):
    print(f"[{score:.4f}] {doc[:200]}...\n")

  from .autonotebook import tqdm as notebook_tqdm
Generating train split: 100%|██████████| 25000/25000 [00:00<00:00, 154329.67 examples/s]
Generating test split: 100%|██████████| 25000/25000 [00:00<00:00, 196369.53 examples/s]
Generating unsupervised split: 100%|██████████| 50000/50000 [00:00<00:00, 217617.32 examples/s]


[0.3471] Wow this movie sucked big time. I heard this movie expresses the meaning of friendship very well. And with all the internet hype on this movie I figured what could go wrong? However the movie was just...

[0.2252] I ended up watching The Tenants with my close friends who rented the movie solely based on Snoop Dogg's appearance (a passionate fetish of theirs) on the cover. Understandably, I did not expect much. ...

[0.2061] A poorly written script with no likeable characters. As for it being a comedy, I forgot to laugh. It's about 2 conceited friends who scam to get women in too bed with them (no sex scenes) and another ...

[0.1950] I like movies about morally corrupt characters, but this was too much. The acting wasn't great, but that wasn't the real problem. The issue was the sinking feeling I got in the pit of my stomach about...

[0.1850] The film is about a young man, Michael, who cares for the elderly. One day he decides to kill some of the relatives of his clients. Aro