# LV1 – Obrada teksta i Part-of-Speech (POS) označavanje
### Laboratorijska vježba 1
**Tema:** Osnove obrade prirodnog jezika pomoću biblioteka spaCy i NLTK

Ovaj notebook sadrži teorijski uvod, osnovne korake obrade teksta te zadatke za samostalni rad. Studenti mogu birati žele li koristiti *spaCy* ili *NLTK* biblioteku pri rješavanju zadataka.

## Ciljevi vježbe
- Upoznati osnovne korake obrade prirodnog jezika (NLP).
- Primijeniti biblioteke **spaCy** i **NLTK** na obradu teksta.
- Razumjeti i implementirati procese tokenizacije, uklanjanja zaustavnih riječi, lematizacije i POS označavanja.
- Razviti sposobnost analize i interpretacije rezultata obrade teksta.

## 1. Instalacija potrebnih biblioteka

In [1]:
!pip install spacy nltk matplotlib pandas

!python -m spacy download en_core_web_sm

import spacy
from spacy.lang.en.stop_words import STOP_WORDS

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag

from collections import Counter
import pandas as pd
import matplotlib.pyplot as plt

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [91m━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/12.8 MB[0m [31m320.2 kB/s[0m eta [36m0:00:27[0m
[?25h[31mERROR: Operation cancelled by user[0m[31m
[0m

## 2. Tokenizacija
**Opis:** Tokenizacija je proces razdvajanja teksta na manje jedinice – tokene (riječi, interpunkcijske znakove itd.).

U nastavku su prikazana dva načina tokenizacije: pomoću *spaCy* i pomoću *NLTK*.

In [2]:
nlp = spacy.load('en_core_web_sm')
text = 'Natural Language Processing enables computers to understand human language.'
doc = nlp(text)
for token in doc:
    print(token.text)

Natural
Language
Processing
enables
computers
to
understand
human
language
.


In [3]:
nltk.download('punkt')
nltk.download('punkt_tab')
text = 'Natural Language Processing enables computers to understand human language.'
tokens = word_tokenize(text)
print(tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...


['Natural', 'Language', 'Processing', 'enables', 'computers', 'to', 'understand', 'human', 'language', '.']


[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


### Zadatak 1
Upiši vlastiti tekst i izvrši tokenizaciju pomoću obje biblioteke.

In [33]:
my_text = (
    """
The team played an intense match last night, delivering one of their strongest performances this season.
Throughout the match, the team demonstrated exceptional teamwork, discipline, and determination.
The coach repeatedly emphasized how important teamwork was for maintaining control during the most difficult moments of the match.
Several players mentioned that the team had trained specifically to improve their teamwork and communication, which clearly paid off.
Fans celebrated loudly after the match, recognizing that the team’s victory was crucial for improving their position in the championship.
During the press conference, the coach praised the players for their strategy, dedication, and ability to adapt as the match progressed.
He highlighted that every victory strengthens the team’s confidence and prepares them for future challenges.
The upcoming match is even more important, as the team is competing for a spot in the finals.
Analysts agree that if the team continues to show this level of teamwork and discipline, they have a strong chance of winning the entire championship.
In the end, the team proved that success is not just about individual talent but about unity, effort, and the shared goal of winning the championship.
"""
)

print("Izvorni tekst:")
print(my_text)
print("\n" + "-"*60 + "\n")

print("Tokenizacija sa spaCy:")
doc_custom = nlp(my_text)
spacy_tokens = [token.text for token in doc_custom]
print(spacy_tokens)

print("\n" + "-"*60 + "\n")

print("Tokenizacija s NLTK:")
nltk_tokens = word_tokenize(my_text)
print(nltk_tokens)

print("\nUočene razlike:")
print("Spacy prepoznaje /n kao znak.")
print("NLTK-ova word_tokenize funkcija je jednostavnija i fokusira se na razdvajanje riječi i interpunkcije.")


Izvorni tekst:

The team played an intense match last night, delivering one of their strongest performances this season.
Throughout the match, the team demonstrated exceptional teamwork, discipline, and determination.
The coach repeatedly emphasized how important teamwork was for maintaining control during the most difficult moments of the match.
Several players mentioned that the team had trained specifically to improve their teamwork and communication, which clearly paid off.
Fans celebrated loudly after the match, recognizing that the team’s victory was crucial for improving their position in the championship.
During the press conference, the coach praised the players for their strategy, dedication, and ability to adapt as the match progressed.
He highlighted that every victory strengthens the team’s confidence and prepares them for future challenges.
The upcoming match is even more important, as the team is competing for a spot in the finals.
Analysts agree that if the team continu

## 3. Uklanjanje zaustavnih riječi (Stopwords)
Zaustavne riječi su česte riječi koje ne doprinose značenju teksta (npr. the, is, in...).

In [4]:
doc = nlp(text)
filtered_spacy = [token.text for token in doc if not token.is_stop]
print(filtered_spacy)

['Natural', 'Language', 'Processing', 'enables', 'computers', 'understand', 'human', 'language', '.']


In [5]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
filtered_nltk = [word for word in tokens if word.lower() not in stop_words]
print(filtered_nltk)

[nltk_data] Downloading package stopwords to /root/nltk_data...


['Natural', 'Language', 'Processing', 'enables', 'computers', 'understand', 'human', 'language', '.']


[nltk_data]   Unzipping corpora/stopwords.zip.


### Zadatak 2
Ukloni zaustavne riječi iz vlastitog teksta pomoću obje biblioteke.

In [34]:
print("Izvorni tekst:")
print(my_text)
print("\n" + "-"*60 + "\n")

print("Uklanjanje stop riječi sa spaCy:")
doc_custom = nlp(my_text)
spacy_filtered = [token.text for token in doc_custom if not token.is_stop]
print(spacy_filtered)

print("\n" + "-"*60 + "\n")

print("Uklanjanje stop riječi s NLTK:")
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

nltk_filtered = [w for w in nltk_tokens if w.lower() not in stop_words]
print(nltk_filtered)

print("\nNapomena:")
print("- spaCy koristi svoj jezični model za određivanje stop-riječi.")
print("- NLTK ima širu listu i često uklanja dodatne riječi poput 'can', 'we'.")

Izvorni tekst:

The team played an intense match last night, delivering one of their strongest performances this season.
Throughout the match, the team demonstrated exceptional teamwork, discipline, and determination.
The coach repeatedly emphasized how important teamwork was for maintaining control during the most difficult moments of the match.
Several players mentioned that the team had trained specifically to improve their teamwork and communication, which clearly paid off.
Fans celebrated loudly after the match, recognizing that the team’s victory was crucial for improving their position in the championship.
During the press conference, the coach praised the players for their strategy, dedication, and ability to adapt as the match progressed.
He highlighted that every victory strengthens the team’s confidence and prepares them for future challenges.
The upcoming match is even more important, as the team is competing for a spot in the finals.
Analysts agree that if the team continu

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 4. Lematizacija
Lematizacija svodi riječi na osnovni oblik (lemu).

In [35]:
#Primjer: Lemmatizacija sa spaCy
for token in doc:
    print(f'{token.text:15} → {token.lemma_}')

Natural         → Natural
Language        → Language
Processing      → processing
enables         → enable
computers       → computer
to              → to
understand      → understand
human           → human
language        → language
.               → .


In [7]:
#Primjer: Lemmatizacija s NLTK
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
pos_tags = pos_tag(tokens)
def get_wordnet_pos(tag):
    if tag.startswith('J'): return wordnet.ADJ
    elif tag.startswith('V'): return wordnet.VERB
    elif tag.startswith('N'): return wordnet.NOUN
    elif tag.startswith('R'): return wordnet.ADV
    else: return wordnet.NOUN
lemmas = [lemmatizer.lemmatize(word, get_wordnet_pos(tag)) for word, tag in pos_tags]
print(lemmas)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


['Natural', 'Language', 'Processing', 'enable', 'computer', 'to', 'understand', 'human', 'language', '.']


### Zadatak 3
Primijeni lematizaciju na vlastiti tekst i usporedi rezultate između spaCy i NLTK.

In [36]:

print("Izvorni tekst:")
print(my_text)
print("\n" + "-"*60 + "\n")

print("Lematizacija sa spaCy (token -> lemma):")
doc_custom = nlp(my_text)
spacy_lemmas = [(token.text, token.lemma_) for token in doc_custom]
for tok, lem in spacy_lemmas:
    print(f"{tok:15s} -> {lem}")

print("\n" + "-"*60 + "\n")

print("Lematizacija s NLTK (token -> lemma):")

import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import pos_tag

lemmatizer = WordNetLemmatizer()

def get_wordnet_pos(tag):
    """Map NLTK POS tag na WordNet POS oznaku."""
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # default

# POS tagovi za NLTK tokene (iz Zadatka 1)
nltk_pos = pos_tag(nltk_tokens)

nltk_lemmas = []
for word, pos in nltk_pos:
    wn_pos = get_wordnet_pos(pos)
    lemma = lemmatizer.lemmatize(word, wn_pos)
    nltk_lemmas.append((word, lemma))
    print(f"{word:15s} -> {lemma}")

print("\n" + "-"*60 + "\n")
print("Kratka usporedba:")
print("- spaCy koristi svoj ugrađeni lematizator vezan uz jezični model.")
print("- NLTK se oslanja na WordNet i točnost ovisi o POS oznakama koje dodijelimo riječi.")


Izvorni tekst:

The team played an intense match last night, delivering one of their strongest performances this season.
Throughout the match, the team demonstrated exceptional teamwork, discipline, and determination.
The coach repeatedly emphasized how important teamwork was for maintaining control during the most difficult moments of the match.
Several players mentioned that the team had trained specifically to improve their teamwork and communication, which clearly paid off.
Fans celebrated loudly after the match, recognizing that the team’s victory was crucial for improving their position in the championship.
During the press conference, the coach praised the players for their strategy, dedication, and ability to adapt as the match progressed.
He highlighted that every victory strengthens the team’s confidence and prepares them for future challenges.
The upcoming match is even more important, as the team is competing for a spot in the finals.
Analysts agree that if the team continu

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## 5. POS (Part-of-Speech) označavanje
POS označavanje dodjeljuje gramatičku ulogu svakoj riječi (imenica, glagol, pridjev, prilog...).

In [8]:
for token in doc:
    print(f'{token.text:15} → {token.pos_:6} ({token.tag_})')

Natural         → PROPN  (NNP)
Language        → PROPN  (NNP)
Processing      → NOUN   (NN)
enables         → VERB   (VBZ)
computers       → NOUN   (NNS)
to              → PART   (TO)
understand      → VERB   (VB)
human           → ADJ    (JJ)
language        → NOUN   (NN)
.               → PUNCT  (.)


In [9]:
pos_tags = pos_tag(tokens)
for word, tag in pos_tags:
    print(f'{word:15} → {tag}')

Natural         → JJ
Language        → NNP
Processing      → NNP
enables         → VBZ
computers       → NNS
to              → TO
understand      → VB
human           → JJ
language        → NN
.               → .


### Zadatak 4
Izdvoji sve imenice i glagole iz svog teksta pomoću jedne od biblioteka.

In [37]:

print("Izvorni tekst:")
print(my_text)
print("\n" + "-"*60 + "\n")

doc_custom = nlp(my_text)

nouns = [token.text for token in doc_custom if token.pos_ == "NOUN"]
verbs = [token.text for token in doc_custom if token.pos_ == "VERB"]

print("Imenice u tekstu:")
print(nouns)

print("\nGlagoli u tekstu:")
print(verbs)

print("\nNapomena:")
print("- spaCy prepoznaje POS oznake prema jezičnom modelu.")
print("- Rezultat ovisi o kontekstu rečenice i samom modelu (en_core_web_sm).")


Izvorni tekst:

The team played an intense match last night, delivering one of their strongest performances this season.
Throughout the match, the team demonstrated exceptional teamwork, discipline, and determination.
The coach repeatedly emphasized how important teamwork was for maintaining control during the most difficult moments of the match.
Several players mentioned that the team had trained specifically to improve their teamwork and communication, which clearly paid off.
Fans celebrated loudly after the match, recognizing that the team’s victory was crucial for improving their position in the championship.
During the press conference, the coach praised the players for their strategy, dedication, and ability to adapt as the match progressed.
He highlighted that every victory strengthens the team’s confidence and prepares them for future challenges.
The upcoming match is even more important, as the team is competing for a spot in the finals.
Analysts agree that if the team continu

## 6. Zadaci

## Zadatak 1: Usporedi dva teksta po učestalosti riječi

**Opis:**  
Analiziraj dva različita teksta (npr. jedan o sportu, drugi o tehnologiji).  
Nakon što provedeš tokenizaciju, uklanjanje zaustavnih riječi i lematizaciju, potrebno je:  
- pronaći 5 najčešćih imenica u svakom tekstu,  
- usporediti liste dobivenih imenica,  
- zaključiti o čemu se govori u svakom tekstu.

**Cilj:**  
Razumjeti kako se analiza frekvencije riječi može koristiti za prepoznavanje teme teksta.

**Upute:**  
1. Učitaj dva različita teksta (mogu biti dvije rečenice, dva odlomka ili datoteke).  
2. Obradi svaki tekst (tokenizacija → čišćenje → lematizacija → POS tagging).  
3. Izdvoji samo riječi označene kao NOUN (imenice).  
4. Prebroji pojavljivanja i prikaži 5 najčešćih.  
5. Zaključi koja je tema svakog teksta.

In [38]:
text_sport = """
The team played an intense match last night, delivering one of their strongest performances this season.
Throughout the match, the team demonstrated exceptional teamwork, discipline, and determination.
The coach repeatedly emphasized how important teamwork was for maintaining control during the most difficult moments of the match.
Several players mentioned that the team had trained specifically to improve their teamwork and communication, which clearly paid off.
Fans celebrated loudly after the match, recognizing that the team’s victory was crucial for improving their position in the championship.
During the press conference, the coach praised the players for their strategy, dedication, and ability to adapt as the match progressed.
He highlighted that every victory strengthens the team’s confidence and prepares them for future challenges.
The upcoming match is even more important, as the team is competing for a spot in the finals.
Analysts agree that if the team continues to show this level of teamwork and discipline, they have a strong chance of winning the entire championship.
In the end, the team proved that success is not just about individual talent but about unity, effort, and the shared goal of winning the championship.
"""

text_tech = """
Modern technology is evolving rapidly, shaping the way people work, communicate, and solve complex problems.
New devices and software are developed every year, pushing the boundaries of what modern technology can achieve.
Researchers are focusing heavily on artificial intelligence, automation, and advanced data processing to create smarter and more powerful systems.
These innovations enable companies to build faster devices, more secure software, and highly efficient solutions for everyday use.
Experts believe that artificial intelligence will continue to transform technology by improving decision-making, optimizing workflows, and predicting user needs.
Many companies are investing in automation technologies to reduce costs, increase productivity, and eliminate repetitive tasks.
At the same time, advancements in data processing make it possible to analyze enormous datasets and identify patterns that were previously impossible to detect.
This combination of artificial intelligence, automation, and data processing is driving a new era of modern technology.
If current trends continue, technology will become even more integrated into daily life, offering smarter devices, adaptive software, and personalized solutions.
Researchers conclude that the future of modern technology depends on continuous innovation, reliable data processing, and the responsible development of artificial intelligence.
"""

In [39]:

from collections import Counter
import re

def preprocess(t):
    t = t.lower()
    t = re.sub(r"[^a-zA-Z]+", " ", t)
    tokens = t.split()
    return tokens

tokens1 = preprocess(text_sport)
tokens2 = preprocess(text_tech)

# 3. Brojanje učestalosti
freq1 = Counter(tokens1)
freq2 = Counter(tokens2)

print("Najčešće riječi u tekstu 1:")
print(freq1.most_common(10))

print("\nNajčešće riječi u tekstu 2:")
print(freq2.most_common(10))

# 4. Usporedba zajedničkih riječi
common = set(freq1.keys()) & set(freq2.keys())

print("\nZajedničke riječi i njihove frekvencije:")
for w in common:
    print(f"{w:12s}  T1: {freq1[w]}   T2: {freq2[w]}")

print("\nNapomena:")
print("- Učestalost pomaže prepoznati tematske jezgre svakog teksta.")
print("- Riječi s najvišim frekvencijama sugeriraju glavnu temu i fokus.")


Najčešće riječi u tekstu 1:
[('the', 24), ('team', 8), ('match', 6), ('and', 6), ('of', 5), ('for', 5), ('that', 5), ('their', 4), ('teamwork', 4), ('to', 3)]

Najčešće riječi u tekstu 2:
[('and', 11), ('technology', 6), ('to', 6), ('the', 5), ('of', 5), ('modern', 4), ('artificial', 4), ('intelligence', 4), ('data', 4), ('processing', 4)]

Zajedničke riječi i njihove frekvencije:
to            T1: 3   T2: 6
of            T1: 5   T2: 5
even          T1: 1   T2: 1
every         T1: 1   T2: 1
for           T1: 5   T2: 1
is            T1: 3   T2: 2
this          T1: 2   T2: 1
in            T1: 3   T2: 2
that          T1: 5   T2: 3
the           T1: 24   T2: 5
improving     T1: 1   T2: 1
more          T1: 1   T2: 3
and           T1: 6   T2: 11
if            T1: 1   T2: 1
a             T1: 2   T2: 1
future        T1: 1   T2: 1

Napomena:
- Učestalost pomaže prepoznati tematske jezgre svakog teksta.
- Riječi s najvišim frekvencijama sugeriraju glavnu temu i fokus.


## Zadatak 2: Analiza tonova (pozitivno vs. negativno)

**Opis:**  
Zadatak je provesti osnovnu analizu sentimenta.  
Potrebno je obraditi nekoliko kratkih recenzija (npr. o filmovima, proizvodima, restoranima) i odrediti jesu li one pozitivne ili negativne.

**Cilj:**  
Pokazati kako se osnovni NLP alati mogu koristiti za analizu osjećaja u tekstu.

**Upute:**  
1. Pripremi popise riječi:  
   - pozitivne: `["good", "great", "excellent", "amazing", "nice", "wonderful"]`  
   - negativne: `["bad", "poor", "terrible", "boring", "awful", "disappointing"]`  
2. Za svaku recenziju:  
   - očisti tekst (ukloni stopwords, lematiziraj),  
   - prebroji koliko pozitivnih i negativnih riječi sadrži.  
3. Na temelju rezultata zaključi ton svake recenzije.  
4. (Opcionalno) Prikaži rezultate u tablici ili grafu.

In [40]:

import re
from collections import Counter

text = """
I really love how this project turned out. The results are amazing and wonderful,
but the process was sometimes bad and terrible, with a few horrible moments.
"""

positive_words = ["good", "great", "excellent", "happy", "love", "amazing", "nice", "wonderful"]
negative_words = ["bad", "terrible", "awful", "sad", "hate", "horrible", "disgusting", "poor"]

def preprocess(t):
    t = t.lower()
    t = re.sub(r"[^a-zA-Z]+", " ", t)
    tokens = t.split()
    return tokens

tokens = preprocess(text)
freq = Counter(tokens)

pos_count = sum(freq[w] for w in positive_words if w in freq)
neg_count = sum(freq[w] for w in negative_words if w in freq)

print("Ukupno riječi u tekstu:", len(tokens))
print("Pozitivne riječi:", pos_count)
print("Negativne riječi:", neg_count)
print()

print("Pozitivne riječi koje su se pojavile:")
for w in positive_words:
    if w in freq:
        print(f"{w:12s} -> {freq[w]}")

print("\nNegativne riječi koje su se pojavile:")
for w in negative_words:
    if w in freq:
        print(f"{w:12s} -> {freq[w]}")

print("\nProcjena tona teksta:")
if pos_count > neg_count:
    print("Tekst je pretežno POZITIVAN.")
elif neg_count > pos_count:
    print("Tekst je pretežno NEGATIVAN.")
else:
    print("Tekst je NEODREĐEN ili balansiran između pozitivnog i negativnog.")


Ukupno riječi u tekstu: 27
Pozitivne riječi: 3
Negativne riječi: 3

Pozitivne riječi koje su se pojavile:
love         -> 1
amazing      -> 1
wonderful    -> 1

Negativne riječi koje su se pojavile:
bad          -> 1
terrible     -> 1
horrible     -> 1

Procjena tona teksta:
Tekst je NEODREĐEN ili balansiran između pozitivnog i negativnog.


## Zadatak 3: Uredi nered/pronađi lažne riječi

**Opis:**  
Zadan je tekst koji sadrži izmišljene riječi ili “šum”.  
Zadatak je pronaći riječi koje nisu prepoznate u jezičnom modelu (engl. *out of vocabulary words*).

**Cilj:**  
Razumjeti kako model prepoznaje poznate i nepoznate riječi te kako to može pomoći u detekciji pogrešaka u tekstu.

**Upute:**  
1. Unesi tekst koji sadrži besmislene riječi (npr. „The data blorp is analyzed using great accuracy flom.“).  
2. Tokeniziraj tekst pomoću spaCy modela.  
3. Provjeri svaku riječ pomoću `token.is_oov`, ako vrati `True`, riječ nije prepoznata.  
4. Ispiši popis “nepoznatih” riječi.  
5. (Opcionalno) Očisti tekst uklanjanjem tih riječi.

**Tekst:**

> In the future, artificel intellgence will revolutionize the way we interract with technolodgy.  
> Peaple might use smart assistents not only for work but also for personal healtcare and educattion.  
> Yet, as systems become more compicated, ensuring data privasy and securrity will be crucial.  
> The recent blonix project already shows how mashine learning can adapt to dynamic enviroments.

In [44]:
text33 = """
In the future, artificel intellgence will revolutionize the way we interract with technolodgy.
Peaple might use smart assistents not only for work but also for personal healtcare and educattion.
Yet, as systems become more compicated, ensuring data privasy and securrity will be crucial.
The recent blonix project already shows how mashine learning can adapt to dynamic enviroments.
"""

In [46]:


import re
import nltk
from nltk.corpus import words

nltk.download('words')

english_words = set(w.lower() for w in words.words())

def preprocess(t):
    t = t.lower()
    t = re.sub(r"[^a-z\s]+", " ", t)
    tokens = t.split()
    return tokens

tokens = preprocess(text33)

real_words = []
fake_words = []

for tok in tokens:
    if tok in english_words:
        real_words.append(tok)
    else:
        fake_words.append(tok)

print("Svi tokeni:")
print(tokens)

print("\nStvarne riječi (pronađene u NLTK rječniku):")
print(sorted(set(real_words)))

print("\nSumnjive / lažne riječi (nisu u rječniku):")
print(sorted(set(fake_words)))

print("\nNapomena:")
print("- 'Lažne' riječi mogu biti i tipfeleri ili rijetke/morfološki promijenjene riječi.")
print("- Ovo je gruba heuristika, ali dobro pokazuje ideju čišćenja teksta.")


[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


Svi tokeni:
['in', 'the', 'future', 'artificel', 'intellgence', 'will', 'revolutionize', 'the', 'way', 'we', 'interract', 'with', 'technolodgy', 'peaple', 'might', 'use', 'smart', 'assistents', 'not', 'only', 'for', 'work', 'but', 'also', 'for', 'personal', 'healtcare', 'and', 'educattion', 'yet', 'as', 'systems', 'become', 'more', 'compicated', 'ensuring', 'data', 'privasy', 'and', 'securrity', 'will', 'be', 'crucial', 'the', 'recent', 'blonix', 'project', 'already', 'shows', 'how', 'mashine', 'learning', 'can', 'adapt', 'to', 'dynamic', 'enviroments']

Stvarne riječi (pronađene u NLTK rječniku):
['adapt', 'already', 'also', 'and', 'as', 'be', 'become', 'but', 'can', 'crucial', 'data', 'dynamic', 'for', 'future', 'how', 'in', 'learning', 'might', 'more', 'not', 'only', 'personal', 'project', 'recent', 'revolutionize', 'smart', 'the', 'to', 'use', 'way', 'we', 'will', 'with', 'work', 'yet']

Sumnjive / lažne riječi (nisu u rječniku):
['artificel', 'assistents', 'blonix', 'compicated', 

## Zadatak 4: Tko govori o čemu?

**Opis:**  
Imate tri različita teksta iz različitih domena (npr. politika, sport, znanost).  
Nakon obrade potrebno je prepoznati kojoj temi pojedini tekst pripada, koristeći najčešće riječi.

**Cilj:**  
Povezati statističku analizu riječi s prepoznavanjem teme teksta —> osnova za automatsku klasifikaciju dokumenata.

**Upute:**  
1. Pripremi tri teksta različitih tema.  
2. Obradi svaki tekst kroz cijeli NLP postupak.  
3. Izvuci 5 najčešćih imenica i glagola.  
4. Na temelju tih riječi pokušaj zaključiti o čemu tekst govori.  
5. (Opcionalno) Napravi jednostavan graf koji prikazuje razlike među tekstovima.

In [47]:
text_1 = """
The government has introduced a series of new reforms designed to improve economic stability and strengthen national policy.
According to officials, the government believes these reforms will help reduce inflation and increase trust in public institutions.
During a press conference, government representatives explained that the policy focuses on long-term economic growth, responsible budgeting, and transparent decision-making.
Opposition leaders criticized the government, arguing that the reforms do not address the root causes of inflation and may place additional pressure on the middle-class population.
Despite the criticism, the prime minister emphasized that the government must take decisive action to protect the economy.
He stated that the policy is essential for maintaining stability, supporting national programs, and ensuring that citizens benefit from a more resilient economic system.
The government also announced consultations with economic experts to refine the policy and monitor inflation trends.
Overall, the government insists that the reforms represent a necessary step toward financial responsibility and sustainable development.
"""

text_2 = """
The team delivered an outstanding performance last night, playing one of the most intense matches of the season.
Throughout the match, the team showed great determination, teamwork, and discipline.
The coach praised the team for maintaining focus and adapting their strategy as the match progressed.
Fans celebrated loudly, recognizing that the team’s victory was crucial for securing their position in the championship rankings.
During the post-match interview, the coach highlighted how preparation and teamwork were essential for winning such a competitive match.
Several players said that the team felt more united than ever, and that their teamwork was the key factor in overcoming the toughest opponents.
The next match will be even more important, as the team aims to qualify for the finals.
If the team continues to play with this level of teamwork and discipline, they have a strong chance of winning the entire championship.
"""

text_3 = """
Researchers at the university have developed a new material that significantly improves energy storage efficiency.
The material was tested under various laboratory conditions, and researchers observed that the material maintained its structure even when exposed to high temperatures.
According to the study, the material could transform the future of renewable energy by enabling more stable and long-lasting storage systems.
Scientists believe that energy demand will continue to rise, making the development of advanced material technologies essential for sustainable production.
The research team plans to publish additional data as they continue studying the material and its impact on battery performance.
Several researchers have already suggested that this material could replace current lithium-based components used in many energy systems.
If the material continues to show positive results, it may revolutionize energy production and create new opportunities for scientific innovation.
Overall, the study highlights the importance of energy research and the potential of this new material to reshape modern technology.
"""

In [49]:
from collections import Counter


texts = {
    "Govornik A": text_1,
    "Govornik B": text_2,
    "Govornik C": text_3,
}

def analyze_text(label, text):
    print("\n" + "="*70)
    print(f"{label}")
    print("="*70 + "\n")

    doc = nlp(text)

    nouns = []
    verbs = []

    for token in doc:
        # preskoči stop-riječi, interpunkciju, razmake
        if token.is_stop or token.is_punct or token.is_space:
            continue

        # koristimo lemu da bismo grupirali oblike riječi
        if token.pos_ == "NOUN":
            nouns.append(token.lemma_.lower())
        elif token.pos_ == "VERB":
            verbs.append(token.lemma_.lower())

    noun_counts = Counter(nouns)
    verb_counts = Counter(verbs)

    print("Top 5 imenica:")
    for word, cnt in noun_counts.most_common(5):
        print(f"{word:20s} -> {cnt}")

    print("\nTop 5 glagola:")
    for word, cnt in verb_counts.most_common(5):
        print(f"{word:20s} -> {cnt}")

    print("\nKratki zaključak:")
    print("- Imenice otkrivaju TEME o kojima govornik priča.")
    print("- Glagoli otkrivaju RADNJE, što se događa ili što se želi postići.")
    print("- Usporedbom govornika možeš vidjeti tko priča o podacima, tko o klimi, tko o razvoju softvera.\n")


# 2. Analiza svih govornika
for label, txt in texts.items():
    analyze_text(label, txt)



Govornik A

Top 5 imenica:
government           -> 7
reform               -> 4
policy               -> 4
inflation            -> 3
stability            -> 2

Top 5 glagola:
introduce            -> 1
design               -> 1
improve              -> 1
strengthen           -> 1
accord               -> 1

Kratki zaključak:
- Imenice otkrivaju TEME o kojima govornik priča.
- Glagoli otkrivaju RADNJE, što se događa ili što se želi postići.
- Usporedbom govornika možeš vidjeti tko priča o podacima, tko o klimi, tko o razvoju softvera.


Govornik B

Top 5 imenica:
team                 -> 7
match                -> 5
teamwork             -> 4
discipline           -> 2
coach                -> 2

Top 5 glagola:
play                 -> 2
win                  -> 2
deliver              -> 1
show                 -> 1
praise               -> 1

Kratki zaključak:
- Imenice otkrivaju TEME o kojima govornik priča.
- Glagoli otkrivaju RADNJE, što se događa ili što se želi postići.
- Usporedbom govornika 

## Zadatak 5: Analiza političkih govora (napredni zadatak)

> **Opis:**  
> U ovom zadatku treba analizirati tekstove političkih govora i otkriti koje riječi govornici najčešće koriste kako bi naglasili svoje poruke.  
> Cilj je otkriti koje teme i koje vrste riječi dominiraju u govoru.

---

**Upute:**
1. Pronađi ili kopiraj dva kratka govora (ili odlomka) poznatih političara.  
   Ako nemaš stvarne govore, možeš koristiti dva primjera niže.  
2. Za svaki govor napravi kompletnu obradu teksta:
   - tokenizacija  
   - uklanjanje zaustavnih riječi  
   - lematizacija  
   - POS tagging  
3. Izdvoji:
   - 10 **imenica**,  
   - 10 **glagola**,  
   - 10 **pridjeva**.  
4. Prikaži rezultate u **tri odvojene tablice** ili **grafovima** (koristi `pandas` i `matplotlib`).  
5. Usporedi govore i pokušaj zaključiti:
   - Koji govor je “pozitivniji” (više koristi riječi poput *hope*, *future*, *together*)  
   - Koji je “defanzivniji” ili “konfliktniji” (više koristi riječi poput *fight*, *challenge*, *threat*).  
6. Na kraju napiši **kratki zaključak (2–3 rečenice)**: kako se teme razlikuju i što dominira u svakom govoru.

---

**Cilj:**  
Ovim zadatkom studenti povezuju sve što su naučili, obradu, analizu i interpretaciju teksta, u jednu cjelinu, simulirajući osnovnu NLP analizu stvarnih podataka.

---

In [13]:
nlp = spacy.load("en_core_web_sm")

In [50]:
from collections import Counter


text_A = """My fellow citizens, today we gather not as strangers, but as a community united by our shared hopes and dreams.
We stand at the dawn of a new era—one built on innovation, cooperation, and the unshakable belief in the potential of our people.
The challenges before us are great, but so too is our courage and creativity.
We will invest in education, protect our planet, and empower every individual to shape their own destiny.

Let us build bridges, not walls; extend hands, not fists.
Together, we can create a nation where opportunity is not limited to the few, but shared by all.
Our strength lies not in fear, but in faith—in each other, in our values, and in the bright future we will create together.

Let this be the generation that chooses unity over division, progress over stagnation, and hope over despair.
Let this be the generation that dares to dream boldly, that embraces change, and that lifts one another up rather than tearing each other down.

We will work to expand access to healthcare, to ensure that no family has to choose between medicine and food.
We will support our teachers, invest in our children, and build schools that prepare every young mind for the world of tomorrow.
We will encourage clean energy, sustainable development, and responsible stewardship of the natural resources entrusted to us.

And above all, we will choose compassion—compassion for our neighbors, for our communities, and for those whose voices too often go unheard.
Our nation is strongest when every citizen feels seen, valued, and empowered.

Together, we will write a new chapter in our nation’s story—one defined not by fear or division, but by courage, unity, and purpose.
A future filled with opportunity is within our reach, and it is a future we will build hand in hand.
Let us move forward with confidence, with optimism, and with unwavering hope in all that we can achieve—together.
"""

text_B = """My fellow citizens, the world we face today is uncertain and full of danger.
Across the globe, our values are challenged, our security is tested, and our freedom is under threat.
We cannot afford complacency or hesitation.
We must strengthen our defenses, protect our borders, and ensure the safety of our families and our future.

Our enemies seek to divide us, to weaken our resolve, and to spread fear and chaos.
But we will not yield.
We will act with determination, discipline, and strength.
Every citizen has a role to play in defending our nation and preserving our way of life.

We must increase our vigilance, enhance our intelligence capabilities, and give our armed forces the tools they need to counter every threat.
We must stand firm against those who wish to undermine our democracy, whether they act from within or from beyond our borders.

Let us face the challenges before us with courage, and together ensure that the next generation inherits not fear, but freedom—not weakness, but resilience.
We will confront extremism wherever it appears.
We will push back against hostile powers seeking to disrupt our alliances.
We will safeguard our economy from manipulation and ensure that our industries cannot be exploited by those who do not share our values.

The dangers we confront are real, and they are growing.
Cyberattacks, disinformation campaigns, and coordinated acts of aggression threaten our stability.
We cannot ignore these warnings; we must respond with unity and unwavering resolve.

We will reinforce our border security, support law enforcement, and empower our military to defend every inch of our homeland.
We will stand shoulder to shoulder, refusing to be intimidated, refusing to be divided, and refusing to surrender to forces that thrive on fear.

Together, we will ensure that our nation remains safe, strong, and unbroken.
We will rise to meet every threat, overcome every obstacle, and protect the sacred freedoms that define us.
This is our duty, our responsibility, and our promise to all who come after us.
"""

speeches = {
    "Govor A – inkluzija, nada, progres": text_A,
    "Govor B – prijetnje, sigurnost, obrana": text_B,
}

positive_words = [
    "growth", "prosper", "innovation", "clean", "future",
    "trust", "fair", "justice", "peace", "support", "courage",
    "solidarity", "strong", "recover", "hope", "unity", "progress",
    "opportunity", "compassion", "confidence", "optimism", "resilience"
]

negative_words = [
    "crisis", "corruption", "injustice", "struggle",
    "poverty", "fear", "violence", "unemployment",
    "broken", "division", "conflict", "danger", "threat",
    "chaos", "hostile", "aggression"
]

def analyze_speech(label, text):
    print("\n" + "="*80)
    print(label)
    print("="*80 + "\n")

    doc = nlp(text)

    content_tokens = [
        token for token in doc
        if not token.is_stop and not token.is_punct and not token.is_space
    ]

    lemmas = [token.lemma_.lower() for token in content_tokens]
    nouns = [token.lemma_.lower() for token in content_tokens if token.pos_ == "NOUN"]
    verbs = [token.lemma_.lower() for token in content_tokens if token.pos_ == "VERB"]

    lemma_counts = Counter(lemmas)
    noun_counts = Counter(nouns)
    verb_counts = Counter(verbs)

    # Sentiment-skica
    pos_count = sum(lemma_counts[w] for w in positive_words if w in lemma_counts)
    neg_count = sum(lemma_counts[w] for w in negative_words if w in lemma_counts)

    print("Top 10 sadržajnih riječi (leme):")
    for w, c in lemma_counts.most_common(10):
        print(f"{w:20s} -> {c}")

    print("\nTop 5 imenica:")
    for w, c in noun_counts.most_common(5):
        print(f"{w:20s} -> {c}")

    print("\nTop 5 glagola:")
    for w, c in verb_counts.most_common(5):
        print(f"{w:20s} -> {c}")

    print("\nProcjena tona govora (vrlo pojednostavljena):")
    print(f"Pozitivne riječi: {pos_count}")
    print(f"Negativne riječi: {neg_count}")

    if pos_count > neg_count:
        print("→ Govor ima pretežno POZITIVAN naglasak.")
    elif neg_count > pos_count:
        print("→ Govor ima pretežno NEGATIVAN / sigurnosno-krizni naglasak.")
    else:
        print("→ Govor je tonalno balansiran ili neodređen.")

    print("\nKratki zaključak:")
    print("- Dominantne imenice otkrivaju glavne TEME (npr. unity, opportunity vs. threat, security).")
    print("- Dominantni glagoli otkrivaju ti1pične AKCIJE (invest, build, support vs. defend, strengthen, confront).")
    print("- Omjer pozitivnih i negativnih riječi daje grubu, ali jasnu sliku retoričkog smjera.\n")


# 3) Analiza oba govora
for label, speech in speeches.items():
    analyze_speech(label, speech)



Govor A – inkluzija, nada, progres

Top 10 sadržajnih riječi (leme):
build                -> 4
let                  -> 4
hope                 -> 3
hand                 -> 3
nation               -> 3
future               -> 3
choose               -> 3
citizen              -> 2
community            -> 2
share                -> 2

Top 5 imenica:
hand                 -> 3
nation               -> 3
future               -> 3
citizen              -> 2
community            -> 2

Top 5 glagola:
build                -> 4
let                  -> 4
choose               -> 3
share                -> 2
invest               -> 2

Procjena tona govora (vrlo pojednostavljena):
Pozitivne riječi: 21
Negativne riječi: 4
→ Govor ima pretežno POZITIVAN naglasak.

Kratki zaključak:
- Dominantne imenice otkrivaju glavne TEME (npr. unity, opportunity vs. threat, security).
- Dominantni glagoli otkrivaju tipične AKCIJE (invest, build, support vs. defend, strengthen, confront).
- Omjer pozitivnih i negativnih ri