## Ocena tekstu TF-IDF
The output is a list of tuples where each tuple contains a word and its corresponding tf-idf weight. The words are sorted in descending order of their tf-idf weights, so the first tuple contains the word with the highest tf-idf weight and so on.

To interpret the output, you can consider the following:

The higher the tf-idf weight of a word, the more important or relevant it is to the text.
Words with high tf-idf weights are often key terms or phrases that can help you summarize or understand the main ideas in the text.
Comparing the tf-idf weights of different words can give you an idea of their relative importance in the text.

In [20]:
# Import the TfidfVectorizer class from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

# Define the text
text = ["jestem zieloną żabą"]

# Create a TfidfVectorizer object
tfidf_vectorizer = TfidfVectorizer()

# Calculate the TF-IDF weights for each word in the text
tfidf_weights = tfidf_vectorizer.fit_transform(text)

# Print the TF-IDF weights for each word in the text
print(tfidf_weights)


  (0, 2)	0.5773502691896258
  (0, 1)	0.5773502691896258
  (0, 0)	0.5773502691896258


In [48]:
# Import the TfidfVectorizer class from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

# Import the pandas library
import pandas as pd

# Define the text
text = ["Faworyci podań wymieniali bez liku, ale uprawiali jałową sztukę dla sztuki, nie umieli zamienić teoretycznej przewagi na żaden konkret. W pierwszej połowie wezbrali nadzieją tylko raz, gdy Gavi przywalił w słupek. Tego epizodu nie odnotowali jednak nawet statystycy – sędzia zasygnalizował pozycję spaloną.Przed przerwą Marokańczycy byli bliżej celu, Amallah chybił minimalnie. Ich fani jeszcze dołożyli decybeli. Amok, jakiego jeszcze na katarskim mundialu nie zaznałem."]

# Create the TfidfVectorizer instance
vectorizer = TfidfVectorizer()

# Calculate the TF-IDF weights for the text
tfidf_weights = vectorizer.fit_transform(text)

# Get the list of words
words = vectorizer.get_feature_names_out()

# Get the list of TF-IDF weights
weights = tfidf_weights.toarray()

# Create a list of tuples containing the word and weight
tuples = zip(words, weights[0])

# Create a DataFrame from the list of tuples
df = pd.DataFrame(tuples, columns=["Word", "TF-IDF Weight"])

# Format the table
df.style.set_table_styles(
    [
        {"selector": "th", "props": [("text-align", "left")]},
        {"selector": "td", "props": [("text-align", "right")]},
    ]
)

# Convert the DataFrame to a Markdown table and print it to the console
print(df.to_markdown())


|    | Word           |   TF-IDF Weight |
|---:|:---------------|----------------:|
|  0 | ale            |        0.117851 |
|  1 | amallah        |        0.117851 |
|  2 | amok           |        0.117851 |
|  3 | bez            |        0.117851 |
|  4 | bliżej         |        0.117851 |
|  5 | byli           |        0.117851 |
|  6 | celu           |        0.117851 |
|  7 | chybił         |        0.117851 |
|  8 | decybeli       |        0.117851 |
|  9 | dla            |        0.117851 |
| 10 | dołożyli       |        0.117851 |
| 11 | epizodu        |        0.117851 |
| 12 | fani           |        0.117851 |
| 13 | faworyci       |        0.117851 |
| 14 | gavi           |        0.117851 |
| 15 | gdy            |        0.117851 |
| 16 | ich            |        0.117851 |
| 17 | jakiego        |        0.117851 |
| 18 | jałową         |        0.117851 |
| 19 | jednak         |        0.117851 |
| 20 | jeszcze        |        0.235702 |
| 21 | katarskim      |        0.1

In [49]:
# Import the TfidfVectorizer class from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

# Import the pandas library
import pandas as pd

# Define the text
with open("budzety.txt") as file:
    text = [file.read()]

# Create the TfidfVectorizer instance
vectorizer = TfidfVectorizer()

# Calculate the TF-IDF weights for the text
tfidf_weights = vectorizer.fit_transform(text)

# Get the list of words
words = vectorizer.get_feature_names_out()

# Get the list of TF-IDF weights
weights = tfidf_weights.toarray()

# Create a list of tuples containing the word and weight
tuples = zip(words, weights[0])

# Create a DataFrame from the list of tuples
df = pd.DataFrame(tuples, columns=["Word", "TF-IDF Weight"])

# Sort the DataFrame by the TF-IDF weights
df = df.sort_values(by="TF-IDF Weight", ascending=False)

# Format the table
df.style.set_table_styles(
    [
        {"selector": "th", "props": [("text-align", "left")]},
        {"selector": "td", "props": [("text-align", "right")]},
    ]
)

# Convert the DataFrame to a Markdown table and print it to the console
print(df.to_markdown())


|       | Word                                      |   TF-IDF Weight |
|------:|:------------------------------------------|----------------:|
|  5249 | na                                        |     0.513277    |
|  9998 | ul                                        |     0.386653    |
|  7585 | przy                                      |     0.366021    |
|  1723 | dla                                       |     0.299035    |
|  6440 | parku                                     |     0.293947    |
|  6394 | park                                      |     0.193327    |
|  1729 | do                                        |     0.165911    |
|  1099 | budowa                                    |     0.145843    |
| 11156 | zabaw                                     |     0.126341    |
|  6103 | oraz                                      |     0.121253    |
|  5639 | nr                                        |     0.0757479   |
|  6724 | placu                                     |     0.0751

In [51]:
# Import the spacy library
import spacy

# Load the spacy model
nlp = spacy.load("pl_core_news_lg")

# Define the text
with open("budzety.txt") as file:
    text = file.read()

# Create a spacy document from the text
doc = nlp(text)

# Lemmatize the text
lemmatized_text = [token.lemma_ for token in doc]

# Import the TfidfVectorizer class from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

# Create the TfidfVectorizer instance
vectorizer = TfidfVectorizer()

# Calculate the TF-IDF weights for the lemmatized text
tfidf_weights = vectorizer.fit_transform(lemmatized_text)

# Get the list of words
words = vectorizer.get_feature_names_out()

# Get the list of TF-IDF weights
weights = tfidf_weights.toarray()

# Print the TF-IDF weights for each word in the lemmatized text
for word, weight in zip(words, weights):
  print(f"{word}: {weight}")


00: [0. 0. 0. ... 0. 0. 0.]
000: [0. 0. 0. ... 0. 0. 0.]
000zt: [0. 0. 0. ... 0. 0. 0.]
000zł: [0. 0. 0. ... 0. 0. 0.]
0013: [0. 0. 0. ... 0. 0. 0.]
0015: [0. 0. 0. ... 0. 0. 0.]
0015miasta: [0. 0. 0. ... 0. 0. 0.]
0055: [0. 0. 0. ... 0. 0. 0.]
01: [0. 0. 0. ... 0. 0. 0.]
019: [0. 0. 0. ... 0. 0. 0.]
02: [0. 0. 0. ... 0. 0. 0.]
03: [0. 0. 0. ... 0. 0. 0.]
04: [0. 0. 0. ... 0. 0. 0.]
05: [0. 0. 0. ... 0. 0. 0.]
06: [0. 0. 0. ... 0. 0. 0.]
07: [0. 0. 0. ... 0. 0. 0.]
08: [0. 0. 0. ... 0. 0. 0.]
09: [0. 0. 0. ... 0. 0. 0.]
10: [0. 0. 0. ... 0. 0. 0.]
100: [0. 0. 0. ... 0. 0. 0.]
1000: [0. 0. 0. ... 0. 0. 0.]
100letnie: [0. 0. 0. ... 0. 0. 0.]
101: [0. 0. 0. ... 0. 0. 0.]
102: [0. 0. 0. ... 0. 0. 0.]
103: [0. 0. 0. ... 0. 0. 0.]
104: [0. 0. 0. ... 0. 0. 0.]
105: [0. 0. 0. ... 0. 0. 0.]
106: [0. 0. 0. ... 0. 0. 0.]
1068: [0. 0. 0. ... 0. 0. 0.]
107: [0. 0. 0. ... 0. 0. 0.]
108: [0. 0. 0. ... 0. 0. 0.]
109: [0. 0. 0. ... 0. 0. 0.]
11: [0. 0. 0. ... 0. 0. 0.]
110: [0. 0. 0. ... 0. 0. 0.]
110a

trawiastą: [0. 0. 0. ... 0. 0. 0.]
trawka: [0. 0. 0. ... 0. 0. 0.]
trawniiek: [0. 0. 0. ... 0. 0. 0.]
trawnik: [0. 0. 0. ... 0. 0. 0.]
trawnika: [0. 0. 0. ... 0. 0. 0.]
trawniki: [0. 0. 0. ... 0. 0. 0.]
trawnikowy: [0. 0. 0. ... 0. 0. 0.]
trawniku: [0. 0. 0. ... 0. 0. 0.]
trawników: [0. 0. 0. ... 0. 0. 0.]
tree: [0. 0. 0. ... 0. 0. 0.]
treegator: [0. 0. 0. ... 0. 0. 0.]
treegatory: [0. 0. 0. ... 0. 0. 0.]
trejaze: [0. 0. 0. ... 0. 0. 0.]
tren: [0. 0. 0. ... 0. 0. 0.]
trend: [0. 0. 0. ... 0. 0. 0.]
trener: [0. 0. 0. ... 0. 0. 0.]
trening: [0. 0. 0. ... 0. 0. 0.]
treningi: [0. 0. 0. ... 0. 0. 0.]
treningowy: [0. 0. 0. ... 0. 0. 0.]
tresci: [0. 0. 0. ... 0. 0. 0.]
tresura: [0. 0. 0. ... 0. 0. 0.]
trików: [0. 0. 0. ... 0. 0. 0.]
trochę: [0. 0. 0. ... 0. 0. 0.]
trojden: [0. 0. 0. ... 0. 0. 0.]
trolejbusowy: [0. 0. 0. ... 0. 0. 0.]
trop: [0. 0. 0. ... 0. 0. 0.]
troska: [0. 0. 0. ... 0. 0. 0.]
trubuna: [0. 0. 0. ... 0. 0. 0.]
trudność: [0. 0. 0. ... 0. 0. 0.]
trudny: [0. 0. 0. ... 0. 0. 0.]
t

## Kod Lematyzujący bazę danych i pokazujący TF-IDF

In [52]:
# Import the spacy library
import spacy

# Load the spacy model
nlp = spacy.load("pl_core_news_lg")

# Import the TfidfVectorizer class from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

# Import the pandas library
import pandas as pd

# Define the text
with open("budzety.txt") as file:
    text = file.read()

# Create a spacy document from the text
doc = nlp(text)

# Define the lemmatized text
lemmatized_text = " ".join([token.lemma_ for token in doc])

# Create the TfidfVectorizer instance
vectorizer = TfidfVectorizer()

# Calculate the TF-IDF weights for the text
tfidf_weights = vectorizer.fit_transform([lemmatized_text])

# Get the list of words
words = vectorizer.get_feature_names_out()

# Get the list of TF-IDF weights
weights = tfidf_weights.toarray()

# Create a list of tuples containing the word and weight
tuples = zip(words, weights[0])

# Create a DataFrame from the list of tuples
df = pd.DataFrame(tuples, columns=["Word", "TF-IDF Weight"])

# Sort the DataFrame by the TF-IDF weights
df = df.sort_values(by="TF-IDF Weight", ascending=False)

# Format the table
df.style.set_table_styles(
    [
        {"selector": "th", "props": [("text-align", "left")]},
        {"selector": "td", "props": [("text-align", "right")]},
    ]
)

# Convert the DataFrame to a Markdown table and print it to the console
print(df.to_markdown())
df.to_csv("slowa_z_tf_idf.csv")


|       | Word                                      |   TF-IDF Weight |
|------:|:------------------------------------------|----------------:|
|  4428 | na                                        |     0.445577    |
|  8355 | ulica                                     |     0.423723    |
|  5386 | park                                      |     0.414496    |
|  6365 | przy                                      |     0.314453    |
|  1473 | dla                                       |     0.256905    |
|  9782 | zielony                                   |     0.196928    |
|  1479 | do                                        |     0.142536    |
|   967 | budowa                                    |     0.134766    |
|  5651 | plac                                      |     0.12481     |
|  4103 | miejsce                                   |     0.111698    |
|  8088 | teren                                     |     0.109269    |
|  5392 | parking                                   |     0.1090

TF-IDF z podglądem POS

In [45]:
# Import the TfidfVectorizer class from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

# Import the spacy library
import spacy

# Import the pandas library
import pandas as pd

# Load the spacy model
nlp = spacy.load("pl_core_news_lg")

# Define the text
with open("budzety.txt") as file:
    text = file.read()

# Create a spacy document from the text
doc = nlp(text)

# Lemmatize the words in the text
lemmas = [token.lemma_ for token in doc]

# Create the TfidfVectorizer instance
vectorizer = TfidfVectorizer()

# Calculate the TF-IDF weights for the text
tfidf_weights = vectorizer.fit_transform(lemmas)

# Get the list of words
words = vectorizer.get_feature_names()

# Get the list of TF-IDF weights
weights = tfidf_weights.toarray()

# Create a list of tuples containing the word, POS tag, and weight
tuples = zip(words, [token.pos_ for token in doc], weights[0])

# Create a DataFrame from the list of tuples
df = pd.DataFrame(tuples, columns=["Word", "POS Tag", "TF-IDF Weight"])

# Sort the DataFrame by the TF-IDF weights
df = df.sort_values(by="TF-IDF Weight", ascending=False)

# Format the table
df.style.set_table_styles(
    [
        {"selector": "th", "props": [("text-align", "left")]},
        {"selector": "td", "props": [("text-align", "right")]},
    ]
)

# Convert the DataFrame to a Markdown table and print it to the console
print(df.to_markdown())
df.to_csv("lemmas_z_tf_idf.csv")



|       | Word                                      | POS Tag   |   TF-IDF Weight |
|------:|:------------------------------------------|:----------|----------------:|
|  4574 | nazwa                                     | CCONJ     |               1 |
|     0 | 00                                        | NOUN      |               0 |
|  6818 | romantyczny                               | NOUN      |               0 |
|  6820 | romaszewski                               | PROPN     |               0 |
|  6821 | ronald                                    | CCONJ     |               0 |
|  6822 | rond                                      | PUNCT     |               0 |
|  6823 | ronda                                     | SPACE     |               0 |
|  6824 | ronde                                     | ADJ       |               0 |
|  6825 | rondo                                     | ADJ       |               0 |
|  6826 | rondzie                                   | ADP       |           