In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer

https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/#.Y5R1cOzMLVY




In natural language processing (NLP), the term frequency-inverse document frequency (TF-IDF) score is a measure of the importance of a word or phrase to the meaning of a document (or group of documents) in a collection. The TF-IDF score is calculated by combining the term frequency (TF) and the inverse document frequency (IDF) of the word or phrase.

The term frequency (TF) of a word or phrase is the number of times it appears in a document. This value reflects how important the word is to the meaning of the document.

The inverse document frequency (IDF) of a word or phrase is a measure of how frequently it appears in the entire collection of documents. This value reflects how common the word is across the collection of documents. Words that appear frequently in the collection are given a lower IDF score, while words that appear infrequently are given a higher IDF score.

The TF-IDF score is calculated by multiplying the TF and IDF values for a word or phrase. This score reflects both the importance of the word to the meaning of the document and its rarity across the collection of documents. Words and phrases with high TF-IDF scores are considered to be more important and relevant to the meaning of the document than those with low TF-IDF scores.

In summary, the difference between the IDF and the TF-IDF score is that the IDF is a measure of the rarity of a word or phrase across a collection of documents, while the TF-IDF score is a measure of the importance of the word or phrase to the meaning of a specific document in the collection. The TF-IDF score combines both of these factors to give a more complete picture of the significance of a word or phrase to the meaning of a document.

In [2]:
# Load the text data into a DataFrame
data = pd.read_csv('data/p_content.csv')


# settings that you use for count vectorizer will go here 
tfidf_vectorizer=TfidfVectorizer(use_idf=True) 
# just send in all your docs here 
tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(data['content'])



# get the first vector out (for the first document) 
first_vector_tfidfvectorizer=tfidf_vectorizer_vectors[0] 
# place tf-idf values in a pandas data frame 
df = pd.DataFrame(first_vector_tfidfvectorizer.T.todense(), index=tfidf_vectorizer.get_feature_names(), columns=["tfidf"]) 
df.sort_values(by=["tfidf"],ascending=False)



Unnamed: 0,tfidf
stoff,0.251966
medizinisches,0.231332
kategorie,0.231332
umgang,0.154003
und,0.152668
...,...
filefoto,0.000000
filiale,0.000000
filialen,0.000000
filmvorführungen,0.000000


The lower the IDF value of a word, the less unique it is to any particular document

In [4]:
#instantiate CountVectorizer() 
cv=CountVectorizer() 
# this steps generates word counts for the words in your docs 
word_count_vector=cv.fit_transform(data['content'])
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True) 
tfidf_transformer.fit(word_count_vector)

# print idf values 
df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names_out(),columns=["idf_weights"]) 
# sort ascending 
df_idf.sort_values(by=['idf_weights'])


Unnamed: 0,idf_weights
in,1.000000
der,1.000000
die,1.000000
im,1.000000
und,1.000000
...,...
impfungendie,4.951244
alpha,4.951244
alsterfoto,4.951244
impfziel,4.951244


build feature frame

In [31]:
# Convert the matrix of TF-IDF values to a DataFrame
tfidf_df = pd.DataFrame(tfidf_vectorizer_vectors.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

newcol = [('tfidf',n) for n in tfidf_df.columns]
multicol1 = pd.MultiIndex.from_tuples(newcol)

tfidf_df.columns = multicol1

tfidf_df.head()

Unnamed: 0_level_0,tfidf,tfidf,tfidf,tfidf,tfidf,tfidf,tfidf,tfidf,tfidf,tfidf,tfidf,tfidf,tfidf,tfidf,tfidf,tfidf,tfidf,tfidf,tfidf,tfidf,tfidf
Unnamed: 0_level_1,00,000,000er,001,0061,0067,015,016,03,04,...,überwunden,überzeugen,überzeugt,üblichen,übrigen,übrigens,übriges,übt,übte,übung
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.057313,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.039842,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.032553,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.018379,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.053137


In [33]:
df_merged = data[['ID_GodotObject']].merge(tfidf_df, left_index=True, right_index=True)
df_merged.head()

  df_merged = data[['ID_GodotObject']].merge(tfidf_df, left_index=True, right_index=True)


Unnamed: 0,ID_GodotObject,"(tfidf, 00)","(tfidf, 000)","(tfidf, 000er)","(tfidf, 001)","(tfidf, 0061)","(tfidf, 0067)","(tfidf, 015)","(tfidf, 016)","(tfidf, 03)",...,"(tfidf, überwunden)","(tfidf, überzeugen)","(tfidf, überzeugt)","(tfidf, üblichen)","(tfidf, übrigen)","(tfidf, übrigens)","(tfidf, übriges)","(tfidf, übt)","(tfidf, übte)","(tfidf, übung)"
0,2000115059032,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2000116305030,0.0,0.057313,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2000116325081,0.0,0.039842,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2000116346340,0.0,0.032553,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2000116371728,0.0,0.018379,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.053137


https://www.kaggle.com/code/pierremegret/gensim-word2vec-tutorial/notebook

In [34]:
df.to_csv('data/tfidf_content.csv', encoding='utf-8', index=False)