# TF-IDF Representation

**TF-IDF** stands for:
- **Term Frequency (TF)**: Measures how frequently a term appears in a document. The more a term appears in a document, the higher its TF score.
- **Inverse Document Frequency (IDF)**: Measures the importance of a term across the entire corpus. Words that appear in many documents (e.g., common words like "the" or "is") have a low IDF score, while words that are unique to fewer documents have a higher IDF score.

The TF-IDF value of a term is calculated as:

$$
TF-IDF(t, d) = TF(t, d) \times IDF(t)
$$

Where:
- **TF(t, d)**: Term frequency of term `t` in document `d`.
- **IDF(t)**: Inverse document frequency of term `t`.

##### How TF-IDF Works

1. **Term Frequency (TF)**: TF measures the frequency of a term within a document. It is calculated as the ratio of the number of times a term occurs in a document to the total number of terms in that document. The goal is to emphasize words that are frequent within a document.

![image.png](https://miro.medium.com/v2/resize:fit:1100/format:webp/1*lflj0Cz-X04bM2CKD9-ZTg.png)


2. **Inverse Document Frequency (IDF)**: IDF measures the rarity of a term across a collection of documents. It is calculated as the logarithm of the ratio of the total number of documents to the number of documents containing the term. The goal is to penalize words that are common across all documents.
   
![image.png](https://miro.medium.com/v2/resize:fit:1186/format:webp/1*d13YVVFq7YBXvbDkHaihQA.png)


3. **Calculate TF-IDF**: Multiply the TF by the IDF for each word to get the final TF-IDF value, which represents how important a word is to a specific document.

##### Advantages and Limitations of TF-IDF

**Advantages**:
- **Importance Weighting**: TF-IDF provides a measure of word importance, which helps reduce the influence of commonly occurring words that add little meaning.
- **Simplicity**: Easy to implement and widely used in many text mining and information retrieval applications.

**Limitations**:
- **Lack of Semantic Understanding**: TF-IDF considers individual word frequencies without considering the semantic relationships between words.
- **Data Sparsity**: Similar to the Bag of Words model, the TF-IDF representation can result in very sparse matrices for large vocabularies.


### 1. Example of TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = [
    "Machine learning is a fascinating field",
    "Data science and machine learning are closely related",
    "Deep learning is a subfield of machine learning",
    "Supervised learning involves labeled data",
    "Unsupervised learning deals with unlabeled data",
    "Feature engineering is crucial for model performance",
    "Data preprocessing is an important step in machine learning",
    "Natural language processing is a key area in AI",
    "Hyperparameter tuning helps to optimize models",
    "Model evaluation is necessary for understanding model accuracy"
]

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents to create the TF-IDF representation
tfidf_matrix = vectorizer.fit_transform(documents)

# Convert the TF-IDF matrix to an array and print the vocabulary
print("Vocabulary:", vectorizer.get_feature_names_out())
print("TF-IDF Representation:\n", tfidf_matrix.toarray())

### 2 Visualizing TF-IDF

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Get the vocabulary and TF-IDF scores
vocabulary = vectorizer.get_feature_names_out()
scores = np.array(tfidf_matrix.sum(axis=0)).flatten()

# Sort the vocabulary and scores by TF-IDF score in descending order
sorted_indices = np.argsort(-scores)
sorted_vocabulary = vocabulary[sorted_indices]
sorted_scores = scores[sorted_indices]

# Select the top 10 words
top_n = 10
sorted_vocabulary = sorted_vocabulary[:top_n]
sorted_scores = sorted_scores[:top_n]

# Plot the TF-IDF scores
plt.figure(figsize=(10, 5))
plt.bar(sorted_vocabulary, sorted_scores)
plt.xlabel('Words')
plt.ylabel('TF-IDF Score')
plt.title('Top 10 Words by TF-IDF Score')
plt.xticks(rotation=45)
plt.show()

### 3. Basic K-means clustering

In [None]:
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np

# Define the number of clusters
num_clusters = 2

# Initialize and fit the KMeans model
kmeans = KMeans(n_clusters=num_clusters)
kmeans.fit(tfidf_matrix)

# Get the cluster labels for each document
labels = kmeans.labels_

# Visualize the clustering using a scatter plot
from sklearn.decomposition import PCA

# Reduce dimensions with PCA for visualization
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(tfidf_matrix.toarray())

# Plot the clusters
plt.figure(figsize=(10, 6))
plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=labels, cmap='viridis', marker='o', edgecolor='k', s=50)
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('K-Means Clustering of Documents (TF-IDF)')
plt.show()

# Print the cluster labels
for i, label in enumerate(labels):
    print(f"Document {i}: Cluster {label}")



### 4. Working with a Dataframe

##### 4.1 Load the data

In [None]:
import pandas as pd

# Load the data
df = pd.read_csv('movies_cleaned.csv')

##### 4.2 TF-IDF

In [None]:
from sklearn.preprocessing import normalize

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=3, max_df=0.4)

# Fit and transform the 'Cleaned_text' column to create the TF-IDF representation
tfidf_matrix = vectorizer.fit_transform(df['Cleaned_text'])

# Normalizing the TF-IDF matrix
tfidf_matrix_norm = normalize(tfidf_matrix)
terms = vectorizer.get_feature_names_out()

tfidf_df = pd.DataFrame(tfidf_matrix_norm.toarray(), columns=terms)

tfidf_df

##### 4.3. Determine the Optimal Number of Clusters
The `kneed` function is a part of the kneed Python library, which is used for finding the "knee" or "elbow" point in a dataset. The knee point represents an optimal value or a point of interest that helps determine the number of clusters or groups in data clustering or the appropriate value for other hyperparameters in various algorithms. It is commonly used in the context of the "elbow method" for determining the optimal number of clusters in K-means clustering.

In [None]:
%pip install -U kneed

In [None]:
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from kneed import KneeLocator  # Optional for precise elbow detection

# Calculate distortions for different values of k
distortions = []
K = range(1, 11)  # Test k from 1 to 10
for k in K:
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(tfidf_matrix_norm)
    distortions.append(kmeans.inertia_)

# Plot the elbow graph
plt.figure(figsize=(8, 5))
plt.plot(K, distortions, marker='o')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Distortion')
plt.title('Elbow Method for Optimal k')
plt.grid(True)
plt.show()

# Determine the optimal k using the KneeLocator library
knee_locator = KneeLocator(K, distortions, curve="convex", direction="decreasing")
optimal_k = knee_locator.knee



#### 4.4 K-means Clustering

In [None]:
# Choose a number of clusters for K-means
num_clusters = 7  # This can be adjusted based on specific needs

# Applying K-Means Clustering (this time we use some more parameters)
kmeans = KMeans(n_clusters=num_clusters, init='k-means++', max_iter=3000, n_init='auto')

# Fit the model
kmeans.fit(tfidf_matrix_norm)

# # Getting the cluster labels
cluster_labels = kmeans.labels_

# # Displaying the first few cluster labels
cluster_labels[:10]  # Showing the labels for the first 10 abstracts

order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names_out()
dict = []
for i in range(num_clusters):
  print("%d" % i, sep='', end=','),
  for ind in order_centroids[i, :10]:
    print(terms[ind], sep='', end=',')
  print('')

### 5. Translate to the Case
Go to the case and perform k-means clustering on the news articles.