# Part 3


1) Explain the concept of TF-IDF in your own words and how it can help you understand the genres and communities.

Term Frequency–Inverse Document Frequency (TF-IDF) measures how relevant a word is in a specific text. It is calculated by taking the product of TF (Term Frequency) and the IDF (Inverse Document Frequency). 

The TF measures how often a word appears in a text and the IDF measures how specific a word is to a text in question. This is done by comparing the number of text parts that contain that word with the total number of different texts.

Thus, by combining these 2 metrics, the TF-IDF score highlights words that are frequent in one text but rare in the rest of the collection. A high TF-IDF means that a work is highly descriptive of a certain text, because it appears in the text many times (high TF), but not so many in the other texts (high IDF).

Therefore, we can use TF-IDF to recognize the words that are truly intrinsic and, thus, characterize, a certain genre or community. In fact, artists that have the same words with higher values of TF-IDF probably share some type of relationship (same genre or other type of connection) and, thus, are more likely to belong to the same community. 

This way, we can use TF-IDF to better undertsand the words that are more relevant per genre, and, thus, identify it. Moreover, we also use it to identify relationships between artists that are crucial when we analyse our network structure and find the communities present. 


2) Calculate and visualize TF-IDF for the genres and communities.

In [None]:
#imports

import math
from collections import Counter
from pathlib import Path

In [None]:
#Function to compute tfidf

def compute_tfidf(group_folder, label_type="genre", top_n=10, show_top=True):

    group_folder = Path(group_folder)
    group_tf = {}
    all_words = set()

    # Load TF lists 
    for file in group_folder.glob("*_tf.txt"):
        group_name = file.stem.replace("_tf", "")
        counts = Counter()
        with open(file, "r", encoding="utf-8") as f:
            for line in f:
                word, freq = line.strip().split()
                freq = int(freq)
                counts[word] = freq
                all_words.add(word)
        group_tf[group_name] = counts

    # Compute document frequency (df) 
    df = Counter()
    for word in all_words:
        df[word] = sum(1 for tf_counts in group_tf.values() if word in tf_counts)

    # Compute IDF 
    N = len(group_tf)
    idf = {word: math.log((N + 1) / (df[word] + 1)) for word in all_words}

    # Compute TF-IDF per group 
    group_tfidf = {}
    for group_name, tf_counts in group_tf.items():
        total_words = sum(tf_counts.values())
        tfidf = {}
        for word, count in tf_counts.items():
            tf = count / total_words
            tfidf[word] = tf * idf[word]
        group_tfidf[group_name] = tfidf

    # Show top words    
    for group_name, tfidf_dict in group_tfidf.items():
        print(f"\n=== Top {top_n} TF-IDF words for {label_type.upper()} '{group_name}' ===")
        for word, score in sorted(tfidf_dict.items(), key=lambda x: x[1], reverse=True)[:top_n]:
            print(f"{word:20s} {score:.6f}")

    return group_tfidf, idf, df
