# Part 1: Genres and communities and plotting 

<div style="border: 1px solid white; padding: 10px;">

The questions below are based on Lecture 7, part 2.

* Write about genres and modularity.
* Detect the communities, discuss the value of modularity in comparison to the genres.
* Calculate the matrix $D$ and discuss your findings.
* Plot the communities and comment on your results.

</div>

<font color='skyblue'>Answer 1 Part 1: **Write about genres and modularity**</font>

Modularity is a measure that quantifies the quality of a particular partition of a network into communities, aiming to capture the strenght of connections within each community compared to what would be expected by random chance. It assesses whether nodes within the same community are more dansely connected to each other than to nodes in other comminities.

In a network with $N$ nodes and $L$ links, modularity $M$ for a given partition into $n_c$ communities is calculated by comparing the actual number of links within each community $C_c$ to the number we would expect if the network were randomly structured (while perserving each node's total degree or connectivity). For each community $C_c$ mosularity is defined as:

$$M_c = \frac{L_c}{L} - (\frac{K_c}{2L})^2$$

A higher modularity means a stronger community structure, where nodes within each community are more densely connected to each other than to nodes outside the community.

A zero modularity means that the martition has no community structure and a negative modularity would mean that the partitioned structure actually has fewer connection within each community than expected by chance.


<font color='skyblue'>Answer 2 Part 1: **Detect the communities, discuss the value of modularity in comparison to the genres**</font>

*Needed code to load the network*

In [4]:
import os
import re
import networkx as nx
import matplotlib.pyplot as plt

# Directory containing Wikipedia pages
SAVE_DIR = "C:/Users/nerea/Documents/MasterDTU/SocialGraphs_fall24/Projects/socialGraphs_fall24/country_artists_wiki"

# Sample list of performers (This should be replaced with the complete list)
performers = [file.replace('_', ' ').replace('.txt', '') for file in os.listdir(SAVE_DIR) if file.endswith('.txt')]

# Initialize the directed graph
G = nx.DiGraph()

# Helper function to count words in the page content
def count_words(content):
    words = re.findall(r'\b\w+\b', content)
    return len(words)

# Function to extract valid links pointing to other performers
def extract_links(wikitext, performers):
    # Regular expression to find all Wikipedia links
    links = re.findall(r'\[\[(.*?)(?:\|(.*?))?\]\]', wikitext)
    valid_links = []
    
    for link, display_text in links:
        # Clean the link, replace underscores with spaces, and match against performers
        clean_link = link.replace('_', ' ').strip()
        if clean_link in performers:
            valid_links.append(clean_link)
    
    return valid_links

# Process all performer pages in the directory
for performer_file in os.listdir(SAVE_DIR):
    if performer_file.endswith('.txt'):
        performer_name = performer_file.replace('_', ' ').replace('.txt', '')
        
        # Open and read the wikitext for this performer
        with open(os.path.join(SAVE_DIR, performer_file), 'r', encoding='utf-8') as f:
            wikitext = f.read()
        
        # Extract links to other performers
        linked_performers = extract_links(wikitext, performers)
        
        # Count the number of words in the page content
        word_count = count_words(wikitext)
        
        # Add the node to the graph with the word count as an attribute
        G.add_node(performer_name, word_count=word_count)
        
        # Add directed edges from this performer to the performers they link to
        for linked_performer in linked_performers:
            G.add_edge(performer_name, linked_performer)

# Remove disconnected nodes from the graph
G.remove_nodes_from(list(nx.isolates(G)))

# Remove self-loops from the graph
G.remove_edges_from(nx.selfloop_edges(G))

# Remove multiple edges between the same nodes
G = nx.DiGraph(G)

# Output the updated number of nodes and edges
print(f"Total performers (nodes): {G.number_of_nodes()}")
print(f"Total links (edges): {G.number_of_edges()}")

Total performers (nodes): 1952
Total links (edges): 17678


In [8]:
import json
import networkx as nx

G_undirected = G.to_undirected()

# Load artist data from JSON file (assuming it's in the format {"artist_name": ["genre1", "genre2", ...]})
with open("C:/Users/nerea/Documents/MasterDTU/SocialGraphs_fall24/Projects/socialGraphs_fall24/genre_data.txt", "r") as file:
    genre_data = json.load(file)

# Extract artist names as a set for fast lookup
artist_names = set(genre_data.keys())

# Assuming 'G_undirected' is your existing NetworkX graph
# Create a subgraph that only includes nodes present in artist_names
filtered_G = G_undirected.subgraph([node for node in G_undirected.nodes if node in artist_names])

# Add genre information as a node attribute
for artist in filtered_G.nodes:
    # Assign the genres from genre_data to each artist node
    filtered_G.nodes[artist]['genres'] = genre_data[artist]

# Verify the genre mapping
print("Filtered graph attributes with genres:", filtered_G.nodes.data())

# Print summary of the filtered graph
print("Filtered graph has", filtered_G.number_of_nodes(), "nodes and", filtered_G.number_of_edges(), "edges.")

Filtered graph attributes with genres: [('Tanya Tucker', {'word_count': 6289, 'genres': ['country', 'outlaw country', 'country rock']}), ('Diamond Rio', {'word_count': 10259, 'genres': ['country', 'ccm']}), ('Reba McEntire', {'word_count': 21253, 'genres': ['country', 'gospel']}), ('Amie Comeaux', {'word_count': 889, 'genres': ['country']}), ('Michael Johnson (singer)', {'word_count': 2445, 'genres': ['folk', 'folk rock', 'country', 'soft rock']}), ('Poco (band)', {'word_count': 5094, 'genres': ['country rock', 'folk rock', 'soft rock']}), ('Great Plains (Tennessee band)', {'word_count': 1791, 'genres': ['country']}), ('Chely Wright', {'word_count': 12657, 'genres': ['country', 'americana', 'folk']}), ('Keith Urban', {'word_count': 10905, 'genres': ['country', 'country pop', 'country rock']}), ('Chad Brock', {'word_count': 2020, 'genres': ['country']}), ('The Jenkins', {'word_count': 554, 'genres': ['country']}), ('Craig Morgan', {'word_count': 5602, 'genres': ['country']}), ('Loretta 

In [None]:
# First partition: Each node is simply characterized by the first genre in its list of genres
partition = {node: data['genres'][0] for node, data in filtered_G.nodes.data()}

# Print the each communities created by the first partition and how many nodes are in each community in ordrer
from collections import Counter
community_sizes = Counter(partition.values())
print("Communities created by the first partition and the number of nodes in each community:")
print(community_sizes)


Communities created by the first partition and the number of nodes in each community:
Counter({'country': 1218, 'bluegrass': 51, 'americana': 50, 'rock': 44, 'pop': 44, 'folk': 36, 'country rock': 34, 'country pop': 30, 'country music': 25, 'alternative country': 24, 'rock and roll': 14, 'neotraditional country': 12, 'rockabilly': 12, 'southern rock': 11, 'progressive bluegrass': 8, 'country rap': 8, 'pop rock': 8, 'western swing': 7, 'progressive country': 7, 'hip hop': 7, 'blues': 6, 'outlaw country': 6, 'jazz': 5, 'texas country': 5, 'traditional pop': 5, 'folk rock': 5, 'roots rock': 5, 'country folk': 5, 'red dirt': 5, 'bluegrass music': 5, 'soul': 4, 'hard rock': 4, 'blues rock': 4, 'ameripolitan': 4, 'alternative rock': 4, 'soft rock': 3, 'ccm': 3, 'tejano': 3, 'contemporary christian': 3, 'heartland rock': 3, 'western': 3, 'old-time music': 3, 'rock music': 3, 'old-time': 2, 'new mexico music': 2, 'indie pop': 2, 'blue-eyed soul': 2, 'edm': 2, 'country and irish': 2, 'bubblegum

In [15]:
# Calculate L (total links) in the entire network
L = filtered_G.number_of_edges()

# Calculate modularity using the formula from chapter 9
modularity_value_genre = 0
for community in partition.values():
    subgraph = filtered_G.subgraph(community)
    
    # L_c: Count of edges within the community
    L_c = subgraph.number_of_edges()
    
    # k_c: Total degree of the nodes in this community
    k_c = sum(deg for node, deg in subgraph.degree())

    # Calculate modularity contribution for this community 
    # (L_C/L is the fraction of edges within the community, and k_c/(2L) is the expected fraction of edges in a random network)
    modularity_value_genre += ((L_c / L) - (k_c / (2 * L)) ** 2)

print("Modularity of the genre-based partition (formula book):", modularity_value_genre)

Modularity of the genre-based partition (formula book): 0.0


<font color='skyblue'>Answer 3 Part 1: **Calculate the matrix $D$ and discuss your findings**</font>

<font color='skyblue'>Answer 4 Part 1: **Plot the communities and comment on your results**</font>

# Part 2: TF-IDF to understand genres and communities 

<div style="border: 1px solid white; padding: 10px;">

The questions below  are based on Lecture 7, part 2, 4, 5, 6 (and a little bit on part 3).

* Explain the concept of TF-IDF in your own words and how it can help you understand the genres and communities.
* Calculate and visualize TF-IDF for the genres and communities.
* Use the matrix $D$ (Lecture 7, part 2) to dicusss the difference between the word-clouds between genres and communities.

</div>

# Part 3: Sentiment of the artists and communities

<div style="border: 1px solid white; padding: 10px;">

The questions below are based on Lecture 8

* Calculate the sentiment of the Artists pages (OK to work with the sub-network of artists-with-genre) and describe your findings using stats and visualization, inspired by the first exercise of week 8.
* Discuss the sentiment of the largest communities. Do the findings using TF-IDF during Lecture 7 help you understand your results?

</div>