# Project Assignment B
Link to git repository: https://github.com/ongiboy/computational_social_science

Group members:
* Christian Ong Hansen (s204109)
* Kavus Latifi Yaghin (s214601)
* Daniel Damkjær Ries (s214641)

Group member's contribution:
* Every task was made in collaboration by all members.

## 1. Motivation

* What is your dataset

Our dataset originates from The Movie Database (TMDB) via its API. The data is collected from the "Popular People" tab on the webpage, and includes the top actors from the first 250 pages, with each page containing data for 20 actors. The popularity score attributed to actors by TMDB is a metric not publicly disclosed, but it's generally understood to consider factors like page views, favorites, watchlists, and recent activity.

The data is structured into two main dataframes: one for actors and their attributes (name, ID, gender, age, birthplace, and filmography), and another for movies featuring these actors (rating, popularity, genres, release date, and abstract for text analysis). This setup forms the basis for our actor collaboration network and analysis, which is essential for answering our research question: Based on collaborations among the most popular actors, do distinct communities form, and if so, what characterizes the most succesful communities?

* Why did you choose this/these particular dataset(s)?

The reason for us choosing The Movie Database (TMDB) as the data source for our project is because of our interest in exploring and analyzing the film industry by an actor collaboration network. TMDB has a broad coverage of the film industry and it has built a reputation for being a well-known data source within the field. TMDB is widely recognized for its movie-related data, containing details on actors, movies, genres, ratings, popularity etc.. The rate limit for the API (50 calls per second) was also an attractive factor. 

It was therefore believed that through TMDB's API, it would be possible to access up-to-date and reliable data on popular actors and their participation in movies, enabling us to explore actor interactions and movie trends effectively. In this way the TMDB datasource can be used to contribute valuable insights to the broader understanding of the film industry landscape.

* What was your goal for the end user’s experience?

The goal of this project for the end user's experience was to provide insights into the characteristics and dynamics of successful actor communities. By analyzing data from popular actors and their collaborations in a network, we sought to offer a user-friendly interface for exploring trends, identifying influential factors, and gaining a deeper understanding of what drives success in the film industry in a an actor perspective.


## 2. Basic stats
* Write about your choices in data cleaning and preprocessing

Using the TMDB API it was possible to retrieve information about popular actors just by using an API-key and selecting the desired number of pages with 20 actors in each page. 250 pages were chosen as a reasonable amount, which in theory would lead to a raw dataset consisting of 6000 actors, but in practice ended up being 4389 actors. This would be the first dataframe "df_actors". Since it wasn't possible to use filters directly in the API-call, data processing was performed after the data was collected. The information about these actors that was collected were the attributes: name, id, gender, birthday, place_of_birth. 

Initially, all rows with missing values on the attributes of interest were removed and duplicate rows were removed. Since all the collected information about the actors is crucial for the analysis, this seemed a reasonable part of the cleaning. After this short processing, the actor dataframe would consist of 4048 actors. 

Using the retrieved actor IDs, it was now possible to get a complete list of the movies that each actor has played a role in as well as the movie ID - again using the TMDB API. This also laid the foundation for the second dataframe "df_movies" with the purpose of being our own little movie-database containing information about all the movies that the actors in the "df_actors" dataframe have been part of. Before any processing, the dataframe included 67587 movies.

With yet another API call using the found movie IDs, the desired information could be obtained, which consisted of the attributes: rating, popularity, genres, release date and movie abstract. After retrieval of movie information in the movies dataframe, all rows with missing values in any of the attributes were removed. Furthermore, it was decided for this project to only focus on "recent" movies, which in this case is defined as movies from 2010 and until the date of collection (08-05-2024). After this processing the movie dataset consisted of 26494 movies. This was also done to avoid having a too dense network later.

The outfiltered movies in the movies dataframe, were also removed from the actor dataframe, so the two dataframes would be consistent. 

Through further investigation of the actors dataframe, it was found that the "birthplace" column needed processing. Firstly, the birthplace was reduced to only include the country of birth for the actors, but it was also found that the same countries were spelled differently. The country could be in different languages, symbols, etc., which could introduce misleading findings. The birthplace column was normalized so there would be no double instances of a country of birth. 

For the movies dataframe, the movie popularity seemed to have outliers. It was found that around 99% of the moves had a popularity measure under 100 but a few values going all the way up to over 2000. This led us to setting a cap at 100.

It was in our interest to select a main genre for each actor. Each movie has a list of given genres, and by finding the genre that appeared the most in the movies that the actor has been part of, a main genre for an actor was found. Since about 50% of all movies had the "Drama" genre, this was only counted if the movie had two or less genres in its genre list to avoid that all actors would be "Drama" actors. We also found that the "Drama" genre was generally not very descriptive of the movie when there were more than 2 genres.

Finally, from the TMDB API there is no direct actor rating measure, which gave us the idea to create our own measure. This measure was created as the mean of the ratings from all the movies the actor has acted in, of the more recent movies (2010-2024). 

Now that the data has been cleaned and preprocessed, it was time to create the actual network. The edges of the network were weighted by the number of times two actors had collaborated. This led to a very dense network, and therefore a threshold was set, so that there would only be an edge if the actors had collaborated at least twice. The final network consisted of the 4254 actors/nodes with 23399 edges. 

* Write a short section that discusses the dataset stats

Our main dataset is as mentioned before, two dataframes, one for actor information and the other for movie information.
These datasets have the following stats:

Actor dataframe: 4048 rows of size 4.306 KB

Movie Dataframe: 23399 rows of size 9.569 KB

From these a network was made of actor collaborations, with the following stats:
4048 nodes and 15116 edges, of size 3.177 KB


### Data retrieval and cleaning/preprocessing

Importing libraries

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import concurrent.futures
import requests
import itertools
import networkx as nx
from collections import defaultdict
from functools import partial
from tqdm import tqdm
from requests_futures.sessions import FuturesSession
import matplotlib.pyplot as plt
import ast
import numpy as np
from statistics import mode
import json
from threading import Lock
import matplotlib.cm as cm
import pycountry
import re
import netwulf as nw
import matplotlib.patches as mpatches
import community.community_louvain as community_louvain
from IPython.display import Image
from nltk.corpus import stopwords
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from nltk.tokenize import word_tokenize, MWETokenizer
from nltk.stem import PorterStemmer
from collections import Counter
import math
from wordcloud import WordCloud


headers = {"accept": "application/json"}
api_key = "f5813332cb558d374cbcb057ea2fc48b"

Functions used to make API calls (**ADD ALL AND RENAME**)

In [None]:
counter = 0
lock = Lock()

def movie_title_and_IDs_from_actor_ID(actor_id, session):
    global counter
    url = f"https://api.themoviedb.org/3/person/{actor_id}/movie_credits?api_key={api_key}"
    response = session.get(url)
    data = response.json()

    with lock:
        counter += 1
        if counter % 1000 == 0:
            print(f"Processed {counter} actors")
    # Return the whole movie dictionary, not just the title
    return [movie['title'] for movie in data['cast']], [movie['id'] for movie in data['cast']]

def actor_info_from_page(page, session):
    url = f"https://api.themoviedb.org/3/person/popular?api_key={api_key}&page={page}"
    response = session.get(url)
    data = response.json()
    people = []
    for person in data['results']:
        person_url = f"https://api.themoviedb.org/3/person/{person['id']}?api_key={api_key}"
        person_response = session.get(person_url)
        person_data = person_response.json()
        people.append((person['name'], person['id'], person_data['gender'], person_data['birthday'], person_data['place_of_birth'], person_data['popularity']))
    return people

In [None]:
counter = 0
lock = Lock()

def fetch(session, url):
    global counter
    future = session.get(url, headers=headers)
    with lock:
        counter += 1
        if counter % 10000 == 0:
            print(f"Processed {counter} movies")
    
    return future

def movie_info_from_movie_ID(urls):
    with FuturesSession() as session:
        futures = [fetch(session, url) for url in urls]
        responses = [future.result().json() for future in tqdm(futures, total=len(futures))]
    return responses

Retrieve actor information (name, id, gender, birthday, birthplace, popularity)

In [None]:
with requests.Session() as session:
    session.headers.update(headers)
    with concurrent.futures.ThreadPoolExecutor() as executor:
        pages = list(range(1, 301))
        fetch_page_with_session = partial(actor_info_from_page, session=session)
        people = list(executor.map(fetch_page_with_session, pages))

actor_names, actor_ids, actor_genders, actor_birthdays, actor_birthplaces, actor_popularities = zip(*itertools.chain(*people))

Creating the Actor Dataframe and initial preprocessing

In [None]:
# Create a DataFrame with 'actors' column
df_actors = pd.DataFrame(actor_names, columns=['actor'])

# Add 'ids', 'genders', and 'birthplaces' columns to the DataFrame
df_actors['actor_id'] = actor_ids
df_actors['gender'] = actor_genders
df_actors['age'] = actor_birthdays # This is not the age, but the birthday
df_actors['birthplace'] = actor_birthplaces
df_actors['popularity'] = actor_popularities

# Change birthplaces so that it only contains the country (text after the last comma)
df_actors['birthplace'] = df_actors['birthplace'].str.split(',').str[-1]

# Change birthday to age
df_actors['age'] = pd.to_datetime(df_actors['age'], errors='coerce')
df_actors['age'] = (pd.to_datetime('today') - df_actors['age']).dt.days // 365

# Drop rows with missing values
df_actors.dropna(inplace=True)

# Drop duplicates
df_actors.drop_duplicates(subset='actor_id', inplace=True)
#reset index
df_actors.reset_index(drop=True, inplace=True)

Retrieve movie titles and IDs

In [None]:
# Fetch movies for each actor and add them to 'movies' and 'movie_IDs' columns
with requests.Session() as session:
    session.headers.update(headers)
    movie_titles_and_ids_from_actor_ID_with_session = partial(movie_title_and_IDs_from_actor_ID, session=session)
    with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
        movies_and_ids = list(executor.map(movie_titles_and_ids_from_actor_ID_with_session, df_actors['actor_id']))

# Add 'movies' column to df_actors
df_actors['movies'] = [x[0] for x in movies_and_ids]
df_actors['movie_IDs'] = [x[1] for x in movies_and_ids]

# Flatten movies_and_ids into two separate lists
movies = [item for sublist in [x[0] for x in movies_and_ids] for item in sublist]
ids = [item for sublist in [x[1] for x in movies_and_ids] for item in sublist]

# Convert the lists into a list of dictionaries
movies_and_ids_dict = [{'movie': movie, 'movie_ID': id} for movie, id in zip(movies, ids)]

# Convert the list of dictionaries into a DataFrame
df_movies = pd.DataFrame(movies_and_ids_dict)
df_movies = df_movies.drop_duplicates(subset='movie_ID')
df_movies.reset_index(drop=True, inplace=True)

Collect movie information trough API (rating, popularity, genre, release date, abstract)

In [None]:
# Prepare the URLs
urls = [f"https://api.themoviedb.org/3/movie/{id}?api_key={api_key}" for id in df_movies["movie_ID"]]
print('urls are prepared')

# Fetch all responses
responses = movie_info_from_movie_ID(urls)
print('all responses are fetched')

# Process the responses
for i, response in enumerate(responses):
    # if i % 100 == 0:
    #     print(f"Processing response {i+1}/{len(responses)}")
    if isinstance(response, Exception):
        print(f"Error: {response}")
        continue  # Skip this response
    # Process the response here

# Initialize empty lists to store the data
ratings = []
popularities = []
genres = []
release_dates = []
abstracts = []

# Process the responses one by one
for data in tqdm(responses):
    ratings.append(data.get('vote_average'))
    popularities.append(data.get('popularity'))
    genres.append([genre['name'] for genre in data.get('genres', [])])
    release_dates.append(data.get('release_date'))
    abstracts.append(data.get('overview'))

# Assign the lists to the DataFrame columns
df_movies['rating'] = ratings
df_movies['popularity'] = popularities
df_movies['genres'] = genres
df_movies['release_date'] = release_dates
df_movies['abstract'] = abstracts

Initial preprocessing of Movies Dataframe

In [None]:
# Drop duplicates
df_movies.drop_duplicates(subset='movie', inplace=True)

# Drop rows with missing values
df_movies.dropna(inplace=True)

# Remove rows with empty lists in 'genres' column
df_movies = df_movies[df_movies['genres'].apply(lambda x: len(x) > 0)]

# Remove rows with empty release dates
df_movies = df_movies[df_movies['release_date'].apply(lambda x: len(x) > 0)]

# Drop row if abstract is missing
df_movies = df_movies[df_movies['abstract'].apply(lambda x: len(x) > 0)]

# Drop row if rating is missing
df_movies = df_movies[df_movies['rating'].apply(lambda x: x > 0)]

# Drop row if popularity is missing
df_movies = df_movies[df_movies['popularity'].apply(lambda x: x > 0)]

# Reset index
df_movies.reset_index(drop=True, inplace=True)

Remove old and future movies

In [None]:
df_movies

In [None]:
# Combine the two dataframes so that each actor is associated with the movies they have acted in
df_actors_filtered = df_actors.copy()
df_movies_filtered = df_movies.copy()
df_movies_filtered.rename(columns={'movie_ID': 'movie_IDs'}, inplace=True)
df_movies_filtered.rename(columns={'popularity': 'movie_popularity'}, inplace=True)
df_actors_filtered.drop(columns=['movies'], inplace=True)

df_actors_filtered = df_actors_filtered.explode('movie_IDs').reset_index(drop=True)

df_actors_movies = df_actors_filtered.merge(df_movies_filtered, on='movie_IDs', how='inner')

# Remove all rows where the release date is before 2010 or after 2024 (to avoid movies that have not been released yet)
df_actors_movies = df_actors_movies[(df_actors_movies['release_date'] >= '2010-01-01') & (df_actors_movies['release_date'] <= '2024-05-08')]

# Collapse the the actors in the actor column so there is only one row per actor and the movies and movie_IDs are stored in lists
df_actors_filtered = df_actors_movies.groupby('actor').agg({'actor_id': 'first',
                                                             'gender': 'first',
                                                             'birthplace': 'first',
                                                             'age': 'first',
                                                             'popularity': 'first',
                                                             'movie': list, 
                                                             'movie_IDs': list}).reset_index()

df_movies_filtered = df_actors_movies.drop_duplicates(subset=['movie']).groupby('movie').agg({'movie_IDs': 'first',
                                                                                               'rating': 'first',
                                                                                               'popularity': 'first',
                                                                                               'genres': 'first',
                                                                                               'release_date': 'first',
                                                                                               'abstract': 'first'}).reset_index()

df_actors_filtered.rename(columns={'movie': 'movies'}, inplace=True)
df_movies_filtered.rename(columns={'movie_IDs': 'movie_ID'}, inplace=True)

Cleaning the birthplace column, to avoid having the same country spelled differently and removing symbols

In [None]:
# Clean birthplace column in actor dataframe
len_bef_clean = len(df_actors_filtered['birthplace'].unique())

country_names = [country.name for country in pycountry.countries]

def normalize_country_name(name):
    # strip name
    name = name.strip()
    # replace '.' with ''
    name = name.replace('.', '').replace(']', '')

    if "UK" in name or "İngiltere" in name:
        return "United Kingdom"
    try:
        # Try to get the country object
        country = pycountry.countries.get(name=name)
        if country is not None:
            # If the country object is found, return the official name
            return country.name
        else:
            # If the country object is not found, try to find it by its common name
            country = pycountry.countries.search_fuzzy(name)
            return country[0].name
    except LookupError:
        # Standardizing names
        for country_name in country_names:
            if country_name in name:
                name = country_name

        # Fixing abbreviations and wird instances
        if "Russia" in name:
            return "Russian Federation"
        elif "Türkiye" in name or "Turkey" in name:
            return "Türkiye"
        elif "USA" in name or " US" in name or "United States" in name:
            return "United States"
        elif "Korea" in name:
            return "Korea, Republic of"
        elif "Czech" in name:
            return "Czechia"
        # Hardcoded, could use package to translate, maybe not necessary, few occurences
        elif "TX" in name:
            return "United States"
        elif "Frankrike" in name:
            return "France"
        elif "Afrique du Sud" in name:
            return "South Africa"
        elif "Irlanda" in name:
            return "Ireland"
        elif "中国" in name or "中华民国" in name or "重庆" in name or "南京" in name:
            return "China"
        
        
        # Updating from old names
        if "now" in name:
            match = re.search(r'\[now (.*?)', name)
            if match:
                return match.group(1)

        # If the country is not found, return the original name
        return name

# Normalize the country names in the DataFrame
df_actors_filtered['birthplace'] = df_actors_filtered['birthplace'].apply(normalize_country_name)

len_aft_clean = len(df_actors_filtered['birthplace'].unique())
print("length before cleaning", len_bef_clean)
print("length after cleaning", len_aft_clean)

Plot of popularity

In [None]:
# Calculate the histogram
counts, bin_edges = np.histogram(df_actors_filtered['popularity'], bins=100)

# Add a small constant to counts to avoid log(0)
counts = counts + 1e-10

# Plot the histogram
plt.plot(bin_edges[:-1], counts)

# Set the scale of both axes to logarithmic
plt.xscale('log')
plt.yscale('log')

plt.show()

In [None]:
plt.hist(df_actors_filtered['popularity'], bins=100)
plt.show()

Plot of age distribution

In [None]:
# Plot the age distribution, NOTHINGS DONE YET
df_actors_filtered['age'].hist(bins=30, edgecolor='white')
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

Plot of top 10 actor birthplaces

In [None]:
# Get the counts of each unique value in the 'birthplace' column
birthplace_counts = df_actors_filtered['birthplace'].value_counts()

# Calculate the percentage of each birthplace out of the total number of actors
birthplace_percentages = (birthplace_counts / df_actors_filtered['birthplace'].count()) * 100

# Get the top 5 birthplaces
top_5_birthplaces = birthplace_percentages[:10]

# Create a color map
colors = cm.rainbow(np.linspace(0, 1, len(top_5_birthplaces)))

# Create a bar plot of the top 5 birthplaces with different colors for each bar
top_5_birthplaces.plot(kind='bar', color=colors, edgecolor='black')

plt.title('Top 10 Birthplaces of Actors')
plt.xlabel('Birthplace')
plt.ylabel('Percentage of Actors (%)')

plt.show()

print("The number of unique birthplaces:",len(df_actors_filtered['birthplace'].unique()))
# Get the birthplaces with a count of 1
birthplaces_with_count_1 = birthplace_counts[birthplace_counts == 1]

print("Number of countries only associated with One actor in our dataset",len(birthplaces_with_count_1))

* Note: TMDB is very biased towards USA with their popular actors data that we have retrieved (over 50% of the total number of actors). Therefore most of the movies are american and the actors from different countries are biased towards the american movie industry.

* It is also worth noting that 44 of the 106 unique birthplaces in our dataset is only associated with one actor, which is also a factor causing the unexpected low assortativity for the birthplace attribute. 

* To fix this or make it more representative, we could webscrape from other websites to make a less biased dataset.

Finding the main genre for the popular actors

In [None]:
# Function to find the genre counts and main genre of an actor
def find_genres_and_main_genre(movies_list, remove_drama=False): # nearly half of all movies have drama as genre, so not descriptive
    # Get the genres of the movies
    genres = [genre for movie in movies_list for genre in movie_genres.get(movie, [])]
    # Count the frequency of each genre
    genre_counts = pd.Series(genres).value_counts()
    # Remove the 'Drama' genre if specified
    if remove_drama and len(genre_counts) > 2:
        genre_counts.drop('Drama', errors='ignore', inplace=True)
        #genre_counts.drop('Thriller', errors='ignore', inplace=True)
    
    # Return the genre counts as a dictionary and the main genre
    return genre_counts.to_dict(), genre_counts.idxmax() if not genre_counts.empty else None

movie_genres = df_movies_filtered.set_index('movie')['genres'].to_dict()
# Apply the function to the 'movies' column of the actors DataFrame and create two new columns
df_actors_filtered['genres'], df_actors_filtered['main_genre'] = zip(*df_actors_filtered['movies'].apply(find_genres_and_main_genre, remove_drama=True))

Actor popularity and rating

In [None]:
# Function to find the average rating and popularity of an actor
def find_actor_rating(movies_list):
    # Get the ratings for each movie
    ratings = [movie_ratings[movie] for movie in movies_list if movie in movie_ratings]
    avg_rating = np.mean(ratings)

    return ratings, avg_rating

movie_ratings = df_movies_filtered.set_index('movie')['rating'].to_dict()

# Apply the function to the 'movies' column of the actors DataFrame and create two new columns
df_actors_filtered['ratings'], df_actors_filtered['avg_rating'] = \
        zip(*df_actors_filtered['movies'].apply(find_actor_rating))

In [None]:
# save the dataframes to csv files
df_actors_filtered.to_csv('data/actors.csv', index=False)
df_movies_filtered.to_csv('data/movies.csv', index=False)

### Creating the network

In [None]:
# load the dataframes from csv files
df_actors_filtered = pd.read_csv('data/actors.csv', converters={'movies': ast.literal_eval, 'movie_IDs': ast.literal_eval, 'ratings': ast.literal_eval, 'genres': ast.literal_eval})
df_movies_filtered = pd.read_csv('data/movies.csv', converters={'genres': ast.literal_eval})

In [None]:
# Step 1: Prepare the Data
movie_to_actors = defaultdict(list)

for _ , row in df_actors_filtered.iterrows():
    actor = row['actor']
    movies = row['movies']
    for movie in movies:
        movie_to_actors[movie].append(actor)

edges = defaultdict(int)
for actors in movie_to_actors.values():
    for pair in itertools.combinations(sorted(actors), 2):
        edges[pair] += 1

# Step 2: Create the Graph
G = nx.Graph()

# Add nodes with attributes
for _ , row in df_actors_filtered.iterrows():
    G.add_node(row['actor'], gender=row['gender'], age=row['age'], birthplace=row['birthplace'], main_genre=row['main_genre'],
               avg_rating=row['avg_rating'], popularity=row['popularity'])

# Add edges with weights
for edge, weight in edges.items():

# Adding only the edges with weight >= 2 to get less dense graph
    if weight >= 2:
        G.add_edge(*edge, weight=weight)

In [None]:
# Find number of nodes and edges
N = G.number_of_nodes()
L = G.number_of_edges()

print("Number of nodes: ", N)
print("Number of edges: ", L)

## 3. Tools, theory and analysis

* Describe which network science tools and data analysis strategies you’ve used, how those network science measures work, and why the tools you’ve chosen are right for the problem you’re solving.

Degree analysis: 

In the first part of the network analysis, the focus is on node degrees. The degree of a node is the amount of edges that a node has in the network. Since this network is a undirected network, a weight can be added to the edges, so that the same edges are still counted multiple times. The degree distribution is found and illustrated to see whether the distribution is heavy-tailed, meaning that most nodes have few connections in the network and that there are hubs with a significantly degree than the average and median of the network. 

Regime:

Looking into which regime the network belongs is an important analysis for understanding and characterizing the network. By doing simple calculations the regime and characteristics that follow can be determined. 

Visualization of the network:
More realizations about the network can be obtained by simply visualizing the network. Firstly, the naturally formed communities and the specific hubs in the network can be identified by sizing the nodes by degree. Furthermore, the nodes can be colored according to a specific node attribute, e.g. country of birth, to see whether there are any connections in the network based on this attribute. 

- Assortativity (Numeric: degree, age, rating, popularity - Categorical: genre, birthplace, gender)
The assortativity measure refers to the tendency of nodes to link to other nodes with the same (or similar for numeric values) attribute values, such as degree, age, gender etc. The assortativity coefficient ranges from -1 (perfect disassortative) to 1 (perfect assortative). A positive assortativity means that nodes with similar attribute values tend to connect, and a negative assortativity coefficient means that nodes with dissimilar values tend to form connections. As an example if the assortativity attribute is gender, in a perfect assortative network males would only connect to other males, and in a perfect dissasortative network males would only connect to females (in binary gendered network). 

- Closeness/eigenvector centrality


* How did you use the tools to understand your dataset?
##### Degree analysis
After the network was created, the degrees of the nodes in the network were investigated. The average, median, mode, minimum and maximum degree were computed for both the weighted and unweighted network. With the edges being weighted, the average, median and max degree increases significantly, which is expected as popular actors may have a tendency of collaborating multiple times with other popular actors. The mode degree remains at 0, which is caused by the high number of isolates in the network. It is also worth noting, that purely based on the average and median degree, it can be seen that both the unweighted and weighted degree distribution are heavy-tailed as the median is significantly smaller than the average degree. This is further illustrated through two plots, where the distribution is plotted in normal scale and log-log-scale. On normal scale, it is clear to see the heavy-tailed distribution, which is further established through the linearity on the log-log-scale. 

Next, the highest degree nodes were found. The top 5 actors with highest unweighted degree illustrate the popular actors that have most unique collaborations, where the weighted degree illustrates the popular actors with with most collaborations in the popular actor network, this being both unique and non-unique collborations. For both the weighted and unweighted network, Samuel L. Jackson was at the top, and is thus being the most collaborative popular actor in the network. 
By looking the at most repeated edges (edges with highest weight), the "best friends" of the network was found. The top two collaborators in the network are Adam Copeland and Michael Hickenbottom, who are two wrestlers from the American WWE. It makes sense that these two have many collaborations as they have competed against eachother for many years. 


* Talk about how you’ve worked with text, including regular expressions, unicode, etc.

##### Degrees and weights

In [None]:
# Compute node degrees
degrees = dict(G.degree())
weighted_degrees = dict(G.degree(weight='weight'))

# Compute the specificied values
average_degree = np.mean(list(degrees.values()))
median_degree = np.median(list(degrees.values()))
try:
    mode_degree = mode(list(degrees.values())) #Degree value that occurs with highest frequency among the nodes
except:
    mode_degree = "No unique mode"
min_degree = min(degrees.values())
max_degree = max(degrees.values())

# Same calculations but for STRENGTH (WEIGHTED DEGREE)
average_weighted_degree = np.mean(list(weighted_degrees.values()))
median_weighted_degree = np.median(list(weighted_degrees.values()))
try:
    mode_weighted_degree = mode(list(weighted_degrees.values()))
except:
    mode_weighted_degree = "No unique mode"
min_weighted_degree = min(weighted_degrees.values())
max_weighted_degree = max(weighted_degrees.values())

print("Degree Analysis:")
print(f"Average Degree: {average_degree}")
print(f"Median Degree: {median_degree}")
print(f"Mode Degree: {mode_degree}")
print(f"Minimum Degree: {min_degree}")
print(f"Maximum Degree: {max_degree}")

print("\nWeighted Degree Analysis:")
print(f"Average Weighted Degree: {average_weighted_degree}")
print(f"Median Weighted Degree: {median_weighted_degree}")
print(f"Mode Weighted Degree: {mode_weighted_degree}")
print(f"Minimum Weighted Degree: {min_weighted_degree}")
print(f"Maximum Weighted Degree: {max_weighted_degree}")

In [None]:
# Sort the edges by weight in descending order
sorted_edges = sorted(G.edges.data(), key=lambda x: x[2]['weight'], reverse=True)
most_connected_actors = sorted(degrees.items(), key=lambda x: x[1], reverse=True)
most_connected_actors_weighted = sorted(weighted_degrees.items(), key=lambda x: x[1], reverse=True)

print("Top 5 actors with the highest degree:")
for i, actor in enumerate(most_connected_actors[:5], start=1):
    print(f"{i}: Degree: {actor[1]},\t ({G.nodes[actor[0]]['main_genre']})\t {actor[0]}")

print("\nTop 5 actors with the highest weighted degree:")
for i, actor in enumerate(most_connected_actors_weighted[:5], start=1):
    print(f"{i}: Degree: {actor[1]},\t ({G.nodes[actor[0]]['main_genre']})\t {actor[0]}")

print("\nTop 5 most important edges:")
for i, edge in enumerate(sorted_edges[:5], start=1):
    print(f"{i}: Degree: {edge[2]['weight']}, ({G.nodes[edge[0]]['main_genre']} - {G.nodes[edge[1]]['main_genre']}) \t {edge[0]} - {edge[1]}")

These are the top ten actors who have participated the most together, as the weight of an edge represents the number of times two actors (nodes) have co-acted in movies together. 

In [None]:
# Calculate degree distribution
degree_sequence = sorted([d for n, d in G.degree()], reverse=False)  # sort in ascending order
degree_count = nx.degree_histogram(G)

fig, axs = plt.subplots(1, 2, figsize=(16, 6))

# degree distribution
axs[0].bar(range(len(degree_count)), degree_count, align='center')
axs[0].axvline(average_degree, color='r', linestyle='-', label='Average Degree')
axs[0].set_xlabel('Degree')
axs[0].set_ylabel('Count')
axs[0].set_title('Degree Distribution')
axs[0].legend()

# loglog degree distribution
axs[1].bar(range(len(degree_count)), degree_count, align='center')
axs[1].set_yscale('log')
axs[1].set_xscale('log')
axs[1].set_xlabel('Degree')
axs[1].set_ylabel('Count')
axs[1].set_title('Degree Distribution (log scale)')
axs[1].axvline(average_degree, color='r', linestyle='-', label='Average Degree')
axs[1].legend()

plt.show()

In [None]:
# Compute the probability that makes the expected number of edges equal to the actual number of edges in the graph.
p = 2*L/(N*(N-1))
print("Prob:",p)

# Compute the natural logarithm of N
ln_N = np.log(N)

print(f"ln(N): {ln_N:.2f} < k: {average_degree:.2f}\nAnd p: {p:.5f} > ln(N)/N: {ln_N/N:.5f}")

As the average degree is 10.22 and thus <k> > 1, and it is larger than ln(N) while p > ln(N)/N, the network must fall into the connected regime. It is therefore above the critical threshold. However, as we have thresholded the network by removing edges with an weight less than 2, it is not fully connected.

In [None]:
# Plot with netwulf
# Style configuration
config = {
    'zoom': 1,
    'node_charge': -40,
    'node_gravity': 0.8,
    'link_distance': 50,
    'link_distance_variation': 1,
    'node_collision': True,
    'wiggle_nodes': False,
    'freeze_nodes': False,

    'node_size': 15,
    'node_stroke_width': 0.3,
    'node_size_variation': 0.9,
    'display_node_labels': False,
    'scale_node_size_by_strength': True,

    'link_width': 2,
    'link_width_variation': 2,
    'link_alpha': 0.05,

}

G_plot_nation = G.copy()


for k, v in G_plot_nation.nodes(data=True):
    v['group'] = v['birthplace']; del v['birthplace']


for n, data in G_plot_nation.nodes(data=True):
    data['size'] = np.random.random()


for n1, n2, data in G_plot_nation.edges(data=True):
    data['weight'] = np.random.random()


network, config = nw.visualize(G_plot_nation, config=config)

In [None]:
# Find the color of all nodes in the network and find the top 10 most used colors
node_colors = [node['color'] for node in network['nodes']]

# Count the frequency of each color
color_counts = pd.Series(node_colors).value_counts()

# Get the top 10 most used colors
top_10_colors = color_counts.head(10)

# Create color dictionary
color_dict = {}

# Go through the nodes and assign the color to the birthplace
for color in top_10_colors.index:
    for node in network['nodes']:
        if node['color'] == color:
            color_dict[color] = df_actors_filtered[df_actors_filtered['actor']==node['id']]['birthplace'].values[0]
            break

# Print the color dictionary
for color, birthplace in color_dict.items():
    print(f"Color: {color}, Birthplace: {birthplace}")

In [None]:
patches = [mpatches.Patch(color=color, label=f"{color_dict[color]}") for color in color_dict]

plt.figure(figsize=(5,2))
plt.legend(handles=patches, loc='center', frameon=False)
plt.axis('off')
plt.show()

In [None]:
fig,ax = nw.draw_netwulf(network)
plt.legend(handles=patches, loc='upper left', frameon=False)
fig.set_size_inches(12, 8)

In [None]:
# Load the saved png-file "country_graph.png" and plot it in the notebook
Image(filename='country_graph3.png')

In [None]:
# Calculate the degree assortativity coefficient
degree_assortativity = nx.degree_assortativity_coefficient(G)

# Calculate the attribute assortativity coefficient
country_assortativity = nx.attribute_assortativity_coefficient(G, 'birthplace')

gender_assortativity = nx.attribute_assortativity_coefficient(G, 'gender')

# Calculate the numeric assortativity coefficient
age_assortativity = nx.numeric_assortativity_coefficient(G, 'age')

# Calcualte the assortativity for main_genre
main_genre_assortativity = nx.attribute_assortativity_coefficient(G, 'main_genre')

rating_assortativity = nx.numeric_assortativity_coefficient(G, 'avg_rating')
popularity_assortativity = nx.numeric_assortativity_coefficient(G, 'popularity')

print(f"Categorical:\nGenre assortativity: {main_genre_assortativity}, \nBirth place assortativity: {country_assortativity}, \nGender assortativity: {gender_assortativity}")

print(f"\nNumeric:\nDegree assortativity: {degree_assortativity}, \nAge assortativity: {age_assortativity}, \nRating assortativity: {rating_assortativity}, \nPopularity assortativity: {popularity_assortativity}")

node to country assortativity: expected a higher value, but the litl lower value can be due to several reasons, desired nationality diversity, moving to different countries (living place) etc...

In [None]:
# Get the isolates
isolates = list(nx.isolates(G))

# Print the number of isolates
print(len(isolates))

In [None]:
# x = df_movies_filtered['rating']
# x2 = df_movies_filtered['popularity']
# plt.scatter(x,x2) #ad

In [None]:
# Calculate the closeness centrality of the network
closeness_centrality = nx.closeness_centrality(G)

# Sort the actors according to the closeness centrality
sorted_closeness_centrality = sorted(closeness_centrality.items(), key=lambda x: x[1], reverse=True)

In [None]:
# Find the 5 most central actors
most_central_actors = sorted_closeness_centrality[:10]
most_central_actors

In [None]:
# Calculate the eigenvector centrality of the network
eigenvector_centrality = nx.eigenvector_centrality(G)

# Sort the scientists according to the eigenvector centrality
sorted_eigenvector_centrality = sorted(eigenvector_centrality.items(), key=lambda x: x[1], reverse=True)

In [None]:
# Find the 5 most central actors
most_central_actors2 = sorted_eigenvector_centrality[:10]
most_central_actors2

In [None]:
# Plot the closeness centrality vs node degree to see if there is a correlation
closeness_centrality_values = list(closeness_centrality.values())
eigenvector_centrality_values = list(eigenvector_centrality.values())
degree_values = list(dict(G.degree()).values())

fig, axs = plt.subplots(1, 2, figsize=(16, 6))

# plot the closeness centrality vs node degree to see if there is a correlation
axs[0].scatter(degree_values, closeness_centrality_values, alpha=0.5)
axs[0].set_xlabel('Node Degree')
axs[0].set_ylabel('Closeness Centrality')
axs[0].set_title('Closeness Centrality vs Node Degree')

# plot the eigenvector centrality vs node degree to see if there is a correlation
axs[1].scatter(degree_values, eigenvector_centrality_values, alpha=0.5)
axs[1].set_xlabel('Node Degree')
axs[1].set_ylabel('Eigenvector Centrality')
axs[1].set_title('Eigenvector Centrality vs Node Degree')

plt.show()

### NLP TIME

Function used for tokenization of movie abstracts, including cleaning, stemming and removing stopwords.

In [None]:
stemmer = PorterStemmer()
# Define a function to tokenize and clean text
def tokenize_and_clean_text3(text, collocations={}, with_collocations=False): #takes text and dictionary of collocations
    # Exclude URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Exclude mathematical symbols and numbers
    text = re.sub(r'\b\d+\b', '', text)
    # Exclude punctuation and convert to lowercase
    text = re.sub(r'\W', ' ', text).lower()

    # Tokenize
    if with_collocations:
        tokenizer = MWETokenizer(list(collocations.keys()))
        tokens = word_tokenize(text)
        tokens = tokenizer.tokenize(tokens)
    else:
        tokens = word_tokenize(text)
    # stem
    tokens = [stemmer.stem(token) for token in tokens if token not in stopwords.words('english')]# TO REMOVE STOPWORDS 
    return tokens

In [None]:
# First clean text
df_movies_filtered['tokens'] = df_movies_filtered['abstract'].apply(lambda x: tokenize_and_clean_text3(x))

In [None]:
from nltk import bigrams
# Initialize an empty list to store all bigrams
all_bigrams = []


# For each list of tokens in the 'tokens' column
for tokens in df_movies_filtered['tokens']:
    # Generate the list of bigrams
    bigrams_list = list(bigrams(tokens))
    # Add the bigrams to the list of all bigrams
    all_bigrams.extend(bigrams_list)

In [None]:
import collections
from scipy import stats

all_unique_bigrams = list(set(all_bigrams))
tokens_list = list(df_movies_filtered['tokens'].copy().explode())

# Create a Counter for bigrams and tokens
bigrams_counter = collections.Counter(all_bigrams)
tokens_counter = collections.Counter(tokens_list)

# Create a dictionary to store the data
data = {}

# For each unique pair of words
for w1, w2 in tqdm(all_unique_bigrams):
    # number of times w1 and w2 appear together in all_bigrams
    nii = bigrams_counter[(w1, w2)]
    # number of times w1 appears without w2
    noi = tokens_counter[w1] - nii
    # number of times w2 appears without w1
    nio = tokens_counter[w2] - nii
    # number of times w1 and w2 do not appear together
    noo = len(all_bigrams) - nii - noi - nio

    O = [[nii, nio], [noi, noo]]

    R1 = nii + nio
    C1 = nii + noi
    R2 = noi + noo
    C2 = nio + noo
    N = R1 + R2 + C1 + C2

    E = [[R1*C1/N, R1*C2/N],[R2*C1/N, R2*C2/N]] 

    X_sq = sum((O[i][j] - E[i][j]) ** 2 / E[i][j] for i in range(len(O)) for j in range(len(O[i])))
    p_val = stats.chi2.sf(X_sq, df=1)

    data[(w1, w2)] = O, E, p_val

In [None]:
collocations = {}

for w1, w2 in tqdm(all_unique_bigrams):
    if bigrams_counter[(w1, w2)] > 50 and data[(w1, w2)][2] < 0.001:
        collocations.update({(w1, w2): bigrams_counter[(w1, w2)]})

print(f"\nNumber of collocations: {len(collocations)}")
# Print out the top 20 of them by number of occurrences
sorted(collocations.items(), key=lambda x: x[1], reverse=True)[:20]

In [None]:
# recomputing the tokens with collocations
df_movies_filtered['tokens'] = df_movies_filtered['abstract'].apply(lambda x: tokenize_and_clean_text3(x, collocations, 
                                                                                                       with_collocations=True))

In [None]:
# # save the dataframe as a csv file called df_movies_tokens
# df_movies_filtered.to_csv('data/df_movies_tokens.csv', index=False)

df_movies_tokens = pd.read_csv('data/df_movies_tokens.csv', converters={'tokens': ast.literal_eval})

In [None]:
# Compute the best partition using the Louvain method
partition = community_louvain.best_partition(G, resolution=0.5)

# Compute the modularity of this partition
modularity = community_louvain.modularity(partition, G)

print("Modularity found by Louvain algorithm:", modularity)

In [None]:
# Calculate the number of communities and their sizes
communities = {}
for node, community in partition.items():
    if community not in communities:
        communities[community] = []
    communities[community].append(node)

num_communities = len(communities)
community_sizes = sorted([len(nodes) for nodes in communities.values()], reverse=True)

print("Number of communities:", num_communities)
print("Sizes of communities:", community_sizes)
# Check if the modularity is significantly different than 0
if abs(modularity) > 0.01:  # or any other threshold you consider significant
    print("The modularity is significantly different than 0.")
else:
    print("The modularity is not significantly different than 0.")

In [None]:
# Create a DataFrame from the partition and degree information
df_communities = pd.DataFrame({
    'actor': list(G.nodes),
    'community': [partition[node] for node in G.nodes],
    'degree': [G.degree(node) for node in G.nodes]
})

# Save the dataframe to csv
df_communities.to_csv('data/communities.csv', index=False)

In [None]:
# Load the df_communities csv-file
df_communities = pd.read_csv('data/communities.csv')

In [None]:
# Merge actors_works_df with communities_df
df = pd.merge(df_actors_filtered, df_communities, on='actor')
df = df.drop(columns=["movies"]).explode('movie_IDs')


# Merge with abstracts_df
df = pd.merge(df, df_movies_tokens, left_on='movie_IDs', right_on='movie_ID')

# # Get abstract tokens for all communities
all_communities_abstracts = df.groupby('community')['tokens'].agg("sum").reset_index()

all_communities_abstracts.columns = ['Community', 'Abstract Tokens']

In [None]:
# Get the top 9 communities and filter the abstracts for these communities
top9_communities = df_communities['community'].value_counts().nlargest(9).index
top9_communities_abstracts = all_communities_abstracts[all_communities_abstracts['Community'].isin(top9_communities)]

top_terms = {}
top_tfidf_terms = {}
all_tfidf_terms = {}

# (for less computational cost) get IDF for each term once and store the results
idf_dict = {}
for term in tqdm(set.union(*top9_communities_abstracts['Abstract Tokens'].apply(set))):
    idf_dict[term] = math.log(len(top9_communities_abstracts) / 
                              sum(term in abstract for abstract in top9_communities_abstracts['Abstract Tokens']))


for community in tqdm(top9_communities):
    # Get abstract tokens for the community
    abstract_tokens = top9_communities_abstracts[top9_communities_abstracts['Community'] == community]['Abstract Tokens'].values[0]
    
    # The top 10 TF terms
    term_counts = Counter(abstract_tokens)
    top_terms[community] = term_counts.most_common(10)


    tfidf = {}
    for term, count in term_counts.items():
        # TF for all terms in top 9 communities
        tf = count / len(abstract_tokens)

        # Get IDF from the precalculated dictionary
        idf = idf_dict[term]
        
        # TF-IDF for all terms in top 9 communities
        tfidf[term] = tf * idf

    all_tfidf_terms[community] = sorted(tfidf.items(), key=lambda x: x[1], reverse=True)
    
    # Get the top 10 TF-IDF words
    top_tfidf_terms[community] = sorted(tfidf.items(), key=lambda x: x[1], reverse=True)[:10]

In [None]:
for item in top_tfidf_terms.items():
    print(f"Community {item[0]}:")
    print(f"Top 10 TF-IDF terms: {item[1]}\n")

In [None]:
top9_communities_df = df_communities[df_communities['community'].isin(top9_communities)]

# Group by Community and Actor, and sum the Degree to get the total Degree for each Actor in each Community
grouped_df = top9_communities_df.groupby(['community', 'actor'])['degree'].sum().reset_index()

# Make new dataframe storing the top 3 actors by degree for each community
top_actors = grouped_df.groupby('community').apply(lambda x: x.nlargest(5, 'degree')).reset_index(drop=True)
top_actors

In [None]:
# Save all_tfidf_terms as a JSON file in the data folder
with open('data/all_tfidf_terms.json', 'w') as f:
    json.dump(all_tfidf_terms, f)

# Save top9_communities as a CSV file in the data folder
pd.DataFrame(top9_communities, columns=['community']).to_csv('data/top9_communities.csv', index=False)

# Save top_actors as a CSV file in the data folder
top_actors.to_csv('data/top_actors.csv', index=False)

In [None]:
# load files for wordcloud
top_actors = pd.read_csv('data/top_actors.csv')

top9_communities = pd.read_csv('data/top9_communities.csv')

with open('data/all_tfidf_terms.json', 'r') as file:
    all_tfidf_terms = json.load(file)

In [None]:
fig, axs = plt.subplots(3, 3, figsize=(24, 24))

for i, community in enumerate(top9_communities["community"]):
    # Get the already calculated top TF-IDF terms for the community
    tfidf_terms = dict(all_tfidf_terms[str(community)])
    
    # Create the word cloud
    wordcloud = WordCloud(width = 800, height = 800, 
                background_color ='white', 
                stopwords = None, 
                min_font_size = 10).generate_from_frequencies(tfidf_terms)
    
    # Subplot indices
    row = i // 3
    col = i % 3

    # Plot the word cloud
    axs[row, col].imshow(wordcloud)
    axs[row, col].axis("off")

    # Get the names of the top three actors in the community
    top_three_actors = top_actors[top_actors['community'] == community]['actor'].values
    # Create the title string with newlines
    title_string = "\n".join(top_three_actors)

    # Set title of subplot to the names of the top three actors, each on a new line
    axs[row, col].set_title(f"Top actors in community {community}:\n{title_string}", fontsize=15)

# Remove any extra subplots
for j in range(i+1, 9):
    fig.delaxes(axs.flatten()[j])

plt.tight_layout(pad = 1) 
plt.subplots_adjust(hspace = 0.25)  # Increase the hspace value
plt.show()

In [None]:
# Merge df_actors_filtered with communities df
df_actors_communities = pd.merge(df_actors_filtered, df_communities, on='actors')

# For community in top20_communities, calculate community attributes
# Extract top 20 communities
top20_communities_idx = df_communities['community'].value_counts().nlargest(10).index
top20_community_actors = df_actors_communities[df_actors_communities['community'].isin(top20_communities_idx)]

# Calculate in each community (group_by)
def average_of_list(x):
    return np.mean([val for sublist in x for val in sublist])
def sum_dicts(dicts):
    result = {}
    for d in dicts:
        for k, v in d.items():
            if k in result:
                result[k] += v
            else:
                result[k] = v
    return result
top20_community_attributes = top20_community_actors.groupby('community').agg({
    'actors': 'count',
    'ages': 'mean',
    'ratings': [('mean', average_of_list)],
    'degree': [('mean', 'mean')],
    'birthplaces': lambda x: x.mode().iloc[0] if not x.mode().empty else None,
})

# Clean columns
top20_community_attributes.columns = ['_'.join(col).strip() for col in top20_community_attributes.columns.values]
top20_community_attributes.reset_index(inplace=True)

# Calculate main genre in each community
df_grouped = df_actors_communities.groupby('community')['genres'].apply(sum_dicts).reset_index()
df_grouped = df_grouped.dropna()
df_grouped = df_grouped.loc[df_grouped.groupby('community')['genres'].idxmax()]
top20_community_attributes = pd.merge(top20_community_attributes, df_grouped, on='community', how='inner')
top20_community_attributes = top20_community_attributes.drop(columns=['genres'])

# Rename columns
top20_community_attributes = top20_community_attributes.rename(columns={'birthplaces_<lambda>': 'country'})
top20_community_attributes = top20_community_attributes.rename(columns={'level_1': 'main_genre'})

top20_community_attributes.sort_values(by='actors_count')

In [None]:
# Plot success vs (genre, country, age)
fig, ax = plt.subplots(1,3, figsize=(16, 4))

ax[0].set_title('Age vs Success')
ax[0].scatter(top20_community_attributes['ages_mean'], top20_community_attributes['ratings_mean'])
ax[0].set_xlabel('average age')
ax[0].set_ylabel('average success')

ax[1].set_title('Genre vs Success')
ax[1].boxplot([top20_community_attributes.loc[top20_community_attributes['main_genre'] == genre, 'ratings_mean'] for genre in top20_community_attributes['main_genre'].unique()])
ax[1].set_xlabel('Genre')
ax[1].set_ylabel('average success')
ax[1].set_xticklabels(top20_community_attributes['main_genre'].unique(), rotation=80)  # Rotate x-axis labels for better visibility

ax[2].set_title('Country vs Success')
ax[2].boxplot([top20_community_attributes.loc[top20_community_attributes['country'] == country, 'ratings_mean'] for country in top20_community_attributes['country'].unique()])
ax[2].set_xlabel('Country')
ax[2].set_ylabel('average success')
ax[2].set_xticklabels(top20_community_attributes['country'].unique(), rotation=80)  # Rotate x-axis labels for better visibility

plt.show()

## 4. Discussion
* What went well?
* What is still missing? What could be improved? Why?