# Part 1: Social Media Behaviour Data Analysis


---

### Install Python packages (pip only)

In [33]:
pip install networkx


Note: you may need to restart the kernel to use updated packages.


### Import Python packages

In [2]:
import networkx as nx
import numpy as np

---

### Task 1 of 1

Examine the Graph Modelling Language (gml) files "socialmedia_cmt224_reply_network.gml" (reply network) and "socialmedia_cmt224_social_network.gml" (social network) which represent Twitter data between a sample of users over several days at the time of the Higgs boson particle discovery. Both networks are directed and share the same ids for nodes (anonymised Twitter users).  However, the shared user ids are contained within the "label" attribute in the .gml files, not the node "id" attribute of each individual .gml file.

In the reply network, an edge from a node, 𝑢, to some other node, 𝑣, indicates that 𝑢 replied to a Tweet made by 𝑣 during the time period. Replies are also Tweets. Edges are weighted with the weight representing the number of times this happened over the time period.

In the social network, an edge from node 𝑢 to 𝑣 indicates that 𝑢 follows 𝑣 on the social media platform.

Using these networks, answer the following questions:

##### Q1. What fraction of users do not reply to or follow any other user, but have had others reply to their Tweets?

In [13]:
reply_net_path = 'socialmedia_cmt224_reply_network.gml'
social_net_path = 'socialmedia_cmt224_social_network.gml'
reply_net = nx.read_gml(reply_net_path)
social_net = nx.read_gml(social_net_path)

# Identify users who have not initiated any replies and have not followed anyone
non_reply_users = set(node for node in reply_net.nodes() if reply_net.out_degree(node) == 0)
non_follow_users = set(node for node in social_net.nodes() if social_net.out_degree(node) == 0)

# Identify users who have had others reply to their tweets
received_replies_users = set(node for node in reply_net.nodes() if reply_net.in_degree(node) > 0)

# Find users who meet both criteria: not replied, not followed anyone, but had others reply to their tweets
non_active_users = non_reply_users.intersection(non_follow_users)
target_users = non_active_users.intersection(received_replies_users)

# Calculate the fraction of target users out of all users across both networks
all_users = set(reply_net.nodes()).union(set(social_net.nodes()))
fraction = len(target_users) / len(all_users)

def format_value(value):
    return "{:,}".format(value)

fraction_formatted = format_value(fraction)
target_users_count_formatted = format_value(len(target_users))
all_users_count_formatted = format_value(len(all_users))

print(f"The fraction of users who have not initiated any replies, have not followed anyone, but had others reply to their tweets is {fraction_formatted}.\n"
      f"The number of target users meeting these criteria is {target_users_count_formatted}.\n"
      f"The total number of users across both networks is {all_users_count_formatted}.")


The fraction of users who have not initiated any replies, have not followed anyone, but had others reply to their tweets is 0.009081675330106948.
The number of target users meeting these criteria is 152.
The total number of users across both networks is 16,737.


##### Q2. How does the topological structure of the reply network differ from the social network in terms of overall sparsity of edges between users and the number of connected groups of users?

In [21]:
# Calculate the sparsity of edges in the reply network
reply_sparsity = len(reply_net.edges()) / (len(reply_net.nodes()) * (len(reply_net.nodes()) - 1))
reply_sparsity = "{:.2f}".format(reply_sparsity)  # Format sparsity to 2 decimal places

# Calculate the number of connected components in the reply network
reply_connected_components = nx.number_weakly_connected_components(reply_net)

# Calculate the sparsity of edges in the social network
social_sparsity = len(social_net.edges()) / (len(social_net.nodes()) * (len(social_net.nodes()) - 1))
social_sparsity = "{:.2f}".format(social_sparsity)  # Format sparsity to 2 decimal places

# Calculate the number of connected components in the social network
social_connected_components = nx.number_weakly_connected_components(social_net)

(reply_sparsity, reply_connected_components, social_sparsity, social_connected_components)

print(f'Number of weakly connected components in reply network: {reply_connected_components}')
print(f'Number of weakly connected components in social network: {social_connected_components}')




Number of weakly connected components in reply network: 5920
Number of weakly connected components in social network: 436


##### Q3. Does the number of users a user follows in the social network correlate with the number of replies that they make?

In [4]:
social_network = nx.read_gml('socialmedia_cmt224_social_network.gml')

# Function to calculate out-degrees for a given network
def calculate_out_degrees(network):
    return {node: network.out_degree(node) for node in network.nodes()}

# Calculate out-degrees for social network and reply network
social_out_degrees = calculate_out_degrees(social_network)
reply_out_degrees = calculate_out_degrees(reply_network)

# Collect follow counts and reply counts for nodes present in both networks
common_nodes = set(social_out_degrees.keys()).intersection(reply_out_degrees.keys())

follow_counts = [social_out_degrees[node] for node in common_nodes]
reply_counts = [reply_out_degrees[node] for node in common_nodes]

# Calculate correlation coefficient between follow counts and reply counts
correlation_coefficient = np.corrcoef(follow_counts, reply_counts)[0, 1]

# Round the correlation coefficient to 2 decimal places
rounded_coefficient = round(correlation_coefficient, 2)

print(f"Correlation coefficient: {rounded_coefficient}")

Correlation coefficient: 0.06


##### Q4. Is a user that replies to another user's Tweet multiple times more likely to follow that user in comparison to if they only replied once?

In [25]:
reply_network_path = 'socialmedia_cmt224_reply_network.gml'
social_network_path = 'socialmedia_cmt224_social_network.gml'

reply_network = nx.read_gml(reply_network_path)
social_network = nx.read_gml(social_network_path)

# Step 1: Get the users that reply to another user's tweets multiple times
num_replies_made = {}
for u, v in reply_network.edges():
    num_replies_made[u] = num_replies_made.get(u, 0) + 1
replied_multiple_times = {user: count for user, count in num_replies_made.items() if count > 1}

# Step 2: Get the users that reply to another user's tweet a single time
replied_once = {user: count for user, count in num_replies_made.items() if count == 1}

# Step 3: Get the users that reply to another user's tweets multiple times and follow them as well
replied_and_followed_multiple_times = {}
replied_to_users = {}
for u, v in reply_network.edges():
    replied_to_users[u] = replied_to_users.get(u, set())
    replied_to_users[u].add(v)
for user, count in replied_multiple_times.items():
    if user in social_network:
        successors = social_network.successors(user)
        if any(successor in replied_to_users.get(user, set()) for successor in successors):
            replied_and_followed_multiple_times[user] = count

# Step 4: Get the users that reply to another user's tweet a single time and follow them as well
replied_and_followed_once = {}
for user, _ in replied_once.items():
    if user in social_network:
        successors = social_network.successors(user)
        if any(successor in replied_to_users.get(user, set()) for successor in successors):
            replied_and_followed_once[user] = 1

# Calculate and print the fractions
fraction_replied_once_follow = len(replied_and_followed_once) / len(replied_once) if len(replied_once) > 0 else 0
fraction_replied_multiple_times_follow = len(replied_and_followed_multiple_times) / len(replied_multiple_times) if len(replied_multiple_times) > 0 else 0

print(f"Likelihood of users who replied once and follow back: {fraction_replied_once_follow:.2f}")
print(f"Likelihood of users who replied multiple times and follow back: {fraction_replied_multiple_times_follow:.2f}")


Likelihood of users who replied once and follow back: 0.86
Likelihood of users who replied multiple times and follow back: 0.93


##### Q5. How many users have only mutual following connections (i.e., every user they follow also follows them) and only mutual reply connections with these same users?

In [3]:
reply_network_path = 'socialmedia_cmt224_reply_network.gml'
social_network_path = 'socialmedia_cmt224_social_network.gml'

reply_network = nx.read_gml(reply_network_path)
social_network = nx.read_gml(social_network_path)

#Identify users with only mutual following connections
only_mutual_follow_users = set()
for user in social_network.nodes():
    followers = set(social_network.predecessors(user))
    following = set(social_network.successors(user))
    if followers == following:  # Check if followers are the same as following
        only_mutual_follow_users.add(user)

#Identify users with only mutual reply connections with the same users
only_mutual_reply_users = set()
for user in reply_network.nodes():
    replied_to = set(reply_network.predecessors(user))
    replies = set(reply_network.successors(user))
    if replied_to == replies:  # Check if replied_to users are the same as reply users
        only_mutual_reply_users.add(user)

#Find users with both types of only mutual connections
users_with_only_mutual_connections = only_mutual_follow_users.intersection(only_mutual_reply_users)

num_users_with_only_mutual_connections = len(users_with_only_mutual_connections)
print("Number of users with only mutual following and reply connections:", num_users_with_only_mutual_connections)


Number of users with only mutual following and reply connections: 261
