# Part 1: Social Media Behaviour Data Analysis


---

### Install Python packages (pip only)

In [1]:
#e.g., %pip install networkx

### Import Python packages

In [2]:
import networkx as nx
import numpy as np

---

### Task 1 of 1

Examine the Graph Modelling Language (gml) files "socialmedia_cmt224_reply_network.gml" (reply network) and "socialmedia_cmt224_social_network.gml" (social network) which represent Twitter data between a sample of users over several days at the time of the Higgs boson particle discovery. Both networks are directed and share the same ids for nodes (anonymised Twitter users).  However, the shared user ids are contained within the "label" attribute in the .gml files, not the node "id" attribute of each individual .gml file.

In the reply network, an edge from a node, 𝑢, to some other node, 𝑣, indicates that 𝑢 replied to a Tweet made by 𝑣 during the time period. Replies are also Tweets. Edges are weighted with the weight representing the number of times this happened over the time period.

In the social network, an edge from node 𝑢 to 𝑣 indicates that 𝑢 follows 𝑣 on the social media platform.

Using these networks, answer the following questions:

##### Q1. How does the topological structure of the reply network differ from the social network in terms of overall sparsity of edges between users and the number of connected groups of users?

In [3]:
# Load the reply network
reply_network = nx.read_gml("socialmedia_cmt224_reply_network.gml")

# Load the social network
social_network = nx.read_gml("socialmedia_cmt224_social_network.gml")

# Calculate the density of the reply network
reply_density = nx.density(reply_network)

# Print the result to 2 decimal places unless it is less than 0.01
if nx.density(reply_network) >= 0.01:
    print("Reply network density: {:.2f}".format(reply_density))
else:
    print("Reply network density: {:.5f}".format(reply_density))

# Calculate the density of the social network
social_density = nx.density(social_network)

# Print the result to 2 decimal places unless it is less than 0.01
if nx.density(social_network) >= 0.01:
    print("Social network density: {:.2f}".format(social_density))
else:
    print("Social network density: {:.5f}".format(social_density))

# Calculate the number of connected groups in the reply network
reply_groups = nx.number_strongly_connected_components(reply_network)
print("Reply network groups:", reply_groups)

# Calculate the number of connected groups in the social network
social_groups = nx.number_strongly_connected_components(social_network)
print("Social network groups:", social_groups)

Reply network density: 0.00002
Social network density: 0.00060
Reply network groups: 36132
Social network groups: 9146


##### Q2. Do the 25 users with highest number of followers also have the highest number of repliers to their Tweets?

In [4]:
# Get the top 25 followed users
top_followed = sorted(social_network.in_degree(weight='weight'), reverse=True, key=lambda x: x[1])[:25]
top_followed_users = set([node[0] for node in top_followed])
print("Top 25 followed users:", top_followed_users)

# Get the top 25 replied users
top_replied = sorted(reply_network.in_degree(weight='weight'), reverse=True, key=lambda x: x[1])[:25]
top_replied_users = set([node[0] for node in top_replied])
print("Top 25 replied users:", top_replied_users)

# Get the intersection of the top followed and top replied users
common_users = top_followed_users.intersection(top_replied_users)
print("Users on both top followed and top replied lists:", common_users)

Top 25 followed users: {'383', '396', '88', '1274', '15', '8', '407', '3549', '979', '965', '465', '206', '677', '3419', '220', '1062', '960', '2417', '1503', '352', '205', '1988', '138', '301', '317'}
Top 25 replied users: {'349', '9964', '7962', '88', '7690', '216', '1880', '3549', '3998', '2014', '327', '677', '2177', '220', '4259', '13808', '2280', '16460', '5245', '6940', '1988', '3369', '12281', '4368', '317'}
Users on both top followed and top replied lists: {'677', '220', '1988', '3549', '88', '317'}


##### Q3. To what extent does the number of followers a user has correlate with the number of users that they have replied to?

In [5]:
# Get all users in the social network
social_all_users = social_network.nodes()

# Initialize empty lists for followers and replies
followers = []
replies = []

# Loop through all users in the social network
for user in social_all_users:
    # Get the number of followers for the user
    follower_count = social_network.in_degree(user, weight='weight')
    followers.append(follower_count)
    
    # Get the number of users that the user has replied to
    reply_count = reply_network.out_degree(user, weight='weight')
    replies.append(reply_count)

# Calculate the correlation coefficient
correlation = np.corrcoef(followers, replies)[0, 1]

# Print the result to 2 decimal places unless it is less than 0.01
if abs(np.corrcoef(followers, replies)[0, 1]) >= (0.01):
    print("Correlation coefficient between followers and replies: {:.2f}".format(correlation))
else:
    print("Correlation coefficient between followers and replies: {:.5f}".format(correlation))

Correlation coefficient between followers and replies: -0.04


##### Q4. Do users typically ONLY reply to Tweets, are ONLY replied to, or BOTH?

In [6]:
# Get the nodes that only reply
only_reply = set(node for node, degree in reply_network.in_degree() if degree == 0 and reply_network.out_degree(node) > 0)
only_reply_count = len(only_reply)
only_reply_proportion = only_reply_count / len(reply_network.nodes())
print("Only reply: {} nodes ({:.2%} of all nodes)".format(only_reply_count, only_reply_proportion))

# Get the nodes that only get replied to
only_replied_to = set(node for node, degree in reply_network.out_degree() if degree == 0 and reply_network.in_degree(node) > 0)
only_replied_to_count = len(only_replied_to)
only_replied_to_proportion = only_replied_to_count / len(reply_network.nodes())
print("Only get replied to: {} nodes ({:.2%} of all nodes)".format(only_replied_to_count, only_replied_to_proportion))

# Get the nodes that both reply and get replied to
both = set(reply_network.nodes()) - only_reply - only_replied_to
both_count = len(both)
both_proportion = both_count / len(reply_network.nodes())
print("Both reply and get replied to: {} nodes ({:.2%} of all nodes)".format(both_count, both_proportion))

Only reply: 20332 nodes (52.24% of all nodes)
Only get replied to: 11663 nodes (29.97% of all nodes)
Both reply and get replied to: 6923 nodes (17.79% of all nodes)


##### Q5. How many users have ONLY mutual following connections (i.e., every user they follow also follows them) AND ONLY mutual reply connections with these SAME users?

In [7]:
# Get the set of nodes in the social and reply networks
social_network_nodes = set(social_network.nodes())
reply_network_nodes = set(reply_network.nodes())

# For each node in the social network, check if it has only mutual following connections with its followers
mutual_following_users = set()
for user in social_network_nodes:
    followers = set(social_network.predecessors(user))
    following = set(social_network.successors(user))
    if followers == following and len(followers) > 0: # check if the user has only mutual following connections and has at least one follower
        mutual_following_users.add(user)

# For each node that has only mutual following connections, check if it also has only mutual reply connections with the same followers
count = 0
for user in mutual_following_users:
    followers = set(social_network.predecessors(user))
    isMutual = True
    for follower in followers:
        if not reply_network.has_edge(follower, user) or not reply_network.has_edge(user, follower):
            isMutual = False
            break
    if isMutual:
        count += 1

# Print the size of the set of users found.
print("Number of users with only mutual following and reply connections:", count)

Number of users with only mutual following and reply connections: 189
