# **Analysis of Twitch Social Network**

## **Introduction**

This project focuses on analyzing the social network of Twitch using the dataset provided by SNAP (Stanford Network Analysis Project). Twitch is a live-streaming platform, primarily for gaming, hosting a vast community of streamers and viewers. The Twitch social network is represented as an undirected graph, where nodes are users, and edges represent mutual friendships. This project aims to study the interaction dynamics, community structures, and user behaviors within the network.

---

## **Project Objectives**

### 1. **User Behavior Analysis**
I will examine the interactions between users on the platform by analyzing metrics such as:
- Node degree (number of connections).
- Centrality (relative importance of a node in the network).
- Distribution of connections and identification of influential users.

### 2. **Community Detection**
I will apply community detection algorithms to identify groups of users that are strongly connected. This will help to understand how communities form based on shared characteristics, such as language, preferred games, or geographical location.

### 3. **Explicit Content Analysis**
I will predict the likelihood of a streamer using explicit language based on their connections and attributes. This will be achieved using supervised machine learning techniques, leveraging node attributes and graph connections.

---

## **Methodology**

The project will be developed following these steps:
1. **Data Loading and Preprocessing**: Importing the edge list and node features, building the graph, and cleaning the data.
2. **Exploratory Network Analysis**: Studying the main characteristics of the network, such as the number of nodes, edges, and basic metrics.
3. **Experiments**:
   - User behavior analysis.
   - Community detection using clustering algorithms.
   - Explicit content prediction through classification.
4. **Evaluation**: Evaluating results and visualizing the network and analyses.

---

## **Tools Used**
- **Python**: The primary programming language.
- **NetworkX**: For graph manipulation and analysis.
- **Scikit-learn**: For implementing machine learning models.
- **Matplotlib/Seaborn**: For data and result visualization.
- **Gephi**: For interactive network visualization.


---



## **1. Data Loading and Preprocessing**


In [73]:
import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
import random
from collections import defaultdict

### 1. Load the dataset, including:
   - Edge list: representing user connections on Twitch.
   - Node features: describing user attributes (e.g., language, streaming behavior).

In [74]:
# Load the edge list
edgelist = pd.read_csv('large_twitch_edges.csv', )
# Display the first few rows to understand the structure
edgelist.head()

Unnamed: 0,numeric_id_1,numeric_id_2
0,98343,141493
1,98343,58736
2,98343,140703
3,98343,151401
4,98343,157118


In [75]:
# Load CSV file with node features
node_features = pd.read_csv("large_twitch_features.csv", index_col=5)
# Display the first few rows to understand the structure
node_features.head()

Unnamed: 0_level_0,views,mature,life_time,created_at,updated_at,dead_account,language,affiliate
numeric_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,7879,1,969,2016-02-16,2018-10-12,0,EN,1
1,500,0,2699,2011-05-19,2018-10-08,0,EN,0
2,382502,1,3149,2010-02-27,2018-10-12,0,EN,1
3,386,0,1344,2015-01-26,2018-10-01,0,EN,0
4,2486,0,1784,2013-11-22,2018-10-11,0,EN,0


At first sight, I note that:
- **Edge List**: Represents the connections between Twitch users, where each row is a link between two users (`numeric_id_1` and `numeric_id_2`).
- **Node Features**: Contains attributes for each user (node) such as:
  - `views`: Total number of views a streamer has.
  - `mature`: Indicates if the streamer produces mature content (1 for yes, 0 for no).
  - `life_time`: Lifetime of the user's account (in days).
  - `created_at` and `updated_at`: Account creation and last update dates.
  - `dead_account`: Indicates if the account is inactive.
  - `language`: The main language used by the user.
  - `affiliate`: Indicates if the user is a Twitch affiliate (1 for yes, 0 for no).

### 2. Clean and preprocess the data:
   - Handle missing values and duplicates.

In [76]:
# Check for null values and duplicates in the edge list
print("### Edge List ###")
print(f"Total rows in edge list: {edgelist.shape[0]}")

# Check for null values
null_values_edges = edgelist.isnull().sum()
print("Null values in edge list:")
print(null_values_edges)

# Check for duplicates
duplicates_edges = edgelist.duplicated().sum()
print(f"Duplicate rows in edge list: {duplicates_edges}")

print("\n")

# Check for null values and duplicates in the node attributes
print("### Node Attributes ###")
print(f"Total rows in node attributes: {node_features.shape[0]}")

# Check for null values
null_values_nodes = node_features.isnull().sum()
print("Null values in node attributes:")
print(null_values_nodes)

### Edge List ###
Total rows in edge list: 6797557
Null values in edge list:
numeric_id_1    0
numeric_id_2    0
dtype: int64
Duplicate rows in edge list: 0


### Node Attributes ###
Total rows in node attributes: 168114
Null values in node attributes:
views           0
mature          0
life_time       0
created_at      0
updated_at      0
dead_account    0
language        0
affiliate       0
dtype: int64


- **Edge List**:
  - No null values found in `numeric_id_1` or `numeric_id_2`.
  - No duplicate rows detected.

- **Node Attributes**:
  - No null values in any columns (`views`, `mature`, `life_time`, etc.).
  - No duplicate rows detected.

Both datasets are clean and ready for the next step: building the graph using the edge list and enriching it with node attributes.

### 3. Build the graph:
   - Use NetworkX to create the graph from the edge list.
   - Add node features to the graph.

In [77]:
OG = nx.from_pandas_edgelist(edgelist, source='numeric_id_1', target='numeric_id_2', create_using=nx.Graph())
print(OG)
print('Is the graph directed ?',OG.is_directed())

Graph with 168114 nodes and 6797557 edges
Is the graph directed ? False


I built an **undirected graph** where:
* Nodes are Twitch streamers.
* Edges are mutual friendships between Twitch streamers.


In [78]:
# Add node features to the graph
nx.set_node_attributes(OG, node_features.to_dict(orient='index'))

# Check correctness of this operation
example_node = list(OG.nodes)[0]
print(f"Attributes for node {example_node}: {OG.nodes[example_node]}")

Attributes for node 98343: {'views': 282, 'mature': 0, 'life_time': 2086, 'created_at': '2012-12-27', 'updated_at': '2018-09-13', 'dead_account': 0, 'language': 'EN', 'affiliate': 0}


Another important step is to check for **self loops** and removing them:

In [79]:
OG.remove_edges_from(nx.selfloop_edges(OG))
print(OG)

Graph with 168114 nodes and 6797557 edges


As we can see from the output, no edges have been removed.


For the purposes of this project and to facilitate analysis, I will take into account only a part of the network of Twitch streamers.
At first, I filter the nodes with `language` attribute and count them :


In [80]:
language_counts = defaultdict(int)

# Iteration through nodes
for node, attr in OG.nodes(data=True):
    language = attr.get('language', 'Unknown')
    language_counts[language] += 1

# Print 
for language, count in language_counts.items():
    print(f"Language: {language}, Count: {count}")

Language: EN, Count: 124411
Language: OTHER, Count: 1429
Language: ZH, Count: 2828
Language: ES, Count: 5699
Language: SV, Count: 854
Language: DE, Count: 9428
Language: RU, Count: 4821
Language: CS, Count: 576
Language: DA, Count: 503
Language: KO, Count: 1215
Language: IT, Count: 1230
Language: NL, Count: 701
Language: PT, Count: 2536
Language: NO, Count: 330
Language: FI, Count: 652
Language: FR, Count: 6799
Language: TR, Count: 772
Language: JA, Count: 1327
Language: HU, Count: 427
Language: TH, Count: 632
Language: PL, Count: 944


**French** network has a medium size, so I will use it as starting point, creating an independent subgraph (`G`) from the original one (`OG`)

In [82]:
# Filter nodes with attribute 'language' = 'FR'
nodes_with_fr = [n for n, attr in OG.nodes(data=True) if attr.get('language') == 'FR']

# Create independent subgraph
G = OG.subgraph(nodes_with_fr).copy()

print(G)
print(f'Is the graph directed?', G.is_directed())

Graph with 6799 nodes and 123644 edges
Is the graph directed? False


In [83]:
example_node = list(G.nodes)[0]
print(f"Attributes for node {example_node}: {G.nodes[example_node]}")

Attributes for node 32768: {'views': 1183, 'mature': 1, 'life_time': 280, 'created_at': '2018-01-04', 'updated_at': '2018-10-11', 'dead_account': 0, 'language': 'FR', 'affiliate': 1}


### 4. Save the cleaned data for later use.

In this section, I will save the Graph in a format compatible with Gephi, for visualisation.

In [84]:
nx.write_graphml(G, "twitch_networkFR.graphml")

### 5. Network Visualisation


The visualisation below is elaborated with Gephi. 

I used ForceAtlas2 and coloured nodes with green (affiliated streamers) and red (not affiliated streamers). 


![vis1](vis1.png)

This ForceAtlas2 visualization of the network reveals some key features of the graph:

1. **Clustering**:
    
   There is a **central cluster** that contains nodes that are highly connected, likely representing influential or well-integrated streamers within the network. The **peripheral nodes** are less connected and likely represent either new streamers or those with minimal interaction in the network. We can see that the majority of peripheral nodes are not affiliated streamers, meaning that nodes that are less connected / important aren't in collaboration with the Platform.

2. **Community Detection**:

   An important point to explore is if there are distinct communities within the network, and how affiliation and other attributes correlate with these groups. At first sight, we can see a big central cluster, but maybe this can be divided in more communities, maybe based on type of content or seniority.

3. **Role of Peripheral Nodes**:

   What characterizes the streamers in the periphery? Are they newcomers, or do they belong to smaller, isolated clusters?


In [85]:
# Calcola il degree per ogni nodo del grafo (considera la somma di entrante e uscente se il grafo è orientato)
degrees = [degree for _, degree in G.degree()]

# Calcola il grado minimo, massimo e medio
min_degree = min(degrees)
max_degree = max(degrees)
avg_degree = sum(degrees) / len(degrees)

# Stampa i risultati
print(f"Minimum Degree: {min_degree}")
print(f"Maximum Degree: {max_degree}")
print(f"Average Degree: {avg_degree:.2f}")

Minimum Degree: 0
Maximum Degree: 2081
Average Degree: 36.37
