---
author: Julian Dehne
bibliography: references.bib
execute:
  freeze: auto 
---

# Analyzing Social Media Conversations Trees

## Learning Objectives

By the end of this tutorial, you will be able to

- Analyze the integrity of the social media conversation.
- Use network analysis to extract longer reply path that might represent actual deliberation
- Use network analysis to show which author is the most central in the discussion

## Description
- This notebook introduces the python library delab_trees and showcases on some examples how it can be useful in dealing with social media data.

## Target Audience

- This library is intended for advanced CSS researchers that have a solid background in network computing and python
- Motivated intermediate learners may use some of the toolings as a blackbox to arrive at the conversation pathways later used in their research

## Prerequisites

Before you begin, you need to know the following technologies.

- python
- networkX
- pandas

## Environment Setup

- In order to run this tutorial, you need at least Python >= 3.9
- the library will install all its dependencies, just run

```python
pip install delab_trees
```

## Social Science Usecases 

This learning resource is useful if you have encountered one of these three use cases:
- deleted posts in your social media data
- interest in author interactions on social media
- huge numbers of conversation trees (scalability) 
- discussion mining (finding actual argumentation sequences in social media)


## Sample Input and Output Data 

Example data for Reddit and Twitter are available here https://github.com/juliandehne/delab-trees/raw/main/delab_trees/data/dataset_[reddit|twitter]_no_text.pkl. 
The data is structure only. Ids, text, links, or other information that would break confidentiality of the academic 
access have been omitted.

The trees are loaded from tables like this:

|    |   tree_id |   post_id |   parent_id | author_id   | text        | created_at          |
|---:|----------:|----------:|------------:|:------------|:------------|:--------------------|
|  0 |         1 |         1 |         nan | james       | I am James  | 2017-01-01 01:00:00 |
|  1 |         1 |         2 |           1 | mark        | I am Mark   | 2017-01-01 02:00:00 |
|  2 |         1 |         3 |           2 | steven      | I am Steven | 2017-01-01 03:00:00 |
|  3 |         1 |         4 |           1 | john        | I am John   | 2017-01-01 04:00:00 |
|  4 |         2 |         1 |         nan | james       | I am James  | 2017-01-01 01:00:00 |
|  5 |         2 |         2 |           1 | mark        | I am Mark   | 2017-01-01 02:00:00 |
|  6 |         2 |         3 |           2 | steven      | I am Steven | 2017-01-01 03:00:00 |
|  7 |         2 |         4 |           3 | john        | I am John   | 2017-01-01 04:00:00 |

This dataset contains two conversational trees with four posts each.

Currently, you need to import conversational tables as a pandas dataframe like this:


In [4]:
import pandas as pd
from delab_trees import TreeManager

d = {'tree_id': [1] * 4,
     'post_id': [1, 2, 3, 4],
     'parent_id': [None, 1, 2, 1],
     'author_id': ["james", "mark", "steven", "john"],
     'text': ["I am James", "I am Mark", " I am Steven", "I am John"],
     "created_at": [pd.Timestamp('2017-01-01T01'),
                    pd.Timestamp('2017-01-01T02'),
                    pd.Timestamp('2017-01-01T03'),
                    pd.Timestamp('2017-01-01T04')]}
df = pd.DataFrame(data=d)
manager = TreeManager(df) 
# creates one tree
test_tree = manager.random()
test_tree

loading data into manager and converting table into trees...


100%|██████████| 1/1 [00:04<00:00,  4.28s/it]


<delab_trees.delab_tree.DelabTree at 0x19379104ac0>



Note that the tree structure is based on the parent_id matching another rows post_id. 

You can now analyze the reply trees basic metrics:


In [6]:

from delab_trees.test_data_manager import get_test_tree
from delab_trees.delab_tree import DelabTree
import warnings
import numpy as np

# Suppress only VisibleDeprecationWarning
warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning)

test_tree : DelabTree = get_test_tree()
assert test_tree.average_branching_factor() > 0

print("number of posts in the conversation: ", test_tree.total_number_of_posts())


loading data into manager and converting table into trees...


100%|██████████| 1/1 [00:04<00:00,  4.22s/it]

number of posts in the conversation:  4






## Use Cases

### Use Case 1: Analyze the integrity of the social media conversation

For this we use the provided anonymized sample data (which is real, still):


In [7]:
from delab_trees.test_data_manager import get_test_manager

manager = get_test_manager()
manager.describe()

loading data into manager and converting table into trees...


100%|██████████| 6/6 [00:06<00:00,  1.16s/it]


'The dataset contains 6 conversations and 24 posts in total.\nThe average depth of the longest flow per conversation is (2, 4, 3.1666666666666665).\nThe conversations contain 6 authors and the min and max number of authors per conversation is min:2, max: 4, avg: 3.3333333333333335.\nThe average length of the posts is 10.0 characters.\n'

In order to check if all the conversations are valid trees which in social media data, they often are not, simply call:

In [13]:
manager.validate(break_on_invalid=False, verbose=False)

 67%|██████▋   | 4/6 [00:00<00:00, 1979.61it/s]


False


### Use Case 2: Extract Pathways


::: {.columns}
::: {.column width="50%"}
![Photo of marked Pathways](tutorial/img/conversation02.png){#fig-conversationpath width="25%"}
:::
::: {.column width="50%"}
As an analogy with offline-conversations, we are interested in longer reply-chains as depicted in @fig-conversationpath. Here, the nodes are the posts, and the edges read from top to bottom as a post answering another post. The root of the tree is the original post in the online conversation. Every online forum and social media thread can be modeled this way because every post except the root post has a parent, which is the mathematical definition of a recursive tree structure.
:::
:::

The marked path is one of many pathways that can be written down like a transcript from a group discussion. Pathways can be defined as all the paths in a tree that start with the root and end in a leaf (a node without children). This approach serves the function of filtering linear reply-chains in social media (see @Wang2008; @Nishi2016), that can be considered an online equivalent of real-life discussions.



## conclusion (Optional)
How the learning goal is achieved, skills acquired with this tutorial and concluding remarks

## Contact details
Providing email address, social media handles and research interests

## Acknowledgments (Optional)
Acknowledge contributors, resources, or tools.

## Disclaimer (Optional)
Disclaimer statement if needed

## Further Exploration (Optional)
For more in-depth exploration, consider checking out the following resources (links to additional resources)

## Exercises or Challenges (Optional)
Include interactive exercises or challenges for practical application.

## FAQs (Optional)
Anticipate frequently asked questions and provide concise answers.
