# PyConversations: A 4chan-based Example

The following is a tutorial notebook that demonstrates how to use `pyconversations` with 4chan data.

Interfacing with 4chan data _does not_ require any secret keys or tokens or anyting of that nature. Instead, we'll directly use a package called `BASC-py4chan` which provides [an object-oriented interface](https://basc-py4chan.readthedocs.io/en/latest/index.html) for interacting with the 4chan API. 

To begin, let's obtain some data using the package and then proceed by integrating the data with PyConversations!

In [1]:
from pprint import pprint

from pyconversations.convo import Conversation
from pyconversations.message import ChanPost

## Data Sample

In [2]:
import basc_py4chan

In [3]:
# get the board we want
board_name = 'news'

board = basc_py4chan.Board(board_name)
board

<Board /news/>

In [4]:
# select a thread on the board

ix = 1
all_thread_ids = board.get_all_thread_ids()
thread_id = all_thread_ids[ix] if ix < len(all_thread_ids) else all_thread_ids[-1]
thread = board.get_thread(thread_id)

thread_id, thread

(897109, <Thread /news/897109, 65 replies>)

In [5]:
# print thread information
print(thread)
print('Sticky?', thread.sticky)
print('Closed?', thread.closed)
print('Replies:', len(thread.replies))

<Thread /news/897109, 65 replies>
Sticky? False
Closed? False
Replies: 65


In [6]:
# print topic post information
topic = thread.topic
print('Topic Repr', topic)
print('Postnumber', topic.post_number)
print('Timestamp', topic.timestamp)
print('Datetime', repr(topic.datetime))
print('Subject', topic.subject)
print('Comment', topic.comment)

Topic Repr <Post /news/897109#897109, has_file: True>
Postnumber 897109
Timestamp 1627428379
Datetime datetime.datetime(2021, 7, 27, 19, 26, 19)
Subject Failed Trump Coup Updats: Trump Supporters Labeled as Terrorists, Police Rebuke of Trump&#039;s Big Lie
Comment https://www.reuters.com/world/us/police-who-defended-us-capitol-testify-riot-probes-first-hearing-2021-07-27/<br><br>WASHINGTON, July 27 (Reuters) - Four police officers on Tuesday told lawmakers they were beaten, taunted with racial insults, heard threats including &quot;kill him with his own gun&quot; and thought they might die as they struggled to defend the U.S. Capitol on Jan. 6 against a mob of then-President Donald Trump&#039;s supporters.<br><br>Often tearful, sometimes profane, the officers called the rioters &quot;terrorists&quot; engaged in an &quot;attempted coup&quot; during a 3-1/2 hour congressional hearing in which they also criticized Republican lawmakers who have sought to downplay the attack.<br><br>&quot;I

In [7]:
thread.all_posts

[<Post /news/897109#897109, has_file: True>,
 <Post /news/897109#897110, has_file: False>,
 <Post /news/897109#897112, has_file: False>,
 <Post /news/897109#897113, has_file: False>,
 <Post /news/897109#897119, has_file: False>,
 <Post /news/897109#897153, has_file: False>,
 <Post /news/897109#897241, has_file: False>,
 <Post /news/897109#897244, has_file: False>,
 <Post /news/897109#897248, has_file: False>,
 <Post /news/897109#897249, has_file: False>,
 <Post /news/897109#897253, has_file: False>,
 <Post /news/897109#897259, has_file: False>,
 <Post /news/897109#897267, has_file: False>,
 <Post /news/897109#897276, has_file: False>,
 <Post /news/897109#897342, has_file: False>,
 <Post /news/897109#897344, has_file: False>,
 <Post /news/897109#897345, has_file: False>,
 <Post /news/897109#897348, has_file: False>,
 <Post /news/897109#897349, has_file: False>,
 <Post /news/897109#897378, has_file: False>,
 <Post /news/897109#897381, has_file: False>,
 <Post /news/897109#897385, has_fil

## Integration with `pyconversations`

All that's left to do is plug our data directly into `pyconversations`!

In [8]:
# create conversation
conv = Conversation()

In [9]:
for post in thread.all_posts:
    
    # gather up raw text
    raw_text = ((post.subject + ':\n' if post.subject else '') + post.text_comment).strip()
    
    # cleanse text and retrieve references to other posts
    text, rfs = ChanPost.clean_text(raw_text)
    if not rfs and topic.post_number != post.post_id:
        rfs = [topic.post_number]
    rfs = list(map(int, rfs))
    
    # create data for the post constructor
    cons = {
        'uid':        post.post_id,  # unique identifier for the post (mandatory field)
        'created_at': post.datetime,  # datetime of post creation
        'text':       text,  # cleaned plaintext
        'author':     post.name,  # self-assigned name of the poster (likely 'Anonymous')
        'reply_to':   rfs,  # cleaned references to other posts
        'lang_detect': True  # whether or not to attempt language detection
    }
    conv.add_post(ChanPost(**cons))

In [10]:
# print number of unique posts contained within the Conversation
len(conv.posts)

66

In [11]:
list(conv.posts.items())[:5]

[(897109,
  4chanPost(4chan::Anonymous::2021-07-27 19:26:19::Failed Trump Coup Updats: Trump Supporters Labeled::tags=)),
 (897110,
  4chanPost(4chan::Anonymous::2021-07-27 19:26:36::"There was an attack carried out on Jan. 6, and a ::tags=)),
 (897112,
  4chanPost(4chan::Anonymous::2021-07-27 19:27:33::>>897109
  >the officers called the rioters "terrori::tags=)),
 (897113,
  4chanPost(4chan::Anonymous::2021-07-27 19:28:13::Hodges said many rioters appeared to be white nati::tags=)),
 (897119,
  4chanPost(4chan::Anonymous::2021-07-27 19:29:49::Officers label Jan. 6 coup forces as 'terrorists' ::tags=))]

In [12]:
# Conversations can be sub-segmented (if we have a large collection and are uncertain if posts are missing or would like to splice out disjoint trees)
# This is likely to return a single conversation (a copy of what we built) since we just queried a single thread directly using the API 
# This is more relevant when ingesting a heterogenous collection of posts
segs = conv.segment()

len(segs)

1

### (Detected) Language Distribution

In [13]:
from collections import Counter

lang_dist = Counter([post.lang for post in conv.posts.values()])
lang_dist

Counter({'en': 63, 'und': 3})

### Saving and Loading from the universal format

In [14]:
import json

In [15]:
# saving a conversation to disk
# alternatively: save as a JSONLine file, where each line is a conversation!
j = conv.to_json()
# pprint.pprint(j)
json.dump(j, open('4chan_conv.json', 'w+'))

In [16]:
# reloading directly from the JSON
conv_reloaded = Conversation.from_json(json.load(open('4chan_conv.json')))
len(conv_reloaded.posts)

66

### Feature Extraction

PyConversations exposes many features for extraction that are needed for social media analysis. 
Let's use the main feature extraction engine:

In [17]:
from pyconversations.feature_extraction.post import get_all as post_all
from pyconversations.feature_extraction.post_in_conv import get_all as pic_all
from pyconversations.feature_extraction.conv import get_all as conv_all
from pyconversations.feature_extraction.user_in_conv import get_all as uic_all

In [18]:
ix = 0
pid = conv.time_order()[ix]
pid

897109

In [19]:
post = conv.posts[pid]
# post, post.text
print(post.text)

Failed Trump Coup Updats: Trump Supporters Labeled as Terrorists, Police Rebuke of Trump's Big Lie:
https://www.reuters.com/world/us/police-who-defended-us-capitol-testify-riot-probes-first-hearing-2021-07-27/

WASHINGTON, July 27 (Reuters) - Four police officers on Tuesday told lawmakers they were beaten, taunted with racial insults, heard threats including "kill him with his own gun" and thought they might die as they struggled to defend the U.S. Capitol on Jan. 6 against a mob of then-President Donald Trump's supporters.

Often tearful, sometimes profane, the officers called the rioters "terrorists" engaged in an "attempted coup" during a 3-1/2 hour congressional hearing in which they also criticized Republican lawmakers who have sought to downplay the attack.

"I feel like I went to hell and back to protect the people in this room," said District of Columbia police officer Michael Fanone, referring to lawmakers. "The indifference shown to my colleagues is disgraceful," Fanone added

In [20]:
post_all(post, ignore_keys={'tokens', 'type_frequency'})

{'is_source': True,
 'author': 'Anonymous',
 'language': 'en',
 'platform': '4chan',
 'mixing_k1': 0.99001,
 'mixing_theta': 0.8701154588269338,
 'mixing_entropy': 0.7091912711970605,
 'mixing_N_avg': 85.76752540270044,
 'mixing_M_avg': 660.3366699999999,
 '?_count': 0,
 '!_count': 0,
 'char_count': 2071,
 'emoji_count': 0,
 'hashtag_count': 0,
 'mention_count': 0,
 'degree_out': 0,
 'punct_count': 54,
 'token_count': 667,
 'type_count': 229,
 'uppercase_count': 72,
 'url_count': 1,
 'emojis': [],
 'hashtags': [],
 'mentions': [],
 'urls': ['https://www.reuters.com/world/us/police-who-defended-us-capitol-testify-riot-probes-first-hearing-2021-07-27/']}

In [21]:
pic_all(post, conv, ignore_keys={'tokens', 'type_frequency'})

{'is_leaf': False,
 'is_internal': False,
 'is_author_source_author': True,
 'is_source': True,
 'relative_age': 0.0,
 'response_time': 0.0,
 'avg_token_entropy_after-ancestors': 0.21656746540860072,
 'avg_token_entropy_after-before': 0.21656746540860072,
 'avg_token_entropy_after-children': 0.21656746540860072,
 'avg_token_entropy_after-descendants': 0.21656746540860072,
 'avg_token_entropy_after-full': 0.21656746540860072,
 'avg_token_entropy_after-parents': 0.21656746540860072,
 'avg_token_entropy_after-siblings': 0.21656746540860072,
 'avg_token_entropy_ancestors-after': 0.36204062585428004,
 'avg_token_entropy_ancestors-before': 0.3946187483311423,
 'avg_token_entropy_ancestors-children': 0.36802046702918634,
 'avg_token_entropy_ancestors-descendants': 0.36204062585428004,
 'avg_token_entropy_ancestors-full': 0.36204062585428004,
 'avg_token_entropy_ancestors-parents': 0.3946187483311423,
 'avg_token_entropy_ancestors-siblings': 0.3946187483311423,
 'avg_token_entropy_before-after

In [22]:
uic_all('Anonymous', conv)

{'is_source_author': True,
 'mixing_k1': 0.99001,
 'mixing_theta': 0.9034635625937951,
 'mixing_entropy': 0.5857216250628748,
 'mixing_N_avg': 917.9694287985446,
 'mixing_M_avg': 9509.046049999999,
 'message_count': 65,
 'types': 1781,
 'leaf_count': 31,
 'internal_count': 33,
 'author_source_author_count': 65,
 'source_count': 1,
 'degree': 144,
 'degree_in': 72,
 '?_count': 15,
 '!_count': 14,
 'char_count': 27580,
 'emoji_count': 0,
 'hashtag_count': 0,
 'mention_count': 54,
 'degree_out': 72,
 'punct_count': 625,
 'token_count': 9605,
 'uppercase_count': 878,
 'url_count': 21,
 'post_agg': {'relative_age': {'min': 0.0,
   'max': 66081.0,
   'std': 22236.51092824086,
   'mean': 36730.876923076925,
   'median': 45293.0},
  'response_time': {'min': 0.0,
   'max': 66081.0,
   'std': 20222.75916940289,
   'mean': 12289.76923076923,
   'median': 1379.0},
  'avg_token_entropy_after-ancestors': {'min': 0.21656746540860072,
   'max': 0.6071738196531219,
   'std': 0.07344171567151166,
   'me

In [23]:
conv_all(conv, ignore_keys={'type_frequency_distribution'})

{'degree_size_distribution': Counter({23: 1,
          1: 31,
          3: 10,
          2: 19,
          4: 3,
          5: 1,
          7: 1}),
 'degree_in_size_distribution': Counter({23: 1,
          0: 32,
          2: 10,
          1: 20,
          3: 2,
          4: 1}),
 'degree_out_size_distribution': Counter({0: 1, 1: 62, 2: 1, 3: 1, 6: 1}),
 'depth_distribution': Counter({0: 1,
          1: 23,
          2: 12,
          3: 9,
          4: 6,
          5: 5,
          6: 6,
          7: 3,
          8: 1}),
 'user_size_distribution': Counter({65: 1, 1: 1}),
 'density': 0.034032634032634033,
 'duration': 66081.0,
 'mixing_k1': 0.99001,
 'mixing_theta': 0.9035082254041784,
 'mixing_entropy': 0.5855232931190802,
 'mixing_N_avg': 924.040619958085,
 'mixing_M_avg': 9576.36673,
 'messages': 66,
 'tree_degree': 23,
 'tree_depth': 8,
 'tree_width': 23,
 'types': 1795,
 'users': 2,
 'leaf_count': 32,
 'internal_count': 33,
 'author_source_author_count': 65,
 'source_count': 1,
 'degr

In [16]:
from pyconversations.feature_extraction import ConversationFeaturizer
from pyconversations.feature_extraction import PostFeaturizer

In [17]:
conv_ft = ConversationFeaturizer(include_post=False)
post_ft = PostFeaturizer()

In [18]:
features = conv_ft.transform(conv)

for k, v in features.items():
    if type(v) == dict:
        print(k, ': ', len(v))

pprint(features)

binary :  0
numeric :  420
categorical :  0
{'binary': {},
 'categorical': {},
 'convo_id': 'CONV_893871',
 'numeric': {'age_max': 141271.0,
             'age_mean': 54708.65,
             'age_median': 44449.5,
             'age_min': 0.0,
             'age_std': 41465.149149948804,
             'author_posts_max': 20,
             'author_posts_mean': 20.0,
             'author_posts_median': 20.0,
             'author_posts_min': 20,
             'author_posts_std': 0.0,
             'avg_token_entropy_after-ancestors_max': 0.7335981309118995,
             'avg_token_entropy_after-ancestors_mean': 0.4436810561915885,
             'avg_token_entropy_after-ancestors_median': 0.3664204072307681,
             'avg_token_entropy_after-ancestors_min': 0.2633786505412698,
             'avg_token_entropy_after-ancestors_std': 0.15476113993369517,
             'avg_token_entropy_after-before_max': 0.7020164765999354,
             'avg_token_entropy_after-before_mean': 0.4617702593405227,
   

In [25]:
pid = conv.time_order()[4]
post = conv.posts[pid]

pprint(post.text)

post

'>>894034\nThe man that will upset the California cartel'


4chanPost(4chan::Anonymous::2021-07-22 08:54:34::>>894034
The man that will upset the California ca::tags=)

In [26]:
features = post_ft.transform(post, conv)

for k, v in features.items():
    if type(v) == dict:
        print(k, ': ', len(v))

features

binary :  4
numeric :  80
categorical :  3


{'id': 894037,
 'convo_id': 'CONV_893871',
 'binary': {'is_source': 0,
  'is_leaf': 0,
  'is_internal': 1,
  'is_source_author': 1},
 'numeric': {'urls': 0,
  'mentions': 1,
  'chars': 54,
  'tokens': 19,
  'types': 12,
  'uppercase': 2,
  'lowercase': 36,
  'uppercase_ratio': 0.05263157894736842,
  'depth': 2,
  'width': 3,
  'degree': 2,
  'in_degree': 1,
  'out_degree': 1,
  'author_posts': 20,
  'age': 35787.0,
  'reply_time': 463.0,
  'avg_token_entropy_after-ancestors': 0.32762109295194225,
  'avg_token_entropy_after-before': 0.3447214251916658,
  'avg_token_entropy_after-children': 0.32904693516498795,
  'avg_token_entropy_after-descendants': 0.32508887660195473,
  'avg_token_entropy_after-full': 0.32762109295194225,
  'avg_token_entropy_after-parents': 0.3447214251916658,
  'avg_token_entropy_after-siblings': 0.3447214251916658,
  'avg_token_entropy_ancestors-after': 0.2633786505412698,
  'avg_token_entropy_ancestors-before': 0.2633786505412698,
  'avg_token_entropy_ancestors-c