# PyConversations: A 4chan-based Example

The following is a tutorial notebook that demonstrates how to use `pyconversations` with 4chan data.

Interfacing with 4chan data _does not_ require any secret keys or tokens or anyting of that nature. Instead, we'll directly use a package called `BASC-py4chan` which provides [an object-oriented interface](https://basc-py4chan.readthedocs.io/en/latest/index.html) for interacting with the 4chan API. 

To begin, let's obtain some data using the package and then proceed by integrating the data with PyConversations!

In [1]:
from pprint import pprint

from pyconversations.convo import Conversation
from pyconversations.message import ChanPost

## Data Sample

In [2]:
import basc_py4chan

In [3]:
# get the board we want
board_name = 'news'

board = basc_py4chan.Board(board_name)
board

<Board /news/>

In [4]:
# select a thread on the board

ix = 13
all_thread_ids = board.get_all_thread_ids()
thread_id = all_thread_ids[ix] if ix < len(all_thread_ids) else all_thread_ids[-1]
thread = board.get_thread(thread_id)

thread_id, thread

(894563, <Thread /news/894563, 29 replies>)

In [5]:
# print thread information
print(thread)
print('Sticky?', thread.sticky)
print('Closed?', thread.closed)
print('Replies:', len(thread.replies))

<Thread /news/894563, 29 replies>
Sticky? False
Closed? False
Replies: 29


In [6]:
# print topic post information
topic = thread.topic
print('Topic Repr', topic)
print('Postnumber', topic.post_number)
print('Timestamp', topic.timestamp)
print('Datetime', repr(topic.datetime))
print('Subject', topic.subject)
print('Comment', topic.comment)

Topic Repr <Post /news/894563#894563, has_file: True>
Postnumber 894563
Timestamp 1627019044
Datetime datetime.datetime(2021, 7, 23, 1, 44, 4)
Subject Republicans go full cancel culture against an American corporation for the glory of Isreali
Comment https://www.thedailybeast.com/gop-lawmakers-want-to-cancel-ice-cream-after-ben-and-jerrys-fiasco<br><br>Sen. James Lankford (R-OK) is among a vexed group of lawmakers coming forward with threats to cancel ice cream after Ben &amp; Jerry’s announced this week it wouldn’t renew its current license agreement with its manufacturer in Israel which also distributes frozen treats to the West Bank.<br><br>“If Ben &amp; Jerry’s wants to have a meltdown &amp; boycott Israel, OK is ready to respond. Oklahoma has an anti-boycott of Israel law in place,” Lankford wrote on Twitter Wednesday. More than two dozen states have laws opposing boycotts of Israel.<br><br>The Oklahoma Republican urged his state to “immediately block the sale of all #Benandjerrys

In [7]:
thread.all_posts

[<Post /news/894563#894563, has_file: True>,
 <Post /news/894563#894564, has_file: False>,
 <Post /news/894563#894566, has_file: False>,
 <Post /news/894563#894570, has_file: False>,
 <Post /news/894563#894573, has_file: False>,
 <Post /news/894563#894576, has_file: False>,
 <Post /news/894563#894582, has_file: False>,
 <Post /news/894563#894583, has_file: False>,
 <Post /news/894563#894587, has_file: False>,
 <Post /news/894563#894588, has_file: False>,
 <Post /news/894563#894590, has_file: False>,
 <Post /news/894563#894592, has_file: False>,
 <Post /news/894563#894594, has_file: False>,
 <Post /news/894563#894635, has_file: False>,
 <Post /news/894563#894642, has_file: False>,
 <Post /news/894563#894643, has_file: False>,
 <Post /news/894563#894648, has_file: False>,
 <Post /news/894563#894651, has_file: False>,
 <Post /news/894563#894655, has_file: False>,
 <Post /news/894563#894657, has_file: False>,
 <Post /news/894563#894665, has_file: False>,
 <Post /news/894563#894694, has_fil

## Integration with `pyconversations`

All that's left to do is plug our data directly into `pyconversations`!

In [8]:
# create conversation
conv = Conversation()

In [9]:
for post in thread.all_posts:
    
    # gather up raw text
    raw_text = ((post.subject + ':\n' if post.subject else '') + post.text_comment).strip()
    
    # cleanse text and retrieve references to other posts
    text, rfs = ChanPost.clean_text(raw_text)
    if not rfs and topic.post_number != post.post_id:
        rfs = [topic.post_number]
    rfs = list(map(int, rfs))
    
    # create data for the post constructor
    cons = {
        'uid':        post.post_id,  # unique identifier for the post (mandatory field)
        'created_at': post.datetime,  # datetime of post creation
        'text':       text,  # cleaned plaintext
        'author':     post.name,  # self-assigned name of the poster (likely 'Anonymous')
        'reply_to':   rfs,  # cleaned references to other posts
        'lang_detect': True  # whether or not to attempt language detection
    }
    conv.add_post(ChanPost(**cons))

In [10]:
# print number of unique posts contained within the Conversation
len(conv.posts)

30

In [11]:
# Conversations can be sub-segmented (if we have a large collection and are uncertain if posts are missing or would like to splice out disjoint trees)
# This is likely to return a single conversation (a copy of what we built) since we just queried a single thread directly using the API 
# This is more relevant when ingesting a heterogenous collection of posts
segs = conv.segment()

len(segs)

1

### (Detected) Language Distribution

In [12]:
from collections import Counter

lang_dist = Counter([post.lang for post in conv.posts.values()])
lang_dist

Counter({'en': 28, 'hr': 1, 'es': 1})

### Saving and Loading from the universal format

In [13]:
import json

In [14]:
# saving a conversation to disk
# alternatively: save as a JSONLine file, where each line is a conversation!
j = conv.to_json()
# pprint.pprint(j)
json.dump(j, open('4chan_conv.json', 'w+'))

In [15]:
# reloading directly from the JSON
conv_reloaded = Conversation.from_json(json.load(open('4chan_conv.json')))
len(conv_reloaded.posts)

30

### Feature Extraction

PyConversations exposes many features for extraction that are needed for social media analysis. 
Let's use the main feature extraction engine:

In [16]:
from pyconversations.feature_extraction import ConversationFeaturizer
from pyconversations.feature_extraction import PostFeaturizer

In [17]:
conv_ft = ConversationFeaturizer(include_post=False)
post_ft = PostFeaturizer()

In [18]:
features = conv_ft.transform(conv)
pprint(features)

{'binary': {},
 'categorical': {},
 'convo_id': 'CONV_894563',
 'numeric': {'chars': 9651,
             'connections': 29,
             'density': 0.06666666666666667,
             'duration': 42541.0,
             'internal_nodes': 16,
             'internal_ratio': 0.5333333333333333,
             'leaf_ratio': 0.43333333333333335,
             'leaves': 13,
             'mentions': 0,
             'messages': 30,
             'source_ratio': 0.03333333333333333,
             'sources': 1,
             'tokens': 3445,
             'tree_degree': 10,
             'tree_depth': 7,
             'tree_width': 10,
             'types': 1312,
             'urls': 3,
             'user_post_ratio': 0.03333333333333333,
             'users': 1}}


In [19]:
pid = conv.time_order()[0]
post = conv.posts[pid]

pprint(post.text)

post

('Republicans go full cancel culture against an American corporation for the '
 'glory of Isreali:\n'
 'https://www.thedailybeast.com/gop-lawmakers-want-to-cancel-ice-cream-after-ben-and-jerrys-fiasco\n'
 '\n'
 'Sen. James Lankford (R-OK) is among a vexed group of lawmakers coming '
 'forward with threats to cancel ice cream after Ben & Jerry’s announced this '
 'week it wouldn’t renew its current license agreement with its manufacturer '
 'in Israel which also distributes frozen treats to the West Bank.\n'
 '\n'
 '“If Ben & Jerry’s wants to have a meltdown & boycott Israel, OK is ready to '
 'respond. Oklahoma has an anti-boycott of Israel law in place,” Lankford '
 'wrote on Twitter Wednesday. More than two dozen states have laws opposing '
 'boycotts of Israel.\n'
 '\n'
 'The Oklahoma Republican urged his state to “immediately block the sale of '
 'all #Benandjerrys” to Oklahomans.\n'
 '\n'
 'The meltdown comes after the top-selling ice cream maker said a statement '
 'that it was “

4chanPost(4chan::Anonymous::2021-07-23 01:44:04::Republicans go full cancel culture against an Amer::tags=)

In [20]:
features = post_ft.transform(post, conv)

for k, v in features.items():
    if type(v) == dict:
        print(k, ': ', len(v))

features

binary :  3
numeric :  80
categorical :  3


{'id': 894563,
 'convo_id': 'CONV_894563',
 'binary': {'is_source': 1, 'is_leaf': 0, 'is_internal': 0},
 'numeric': {'urls': 1,
  'mentions': 0,
  'chars': 1732,
  'tokens': 599,
  'types': 188,
  'uppercase': 67,
  'lowercase': 1324,
  'uppercase_ratio': 0.04816678648454349,
  'depth': 0,
  'width': 1,
  'degree': 10,
  'in_degree': 10,
  'out_degree': 0,
  'author_posts': 30,
  'age': 0.0,
  'reply_time': 0,
  'avg_token_entropy_after-ancestors': 0.24444395748108105,
  'avg_token_entropy_after-before': 0.24444395748108105,
  'avg_token_entropy_after-children': 0.24444395748108105,
  'avg_token_entropy_after-descendants': 0.24444395748108105,
  'avg_token_entropy_after-full': 0.24444395748108105,
  'avg_token_entropy_after-parents': 0.24444395748108105,
  'avg_token_entropy_after-siblings': 0.24444395748108105,
  'avg_token_entropy_ancestors-after': 0.24444395748108105,
  'avg_token_entropy_ancestors-before': 0.24444395748108105,
  'avg_token_entropy_ancestors-children': 0.24444395748

In [21]:
conv_ft = ConversationFeaturizer()

In [None]:
features = conv_ft.transform(conv)
pprint(features)