# PyConversations: A 4chan-based Example

The following is a tutorial notebook that demonstrates how to use `pyconversations` with 4chan data.

Interfacing with 4chan data _does not_ require any secret keys or tokens or anyting of that nature. Instead, we'll directly use a package called `BASC-py4chan` which provides [an object-oriented interface](https://basc-py4chan.readthedocs.io/en/latest/index.html) for interacting with the 4chan API. 

To begin, let's obtain some data using the package and then proceed by integrating the data with PyConversations!

In [1]:
from pprint import pprint

from pyconversations.convo import Conversation
from pyconversations.message import ChanPost

## Data Sample

In [2]:
import basc_py4chan

In [3]:
# get the board we want
board_name = 'news'

board = basc_py4chan.Board(board_name)
board

<Board /news/>

In [4]:
# select a thread on the board

ix = 13
all_thread_ids = board.get_all_thread_ids()
thread_id = all_thread_ids[ix] if ix < len(all_thread_ids) else all_thread_ids[-1]
thread = board.get_thread(thread_id)

thread_id, thread

(893871, <Thread /news/893871, 19 replies>)

In [5]:
# print thread information
print(thread)
print('Sticky?', thread.sticky)
print('Closed?', thread.closed)
print('Replies:', len(thread.replies))

<Thread /news/893871, 19 replies>
Sticky? False
Closed? False
Replies: 19


In [6]:
# print topic post information
topic = thread.topic
print('Topic Repr', topic)
print('Postnumber', topic.post_number)
print('Timestamp', topic.timestamp)
print('Datetime', repr(topic.datetime))
print('Subject', topic.subject)
print('Comment', topic.comment)

Topic Repr <Post /news/893871#893871, has_file: True>
Postnumber 893871
Timestamp 1626922687
Datetime datetime.datetime(2021, 7, 21, 22, 58, 7)
Subject Larry Elder wins fight to enter California recall
Comment LOS ANGELES – A California judge on Wednesday cleared the way for conservative talk radio host Larry Elder to join the field of candidates for an upcoming recall election aimed at removing Democratic Gov. Gavin Newsom from office.<br><br>Elder scored a swift court victory in Sacramento, where he challenged a decision by state election officials to block him from the September recall ballot.<br><br>In a tweet, Elder wrote, “Victory! My next one will be on Sept. 14 at the ballot box.”<br><br>He added: “This isn’t just a victory for me, but a victory for the people of California. And not just those who favor the recall and support me, but all voters, including many who will come to know me.”<br><br>Superior Court Judge Laurie M. Earl disagreed with a state decision that Elder failed

In [7]:
thread.all_posts

[<Post /news/893871#893871, has_file: True>,
 <Post /news/893871#893872, has_file: False>,
 <Post /news/893871#893874, has_file: False>,
 <Post /news/893871#894034, has_file: False>,
 <Post /news/893871#894037, has_file: False>,
 <Post /news/893871#894038, has_file: False>,
 <Post /news/893871#894055, has_file: False>,
 <Post /news/893871#894062, has_file: False>,
 <Post /news/893871#894063, has_file: False>,
 <Post /news/893871#894064, has_file: False>,
 <Post /news/893871#894067, has_file: False>,
 <Post /news/893871#894073, has_file: False>,
 <Post /news/893871#894136, has_file: False>,
 <Post /news/893871#894172, has_file: False>,
 <Post /news/893871#894177, has_file: False>,
 <Post /news/893871#894194, has_file: False>,
 <Post /news/893871#894522, has_file: False>,
 <Post /news/893871#894801, has_file: False>,
 <Post /news/893871#894802, has_file: False>,
 <Post /news/893871#894803, has_file: False>]

## Integration with `pyconversations`

All that's left to do is plug our data directly into `pyconversations`!

In [8]:
# create conversation
conv = Conversation()

In [9]:
for post in thread.all_posts:
    
    # gather up raw text
    raw_text = ((post.subject + ':\n' if post.subject else '') + post.text_comment).strip()
    
    # cleanse text and retrieve references to other posts
    text, rfs = ChanPost.clean_text(raw_text)
    if not rfs and topic.post_number != post.post_id:
        rfs = [topic.post_number]
    rfs = list(map(int, rfs))
    
    # create data for the post constructor
    cons = {
        'uid':        post.post_id,  # unique identifier for the post (mandatory field)
        'created_at': post.datetime,  # datetime of post creation
        'text':       text,  # cleaned plaintext
        'author':     post.name,  # self-assigned name of the poster (likely 'Anonymous')
        'reply_to':   rfs,  # cleaned references to other posts
        'lang_detect': True  # whether or not to attempt language detection
    }
    conv.add_post(ChanPost(**cons))

In [10]:
# print number of unique posts contained within the Conversation
len(conv.posts)

20

In [11]:
# Conversations can be sub-segmented (if we have a large collection and are uncertain if posts are missing or would like to splice out disjoint trees)
# This is likely to return a single conversation (a copy of what we built) since we just queried a single thread directly using the API 
# This is more relevant when ingesting a heterogenous collection of posts
segs = conv.segment()

len(segs)

1

### (Detected) Language Distribution

In [12]:
from collections import Counter

lang_dist = Counter([post.lang for post in conv.posts.values()])
lang_dist

Counter({'en': 18, 'und': 1, 'nl': 1})

### Saving and Loading from the universal format

In [13]:
import json

In [14]:
# saving a conversation to disk
# alternatively: save as a JSONLine file, where each line is a conversation!
j = conv.to_json()
# pprint.pprint(j)
json.dump(j, open('4chan_conv.json', 'w+'))

In [15]:
# reloading directly from the JSON
conv_reloaded = Conversation.from_json(json.load(open('4chan_conv.json')))
len(conv_reloaded.posts)

20

### Feature Extraction

PyConversations exposes many features for extraction that are needed for social media analysis. 
Let's use the main feature extraction engine:

In [16]:
from pyconversations.feature_extraction import ConversationFeaturizer
from pyconversations.feature_extraction import PostFeaturizer

In [17]:
conv_ft = ConversationFeaturizer(include_post=False)
post_ft = PostFeaturizer()

In [18]:
features = conv_ft.transform(conv)

for k, v in features.items():
    if type(v) == dict:
        print(k, ': ', len(v))

pprint(features)

binary :  0
numeric :  420
categorical :  0
{'binary': {},
 'categorical': {},
 'convo_id': 'CONV_893871',
 'numeric': {'age_max': 141271.0,
             'age_mean': 54708.65,
             'age_median': 44449.5,
             'age_min': 0.0,
             'age_std': 41465.149149948804,
             'author_posts_max': 20,
             'author_posts_mean': 20.0,
             'author_posts_median': 20.0,
             'author_posts_min': 20,
             'author_posts_std': 0.0,
             'avg_token_entropy_after-ancestors_max': 0.7335981309118995,
             'avg_token_entropy_after-ancestors_mean': 0.4436810561915885,
             'avg_token_entropy_after-ancestors_median': 0.3664204072307681,
             'avg_token_entropy_after-ancestors_min': 0.2633786505412698,
             'avg_token_entropy_after-ancestors_std': 0.15476113993369517,
             'avg_token_entropy_after-before_max': 0.7020164765999354,
             'avg_token_entropy_after-before_mean': 0.4617702593405227,
   

In [25]:
pid = conv.time_order()[4]
post = conv.posts[pid]

pprint(post.text)

post

'>>894034\nThe man that will upset the California cartel'


4chanPost(4chan::Anonymous::2021-07-22 08:54:34::>>894034
The man that will upset the California ca::tags=)

In [26]:
features = post_ft.transform(post, conv)

for k, v in features.items():
    if type(v) == dict:
        print(k, ': ', len(v))

features

binary :  4
numeric :  80
categorical :  3


{'id': 894037,
 'convo_id': 'CONV_893871',
 'binary': {'is_source': 0,
  'is_leaf': 0,
  'is_internal': 1,
  'is_source_author': 1},
 'numeric': {'urls': 0,
  'mentions': 1,
  'chars': 54,
  'tokens': 19,
  'types': 12,
  'uppercase': 2,
  'lowercase': 36,
  'uppercase_ratio': 0.05263157894736842,
  'depth': 2,
  'width': 3,
  'degree': 2,
  'in_degree': 1,
  'out_degree': 1,
  'author_posts': 20,
  'age': 35787.0,
  'reply_time': 463.0,
  'avg_token_entropy_after-ancestors': 0.32762109295194225,
  'avg_token_entropy_after-before': 0.3447214251916658,
  'avg_token_entropy_after-children': 0.32904693516498795,
  'avg_token_entropy_after-descendants': 0.32508887660195473,
  'avg_token_entropy_after-full': 0.32762109295194225,
  'avg_token_entropy_after-parents': 0.3447214251916658,
  'avg_token_entropy_after-siblings': 0.3447214251916658,
  'avg_token_entropy_ancestors-after': 0.2633786505412698,
  'avg_token_entropy_ancestors-before': 0.2633786505412698,
  'avg_token_entropy_ancestors-c