# PyConversations: A 4chan-based Example

The following is a tutorial notebook that demonstrates how to use `pyconversations` with 4chan data.

Interfacing with 4chan data _does not_ require any secret keys or tokens or anyting of that nature. Instead, we'll directly use a package called `BASC-py4chan` which provides [an object-oriented interface](https://basc-py4chan.readthedocs.io/en/latest/index.html) for interacting with the 4chan API. 

To begin, let's obtain some data using the package and then proceed by integrating the data with PyConversations!

In [1]:
import pprint

from pyconversations.convo import Conversation
from pyconversations.message import ChanPost

## Data Sample

In [2]:
import basc_py4chan

In [3]:
# get the board we want
board_name = 'news'
board = basc_py4chan.Board(board_name)
board

<Board /news/>

In [5]:
# select a thread on the board

ix = 13
all_thread_ids = board.get_all_thread_ids()
thread_id = all_thread_ids[ix] if ix < len(all_thread_ids) else all_thread_ids[-1]
thread = board.get_thread(thread_id)

thread_id, thread

(882179, <Thread /news/882179, 16 replies>)

In [6]:
# print thread information
print(thread)
print('Sticky?', thread.sticky)
print('Closed?', thread.closed)
print('Replies:', len(thread.replies))

<Thread /news/882179, 16 replies>
Sticky? False
Closed? False
Replies: 16


In [7]:
# print topic post information
topic = thread.topic
print('Topic Repr', topic)
print('Postnumber', topic.post_number)
print('Timestamp', topic.timestamp)
print('Datetime', repr(topic.datetime))
print('Subject', topic.subject)
print('Comment', topic.comment)

Topic Repr <Post /news/882179#882179, has_file: True>
Postnumber 882179
Timestamp 1625345123
Datetime datetime.datetime(2021, 7, 3, 16, 45, 23)
Subject India virus strain spread, what excuse this time?
Comment First excuse was we weren&#039;t warned in advance enough, then we &quot;closed borders&quot; but we still got European virus strain spreading all over, now we&#039;ve known over a YEAR about COVID but somehow these new strains from different countries is still spreading all over the world.<br>https://www.cnn.com/2021/07/03/health/us-coronavirus-saturday/index.html


In [8]:
thread.all_posts

[<Post /news/882179#882179, has_file: True>,
 <Post /news/882179#882191, has_file: False>,
 <Post /news/882179#882194, has_file: False>,
 <Post /news/882179#882198, has_file: False>,
 <Post /news/882179#882869, has_file: False>,
 <Post /news/882179#882872, has_file: False>,
 <Post /news/882179#882911, has_file: False>,
 <Post /news/882179#882915, has_file: False>,
 <Post /news/882179#882919, has_file: False>,
 <Post /news/882179#882920, has_file: False>,
 <Post /news/882179#882935, has_file: False>,
 <Post /news/882179#882951, has_file: False>,
 <Post /news/882179#882955, has_file: False>,
 <Post /news/882179#882957, has_file: False>,
 <Post /news/882179#882958, has_file: False>,
 <Post /news/882179#882964, has_file: False>,
 <Post /news/882179#883010, has_file: False>]

## Integration with `pyconversations`

All that's left to do is plug our data directly into `pyconversations`!

In [9]:
# create conversation
conv = Conversation()

In [28]:
for post in thread.all_posts:
    
    # gather up raw text
    raw_text = ((post.subject + ':\n' if post.subject else '') + post.text_comment).strip()
    
    # cleanse text and retrieve references to other posts
    text, rfs = ChanPost.clean_text(raw_text)
    if not rfs and topic.post_number != post.post_id:
        rfs = [topic.post_number]
    rfs = list(map(int, rfs))
    
    # create data for the post constructor
    cons = {
        'uid':        post.post_id,  # unique identifier for the post (mandatory field)
        'created_at': post.datetime,  # datetime of post creation
        'text':       text,  # cleaned plaintext
        'author':     post.name,  # self-assigned name of the poster (likely 'Anonymous')
        'platform':   '4Chan',
        'reply_to':   rfs,  # cleaned references to other posts
        'lang_detect': True  # whether or not to attempt language detection
    }
    conv.add_post(ChanPost(**cons))

In [29]:
# print number of unique posts contained within the Conversation
conv.messages

17

In [30]:
# Conversations can be sub-segmented (if we have a large collection and are uncertain if posts are missing or would like to splice out disjoint trees)
# This is likely to return a single conversation (a copy of what we built) since we just queried a single thread directly using the API 
# This is more relevant when ingesting a heterogenous collection of posts
segs = conv.segment()

len(segs)

1

### (Detected) Language Distribution

In [31]:
from collections import Counter

lang_dist = Counter([post.lang for post in conv.posts.values()])
lang_dist

Counter({'en': 16, 'it': 1})

### Saving and Loading from the universal format

In [32]:
import json

In [37]:
# saving a conversation to disk
# alternatively: save as a JSONLine file, where each line is a conversation!
j = conv.to_json()
pprint.pprint(j)
json.dump(j, open('4chan_conv.json', 'w+'))

[{'author': 'Anonymous',
  'created_at': 1625345123.0,
  'lang': 'en',
  'platform': '4Chan',
  'reply_to': [],
  'tags': [],
  'text': 'India virus strain spread, what excuse this time?:\n'
          "First excuse was we weren't warned in advance enough, then we "
          '"closed borders" but we still got European virus strain spreading '
          "all over, now we've known over a YEAR about COVID but somehow these "
          'new strains from different countries is still spreading all over '
          'the world.\n'
          'https://www.cnn.com/2021/07/03/health/us-coronavirus-saturday/index.html',
  'uid': 882179},
 {'author': 'Anonymous',
  'created_at': 1625346530.0,
  'lang': 'en',
  'platform': '4Chan',
  'reply_to': [882179],
  'tags': [],
  'text': '>>882179\n'
          '>close all public travel\n'
          '>literally anybody with money can rent a flight to the country at '
          "any point since that isn't regulated\n"
          '>but muh "closed borders"\n'
   

In [38]:
# reloading directly from the JSON
conv_reloaded = Conversation.from_json(json.load(open('4chan_conv.json')), ChanPost)
conv_reloaded.messages

17

### Properties for Analysis

In [39]:
conv.messages  # number of messages in the conversation

17

In [40]:
conv.connections  # number of reply connections

16

In [41]:
conv.users  # number of unique users

1

In [42]:
conv.chars  # character size of the entire conversation

3454

In [43]:
conv.tokens  # length of conversation, in tokens

1270

In [44]:
conv.token_types  # number of unique tokens used in convo

360

In [45]:
conv.sources  # IDs of source messages (messages without a reply action, messages that originate dialog)

{882179}

In [46]:
conv.density  # density of the conversation, when represented as a graph

0.11764705882352941

In [47]:
conv.degree_hist  # Returns the degree (# of replies received) histogram of this conversation

[0, 5, 10, 1, 1]

In [48]:
conv.tree_depth  # height of the conversational tree

6

In [49]:
conv.tree_width  # width of a depth level is the # of posts at that depth level (distance from source), tree width is the max width of any depth level

4

In [50]:
conv.start_time, conv.end_time, conv.duration  # time-related properties of the conversation

(datetime.datetime(2021, 7, 3, 16, 45, 23),
 datetime.datetime(2021, 7, 5, 13, 15, 47),
 160224.0)

In [51]:
conv.time_series  # times of posting, in order

[1625345123.0,
 1625346530.0,
 1625347259.0,
 1625348896.0,
 1625465108.0,
 1625466070.0,
 1625483214.0,
 1625484180.0,
 1625487057.0,
 1625488825.0,
 1625493579.0,
 1625497014.0,
 1625497910.0,
 1625498189.0,
 1625498343.0,
 1625499335.0,
 1625505347.0]

In [52]:
conv.text_stream  # text of posts, in temporal order

['India virus strain spread, what excuse this time?:\nFirst excuse was we weren\'t warned in advance enough, then we "closed borders" but we still got European virus strain spreading all over, now we\'ve known over a YEAR about COVID but somehow these new strains from different countries is still spreading all over the world.\nhttps://www.cnn.com/2021/07/03/health/us-coronavirus-saturday/index.html',
 '>>882179\n>close all public travel\n>literally anybody with money can rent a flight to the country at any point since that isn\'t regulated\n>but muh "closed borders"\nJust how retarded is the average /news/ poster for fucks sake.',
 'Build a great wall around India',
 '>>882191\nSo what excuse do we have this time?',
 '>>882194\nmaybe just call the hoax quits.',
 ">>882179\nyou want an excuse or a reason ? \nthere's no real excuse at all but the reason is capitalism.",
 ">>882872\nyou're pretty naive to think that capitalism is the end goal.",
 ">>882911\nyou're a babbling retard",
 ">>