# Restructuring and Filtering for Coding

In order to start the coding we need to improve the conversation filtering, so that
only conversations of a kind of interest to us remain (Ethos attacks, Conflict, Scripts, Interventions).

## Cropping the Conversation Tree

tree structure should be reduced:

- Merge sequential nodes of same author into one node (if posted on same day)
- min-depth: 4
- max-branching: 4 and indicate how many more siblings are on a level
- (optional) min length of words and variance of number of words should prevent conversations, that only have short responses.

In [1]:

# pre_sorting the dataframe
import pandas as pd
import sqlite3
from numpy_ext import rolling_apply
from util.sql_switch import query_sql

# reducing the size of the df for debugging
# df = df.loc[df["conversation_id"] == 1426273610289848324]
fieldnames = ["id", "conversation_id", "author_id", "created_at", "in_reply_to_user_id", "text"]
df = query_sql(
    fieldnames=fieldnames)  # a utility so I don't have to rewrite the get twitter data for both django and jupyter context

df.sort_values(by=['conversation_id', 'author_id', "created_at"], inplace=True)
df.reset_index(drop=True, inplace=True)
#pd.set_option('display.max_rows', 1000)
#df.loc[[2,1,3]]
data_list = df.to_dict('records')
print("before merging there are {} tweets".format(len(data_list)))

df.head(5)

ImproperlyConfigured: Requested setting INSTALLED_APPS, but settings are not configured. You must either define the environment variable DJANGO_SETTINGS_MODULE or call settings.configure() before accessing settings.

## Merge subsequent nodes

- pandas rolling does not work on dataframes too good
- using basic python (assuming list will store the order achieved previously)

In [None]:
result = []
duplicates_ids = []
for current, next_one in zip(data_list, data_list[1:]):
    if current["id"] not in duplicates_ids:
        if current["author_id"] == next_one["author_id"] and current["conversation_id"] == next_one["conversation_id"]:
            current["text"] = current["text"] + "<new_tweet><replyto:"
            current["text"] = current["text"] + str(next_one["in_reply_to_user_id"]) + ">" + next_one["text"]
            duplicates_ids.append(next_one["id"])
    result.append(current)

print("result contains {} tweets".format(len(result)))
print("duplicates contains {} tweets".format(len(duplicates_ids)))

## Crop the branches of the conversations
- we want deeper conversations
- we don't like trees with many branches

We start with reconstructing the trees:

In [None]:
df_merged = pd.DataFrame(result, columns=fieldnames)
df_merged.shape

roots = df_merged[df_merged["in_reply_to_user_id"].isnull()]
not_roots = df_merged[df_merged["in_reply_to_user_id"].notnull()]
print(not_roots.shape)
roots.shape

In [None]:
from delab.TwConversationTree import TreeNode

roots_as_rec = roots.to_dict('records')
not_roots_as_rec = not_roots.to_dict('records')
#roots_as_rec[0:5]

trees_roots = {}
for root_data in roots_as_rec:
    root_data["in_reply_to_user_id"] = root_data["author_id"]
    trees_roots[root_data["conversation_id"]] = TreeNode(root_data)  # root is defined as answering to him/herself

for not_root in not_roots_as_rec:
    if not_root["conversation_id"] in trees_roots:
        trees_roots.get(not_root["conversation_id"]).find_parent_of(TreeNode(not_root))

example_trees = list(trees_roots.values())[0:1]
for example_tree in example_trees:
    pass
    # print(example_tree.data) We are ommitting this for the pdf it shows a rather flat tree set
    # example_tree.print_tree(0)

In [None]:
# filtering out the trees that are too short
useful_trees = []
trees = list(trees_roots.values())
for tree in trees:
    if tree.get_max_path_length() > 3:
        #  print(tree.get_max_path_length())
        useful_trees.append(tree)

print("we found {} useful trees".format(len(useful_trees)))

useful_number_of_tweets = 0
for useful_tree in useful_trees:
    useful_number_of_tweets += useful_tree.flat_size()
    # useful_tree.print_tree(0)

print("we found {} useful tweets".format(useful_number_of_tweets))

Cropping the children that are branching too much

In [None]:
useful_number_of_tweets = 0
for useful_tree in useful_trees:
    useful_tree.crop_orphans()
    useful_number_of_tweets += useful_tree.flat_size()
    # useful_tree.print_tree(0)

print("we found {} useful tweets".format(useful_number_of_tweets))

After cleaning the conversations to have more of a classical discussion we arrive at about 1% tweets, that are kept.
This begs the question if Twitter is the optimal case for discussion analytics.