# Restructuring and Filtering for Coding

In order to start the coding we need to improve the conversation filtering, so that
only conversations of a kind of interest to us remain (Ethos attacks, Conflict, Scripts, Interventions).

## Cropping the Conversation Tree

tree structure should be reduced:

- Merge sequential nodes of same author into one node (if posted on same day)
- min-depth: 4
- max-branching: 4 and indicate how many more siblings are on a level
- (optional) min length of words and variance of number of words should prevent conversations, that only have short responses.

In [6]:
# pre_sorting the dataframe
import pandas as pd
import sqlite3
from numpy_ext import rolling_apply
from util.sql_switch import query_sql

# reducing the size of the df for debugging
# df = df.loc[df["conversation_id"] == 1426273610289848324]
fieldnames = ["id", "conversation_id", "author_id", "created_at", "in_reply_to_user_id", "text"]
df = query_sql(fieldnames=fieldnames)  # a utility so I don't have to rewrite the get twitter data for both django and jupyter context

df.sort_values(by=['conversation_id', 'author_id', "created_at"], inplace=True)
df.reset_index(drop=True, inplace=True)
#pd.set_option('display.max_rows', 1000)
#df.loc[[2,1,3]]
data_list = df.to_dict('records')
print("before merging there are {} tweets".format(len(data_list)))

df.head(5)

before merging there are 9319 tweets


Unnamed: 0,id,conversation_id,author_id,created_at,in_reply_to_user_id,text
0,134,1422780001183834117,44196397,2021-08-04 04:42:40,,All 6 engines mounted to first orbital Starship https://t.co/l5QnQRSg3D
1,140,1422780001183834117,354265431,2021-08-22 21:46:21,44196397.0,@elonmusk https://t.co/1mEPBQPUPT #Bitcoin #investment #cryptotrading #cryptocurrency #Coinbase
2,141,1422780001183834117,828880051949166592,2021-08-22 20:48:40,44196397.0,@elonmusk Xvg
3,154,1422780001183834117,892416621406547969,2021-08-21 11:15:49,44196397.0,@elonmusk @SpaceX I should this produces as much emissions as the whole of the UK.
4,138,1422780001183834117,957329002708029440,2021-08-23 00:43:19,44196397.0,@elonmusk https://t.co/Dc7QENaC4g


## Merge subsequent nodes

- pandas rolling does not work on dataframes too good
- using basic python (assuming list will store the order achieved previously)

In [7]:
result = []
duplicates_ids = []
for current, next_one in zip(data_list, data_list[1:]):
    if current["id"] not in duplicates_ids:
        if current["author_id"] == next_one["author_id"] and current["conversation_id"] == next_one["conversation_id"]:
            current["text"] = current["text"] + "<new_tweet><replyto:" \
                              + next_one["in_reply_to_user_id"] + ">" \
                              + next_one["text"]
            duplicates_ids.append(next_one["id"])
        result.append(current)

print("result contains {} tweets".format(len(result)))
print("duplicates contains {} tweets".format(len(duplicates_ids)))

IndentationError: unexpected indent (2143549268.py, line 7)

## Crop the branches of the conversations
- we want deeper conversations
- we don't like trees with many branches

In [None]:
from twitter import TwConversationTree

pd.DataFrame.from_dict(result, orient='index', columns=fieldnames)
pd.head(3)
