# 1.4 Prepare Network Analysis Dataset

This is the fourth part of our complete workflow. We use the subreddit dataset to generate a relationship table. The relationship table contains information regarding the "parent" thread and the "child" response, which is a response to the corresponding "parent" thread.

When we analyze the processed dataset using KNIME, we will only use the columns creator_id and replier_id.

At the end of the workflow, the dataset will contain the following columns:
- parent_thread_id
- thread_creation_date
- post_id
- post_creation_date
- creator_id
- replier_id
- pos_emo
- neg_emo

Please note that you should execute the `2-social-network-analysis-workflow.knwf` KNIME workflow after executing this notebook and before executing `1-5-social-network-analysis.ipynb`.

In [None]:
# !pip install pandas

In [8]:
import pandas as pd

## Import datasets

In [9]:
reddit_df = pd.read_csv('../original-datasets/RedditCrypto-2017.csv', header=0, names=['parent_thread_id', 'post_id', 'user_id', 'timestamp', 'pos_emo', 'neg_emo'])

## Process data for social network analysis

### Remove leading and trailing white space

In [10]:
reddit_df['parent_thread_id'] = reddit_df['parent_thread_id'].str.strip()
reddit_df['post_id'] = reddit_df['post_id'].str.strip()
reddit_df['user_id'] = reddit_df['user_id'].str.strip()

### Extract thread id from file name

In [11]:
reddit_df['parent_thread_id'] = reddit_df['parent_thread_id'].str.slice(start=15, stop=21)

### Create dataframe with only "parent" posts

In [12]:
creator_posts_df = reddit_df[reddit_df['post_id'] == reddit_df['parent_thread_id']]
creator_posts_df = creator_posts_df.rename(columns={'user_id': 'creator_id', 'timestamp': 'thread_creation_date'})

### Join the "parent posts" dataframe with the original dataset

Match each post in the complete dataset with its corresponding "parent post"

In [13]:
joined_posts_df = creator_posts_df.merge(reddit_df, left_on='post_id', right_on='parent_thread_id')
joined_posts_df.drop(['post_id_x', 'pos_emo_x', 'neg_emo_x', 'parent_thread_id_y'], axis=1, inplace=True)

### Remove tuples with the same creator_id and user_id

In [14]:
joined_posts_df = joined_posts_df[joined_posts_df['creator_id'] != joined_posts_df['user_id']]

### Remove tuples where at least one of the users involved are bots

In [15]:
joined_posts_df = joined_posts_df[joined_posts_df['creator_id'] != 'AutoModerator']
joined_posts_df = joined_posts_df[joined_posts_df['creator_id'] != 'None']
joined_posts_df = joined_posts_df[joined_posts_df['user_id'] != 'None']
joined_posts_df = joined_posts_df[joined_posts_df['user_id'] != 'AutoModerator']

### Rename columns for clarity

In [16]:
joined_posts_df = joined_posts_df.rename(
    columns={
        'parent_thread_id_x': 'parent_thread_id', 
        'post_id_y': 'post_id', 
        'timestamp': 'post_creation_date', 
        'pos_emo_y': 'pos_emo', 
        'neg_emo_y': 'neg_emo',
        'user_id': 'replier_id'
    })

### Reorder columns

In [17]:
joined_posts_df = joined_posts_df[['parent_thread_id', 'thread_creation_date', 'post_id', 'post_creation_date', 'creator_id', 'replier_id', 'pos_emo', 'neg_emo']]

## Export datasets to csv

In [18]:
joined_posts_df.to_csv('../generated-datasets/network-analysis-dataset.csv')