# 1.5 Generate top users emotion attribute

This is the final part of our complete workflow. Using the output from KNIME, we determine the dominating emotion for all posts made by the top 3 most influential subreddit users, for each day.

Neutral days are assigned a 0, positive days are assigned a 1, and negative days are assigned a -1.

If none of the top users had made a post on a particular day, we assume the emotion is neutral, and label the corresponding date as 0.


At the end of the workflow, the dataset will contain the following columns:
- Date
- is_profitable
- close-low
- close-high
- SMA-close
- EMA-close
- EMA_diff
- SMA_diff
- pos_emo
- neg_emo
- top_influencer_emotion



In [1]:
# !pip install pandas

In [2]:
import pandas as pd

In [3]:
def determine_emotion(row):
    if row['pos_emo_y'] > row['neg_emo_y']:
        row['top_influencer_emotion'] = 1
    elif row['pos_emo_y'] < row['neg_emo_y']:
        row['top_influencer_emotion'] = -1
    else:
        row['top_influencer_emotion'] = 0
    return row

### Import and preprocess datasets

In [4]:
influencers_df = pd.read_csv('../generated-datasets/top-influencers.csv', usecols=['Object id']).head(3)

reddit_df = pd.read_csv('../original-datasets/RedditCrypto-2017.csv', header=0, usecols=['Source (C)', 'Source (E)', 'posemo', 'negemo'])
reddit_df = reddit_df.rename(columns={'Source (C)': 'poster', 'Source (E)':'timestamp', 'posemo': 'pos_emo', 'negemo': 'neg_emo'})
reddit_df['poster'] = reddit_df['poster'].str.strip()

processed_reddit_df = pd.read_csv('../generated-datasets/classification-dataset-with-emotion-scores.csv')

### Drop posts that are not by top influencers

In [5]:
reddit_df = reddit_df.merge(right=influencers_df, left_on='poster', right_on='Object id')

### Take aggreagate sum of each day

In [6]:
reddit_df['timestamp'] = pd.to_datetime(reddit_df['timestamp'])
reddit_df['timestamp'] = reddit_df['timestamp'].dt.date

reddit_df = reddit_df.groupby(['timestamp'], as_index=False).sum()
reddit_df['timestamp'] = pd.to_datetime(reddit_df['timestamp'])

### Join top influencer posts with other daily statistics

In [7]:
processed_reddit_df['Date'] = pd.to_datetime(processed_reddit_df['Date'])

processed_reddit_df = pd.merge(processed_reddit_df, reddit_df, how='left', left_on=["Date"], right_on=["timestamp"], sort=False)

### Create column indicating the dominating emotion of the posts from the top influencers for each day

In [8]:
processed_reddit_df = processed_reddit_df.apply(determine_emotion, axis=1)

### Drop unused columns and rename column headers

In [9]:
processed_reddit_df = processed_reddit_df.drop(columns=['Unnamed: 0', 'timestamp', 'pos_emo_y', 'neg_emo_y'])
processed_reddit_df = processed_reddit_df.rename(columns={'pos_emo_x': 'pos_emo', 'neg_emo_x': 'neg_emo'})


### Export dataset

In [10]:
processed_reddit_df.to_csv('../generated-datasets/classification-dataset-with-emotion-scores.csv')