# 1.3 Process Reddit Dataset

This is the third part of our complete workflow. We take the aggregate sum of the positive and negative emotion scores while grouping by date.

At the end of the workflow, the dataset will contain the following columns:
- Date
- is_profitable
- close-low
- close-high
- SMA-close
- EMA-close
- EMA_diff
- SMA_diff
- pos_emo
- neg_emo

In [None]:
# !pip install pandas

In [10]:
import pandas as pd

## Import datasets

In [11]:
prices_df = pd.read_csv('../generated-datasets/classification-dataset.csv', index_col=0)
reddit_df = pd.read_csv('../original-datasets/RedditCrypto-2017.csv', header=0, usecols=['Source (E)', 'posemo', 'negemo'])
reddit_df = reddit_df.rename(columns={'Source (E)':'timestamp', 'posemo': 'pos_emo', 'negemo': 'neg_emo'})

## Process data for classification tasks

### Convert date column

In [12]:
reddit_df['timestamp'] = pd.to_datetime(reddit_df['timestamp'])
reddit_df['date'] = reddit_df['timestamp'].dt.date

### Aggregate sum on date column

In [13]:
reddit_df = reddit_df.groupby(['date'], as_index=False).sum()
reddit_df['date'] = pd.to_datetime(reddit_df['date'])

### Combine reddit dataframe and historical prices dataframe

In [14]:
reddit_df['date'] = pd.to_datetime(reddit_df['date'])
prices_df['Date'] = pd.to_datetime(prices_df['Date'])

combined_df = prices_df.merge(reddit_df, left_on='Date', right_on='date')
combined_df.drop('date', axis=1, inplace=True)

## Export datasets to csv

In [16]:
combined_df.to_csv('../generated-datasets/classification-dataset-with-emotion-scores.csv')