In [None]:
import pandas as pd
import numpy as np

from google.colab import drive

# Config

Mount Google Drive (only works with Google Colab).

In [None]:
drive.mount('/content/drive')
file = '/content/drive/My Drive/ml_projects/reddit_tldr/'

Mounted at /content/drive


# Self-scraped dataset

Load dataset from mounted Google Drive.

In [None]:
df_scraped = pd.read_pickle(file + 'my_reddit_data.pkl')

Investigate shape.

In [None]:
print(f'Number of observations: {df_scraped.shape[0]}')
print(f'Number of features: {df_scraped.shape[1]}')

Number of observations: 1707
Number of features: 6


Take a look at an indiviual stories and their TLDRs.

In [None]:
for i in range (2):
  print(f'Story number {i}:')
  print(df_scraped['story'].iloc[i])
  print(f'Matching TLDR:')
  print(df_scraped['tldr'].iloc[i])
  print()

Story number 0:
Hi Reddit!

So my GF of over 10 years told me about a week ago that she will spend the NYE with her friends instead of staying home with me and our son. She asked me if I would be ok with it and I told her that it would make me sad but ultimately it's her decision and I won't stop her if that makes her happy. I also suggested that she could take me with her but she said that the other girls are coming alone and it would be weird if I came along.

So I spent the NYE with my son and my GFs mother who is visiting us for the holidays. We had a short phonecall to wish each other happy new year and I tried to be happy and smiling but inside I'm feeling quite sad and I had a lot of trouble going to sleep because I just kept thinking about our situation and why she would choose her friends over me to spend this special moment.

I'm not sure if and how to talk about it with her when she finally decides to come home. I don't like to cause conflict but I think my happiness is also

While the scraper works fine and managed to scrape around 2000 posts without major errors from the relationships-subreddit in one week, it is not possible to scrape archived posts, which makes generatating a large amount of data in short time impossible. 

However, more data is needed for this kind of task. 

Therefore, the data scraped by myself is massively enriched by the open source  [reddit dataset](https://www.tensorflow.org/datasets/catalog/reddit) with a filter on subreddit = 'r/relationships'.

# Tensorflow reddit dataset 

Load dataset from mounted Google Drive.

In [None]:
df_tflow = pd.read_pickle(file + 'tf_reddit_data_complete.pkl')

Investigate shape.

In [None]:
print(f'Number of observations: {df_tflow.shape[0]}')
print(f'Number of features: {df_tflow.shape[1]}')

Number of observations: 402448
Number of features: 8


Take a look at an indiviual stories and their TLDRs.

In [None]:
for i in range (2):
  print(f'Story number {i}:')
  print(df_tflow['content'].iloc[i])
  print(f'Matching TLDR:')
  print(df_tflow['summary'].iloc[i])
  print()

Story number 0:
b'First, I just want to thank whoever is reading this. I\'m having a horrible day. Second, no, this isn\'t your typical jealousy as in "I don\'t want you hanging around them, you might cheat on me", it\'s actual envy as in "why can\'t I be more like you? Why are you so good at everything?" \n Before I explain, I guess I\'ll have to give you some context to make things more clear. I am 19 years old and female, and my girlfriend is 19, too; her name is Sarah. We\'ve been in a relationship for about a year and a half now. A few weeks to about a month ago (I don\'t know how long exactly), Sarah got her first job at Dunkin\' Donuts. Her sister took her to a job fair because she used to work at Dunkin\' too, and Sarah applied, met the manager, and left. She didn\'t get a call, but her sister talked to the girl and she got the job pretty much the next day. When she told me she had gotten a job, I tried so hard to be happy for her, but all I could think about was how hard I had

# Merging the dataset

A brief exploration of the data sets revealed the following differences:

* We have different column-names
* We have a different amount of columns and only two columns contain relevant information
* The tensorflow reddit dataset is binary encoded

These differences are aligned in the following lines of code.

Only keep the relevant columns.

In [None]:
df_tflow = df_tflow[['content','summary']]
df_scraped = df_scraped[['story','tldr']]

Match column names.

In [None]:
df_tflow = df_tflow.rename(columns={'content': 'story', 'summary': 'tldr'})

Change encoding to UTF-8.

In [None]:
for c in list(df_tflow):
  df_tflow[c] = df_tflow[c].str.decode("utf-8")

As a last step, we merge the datasets.

In [None]:
df = pd.concat([df_scraped, df_tflow], ignore_index=True)

Investigate shape of the merged dataset.

In [None]:
print(f'Number of observations: {df.shape[0]}')
print(f'Number of features: {df.shape[1]}')

Number of observations: 404155
Number of features: 2


Save the dataset go Google Drive.

In [None]:
df.to_pickle(file + 'combined_reddit_data.pkl')

End of data merging.

Next notebook: data_prep_and_viz.ipynb