# Tips Prediction #

In this library, we will explore the data which we retrieved from *LifeProTips*, *ShittyLifeProTips*, *UnethicalLifeProTips* and *IllegalLifeProTips*.
Afterwards, we will make some predictions to determine which subreddit a certain post belongs to.

---

Let's start by loading the posts into memory

In [15]:
import json

posts = {}

# Import LifeProTips
with open('tips/lifeprotips_dump_1.json', 'r') as lptf:
    posts['lpt'] = json.load(lptf)

# Import ShittyLifeProTips
with open('tips/shittylifeprotips_dump_1.json', 'r') as slptf:
    posts['slpt'] = json.load(slptf)
    
# Import UnethicalLifeProTips
with open('tips/unethicallifeprotips_dump_1.json', 'r') as ulptf:
    posts['ulpt'] = json.load(ulptf)

# Import IllegalLifeProTips
with open('tips/illegallifeprotips_dump_1.json', 'r') as ilptf:
    posts['ilpt'] = json.load(ilptf)

Let's remove any duplicate from the posts

In [16]:
for subreddit in posts:
    obs_ids = [] # observed ids
    unique_posts = [] # unique posts
    for post in posts[subreddit]: # every post is a dictionary
        if post['id'] not in obs_ids:
            obs_ids.append(post['id'])
            unique_posts.append(post)
    posts[subreddit] = unique_posts

Check the amount of unique posts we retrieved per subreddit $\rightarrow$ No duplicates!

In [17]:
for (s, a) in posts.items():
    print('{} posts retrieved from {}'.format(len(a), s))

30000 posts retrieved from lpt
30000 posts retrieved from slpt
30000 posts retrieved from ulpt
10000 posts retrieved from ilpt


The reason why there are only 10k posts from *IllegalLifeProTips* is because the community is relatively young compared to the rest of subreddits. It was founded a year ago while the other three appeared 7 years ago.

---

Now, let's shuffle the posts for each subreddit to perform an arbitrary and even train/validation/test division

In [18]:
import random

for (s, p) in posts.items():
    random.shuffle(p)

We will perfom a train/validation/test split as follows:
* $4/9$ for training ($2/3$ of $2/3$)
* $2/9$ for validation ($1/3$ of $2/3$)
* $1/3$ for testing

In [19]:
import math

# Create and empty dictionary to store train, validation and test posts
posts_struct = {'train':[], 'validation':[], 'test':[]}

# For each subreddit, split
for p in posts.values():
    # p is the list of posts (each post is a dictionary)
    # train + validation is 2/3; test is 1/3
    split_index = int(math.ceil((2/3)*len(p)))
    p_trainval = p[:split_index]
    p_test = p[split_index:]
    # Now, out of train+validation, 2/3 go to train and 1/3 to validation
    split_index = int(math.ceil((2/3)*len(p_trainval)))
    p_train = p_trainval[:split_index]
    p_validation = p_trainval[split_index:]
    
    # Finally, add the posts to their respective parts in the posts_struct dictionary
    posts_struct['train'].extend(p_train)
    posts_struct['validation'].extend(p_validation)
    posts_struct['test'].extend(p_test)

# Shuffle the lists
for (sep, p) in posts_struct.items():
    random.shuffle(p)

Check that we performed a correct split

In [20]:
ins_tot = len(posts_struct['train']) + len(posts_struct['validation']) + len(posts_struct['test'])
print('Total number of instances: {}'.format(ins_tot))
print('Percentage of train instances: {}'.format((len(posts_struct['train'])/ins_tot)*100))
print('Percentage of validation instances: {}'.format((len(posts_struct['validation'])/ins_tot)*100))
print('Percentage of test instances: {}'.format((len(posts_struct['test'])/ins_tot)*100))

Total number of instances: 100000
Percentage of train instances: 44.446999999999996
Percentage of validation instances: 22.220000000000002
Percentage of test instances: 33.333


We just need one more step to perform. Creating `pandas` DataFrames for training, validation and testing will make data handling easier

In [None]:
import pandas as pd

# DataFrame for training

# DataFrame for validation

# DataFrame for testing