# Tips Prediction #

In this library, we will explore the data which we retrieved from *LifeProTips*, *ShittyLifeProTips*, *UnethicalLifeProTips* and *IllegalLifeProTips*.
Afterwards, we will make some predictions to determine which subreddit a certain post belongs to.

---

## 1. Load and split ##

Let's start by loading the posts into memory

In [1]:
import json

posts = {}

# Import LifeProTips
with open('tips/lifeprotips_dump_1.json', 'r') as lptf:
    posts['lpt'] = json.load(lptf)

# Import ShittyLifeProTips
with open('tips/shittylifeprotips_dump_1.json', 'r') as slptf:
    posts['slpt'] = json.load(slptf)
    
# Import UnethicalLifeProTips
with open('tips/unethicallifeprotips_dump_1.json', 'r') as ulptf:
    posts['ulpt'] = json.load(ulptf)

# Import IllegalLifeProTips
with open('tips/illegallifeprotips_dump_1.json', 'r') as ilptf:
    posts['ilpt'] = json.load(ilptf)

Let's remove any duplicate from the posts

In [2]:
for subreddit in posts:
    obs_ids = [] # observed ids
    unique_posts = [] # unique posts
    for post in posts[subreddit]: # every post is a dictionary
        if post['id'] not in obs_ids:
            obs_ids.append(post['id'])
            unique_posts.append(post)
    posts[subreddit] = unique_posts

Check the amount of unique posts we retrieved per subreddit $\rightarrow$ No duplicates!

In [3]:
for (s, a) in posts.items():
    print('{} posts retrieved from {}'.format(len(a), s))

30000 posts retrieved from lpt
30000 posts retrieved from slpt
30000 posts retrieved from ulpt
10000 posts retrieved from ilpt


The reason why there are only 10k posts from *IllegalLifeProTips* is because the community is relatively young compared to the rest of subreddits. It was founded a year ago while the other three appeared 7 years ago.

---

Now, let's shuffle the posts for each subreddit to perform an arbitrary and even train/validation/test division

In [4]:
import random

for (s, p) in posts.items():
    random.shuffle(p)

We will perfom a train/validation/test split as follows:
* $4/9$ for training ($2/3$ of $2/3$)
* $2/9$ for validation ($1/3$ of $2/3$)
* $1/3$ for testing

In [5]:
import math

# Create and empty dictionary to store train, validation and test posts
posts_struct = {'train':[], 'validation':[], 'test':[]}

# For each subreddit, split
for p in posts.values():
    # p is the list of posts (each post is a dictionary)
    # train + validation is 2/3; test is 1/3
    split_index = int(math.ceil((2/3)*len(p)))
    p_trainval = p[:split_index]
    p_test = p[split_index:]
    # Now, out of train+validation, 2/3 go to train and 1/3 to validation
    split_index = int(math.ceil((2/3)*len(p_trainval)))
    p_train = p_trainval[:split_index]
    p_validation = p_trainval[split_index:]
    
    # Finally, add the posts to their respective parts in the posts_struct dictionary
    posts_struct['train'].extend(p_train)
    posts_struct['validation'].extend(p_validation)
    posts_struct['test'].extend(p_test)

# Shuffle the lists
for (sep, p) in posts_struct.items():
    random.shuffle(p)

Check that we performed a correct split

In [6]:
ins_tot = len(posts_struct['train']) + len(posts_struct['validation']) + len(posts_struct['test'])
print('Total number of instances: {}'.format(ins_tot))
print('Percentage of train instances: {}'.format((len(posts_struct['train'])/ins_tot)*100))
print('Percentage of validation instances: {}'.format((len(posts_struct['validation'])/ins_tot)*100))
print('Percentage of test instances: {}'.format((len(posts_struct['test'])/ins_tot)*100))

Total number of instances: 100000
Percentage of train instances: 44.446999999999996
Percentage of validation instances: 22.220000000000002
Percentage of test instances: 33.333


We just need one more step to perform. Creating `pandas` DataFrames for training, validation and testing will make data handling easier

In [7]:
import pandas as pd

# DataFrame for training
df_train = pd.DataFrame({'instance':[s['title'] for s in posts_struct['train']],
                         'label':[s['label'] for s in posts_struct['train']]})

# DataFrame for validation
df_validation = pd.DataFrame({'instance':[s['title'] for s in posts_struct['validation']],
                              'label':[s['label'] for s in posts_struct['validation']]})

# DataFrame for testing
df_test = pd.DataFrame({'instance':[s['title'] for s in posts_struct['test']],
                        'label':[s['label'] for s in posts_struct['test']]})

## 2. Instance pre-processing ##

Define a method to clean the instances<br>
**Disclaimer:** we are **not** copying this from Assignment 2

In [8]:
def clean(text, stem_words=True):
    import re    # for regular expressions
    from string import punctuation
    from nltk.stem import SnowballStemmer    #if you are brave enough to do stemming
    from nltk.corpus import stopwords      #if you want to remove stopwords
    
    # Empty question
    if type(text) != str or text=='':
        return ''
    
    # Make text lowercase
    text = text.lower()

    # Clean the text
    text = re.sub("\'s", " ", text) # we have cases like "Sam is" or "Sam's" (i.e. his) these two cases aren't separable, I choose to compromise are kill "'s" directly
    text = re.sub("\'t'", " not ", text)
    text = re.sub(" whats ", " what is ", text, flags=re.IGNORECASE)
    text = re.sub(" thats ", " that is ", text, flags=re.IGNORECASE)
    text = re.sub("\'ve", " have ", text)
    text = re.sub("-", " ", text)
    #you might need more
    
    # remove comma between numbers, i.e. 15,000 -> 15000
    text = re.sub('(?<=[0-9])\,(?=[0-9])', "", text)
    
    # remove punctuation
    for c in punctuation:
        text = re.sub("\\" + c, "", text)
        
    # split text
    n_text = text.split();
    
    # remove stopwords and subreddit keywords
    #stops = ['the', 'to']
    stops = set(stopwords.words('english'))
    sub_kwords = set(['lpt', 'ulpt', 'ilpt', 'slpt'])
    numbers = set(['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', 'eleven', 'twelve'])
    n_text = [word for word in n_text if word not in stops|sub_kwords|numbers if word.isalpha()]
    
    # Change numbers and other similar words
    #subs = [('one', '1'), ('ii', '2'), ('two', '2'), ('iii', '3'), ('three', '3'), ('iv', '4'), ('four', '4'),
    #        ('v', '5'), ('five', '5'), ('vi', '6'), ('six', '6'), ('vii', '7'), ('seven', '7'), ('viii', '8'),
    #        ('eight', '8'), ('ix', '9'), ('nine', '9'), ('ten', '10'), ('eleven', '11'), ('twelve', '12')]
    #for i, word in enumerate(n_text):
    #    for (bad, good) in subs:
    #        if word == bad:
    #            n_text[i] = good
    #            break
    
    # Return a list of words
    return ' '.join(n_text)

Clean the DataFrames

In [9]:
df_train['instance'] = df_train['instance'].apply(clean)
df_validation['instance'] = df_validation['instance'].apply(clean)
df_test['instance'] = df_test['instance'].apply(clean)

# Drop empty posts
df_train = df_train[df_train['instance'] != "" and df_train['instance'] != " "]
df_validation = df_validation[df_validation['instance'] != "" and df_validation['instance'] != " "]
df_test = df_test[df_test['instance'] != "" and df_test['instance'] != " "]

## 3. Train the ML Models ##

### BOW + Naive Bayes ###

First, build the BOW representation for training, validation and testing

In [14]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

# Generate the vocabulary using all instances
vectorizer.fit(np.concatenate((df_train['instance'].values, df_validation['instance'].values,
                               df_test['instance'].values), axis=0))

# Transform each batch separately and add labels
# Train
x_train = vectorizer.transform(df_train['instance'].values).toarray()
y_train = df_train['label'].values
# Validation
x_validation = vectorizer.transform(df_validation['instance'].values).toarray()
y_validation = df_validation['label'].values
# Test
x_test = vectorizer.transform(df_test['instance'].values).toarray()
y_test = df_test['label'].values

Perform validation using `MultinomialNB` from `sklearn`