# Ratings Predictor



<font size="3">The training was done on google colab as it has more resources.</font>

In [None]:
!pip install transformers -q

In [None]:
import pandas as pd
import numpy as np
from transformers import TFAutoModel, AutoTokenizer
import torch
import tensorflow as tf
from textblob import TextBlob

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


### Data

<font size = 3>We are using the <b>reddit</b> jokes data and using the ratings as scores. There are definitely downsides in this data as these belong to specific subreddits with specific audience. But this is one of the largest jokes data. Also, it would also contain some recent trends that makes it even better.</font>

In [None]:
df = pd.read_json('/content/gdrive/MyDrive/reddit_jokes.json')

<font size = 3>I found some anomalies where joke size was very small (Not parsed correctly or some other issue). I took a lower limit for word_count</font>

In [None]:
df['word_count'] = df['body'].apply(lambda x: len(x.split()))
df = df[df['word_count'] > 10]
df.head()

Unnamed: 0,body,id,score,title,word_count
0,"Now I have to say ""Leroy can you please paint ...",5tz52q,1,I hate how you cant even say black paint anymore,12
1,Pizza doesn't scream when you put it in the ov...,5tz4dd,0,What's the difference between a Jew in Nazi Ge...,14
2,...and being there really helped me learn abou...,5tz319,0,I recently went to America....,34
3,A Sunday school teacher is concerned that his ...,5tz2wj,1,"Brian raises his hand and says, “He’s in Heaven.”",92
4,He got caught trying to sell the two books to ...,5tz1pc,0,You hear about the University book store worke...,12


### Transforming Score

<font size = 3>I applied the <b>logarithmic function</b> to the scores which can help reduce the impact of the outliers by decreasing the variance of the data. Scores are then normalized to fit between 0 and 10.</font>

In [None]:
import numpy as np

df['log_score'] = np.log(df['score'] + 1)

normalized_scores = (df['log_score'] - df['log_score'].min()) / (df['log_score'].max() - df['log_score'].min()) * 10

df['joke_ranking'] = normalized_scores
df = df[['body','joke_ranking']]
df.columns = ['joke','score']
print(df.shape)
df.head()

(82262, 2)


Unnamed: 0,joke,score
0,"Now I have to say ""Leroy can you please paint ...",0.654786
1,Pizza doesn't scream when you put it in the ov...,0.0
2,...and being there really helped me learn abou...,0.0
3,A Sunday school teacher is concerned that his ...,0.654786
4,He got caught trying to sell the two books to ...,0.0


### Getting Equal Samples

<font size = 3>To make the distribution even, I took equal number of samples from each of the below bins. This is to <b>avoid any imbalance</b> between labels.</font>

In [None]:
bins = [0, 2, 4, 6, 8, 10]
df['category'] = pd.cut(df['score'], bins, include_lowest=True)

sampled_df = pd.DataFrame()

for category in df['category'].unique():
    category_df = df[df['category'] == category]
    sampled_category_df = category_df.sample(min(len(category_df), 1000))
    sampled_df = pd.concat([sampled_df, sampled_category_df])

if len(sampled_df) < 5000:
    sample_size = len(sampled_df)
else:
    sample_size = 5000
final_sampled_df = sampled_df.sample(sample_size)
final_sampled_df = final_sampled_df[['joke','score']]

# Rounded the scores to get 11 classes.
final_sampled_df['score'] = final_sampled_df['score'].round()
print(final_sampled_df.shape)
final_sampled_df.head()

(4699, 3)
(4699, 2)


Unnamed: 0,joke,score
43335,"I know he's going to treat her well, I heard t...",9.0
188166,A little girl is riding her bicycle down the s...,5.0
118552,They pass a gay bar and one condom says to the...,8.0
14509,The doctor tells them theres been a mix up and...,9.0
26914,"A Priest, a Rabbi, and a Sheik walk into a bar...",2.0


## Model

<font size = 3>I treated the problem as a classification problem. Where we used used the embeddings of the jokes as input to our classicfication model. For embeddings I was forced to use BERT tiny even though I opted for bigger models first but because of the resource limitation, I had to move to BERT. (RAM overshoots even with processing in batches). BERT embeddings are combined with sentiment polarity to create complete input feature set.</font>

### Getting Embeddings and Other Features

In [None]:
model = TFAutoModel.from_pretrained('prajjwal1/bert-tiny', from_pt=True)

tokenizer = AutoTokenizer.from_pretrained('prajjwal1/bert-tiny')

jokes = final_sampled_df['joke'].tolist()
chunks = [jokes[x:x+1000] for x in range(0, len(jokes), 1000)]

embeddings = []
i = 0
for chunk in chunks:
    i+=1
    print(i)
    inputs = tokenizer(chunk, return_tensors='tf', truncation=True, padding=True, max_length=512)
    outputs = model(inputs)
    chunk_embeddings = outputs.last_hidden_state[:, 0, :].numpy()
    embeddings.append(chunk_embeddings)

embeddings = np.concatenate(embeddings, axis=0)

sentiments = [TextBlob(joke).sentiment for joke in jokes]
polarity = [sentiment.polarity for sentiment in sentiments]
subjectivity = [sentiment.subjectivity for sentiment in sentiments]

df_final = pd.DataFrame(embeddings)
df_final['polarity'] = polarity
df_final['subjectivity'] = subjectivity
df_final.head()

### Classification Model on Top of BERT

<font size = 3>Finally <b>XGBoost</b> is trained to classify the input features among 1 of the 11 classes. Treating the problem as classification gives far better results compared to treating it as a regression problem. Also, it makes sense for classification as we want to detect level of humour and we already have a division of 11 classes from high to low humour.</font>

In [None]:
import xgboost as xgb

labels = final_sampled_df['score'].tolist()
unique_labels = np.unique(labels)
dtrain = xgb.DMatrix(df_final, label=unique_labels.tolist())

param = {'max_depth': 7, 'eta': 0.3, 'objective': 'multi:softmax', 'num_class': 11}

num_round = 20
bst = xgb.train(param, dtrain, num_round)

## Saving the Model

<font size=3> Finally we save the trained model</font>

In [None]:
bst.save_model('xgboost_model.json')