# Gift Recommender Engine: Topic Modelling Classifier

In this notebook, I will attempt to improve the performance of my model. As of now, my working approach is this: train a classifier using Reddit data extracted from different categories. This model is then fed with a user's Twitter data that is filtered by performing sentiment analysis (extracting only positive tweets). When tested on Reddit dataset, the models perform extremely well (up to 98% accuracy). I evaluated the model on a celebrity's user profile to predict what that celebrity might like - in this case, Taylor Swift and overall, recommended some pretty relevant gift categories for her: music, movies, and books. However, the approach is quite messy - I have to input each Tweet individually into the classifier and count the most frequent topics from each Tweet. One significant limitation of this approach is that not every tweet, even though it has a positive sentiment, can be used to recommend a gift. It would make more sense to identify the relevant keywords that make up a topic and use these keywords as inputs to a classifier. However, performing LDA on every user's Tweets is inpractical for this process. Thus, I wonder if it's possible to use the output of an LDA model as input for a supervised learning classification problem.

In this notebook, I will attempt to create a classifier that predicts gift categories from vectors that correspond to the distribution of topics identified by the LDA model. I will use the Reddit dataset I scraped earlier.

## Import and Clean Reddit Data

In [72]:
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [32]:
import pandas as pd

df = pd.read_csv('datasets/reddit-categories-clean2.csv')
df.drop('Unnamed: 0', axis=1, inplace=True)

data = df[['category', 'all-text', 'clean-text']]
data.head()

In [24]:
import nltk
import string
import re
import spacy

punctuations = string.punctuation
stopwords = nltk.corpus.stopwords.words('english')
nlp = spacy.load('en_core_web_sm')

def spacy_lemmatize(text):
    if type(text) == list:
        doc = nlp(u"{}".format(' '.join(text)))
    else:
        doc = nlp(u"{}".format(text))
    lemmatized = list()
    for token in doc:
        lemmatized.append(token.lemma_)
    
    return lemmatized

def preprocess(text):
    text = text.split() #split into list
    text = [re.sub(r'^https?:\/\/.*[\r\n]*', '', s, flags=re.MULTILINE) for s in text] #remove any links
    text = [s.lower() for s in text] #convert every character into lowercase
    text = [re.sub(rf"[{string.punctuation}]", " ", s) for s in text] #remove punctuations
    text = [re.sub(r'[0-9]', ' ', s) for s in text] #remove all digits
    text = ' '.join(text)  #resplits
    text = [s for s in text.split() if len(s) >= 2] #removes words with one word length
    text = [s for s in text if s not in stopwords] #remove all stopwords
    text = ' '.join(spacy_lemmatize(text)) #lemmatize text using spacy and join into a string
    return text

In [36]:
#data['clean-text'] = data['all-text'].map(preprocess)
#data.to_csv('datasets/reddit-categories-clean3.csv')

In [39]:
data['clean-text-list'] = data['clean-text'].apply(lambda x: x.split())
all_text = data['clean-text-list'].to_list()

## Constructing LDA Model

### Use HDA to Identify Number of Topics

In [58]:
import gensim
import gensim.corpora as corpora

id2word = gensim.corpora.Dictionary(all_text)
id2word.filter_extremes(no_below=10, no_above=0.35)
id2word.compactify()
corpus = [id2word.doc2bow(text) for text in all_text]

In [59]:
from gensim.models import HdpModel

hdp = HdpModel(corpus, id2word, chunksize=10000)

In [60]:
len(hdp.print_topics())

20

### LDA Modelling

In [62]:
lda = gensim.models.ldamulticore.LdaMulticore(corpus=corpus, num_topics=20,
                                             id2word=id2word, chunksize=100,
                                             workers=6, passes=50,
                                             per_word_topics=True)

In [69]:
train_vecs = []
for i in range(len(all_text)):
    top_topics = lda.get_document_topics(corpus[i], minimum_probability=0.0)
    topic_vec = [top_topics[i][1] for i in range(20)]
    train_vecs.append(topic_vec)

## Naive Bayes Model

In [80]:
from sklearn.preprocessing import LabelEncoder

X = np.array(train_vecs)
y = np.array(data.category)

le = LabelEncoder()
y = le.fit_transform(y)

ref = dict(zip(data['category'].to_numpy(), y))
ref = {k:v for k,v in sorted(ref.items(), key=lambda item: item[1])}

In [83]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=8)

nb = MultinomialNB()
nb.fit(X_train, y_train)

MultinomialNB()

In [84]:
nb.score(X_test, y_test)

0.24117805998378816

## Support Vector Classifier

In [86]:
svc = OneVsRestClassifier(LinearSVC(random_state=0))

In [87]:
svc.fit(X_train, y_train)

OneVsRestClassifier(estimator=LinearSVC(random_state=0))

In [88]:
svc.score(X_test, y_test)

0.2512023777357471