# Gift Recommender Engine: Topic Modelling Classifier

In this notebook, I will attempt to improve the performance of my model. As of now, my working approach is this: train a classifier using Reddit data extracted from different categories. This model is then fed with a user's Twitter data that is filtered by performing sentiment analysis (extracting only positive tweets). When tested on Reddit dataset, the models perform well (up to 85% accuracy). I evaluated the model on a celebrity's user profile to predict what that celebrity might like - in this case, Taylor Swift and overall, recommended some pretty relevant gift categories for her: music, movies, and books. However, the approach is quite messy - I have to input each Tweet individually into the classifier and count the most frequent topics from each Tweet. One significant limitation of this approach is that not every tweet, even though it has a positive sentiment, can be used to recommend a gift. It would make more sense to identify the relevant keywords that make up a topic and use these keywords as inputs to a classifier. However, performing LDA on every user's Tweets is inpractical for this process. Thus, I wonder if it's possible to use the output of an LDA model as input for a supervised learning classification problem.

In this notebook, I will attempt to create a classifier that predicts gift categories from vectors that correspond to the distribution of topics identified by the LDA model. I will use the Reddit dataset I scraped earlier.

## Import and Clean Reddit Data

In [90]:
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [91]:
import pandas as pd

df = pd.read_csv('datasets/reddit-categories-clean2.csv')
df.drop('Unnamed: 0', axis=1, inplace=True)

data = df[['category', 'all-text', 'clean-text']]
data.head()

Unnamed: 0,category,all-text,clean-text
0,Electronics/Gadgets,New r/tech discord server Unfortunately due to...,new tech discord server unfortunately due extr...
1,Electronics/Gadgets,Intel chief warns of two-year chip shortage,intel chief warn two year chip shortage
2,Electronics/Gadgets,"New York, other states to fight dismissal of a...",new york state fight dismissal antitrust lawsu...
3,Electronics/Gadgets,Microsoft's profits skyrocketed by 47 percent ...,microsoft profit skyrocket percent
4,Electronics/Gadgets,Hiding malware inside AI neural networks,hide malware inside ai neural network


In [147]:
data['all-text'].iloc[104]



In [146]:
preprocess(data['all-text'].iloc[104])



In [124]:
import nltk
import string
import re
import spacy

punctuations = string.punctuation
stopwords = nltk.corpus.stopwords.words('english')
nlp = spacy.load('en_core_web_sm')

def spacy_lemmatize(text):
    if type(text) == list:
        doc = nlp(u"{}".format(' '.join(text)))
    else:
        doc = nlp(u"{}".format(text))
    lemmatized = list()
    for token in doc:
        lemmatized.append(token.lemma_)
    
    return lemmatized

def preprocess(text):
    text=re.sub(r'http\S+', '', text)
    text = re.sub(r'www\S+', '', text)
    text = text.split() #split into list
    #text = [re.sub(r'^https?:\/\/.*[\r\n]*', '', s, flags=re.MULTILINE) for s in text] #remove any links
    text = [s.lower() for s in text] #convert every character into lowercase
    text = [re.sub(rf"[{string.punctuation}]", " ", s) for s in text] #remove punctuations
    text = [re.sub(r'[0-9]', ' ', s) for s in text] #remove all digits
    text = ' '.join(text)  #resplits
    text = [s for s in text.split() if len(s) >= 2] #removes words with one word length
    text = [s for s in text if s not in stopwords] #remove all stopwords
    text = ' '.join(spacy_lemmatize(text)) #lemmatize text using spacy and join into a string
    return text

In [36]:
#data['clean-text'] = data['all-text'].map(preprocess)
#data.to_csv('datasets/reddit-categories-clean3.csv')

In [148]:
data = pd.read_csv('datasets/reddit-categories-clean4.csv')
data.drop('Unnamed: 0', axis=1, inplace=True)

In [97]:
#data['clean-text-list'] = data['clean-text'].apply(lambda x: x.split())
#all_text = data['clean-text-list'].to_list()

In [99]:
data.head(5)

Unnamed: 0,title,score,id,subreddit,url,num_comments,body,created,category,all-text,clean-text
0,New r/tech discord server,10,nwb9yv,tech,https://www.reddit.com/r/tech/comments/nwb9yv/...,0,Unfortunately due to extreme circumstances the...,1623287000.0,Electronics/Gadgets,New r/tech discord server Unfortunately due to...,new tech discord server unfortunately due extr...
1,Intel chief warns of two-year chip shortage,1842,otbino,tech,https://www.bbc.com/news/technology-57996908,199,,1627484000.0,Electronics/Gadgets,Intel chief warns of two-year chip shortage,intel chief warn two year chip shortage
2,"New York, other states to fight dismissal of a...",429,otbgzp,tech,https://www.reuters.com/technology/new-york-ot...,6,,1627484000.0,Electronics/Gadgets,"New York, other states to fight dismissal of a...",new york state fight dismissal antitrust lawsu...
3,Microsoft's profits skyrocketed by 47 percent ...,695,ot3zze,tech,https://www.engadget.com/microsoft-q4-fy21-ear...,32,,1627452000.0,Electronics/Gadgets,Microsoft's profits skyrocketed by 47 percent ...,microsoft profit skyrocket percent
4,Hiding malware inside AI neural networks,162,ot70z7,tech,https://techxplore.com/news/2021-07-malware-ai...,8,,1627467000.0,Electronics/Gadgets,Hiding malware inside AI neural networks,hide malware inside ai neural network


In [155]:
data['all-text'].iloc[8000]

'Highly recommend last stop that’s just dropped on game pass Not gonna mention anything about this game. I’m not normally a fan of this genre but man I’m having a blast with this game.'

In [154]:
data['clean-text'].iloc[8000]

'highly recommend last stop that ’s drop game pass gon na mention anything game I ’m normally fan genre man I ’m blast game'

## Constructing LDA Model: Gensim

### Use HDA to Identify Number of Topics

In [58]:
import gensim
import gensim.corpora as corpora

id2word = gensim.corpora.Dictionary(all_text)
id2word.filter_extremes(no_below=10, no_above=0.35)
id2word.compactify()
corpus = [id2word.doc2bow(text) for text in all_text]

In [59]:
from gensim.models import HdpModel

hdp = HdpModel(corpus, id2word, chunksize=10000)

In [60]:
len(hdp.print_topics())

20

### LDA Modelling

In [62]:
lda = gensim.models.ldamulticore.LdaMulticore(corpus=corpus, num_topics=20,
                                             id2word=id2word, chunksize=100,
                                             workers=6, passes=50,
                                             per_word_topics=True)

In [69]:
train_vecs = []
for i in range(len(all_text)):
    top_topics = lda.get_document_topics(corpus[i], minimum_probability=0.0)
    topic_vec = [top_topics[i][1] for i in range(20)]
    train_vecs.append(topic_vec)

## Naive Bayes Model

In [80]:
from sklearn.preprocessing import LabelEncoder

X = np.array(train_vecs)
y = np.array(data.category)

le = LabelEncoder()
y = le.fit_transform(y)

ref = dict(zip(data['category'].to_numpy(), y))
ref = {k:v for k,v in sorted(ref.items(), key=lambda item: item[1])}

In [83]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=8)

nb = MultinomialNB()
nb.fit(X_train, y_train)

MultinomialNB()

In [84]:
nb.score(X_test, y_test)

0.24117805998378816

## Support Vector Classifier

In [86]:
svc = OneVsRestClassifier(LinearSVC(random_state=0))

In [87]:
svc.fit(X_train, y_train)

OneVsRestClassifier(estimator=LinearSVC(random_state=0))

In [88]:
svc.score(X_test, y_test)

0.2512023777357471

## Constructing Model: Sklearn

In [157]:
from sklearn.decomposition import LatentDirichletAllocation,NMF
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [158]:
data.dropna(inplace=True)

In [159]:
tfidf_v = TfidfVectorizer(stop_words='english')
tfidf=tfidf_v.fit_transform(data['clean-text'])
tfidf_feature_names=tfidf_v.get_feature_names()

In [160]:
count_v = CountVectorizer(stop_words='english')
count = count_v.fit_transform(data['clean-text'])
count_feature_names = count_v.get_feature_names()

In [164]:
no_topics=10
nmf=NMF(n_components=no_topics, random_state=1, alpha=0.1, l1_ratio=0.5, init='nndsvd')
lda=LatentDirichletAllocation(n_components=no_topics, max_iter=5, learning_method='online', learning_offset=50,random_state=0)

In [165]:
nmf_output = nmf.fit_transform(count)
lda_output = lda.fit_transform(count)

In [167]:
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print ("Topic %d:" % (topic_idx))
        print (" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

no_top_words = 20
display_topics(nmf, count_feature_names, no_top_words)

Topic 0:
like time day make want work feel know think book thing really say try year good people way look life
Topic 1:
official nintendo com press trailer release games twitter post summer game fall website edition end city exclusive tale ign ii
Topic 2:
icon sub big sprite yellow substitution bar txtblack ball red card goal lime replace match score thread yc assist note
Topic 3:
comment hockey link lw sign rw rd ld van canuck deal leafs sabre car year cane tor tampabaylightne tbl det
Topic 4:
exe corporation microsoft service process host window svchost brave software browser steam messenger valve broker client runtime logitech runtimebroker steamwebhelper
Topic 5:
pt vs fg minute highlight reb ast ft stl blk pg buck sf pf sg clipper sun jazz hawk net
Topic 6:
harry riddle bellatrix olivander snape say fleur mcgonagall lupin vote rita krum dumbledore episode look viktor luna make game know
Topic 7:
year team player season value playoff pss good play round win total raven pick run fin

In [170]:
data['lda-vector'] = data['clean-text'].apply(lambda x: lda.transform(count_v.transform([x])))

In [171]:
data['nmf-vector'] = data['clean-text'].apply(lambda x: nmf.transform(count_v.transform([x])))

### SVC Model from LDA-Vector

In [174]:
lda_data = data[['category', 'lda-vector']]

In [175]:
le = LabelEncoder()
lda_data['label'] = le.fit_transform(lda_data['category'])

In [179]:
lda_ref = le.classes_

In [181]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC


In [214]:
X = np.array(lda_data['lda-vector'].apply(lambda x: list(x[0])).to_list())
y = np.array(lda_data['label'])

In [215]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=8)

In [216]:
svc = OneVsRestClassifier(LinearSVC(random_state=0))

In [217]:
svc.fit(X_train, y_train)

OneVsRestClassifier(estimator=LinearSVC(random_state=0))

In [218]:
svc.score(X_test, y_test)

0.3823293172690763

In [225]:
import pickle

filename = open('evan_sowards.sav', 'rb')
tweets = pickle.load(filename)