# Topic Modeling from Clickbait Websites

For this project I wanted to look at the topics on clickbait websites.  I was interested in these topics since the content on these sites is designed specifically for clicks, not necessarily for quality. I chose to modularize my code within classes for this project to simplify the workflow.

In [5]:
import pandas as pd

from nltk.corpus import stopwords 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import NMF
from gensim import corpora, models, similarities, matutils
from pymongo import MongoClient

In [11]:
%load_ext autoreload
%autoreload 1
%aimport NLPProcessing
from NLPProcessing import GetArticles
from NLPProcessing import NLPPipeline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


I created a class to handle all of the webscaping for the sites.  The class creates a dictionary and appends each article on each site to the dictionary.  I created this dictionary for the three sites that I was interested in scraping. 

In [3]:
url = 'https://www.nickiswift.com/'
nicki_swift_articles = GetArticles()
nicki_swift_articles = nicki_swift_articles.create_article_dict(url)

In [2]:
url = 'https://www.grunge.com/'
grunge_articles = GetArticles()
grunge_articles = grunge_articles.create_article_dict(url)

In [181]:
url = 'http://www.giveitlove.com/'
give_it_love_articles = GetArticles()
give_it_love_articles = give_it_love_articles.create_article_dict(url)

## Create MongoDB

I wanted to put all of the data into a Mongo database so I created a database and inserted all of the articles from the three websites. The collection that I created was called "junk_website_data" because the sites were all clickbate sites.

In [3]:
client = MongoClient()

In [4]:
mydb = client["junk_website_data"]

Each document, which was an article, had a website, title and text field was inserted into MongoDB. I ran this code for each websites dictionary that was created above. I chose to do this one site at a time to ensure the data was in the proper format before inserting

In [None]:
# Change the site to be iterated over for each site
for key, value in give_it_love_articles.items():
    
    data_dict = {}
    data_dict['website'] = url
    data_dict['title'] = str(key)
    data_dict['text'] = str(value)
    mydb.junk_website_data.insert_one(data_dict)

I ran the code below to ensure that the collection was created.

In [9]:
mydb.list_collection_names()

['junk_website_data']

## Clean MongoDB Data

When I inserted the data into MongoDB, I inserted the data as raw html. In order to pull the text and titles out of the raw html, BeautifulSoup was used to extract the raw text. 

### Titles

In [42]:
title_cursor = mydb.junk_website_data.find({}, {'_id':0, 'title': 1}).limit(1)
title_list = list(title_cursor)
title = title_list[0]['title']

In [33]:
soup = BeautifulSoup(title, "lxml")

In [34]:
soup.get_text()

'The real reason these contestants were kicked off The Bachelor'

And now we can get clean titles!

### Text

In [3]:
client = MongoClient()
mydb = client["junk_website_data"]

In [4]:
text_cursor = mydb.junk_website_data.find({}, {'_id':0, 'title': 1, 'text': 1})
articles = list(text_cursor)

## Model the Data

Models to be created:
1. SVD
2. NMF
3. LDA

All models will be created with a count vectorizer and tf-idf vectorizer

### SVD - Count Vectorizer

In [13]:
cv = NLPPipeline()
cv.vectorize(articles)
cv.fit(topic_model=TruncatedSVD(5))
cv.display_topics(20)


Topic  0
game, dog, movie, band, team, cat, war, rock, character, song, video, white, school, season, photo, actor, john, album, human, fire

Topic  1
dog, cat, photo, rescue, animals, save, credit, owner, pet, animal, humans, water, cub, look like, pup, human, photo credit, food, adopt, boat

Topic  2
cat, myth, snapchat, food, human, milk, image, kitty, kitten, credit, researchers, feral, space, ancient, paw, study, hunt, litter, domesticate, color

Topic  3
game, team, sport, season, bowl, super, players, coach, nba, super bowl, player, nfl, league, football, ball, field, basketball, cup, olympics, championship

Topic  4
game, band, song, winner, album, team, rock, dog, cat, songs, sport, bowl, player, super bowl, players, tour, nba, coach, outstanding, super


### SVD - TFIDF

In [21]:
nltk_stop_words = set(stopwords.words('english'))
tfidf = NLPPipeline(vectorizer=TfidfVectorizer(stop_words=nltk_stop_words, min_df=15, max_df=0.25, ngram_range=(1,3)))
tfidf.vectorize(articles)
tfidf.fit(topic_model=TruncatedSVD(5))
tfidf.display_topics(20)


Topic  0
game, movie, band, actor, wed, dog, character, song, war, season, team, video, tweet, award, photo, rock, royal, grande, school, police

Topic  1
markle, meghan, royal, harry, prince, wed, prince harry, duchess, meghan markle, royal family, thomas, swift previously, swift previously report, nicki swift previously, engagement, grande, palace, middleton, kate, davidson

Topic  2
markle, royal, meghan, harry, prince, prince harry, duchess, meghan markle, thomas, royal family, queen, palace, dog, middleton, game, william, war, george, princess, sussex

Topic  3
grande, davidson, ariana, miller, pete, ariana grande, pete davidson, band, album, song, mac, mac miller, night live, pop star, saturday night live, comedian, saturday night, grande davidson, snl, saturday

Topic  4
lovato, sexual, drug, arrest, sobriety, abuse, overdose, addiction, charge, police, demi, allege, assault, band, tmz, kelly, weinstein, allegedly, rehab, trump


In [15]:
nltk_stop_words = set(stopwords.words('english'))
tfidf = NLPPipeline(vectorizer=TfidfVectorizer(stop_words=nltk_stop_words, min_df=15, max_df=0.25, ngram_range=(1,3)))
tfidf.vectorize(articles)
tfidf.fit(topic_model=TruncatedSVD(5))
tfidf.display_topics(20)


Topic  0
game, movie, band, actor, wed, dog, character, song, war, season, team, video, tweet, award, photo, rock, royal, grande, school, police

Topic  1
markle, meghan, royal, harry, prince, wed, prince harry, duchess, meghan markle, royal family, thomas, swift previously, swift previously report, nicki swift previously, engagement, grande, palace, middleton, kate, davidson

Topic  2
markle, royal, meghan, harry, prince, prince harry, duchess, meghan markle, thomas, royal family, queen, palace, dog, middleton, game, william, war, george, princess, wed

Topic  3
grande, davidson, ariana, pete, miller, ariana grande, pete davidson, band, album, song, mac, mac miller, bieber, night live, pop star, saturday night live, comedian, baldwin, saturday night, grande davidson

Topic  4
lovato, drug, arrest, sexual, sobriety, abuse, addiction, charge, overdose, police, demi, allege, assault, tmz, band, allegedly, rehab, sentence, prison, cosby


## NMF - Count Vectorizer

In [45]:
cv.fit(topic_model=NMF(5))
cv.display_topics(20)


Topic  0
movie, war, character, president, white, actor, trump, school, men, build, job, fire, human, role, america, movies, party, stuff, murder, allegedly

Topic  1
dog, photo, rescue, save, look like, pup, cub, animals, adopt, credit, boat, animal, owner, water, photo credit, breed, pant, khan, pet, taco

Topic  2
cat, credit, food, myth, human, snapchat, image, pet, animals, owner, humans, paw, eat, milk, kitten, animal, water, kitty, sleep, fish

Topic  3
game, team, sport, season, super, bowl, players, player, coach, nba, super bowl, league, nfl, football, ball, field, score, basketball, title, cup

Topic  4
band, song, rock, album, winner, songs, roll, tour, group, stone, sing, video, john, mercury, roll stone, single, lyric, queen, albums, award


## NMF - TFIDF

In [71]:
tfidf = NLPPipeline(vectorizer=TfidfVectorizer(stop_words=nltk_stop_words, min_df=15, max_df=0.25, ngram_range=(1,3)))
tfidf.vectorize(articles)
tfidf.fit(topic_model=NMF(5))
tfidf.display_topics(20)


Topic  0
game, movie, band, dog, war, character, team, song, rock, season, school, president, sport, actor, trump, white, album, men, human, murder

Topic  1
markle, meghan, royal, harry, prince, prince harry, wed, duchess, meghan markle, thomas, royal family, palace, middleton, queen, sussex, windsor, kate, princess, duchess sussex, harry meghan

Topic  2
kardashian, jenner, welcome, first child, caption, swift previously, swift previously report, nicki swift previously, mom, divorce, child together, marriage, thompson, excite, june, us weekly, girl, wed, social media, weekly

Topic  3
grande, davidson, ariana, pete, miller, ariana grande, pete davidson, engagement, comedian, night live, pop star, saturday, saturday night live, bieber, mac miller, saturday night, mac, baldwin, grande davidson, snl

Topic  4
lovato, sobriety, overdose, demi, addiction, drug, rehab, abuse, sober, health, demi lovato, substance abuse, substance, relapse, struggle, help privacy, help privacy policy, reco

## LDA - Count Vectorizer

In [18]:
doc_word = cv.vectorized_corpus.transpose()

In [20]:
pd.DataFrame(doc_word.toarray(), cv.vectorizer.get_feature_names()).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,3084,3085,3086,3087,3088,3089,3090,3091,3092,3093
aaron,0,0,0,0,0,0,0,0,0,0,...,0,0,4,0,0,0,0,0,0,0
aback,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abandon,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
abbey,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abbott,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
# Convert sparse matrix of counts to a gensim corpus
corpus = matutils.Sparse2Corpus(doc_word)

In [23]:
id2word = dict((v, k) for k, v in cv.vectorizer.vocabulary_.items())

In [24]:
# Create lda model (equivalent to "fit" in sklearn)
lda = models.LdaModel(corpus=corpus, num_topics=3, id2word=id2word, passes=5)

In [25]:
lda.print_topics()

[(0,
  '0.005*"movie" + 0.004*"character" + 0.003*"actor" + 0.002*"role" + 0.002*"movies" + 0.002*"cast" + 0.002*"host" + 0.002*"tweet" + 0.002*"season" + 0.001*"comedy"'),
 (1,
  '0.006*"band" + 0.005*"dog" + 0.005*"song" + 0.004*"rock" + 0.004*"album" + 0.003*"photo" + 0.002*"songs" + 0.002*"roll" + 0.002*"credit" + 0.002*"tour"'),
 (2,
  '0.003*"game" + 0.002*"war" + 0.002*"team" + 0.001*"human" + 0.001*"dog" + 0.001*"cat" + 0.001*"build" + 0.001*"water" + 0.001*"eat" + 0.001*"white"')]

In [225]:
# Transform the docs from the word space to the topic space (like "transform" in sklearn)
lda_corpus = lda[corpus]

<gensim.interfaces.TransformedCorpus at 0x1a38e6deb8>

In [226]:
# Store the documents' topic vectors in a list so we can take a peak
lda_docs = [doc for doc in lda_corpus]

In [227]:
lda_docs[0:5]

[[(0, 0.37225583), (2, 0.62675345)],
 [(0, 0.2294572), (1, 0.2791612), (2, 0.4913816)],
 [(0, 0.7855218), (2, 0.21371226)],
 [(2, 0.99563247)],
 [(0, 0.3221556), (2, 0.67626154)]]

## LDA - TFIDF

In [33]:
nltk_stop_words = set(stopwords.words('english'))
tfidf_lda = NLPPipeline(vectorizer=TfidfVectorizer(stop_words=nltk_stop_words, min_df=15, max_df=0.25, ngram_range=(1,3)))
tfidf_lda.vectorize(articles)
doc_word_tfidf = tfidf_lda.vectorized_corpus.transpose()

In [36]:
# Convert sparse matrix of counts to a gensim corpus
corpus = matutils.Sparse2Corpus(doc_word_tfidf)

In [38]:
id2word = dict((v, k) for k, v in tfidf_lda.vectorizer.vocabulary_.items())

In [39]:
# Create lda model (equivalent to "fit" in sklearn)
lda = models.LdaModel(corpus=corpus, num_topics=5, id2word=id2word, passes=5)

In [40]:
lda.print_topics()

[(0,
  '0.001*"mcphee" + 0.000*"grande" + 0.000*"jolie" + 0.000*"davidson" + 0.000*"pitt" + 0.000*"lowell" + 0.000*"katharine mcphee" + 0.000*"david foster" + 0.000*"capri" + 0.000*"carpet debut"'),
 (1,
  '0.000*"thanos" + 0.000*"spider" + 0.000*"spider man" + 0.000*"batman" + 0.000*"comics" + 0.000*"spacey" + 0.000*"marvel" + 0.000*"joker" + 0.000*"movie" + 0.000*"character"'),
 (2,
  '0.002*"lovato" + 0.001*"sobriety" + 0.001*"hyland" + 0.000*"demi" + 0.000*"overdose" + 0.000*"help privacy" + 0.000*"help privacy policy" + 0.000*"substance abuse mental" + 0.000*"abuse mental health" + 0.000*"abuse mental"'),
 (3,
  '0.001*"game" + 0.001*"dog" + 0.001*"band" + 0.001*"movie" + 0.001*"war" + 0.001*"team" + 0.001*"character" + 0.001*"song" + 0.001*"photo" + 0.001*"rock"'),
 (4,
  '0.001*"chopra" + 0.001*"jonas" + 0.001*"child together baby" + 0.001*"together baby" + 0.001*"child together" + 0.000*"excite baby" + 0.000*"first child" + 0.000*"baby news" + 0.000*"share excite" + 0.000*"baby