# Twitter Hashtag Recommender – Mining

A dataset of hashtags and their respective number of posts were downloaded from the following page:

https://www.kaggle.com/tastelesswine/hashtaglist

The data goes as follows:

In [1]:
import pandas as pd

hashtag_data = pd.read_csv("mined_data/Top_hashtag.csv")
hashtag_data.head()

Unnamed: 0,S.no,Hashtag,Posts,Comments,Likes
0,1,love,1212163650,115.444444,4967.666667
1,2,friend,47119175,44.111111,6833.222222
2,3,beach,168603835,11.555556,893.666667
3,4,family,242155953,18.777778,813.555556
4,5,yellow,29176748,59.111111,3473.444444


Tweets from the top hashtags with the most posts were collected using the following custom feature extractor that uses the tweepy module.

In [2]:
from sklearn.base import BaseEstimator, TransformerMixin
import tweepy
import numpy as np

consumer_key = "idjkP1aobw1UQd8xZ9RYiY5CZ"
consumer_secret = "jZFXsLJRtvR4pQvmuTJ94mnr1TJ0tYz1w4s0XI5TpR4U5tEnXe"
access_token = "1001251273981677568-5SxiGu3SisqPnzY3Zkq8QHh7vreYar"
access_token_secret = "XZn1rvLw10JnxJgKx05sW4eN0HqhaVjsasaqV5tEytsTu"
auth = tweepy.OAuthHandler(consumer_key,consumer_secret)
auth.set_access_token(access_token,access_token_secret)

api = tweepy.API(auth,wait_on_rate_limit=True)

class HashtagDetailsExtractor(BaseEstimator, TransformerMixin):

    def __init__(self,tweetnum=100):
        self.tweetnum=tweetnum
    
    def fit(self,X,y=None):
        return self
    
    def transform(self,tag):
        tweet_text_list = []
        for tweet in tweepy.Cursor(api.search,q=tag,tweet_mode="extended").items(self.tweetnum):
            tweet_text_list.append(tweet._json["full_text"].lower())
        
        return tweet_text_list

hashtag_details_extractor = HashtagDetailsExtractor(1000)
tweets = hashtag_details_extractor.fit_transform("#love")

In [5]:
perm = np.random.RandomState(seed=13).permutation(10)*10+200

[tweets[i] for i in perm]

['rt @suzialbracht: the ghost fixer\nhttps://t.co/ngfdrzdlqt\na freak car accident stole my life from me and left my spirit earthbound. sounds…',
 '"national forecast for january 5, 2020" via fox news https://t.co/xukxse0q6m https://t.co/conoqih2aa #mlb #baseball #dfs #love #ny #lineup #softball #dk #fd #usa #homerun #funny #haha #wtf # #twins #astros #rangers #redsox #whitesox #usa #nba #video #money #fantasy #night',
 "don't get stuck with a bad lender! loan with me and see just how easy it can be!\nhttps://t.co/oshnvwe54k\n#life #love #lockwithleslie #home #mortgages #realestate #realtor #mortgage #mortgagebroker #loanofficer #mortgagelender https://t.co/dfypjxzrlz",
 'rt @vclinebarton: 🎇inspiring words to ponder this first #saturdaymorning of the #newyear2020! 😁🌠 #happysaturday friends! 💖👑🕵️\u200d♂️🍸 #saturday…',
 'wisdom: if you find yourself with an empty fridge always remember burger king is only a bus-ride away #workout #love #botlife https://t.co/mpuc8j6sv8',
 "don't stop find

As you can see, the text is messy and is full if repeated punctuations, emojis, hashtags, links, etc. To clean up this mess, let's start by tokenizing the text. A scikit-learn wrapper class was used around the nltk tokenizer so we can use it in a scikit-learn pipeline. (A Treebank Tokenizer was used, but in a future version, NLTK's Tweet Tokenizer should be implemented instead)

In [6]:
from nltk.tokenize import TreebankWordTokenizer

tokenizermodel = TreebankWordTokenizer()

class Tokenizer(BaseEstimator, TransformerMixin):
    
    def __init__(self,tokenizer_model,lowercase=True):
        self.tokenizer_model = tokenizer_model
        self.lowercase = lowercase
        
    def fit(self,X,y=None):
        return self
    
    def transform(self, X, y=None):
        tokenized_tweets = []
        for tweet in X:    
            tweet_text_tokenized = self.tokenizer_model.tokenize(tweet)
            tokenized_tweets.append(tweet_text_tokenized)
        return tokenized_tweets

tokenizer = Tokenizer(tokenizermodel)

tokenized_tweets = tokenizer.fit_transform(tweets)

print([tokenized_tweets[i] for i in perm])

[['rt', '@', 'suzialbracht', ':', 'the', 'ghost', 'fixer', 'https', ':', '//t.co/ngfdrzdlqt', 'a', 'freak', 'car', 'accident', 'stole', 'my', 'life', 'from', 'me', 'and', 'left', 'my', 'spirit', 'earthbound.', 'sounds…'], ['``', 'national', 'forecast', 'for', 'january', '5', ',', '2020', "''", 'via', 'fox', 'news', 'https', ':', '//t.co/xukxse0q6m', 'https', ':', '//t.co/conoqih2aa', '#', 'mlb', '#', 'baseball', '#', 'dfs', '#', 'love', '#', 'ny', '#', 'lineup', '#', 'softball', '#', 'dk', '#', 'fd', '#', 'usa', '#', 'homerun', '#', 'funny', '#', 'haha', '#', 'wtf', '#', '#', 'twins', '#', 'astros', '#', 'rangers', '#', 'redsox', '#', 'whitesox', '#', 'usa', '#', 'nba', '#', 'video', '#', 'money', '#', 'fantasy', '#', 'night'], ['do', "n't", 'get', 'stuck', 'with', 'a', 'bad', 'lender', '!', 'loan', 'with', 'me', 'and', 'see', 'just', 'how', 'easy', 'it', 'can', 'be', '!', 'https', ':', '//t.co/oshnvwe54k', '#', 'life', '#', 'love', '#', 'lockwithleslie', '#', 'home', '#', 'mortgages',

We can try throwing all these tweets inside our word embedding model to turn these words into vectors, but this would force us to drop a lot of tokens because they do not exist in the vocabulary of the word embedding model. This is especially a big problem because some of the words (such as "beyourself") and the emojis (ex: 💖) seems  to have some meaning in common with "love"-- the hashtag we are querying.

To solve this, a series of preprocessors were created:

#### 1.) Remove trailing punctuation

Removes trailing periods which failed to be tokenized ex: 😉. Although the word-embedding's vocabulary includes periods, separating these from the actual meaningful tokens is going to be important in the next 2 steps because it will make it much harder for the  algorithms to look for meaningful elements inside the compound elements.

Another solution was to use a tokenizer that removes all punctuations, but I believed "!" and "?" holds some valuable context about the tweet.

In [7]:
class RemoveTrailingPeriods(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        pass
    
    def fit(self, X ,y=None):
        return self
    
    def transform(self, X,y=None):
        for i,tweet in enumerate(X):
            indexes_with_period = []
            for ii, word_token in enumerate(tweet):
                try:
                    while(X[i][ii][-1]=="." and len(X[i][ii]) != 1):
                        X[i][ii] = X[i][ii][:-1]
                
                    if(X[i][ii][-1]=="…"):
                        X[i][ii] = X[i][ii][:-1]
                except:
                    pass
        return X

In [8]:
rmtrailingp = RemoveTrailingPeriods()

no_periods = rmtrailingp.fit_transform(tokenized_tweets)

print([no_periods[i] for i in perm])

[['rt', '@', 'suzialbracht', ':', 'the', 'ghost', 'fixer', 'https', ':', '//t.co/ngfdrzdlqt', 'a', 'freak', 'car', 'accident', 'stole', 'my', 'life', 'from', 'me', 'and', 'left', 'my', 'spirit', 'earthbound', 'sounds'], ['``', 'national', 'forecast', 'for', 'january', '5', ',', '2020', "''", 'via', 'fox', 'news', 'https', ':', '//t.co/xukxse0q6m', 'https', ':', '//t.co/conoqih2aa', '#', 'mlb', '#', 'baseball', '#', 'dfs', '#', 'love', '#', 'ny', '#', 'lineup', '#', 'softball', '#', 'dk', '#', 'fd', '#', 'usa', '#', 'homerun', '#', 'funny', '#', 'haha', '#', 'wtf', '#', '#', 'twins', '#', 'astros', '#', 'rangers', '#', 'redsox', '#', 'whitesox', '#', 'usa', '#', 'nba', '#', 'video', '#', 'money', '#', 'fantasy', '#', 'night'], ['do', "n't", 'get', 'stuck', 'with', 'a', 'bad', 'lender', '!', 'loan', 'with', 'me', 'and', 'see', 'just', 'how', 'easy', 'it', 'can', 'be', '!', 'https', ':', '//t.co/oshnvwe54k', '#', 'life', '#', 'love', '#', 'lockwithleslie', '#', 'home', '#', 'mortgages', '

#### 2.) Break down emojis into their definitions.

Replaces the emojis inside the tweet with their unicode description. Sometimes emojis were combined with words 

In [9]:
from nltk.tokenize import TweetTokenizer
import copy
import emoji
from emoji import UNICODE_EMOJI
import functools
import operator
import re
import unicodedata as ud

class ConvertEmojis(BaseEstimator, TransformerMixin):

    def __init__(self, tokenizer_model):
        self.tokenizer_model = tokenizer_model
    
    def fit(self,X,y=None):
        return self
    
    def hasemoji(self,s):
        em_split_emoji = emoji.get_emoji_regexp().split(s)
        em_split_whitespace = [substr.split() for substr in em_split_emoji]
        em_split = functools.reduce(operator.concat, em_split_whitespace)
        emojiExists = False
        for emojiTest in em_split:
            if(emojiTest in UNICODE_EMOJI):
                emojiExists = True
    
        return emojiExists
    
    def transform(self, X, y=None):
        emoji_seperator = TweetTokenizer()
        X_copy = copy.deepcopy(X)
        for i,tweet in enumerate(X):
            shiftindex = 0
            for ii,word_token in enumerate(tweet):
                if(self.hasemoji(word_token)):
                    em_split_emoji = emoji.get_emoji_regexp().split(word_token)
                    em_split_whitespace = [substr.split() for substr in em_split_emoji]
                    em_split = functools.reduce(operator.concat, em_split_whitespace)
                    emoji_detail_tokenized = []
                    words = ""
                    for iii,each_emoji in enumerate(em_split):
                        try:
                            emoji_detail = ud.name(each_emoji)
                            emoji_tokenized = self.tokenizer_model.tokenize(emoji_detail.lower())
                            if(emoji_tokenized[-1]=="selector-16" and emoji[-2]=="variation"):
                                emoji_tokenized.clear()
                            emoji_detail_tokenized.extend(emoji_tokenized)
                        except:
                            emoji_detail_tokenized.extend([each_emoji])
                            pass
                    X_copy[i].pop(ii + shiftindex)
                    X_copy[i] = X_copy[i][:ii + shiftindex] + emoji_detail_tokenized + X_copy[i][ii + shiftindex:]
                    shiftindex += len(emoji_detail_tokenized) - 1
        return X_copy

In [10]:
emojiConv = ConvertEmojis(tokenizermodel)

no_emoji = emojiConv.fit_transform(no_periods)

print([no_emoji[i] for i in perm])

[['rt', '@', 'suzialbracht', ':', 'the', 'ghost', 'fixer', 'https', ':', '//t.co/ngfdrzdlqt', 'a', 'freak', 'car', 'accident', 'stole', 'my', 'life', 'from', 'me', 'and', 'left', 'my', 'spirit', 'earthbound', 'sounds'], ['``', 'national', 'forecast', 'for', 'january', '5', ',', '2020', "''", 'via', 'fox', 'news', 'https', ':', '//t.co/xukxse0q6m', 'https', ':', '//t.co/conoqih2aa', '#', 'mlb', '#', 'baseball', '#', 'dfs', '#', 'love', '#', 'ny', '#', 'lineup', '#', 'softball', '#', 'dk', '#', 'fd', '#', 'usa', '#', 'homerun', '#', 'funny', '#', 'haha', '#', 'wtf', '#', '#', 'twins', '#', 'astros', '#', 'rangers', '#', 'redsox', '#', 'whitesox', '#', 'usa', '#', 'nba', '#', 'video', '#', 'money', '#', 'fantasy', '#', 'night'], ['do', "n't", 'get', 'stuck', 'with', 'a', 'bad', 'lender', '!', 'loan', 'with', 'me', 'and', 'see', 'just', 'how', 'easy', 'it', 'can', 'be', '!', 'https', ':', '//t.co/oshnvwe54k', '#', 'life', '#', 'love', '#', 'lockwithleslie', '#', 'home', '#', 'mortgages', '

#### 3.) Break down compound words

Replaces compound words such as "strongwoman", "makelove",  and "followme" (which likely are not inside the word embedding vocabulary yet probably contains valuable context information) into their simpler forms. In the case where there are multiple possible ways to break up a word, the `wordfreq` library's `word_frequency` method is used to get the frequency of each word (between 0.0 and 1.0) from a large corpus. The two frequencies are multiplied together, giving `both_word_freq` for each combination. The word combination with the highest `both_word _freq` is chosen to replace the compound word inside the tweet. Multiplying the two words in each combination also allows us to prioritize compound words that can stand as compound words. For example, we would rather keep "mortage" as it is instead of breaking it down into "mort" and "gage." This should also hold true for compound words seperated by a "-":

"poetrycommunity" -> \["poetry","community"\]

"best-selling" -> \["best-selling"\] 

"deephouse" -> \["deep","house"\]

"mortgage" -> \["mortgage"\]

"fight-club" -> \["fight","club"\] (NOTE: NO EXAMPLE OF THIS FROM THE SAMPLE )

In [11]:
from wordfreq import word_frequency

class BreakupWords(BaseEstimator, TransformerMixin):
    
    def __init__(self,model):
        self.model = model
        
    def fit(self, X, y=None):
        return self
    
    def break_compound_word(self,compound_word):
        possible_words = []
        first_word=""
        for i,xchar in enumerate(compound_word):
            second_word = compound_word[i+1:]
            if(xchar=="-"):
                try:
                    self.model[first_word]
                    self.model[second_word]
                    possible_words.append([first_word,second_word])
                except:
                    pass
                    
            first_word+=xchar
            try:          
                self.model[first_word]
                self.model[second_word]
                possible_words.append([first_word,second_word])
            except:
                if(second_word==""):
                    try:
                        self.model[first_word]
                        possible_words.append([first_word])
                    except:
                        pass
        return possible_words
    
    def transform(self, X, y=None):
        X_copy = copy.deepcopy(X)
        for i,tweet in enumerate(X):
            shiftindex = 0
            for ii, token in enumerate(tweet):
                child_words = self.break_compound_word(token)
                highestfrq = 0
                if(len(child_words)!= 0):
                    for child_word_set in child_words:
                        if(len(child_word_set)!=1):
                            both_word_freq = word_frequency(child_word_set[0],"en")*word_frequency(child_word_set[1],"en")
                            if (both_word_freq > highestfrq):
                                most_likely_combo = child_word_set
                                highestfrq = both_word_freq
                        else:
                            most_likely_combo = child_word_set
                    X_copy[i].pop(ii + shiftindex)
                    X_copy[i] = X_copy[i][:ii + shiftindex] + most_likely_combo + X_copy[i][ii + shiftindex:]
                    shiftindex += len(most_likely_combo)-1
        return X_copy

At this stage of the feature extraction, the word embedding model was introduced since the `BreakupWords` class will needa vocabulary set to look for simple words in compound words.

In [12]:
from gensim.test.utils import common_texts
from gensim.models import Word2Vec
import gensim.downloader as gensimapi

vectorizer_model = gensimapi.load("glove-wiki-gigaword-50")

In [19]:
breakupWords = BreakupWords(vectorizer_model)

simple_words = breakupWords.fit_transform(no_emoji)

print([simple_words[i] for i in perm])

[['rt', '@', 'suzialbracht', ':', 'the', 'ghost', 'fixer', 'https', ':', '//t.co/ngfdrzdlqt', 'a', 'freak', 'car', 'accident', 'stole', 'my', 'life', 'from', 'me', 'and', 'left', 'my', 'spirit', 'earthbound', 'sounds'], ['``', 'national', 'forecast', 'for', 'january', '5', ',', '2020', "''", 'via', 'fox', 'news', 'https', ':', '//t.co/xukxse0q6m', 'https', ':', '//t.co/conoqih2aa', '#', 'mlb', '#', 'baseball', '#', 'dfs', '#', 'love', '#', 'ny', '#', 'lineup', '#', 'softball', '#', 'dk', '#', 'fd', '#', 'usa', '#', 'homerun', '#', 'funny', '#', 'haha', '#', 'wtf', '#', '#', 'twins', '#', 'astros', '#', 'rangers', '#', 'redsox', '#', 'whitesox', '#', 'usa', '#', 'nba', '#', 'video', '#', 'money', '#', 'fantasy', '#', 'night'], ['do', "n't", 'get', 'stuck', 'with', 'a', 'bad', 'lender', '!', 'loan', 'with', 'me', 'and', 'see', 'just', 'how', 'easy', 'it', 'can', 'be', '!', 'https', ':', '//t.co/oshnvwe54k', '#', 'life', '#', 'love', '#', 'lockwithleslie', '#', 'home', '#', 'mortgages', '

The remaining texts show a lot of unnnecessary punctuations that do are not serving anything towards the meaning of the hashtag, which is why another feature extractor was created to remove all the given punctuations.

In [20]:
class RemovePunctuation(BaseEstimator, TransformerMixin):
    
    def __init__(self,charsequence):
        self.charsequence = set(charsequence)
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X_copy = X.copy()
        for i,tweet in enumerate(X):
            
            X_copy[i] = [text for text in tweet if not set([text]).issubset(self.charsequence)]
            
        return X_copy

In [23]:
noPunctuation = RemovePunctuation([".","#",":","•",",","@","\"",";","\'",")","(","&","``","\'\'"])

no_punc = noPunctuation.fit_transform(simple_words)

print([no_punc[i] for i in perm])

[['rt', 'suzialbracht', 'the', 'ghost', 'fixer', 'https', '//t.co/ngfdrzdlqt', 'a', 'freak', 'car', 'accident', 'stole', 'my', 'life', 'from', 'me', 'and', 'left', 'my', 'spirit', 'earthbound', 'sounds'], ['national', 'forecast', 'for', 'january', '5', '2020', 'via', 'fox', 'news', 'https', '//t.co/xukxse0q6m', 'https', '//t.co/conoqih2aa', 'mlb', 'baseball', 'dfs', 'love', 'ny', 'lineup', 'softball', 'dk', 'fd', 'usa', 'homerun', 'funny', 'haha', 'wtf', 'twins', 'astros', 'rangers', 'redsox', 'whitesox', 'usa', 'nba', 'video', 'money', 'fantasy', 'night'], ['do', "n't", 'get', 'stuck', 'with', 'a', 'bad', 'lender', '!', 'loan', 'with', 'me', 'and', 'see', 'just', 'how', 'easy', 'it', 'can', 'be', '!', 'https', '//t.co/oshnvwe54k', 'life', 'love', 'lockwithleslie', 'home', 'mortgages', 'realestate', 'realtor', 'mortgage', 'mortgage', 'broker', 'loan', 'officer', 'mortgage', 'lender', 'https', '//t.co/dfypjxzrlz'], ['rt', 'vclinebarton', 'firework', 'sparkler', 'inspiring', 'words', 'to

The text looks a lot cleaner now with more 'vectorizable' words. Now we can finally convert all the text into vectors.

Now we have a list of vectors for each token which which was successfully converted. Our final step should be modifying the dimensions of the list into one which can be easily used in scikit-learn classifiers. We have three options:


#### 1.) Flatten out all the vectors for a given tweet into a 1 dimensional list of numbers.

#####    Pro: 

        Preserves all the vectors
    
#####    Con: 
    
        Will have too many dimensions (Current maximum twitter post character limit is 280. If a tweet were to go something like "i i i i i..." with i repeating 140 times, there will be 50\*140=7000 dimensions!) and will signifantly affect classifer performance
    
        Order of the words in a tweet should not matter. In other words, there should not be any inherent difference between the tweet "love is important" and "is important love." We are only looking at the relationship between the words in the tweet and the hashtags
    
#### 2.) Sum up all the vectors in a tweet.

#####    Pro:
        
        No "Curse of Dimensionality Problem"
        
        Does not take order of the words into account
        
#####    Con:
    
        A lot more variation within the data points
        
#### 3.) Take an average of all the vectors in a tweet

#####    Pro:
    
        No "Curse of Dimensionality Problem"
        
        Does not take order of the words into account
        
        Does not cause much variation within the data points
        
#####    Con:
    
        If all the vectors of of words in a tweet are far from each other (Ex: On a 3dimensional space if all the data points form a sphere), the average will be some point that has close to no correlation with each points in the space
        

This project used the second choice, but in future versions, all choices 1 and 3 should also be explored.

In [28]:
class VectorizeTweets(BaseEstimator, TransformerMixin):
    
    def __init__(self,model):
        self.model = model
        
    def fit(self,X,y=None):
        return self
    
    def transform(self, X, y=None):
        X_vectorized = []
        for tweet in X:
            tweet_vectorized = []
            for token in tweet:
                try:
                    vector = self.model[token]
                    tweet_vectorized.append(vector)
                except:
                    pass
            X_vectorized.append(tweet_vectorized)
        
        return X_vectorized
    
class SumUpTweetVector(BaseEstimator, TransformerMixin):

    def __init__(self):
        pass
    
    def fit(self,X,y=None):
        return self
    
    def transform(self, X, y=None):
        tweet_vector_sums = []
        vector_shape = len(X[0][0])
        for tweet in X:
            vector_sum = np.zeros(vector_shape)
            for vector in tweet:
                vector_sum += vector
                
            tweet_vector_sums.append(vector_sum)
            
        return np.array(tweet_vector_sums)

In [30]:
tweetVectorizer = VectorizeTweets(vectorizer_model)

tweets_v = tweetVectorizer.fit_transform(no_punc)

vectorSumUp = SumUpTweetVector()

tweets_v_sum = vectorSumUp.fit_transform(tweets_v)

print([tweets_v_sum[i] for i in perm])

[array([  4.95226901,   3.15686597,   0.25663711,  -3.78946606,
         5.72923988,   2.64818394,  -6.50811001,  -0.31704699,
        -1.11409814,   1.82877156,  -1.47907597,   4.12991197,
        -7.04476113,   1.51743902,   5.79381247,  -1.925828  ,
        -3.33800095,   5.70873202,  -6.01290099,  -5.46673896,
        -2.53265295,   9.43682811,   0.49462399,   2.48373532,
         8.48723485, -23.39799008, -11.50549011,   5.11734487,
        12.19873604,  -5.41126891,  43.92079987,   1.64401296,
        -4.91623806,   0.47500407,  -0.06193791,   4.33825792,
         2.72299994,  -4.22834764,   4.21625101,  -2.25497804,
         0.68742408,   0.51830799,  -4.381761  ,   3.12400904,
         0.77917704,  -0.71636501,   0.60973888,  -7.45060703,
         5.13054807,  -1.60230596]), array([-11.07284827,  14.35666701,   1.63315413,  18.54431606,
        -7.24858415,  -8.64160295, -15.94223228, -10.08173362,
        -4.63276496,   1.47072607,  -1.74792978,   3.55550306,
       -11.126034

### Future TODO:

 1. Create dataset that utilize option 1 (from combining tweet-word vectors)
 2. Create dataset that utilize option 3 (from combining tweet-word vectors)
 3. Recognize combined words comprised of 3 or more simpler words (Ex: "photooftheday" -> "photo" "of " "the" "day")
 4. Create dataset that calculates TF-IDF of all the words and uses 1 key word from each tweet to represent the tweet
 5. Create dataset that calculates TF-IDF of all the words and uses 1 key word and multiplies the vector for each word with its respective tf-idf value, and then takes an average of all the vectors in a tweet
 6. Implement Doc2Vec to represent each tweet instead of trying to combine 
 7. UPDATE - NLTK has a Twitter Tokenizer that preserves text-emojis such as :) and :-). Create a dataset using this tokenizer instead.