# Twitter Hashtag Recommender - Analysis


After a day of mining, tweepy was able to collect 79195 tweets and converted it to a 50 dimension vector (details of this is stated in the Presentation_mining.ipynb) file

### Reading the mined data

In [2]:
import csv
import numpy as np

with open("mined_data/data_X.csv","rt") as f:
        reader = csv.reader(f)
        X = list(reader)
        
with open("mined_data/data_y.csv","rt") as f:
        reader = csv.reader(f)
        y = list(reader)
        
X = np.array(X)
y = np.array(y)

print(X.shape)
print(y.shape)

(79195, 50)
(79195, 1)


The number of tweets per hashtag however, was not consistent. During the mining process, a maximum of 1000 tweets per hashtag was set, but as you can see, hashtags such as "girl" only contained 16 tweets. A Stratified Shuffle was generated to separate the training data and test data to have equal proportions of hashtags.

In [3]:
labeloccurence = {}

for label in y.flatten():
    try:
        labeloccurence[label] = labeloccurence[label] + 1
    except:
        labeloccurence[label] = 1
        
        
print(labeloccurence)

{'#beach': 1000, '#family': 1000, '#friend': 811, '#instagood': 1000, '#love': 1000, '#photooftheday': 1000, '#yellow': 1000, '#fashion': 1000, '#beautiful': 272, '#happy': 1000, '#cute': 233, '#tbt': 1000, '#picoftheday': 1000, '#follow': 1000, '#selfie': 1000, '#summer': 1000, '#art': 1000, '#instadaily': 1000, '#nature': 1000, '#girl': 16, '#fun': 1000, '#style': 1000, '#smile': 1000, '#food': 758, '#instalike': 1000, '#likeforlike': 1000, '#fitness': 529, '#igers': 1000, '#tagsforlikes': 1000, '#nofilter': 1000, '#life': 1000, '#beauty': 292, '#amazing': 1000, '#instagram': 1000, '#photography': 1000, '#vscocam': 1000, '#photo': 1000, '#sun': 1000, '#music': 1000, '#ootd': 1000, '#bestoftheday': 88, '#sunset': 1000, '#sky': 806, '#dog': 1000, '#vsco': 1000, '#makeup': 1000, '#foodporn': 683, '#hair': 1000, '#pretty': 808, '#cat': 1000, '#model': 874, '#swag': 996, '#motivation': 1000, '#baby': 1000, '#party': 548, '#cool': 1000, '#gym': 618, '#lol': 1000, '#design': 1000, '#instapi

In [4]:
import pandas as pd
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import StandardScaler

splitter = StratifiedShuffleSplit(n_splits=1,random_state=13,test_size=0.2)
column_labels = list(range(50))
column_labels = [str(i) for i in column_labels]
column_labels.append("tag")
data = np.hstack((X,y))
dataDF = pd.DataFrame(data,columns=column_labels)

for train_index, test_index in splitter.split(dataDF,dataDF["tag"]):
    stratified_train = dataDF.loc[train_index]
    stratified_test = dataDF.loc[test_index]
    
X_train = stratified_train[column_labels[:-1]]
y_train = stratified_train["tag"]

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

stratified_train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,41,42,43,44,45,46,47,48,49,tag
5557,1.8970260387286544,11.426142051815988,2.286433018743992,3.6580030396580696,-3.2866700291633606,-4.1619668528437614,-9.547938160132617,-6.340730000287294,1.053048822504934,-4.044937053695321,...,-1.1010588519275188,6.068599864840508,-6.114979052916169,5.338927154429257,-1.7078100219368937,3.688665965106338,-0.4430071637034416,3.8419770002365112,0.1854590028524398,#photooftheday
51867,6.770628258673241,5.390256021171808,7.479233972728252,1.7166554885916412,1.4513099640607834,-1.4184269718825817,-3.562019938603044,-7.850565972737968,-2.35472735442454,-1.7609331631101668,...,-3.59414797462523,0.832190650049597,1.4485529512166977,-2.9203838554385584,6.806337898597121,2.2226188513450325,4.469500884413719,6.190316407941282,1.2796433912590146,#design
29334,-0.8846498653292656,4.359336979687214,-0.5777599550783634,5.230512024834752,-1.4255489446222782,-6.98695595562458,-4.281100064516068,-5.169744022190571,-4.81668734527193,6.982820004224777,...,2.2451081599574536,-3.6353960260748863,-6.352880030870438,-3.078800052404404,-5.381459020078182,1.385561030358076,-3.0108290389180183,7.013682007789612,7.76913595572114,#instagram
44266,4.408114038407803,0.8948999792337418,4.255405901814811,-2.9145560916513205,-1.1366247844416648,-2.2644220776855946,-4.2542518600821495,-8.765686988830566,-1.8110286509618163,1.9321849904954436,...,1.478124056942761,0.1154539138078689,-0.4421770721673965,1.8928942400962117,0.5045299828052521,-0.0895709916949272,-5.362092062830925,2.5468911081552505,2.0169109888374805,#cat
10606,-1.6573240086436272,5.77875205129385,-1.9465200901031492,0.1602295590564608,-6.251160055398941,-4.399850085377693,6.2419000416994095,-5.133064024150372,-2.443369960412383,5.6650701850885525,...,-4.089522950351238,-0.7252710424363613,-3.160119991749525,0.5469664859992918,-0.5970050133764744,1.2338529713451862,0.9109479617327452,7.015899032354355,8.603660078719258,#picoftheday


### Training

Now we will work on training with a scaled version of the data data (X_train_scaled) using the various multi-class classifiers supported by scikit-learn. Training with OneVsOne classifiers was skipped because of the large dimensionality of the training data and certain classifiers such as `GaussianProcessClassifier()` was skipped because the large sample data made the training process extremely inefficient due to the training machine's ram.

Here is a rundown of their recall,accuracy, an calculated with a 3-way cross validation.


#### Inherently Multiclass

Table 1

|Classifier|Precision|Recall|F1-Score|
|:-|-|-|-|
|`BernoulliNB()`|0.17871032497557543|0.16760843487593913|0.15193282169569067|
|`DecisionTreeClassifier()`|0.32884656073391016|0.3284140412904855|0.32775179443622027|
|`ExtraTreeClassifier()`|0.3170519558192891|0.31613422564555843|0.31569708377813743|
|`ExtraTreesClassifier()`|0.45201342059650934|0.44396742218574403|0.4408797517858572|
|`GaussianNB()`|0.19586086737595293|0.14434307721447062|0.13305816692082342|
|`KNeighborsClassifier()`|0.4094528189643595|0.38569354125891786|0.38707707881014963|
|`LinearDiscriminantAnalysis()`|0.29787034359887066|0.3069006881747585|0.2926308049269754|
|`LinearSVC(multi_class="crammer_singer")`|0.09890950041654863|0.10818233474335501|0.0828675811060722|
|`LogisticRegression(multi_class="multinomial")`|0.31802802668700164|0.3395889892038639|0.3222431373823199|
|`NearestCentroid()`|0.19837361321709462|0.16797146284487657|0.15873458662002415|
|`QuadraticDiscriminantAnalysis()`|0.3423328288077747|0.32120083338594607|0.3188687539502235|
|`RandomForestClassifier()`|0.4456224929103627|0.4420891470421112|0.4375279167445373|
|`RidgeClassifier()`|0.2940304427404206|0.288370477934213|0.2404565802969223|

#### OneVsRest

Table2

| Classifier | Recall| Accuracy |F1-Score|
|:---------- | --------| ---------|-----------|
|`SGDClassifier()`|0.21791900968005287|0.2065155628511901|0.2050632340660742|
|`Perceptron()`|0.17013810779919047|0.1521245028095208|0.15134008501248902|
|`PassiveAggressiveClassifier()`|0.1793218678089453|0.166140539175453|0.1617584027624504|
|`GradientBoostingClassifier()`|0.36644844360010204|0.3708093945324831|0.36606015391948754|
|`LinearSVC()`|0.27899933452956677|0.31411389607929796|0.26547331622756415|
|`LogisticRegression(multi_class="ovr")`|0.3068813252069861|0.33109729149567524|0.30878692718362605|

It should also be noted that both LogisticRegression and LinearSVC failed to converge.
QuadraticDiscriminantAnalysis threw a runtime warning that variables were collinear

At a glance, it looks like the Ensemble, Decision-Tree, and KNieghbors are doing relatively well. We will go deeper into training with these algorithms.

#### ExtraTreesClassifier

The recalls and precisions from the classifiers trained with X_train_scaled and tested against y_train were around 88%, which suggests the classifiers are overfitting. Changing the criterion from "gini" to "entropy" did not help either

The overfitted-classifier had tree-depths between 35 and 46, and had approximately 30000 leaves. To regularize this, A Grid Search was carried out to find the best hyper parameters fro `ExtraTreesClassifier` that caps the maximum tree-depth at 30 and maximum leaf-node at 30000

In [97]:
from sklearn.ensemble import ExtraTreesClassifier

etreesclf = ExtraTreesClassifier()

etreesclf.fit(X_train_scaled,y_train)

print([estimator.get_depth() for estimator in etreesclf.estimators_])

print([estimator.get_n_leaves() for estimator in etreesclf.estimators_])

[42, 41, 38, 39, 41, 42, 39, 42, 39, 43, 39, 38, 45, 40, 39, 38, 40, 35, 39, 37, 41, 43, 38, 40, 40, 43, 44, 39, 39, 46, 42, 47, 41, 41, 38, 41, 41, 38, 40, 39, 39, 44, 43, 40, 40, 42, 36, 45, 42, 40, 38, 44, 39, 44, 39, 42, 46, 38, 41, 42, 43, 45, 45, 38, 40, 41, 40, 39, 41, 44, 44, 40, 44, 40, 39, 43, 40, 40, 39, 45, 43, 41, 47, 45, 42, 38, 41, 46, 39, 39, 41, 39, 41, 38, 48, 40, 44, 49, 47, 42]
[33964, 34059, 33952, 34114, 34115, 34113, 34217, 34075, 34049, 34057, 34137, 34118, 33950, 33977, 33957, 34078, 34106, 34020, 34100, 34166, 33965, 33989, 34045, 34102, 33960, 34038, 33981, 34092, 33946, 34112, 33971, 34087, 34066, 34027, 34077, 34020, 33999, 34020, 34029, 34004, 34018, 33992, 33895, 34034, 33988, 33882, 33961, 34107, 33963, 34045, 33931, 34069, 34051, 34004, 34076, 34000, 33991, 34010, 34113, 34104, 33949, 33974, 34024, 33997, 34150, 34024, 33944, 33963, 34025, 34016, 34082, 34093, 34034, 33977, 34089, 33950, 34076, 34069, 34084, 34162, 34037, 34058, 33971, 33931, 33923, 339

In [117]:
from sklearn.model_selection import GridSearchCV

param_grid = [{"max_depth":[None,10,20,30],
               "min_samples_split":[2,10,20,30,40],
               "min_samples_leaf":[1,20,40,60],
               "max_leaf_nodes":[None,100,10000,20000,30000]
              }]

grid_search = GridSearchCV(etreesclf, param_grid, cv=3, scoring='f1_weighted')

grid_search.fit(X_train_scaled, y_train)

grid_search.best_params_

{'max_depth': 30,
 'max_leaf_nodes': 20000,
 'min_samples_leaf': 1,
 'min_samples_split': 10}

In [119]:
grid_search.best_score_

0.4417280526270308

Looks like this is `ExtraTreesClassifier`'s best performance. We will try repeating the process for the next most successful classifier: `RandomForestClassifier`

This classifier also was overfitting with an f1 score of 0.8712136016947981, calculated by the prediction based on X_train_scaled and compared against y_train

In [128]:
from sklearn.ensemble import RandomForestClassifier

randforclf = RandomForestClassifier(n_jobs=-1)

randforclf.fit(X_train_scaled,y_train)

print([estimator.get_depth() for estimator in randforclf.estimators_])

print([estimator.get_n_leaves() for estimator in randforclf.estimators_])

[74, 61, 57, 58, 61, 57, 61, 65, 61, 56, 72, 57, 67, 59, 54, 59, 66, 60, 63, 62, 57, 58, 57, 65, 63, 61, 53, 61, 59, 52, 61, 58, 55, 57, 59, 63, 59, 53, 56, 69, 61, 56, 58, 63, 63, 64, 74, 51, 60, 62, 57, 56, 57, 58, 59, 58, 62, 63, 65, 63, 66, 58, 66, 67, 59, 60, 70, 66, 68, 62, 63, 60, 53, 64, 58, 61, 61, 59, 62, 61, 59, 65, 65, 53, 61, 83, 67, 55, 53, 62, 64, 62, 56, 58, 58, 69, 61, 57, 60, 62]
[20594, 20658, 20534, 20599, 20627, 20483, 20733, 20601, 20646, 20605, 20704, 20625, 20391, 20614, 20636, 20650, 20701, 20647, 20536, 20649, 20565, 20553, 20666, 20579, 20593, 20694, 20561, 20633, 20571, 20561, 20537, 20668, 20658, 20495, 20558, 20438, 20545, 20616, 20671, 20656, 20457, 20588, 20588, 20583, 20610, 20604, 20595, 20596, 20526, 20518, 20460, 20734, 20452, 20623, 20548, 20608, 20433, 20536, 20516, 20613, 20484, 20602, 20465, 20571, 20569, 20769, 20511, 20648, 20671, 20588, 20619, 20469, 20577, 20531, 20509, 20552, 20426, 20656, 20545, 20452, 20591, 20597, 20642, 20562, 20472, 205

In [129]:
param_grid_rf = [{"max_depth":[None,20,30,40,50],
               "min_samples_split":[2,10,20,30,40],
               "min_samples_leaf":[1,20,40,60],
               "max_leaf_nodes":[None,100,5000,10000,15000,20000]
              }]

grid_search_rf = GridSearchCV(randforclf, param_grid_rf, cv=3, scoring='f1_weighted',verbose=3,n_jobs=-1)

grid_search_rf.fit(X_train_scaled, y_train)

print(grid_search_rf.best_params_)
print(grid_search_rf.best_score_)

Fitting 3 folds for each of 600 candidates, totalling 1800 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:  8.5min
[Parallel(n_jobs=-1)]: Done 120 tasks      | elapsed: 28.3min
[Parallel(n_jobs=-1)]: Done 280 tasks      | elapsed: 72.6min
[Parallel(n_jobs=-1)]: Done 504 tasks      | elapsed: 195.3min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed: 383.2min
[Parallel(n_jobs=-1)]: Done 1144 tasks      | elapsed: 551.0min
[Parallel(n_jobs=-1)]: Done 1560 tasks      | elapsed: 677.0min
[Parallel(n_jobs=-1)]: Done 1800 out of 1800 | elapsed: 757.6min finished


{'max_depth': 50, 'max_leaf_nodes': 10000, 'min_samples_leaf': 1, 'min_samples_split': 2}
0.43954581494967454


Looks like tuning the hyperparameter does not help the classifier from overfitting, at all. However, we can certainly say that ensemble methods have been the most effective classification algorithms from the results in Table 1 and 2. Here, we will try increasing the variation of the prediciton by introducting `VotingClassifier` to prevent overfitting.

In [7]:
from sklearn.ensemble import VotingClassifier
from sklearn.tree import ExtraTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.linear_model import RidgeClassifier

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score,recall_score,f1_score
from sklearn.model_selection import cross_val_predict

logregclf_mnml = LogisticRegression(multi_class="multinomial")
logregclf_ovr = LogisticRegression(multi_class="ovr")
extratree_clf = ExtraTreeClassifier()
randforrest_clf = RandomForestClassifier(
    max_depth=50,
    max_leaf_nodes=10000,
    min_samples_leaf=1, 
    min_samples_split= 2
)
extratrees_clf = ExtraTreesClassifier(
    max_depth=30,
    max_leaf_nodes=20000,
    min_samples_leaf=1,
    min_samples_split=10
)

kneighbors_clf = KNeighborsClassifier()
qddicranalysis_clf = QuadraticDiscriminantAnalysis()
ridge_clf = RidgeClassifier()


voting_classifer_hard = VotingClassifier(
    estimators=[
        ("lr_mnml",logregclf_mnml),
        ("lr_ovr",logregclf_ovr),
        ("tree",extratree_clf),
        ("trees",extratrees_clf),
        ("forrest",randforrest_clf),
        ("knn",kneighbors_clf),
        ("quad",qddicranalysis_clf),
        ("ridge",ridge_clf)
    ],
    voting="hard",
    n_jobs=-1
)
voting_classifer_soft = VotingClassifier(
    estimators=[
        ("lr_mnml",logregclf_mnml),
        ("lr_ovr",logregclf_ovr),
        ("tree",extratree_clf),
        ("trees",extratrees_clf),
        ("forrest",randforrest_clf),
        ("knn",kneighbors_clf),
        ("quad",qddicranalysis_clf),
    ],
    voting="soft",
    n_jobs=-1
)

voting_pred_hard = cross_val_predict(voting_classifer_hard,X_train_scaled,y_train,cv=3)
voting_pred_soft = cross_val_predict(voting_classifer_soft,X_train_scaled,y_train,cv=3)

In [11]:
print("Hard Voting:\n")
print("    Accuracy: " + str(accuracy_score(y_train,voting_pred_hard)))
print("    Recall: " + str(recall_score(y_train,voting_pred_hard,average="weighted")))
print("    F1-score: " + str(f1_score(y_train,voting_pred_hard,average="weighted")))
print("\nSoft Voting:\n")
print("    Accuracy: " + str(accuracy_score(y_train,voting_pred_soft)))
print("    Recall: " + str(recall_score(y_train,voting_pred_soft,average="weighted")))
print("    F1-score: " + str(f1_score(y_train,voting_pred_soft,average="weighted")))

Hard Voting:

    Accuracy: 0.4642496369720311
    Recall: 0.4642496369720311
    F1-score: 0.45077807939330994

Soft Voting:

    Accuracy: 0.42748910916093186
    Recall: 0.42748910916093186
    F1-score: 0.42268612744329837


Looks like it is going to be extremely difficult to even hit an f1 score larger than 0.5. Let's try one last time testing with boosting ensemble methods before moving on to testing our classifier on the final training set and creating a different dataset. We have already testing GradientBoostingClassifier (scores are displayed in Table 2), so let's try AdaBoost instead.

In [21]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=20000), n_estimators=200,
    algorithm="SAMME.R", learning_rate=0.5)

ada_boost_pred = cross_val_predict(ada_clf,X_train_scaled, y_train,cv=3)

print("Accuracy: " + str(accuracy_score(y_train,ada_boost_pred)))
print("Recall: " + str(recall_score(y_train,ada_boost_pred,average="weighted")))
print("F1-score: " + str(f1_score(y_train,ada_boost_pred,average="weighted")))

Accuracy: 0.44721889008144455
Recall: 0.44721889008144455
F1-score: 0.4494272009117301


Looks like achieving an f1-score higher than 0.5 is going to be extremely difficult with this dataset. One of the possible reasons for such a low score is because the hashtags see to be correlated in some way. For example, we can imagine #happy,#cute, and #picoftheday in the same tweet.

Our miner should have either:
1.) Removed all the other hashtags aside from the tag we are querying with.
2.) Created a mulit-label dataset.

Another possible reason is because the majority of possible hashtags themselves seem to be correlated with one another in the sense that they are all "semantically positive." For the next dataset, the miner should have also had hashtags which are "semantically negative."

Let's do a final test on the training set using the classifier with the highest scores: Hard Voting Classifier comprised of:
`LogisticRegression(multiclass="multinomial")`

`LogisticRegression(multiclass="ovr")`

`ExtraTreeClassifier()`

`RandomForestClassifier(max_depth=50,max_leaf_nodes=10000,min_samples_leaf=1, min_samples_split= 2)`

`ExtraTreesClassifier(max_depth=30,max_leaf_nodes=20000,min_samples_leaf=1,min_samples_split=10)`

`KNeighborsClassifier()`

`QuadraticDiscriminantAnalysis()`


In [24]:
voting_classifer_hard.fit(X_train_scaled,y_train)

VotingClassifier(estimators=[('lr_mnml',
                              LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='multinomial',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False)),
                             ('lr_ovr',
                              LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fi...
                                                   n_jobs=None, n_neighbors=5,
                 

In [28]:
X_test = stratified_test[column_labels[:-1]]
y_test = stratified_test["tag"]

X_test_scaled = scaler.transform(X_test)

vc_pred = voting_classifer_hard.predict(X_test_scaled)

print("    Accuracy: " + str(accuracy_score(y_test,vc_pred)))
print("    Recall: " + str(recall_score(y_test,vc_pred,average="weighted")))
print("    F1-score: " + str(f1_score(y_test,vc_pred,average="weighted")))

    Accuracy: 0.4836163899236063
    Recall: 0.4836163899236063
    F1-score: 0.4711353166334938


In [39]:
import pickle
classifier_path = "classifiers/"
final_clf = voting_classifer_hard

with open(r"classifiers/v1_classifier.pickle", "wb") as output_file:
    pickle.dump(final_clf, output_file)

You can now predict new tweets by running the following command at the project root.

`python tweetrecommender.py "Tweet text" `