Hashtag Popularity Prediction

Predicting the popularity of a given hashtag using standard regression techniques such as Linear Regression, Random Forest and Non-linear SVM.

Done as a part of summer remote project in Northwestern University. Certificate of the project can be found here.

Dependencies

numpy
tqdm
tweepy
pandas
scikit-learn

Data Scrapping

NOTE: You need to have a Twitter Developer Account inorder to scrap data. If you don't have access, please apply here

The data for #DarkNetflix was scrapped from 27/06/2020 for a maximum of 20000 entries. The scrapped data will be in a JSON format. We try to specifically extract the following entities from the data.

Number of retweets
Number of mentions
Number of hashtags
Number of urls
Number of followers
Number of favourites
Time at which the post was created (which will be converted to a time stamp)

Once the data is scrapped, we batch entities 1 to 6 with a window of one minute. Therefore, the final data is made with entries corresponding to different minutes.

The popularity of the hashtag is measured by the number of tweets it comprises at a given minute.

Usage

    python main.py

Results

Random Forest and Non-linear SVM has the least cross validation error.

Looking at the prediction on the test set, we can see that the model is able to predict the popularity very closely to the true value. But training and testing on the same hashtag may induce some bias to the model. Therefore, we have to check if the model is somehow able to capture the trend with regards to other hashtags.

We train the models on #PostponeNEETandJEE which trended in India during mid-june of 2020. As you can see from the first plot, the model is again able to predict somewhat closely with the true value. The interesting part is the plot in the second plot where the model trained on #PostponeNEETandJEE is tested on #DarkNetflix. As expected, the performance of the model has reduced significantly when compared to its performance when the model was trained on #DarkNetflix. Even though the predicted and actual values might be way different, it is able to predict the peaks and valleys with a small delay. Therefore, the model is still consistent with predicting the expected trend of the hashtag.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
figs		figs
LICENSE		LICENSE
README.md		README.md
main.py		main.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

figs

figs

LICENSE

LICENSE

README.md

README.md

main.py

main.py

utils.py

utils.py

Repository files navigation

Hashtag Popularity Prediction

Dependencies

Data Scrapping

Usage

Results

About

Releases

Packages

Languages

License

keerthan2/Hashtag_Popularity_Prediction

Folders and files

Latest commit

History

Repository files navigation

Hashtag Popularity Prediction

Dependencies

Data Scrapping

Usage

Results

About

Resources

License

Stars

Watchers

Forks

Languages