TESLA: TwittEr Spam LeArning
CSCE 670 Spring 2018, Course Project
Welcome to our code base, where the magic happens! We'd love to introduce our web application that can detect if a user is a spam or not in real-time. For technical details, please keep on reading or refer to the about section. Our app is live on Heroku, check it out!
Installation guide (local server mode)
Our application is built entirely on Python 3. Sorry 2.7 - it's time for an upgrade! (Tensorflow doesn't work with 2.7 anyways. SAD.)
- For Windows users who are using Anaconda, please make sure that you have Python 3.5 environment installed. You can check out this awesome post here for Python version control with Conda.
- For Mac users: as long as your Python version >= 3.5, you should be fine.
To play around with our code, please do fork first. Then clone it to your local machine. Install the required libraries by typing the command below:
cd src pip install -Ur requirements.txt
(Again, for Windows users - if you have an issue with cache files, try
pip install -Ur requirements.txt --no-cache-dir
Once everything is installed, stay in the src folder and type:
python manage.py runserver
Note: you might encounter an ImportError claiming that there's no module called 'Tesla.aws'. That is because our team uses AWS as part of the storage. Simply uncomment the following line in the \src\Tesla\settings\base.py file:
# Comment this line please from Tesla.aws.conf import *
You should be able to access our application in a local mode.
Switching between different text models
We have trained two models for the tweet feature: TF-IDF and CNN. The latter one is too big to load on a Heroku free dyno. Thus, our online app runs the TF-IDF model by default. When you download our repository and run it in the local server mode, it's defaulted to run CNN. If you'd like to switch to another model, change the following lines in the \src\predictions\models.py:
# use cnn text model: #basic_info = get_cnn_predict(screen_name, basic_info) # use tfidf text model: basic_info = get_tfidf_predict(screen_name, basic_info)
Uncomment the one that you'd like to use. (Please, do not uncomment/comment both. We have no guarantee on how the application will behave. DANGER ZONE!)
As promised above - here is something technical.
Our website was built upon Django 1.9 with Python 3.6. It accepts screennames (aka handler, e.g. @realDonaldTrump) as input and communicates with the Twitter API to gather basic account and tweet information.
Our framework consists of two parts: account and tweet models.
- Account model focuses on User based features, such as the count of favorite tweets, account ages, etc.
- Tweet model, on the other hand, uses text from the Tweet object exclusively. We've trained two text-based models, namely TF-IDF and CNN. Each text-based model accepts the latest 5 tweets as input. (If a user tweets less than 5 times, then the text model would predict based on the number of actual tweets available.)
Our models were pre-trained offline for online prediction. Once the results are ready for both models, we use a simple weighted function to aggregate them together and get a final "spam" score. If the score is greater than 50%, then we believe that this user is a spammer.
If you've encountered any issues...
Please, please, please DO NOT HESITATE to let us know! Send us a pull request and we'll manage to get back to you ASAP. Bug reports welcome!
- Thanks Yue for building the website and merging everything together.
- Thanks Rose for training the account model and providing guidance on how to run traditional classifiers in general.
- Thanks Weitong for his specialization in front-end design and edition work.
- Thanks Bowen for training the text model and fiddling with deep learning.
And finally, thanks Cav and Parisa. You are just simply too nice to read this through, so we think you deserve some credits here.