Skip to content
An web application that can detect if a user is a spam or not in real-time
Jupyter Notebook Python JavaScript CSS HTML
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
classifiers
data visualization
lib/python3.6/site-packages
src
static_cdn/static_root
test
text
.gitattributes
.gitignore
README.md

README.md

TESLA: TwittEr Spam LeArning

CSCE 670 Spring 2018, Course Project

Intro

Welcome to our code base, where the magic happens! We'd love to introduce our web application that can detect if a user is a spam or not in real-time. For technical details, please keep on reading or refer to the about section. Our app is live on Heroku, check it out!

Installation guide (local server mode)

Our application is built entirely on Python 3. Sorry 2.7 - it's time for an upgrade! (Tensorflow doesn't work with 2.7 anyways. SAD.)

  • For Windows users who are using Anaconda, please make sure that you have Python 3.5 environment installed. You can check out this awesome post here for Python version control with Conda.
  • For Mac users: as long as your Python version >= 3.5, you should be fine.

To play around with our code, please do fork first. Then clone it to your local machine. Install the required libraries by typing the command below:

cd src 
pip install -Ur requirements.txt

(Again, for Windows users - if you have an issue with cache files, try

pip install -Ur requirements.txt --no-cache-dir

instead.)

Once everything is installed, stay in the src folder and type:

python manage.py runserver

Note: you might encounter an ImportError claiming that there's no module called 'Tesla.aws'. That is because our team uses AWS as part of the storage. Simply uncomment the following line in the \src\Tesla\settings\base.py file:

# Comment this line please
from Tesla.aws.conf import *

You should be able to access our application in a local mode.

Switching between different text models

We have trained two models for the tweet feature: TF-IDF and CNN. The latter one is too big to load on a Heroku free dyno. Thus, our online app runs the TF-IDF model by default. When you download our repository and run it in the local server mode, it's defaulted to run CNN. If you'd like to switch to another model, change the following lines in the \src\predictions\models.py:

# use cnn text model:
#basic_info = get_cnn_predict(screen_name, basic_info)

# use tfidf text model:
basic_info = get_tfidf_predict(screen_name, basic_info)

Uncomment the one that you'd like to use. (Please, do not uncomment/comment both. We have no guarantee on how the application will behave. DANGER ZONE!)

Something technical

As promised above - here is something technical.

Our website was built upon Django 1.9 with Python 3.6. It accepts screennames (aka handler, e.g. @realDonaldTrump) as input and communicates with the Twitter API to gather basic account and tweet information.

Our framework consists of two parts: account and tweet models.

  • Account model focuses on User based features, such as the count of favorite tweets, account ages, etc.
  • Tweet model, on the other hand, uses text from the Tweet object exclusively. We've trained two text-based models, namely TF-IDF and CNN. Each text-based model accepts the latest 5 tweets as input. (If a user tweets less than 5 times, then the text model would predict based on the number of actual tweets available.)

Our models were pre-trained offline for online prediction. Once the results are ready for both models, we use a simple weighted function to aggregate them together and get a final "spam" score. If the score is greater than 50%, then we believe that this user is a spammer.

If you'd like to know more details on how we actually built our models, check out the classifiers that we've tried and data visualization plots.

If you've encountered any issues...

Please, please, please DO NOT HESITATE to let us know! Send us a pull request and we'll manage to get back to you ASAP. Bug reports welcome!

Special shout-out!

  • Thanks Yue for building the website and merging everything together.
  • Thanks Rose for training the account model and providing guidance on how to run traditional classifiers in general.
  • Thanks Weitong for his specialization in front-end design and edition work.
  • Thanks Bowen for training the text model and fiddling with deep learning.

And finally, thanks Cav and Parisa. You are just simply too nice to read this through, so we think you deserve some credits here.

You can’t perform that action at this time.