Go to our github project page to download necessary files!!!
The competition task was to predict if a tweet message used to contain a positive :) or a negative :( smiley, by considering only the remaining text. Our team conducted comprehensive research on the proposed solutions in the relevant literature, as well as past projects and articles which tackled similar issues regarding text sentiment analysis. Full specification of our experiments, as well as results and conclusions drawn can be found in our report.
Complete project specification is available on the course's GitHub page.
Following dependencies are required in order to run the project:
-
Anaconda3 - Download and install Anaconda with Python3
-
Scikit-Learn - Download scikit-learn library with conda
conda install scikit-learn
-
Gensim - Install Gensim library
conda install gensim
-
NLTK - Download all the packages of NLTK
python >>> import nltk >>> nltk.download()
-
Tensorflow - Install tensorflow library (version used 1.4.1)
$ pip install tensorflow
-
Train tweets
Download
twitter-datasets.zip
containing positive and negative tweet files which are required during the model training phase. After unzipping, place the files obtained in the./data/datasets
directory. -
Test tweets
Download
test_data.txt
containing tweets which are required for the testing of the trained model and obtaining score for submission to Kaggle. This file needs to be placed in the./data/datasets
directory. -
Stanford Pretrained Glove Word Embeddings
Download Glove Pretrained Word Embeddings which are used for training advanced sentiment analysis models. After unzipping, place the file
glove.twitter.27B.200d.txt
in the./data/glove
directory.
- at least 16 GB of RAM
- a graphics card (optional for faster training involving CNNs)
Tested on Ubuntu 16.04 with Nvidia Tesla K80 GPU with 12 GB GDDR5
Public Leaderboard connected to this competition.
Our team's name is Bill Trader.
Team members:
- Dino Mujkić (dinomujki)
- Hrvoje Bušić (hrvojebusic)
- Sebastijan Stevanović (sebastijan94)
You can find already preprocessed tweet files test_full.csv
, train_neg_full.csv.zip
and train_pos_full.csv.zip
in the ./data/parsed
directory.
To run preprocessing again you must have Train tweets and Test tweets files in the ./data/dataset
directory. Then go to folder /src
and run run_preprocessing.py
with argument train
or test
to generate requried files for running the CNN.
$ python run_preprocessing.py train or test
To reproduce our best score from Kaggle go to folder /src
and run run_cnn.py
with argument eval
$ python run_cnn.py eval
In data/models/1513824111
directory is stored a checkpoint for reproducing our best score so the training part will be skipped. If you want to run the training process from scratch, just pass the argument train
when runnig run_cnn.py
.
To run the evaluation you must have the necessary files. File glove.twitter.27B.200d.txt
in the ./data/glove
directory and preprocessed tweet files test_full.csv
, train_neg_full.csv
and train_pos_full.csv
in the ./data/parsed
directory.
This project is available under MIT license.