Classification of Twitter users using multimodal embeddings
git clone https://github.com/medialab/graines.git
cd graines
- create a virtual environment with python 3.8
- activate it
- run
pip install -r requirements.txt
- Move the
non_graines_metadata.csv
andgraines_metadata.csv
files inside thegraines
repo - run
python create_ground_truth.py
- the ground truth is saved in a csv file : "
graines_et_non_graines.csv
". The seeds get the label 1 and the non-seeds the label 0.
Have a look at the tfidf_on_descriptions.py file: the matrix should be saved
as a name_of_your_embedding_model.npy
matrix, and have exactly 411 rows. Alternatively topo_count.py measures the topoogical features of candidates.
The vectors corresponding to each user should be in the same order as the users in graines_et_non_graines.csv
.
You can run python tfidf_on_descriptions.py
to get an example of the embedding matrix.
python main.py --model name_of_your_embedding_model
(without .npy in the name of the model)
The results are run 5 times with a different train/test split.
To save a complete report to results_binary_classif.csv, run
python main.py --model name_of_your_embedding_model --report
To try a different classifier, run
python main.py --model name_of_your_embedding_model --classifier SVM_RBF_kernel
To test the code only on difficult cases, run
python main.py --model name_of_your_embedding_model --objective difficult_cases
The .gitignore file should prevent you from loading the users personnal data or any Twitter data we collected.
git commit -am "name of your commit"
git push