Hercules: Tweet Retrieval and Machine Learning

This directory contains the software developed in the main Machine Learning part of the project Automated Analysis of Online Behaviour on Social Media, a cooperation of the University of Groningen and the Netherlands eScience Center. This project also has a software repository regarding finding journalists.

The software consist of a collection of Python scripts. These can be divided in three groups:

tweet-fetching scripts
scripts related to the IEEE paper (Auckland)
scripts related to the Casablanca paper

There are also several unrelated scripts, which have been left undocumented.

Tweet-fetching scripts

getTweetsUser.py

The Python script getTweetsUser.py can be used for obtaining tweets from certain users. Run like:

./getTweetsUser.py barackobama realdonaldtrump > file

It will retrieve all available tweets from the specified users and store these in the specified file. The tweets are stored in the data format JSON. The command may require several minutes to complete.

The script needs two things to run:

First: you will need to install the Twitter package from: https://github.com/sixohsix/twitter The commands for this on Linux and MacOSX are:

git clone https://github.com/sixohsix/twitter
cd twitter
python setup.py build
sudo python setup.py install

Second: you need to store your Twitter account data in a file named "definitions.py" in the same directory as getTweetsUser.py. The file should contain the following lines:

# twitter.com authentication keys
token = "???"
token_secret = "???"
consumer_key = "???"
consumer_secret = "???"

Replace the strings "???" with the key information from https://apps.twitter.com , see https://www.slickremix.com/docs/how-to-get-api-keys-and-tokens-for-twitter/ for instructions

getTweetText.py

The Python script getTweetText.py can be used for extracting the tweets from the JSON output of getTweetsUser.py:

./getTweetText.py < getTweetsUser.py.out > file

Scripts related to the IEEE paper (Auckland)

Erik Tjong Kim Sang, Herbert Kruitbosch, Marcel Broersma and Marc Esteve del Valle, Determining the function of political tweets. In: Proceedings of the 13th IEEE International Conference on eScience (eScience 2017), IEEE, Auckland, New Zealand, 2017, pages 438-439, ISBN 978-1-5386-2686-3, doi:10.1109/eScience.2017.60. (PDF, bibtex]

First, the data needs to be converted to the format required by the machine learner fasttext. We use tokenized text preceded by the class label, for example __label__1 this is a tweet !:

for FILE in test train
do
   ./expandReplies.py -t dutch-2012.$FILE.csv -r EMPTY |\
      cut -d' ' -f1,4- | sed 's/ RAWTEXT /*$/' > dutch-2012.$FILE.txt
done

Note that the data files with annotated tweets (dutch-2012.*) are unavailable.

Next, fasttext can be applied to the data:

fasttext supervised -input dutch-2012.train.txt -output MODEL \
   -dim 5 -minCount 300
fasttext predict MODEL.bin dutch-2012.test.txt |\
   paste -d ' ' - dutch-2012.test.txt | cut -d' ' -f1,2 |
      ./eval.py | head -1 | rev | sed 's/^ *//' | cut -d' ' -f1 | rev

For most of the experiments mentioned in Table II of the paper, these two commands can be reused with a different training file. Only the language modeling experiments require an extra step, for creating the language models:

fasttext skipgram -input EXTRADATA -output VECTORS -dim 5 \
   -minCount 300
fasttext supervised -input dutch-2012.train.txt -output MODEL \
   -dim 5 -minCount 300 -pretrainedVectors VECTORS.vec
fasttext predict MODEL.bin dutch-2012.test.txt |\
   paste -d ' ' - dutch-2012.test.txt | cut -d' ' -f1,2 |\
   ./eval.py | head -1 | rev | sed 's/^ *//' | cut -d' ' -f1 | rev

We always remove the labels from the EXTRADATA files.

Scripts related to the Casablanca paper

Erik Tjong Kim Sang, Marc Esteve del Valle, Herbert Kruitbosch, and Marcel Broersma, Active Learning for Classifying Political Tweets. In: Proceedings of the International Conference on Natural Language, Signal and Speech Processing (ICNLSSP), Casablanca, Morocco, 2017.

The experiments related to Figure 1 and Table 1 of the paper, were performed with the bash script run.sh.

After annotating a file for active learning, the next data file was generated with the bash script run-make-batch.

Contact

Erik Tjong Kim Sang, e.tjongkimsang(at)esciencecenter.nl

Information added by the python template

Badges

fair-software.eu recommendations
(1/5) code repository
(2/5) license
(3/5) community registry
(4/5) citation
(5/5) checklist
howfairis
Other best practices
Static analysis
Coverage
GitHub Actions
Build
Metadata consistency
Lint
Publish
SonarCloud
MarkDown link checker

How to use machine_learning

The project setup is documented in project_setup.md. Feel free to remove this document (and/or the link to this document) if you don't need it.

Installation

To install machine_learning from GitHub repository, do:

git clone https://github.com/online-behaviour/machine-learning.git
cd machine-learning
python3 -m pip install .

Documentation

Include a link to your project's full documentation here.

Contributing

If you want to contribute to the development of machine-learning, have a look at the contribution guidelines.

Credits

This package was created with Cookiecutter and the NLeSC/python-template.

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.githooks		.githooks
.github/workflows		.github/workflows
docs		docs
machine_learning		machine_learning
tests		tests
.bumpversion.cfg		.bumpversion.cfg
.editorconfig		.editorconfig
.gitignore		.gitignore
.mlc-config.json		.mlc-config.json
.prospector.yml		.prospector.yml
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
NOTICE		NOTICE
README.dev.md		README.dev.md
README.md		README.md
active-select.py		active-select.py
average.py		average.py
changeCsv.py		changeCsv.py
combine-word-probabilities.py		combine-word-probabilities.py
combineAnnotations.py		combineAnnotations.py
combineAnnotationsParallel.py		combineAnnotationsParallel.py
combineCsv.py		combineCsv.py
compare-dependent.py		compare-dependent.py
confusion.py		confusion.py
conversations.py		conversations.py
drawgraph.ipynb		drawgraph.ipynb
etokenize.py		etokenize.py
eval.py		eval.py
expandReplies.py		expandReplies.py
getReplyIds.py		getReplyIds.py
getTweetText.py		getTweetText.py
getTweetsId.py		getTweetsId.py
getTweetsUser.py		getTweetsUser.py
majority.py		majority.py
minusCsv.py		minusCsv.py
naiveBayes.py		naiveBayes.py
nearest-neighbors.py		nearest-neighbors.py
prdata.py		prdata.py
prepare-data.py		prepare-data.py
project_setup.md		project_setup.md
pyproject.toml		pyproject.toml
ranSelect.py		ranSelect.py
removeDoubles.py		removeDoubles.py
run-eval		run-eval
run-make-batch		run-make-batch
run-select		run-select
run.sh		run.sh
selectData.py		selectData.py
setup.cfg		setup.cfg
setup.py		setup.py
similarity.py		similarity.py
sonar-project.properties		sonar-project.properties
svm2txt		svm2txt
tokenizeTweets.py		tokenizeTweets.py
word2vec.py		word2vec.py
wordProfiles.py		wordProfiles.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hercules: Tweet Retrieval and Machine Learning

Tweet-fetching scripts

getTweetsUser.py

getTweetText.py

Scripts related to the IEEE paper (Auckland)

Scripts related to the Casablanca paper

Contact

Information added by the python template

Badges

How to use machine_learning

Installation

Documentation

Contributing

Credits

About

Releases 1

Packages

Languages

License

online-behaviour/machine-learning

Folders and files

Latest commit

History

Repository files navigation

Hercules: Tweet Retrieval and Machine Learning

Tweet-fetching scripts

getTweetsUser.py

getTweetText.py

Scripts related to the IEEE paper (Auckland)

Scripts related to the Casablanca paper

Contact

Information added by the python template

Badges

How to use machine_learning

Installation

Documentation

Contributing

Credits

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages