Multilingual MigrationsKB

Multilingual MigrationskB (MGKB) is a mulitlingual extended version of English MGKB. The tweets geotagged with Geo location from 32 European Countries ( Austria, Belgium, Bulgaria, Croatia, Cyprus, Czech, Denmark, Estonia, Finland, France, Germany, Greece, Hungary, Ireland, Italy, Latvia, Lithuania, Luxembourg, Malta, Netherlands, Poland, Portugal, Romania, Slovakia, Slovenia, Spain, Sweden, Iceland, Liechtenstein, Norway, Switzerland, the United Kingdom ) are extracted and filtered by 11 languages (English, French, Finnish, German, Greek, Dutch, Hungarian, Italian, Polish, Spain, Swedish). Metadata information about the tweets, such as Geo information (place name, coordinates, country code) are included. MGKB contains sentiments, offensive and hate speeches, topics, hashtags, user mentions in RDF format. The schema of MGKB is an extension of TweetsKB for migration related information. Moreover, to associate and represent the potential economic and social factors driving the migration flows, the data from Eurostat and FIBO ontology was used. To represent multilinguality, the CIDOC Conceptual Reference Model (CIDOC-CRM) is used. The extracted economic indicators, i.e., GDP Growth Rate, Total Unemployment Rate, Youth Unemployment Rate, Long-term Unemployment Rate and Income per househould, are connected with each tweet in RDF using geographical and temporal dimensions.

Please contact Yiyi Chen (yiyi.chen@partner.kit.edu) for pretrained models (Sentiment analysis/hate speech detection/ETM) if necessary.

Resources

MGKB TTL files and topic words in 11 Languages : https://zenodo.org/record/5918508

Overall Framework

Collect tweets

get Twitter api and put credentials.yaml in crawler/config folder
- ```
migrationsKB:
   berear_token: XXXX
```
specify the COUNTRY_ISO2, and idx of keywords_all
- run python -m crawler.main_keywords

Preprocessing tweets

restructure data and get statistics of curated data by country
- python -m preprocessor.restructure_data

Topic Modeling

python -m models.topicModeling.ETM.main --mode train --num_topics 50 --lang_code es

Steps:

1. data_build_tweets.py
2. skipgram.py
3. python -m models.topicModeling.ETM.main --mode train --num_topics 50 --lang_code es
    * train in batch
        python -m run_etms --min_topics 5 --max_topics 50 --device 0 --lang_code en
4. python -m models.topicModeling.ETM.infer_topic_and_filter --lang_code fi 
--model_path output/models/ETM/fi/best/etm_tweets_K_10_Htheta_800_Optim_adam_Clip_0.0_ThetaAct_relu_Lr_0.005_Bsz_1000_RhoSize_300_trainEmbeddings_0_val_loss_6.446055066569226e+18_epoch_188 
--num_topics 10

XLM-R

fine-tune xlm-r with sentiment analysis or hate speech detection

python -m models.scripts.xlm-r-adapter --lang_code swedish --task hsd

CUDA_VISIBLE_DEVICES=1 python -m models.scripts.xlm-r-adapter --lang_code swedish --task hsd

CUDA_VISIBLE_DEVICES=3 python -m models.scripts.xlm-r-adapter --lang_code sv --task sa

Trouble shooting

sentencepiece in mac: google/sentencepiece#378 (comment)

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
crawler		crawler
data/external		data/external
models		models
populate_kb		populate_kb
postprocessor		postprocessor
preprocessor		preprocessor
utils		utils
.gitignore		.gitignore
01_run_crawler.sh		01_run_crawler.sh
02_run_data_preprocessor.sh		02_run_data_preprocessor.sh
03_run_topic_modeling.sh		03_run_topic_modeling.sh
04_run_SA_HSD.sh		04_run_SA_HSD.sh
05_run_post_processor.sh		05_run_post_processor.sh
README.md		README.md
UserManual.pdf		UserManual.pdf
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
run_etms.py		run_etms.py

migrationsKB/MRL

Folders and files

Latest commit

History

Repository files navigation

Multilingual MigrationsKB

Resources

Overall Framework

Collect tweets

Preprocessing tweets

Topic Modeling

XLM-R

Trouble shooting

About

Resources

Stars

Watchers

Forks

Languages