Czech Media Migration Discourse Research, 2015-2023

Repository of scripts, workflow and output

Authors:

Ondrej Pekacek, Charles University
Irene Elmerot, Stockholm University

NOTE: Some data could not be stored in this repository, due to their proprietary nature. This primarily affects media data obtained from Newton Media API.

Overview last updated: 14 July 2022

Structure:

1. Extraction of data

Media articles
- Newton Media database: Migration articles and counts of all media content (API)
- Disinformation Articles (EUvsDisinfo & newsplease scrape)
Facebook (media and journalists)
- Facebook Ads API
- Facebook public pages posts (Crowdtangle API) work in progress
Twitter (media and journalists)
- Twitter tweets (Twitter Academic API)
- Twitter account information & followers (Twitter Academic API)
- Twitter counts (Twitter Academic API)
Instagram (media and journalists) work in progress
- Public accounts posts (Crowdtangle API)
- Public accounts information (igramscraper)
Complementary datasets work in progress
- Relevant parliamentary speeches (Hlidac Statu API)
- Real media owners (Hlidac Statu API)
- GDELT (Big Query API)
- Corpora (SYN2020, representative corpus of Czech language)
- Google media political ads (Big Query API)
- Salience of migration over time (gtrendsR & Wikimedia API)
Visual data of posts work in progress
- Media, Twitter, Instagram (including stories), Facebook

2. Data pre-processing work in progress

Media articles
- Preprocessing: Removal of uneeded characters/symbols (Regex)
- Lemmatization, POS tagging & other text features (Lindat UDPIPE API) work in progress
- Labeling of media types (provided by Cvrcek & Henys, last update February 2022)

3. Exploratory data analysis work in progress

Media articles
- Counts of all migration content thorough time
- Proportion of migration content thorough time on all coverage
- How did the average lenght of the article differ across media types? work in progress
Twitter
- Filter media migration tweets and display in time
- Longitudinal comparison with overall migration Tweet counts
Extra
- Visualization of dependency trees of selected UDPIPE'd sentences (JS Treex)

4. Data Analysis work in progress

What are the most prominent entities in the migration coverage?
- Named Entity Recognition (Lindat NameTag 2 model)
What are the most similar words to migration-related terms and how did the meaning evolve?
- Word Embedding model (Wang2Vec/Word2Vec, GloVe, Fasttext)
- Word shift graphs (shifterator)
Has the overall sentiment evolved over time / differed across the media types?
- Sentiment Analysis (Czech SubLex, BERT Czert B model)
How did the media types differ in semantic grouping?
- Semantic Analysis (UCREL Semantic Analysis System)
What were the articles' key words?
- Keyword extraction (Lindat KER API / local TF-IDF)
Which media were most active in spreading disinformation about migration?
- Document similarity (Doc2Vec) with disinfo dataset
To what extent was the media discourse similar to political actors' speeches?
- Document similarity (Doc2Vec) with parliamentary speeches
How does the usage of linguistic features differ across media types?
- UDPIPE word features data
What difference exists in the discourse/framing of the refugees across media types?
- Word collocations and concordances (KWIC)
- LDA topic modeling
Did the usage (and pattern) of migration vs refugee terms differ across time and media types?
- Lexicon with regex
- Sequence analysis
What objects were most common in the migration coverage?
- Image object recognition (YOLOv4/v5)
What online networks/communities exist around key media and across types?
- Network analysis on Twitter data and users' behavior
What videos are being recommended by YouTube for key Czech media accounts?
- Network analysis of YT recommendations

5. Writing it up

Stockholm June 2022 conference abstract & presentation
CCL Vienna June 2022 presentation
Article 1 draft (NER & KWIC concordances) work in progress

Project workflow diagram:

graph TD;

  style newton fill:#2b9bf4,color:#fff,stroke:#b4263d,stroke-width:1.5px
  style disinfo fill:#2b9bf4,color:#fff,stroke:#b4263d,stroke-width:1.5px
  style twitter fill:#2b9bf4,color:#fff,stroke:#b4263d,stroke-width:1.5px
  style media fill:#2b9bf4,color:#fff,stroke:#b4263d,stroke-width:1.5px
  style fb_ads fill:#2b9bf4,color:#fff,stroke:#b4263d,stroke-width:1.5px
  style crowdtangle fill:#2b9bf4,color:#fff,stroke:#b4263d,stroke-width:1.5px
  style soc_media fill:#2b9bf4,color:#fff,stroke:#b4263d,stroke-width:1.5px
  style visual_data fill:#2b9bf4,color:#fff,stroke:#b4263d,stroke-width:1.5px
  style gdelt fill:#2b9bf4,color:#fff,stroke:#b4263d,stroke-width:1.5px
  style gdelt fill:#2b9bf4,color:#fff,stroke:#b4263d,stroke-width:1.5px
  style complementary fill:#2b9bf4,color:#fff,stroke:#b4263d,stroke-width:1.5px
  style semantic fill:#2b9bf4,color:#fff,stroke:#b4263d,stroke-width:1.5px
  style sentiment fill:#2b9bf4,color:#fff,stroke:#b4263d,stroke-width:1.5px
  style vision fill:#2b9bf4,color:#fff,stroke:#b4263d,stroke-width:1.5px
  style wiki_gtrends fill:#2b9bf4,color:#fff,stroke:#b4263d,stroke-width:1.5px
  style network fill:#2b9bf4,color:#fff,stroke:#b4263d,stroke-width:1.5px
  style complementary_analyses fill:#2b9bf4,color:#fff,stroke:#b4263d,stroke-width:1.5px
  style raw_data fill:#2b9bf4,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style regex fill:#2b9bf4,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style udpipe fill:#2b9bf4,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style clean_data fill:#2b9bf4,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style eda fill:#2b9bf4,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style udpipe fill:#2b9bf4,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style nlp fill:#2b9bf4,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style counts fill:#2b9bf4,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style length fill:#2b9bf4,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style lda fill:#2b9bf4,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style ner fill:#03fc9d,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style analyzed_data fill:#2b9bf4,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style analysis fill:#2b9bf4,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style ccl fill:#2b9bf4,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style stockholm fill:#2b9bf4,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style 1st_article fill:#2b9bf4,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style finish fill:#2b9bf4,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style communication fill:#2b9bf4,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style labels fill:#2b9bf4,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style collocs fill:#03fc9d,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style concord fill:#2b9bf4,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style doc_sim fill:#2b9bf4,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style word_sim fill:#2b9bf4,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style freq fill:#03fc9d,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style trees fill:#2b9bf4,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style kw fill:#2b9bf4,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style fact_check fill:#2b9bf4,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style ws_graphs fill:#2b9bf4,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style med_owners fill:#2b9bf4,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style parl_speeches fill:#2b9bf4,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style ola fill:#2b9bf4,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style med_demograph fill:#2b9bf4,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style idioms fill:#2b9bf4,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style yt fill:#2b9bf4,color:#fff,stroke:#ed1c24,stroke-width:1.5px
  style google fill:#2b9bf4,color:#fff,stroke:#ed1c24,stroke-width:1.5px

  newton[Newton Media API: <br> full media articles and content count] --> media([News media output])
  labels[Media type labelled dataset: <br> Vaclav Cvrcek et al.] ---> media
  disinfo[EUvsDisinfo verified migration disinformation: <br> newsplease scrape] --> media
  twitter[Tweets & Account info: <br> Twitter API] --> soc_media([Migration communication of <br> journalistic actors on social media])
  yt[YouTube channels data: <br> YouTube API] --> soc_media
  crowdtangle[FB & Instagram public posts: <br> Crowdtangle API] --> soc_media
  media ----> raw_data[(RAW DATA)]
  soc_media ---> raw_data
  visual_data[Visual data: <br> web scraping & Selenium] --> complementary([Contextual data])
  gdelt[Global Database of Events, <br> Language and Tone: <br> GDELT API] --> complementary
  ola[Online media readership: <br> Netmonitor.cz OLA data] --> complementary
  fb_ads[Paid FB advertising <br> by Czech media: <br> Facebook Ads Library API] --> complementary
  google[Paid Google advertising <br> by Czech media: <br> Facebook Ads Library API] --> complementary
  med_demograph[Online media demographics: <br> Similarweb.com data] --> complementary
  wiki_gtrends[Salience of migration: <br> pageviews - Wikipedia views <br> gtrends - Google Trends <br> Eurobarometer data] --> complementary
  med_owners[Czech media ownership: <br> real media owners database <br> Hlidac Statu API] --> complementary
  fact_check[Migration PolComm 1: <br> fact-checked claims of <br> political actors about migration: <br> demagog.cz API] --> complementary
  parl_speeches[Migration PolComm 2: <br> speeches about migration <br> in the parliament: <br> Hlidac Statu API] --> complementary
  complementary --> raw_data
  raw_data ===> regex([Text cleaning and preprocessing: <br> Regex pattern])
  regex ==> udpipe([Lemmatization, UPOS & linguistic features: <br> UDPIPE model])
  udpipe ===> clean_data[(PROCESSED DATA)]
  clean_data ====> analysis([Data Analysis])
  eda([Exploratory Data Analysis]) === analysis
  nlp([Natural Language Processing]) ===== analysis
  complementary_analyses([Complementary analyses]) === analysis
  counts[Counts over time <br> and media type] --> eda
  length[Content length over time <br> and media type] --> eda
  trees[Key sentences dependency trees: Treex] --> eda
  network[Network analysis of <br> online media <br> YouTube channels: <br> igraph] --> complementary_analyses
  vision[Object detection: <br> YOLO model] --> complementary_analyses
  sentiment[Sentiment analysis: <br> Czech Subjectivity Lexicon & <br> Czert BERT model] --> nlp
  ws_graphs[Word shift graphs: <br> shifterator] --> nlp
  semantic[Semantic analysis: <br> UCREL USAS] --> nlp
  idioms[Idiomatic expressions: <br> custom lexicon] --> nlp
  ner[Named Entity Recognition: <br> NameTag 2 model] --> nlp
  lda[Topic modeling <br> genisim LDA unsupervised model <br> seededlda semisupervised model] --> nlp
  collocs[Collocations: <br> quanteda] --> nlp
  concord[KWIC Concordances: <br> quanteda] --> nlp
  doc_sim[Document similarity: <br> doc2vec] --> nlp
  word_sim[Word embeddings: <br> wang2vec] --> nlp
  freq[Key term frequencies: <br> custom migration-terms <br> lexicon] --> nlp
  kw[Key Word extraction: <br> TF-IDF] --> nlp
  analysis =====> analyzed_data[(ANALYZED DATA)]
  analyzed_data ====> communication{PUBLISHING & OUTREACH}
  ccl[Vienna University <br> Computational Communication Lab <br> presentation <br> June 2022] --- communication
  stockholm[Stockholm University <br> Powers of Language <br> conference presentation <br> June 2022] --- communication
  1st_article[First article draft <br> summer 2022] --- communication
  communication ====> finish((PROJECT COMPLETION))

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
1.data_sources		1.data_sources
2.data_transformations		2.data_transformations
3.data_EDA		3.data_EDA
4.data_analysis		4.data_analysis
5.write_up		5.write_up
docs		docs
.gitignore		.gitignore
README.md		README.md
media_discourse_research.Rproj		media_discourse_research.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.data_sources

1.data_sources

2.data_transformations

2.data_transformations

3.data_EDA

3.data_EDA

4.data_analysis

4.data_analysis

5.write_up

5.write_up

docs

docs

.gitignore

.gitignore

README.md

README.md

media_discourse_research.Rproj

media_discourse_research.Rproj

Repository files navigation

Czech Media Migration Discourse Research, 2015-2023

Repository of scripts, workflow and output

Authors:

Structure:

1. Extraction of data

2. Data pre-processing work in progress

3. Exploratory data analysis work in progress

4. Data Analysis work in progress

5. Writing it up

Project workflow diagram:

About

Contributors 2

Languages

opop999/media_discourse_research

Folders and files

Latest commit

History

Repository files navigation

Czech Media Migration Discourse Research, 2015-2023

Repository of scripts, workflow and output

Authors:

Structure:

1. Extraction of data

2. Data pre-processing work in progress

3. Exploratory data analysis work in progress

4. Data Analysis work in progress

5. Writing it up

Project workflow diagram:

About

Topics

Resources

Stars

Watchers

Forks

Languages