The TED2020 v1 dataset, from https://opus.nlpl.eu/TED2020.php, was selected for the Word2Vec training. This dataset contains transcribed talks from the popular science communication TED events.
It was chosen because of its simple, everyday and not pretentious use of language, and its continuous and flowing/provisional discourse (absence of technical terms, pluralism in vocabulary and expressions). It is a parallel corpus with sentence pairs in two languages.