This is a repository for the research into radical and extremist infospheres on YouTube. We have used this code for a series of stories at de Volkskrant (link to stories) and de Correspondent (link to stories)
The code consists of several modules, packages and collections of code.
DataCollection contains a library for, well, large scale data collection. The code takes a list of channels and collects, through the YouTube API, the following data types:
- Channel information (basic statistics, relevant playlist ids and more)
- Videos (statistics and descriptions)
- Comments (all comments of the videos)
- Recommendations (all recommendations for the gathered videos)
- Transcripts (transcripts, if available, in English of the videos, gathered with the youtube-dl library
You'll find additional documentation in the DataCollection folder.
Contains scripts and notebooks to gather and analyse data we used for an experiment into the recommendation system of YouTube. This codes still needs a lot of work.
Contains some notebooks used for the analysis of the data on right and left wing 'infospheres.' They just scratch the surface of possible analyses, but they can help you along.
Contains a lot of scripts, data and ideas for natural language processing. The transcripts are a real treasure. During two hackathons we've written code to get a grip on this data. There is still a lot that need to be done, so please consider these scripts as suggestions.
If you are interested in the data (we have gathered aroung 100GB, or 500.000 videos of far right and far left content), please drop me a line. We won't share our comment data without a clear agreement on how to process those safely, because they are really sensitive data.
All code is written in python3.
Please let me know what we can do better. And please share your findings with us.