# Videos Statistics

This notebooks computes descriptive statistics of the videos encountered during the study.

In [1]:
import os
import pandas as pd
import numpy as np
import seaborn as sns

In [2]:
RAW_DATA_PATH = os.path.join('..', 'Data', 'raw_data')
NORMALIZED_DATA_PATH = os.path.join('..', 'Data', 'normalized_data')

## Encountered videos

In [3]:
home = pd.read_csv(os.path.join(RAW_DATA_PATH, 'home_page_results.csv'))
search = pd.read_csv(os.path.join(RAW_DATA_PATH, 'search_results.csv'))
recommend = pd.read_csv(os.path.join(RAW_DATA_PATH, 'recommendations.csv'))

In [4]:
encountered_home = home.groupby('youtube_id').size()
encountered_search = search.groupby('youtube_id').size()
encountered_recommend = recommend.groupby('youtube_id').size()

In [5]:
print(f'There are {home.shape[0]} videos encountered on home page, {encountered_home.shape[0]} are unique.')
print(f'There are {search.shape[0]} videos encountered in search results, {encountered_search.shape[0]} are unique.')
print(f'There are {recommend.shape[0]} videos encountered in recommendations, {encountered_recommend.shape[0]} are unique.')

There are 116479 videos encountered on home page, 9977 are unique.
There are 201404 videos encountered in search results, 942 are unique.
There are 78763 videos encountered in recommendations, 8526 are unique.


In [6]:
total = pd.concat([home, search, recommend])

In [7]:
print(f'In total, there are {total.shape[0]} encountered videos, {len(total["youtube_id"].unique())} are unique.')

In total, there are 396646 encountered videos, 17405 are unique.


## Distribution of manual labels

Promoting and debunking videos include those related as well as unrelated to respective topics. Debunking videos also include those manually annotated as mocking videos. In further analyses presented in the paper as well as in the classification model, we do not distinguish between neutral videos and videos labeled as not about misinformation (both are regarded as neutral). The other videos include videos that were manually labeled as unknown, non-English or removed.

We do not publish predicted labels for individual videos due to ethical concerns.

In [8]:
encountered_normalized = pd.read_csv(os.path.join(NORMALIZED_DATA_PATH, 'encountered_videos.csv'))

In [9]:
print(f'There are {encountered_normalized.shape[0]} manually labeled videos.')

There are 2914 manually labeled videos.


In [10]:
encountered_normalized['original_label'].value_counts()

not about misinfo    1827
debunking             628
promoting             244
neutral               186
other                  29
Name: original_label, dtype: int64

## Descriptive statistics

Descriptive statistics of the collected data with respect to the length of videos (#minutes) and how many times they appeared during data collection (#encounters). The statistics are computed for a subset of 15,837 videos (out of all 17,405 encountered videos), for which we were able to obtain metadata. We collected videos’ metadata using YouTube API some time after the data collection itself. This meant that we were not able to get metadata for all encountered videos, e.g., in cases when the videos were removed by the authors or by the platform.

Most of the columns of `videos_metadata.csv` are empty. Use [`get-train-and-encountered-data.ipynb`](get-train-and-encountered-data.ipynb) to retrieve additional metadata.

In [11]:
metadata = pd.read_csv(os.path.join(NORMALIZED_DATA_PATH, 'videos_metadata.csv'))

In [12]:
metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15837 entries, 0 to 15836
Data columns (total 18 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   youtube_id             15837 non-null  object 
 1   published_at           0 non-null      float64
 2   title                  0 non-null      float64
 3   description            0 non-null      float64
 4   channel_id             0 non-null      float64
 5   language               0 non-null      float64
 6   duration_seconds       15837 non-null  int64  
 7   duration_minutes       15837 non-null  float64
 8   duration_hours         15837 non-null  float64
 9   view_count             0 non-null      float64
 10  like_count             0 non-null      float64
 11  dislike_count          0 non-null      float64
 12  favorite_count         0 non-null      float64
 13  comment_count          0 non-null      float64
 14  encountered_home       15837 non-null  int64  
 15  en

In [13]:
metadata['duration_minutes'].describe(percentiles=[.25, .5, .75, .9])

count    15837.000000
mean        45.418716
std        108.868920
min          0.000000
25%          7.733333
50%         14.733333
75%         39.716667
90%         92.693333
max       1439.883333
Name: duration_minutes, dtype: float64

In [14]:
metadata['encountered_all'].describe(percentiles=[.25, .5, .75, .9])

count    15837.000000
mean        20.408600
std        112.920578
min          1.000000
25%          1.000000
50%          2.000000
75%          5.000000
90%         21.000000
max       2695.000000
Name: encountered_all, dtype: float64