# Extract video information from YouTube channels

`yt-dlp` is a command-line program to download videos from YouTube and a few more sites, and it is a more feature-rich fork of `youtube-dl`. It can extract and provide information in various formats, including JSON.

To use `yt-dlp` from within a Jupyter Notebook, you can use the `!` shell command magic. You've already provided an example of how you use yt-dlp to dump information in JSON format.

To extract specific information such as video duration, title, URL, and views, you can follow these steps:

- Use `yt-dlp` with `--dump-json` to retrieve the information in JSON format.
- Parse the JSON output to filter and extract the desired fields.

In [1]:
import shutil
import yt_dlp
import json
from pathlib import Path
from tqdm import tqdm

# Extract information from known channels where the candidate is the main speaker

In [40]:
mapping = {
    "milei presidente": "milei",
    "el peluca milei": "milei",
    "javier milei": "milei",
    "sergio massa": "massa",
    "patricia bullrich": "bullrich",
}

In [37]:
target_channels = [
    "@ElPelucaMilei",
    "@MILEIPRESIDENTE",
    "@JavierMileiOK",
    "@PatriciaBullrich",
    "@SergioMassa",
]

In [38]:
save_dir = Path("../data/youtube_data")
shutil.rmtree(save_dir)
Path(save_dir).mkdir(parents=True, exist_ok=True)

In [56]:
def save_to_json(data, file_path):
    """
    Save data to a JSON file.

    Args:
        data (dict): The data to save.
        file_path (str): The path to the file where the data should be saved.
    """
    with open(file_path, 'w') as json_file:
        json.dump(data, json_file, indent=4)

In [57]:
# see help(yt_dlp.YoutubeDL) for a list of available options and public functions
for name in tqdm(target_channels):
    URL = "https://www.youtube.com/{}/videos"
    fname = f"{save_dir}/{name}.json"

    ydl_opts = {
        "extract_flat": True,
        "quiet": True
    }
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        info = ydl.extract_info(
            URL.format(name), download=False)
        save_to_json(info, fname)

 60%|██████    | 3/5 [00:18<00:12,  6.22s/it]

KeyboardInterrupt



## Read JSONs and concatenate files on a single `Pandas` dataframe

### YouTubeDataProcessor

The `YouTubeDataProcessor` class is designed to process and transform JSON data obtained from YouTube channels into a structured pandas DataFrame. This data, typically extracted using tools like `yt-dlp`, is organized in a manner that requires certain processing steps to be made suitable for data analysis.

#### Features:

1. **Directory-Based Processing**: 
   The class is initialized with a directory path containing the JSON files. It can process multiple JSON files from this directory and concatenate the results into a single dataframe.

2. **Structured Data Transformation**: 
   The class handles various preprocessing steps:
   - **Explosion of the `entries` Column**: Each JSON contains a key named `entries` that holds several entries. The class ensures that each entry gets its own row in the dataframe.
   - **Column Renaming & Deletion**: Some columns are renamed for clarity, and others are dropped to clean the data.
   - **Extraction of Nested Data**: Data nested within dictionaries is extracted into separate columns.
   - **Column Filtering**: Only necessary columns are retained in the final dataframe for a concise output.

#### Usage:

Initialize the class with the directory containing your JSON files:

```python
processor = YouTubeDataProcessor("path_to_your_directory")
```

Then, call the `process_all_json_files` method to process all JSON files and get the resulting dataframe:

```
df = processor.process_all_json_files()
```

In [48]:
import pandas as pd

class YouTubeDataProcessor:
    def __init__(self, directory):
        self.directory = directory

    def _json_to_df(self, fname):
        with open(fname, 'r') as file:
            data1 = [json.loads(line) for line in file if line.strip()]  # ensure line is not empty

        # exploding the 'entries' column
        df_exploded = pd.DataFrame(data1).explode('entries')

        # rename & delete columns
        df_exploded.drop(["id", "view_count"], axis='columns', inplace=True)
        df_exploded.rename(columns={
            # "channel": "channel_name",
            # "uploader_url": "channel_uploader_url",
            "title": "channel_title",
        }, inplace=True)

        # extracting the dictionaries in the 'entries' column into separate columns
        entries_df = df_exploded['entries'].apply(pd.Series)

        # concatenating the original columns with the new columns from 'entries' & deleting duplicate cols
        df = pd.concat([df_exploded.drop('entries', axis=1), entries_df], axis=1)
        df = df.loc[:, ~df.columns.duplicated()].copy()

        # filter columns
        cols = [
            "channel_id",
            "channel",
            "uploader_url",
            "id",
            "url",
            "title",
            "duration",
            "view_count",
        ]

        return df[cols]

    def process_all_json_files(self):
        # list all JSON files in the directory
        all_files = [os.path.join(self.directory, fname) for fname in os.listdir(self.directory) if fname.endswith('.json')]

        # convert each JSON file to a dataframe and store in a list
        all_dfs = [self._json_to_df(fname) for fname in all_files]

        # concatenate all dataframes into a single dataframe
        final_df = pd.concat(all_dfs, ignore_index=True)

        # add names
        final_df['candidate_name'] = final_df['channel'].str.lower().map(mapping)

        return final_df

In [49]:
processor = YouTubeDataProcessor("../data/youtube_data")
df = processor.process_all_json_files()
df.head()

Unnamed: 0,channel_id,channel,uploader_url,id,url,title,duration,view_count,candidate_name
0,UCqz5tDLcGBJ5obqpSLSkiaQ,MILEI PRESIDENTE,https://www.youtube.com/@MILEIPRESIDENTE,4mmsBbQMZ6o,https://www.youtube.com/watch?v=4mmsBbQMZ6o,"""Este canal miente sobre mí"" Milei desenmascar...",3172.0,50636,milei
1,UCqz5tDLcGBJ5obqpSLSkiaQ,MILEI PRESIDENTE,https://www.youtube.com/@MILEIPRESIDENTE,xd49Kl9XEw8,https://www.youtube.com/watch?v=xd49Kl9XEw8,"""Mat4ron a uno de nuestros militantes"" Javier ...",592.0,108411,milei
2,UCqz5tDLcGBJ5obqpSLSkiaQ,MILEI PRESIDENTE,https://www.youtube.com/@MILEIPRESIDENTE,9e2oRKLbVUw,https://www.youtube.com/watch?v=9e2oRKLbVUw,"El día que Milei debatió con Bullrich: ""Ustede...",1048.0,232416,milei
3,UCqz5tDLcGBJ5obqpSLSkiaQ,MILEI PRESIDENTE,https://www.youtube.com/@MILEIPRESIDENTE,OaA5wO4ijE0,https://www.youtube.com/watch?v=OaA5wO4ijE0,Universitarios opinan sobre la propuesta educa...,379.0,75323,milei
4,UCqz5tDLcGBJ5obqpSLSkiaQ,MILEI PRESIDENTE,https://www.youtube.com/@MILEIPRESIDENTE,_KKjM_Y8b8c,https://www.youtube.com/watch?v=_KKjM_Y8b8c,"""No me trates mal"" Milei se cruza con periodis...",2059.0,201391,milei


## Filter based on duration (15 min > x > 120 min) and view count (+50)

In [50]:
channels = ["Javier Milei", "Sergio Massa", "Patricia Bullrich"]

In [51]:
subset_df = df[(df.duration > 60*15) & (df.duration < 60*120) & (df.view_count > 50)].drop_duplicates(subset=['duration'])
subset_df.channel.value_counts()

channel
MILEI PRESIDENTE     520
El Peluca Milei      220
Javier Milei         117
Sergio Massa          87
Patricia Bullrich     54
Name: count, dtype: int64

## Export JSON

In [52]:
subset_df[subset_df["channel"].isin(channels)].to_csv(save_dir/'data.csv', index=False)