# Extract video information from YouTube channels

`yt-dlp` is a command-line program to download videos from YouTube and a few more sites, and it is a more feature-rich fork of `youtube-dl`. It can extract and provide information in various formats, including JSON.

To use `yt-dlp` from within a Jupyter Notebook, you can use the `!` shell command magic. You've already provided an example of how you use yt-dlp to dump information in JSON format.

To extract specific information such as video duration, title, URL, and views, you can follow these steps:

- Use `yt-dlp` with `--dump-json` to retrieve the information in JSON format.
- Parse the JSON output to filter and extract the desired fields.

In [1]:
import shutil
from pathlib import Path
from tqdm import tqdm

In [2]:
youtube_channels = [
    # "@lanacion",
    # "todonoticias",
    # "@Infobae",
    "@ElPelucaMilei",
    "@MILEIPRESIDENTE",
    "@JavierMileiOK",
    "@PatriciaBullrich",
    "@SergioMassa"
]

save_dir = Path("../data/youtube_data")
shutil.rmtree(save_dir)
Path(save_dir).mkdir(parents=True, exist_ok=True)

In [3]:
for c in tqdm(youtube_channels):
    url = f"https://www.youtube.com/{c}/videos"
    fname = f"{save_dir}/{c}.json"

    !yt-dlp --flat-playlist --dump-single-json $url > $fname

  0%|          | 0/5 [00:00<?, ?it/s]



 20%|██        | 1/5 [00:03<00:15,  3.80s/it]



 40%|████      | 2/5 [00:12<00:19,  6.45s/it]



 60%|██████    | 3/5 [00:15<00:09,  4.84s/it]



 80%|████████  | 4/5 [00:20<00:05,  5.24s/it]



100%|██████████| 5/5 [00:28<00:00,  5.78s/it]


## Read JSONs and concatenate them on a single `Pandas` dataframe

### YouTubeDataProcessor

The `YouTubeDataProcessor` class is designed to process and transform JSON data obtained from YouTube channels into a structured pandas DataFrame. This data, typically extracted using tools like `yt-dlp`, is organized in a manner that requires certain processing steps to be made suitable for data analysis.

#### Features:

1. **Directory-Based Processing**: 
   The class is initialized with a directory path containing the JSON files. It can process multiple JSON files from this directory and concatenate the results into a single dataframe.

2. **Structured Data Transformation**: 
   The class handles various preprocessing steps:
   - **Explosion of the `entries` Column**: Each JSON contains a key named `entries` that holds several entries. The class ensures that each entry gets its own row in the dataframe.
   - **Column Renaming & Deletion**: Some columns are renamed for clarity, and others are dropped to clean the data.
   - **Extraction of Nested Data**: Data nested within dictionaries is extracted into separate columns.
   - **Column Filtering**: Only necessary columns are retained in the final dataframe for a concise output.

#### Usage:

Initialize the class with the directory containing your JSON files:

```python
processor = YouTubeDataProcessor("path_to_your_directory")
```

Then, call the `process_all_json_files` method to process all JSON files and get the resulting dataframe:

```
df = processor.process_all_json_files()
```

In [55]:
import os
import json
import pandas as pd

class YouTubeDataProcessor:
    def __init__(self, directory):
        self.directory = directory

    def _json_to_df(self, fname):
        with open(fname, 'r') as file:
            data1 = [json.loads(line) for line in file if line.strip()]  # ensure line is not empty

        # exploding the 'entries' column
        df_exploded = pd.DataFrame(data1).explode('entries')

        # rename & delete columns
        df_exploded.drop(["id", "view_count"], axis='columns', inplace=True)
        df_exploded.rename(columns={
            "channel": "channel_name",
            "description": "channel_description",
            "uploader_url": "channel_uploader_url",
            "title": "channel_title",
        }, inplace=True)

        # extracting the dictionaries in the 'entries' column into separate columns
        entries_df = df_exploded['entries'].apply(pd.Series)

        # concatenating the original columns with the new columns from 'entries' & deleting duplicate cols
        df = pd.concat([df_exploded.drop('entries', axis=1), entries_df], axis=1)
        df = df.loc[:, ~df.columns.duplicated()].copy()

        # filter columns
        cols = [
            "channel_id",
            "channel_name",
            "channel_description",
            "channel_uploader_url",
            "channel_title",
            "id",
            "url",
            "title",
            "description",
            "duration",
            "view_count",
        ]

        return df[cols]

    def process_all_json_files(self):
        # List all JSON files in the directory
        all_files = [os.path.join(self.directory, fname) for fname in os.listdir(self.directory) if fname.endswith('.json')]

        # Convert each JSON file to a dataframe and store in a list
        all_dfs = [self._json_to_df(fname) for fname in all_files]

        # Concatenate all dataframes into a single dataframe
        final_df = pd.concat(all_dfs, ignore_index=True)

        return final_df

In [56]:
processor = YouTubeDataProcessor("../data/youtube_data")
df = processor.process_all_json_files()
df.head()

Unnamed: 0,channel_id,channel_name,channel_description,channel_uploader_url,channel_title,id,url,title,description,duration,view_count
0,UCqz5tDLcGBJ5obqpSLSkiaQ,MILEI PRESIDENTE,Javier Milei es la persona idónea para sacarno...,https://www.youtube.com/@MILEIPRESIDENTE,MILEI PRESIDENTE - Videos,9e2oRKLbVUw,https://www.youtube.com/watch?v=9e2oRKLbVUw,"El día que Milei debatió con Bullrich: ""Ustede...",Visita nuestro canal colega Daro Darito:\nhttp...,1048.0,191598
1,UCqz5tDLcGBJ5obqpSLSkiaQ,MILEI PRESIDENTE,Javier Milei es la persona idónea para sacarno...,https://www.youtube.com/@MILEIPRESIDENTE,MILEI PRESIDENTE - Videos,OaA5wO4ijE0,https://www.youtube.com/watch?v=OaA5wO4ijE0,Universitarios opinan sobre la propuesta educa...,Visita nuestro canal colega Daro Darito:\nhttp...,379.0,70979
2,UCqz5tDLcGBJ5obqpSLSkiaQ,MILEI PRESIDENTE,Javier Milei es la persona idónea para sacarno...,https://www.youtube.com/@MILEIPRESIDENTE,MILEI PRESIDENTE - Videos,_KKjM_Y8b8c,https://www.youtube.com/watch?v=_KKjM_Y8b8c,"""No me trates mal"" Milei se cruza con periodis...",Visita nuestro canal colega Daro Darito:\nhttp...,2059.0,197448
3,UCqz5tDLcGBJ5obqpSLSkiaQ,MILEI PRESIDENTE,Javier Milei es la persona idónea para sacarno...,https://www.youtube.com/@MILEIPRESIDENTE,MILEI PRESIDENTE - Videos,iCiCTD9hZsc,https://www.youtube.com/watch?v=iCiCTD9hZsc,"""No descarto asumir antes de tiempo"" Imperdibl...",Visita nuestro canal colega Daro Darito:\nhttp...,3483.0,55979
4,UCqz5tDLcGBJ5obqpSLSkiaQ,MILEI PRESIDENTE,Javier Milei es la persona idónea para sacarno...,https://www.youtube.com/@MILEIPRESIDENTE,MILEI PRESIDENTE - Videos,NC2JP0lXMuk,https://www.youtube.com/watch?v=NC2JP0lXMuk,Milei elimina ministerios en vivo- 15/08/23,Visita nuestro canal colega Daro Darito:\nhttp...,895.0,632157


## Select top videos based on duration and views

In [74]:
subset_df = df[(df.duration > 60*20) & (df.duration < 60*120) & (df.view_count > 1000)].drop_duplicates(subset=['duration'])
subset_df.channel_name.value_counts()

channel_name
MILEI PRESIDENTE     383
El Peluca Milei      194
Javier Milei          94
Sergio Massa          31
Patricia Bullrich     22
Name: count, dtype: int64

In [76]:
subset_df[subset_df["channel_name"].isin(["Javier Milei", "Sergio Massa", "Patricia Bullrich"])].to_csv(save_dir/'data.csv', index=False)