# Extract video information from YouTube channels

`yt-dlp` is a command-line program to download videos from YouTube and a few more sites, and it is a more feature-rich fork of `youtube-dl`. It can extract and provide information in various formats, including JSON.

To use `yt-dlp` from within a Jupyter Notebook, you can use the `!` shell command magic. You've already provided an example of how you use yt-dlp to dump information in JSON format.

To extract specific information such as video duration, title, URL, and views, you can follow these steps:

- Use `yt-dlp` with `--dump-json` to retrieve the information in JSON format.
- Parse the JSON output to filter and extract the desired fields.

In [11]:
import shutil
import json
import yt_dlp
import os
import pandas as pd

from pathlib import Path
from tqdm import tqdm

# Extract info through YT search ie. candidate might not be the main speaker

In [2]:
mapping = {
    "milei presidente": "milei",
    "el peluca milei": "milei",
    "javier milei": "milei",
    "sergio massa": "massa",
    "patricia bullrich": "bullrich",
}

In [3]:
save_dir = Path("../data/youtube_search_data")
if save_dir.exists(): shutil.rmtree(save_dir)
Path(save_dir).mkdir(parents=True, exist_ok=True)

In [4]:
search_these_names = ["Javier Milei", "Sergio Massa", "Patricia Bullrich"]
URL = 'https://www.youtube.com/results?search_query=%22{}%22&sp=EgQQARgC'

In [5]:
def save_to_json(data, file_path):
    """
    Save data to a JSON file.

    Args:
        data (dict): The data to save.
        file_path (str): The path to the file where the data should be saved.
    """
    with open(file_path, 'w') as json_file:
        json.dump(data, json_file, indent=4)

In [6]:
# see help(yt_dlp.YoutubeDL) for a list of available options and public functions

for c in tqdm(search_these_names):
    name = c.replace(' ','+').lower()
    fname = f"{save_dir}/{name}.json"

    ydl_opts = {
        "extract_flat": True,
        "quiet": True
    }
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        info = ydl.extract_info(
            URL.format(name), download=False)
        save_to_json(info, fname)

100%|██████████| 3/3 [00:46<00:00, 15.40s/it]


## Read JSONs and concatenate files on a single `Pandas` dataframe

### YouTubeDataProcessor

The `YouTubeDataProcessor` class is designed to process and transform JSON data obtained from YouTube channels into a structured pandas DataFrame. This data, typically extracted using tools like `yt-dlp`, is organized in a manner that requires certain processing steps to be made suitable for data analysis.

#### Features:

1. **Directory-Based Processing**: 
   The class is initialized with a directory path containing the JSON files. It can process multiple JSON files from this directory and concatenate the results into a single dataframe.

2. **Structured Data Transformation**: 
   The class handles various preprocessing steps:
   - **Explosion of the `entries` Column**: Each JSON contains a key named `entries` that holds several entries. The class ensures that each entry gets its own row in the dataframe.
   - **Column Renaming & Deletion**: Some columns are renamed for clarity, and others are dropped to clean the data.
   - **Extraction of Nested Data**: Data nested within dictionaries is extracted into separate columns.
   - **Column Filtering**: Only necessary columns are retained in the final dataframe for a concise output.

#### Usage:

Initialize the class with the directory containing your JSON files:

```python
processor = YouTubeDataProcessor("path_to_your_directory")
```

Then, call the `process_all_json_files` method to process all JSON files and get the resulting dataframe:

```
df = processor.process_all_json_files()
```

In [25]:
class YouTubeDataProcessor:
    def __init__(self, directory):
        self.directory = directory

    def _json_to_df(self, fname):
        with open(fname, 'r') as file:
            data1 = json.load(file)

        # exploding the 'entries' column
        df_exploded = pd.DataFrame([data1]).explode('entries')

        # keep only the desired columns and rename the 'title' column to 'search_term'
        df_filtered = df_exploded[['title', 'extractor_key', 'entries']].rename(columns={'title': 'search_term'})

        # extracting the dictionaries in the 'entries' column into separate columns
        entries_df = df_filtered['entries'].apply(pd.Series)

        # concatenating the original columns with the new columns from 'entries' & deleting duplicate cols
        df = pd.concat([df_filtered.drop('entries', axis=1), entries_df], axis=1)
        df = df.loc[:, ~df.columns.duplicated()].copy()

        cols = [
            "search_term",
            # "extractor_key",
            "channel_id",
            "channel",
            "uploader_url",
            "id",
            "url",
            "title",
            "duration",
            "view_count",
        ]

        return df[cols]

    def process_all_json_files(self):
        # list all JSON files in the directory
        all_files = [os.path.join(self.directory, fname) for fname in os.listdir(self.directory) if fname.endswith('.json')]

        # convert each JSON file to a dataframe and store in a list
        all_dfs = [self._json_to_df(fname) for fname in all_files]

        # concatenate all dataframes into a single dataframe
        final_df = pd.concat(all_dfs, ignore_index=True)

        # add names
        final_df['candidate_name'] = final_df['search_term'].str.replace('"',"").map(mapping)

        return final_df

In [27]:
processor = YouTubeDataProcessor(save_dir)
df = processor.process_all_json_files()
df.head()

Unnamed: 0,search_term,channel_id,channel,uploader_url,id,url,title,duration,view_count,candidate_name
0,"""sergio massa""",UCj6PcyLvpnIRT_2W_mwa9Aw,Todo Noticias,https://www.youtube.com/@todonoticias,r_LjH59QgAs,https://www.youtube.com/watch?v=r_LjH59QgAs,"Patricia Bullrich, Javier Milei y Sergio Massa...",5511.0,642300.0,massa
1,"""sergio massa""",UCFgk2Q2mVO1BklRQhSv6p0w,C5N,https://www.youtube.com/@c5n,4K5UUfrOynU,https://www.youtube.com/watch?v=4K5UUfrOynU,SERGIO MASSA expuso en el CONSEJO de las AMÉRICAS,1641.0,24925.0,massa
2,"""sergio massa""",UCFgk2Q2mVO1BklRQhSv6p0w,C5N,https://www.youtube.com/@c5n,WyBhiYbSBEs,https://www.youtube.com/watch?v=WyBhiYbSBEs,SERGIO MASSA en DURO DE DOMAR | ENTREVISTA COM...,3024.0,49156.0,massa
3,"""sergio massa""",UCFgk2Q2mVO1BklRQhSv6p0w,C5N,https://www.youtube.com/@c5n,ELI3qGzOTOI,https://www.youtube.com/watch?v=ELI3qGzOTOI,Los ANUNCIOS de SERGIO MASSA: CONOCÉ las MEDID...,1205.0,49651.0,massa
4,"""sergio massa""",UCFgk2Q2mVO1BklRQhSv6p0w,C5N,https://www.youtube.com/@c5n,WAa-PpXCwms,https://www.youtube.com/watch?v=WAa-PpXCwms,"PPT: SERGIO MASSA, mano a mano con DADY BRIEVA...",2602.0,94227.0,massa


## Filter based on duration (15 min > x > 120 min) and view count (+1,000)

In [28]:
subset_df = df[(df.duration > 60*15) & (df.duration < 60*120) & (df.view_count > 100)].drop_duplicates(subset=['duration'])
subset_df.search_term.value_counts()

search_term
"sergio massa"         382
"patricia bullrich"    307
"javier milei"         297
Name: count, dtype: int64

## Export JSON

In [29]:
subset_df.to_csv(save_dir/'data.csv', index=False)