### Data exploration

Exploration of the Spotify and YouTube data that we currently have, with a goal of
figuring out how to map the two datasets together

In [18]:
# need to add the parent /src directory into the runtime of this notebook.
import sys
import os
from typing import Dict

parent_directory = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
sys.path.append(parent_directory)

In [8]:
from transformations.enrichment.map_podcasts import get_map_tables_to_sqlite_data

In [9]:
map_tables_to_sqlite_data = get_map_tables_to_sqlite_data()

In [10]:
map_tables_to_sqlite_data.keys()

dict_keys(['channels', 'videos', 'spotify_show', 'spotify_episode'])

In [11]:
youtube_data = {
    "channels": map_tables_to_sqlite_data["channels"],
    "videos": map_tables_to_sqlite_data["videos"]
}

spotify_data = {
    "spotify_show": map_tables_to_sqlite_data["spotify_show"],
    "spotify_episode": map_tables_to_sqlite_data["spotify_episode"]
}

## How can we link a podcast episode from Spotify with one that is on YouTube?

First, let's look at the videos on YouTube and see if there are any discernible traits that we could potentially use to join against.

In [13]:
videos_df = youtube_data["videos"]

In [14]:
videos_df.head()

Unnamed: 0,video_id,video_title,channel_id,channel_title,category_id,default_audio_language,default_language,description,live_broadcast_content,published_at,tags,view_count,like_count,favorite_count,comment_count,synctimestamp
0,qPKd99Pa2iU,Dr. Paul Conti: How to Improve Your Mental Hea...,UC2D2CMWXMOVWx7giW1n3LIg,Andrew Huberman,28,en,en,This is episode 2 of a 4-part special series o...,none,2023-09-13T12:00:49Z,"andrew huberman,huberman lab podcast,huberman ...",80702,2217,0,288,2023-09-14T22:45:02Z
1,z8c6EyMNd0A,Journal Club with Dr. Peter Attia | Metformin ...,UC2D2CMWXMOVWx7giW1n3LIg,Andrew Huberman,28,en,en,"In this journal club episode, my guest is Stan...",none,2023-09-11T12:04:21Z,"andrew huberman,huberman lab podcast,huberman ...",122801,3349,0,360,2023-09-14T22:47:18Z
2,tLRCS48Ens4,Dr. Paul Conti: How to Understand & Assess You...,UC2D2CMWXMOVWx7giW1n3LIg,Andrew Huberman,28,en,en,This is episode 1 of a 4-part special series o...,none,2023-09-06T12:00:50Z,"andrew huberman,huberman lab podcast,huberman ...",618882,10408,0,1246,2023-09-14T22:47:18Z
3,yixIc1Ai6jM,"Marc Andreessen: How Risk Taking, Innovation &...",UC2D2CMWXMOVWx7giW1n3LIg,Andrew Huberman,28,en,en,"In this episode, my guest is Marc Andreessen, ...",none,2023-09-04T12:00:51Z,"andrew huberman,huberman lab podcast,huberman ...",95299,2501,0,349,2023-09-14T22:47:18Z
4,eJU6Df_ffAE,"AMA #10: Benefits of Nature & “Grounding,"" Hea...",UC2D2CMWXMOVWx7giW1n3LIg,Andrew Huberman,28,en,en,Welcome to a preview of the 10th Ask Me Anythi...,none,2023-08-30T12:00:36Z,"andrew huberman,huberman lab podcast,huberman ...",96531,2588,0,291,2023-09-14T22:47:18Z


Now let's look at the Spotify episode data and see what we have

In [15]:
spotify_episodes_df = spotify_data["spotify_episode"]

In [16]:
spotify_episodes_df.head()

Unnamed: 0,id,show_id,audio_preview_url,description,html_description,duration_ms,explicit,href,is_externally_hosted,is_playable,languages,name,release_date,release_date_precision,type,uri,synctimestamp
0,5gpVKImVa70cWqSjZ5BCE8,79CkJF3UJTHFV8Dse3Oy0P,https://p.scdn.co/mp3-preview/190c568827ab8fbf...,This is episode 2 of a 4-part special series o...,<p>This is episode 2 of a 4-part special serie...,11718060,0,https://api.spotify.com/v1/episodes/5gpVKImVa7...,0,1,en,GUEST SERIES | Dr. Paul Conti: How to Improve ...,2023-09-13,day,episode,spotify:episode:5gpVKImVa70cWqSjZ5BCE8,2023-09-14T20:48:44Z
1,0W8MZ0rFL48XINmx9aCnak,79CkJF3UJTHFV8Dse3Oy0P,https://p.scdn.co/mp3-preview/8a8d899025dae4b2...,"In this journal club episode, my guest is Stan...","<p>In this journal club episode, my guest is S...",8478928,0,https://api.spotify.com/v1/episodes/0W8MZ0rFL4...,0,1,en,Journal Club with Dr. Peter Attia | Metformin ...,2023-09-11,day,episode,spotify:episode:0W8MZ0rFL48XINmx9aCnak,2023-09-14T20:48:44Z
2,346tNMjQVlGwhSAXkwEPUw,79CkJF3UJTHFV8Dse3Oy0P,https://p.scdn.co/mp3-preview/6de4f77dcac9dc7a...,This is episode 1 of a 4-part special series o...,<p>This is episode 1 of a 4-part special serie...,13369704,0,https://api.spotify.com/v1/episodes/346tNMjQVl...,0,1,en,GUEST SERIES | Dr. Paul Conti: How to Understa...,2023-09-06,day,episode,spotify:episode:346tNMjQVlGwhSAXkwEPUw,2023-09-14T20:48:44Z
3,0qkK5lNuRFJq0o11Js5Hvl,79CkJF3UJTHFV8Dse3Oy0P,https://p.scdn.co/mp3-preview/792f397f484948c7...,"In this episode, my guest is Marc Andreessen, ...","<p>In this episode, my guest is Marc Andreesse...",10621701,0,https://api.spotify.com/v1/episodes/0qkK5lNuRF...,0,1,en,"Marc Andreessen: How Risk Taking, Innovation &...",2023-09-04,day,episode,spotify:episode:0qkK5lNuRFJq0o11Js5Hvl,2023-09-14T20:48:44Z
4,3uHUrOpGXzO0Jo8hEMGliR,79CkJF3UJTHFV8Dse3Oy0P,https://p.scdn.co/mp3-preview/c92d13aaf313bb88...,Welcome to a preview of the 10th Ask Me Anythi...,<p>Welcome to a preview of the 10th Ask Me Any...,1293609,0,https://api.spotify.com/v1/episodes/3uHUrOpGXz...,0,1,en,"AMA #10: Benefits of Nature & “Grounding,"" Hea...",2023-08-30,day,episode,spotify:episode:3uHUrOpGXzO0Jo8hEMGliR,2023-09-14T20:48:44Z


What are some ways that we could likely link podcasts episodes together?
1. Do they have similar titles?
2. Were they posted at similar times?
3. Were they posted by the same channel? (would need a way to map the channel names together across integrations)
4. Are the descriptions similar?

Based on the data that we have, these are probably the most likely ways that we can try to link together the data. We can probably do some sort of fuzzy matching for all of these, to varying degrees, and give confidence scores for our mappings. This probably works as a first pass, but obviously the quality of this depends on the quality and consistency of the data, so it's unclear how well this scales.

In [23]:
def fuzzy_match_titles(
    youtube_video_title: str, spotify_episode_title: str
) -> float:
    """Performs fuzzy matching of the titles.
    
    Returns a float between 0 and 1 to indicate degree of matching.
    """
    pass


def channel_names_match(
    youtube_channel_name: str, spotify_podcast_name: str
) -> bool:
    """Checks to see if the YouTube and Spotify channel/podcast names match.
    
    This doesn't have to be fuzzy match; we should be able to use a hardcoded
    map in order to see if the names are actually matching.
    """
    pass


def fuzzy_match_descriptions(
    youtube_video_description: str,
    spotify_episode_description: str
) -> float:
    """Performs fuzzy matching of the titles.
    
    Returns a float between 0 and 1 to indicate degree of matching.
    """
    pass


def youtube_video_and_spotify_episode_posted_same_time(
    youtube_video_post_date: str,
    spotify_episode_post_date: str
) -> bool:
    """Get the difference in when the videos were posted and see if this
    falls in an acceptable range. Range could vary depending on variety of
    factors, but can be a constant as a first pass"""
    pass

In [21]:
def match_youtube_video_to_spotify_episode(
    youtube_video: Dict, spotify_episode: Dict
) -> float:
    """Performs a matching between a YouTube video and Spotify episode, to
    get the likelihood (a float between 0 and 1) that they should be
    mapped together.
    
    Algorithm details:
    1. It's likely that a YouTube and a Spotify version of the same podcast
    episode were posted at (around) the same time. Therefore, we can check to
    see if the YouTube video and Spotify podcast were. As a first pass, we can
    reasonably assume that if they were posted at vastly different dates, they
    likely aren't the same.
    2. If they were posted at around the same time, do other matching.
    """
    if not youtube_video_and_spotify_episode_posted_same_time(
        youtube_video_post_date="", spotify_episode_post_date=""
    ):
        return 0
    
    titles_match_score = fuzzy_match_titles(
        youtube_video_title="", spotify_episode_title=""
    )

    channel_names_match_score = channel_names_match(
        youtube_channel_name="", spotify_podcast_name=""
    )

    descriptions_match_score = fuzzy_match_descriptions(
        youtube_video_description="",
        spotify_episode_description=""
    )

    # it's likely that (assuming our algorithm works as intended) that
    # a proper mapping will lead to any of the scores being near 0, so if
    # we get that, then we can likely throw away the result.
    if any(
        [
            score == 0
            for score in [
                titles_match_score, channel_names_match_score,
                descriptions_match_score
            ]
        ]
    ):
        return 0

    return (
        titles_match_score
        + channel_names_match_score
        + descriptions_match_score
    )
