# Chai Time Data Science Show
***Interviews with Practitioners, Kagglers & Researchers and all things Data Science.***
<img src='https://chaitimedatascience.com/assets/images/image01.png?v80947218520951' height=500 width=400/>

## <a id='toc'>Table of Contents</a>
1. [Overview](#1)
2. [Show Timeline](#2)
3. [Battle of the Sexes](#3)
4. [Heroes' Whereabouts](#4)
5. [The Flavorful Show ☕](#5)
6. [Contribution of the Kagglers](#6)<br>
    6.1 [The number of Kagglers](#6.1)<br>
    6.2 [Kaggler Performance Tier](#6.2)<br>
    6.3 [Know your Kagglers!](#6.3)<br>
7. [Youtube Analysis](#7)<br>
    7.1 [Impressions and Views](#7.1)<br>
    7.2 [What influences the CTR?](#7.2)<br>
    7.3 [Does the episode duration matter?](#7.3)<br>
    7.4 [Tracking the User Activity](#7.4)<br>
8. [Podcasts Analysis](#8)<br>
    8.1 [Anchor Analysis](#8.1)<br>
    8.2 [Spotify Analysis](#8.2)<br>
    8.3 [Apple Podcast Analysis](#8.3)<br>

In [1]:
import pandas as pd
import numpy as np
import plotly_express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
from IPython.display import display, HTML
import requests
from bs4 import BeautifulSoup
import re
import json

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

In [2]:
subtitle_path = '/kaggle/input/chai-time-data-science/Cleaned Subtitles/'
df = pd.read_csv('/kaggle/input/chai-time-data-science/Episodes.csv')
PLOT_BGCOLOR='#DADEE3'
PAPER_BGCOLOR='rgb(255,255,255)'

# <a id='1'>1. Overview</a>
<a href='#toc'><span class="label label-info">Go back to the Table of Contents</span></a>

Chai Time Data Science show is a Podcast + Video + Blog based show for interviews with Practitioners, Kagglers & Researchers and all things Data Science This is also a “re-start” or continuation of the “Interview with Machine Learning Heroes Series” by [Sanyam Bhutani](https://www.linkedin.com/in/sanyambhutani/).
Let us take a look at some of the show statistictics.

In [3]:
fig = go.Figure()

fig.add_trace(go.Indicator(
    title = 'Total Episodes Released',
    mode = "number",
    value = df.episode_id.nunique(),
    domain = {'row': 0, 'column': 0}))


fig.add_trace(go.Indicator(
    title = "Total Heroes Interviewed",
    mode = "number",
    value = df.heroes.nunique(),
    domain = {'row': 0, 'column': 1}))


fig.add_trace(go.Indicator(
    title = 'Average Episode Duration<br>(in minutes)',
    mode = "number",
    value = df.episode_duration.mean()/60,
    domain = {'row': 1, 'column': 0}))

fig.add_trace(go.Indicator(
    title = 'Total Youtube<br>Subscribers Earned',
    mode = "number",
    value = df.youtube_subscribers.sum(),
    domain = {'row': 1, 'column': 1}))

fig.update_layout(width=700,height=400,title='<b>Chai Time Data Science Stats</b>',
                  template='seaborn',margin=dict(t=60,b=10,l=10,r=10),
                  grid = {'rows': 2, 'columns': 2, 'pattern': "independent"},paper_bgcolor=PLOT_BGCOLOR)

The show is available on following mediums:-
- [anchor.fm](https://anchor.fm/chaitimedatascience)
- [Apple podcasts](https://podcasts.apple.com/us/podcast/chai-time-data-science/id1473685440?ign-mpt=uo%3D4)
- [Breaker](https://www.breaker.audio/chai-time-data-science)
- [Spotify](https://open.spotify.com/show/7IbEWJjeimwddhOZqWe0G1)
- [Google podcasts](https://podcasts.google.com/feed/aHR0cHM6Ly9hbmNob3IuZm0vcy9jMTk3NzJjL3BvZGNhc3QvcnNz)
- [Pocket Casts](https://pca.st/37LZ)
- [Overcast](https://overcast.fm/itunes1473685440/chai-time-data-science)
- [Radio Public](https://radiopublic.com/chai-time-data-science-6VypwX)
- [Youtube](https://www.youtube.com/playlist?list=PLLvvXm0q8zUbiNdoIazGzlENMXvZ9bd3x)
- [Stitcher](https://www.stitcher.com/podcast/sanyam-bhutani/chai-time-data-science)

# <a id='2'>2. Show Timeline</a>
<a href='#toc'><span class="label label-info">Go back to the Table of Contents</span></a>

<img src='https://media.giphy.com/media/10bL6SqRBRfMUU/giphy.gif'/><br>
The show first released on 21st July 2019 and since then has released 85 episodes in total. Below gantt chart showcases the entire show's timeline including when a particular episode was recorded and when it was released finally.

Out of the released 85 episodes, 76 are interviews (E0 - E75) and the remaining 9 are lessons(M0 - M8) by fast-ai's founder [Jeremy Howard](https://www.fast.ai/about/#jeremy). The interviews are shown in blue color while lessons are shown in yellow color.

The y-axis lists down all the episodes and corresponding links to their Youtube and Anchor podcasts for your ready reference. Just click on any of the links to access the episode on youtube/anchor. 

In [4]:
df_gantt = df.copy()
df_gantt = df_gantt[['episode_id','episode_name','recording_date','release_date','youtube_url','anchor_url']]
df_gantt['Resource'] = df_gantt['episode_id'].apply(lambda x: 'Episode' if x[0]=='E' else 'Mini-Series')
df_gantt['link'] = df_gantt.apply(lambda x: '{}(<a href="{}">Youtube</a>/<a href="{}">Anchor</a>)'\
                                  .format(x['episode_id'],x['youtube_url'],x['anchor_url']),axis=1)
df_gantt.rename(columns={'link':'Task','recording_date':'Start','release_date':'Finish',
                         'episode_name':'Description'},inplace=True)
colors = {'Episode': '#0080B7',
          'Mini-Series': '#FDE803'}
fig = ff.create_gantt(df_gantt,colors=colors,index_col='Resource',bar_width=0.2,showgrid_x=True, showgrid_y=True)
fig.update_layout(width=700,height=1200,title='Episodes: Recording & Release Dates',template='seaborn',
                  xaxis=dict(title='Timeline',mirror=True,linewidth=2,linecolor='black',gridcolor='darkgray'),
                  yaxis=dict(title='Episodes',mirror=True,linewidth=2,linecolor='black',tickfont=dict(size=8),gridcolor='darkgray'),
                  plot_bgcolor=PLOT_BGCOLOR,paper_bgcolor=PAPER_BGCOLOR,hovermode='closest')
fig.show()

- As can be seen from the show timeline above, it is not necessary that an episode released first is recorded first. For example, have a look at E8 in the above chart. It was recorded the earliest but was released after first eight episodes.
- The lessons by Jeremy Howard were recorded all at once on 26 Feb 2020 and also released all at once on 7th March 2020.

In [5]:
# youtube_url = 'https://havecamerawilltravel.com/photographer/files/2020/01/youtube-logo-new.jpg'
# anchor_url = 'https://d12xoj7p9moygp.cloudfront.net/images/anchor-logo-header.png'
# df_episodes = df[['episode_id','episode_name','release_date','youtube_url','anchor_url']]
# df_episodes['Youtube Link'] = df_episodes['youtube_url']\
#     .apply(lambda x: '<a href="{}" target="_blank" title="{}"><img src="{}" width="80" height="10"></a>'.format(
#             x, 'Chai Time Data Science', youtube_url))
# df_episodes['Anchor Link'] = df_episodes['anchor_url']\
#     .apply(lambda x: '<a href="{}" target="_blank" title="{}"><img src="{}" width="80" height="10" style="background-color:#3C3B6E;" ></a>'.format(
#             x, 'Chai Time Data Science', anchor_url))
# df_episodes.rename(columns={'episode_id':'Episode','episode_name':'Name','release_date':'Released On'},inplace=True)
# display_html(df_episodes, cols=['Episode','Name', 'Released On','Youtube Link','Anchor Link'])

# <a id='3'>3. Battle of the Sexes</a>
<a href='#toc'><span class="label label-info">Go back to the Table of Contents</span></a>

<img src='https://media.giphy.com/media/l4Epgt54FuBUIsdYA/giphy.gif'/>
Data Science is a growing field and over the past few years it has seen exponential growth. Men and Women have contributed equally to this field. By looking at the below plot, it would seem like that there's a gender bias as 88% of the heroes interviewed are males, but I believe it is just a matter of time that we see a rise in the blue bar.

**Eagerly looking forward to it!!**

In [6]:
d = dict(df.heroes_gender.value_counts())
total = sum(d.values())
fig = go.Figure()
colors = {'Male':'#FDE803','Female':'#0080B7'}
annotations=[]
space = 0
for key, value in d.items():
    fig.add_trace(go.Bar(name=key,x=[value],y=['Heroes<br>Gender Count'],orientation='h',
                        marker_line_color='black',marker_line_width=1.5,marker_color=colors[key]))
    annotations.append(dict(xref='x', yref='y',
                            x=space + (value/2), y=0,
                            text=str(int(np.round(value/total,2)*100)) + '%',
                            font=dict(family='Arial', size=14,
                                      color='rgb(0, 0, 0)'),
                                      showarrow=False))
    space+=value
fig.update_layout(barmode='stack',width=700,height=150,paper_bgcolor=PAPER_BGCOLOR,plot_bgcolor=PLOT_BGCOLOR,
                 hovermode='y',xaxis=dict(mirror=True,linewidth=2,linecolor='black',showgrid=False),
                 yaxis=dict(mirror=True,linewidth=2,linecolor='black',showgrid=False),margin=dict(t=0,b=0,l=0,r=0),
                 legend=dict(title='Gender',y=0.5),annotations=annotations)
fig.show()

# <a id=4>4. Heroes' Whereabouts</a>
<a href='#toc'><span class="label label-info">Go back to the Table of Contents</span></a>

To remove the confusion if there's any, the people interviewed on CTDS.show are called Data Science Heroes. I fell its apt owing to the fact that they have contributed so much to the field.

In this section, I'll be analysing the distribution of Heroes' nationality and the country they are currently working in. In this age of globalization, it is pretty common to move places. So, its not necessary that a person in India will specifically work in India. People switch places for higher education, work and many other reasons.
<br>
<img src='https://media.giphy.com/media/3ov9k06VQ0SU6f15rW/giphy.gif'/>
<br>
The horizontal bar chart below has two legends: Nationality & Location.
<br>
`Nationality`: The country in which the person is born.
<br>
`Location`: The country in which the person is currently working/studying.

> **Please Note:** I have used logarithmic scale for xaxis as there is a huge gap between the first(USA: 22/37) and the second country(France: 8/5).

In [7]:
d1 = dict(df.heroes_nationality.value_counts())
d2 = dict(df.heroes_location.value_counts())
fig = go.Figure()
fig.add_trace(go.Bar(name='Nationality',x=list(d1.values()),y=list(d1.keys()),orientation='h',marker_color='#FDE803',
                    text=list(d1.values()),textposition='outside',marker_line_color='black',marker_line_width=1.5))
fig.add_trace(go.Bar(name='Location',x=list(d2.values()),y=list(d2.keys()),orientation='h',marker_color='#0080B7',
                    text=list(d2.values()),textposition='outside',marker_line_color='black',marker_line_width=1.5))
fig.update_layout(width=700,height=700,template='seaborn',title='Heroes Nationality vs Location',hovermode='y unified',
                 xaxis=dict(title='Number of Heroes',type='log',mirror=True,linewidth=2,
                            linecolor='black',gridcolor='darkgray'),
                 yaxis=dict(mirror=True,linewidth=2,linecolor='black',showgrid=False),margin=dict(t=25,b=0,l=0,r=0),
                 paper_bgcolor=PAPER_BGCOLOR,plot_bgcolor='#DADEE3',legend=dict(x=0.8,bgcolor='#DADEE3'))
fig.show()

- `USA` is number 1 by a huge margin. One thing to note is that out of the 37 currently working in US, 22 are native americans while the remaining 15 are not orginally from america. I wish Trump had seen this before revoking the visas 😜!!
- Barring `US`, `UK` & `Canada`, all the other countries have Nationality bar less than or equal to the Location bar. So does this mean that US,UK & Canada have more opportunities when it comes to data science? I believe we need more data before jumping to the  conclusion.
- Heros born in `Switzerland`, `Africa`, `Vietnam` & `Greece` are currently working in different countries.
- `Singapore`, `Norway` & `Czech Republic` have in total 4 heroes working there currently. Although, no hero is from there natively.

# <a id='5'>5. The Flavorful Show ☕</a>
<a href='#toc'><span class="label label-info">Go back to the Table of Contents</span></a>

You all must me thinking that the show is called Chai Time Data Science show and still no talk about Chai(Tea)!!<br>
<img src='https://media.giphy.com/media/2z0GFIoCNUVY4/giphy.gif' height=300 width=400/>
<br>
Apparently, the host Sanyam chooses from his collection of 9 different varieties of chai, which he keeps aside him while interviewing. So, let's see which varieties have been used the most and which the least.

In [8]:
d = dict(df.flavour_of_tea.value_counts())
colors = ['#ED1C22','#FEC907','#FF3D09','#7CBB15', '#92D0FF', '#30ADE5','#1373C7','#ED309C','#3BD5C9']
fig = go.Figure(data=[go.Pie(labels=list(d.keys()),values=list(d.values()))])
fig.update_traces(hoverinfo='label+value', textinfo='percent', textfont_size=20,
                  marker=dict(colors=colors, line=dict(color='#000000', width=1.5)))
fig.update_layout(title='Different Flavours of Tea☕',width=700,height=400,barmode='stack',template='seaborn',
                 paper_bgcolor=PLOT_BGCOLOR,plot_bgcolor=PLOT_BGCOLOR,hovermode='x unified',
                 legend=dict(title='<b><i>Type of ☕</i></b>',x=0.835,bgcolor=PLOT_BGCOLOR),margin=dict(t=35,b=10,l=10,r=10),
                 xaxis=dict(title='Number of Heroes',mirror=True,linewidth=2,linecolor='black',showgrid=False),
                 yaxis=dict(mirror=True,linewidth=2,linecolor='black',gridcolor='darkgray'))
fig.show()

Clearly, `Masala Chai` & `Ginger Chai` are used the most while `Kashmiri Kahwa` the least. I believe the first two are the host's favourites while the latter is the least favourite!!

# <a id='6'>6. Contribution of the Kagglers</a>
<a href='#toc'><span class="label label-info">Go back to the Table of Contents</span></a>

Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.

In June 2017, Kaggle announced that it passed 1 million registered users, or Kagglers. The community spans 194 countries. It is the largest and most diverse data community in the world, ranging from those just starting out to many of the world's best known researchers.<br>
<img src='https://upload.wikimedia.org/wikipedia/commons/7/7c/Kaggle_logo.png'/><br>
Definitely, the world's largest Data Science community will make some contribution to a show that's based on Data Science. BTW kaggle is also my favourite plotform. It's an ocean of knowledge where you can learn anything related to Data Science.

In this section I'll be analysing the relation between the heroes interviewed on CTDS.show and the Kagglers.

## <a id='6.1'>6.1 The number of Kagglers</a>
<a href='#toc'><span class="label label-info">Go back to the Table of Contents</span></a>

All the heroes are divided into 4 categories: Kaggle, Industry, Research and Others. Below I have drawn a bar chart showing each category's strength. I have also added gender distrubution as well.

In [9]:
df.heroes_gender.fillna('NA',inplace=True)
group=df.groupby(['category','heroes_gender'],as_index=False)['episode_id']\
        .count().sort_values('episode_id',ascending=False)
fig = go.Figure()
colors={'Male':'#FDE803','Female':'#0080B7','NA':'#FF3D09'}
for sex in group.heroes_gender.unique().tolist():
    df_cat = group[group['heroes_gender']==sex]
    fig.add_trace(go.Bar(name=sex,x=df_cat['category'],y=df_cat['episode_id'],marker_line_color='black',
                        marker_line_width=1.5,marker_color=colors[sex],text=df_cat['episode_id'],textposition='auto'))
fig.update_layout(title='Heroes Category',width=700,height=400,barmode='stack',template='seaborn',
                 paper_bgcolor=PAPER_BGCOLOR,plot_bgcolor=PLOT_BGCOLOR,hovermode='x unified',
                 legend=dict(title='Gender',x=0.835,bgcolor=PLOT_BGCOLOR),margin=dict(t=25,b=0,l=0,r=0),
                 xaxis=dict(title='Number of Heroes',mirror=True,linewidth=2,linecolor='black',showgrid=False),
                 yaxis=dict(mirror=True,linewidth=2,linecolor='black',gridcolor='darkgray'))
fig.show()

**Please Note:** The graph shows out of all the interviewed, 31 are Kagglers and all are males. The actual figure is 43 for Kagglers and they constitute for 50% of those interviewed on CTDS.show. This difference can be attributed to the fact that Kaggle is such a vast community that people working in research, industry or others can also be a part of it. And, only one category has to be assigned to a person.

## <a id='6.2'>6.2 Kaggler Performance Tier</a>
<a href='#toc'><span class="label label-info">Go back to the Table of Contents</span></a>

Want to know your probability of featuring on CTDS.show? Maybe the below plot can answer this question. 
For those of you who are unaware, Kaggle follows a [Progression system](https://www.kaggle.com/progression) that uses performance tiers to track your growth as a data scientist on Kaggle.

The Progression System is designed around four Kaggle categories of data science expertise: Competitions, Notebooks, Datasets, and Discussion. Advancement through performance tiers is done independently within each category of expertise.

Within each category of expertise, there are five performance tiers that can be achieved in accordance with the quality and quantity of work you produce: Novice, Contributor, Expert, Master, and Grandmaster.

In [10]:
df_meta = pd.read_csv('/kaggle/input/meta-kaggle/Users.csv')
df_users= df_meta[df_meta['UserName'].isin(list(df.heroes_kaggle_username.unique()[1:]))].copy()

In [11]:
colors = {0:'#5ac995',1:'#0bf',2:'#95628f',3:'#f96517',4:'#dca917',5:'#008abc'}
labels = {0:'Novice',1:'Contributor',2:'Expert',3:'Master',4:'Grandmaster',5:'Kaggle team'}
group = df_users.groupby('PerformanceTier',as_index=False)['Id'].count()
group['labels'] = group['PerformanceTier'].map(labels)
group['colors'] = group['PerformanceTier'].map(colors)
fig = go.Figure(data=[go.Pie(labels=group.labels,values=group.Id)])
fig.update_traces(hoverinfo='label+value', textinfo='percent', textfont_size=20,
                  marker=dict(colors=group.colors, line=dict(color='#000000', width=1.5)))
fig.update_layout(title="Heroes Kaggle Tier Distribution",width=700,height=400,barmode='stack',template='seaborn',
                 paper_bgcolor=PLOT_BGCOLOR,plot_bgcolor=PLOT_BGCOLOR,hovermode='x unified',
                 legend=dict(title='<b><i>Performance Tier</i></b>',x=0.835,bgcolor=PLOT_BGCOLOR),margin=dict(t=35,b=10,l=10,r=10),
                 xaxis=dict(title='Number of Heroes',mirror=True,linewidth=2,linecolor='black',showgrid=False),
                 yaxis=dict(mirror=True,linewidth=2,linecolor='black',gridcolor='darkgray'))
fig.show()

- All hail the Grandmasters!! If you are a grandmaster, you deservedly have the highest chances of being featured on the show.
- As the Tier reduces the chances of being interviewed on the show also decrease.
- But, don't be disheartened. Your chances of being on the show is more than double than those of the kaggle team atleast. This is something to cheer about :P

## <a id='6.3'>6.3 Know your Kagglers!</a>
<a href='#toc'><span class="label label-info">Go back to the Table of Contents</span></a>

I thought that it would be better if I could list all the Kagglers interviewed on the show. I have also added some information about the kagglers. You can click on the kaggler image to access their profile and look at their awesome projects.
Thanks to [Raenish](https://www.kaggle.com/raenish) for his awesome [kernel](https://www.kaggle.com/raenish/become-grandmaster). I have reused some of his code.

In [12]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

In [13]:
def scrape_data(df):
    for index in df.index:
        #time.sleep(1)
        row = df.iloc[index]
    
        username = row.UserName
        profile_url = '{}{}'.format(KAGGLE_BASE_URL, username)
        displayname = row.DisplayName

        result = requests.get(profile_url)
        src = result.content
        soup = BeautifulSoup(src, 'html.parser')
        soup = soup.find_all("div", id="site-body")[0].find("script")

        user_info = re.search('Kaggle.State.push\(({.*})', str(soup)).group(1)
        user_dict = json.loads(user_info)
    
        city = user_dict['city']
        region = user_dict['region']
        country = user_dict['country']
        avatar_url = user_dict['userAvatarUrl']
        occupation = user_dict['occupation']
        organization = user_dict['organization']
        github_user = user_dict['gitHubUserName']
        twitter_user = user_dict['twitterUserName']
        linkedin_url = user_dict['linkedInUrl']
        website_url = user_dict['websiteUrl']
        last_active = user_dict['userLastActive']
        num_followers = user_dict['followers']['count']
        num_following = user_dict['following']['count']

        num_posts = user_dict['discussionsSummary']['totalResults']
        num_datasets = user_dict['datasetsSummary']['totalResults']
        num_kernels = user_dict['scriptsSummary']['totalResults']
        num_comps = user_dict['competitionsSummary']['totalResults']


        df.loc[index, 'Image'] = '<a href="{}" target="_blank" title="{}"><img src="{}" width="100" height="100"></a>'.format(
            profile_url, displayname, avatar_url)
        df.loc[index, 'NumFollowers'] = num_followers
        df.loc[index, 'NumFollowing'] = num_following
        df.loc[index, 'NumPosts'] = num_posts
        df.loc[index, 'NumDatasets'] = num_datasets
        df.loc[index, 'NumKernels'] = num_kernels
        df.loc[index, 'NumCompetitions'] = num_comps
        df.loc[index, 'Country'] = country
    return df

In [14]:
def display_html(df, cols=None, index=False, na_rep='', num_rows=0):
    if num_rows == 0:
        df_table = df.to_html(columns=cols, index=index, na_rep=na_rep, escape=False, render_links=True)
        display(HTML(df_table))
    else:
        df_table = df.head(num_rows).to_html(columns=cols, index=index, na_rep=na_rep, escape=False, render_links=True)
        display(HTML(df_table))

In [15]:
KAGGLE_BASE_URL = "https://www.kaggle.com/"
df_users['RegisterDate'] = pd.to_datetime(df_users['RegisterDate'],format='%m/%d/%Y')
df_users['Joined'] = df_users.RegisterDate.apply(lambda x: str(2020-x.year)+' years ago')
df_users.reset_index(drop=True,inplace=True)
df_users = scrape_data(df_users)
df_users.NumFollowers = df_users.NumFollowers.astype(int)
display_html(df_users, cols=['Image','UserName', 'DisplayName','Joined','NumFollowers','Country'])

Image,UserName,DisplayName,Joined,NumFollowers,Country
,antgoldbloom,Anthony Goldbloom,10 years ago,598,United States
,jhoward,Jeremy Howard,10 years ago,2971,United States
,abhishek,Abhishek Thakur,9 years ago,6861,Norway
,antorsae,Andres Torrubia,9 years ago,192,Spain
,dansbecker,DanB,9 years ago,5013,United States
,philippsinger,Psi,8 years ago,1135,Austria
,hamelhusain,Hamel Husain,8 years ago,15,United States
,mlandry,mlandry,8 years ago,384,United States
,amuellerml,Andreas Mueller,8 years ago,9,United States
,titericz,Giba,8 years ago,8377,Brazil


# 7. Youtube Analysis
<a href='#toc'><span class="label label-info">Go back to the Table of Contents</span></a>

In this section, I'll be doing an extensive analysis on the youtube data provided to us. First, just to clarify what all the terms used in this section mean(in the context of online advertising):
- **Impression: ** It is when an ad is fetched from its source, and is countable. Whether the ad is clicked is not taken into account. Each time an ad is fetched, it is counted as one impression. So, whenever you see a youtube video link on youtube or any other medium it's counted an an impression.
- **View: ** It is when you click on the link,it's counted as a view. So, if you access the youtube video from a link on youtube webpage/application, it's called an *Impression View*. If you access it from any other webpage/application, it's called an *Non-Impression View*.
- **CTR: ** It is called the click-through rate and is the ratio between Views and Impressions. By Views here I mean the Impression Views. So, higher the CTR, better the the ad campaign. Therefore, anybody would want a higher CTR.

## <a id='7.1'>7.1 Impressions and Views</a>
<a href='#toc'><span class="label label-info">Go back to the Table of Contents</span></a>

Now that you know the meanings of these terms, lets see how CTDS.show fared when it comes to campaigning.

In the plot below I have analysed the impressions per episode. Along with that I have also plotted the mean and median impressions of all the episodes to see how its ditributed.

In [16]:
df_yt = pd.read_csv('/kaggle/input/chai-time-data-science/YouTube Thumbnail Types.csv')
df_yt = pd.merge(df,df_yt,on='youtube_thumbnail_type')
colors = {0:'#FDE803',1:'#0080B7',2:'#FF3D09',3:'#7CBB15'}
df_yt['color'] = df_yt.youtube_thumbnail_type.map(colors)
df_yt['ep_no'] = df_yt.episode_id.apply(lambda x: int(x[1:]) if x[0]=='E' else 75+int(x[1:]))
df_yt.sort_values('ep_no',inplace=True)
df_yt['heroes'] = df_yt['heroes'].fillna('NaN')
df_yt['episode'] = df_yt.apply(lambda x: x['episode_id'] + ' | ' + x['heroes'] 
                               if x['heroes']!='NaN' else x['episode_id'], axis=1)

In [17]:
y_avg = df_yt.youtube_impressions.mean()
y_med = df_yt.youtube_impressions.median()
fig = go.Figure()
fig.add_trace(go.Bar(name='Impressions',x=df_yt.episode_id,y=df_yt.youtube_impressions,
                     marker_line_width=1,marker_color='rgb(255,255,255)',marker_line_color='black',
                    text=df_yt['episode'],showlegend=False))
fig.add_trace(go.Scatter(name='Mean Impressions',x=df_yt.episode_id,
                         y=[y_avg]*len(df_yt),mode='lines',marker_color='black',
                        line = dict(dash='dash')))
fig.add_trace(go.Scatter(name='Median Impressions',x=df_yt.episode_id,
                         y=[y_med]*len(df_yt),mode='lines',marker_color='black',
                        line = dict(dash='dot')))
# Add image
fig.add_layout_image(
    dict(
        source='https://cdn.icon-icons.com/icons2/1584/PNG/512/3721679-youtube_108064.png',
        xref="paper", yref="paper",
        x=1, y=1,
        sizex=0.2, sizey=0.2,
        xanchor="right", yanchor="bottom"
    )
)
fig.update_layout(title='<b>Youtube Impressions</b> per Episode',
                width=700,height=300, barmode='stack',
                paper_bgcolor=PAPER_BGCOLOR,plot_bgcolor=PLOT_BGCOLOR,hovermode='x unified',
                margin=dict(t=40,b=0,l=0,r=0),legend=dict(x=0.5,y=1,orientation='h',bgcolor=PLOT_BGCOLOR),
                xaxis=dict(mirror=True,linewidth=2,linecolor='black',
                showgrid=False,tickfont=dict(size=8)),
                yaxis=dict(mirror=True,linewidth=2,linecolor='black',gridcolor='darkgray'))
fig.show()

- `Episode 27` with fast.ai founder **Jeremy Howard** as the guest has recorded the highest number of impressions followed by `Episode 1` featuring the world's first 4x kaggle grandmaster **Abhishek Thakur**. Rest all the episodes have recorded less than 20k impressions.
- Mean of the impressions is greater than its median, meaning a right-skewed distribution.
- 50% of the episodes have recorded less than 5096 impressions.

Let's also study about the views per episodes. Since, views are of 2 types: Impression and Non-Impression, I have drawn a stacked bar plot based on percentages instead of numbers to get a better understanding of the views distribution. 

In [18]:
df_yt['view_perc'] = df_yt.apply(lambda x: np.round(x['youtube_impression_views']/x['youtube_views']*100),axis=1)
df_yt['non_view_perc'] = df_yt.apply(lambda x: np.round(x['youtube_nonimpression_views']/x['youtube_views']*100),axis=1)
fig = go.Figure()
fig.add_trace(go.Bar(name='Youtube Impression Views',x=df_yt.episode_id,y=df_yt.view_perc,
                    marker_color='#FF3D09',marker_line_width=1))
fig.add_trace(go.Bar(name='Youtube Non-Impression Views',x=df_yt.episode_id,y=df_yt.non_view_perc,
                    marker_color='rgb(255,255,255)',marker_line_width=1))
fig.add_trace(go.Scatter(name='',x=df_yt.episode_id,y=[50]*len(df_yt),mode='lines',marker_color='black',showlegend=False,
                        line = dict(dash='dash')))
# Add image
fig.add_layout_image(
    dict(
        source='https://cdn.icon-icons.com/icons2/1584/PNG/512/3721679-youtube_108064.png',
        xref="paper", yref="paper",
        x=1, y=1.05,
        sizex=0.2, sizey=0.2,
        xanchor="right", yanchor="bottom"
    )
)
annotations=[]
annotations.append(dict(xref='x', yref='y',
                        x='E30', y=55,
                        text=str(50) + '%',
                        font=dict(family='Arial', size=20,
                                color='rgb(0, 0, 0)'),
                                showarrow=False))
fig.update_layout(title='<b>Youtube Views</b><br>(Impression vs Non-Impression)',
                width=700,height=400, barmode='stack',
                paper_bgcolor=PAPER_BGCOLOR,plot_bgcolor=PLOT_BGCOLOR,hovermode='x unified',
                margin=dict(t=100,b=0,l=0,r=0),
                xaxis=dict(mirror=True,linewidth=2,linecolor='black',
                showgrid=False,tickfont=dict(size=8)),
                legend=dict(x=0.1,y=-0.1,orientation='h'),annotations=annotations,
                yaxis=dict(title='Percentage',mirror=True,linewidth=2,linecolor='black',gridcolor='darkgray'))
fig.show()

The plot in red signifies that the video was accessed from youtube only and the plot in white signifies that the video was accessed from elsewhere but youtube.

Here are some of the analysis:
- Episodes in the starting were accessed more from youtube directly but later on the non-impression views came into majority.
- All Mini Series(M0-M8) episodes by Jeremy Howard had greater number of impression views than non-impresion views.
- Earlier, we saw that Episode 27 had the highest number of impressions out of all the episodes released. In the above plot we can see that the views from youtube impressions for this episode is just 29% of the total views. This shows the popularity and reach of fast.ai and its founder as well.

> **Please Note: ** A non-impression view can be anything such as a link shared through an email, article, application e.t.c. And it's necessary to click on that link to count as a view.

## <a id='7.2'>7.2 What influences the CTR?</a>
<a href='#toc'><span class="label label-info">Go back to the Table of Contents</span></a>

To answer this question first I'll be studying the relation between impressions and CTR and then analysing CTR per episode. Alongside I've have all included the thumbnail type as I feel that it also has some influence on the CTR.

Ideally, CTR should increase with the increase in impressions. Lets see if its true for all the thumbnail types.

In [19]:
df_lr = df_yt.copy()
df_lr['youtube_thumbnail_type']=df_lr['youtube_thumbnail_type'].astype(str)
fig = px.scatter(df_lr, x="youtube_impressions", y="youtube_ctr", trendline="ols",hover_name="episode",
                 color='description',color_discrete_sequence=['#0080B7','#FDE803','#7CBB15','#FF3D09'])
fig.update_traces(marker_line_color='black',marker_line_width=1)
# Add image
fig.add_layout_image(
    dict(
        source='https://cdn.icon-icons.com/icons2/1584/PNG/512/3721679-youtube_108064.png',
        xref="paper", yref="paper",
        x=1, y=1.01,
        sizex=0.2, sizey=0.2,
        xanchor="right", yanchor="bottom"
    )
)
fig.update_layout(title='<b>Click-through rate VS Impressions</b>',
                width=700,height=400,
                paper_bgcolor=PAPER_BGCOLOR,plot_bgcolor=PLOT_BGCOLOR,hovermode='closest',
                margin=dict(t=60,b=0,l=0,r=0),
                xaxis=dict(title='Impressions',mirror=True,linewidth=2,linecolor='black',
                tickfont=dict(size=10),gridcolor='darkgray'),
                legend=dict(title='<b>Youtube Thumbnail type</b>',x=0.61,y=1,
                            bgcolor=PLOT_BGCOLOR,font=dict(size=8)),
                yaxis=dict(title='Click-through Rate',mirror=True,linewidth=2,
                           linecolor='black',gridcolor='darkgray'))
fig.show()

- As expected, thumbnail types: `Youtube default image with custom annotation`, `youtube default image` & `Mini Series: custom image with annotations` show a positive correlation between impressions and CTR. On the contrary, the thumbnail type `Custom image with CTDS branding, title and tags` shows a high negative correlation.
- Thumbnail type `Mini Series: custom image with annotations` shows high positive correlation while `Youtube default image with custom annotation` and `youtube default image` show moderate positive correlation.
- I believe spending on more impressions for episodes with `Custom image with CTDS branding, title and tags` thumbnail is not a good idea. 
- I would recommend to use youtube default image with or without image annotations as thumbnails. Mini Series thumbnail was a series specific thumbail so that's why would not be recommended.

To further strengthen the conclusions made by me, I have drawn a bubble plot below which shows the CTR per episode with color repesenting the thumbnail type and the size of the bubble representing the youtube impression views.

In [20]:
fig = go.Figure()
for thumb_type in df_yt.youtube_thumbnail_type.unique().tolist():
    group = df_yt[df_yt.youtube_thumbnail_type==thumb_type]
    fig.add_trace(go.Scatter(name=group.description.iloc[0],x=group.episode_id,y=group.youtube_ctr,
                        marker_color=group.color,mode='markers',marker_size=group.youtube_impression_views/20,
                        marker_line_color='black',
                        text='<b>Episode</b>: '+group.episode.astype(str)+'<br>'+
                             '<b>Impressions</b>: '+group.youtube_impressions.astype(str)+'<br>'+
                             '<b>Impression Views</b>: '+group.youtube_impression_views.astype(str)+'<br>'+
                             '<b>CTR</b>: '+group.youtube_ctr.astype(str),
                        hoverinfo='text'))
# Add image
fig.add_layout_image(
    dict(
        source='https://cdn.icon-icons.com/icons2/1584/PNG/512/3721679-youtube_108064.png',
        xref="paper", yref="paper",
        x=1, y=1.05,
        sizex=0.2, sizey=0.2,
        xanchor="right", yanchor="bottom"
    )
)
fig.update_layout(title='<b>Click-through rate per Episode</b><br>(Size of the bubble represents Youtube impression views)',
                width=700,height=400,
                paper_bgcolor=PAPER_BGCOLOR,plot_bgcolor=PLOT_BGCOLOR,hovermode='closest',
                margin=dict(t=100,b=0,l=0,r=0),
                xaxis=dict(title='Episodes',mirror=True,linewidth=2,linecolor='black',
                showgrid=False,tickfont=dict(size=10)),
                legend=dict(title='<b>Youtube Thumbnail type</b>',x=0.61,y=1,
                            bgcolor=PLOT_BGCOLOR,font=dict(size=8)),
                yaxis=dict(title='Click-through rate',mirror=True,linewidth=2,
                           linecolor='black',gridcolor='darkgray'))
fig.show()

- The youtube default image thumbail has been used the most number of times. In between for some episodes the defuault image with annotations was used as the thumbnail. Now, the thumbail with custom image with CTDS branding, title and tags is used.
- As can be seen the yellow bubble recorded the highest CTR and also the highest Impression views. Although the range of CTR is not changing much barring one outlier i.e. episode 19 with Chip Huyen as the host.
- So, the thumbnail type has some effect but not that much as the CTR is more or less the same. Higher CTR is what anyone strives for.

## <a id='7.3'>7.3 Does the episode duration matter?</a>
<a href='#toc'><span class="label label-info">Go back to the Table of Contents</span></a>

I assume that longer the episode, lesser the number of people watching it. Even if someone very popular is featuring on it then also I believe its very difficult to watch the entire episode on youtube at least. Lets see if my assumption is right or wrong?

In [21]:
y_avg = df_yt.youtube_watch_hours.mean()
y_med = df_yt.youtube_watch_hours.median()
fig = make_subplots(rows=2,cols=1,shared_xaxes=True,vertical_spacing=0.1)
fig.add_trace(go.Bar(name='Watch hours(h)',x=df_yt['episode_id'],y=df_yt['youtube_watch_hours'],
                    marker_line_color='black',marker_line_width=1,
                    text='<b>Episode</b>: '+df_yt.episode.astype(str)+'<br>'+
                        '<b>Total Watch Hours</b>: '+df_yt.youtube_watch_hours.astype(str),
                    hoverinfo='text',marker_color='rgb(255,255,255)'),1,1)
fig.add_trace(go.Scatter(name='Mean watch hours',x=df_yt.episode_id,
                         y=[y_avg]*len(df_yt),mode='lines',marker_color='black',
                        line = dict(dash='dash'),showlegend=False),1,1)
fig.add_trace(go.Scatter(name='Median watch hours',x=df_yt.episode_id,
                         y=[y_med]*len(df_yt),mode='lines',marker_color='black',
                        line = dict(dash='dot'),showlegend=False),1,1)
fig.add_trace(go.Bar(name='Total Duration(s)',x=df_yt['episode_id'],y=df_yt['episode_duration'],
                    marker_line_color='black',marker_line_width=1,
                    text='<b>Episode</b>: '+df_yt.episode.astype(str)+'<br>'+
                        '<b>Total Duration</b>: '+df_yt.episode_duration.astype(str)+' seconds',
                    hoverinfo='text',marker_color='#0080B7'),2,1)
fig.add_trace(go.Bar(name='Average Duration watched(s)',x=df_yt['episode_id'],
                     marker_line_color='black',marker_line_width=1,
                     y=df_yt['youtube_avg_watch_duration'],
                     text='<b>Episode</b>: '+df_yt.episode.astype(str)+'<br>'+
                        '<b>Avg Duration watched</b>: '+df_yt.youtube_avg_watch_duration.astype(str)+' seconds',
                     hoverinfo='text',marker_color='#FDE803'),2,1)
#updates axes
fig.update_xaxes(mirror=True,linecolor='black',linewidth=2,row=1,col=1,showgrid=False,)
fig.update_yaxes(title_text='Total hours watched',mirror=True,linecolor='black',linewidth=2,
                 row=1,col=1,gridcolor='darkgray')
fig.update_xaxes(mirror=True,linecolor='black',linewidth=2,row=2,col=1,showgrid=False)
fig.update_yaxes(title_text='Episode Duration<br>(in seconds)',mirror=True,linecolor='black',linewidth=2,
                 row=2,col=1,gridcolor='darkgray')
# Add image
fig.add_layout_image(
    dict(
        source='https://cdn.icon-icons.com/icons2/1584/PNG/512/3721679-youtube_108064.png',
        xref="paper", yref="paper",
        x=1, y=1,
        sizex=0.15, sizey=0.15,
        xanchor="right", yanchor="bottom"
    )
)
fig.update_layout(title='Youtube Watch Analysis',width=700,height=500,margin=dict(t=60,b=0,l=0,r=0),
                  legend=dict(x=0.1,y=0.54,orientation='h'),barmode='overlay',
                  plot_bgcolor=PLOT_BGCOLOR,paper_bgcolor=PAPER_BGCOLOR)
fig.show()

- The total watch hours per episode is less than 200 hours for all the episodes but E27. It has a staggering watchtime of 704 hours. Jeremy Howard's episode turns out to be the outlier in every department :P.
- Although the episode duration is varying a lot but the average duration watched per episode remains quite less and consistent. That means users are watching the initial part of the video.
- Maybe shortening the videos would help as you can see through the mini series the ratio of average duration watched per total episode length is much better than that of other episodes.
- Episode 23 with Andreas Torrubia was the longest of all episodes with duration of more than 2 hours.

## <a id='7.4'>7.4 Tracking the User Activity</a>
<a href='#toc'><span class="label label-info">Go back to the Table of Contents</span></a>

I have always noticed that the views of a youtube video don't do justice to the likes, subscribers and the commnets it has. Similar story is with kaggle as well :P. Lot of views but upvotes and commnets are nowhere to be seen.

Let's see if its the same case with CTDS.show. In the below plot I have drawn two plots on top of each other. The first plot represents the numbers and the second plot represents the percentages with respect to the total views the youtube video got

> **Please Note: ** I have taken the total views as the denominator as a user can't like, dislike, subscribe or comment on the video without opening the video. And only when you open the video it's counted as a View.

In [22]:
df_yt['like_perc'] = df_yt.apply(lambda x: np.round(x['youtube_likes']/x['youtube_views']*100),axis=1)
df_yt['dislike_perc'] = df_yt.apply(lambda x: np.round(x['youtube_dislikes']/x['youtube_views']*100),axis=1)
df_yt['comment_perc'] = df_yt.apply(lambda x: np.round(x['youtube_comments']/x['youtube_views']*100),axis=1)
df_yt['subscribe_perc'] = df_yt.apply(lambda x: np.round(x['youtube_subscribers']/x['youtube_views']*100),axis=1)
fig = make_subplots(rows=2,cols=1,shared_xaxes=True,vertical_spacing=0.1)
fig.add_trace(go.Bar(name='Likes',x=df_yt.episode_id,y=df_yt.youtube_likes,
                    marker_color='#7CBB15',marker_line_width=1,marker_line_color='black',
                    text='<b>Episode</b>: '+df_yt.episode.astype(str)+'<br>'+
                        '<b>Likes</b>: '+df_yt.youtube_likes.astype(str),
                    hoverinfo='text',legendgroup='like'),1,1)

fig.add_trace(go.Bar(name='Dislikes',x=df_yt.episode_id,y=df_yt.youtube_dislikes,
                    marker_color='#FF3D09',marker_line_width=1,marker_line_color='black',
                    text='<b>Episode</b>: '+df_yt.episode.astype(str)+'<br>'+
                        '<b>Dislikes</b>: '+df_yt.youtube_dislikes.astype(str),
                    hoverinfo='text',legendgroup='dislike'),1,1)

fig.add_trace(go.Bar(name='Comments',x=df_yt.episode_id,y=df_yt.youtube_comments,
                    marker_color='gold',marker_line_width=1,marker_line_color='black',
                    text='<b>Episode</b>: '+df_yt.episode.astype(str)+'<br>'+
                        '<b>Comments</b>: '+df_yt.youtube_comments.astype(str),
                    hoverinfo='text',legendgroup='comment'),1,1)

fig.add_trace(go.Bar(name='Subscribers',x=df_yt.episode_id,y=df_yt.youtube_subscribers,
                    marker_color='#0080B7',marker_line_width=1,marker_line_color='black',
                    text='<b>Episode</b>: '+df_yt.episode.astype(str)+'<br>'+
                        '<b>Subscribers</b>: '+df_yt.youtube_subscribers.astype(str),
                    hoverinfo='text',legendgroup='sub'),1,1)

fig.add_trace(go.Bar(name='Likes%',x=df_yt.episode_id,y=df_yt.like_perc,
                    marker_color='#7CBB15',marker_line_width=1,marker_line_color='black',
                    text='<b>Episode</b>: '+df_yt.episode.astype(str)+'<br>'+
                        '<b>Likes</b>: '+df_yt.like_perc.astype(str)+'%',
                    hoverinfo='text',legendgroup='like',showlegend=False),2,1)

fig.add_trace(go.Bar(name='Dislikes%',x=df_yt.episode_id,y=df_yt.dislike_perc,
                    marker_color='#FF3D09',marker_line_width=1,marker_line_color='black',
                    text='<b>Episode</b>: '+df_yt.episode.astype(str)+'<br>'+
                        '<b>Dislikes</b>: '+df_yt.dislike_perc.astype(str)+'%',
                    hoverinfo='text',legendgroup='dislike',showlegend=False),2,1)

fig.add_trace(go.Bar(name='Comments%',x=df_yt.episode_id,y=df_yt.comment_perc,
                    marker_color='gold',marker_line_width=1,marker_line_color='black',
                    text='<b>Episode</b>: '+df_yt.episode.astype(str)+'<br>'+
                        '<b>Comments</b>: '+df_yt.comment_perc.astype(str)+'%',
                    hoverinfo='text',legendgroup='comment',showlegend=False),2,1)

fig.add_trace(go.Bar(name='Subscribers%',x=df_yt.episode_id,y=df_yt.subscribe_perc,
                    marker_color='#0080B7',marker_line_width=1,marker_line_color='black',
                    text='<b>Episode</b>: '+df_yt.episode.astype(str)+'<br>'+
                        '<b>Subscribers</b>: '+df_yt.subscribe_perc.astype(str)+'%',
                    hoverinfo='text',legendgroup='sub',showlegend=False),2,1)
#updates axes
fig.update_xaxes(mirror=True,linecolor='black',linewidth=2,row=1,col=1,showgrid=False,)
fig.update_yaxes(title_text='Numbers',mirror=True,linecolor='black',linewidth=2,
                 row=1,col=1,gridcolor='darkgray')
fig.update_xaxes(mirror=True,linecolor='black',linewidth=2,row=2,col=1,showgrid=False)
fig.update_yaxes(title_text='Percentages',mirror=True,linecolor='black',linewidth=2,
                 row=2,col=1,gridcolor='darkgray')
# Add image
fig.add_layout_image(
    dict(
        source='https://cdn.icon-icons.com/icons2/1584/PNG/512/3721679-youtube_108064.png',
        xref="paper", yref="paper",
        x=1, y=1,
        sizex=0.15, sizey=0.15,
        xanchor="right", yanchor="bottom"
    )
)
fig.update_layout(title='Youtube User Activity Analysis',width=700,height=500,margin=dict(t=60,b=0,l=0,r=0),
                  legend=dict(x=0.2,y=0.54,orientation='h'),barmode='stack',
                  plot_bgcolor=PLOT_BGCOLOR,paper_bgcolor=PAPER_BGCOLOR)
fig.show()

- Again Jeremy Howard's interview(E27) turns out to be the outlier. If you look as the numbers plot it has the tallest tower and that too by a huge margin. It can also attribute to the fact that this episode recorded the highest number of views and impressions.
- In order to remove this bias of number of views and impressions, I have plotted a percentages plot as well. So, episode E55 and E61 recorded the highest activity i.e.13% of the total views(includes likes, dislikes, subscribers and comments).
- You can play with the legend of the above plot to find the episode with most comments, likes, dislikes, etc.

# <a id='8'>8. Podcast Analysis</a>
<a href='#toc'><span class="label label-info">Go back to the Table of Contents</span></a>

Since, I have covered the youtube analysis of the show above, why miss out on the podcasts analysis. The show is available on a plethora of podcasts as mentioned in the [overview section](#1). We have data available for anchor, spotify and apple podcasts. I'll cover the analysis for these three podcasts in separate subsections.

In [23]:
df_anchor = pd.read_csv('/kaggle/input/chai-time-data-science/Anchor Thumbnail Types.csv')
df_anchor = pd.merge(df,df_anchor,on='anchor_thumbnail_type')
colors = {0:'#FDE803',1:'#0080B7',2:'#FF3D09',3:'#7CBB15'}
df_anchor['color'] = df_anchor.anchor_thumbnail_type.map(colors)
df_anchor['ep_no'] = df_anchor.episode_id.apply(lambda x: int(x[1:]) if x[0]=='E' else 75+int(x[1:]))
df_anchor.sort_values('ep_no',inplace=True)
df_anchor['heroes'] = df_anchor['heroes'].fillna('NaN')
df_anchor['episode'] = df_anchor.apply(lambda x: x['episode_id'] + ' | ' + x['heroes'] 
                               if x['heroes']!='NaN' else x['episode_id'], axis=1)

## <a id='8.1'>8.1 Anchor Analysis</a>
<a href='#toc'><span class="label label-info">Go back to the Table of Contents</span></a>

Anchor is an all-in-one platform where you can create, distribute, and monetize your podcast from any device, for free.I have not heard of this podcast service before but let's see to how people the show has reached through this medium. Also, the show creators have used separate thumbnail types just like in youtube so I'll be studying it's affect as well.

In [24]:
fig = go.Figure()
for anchor_thumb in df_anchor.anchor_thumbnail_type.unique().tolist():
    group = df_anchor[df_anchor['anchor_thumbnail_type']==anchor_thumb]
    fig.add_trace(go.Bar(name=group['description'].iloc[0],x=group.episode_id,y=group.anchor_plays,
                         marker_color=group.color,marker_line_color='black',marker_line_width=1.5,
                         text='<b>Episode</b>: '+group.episode.astype(str)+'<br>'+
                        '<b>Plays</b>: '+group.anchor_plays.astype(str),
                         hoverinfo='text'
                        ))
fig.add_trace(go.Scatter(name='Mean Plays',x=df_anchor.episode_id,y=[df_anchor.anchor_plays.mean()]*len(df_yt),mode='lines',
                         marker_color='black',showlegend=False,line = dict(dash='dot')))
fig.add_trace(go.Scatter(name='Median Plays',x=df_anchor.episode_id,y=[df_anchor.anchor_plays.median()]*len(df_yt),mode='lines',
                         marker_color='black',showlegend=False,line = dict(dash='dash')))
# Add image
fig.add_layout_image(
    dict(
        source='https://cdn-images-1.medium.com/max/552/1*IR4XyosnuJme7tfZJQByuQ@2x.png',
        xref="paper", yref="paper",
        x=1, y=1.03,
        sizex=0.2, sizey=0.15,
        xanchor="right", yanchor="bottom"
    )
)
fig.update_layout(title='<b>Anchor Plays per Episode</b>',
                width=700,height=400,
                paper_bgcolor=PAPER_BGCOLOR,plot_bgcolor=PLOT_BGCOLOR,hovermode='closest',
                margin=dict(t=50,b=0,l=0,r=0),
                xaxis=dict(title='Episodes',mirror=True,linewidth=2,linecolor='black',
                showgrid=False,tickfont=dict(size=10)),
                legend=dict(title='<b>Anchor Thumbnail type</b>',x=0.61,y=1,
                            bgcolor=PLOT_BGCOLOR,font=dict(size=8)),
                yaxis=dict(title='Anchor Plays',mirror=True,linewidth=2,
                           linecolor='black',gridcolor='darkgray'))
fig.show()

- The dotted line represents the mean plays and the dash line represents the median plays. The anchor plays have a right skewed distribution.
- Clearly, the thumbnail with `CTDS branding` & `Youtube defualt playlist image` have been the more successful ones. The most recent thumbnail being used is a custom image with CTDS branding, title and tags.

## <a id='8.2'>8.2 Spotify Analysis</a>
<a href='#toc'><span class="label label-info">Go back to the Table of Contents</span></a>

Spotify is really famous and it was launched in India a few years back. Let's see how the show fares on Spotify.

When you start a spotify podcast it is counted as a **spotify start** and if you listen to it more than 60 seconds then it is counted as a **spotify stream**. The number of unique people listening to the podcasts are called **listeners**. I have tried to study these using the bar plot below.

In [25]:
fig = go.Figure()
fig.add_trace(go.Bar(name='Starts',x=df_anchor.episode_id,y=df_anchor.spotify_starts,
                         marker_color='rgb(255,255,255)',
                         text='<b>Episode</b>: '+df_anchor.episode.astype(str)+'<br>'+
                        '<b>Starts</b>: '+df_anchor.spotify_starts.astype(str),
                         hoverinfo='text'
                        ))
fig.add_trace(go.Bar(name='Streams',x=df_anchor.episode_id,y=df_anchor.spotify_streams,
                         marker_color='#1DB954',
                         text='<b>Streams</b>: '+df_anchor.spotify_streams.astype(str),
                         hoverinfo='text'
                        ))
fig.add_trace(go.Scatter(name='Listeners',x=df_anchor.episode_id,y=df_anchor.spotify_listeners,mode='lines',
                         text='<b>Listeners</b>: '+df_anchor.spotify_listeners.astype(str),
                         hoverinfo='text',marker_color='black',line = dict(dash='dot')))
# Add image
fig.add_layout_image(
    dict(
        source='https://www.scdn.co/i/_global/open-graph-default.png',
        xref="paper", yref="paper",
        x=1, y=1.03,
        sizex=0.2, sizey=0.15,
        xanchor="right", yanchor="bottom"
    )
)
fig.update_layout(title='<b>Spotify Statistics per Episode</b>',
                width=700,height=400,barmode='overlay',
                paper_bgcolor=PAPER_BGCOLOR,plot_bgcolor=PLOT_BGCOLOR,hovermode='x unified',
                margin=dict(t=50,b=0,l=0,r=0),
                xaxis=dict(title='Episodes',mirror=True,linewidth=2,linecolor='black',
                showgrid=False,tickfont=dict(size=10)),
                legend=dict(x=0.82,y=1,bgcolor=PLOT_BGCOLOR,font=dict(size=12)),
                yaxis=dict(title='Starts/Streams/Listeners',mirror=True,linewidth=2,
                           linecolor='black',gridcolor='darkgray'))
fig.show()

- The number of listeners are in accordance with the number of starts and streams. It's pretty obvious though.
- CTDS.show seems to be more popular on anchor.fm as compared with spotify as the maximum number of starts recorded is 826 and for anchor it is almost double that is 1527.
- This is the first plot where I have seen Jeremy Howard's episode(E27) being the undisputed number 1. In spotify paradigm clearly the episode with Abhshek Thakur(E1) tops the list that too convincingly.
- Almost 20% of people close the podcast before 60 seconds. Still not a bad number. On youtube it's much worse. So, it seems mpre serious audiance is on spotify.

## <a id='8.3'>8.3 Apple Podcast Analysis</a>
<a href='#toc'><span class="label label-info">Go back to the Table of Contents</span></a>

![](http://)Now, we come to the final podcast for which the data is available. I have with me available the total hours listened, the average duration of the episode listened in seconds and the number of listeners. To accomodate everything into the same plot I have changed the scale of y-axis to logarithmic type. Total hours listened is converted to seconds.

In [26]:
fig = go.Figure()
fig.add_trace(go.Bar(name='Total seconds listened',x=df_anchor.episode_id,y=df_anchor.apple_listened_hours*60*60,
                         marker_color='rgb(255,255,255)',
                         text='<b>Episode</b>: '+df_anchor.episode.astype(str)+'<br>'+
                        '<b>Total listened</b>: '+(df_anchor.apple_listened_hours).astype(str)+' hours',
                         hoverinfo='text'
                        ))
fig.add_trace(go.Bar(name='Average listen duration in seconds',x=df_anchor.episode_id,y=df_anchor.apple_avg_listen_duration,
                         marker_color='#D56DFB',
                         text='<b>Average listened</b>: '+df_anchor.apple_avg_listen_duration.astype(str)+' seconds',
                         hoverinfo='text'
                        ))
fig.add_trace(go.Scatter(name='Listeners',x=df_anchor.episode_id,y=df_anchor.apple_listeners,mode='lines',
                         text='<b>Listeners</b>: '+df_anchor.apple_listeners.astype(str),
                         hoverinfo='text',marker_color='black',line = dict(dash='dot')))
# Add image
fig.add_layout_image(
    dict(
        source='https://is1-ssl.mzstatic.com/image/thumb/Purple113/v4/22/8d/49/228d49f8-0798-bfdb-7f4a-59039f7d102f/AppIcon-0-1x_U007emarketing-0-0-GLES2_U002c0-512MB-sRGB-0-0-0-85-220-0-0-0-10.png/246x0w.png',
        xref="paper", yref="paper",
        x=1, y=1.02,
        sizex=0.2, sizey=0.15,
        xanchor="right", yanchor="bottom"
    )
)
fig.update_layout(title='<b>Apple Podcasts Statistics per Episode</b>',
                width=700,height=400,barmode='overlay',
                paper_bgcolor=PAPER_BGCOLOR,plot_bgcolor=PLOT_BGCOLOR,hovermode='x unified',
                margin=dict(t=50,b=0,l=0,r=0),
                xaxis=dict(title='Episodes',mirror=True,linewidth=2,linecolor='black',
                showgrid=False,tickfont=dict(size=10)),
                legend=dict(x=0.35,y=1,bgcolor=PLOT_BGCOLOR,font=dict(size=8),orientation='h'),
                yaxis=dict(type='log',title='Total listened/Average Listened/Listeners',mirror=True,linewidth=2,
                           linecolor='black',gridcolor='darkgray',title_font=dict(size=9)))
fig.show()

- Apple podcast seems to be the least popular in front of anchor and spotify as the maximum number of listeners(96) were recorded for Jeremy Howard's episode(E27).
- One thing that is different from all other mediums is that in case of apple podcasts, the stats are more or less consistent throughout all the episodes. This can also be attributed to the less usage.
- E53 and E75 had no listeners. 

# Lot more visuals to follow. Do leave an upvote if you liked my work:)