##### Rough project parameters
- Statistics that YouTube creators care about includes view count, subscriber count, watch time, average view duration, and click through rate
- Some sort of automation component regarding data collection and processing
- Insights / recommendations based on data analysis
- Integrate data into web applications
- Be creative! Show that I know some statistics!

##### Instructions / Helpers
- Using the channels, playlist items, videos
- website to convert handle (@) to channel ID: https://www.streamweasels.com/tools/youtube-channel-id-and-user-id-convertor/

### Loading Python Libraries

In [378]:
# loading necessary libraries
import pandas as pd
from googleapiclient.discovery import build
import datetime

### Accessing the YouTube API

##### Accessing the Channel

In [379]:
# key to access YouTube API
api_key = "AIzaSyAcL_fq1YQz4tDxxTHmwkAsjub0yj0c6Zo"

# interacting with the API
api_service_name = "youtube"
api_version = "v3"

youtube = build(
    api_service_name, api_version, developerKey = api_key)

request = youtube.channels().list(
    part="snippet,contentDetails,statistics",

    # unique channel id that corresponds to the channel I'm interested in
    id="UCIPPMRA040LQr5QPyJEbmXA"
)
channel_response = request.execute()


In [380]:
number_of_subscribers = int(channel_response['items'][0]['statistics']['subscriberCount'])
number_of_views = int(channel_response['items'][0]['statistics']['viewCount'])
number_of_videos = int(channel_response['items'][0]['statistics']['videoCount'])
uploads_id = channel_response['items'][0]['contentDetails']['relatedPlaylists']['uploads']

print('Here are some statistics about the channel, MrBeast Gaming:')
print("Number of subscribers:", number_of_subscribers)
print("Number of views:", number_of_views)
print("Number of videos:", number_of_videos)
print("Upload ID:", uploads_id)

Here are some statistics about the channel, MrBeast Gaming:
Number of subscribers: 30700000
Number of views: 5406449878
Number of videos: 138
Upload ID: UUIPPMRA040LQr5QPyJEbmXA


##### Accessing the Uploaded Videos

In [381]:
request = youtube.playlistItems().list(
        part="snippet,contentDetails",
        playlistId="UUIPPMRA040LQr5QPyJEbmXA"
    )
videos_response = request.execute()

videos = []
for item in videos_response['items']:
        videos.append(item['contentDetails']['videoId'])

next_page_token = videos_response.get('nextPageToken')
while next_page_token is not None:
    request = youtube.playlistItems().list(
                part='contentDetails',
                playlistId = "UUIPPMRA040LQr5QPyJEbmXA",
                maxResults = 50,
                pageToken = next_page_token)
    videos_response = request.execute()

    for item in videos_response['items']:
        videos.append(item['contentDetails']['videoId'])

    next_page_token = videos_response.get('nextPageToken')
print('We have successfully accessed', len(videos), 'videos from the channel.')
print("There are actually", number_of_videos, "videos on the channel.")
print('This is a difference of', number_of_videos - len(videos), 'videos.')

We have successfully accessed 138 videos from the channel.
There are actually 138 videos on the channel.
This is a difference of 0 videos.


##### Turning Video Information from a .JSON into a DataFrame

In [382]:
temp = []
for i in range(len(videos)):
    # getting the information about the ith video
    video_stats_request = youtube.videos().list(
            part="snippet,contentDetails,statistics",
            id = videos[i]
        )
    video_stats_request = video_stats_request.execute()
    # getting the video type
    video_type = video_stats_request['items'][0]['kind'].split('#')[1]
    # getting the title
    title = video_stats_request['items'][0]['snippet']['title']
    # getting the publish date
    publish_date = video_stats_request['items'][0]['snippet']['publishedAt']
    # getting the number of views
    views = int(video_stats_request['items'][0]['statistics']['viewCount'])
    # getting the number of likes
    likes = int(video_stats_request['items'][0]['statistics']['likeCount'])
    # getting the number of comments
    comments = int(video_stats_request['items'][0]['statistics']['commentCount'])
    # getting the duration
    duration = video_stats_request['items'][0]['contentDetails']['duration']

    temp.append([title, publish_date, views, likes, comments, duration, video_type])
video_statistics = pd.DataFrame(temp, columns = ['Title', 'Publish Date', 'Views', 'Likes', 'Comments', 'Duration', 'Video Type'])

In [383]:
video_statistics.head()


Unnamed: 0,Title,Publish Date,Views,Likes,Comments,Duration,Video Type
0,"If You Build It, I'll Pay For It!",2022-12-31T20:00:04Z,15687070,588399,19478,PT11M42S,video
1,World's Hardest Challenge!,2022-12-16T22:18:00Z,17389618,537037,22832,PT14M30S,video
2,100 Youtuber Minecraft Battle Royale!,2022-10-28T21:00:09Z,17501871,989063,45756,PT16M3S,video
3,"Extreme $1,000,000 Challenge!",2022-10-12T20:00:12Z,10040597,393672,11485,PT10M43S,video
4,Minecraft with Ultra Realistic Graphics!,2022-09-16T19:00:37Z,14697558,485063,12969,PT8M47S,video


### Data Cleaning

In [384]:
video_statistics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 138 entries, 0 to 137
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Title         138 non-null    object
 1   Publish Date  138 non-null    object
 2   Views         138 non-null    int64 
 3   Likes         138 non-null    int64 
 4   Comments      138 non-null    int64 
 5   Duration      138 non-null    object
 6   Video Type    138 non-null    object
dtypes: int64(3), object(4)
memory usage: 7.7+ KB


The above code...
- Gets the data types of each of the variables
- Shows us that there are no missing values, which makes our lives much easier.
- Show that 'Publish Date' is not in a date time format

In [385]:
video_statistics['Video Type'].value_counts()

video    138
Name: Video Type, dtype: int64

In [386]:
del video_statistics['Video Type']

All video are a video, none appear to be labeled specifically as shorts. Since all of the values are the same I decided to delete is since it doesn't give us much information.

In [387]:
# duration includes H
video_statistics[video_statistics['Duration'].str.contains('H')]

Unnamed: 0,Title,Publish Date,Views,Likes,Comments,Duration


No videos are an hour long or greater.

The function below converts the data from the format it is in directly from the YouTube API in seconds.

In [388]:
# converting duration to seconds
def convert_to_seconds(duration):
    # sum of the total duration of the video in seconds
    duration_seconds = 0

    # remove the string 'PT' (which is present in every observation)
    duration = duration[2:]

    # If the H is present, which indicates the video is equal to or longer than an hour, add the amount of seconds to the duration_seconds variable
    duration = duration.split('H')
    if len(duration) == 1:
        duration = duration[0]

    elif len(duration) == 2:
        duration_seconds += int(duration[0]) * 3600
        duration = duration[1]

    # If the M is present, which indicates the video is equal to or longer than an minute, add the amount of seconds to the duration_seconds variable 
    duration = duration.split('M')
    if len(duration) == 1:
        duration = duration[0]

    elif len(duration) == 2:
        duration_seconds += int(duration[0]) * 60
        duration = duration[1]

    # add the number of seconds to the video (if present)
    if len(duration) > 0:
        duration_seconds += int(duration.split('S')[0])

    return duration_seconds

In [389]:
# store the duration of the first row as temp
temp1 = video_statistics['Duration'][0]
print((temp1))
temp2 = 'PT1H9M'
print((temp2))

PT11M42S
PT1H9M


In [390]:
print(convert_to_seconds(temp1))
print(convert_to_seconds(temp2))

702
4140


In [391]:
# apply the function to the duration column
video_statistics['Duration in Seconds'] = video_statistics['Duration'].apply(convert_to_seconds)

In [392]:
# getting duration in minutes
video_statistics['Duration in Minutes'] = video_statistics['Duration in Seconds'] / 60

In [393]:
# deleting the original duration column
del video_statistics['Duration']

##### Converting the Publish Date to a Date Format and Determining the Days Since Published

Converting to a DateTime format is important so we can actually work with the date.

In [394]:
video_statistics['Publish Date'].head()

0    2022-12-31T20:00:04Z
1    2022-12-16T22:18:00Z
2    2022-10-28T21:00:09Z
3    2022-10-12T20:00:12Z
4    2022-09-16T19:00:37Z
Name: Publish Date, dtype: object

In [395]:
# converting the publish date to datetime
video_statistics['Publish Date'] = pd.to_datetime(video_statistics['Publish Date'])

In [396]:
video_statistics['Publish Date'].head()

0   2022-12-31 20:00:04+00:00
1   2022-12-16 22:18:00+00:00
2   2022-10-28 21:00:09+00:00
3   2022-10-12 20:00:12+00:00
4   2022-09-16 19:00:37+00:00
Name: Publish Date, dtype: datetime64[ns, UTC]

In [397]:
# converting the time zone from UTC to EST
video_statistics['Publish Date'] = video_statistics['Publish Date'].dt.tz_convert('EST')

In [398]:
video_statistics['Publish Date'].head()

0   2022-12-31 15:00:04-05:00
1   2022-12-16 17:18:00-05:00
2   2022-10-28 16:00:09-05:00
3   2022-10-12 15:00:12-05:00
4   2022-09-16 14:00:37-05:00
Name: Publish Date, dtype: datetime64[ns, EST]

In [399]:
video_statistics['Publish Date'] = video_statistics['Publish Date'].dt.tz_localize(None)

In [400]:
video_statistics['Publish Date'].tail()

133   2020-05-22 15:01:29
134   2020-05-20 15:01:40
135   2020-05-16 14:35:31
136   2020-05-14 15:16:51
137   2020-05-12 15:00:11
Name: Publish Date, dtype: datetime64[ns]

In [401]:
# first row of datetime stored as temp
temp = video_statistics['Publish Date'][0]


In [402]:
# getting the difference (in days) between the current time and the publish date from the column Publish Date
video_statistics['Days Since Published'] = (datetime.datetime.now() - video_statistics['Publish Date']).dt.days

In [403]:
video_statistics['Days Since Published']

0       11
1       26
2       75
3       91
4      117
      ... 
133    964
134    966
135    970
136    972
137    974
Name: Days Since Published, Length: 138, dtype: int64

In [404]:
video_statistics.head()

Unnamed: 0,Title,Publish Date,Views,Likes,Comments,Duration in Seconds,Duration in Minutes,Days Since Published
0,"If You Build It, I'll Pay For It!",2022-12-31 15:00:04,15687070,588399,19478,702,11.7,11
1,World's Hardest Challenge!,2022-12-16 17:18:00,17389618,537037,22832,870,14.5,26
2,100 Youtuber Minecraft Battle Royale!,2022-10-28 16:00:09,17501871,989063,45756,963,16.05,75
3,"Extreme $1,000,000 Challenge!",2022-10-12 15:00:12,10040597,393672,11485,643,10.716667,91
4,Minecraft with Ultra Realistic Graphics!,2022-09-16 14:00:37,14697558,485063,12969,527,8.783333,117


In [405]:
video_statistics['Publish Date']

0     2022-12-31 15:00:04
1     2022-12-16 17:18:00
2     2022-10-28 16:00:09
3     2022-10-12 15:00:12
4     2022-09-16 14:00:37
              ...        
133   2020-05-22 15:01:29
134   2020-05-20 15:01:40
135   2020-05-16 14:35:31
136   2020-05-14 15:16:51
137   2020-05-12 15:00:11
Name: Publish Date, Length: 138, dtype: datetime64[ns]

The below code shows us the comment to view ratio and the like to view ratio.

In [406]:
# comment to views ratio
video_statistics['Comment to View Ratio'] = video_statistics['Comments'] / video_statistics['Views']
# like to view ratio
video_statistics['Like to View Ratio'] = video_statistics['Likes'] / video_statistics['Views']

In [407]:
video_statistics.head()

Unnamed: 0,Title,Publish Date,Views,Likes,Comments,Duration in Seconds,Duration in Minutes,Days Since Published,Comment to View Ratio,Like to View Ratio
0,"If You Build It, I'll Pay For It!",2022-12-31 15:00:04,15687070,588399,19478,702,11.7,11,0.001242,0.037509
1,World's Hardest Challenge!,2022-12-16 17:18:00,17389618,537037,22832,870,14.5,26,0.001313,0.030883
2,100 Youtuber Minecraft Battle Royale!,2022-10-28 16:00:09,17501871,989063,45756,963,16.05,75,0.002614,0.056512
3,"Extreme $1,000,000 Challenge!",2022-10-12 15:00:12,10040597,393672,11485,643,10.716667,91,0.001144,0.039208
4,Minecraft with Ultra Realistic Graphics!,2022-09-16 14:00:37,14697558,485063,12969,527,8.783333,117,0.000882,0.033003


In [408]:
# new column which is the number of views per day
video_statistics['Views per Day'] = round(video_statistics['Views'] / video_statistics['Days Since Published'],1)

In [409]:
# renaming publish date to publish time (EST)
video_statistics.rename(columns = {'Publish Date':'Publish Time (EST)'}, inplace = True)


In [410]:
# taking pubslish time (EST), taking only the date, and storing it in a new column Publish Date
video_statistics['Publish Date'] = video_statistics['Publish Time (EST)'].dt.date

In [411]:
# adding a new column Title and Day Published which is the title and the publish date
video_statistics['Title and Day Published'] = video_statistics['Title'] + ' - (' + video_statistics['Publish Date'].astype(str) + ')'

In [415]:
# switching the order of the columns
video_statistics = video_statistics[['Title and Day Published', 'Title', 'Publish Date', 'Publish Time (EST)', 'Days Since Published', 'Views', 'Views per Day', 'Likes', 'Like to View Ratio', 'Comments', 'Comment to View Ratio', 'Duration in Seconds', 'Duration in Minutes']]

In [416]:
video_statistics.head()

Unnamed: 0,Title and Day Published,Title,Publish Date,Publish Time (EST),Days Since Published,Views,Views per Day,Likes,Like to View Ratio,Comments,Comment to View Ratio,Duration in Seconds,Duration in Minutes
0,"If You Build It, I'll Pay For It! - (2022-12-31)","If You Build It, I'll Pay For It!",2022-12-31,2022-12-31 15:00:04,11,15687070,1426097.3,588399,0.037509,19478,0.001242,702,11.7
1,World's Hardest Challenge! - (2022-12-16),World's Hardest Challenge!,2022-12-16,2022-12-16 17:18:00,26,17389618,668831.5,537037,0.030883,22832,0.001313,870,14.5
2,100 Youtuber Minecraft Battle Royale! - (2022-...,100 Youtuber Minecraft Battle Royale!,2022-10-28,2022-10-28 16:00:09,75,17501871,233358.3,989063,0.056512,45756,0.002614,963,16.05
3,"Extreme $1,000,000 Challenge! - (2022-10-12)","Extreme $1,000,000 Challenge!",2022-10-12,2022-10-12 15:00:12,91,10040597,110336.2,393672,0.039208,11485,0.001144,643,10.716667
4,Minecraft with Ultra Realistic Graphics! - (20...,Minecraft with Ultra Realistic Graphics!,2022-09-16,2022-09-16 14:00:37,117,14697558,125620.2,485063,0.033003,12969,0.000882,527,8.783333


### Exporting the Data to an Excel File

In [417]:
# write to csv
video_statistics.to_csv('video_statistics.csv', index = False)

### Basic Exploratory Data Analysis

In [414]:
# sorting by number of views
video_statistics.sort_values(by = 'Views', ascending = False).head(10)

Unnamed: 0,Title,Publish Time (EST),Views,Likes,Comments,Duration in Seconds,Duration in Minutes,Days Since Published,Comment to View Ratio,Like to View Ratio,Views per Day,Publish Date,Title and Day Published
36,World’s Largest Explosion!,2021-04-07 13:45:24,112736330,1585856,76483,512,8.533333,644,0.000678,0.014067,175056.4,2021-04-07,World’s Largest Explosion! - (2021-04-07)
102,"Whatever You Build, I'll Pay For!",2020-08-06 12:00:23,102415804,4901080,201501,668,11.133333,888,0.001967,0.047855,115333.1,2020-08-06,"Whatever You Build, I'll Pay For! - (2020-08-06)"
53,"Minecraft, But It's Only One Block!",2020-12-17 15:14:17,90890829,1054679,34053,607,10.116667,755,0.000375,0.011604,120385.2,2020-12-17,"Minecraft, But It's Only One Block! - (2020-12..."
50,"If You Build a House, I'll Pay For It!",2021-01-02 14:07:38,90081433,2218173,83515,610,10.166667,739,0.000927,0.024624,121896.4,2021-01-02,"If You Build a House, I'll Pay For It! - (2021..."
103,"Minecraft, But Everything is Random!",2020-08-02 12:17:36,84726686,1211744,47824,642,10.7,892,0.000564,0.014302,94985.1,2020-08-02,"Minecraft, But Everything is Random! - (2020-0..."
47,1000 Zombies vs Mutant Enderman!,2021-01-27 14:00:23,79373721,1453163,50040,616,10.266667,714,0.00063,0.018308,111167.7,2021-01-27,1000 Zombies vs Mutant Enderman! - (2021-01-27)
26,I Survived 100 Days Of Hardcore Minecraft!,2021-07-22 14:46:51,77853993,1315027,78901,937,15.616667,538,0.001013,0.016891,144710.0,2021-07-22,I Survived 100 Days Of Hardcore Minecraft! - (...
83,The Most Insane 900 IQ Among Us Outplay!,2020-09-22 12:45:34,75622411,1582911,50644,609,10.15,841,0.00067,0.020932,89919.6,2020-09-22,The Most Insane 900 IQ Among Us Outplay! - (20...
115,I Made a 100 Player Building Competition!,2020-07-03 11:30:15,73345759,2691120,43939,697,11.616667,922,0.000599,0.036691,79550.7,2020-07-03,I Made a 100 Player Building Competition! - (2...
18,"$45,600 Squid Game Challenge!",2021-10-14 13:00:17,70602968,1724321,55941,681,11.35,454,0.000792,0.024423,155513.1,2021-10-14,"$45,600 Squid Game Challenge! - (2021-10-14)"


### Future Analysis
- Analyzing how MrBeast media appearances (podcasts, videos with other creators) impact channel views