## Phase I Project Proposal
### What Makes a Marketing Video Successful on YouTube?

#### Name: Owen Sweetman, DS 3000


### Introduction

What makes a marketing or brand video successful on YouTube? There are many features which might affect the popularity and reach of marketing videos. I'm interested in examining if things like a video's length, posting frequency, engagement rate (likes, comments, shares), or metadata (title, description, tags) are more or less likely to make a brand‚Äôs video successful. I also think it would be interesting to see if certain features can help me predict what type of content (tutorials, ads, testimonials, entertainment) tends to generate the highest engagement. 

Both of these questions may be used practically: investigating the first question may lead me to recommend best practices for video creation and optimization to brands, while investigating the second may help businesses tailor their YouTube strategies to maximize audience growth and conversion. There are also numerous other questions that might be interesting aside from these two main ones that could be addressed given enough time and data.


### Data Collection

I plan to use YouTube‚Äôs Data API to collect data on videos from top-performing brand and marketing channels. These represent popular recent marketing content, which will help me target the most up-to-date information relevant to my questions of interest. The YouTube API is fairly straightforward to use and can provide metrics such as view count, like count, comment count, and other metadata. Below I demonstrate how I can access and read in the relevant data.

**Note:** The below code requires access to YouTube API credentials, including an API key that is not allowed to be shared. If you need to run the code yourself, you can create your own developer key by registering on the Google Cloud Console. For this proposal, I will demonstrate the structure of the code needed to authenticate and collect data, and I will also save the dataset as a .csv file in case the API access fails in the future.


In [4]:
# Get the API and Load the Credentials to Access it
from googleapiclient.discovery import build
import pandas as pd

# Including the secret API key which is not to be shared
YT_API_KEY = "AIzaSyBuCEzGkykiISml4F1UVv3Ad6gANyw5-M0"

# Build the YouTube API client
youtube = build("youtube", "v3", developerKey=YT_API_KEY)

# Example: Get videos from a marketing/brand channel (replace with actual channel ID)
channel_id = "UC_x5XG1OV2P6uZZ5FSM9Ttw"  # Example: Google Developers channel

# Request the uploads playlist ID
channel_response = youtube.channels().list(
    part="contentDetails",
    id=channel_id
).execute()

uploads_playlist_id = channel_response["items"][0]["contentDetails"]["relatedPlaylists"]["uploads"]

# Collect video details
playlist_items = youtube.playlistItems().list(
    part="snippet",
    playlistId=uploads_playlist_id,
    maxResults=50
).execute()

video_ids = [item["snippet"]["resourceId"]["videoId"] for item in playlist_items["items"]]

# Fetch video statistics
video_data = []
for vid in video_ids:
    stats = youtube.videos().list(
        part="snippet,statistics,contentDetails",
        id=vid
    ).execute()
    video_data.append(stats["items"][0])

# Example structure for collected data
video_dict = {
    "video_id": [v["id"] for v in video_data],
    "title": [v["snippet"]["title"] for v in video_data],
    "published": [v["snippet"]["publishedAt"] for v in video_data],
    "views": [int(v["statistics"].get("viewCount", 0)) for v in video_data],
    "likes": [int(v["statistics"].get("likeCount", 0)) for v in video_data],
    "comments": [int(v["statistics"].get("commentCount", 0)) for v in video_data],
    "duration": [v["contentDetails"]["duration"] for v in video_data],
    "tags": [v["snippet"].get("tags", []) for v in video_data]
}

df = pd.DataFrame(video_dict)
df.head()


Unnamed: 0,video_id,title,published,views,likes,comments,duration,tags
0,e5onzptQghg,Just in from the news desk üì∞: Google Play Game...,2025-10-02T16:01:07Z,2833,59,2,PT55S,"[Google, developers, pr_pr: Google for Develop..."
1,qPSqowMfAaE,The infinite loop of estimating project timeli...,2025-10-02T04:01:05Z,35934,532,19,PT26S,"[Google, developers, pr_pr: Google for Develop..."
2,HtOapu15a-4,"When testing, lean towards DAMP üíß",2025-10-01T23:00:31Z,11797,560,12,PT1M9S,"[Google, developers, pr_pr: Google for Develop..."
3,jRt7HKcZffA,5 things you can do with Gemini CLI!,2025-10-01T04:00:56Z,7304,293,13,PT1M9S,"[Google, developers, pr_pr: Google for Develop..."
4,BxOrnFtdKGM,What‚Äôs your go-to tip to grow as a new dev? ü§î,2025-09-30T23:00:23Z,4597,128,4,PT43S,"[Google, developers, pr_pr: Google for Develop..."


### Data Usage and Remaining Issues

The above dataset is mostly clean already, but there are still some issues to address. For example, YouTube video durations are returned in ISO 8601 format (e.g., "PT5M23S"), which needs to be converted into total seconds for analysis. Similarly, tags are returned as lists and might need to be one-hot encoded or otherwise processed to extract meaningful categorical features. 

However, I have plenty of numeric features (views, likes, comments, duration) and categorical/textual features (tags, titles, categories). These may be useful in answering my first question (what makes a video more successful) and then in answering my second (which type of content performs better). 

While we have not covered any ML models in class yet, I believe regression could help predict a numeric feature (like view count or engagement rate), and classification could help predict categorical features (like ‚Äúhigh engagement‚Äù vs. ‚Äúlow engagement‚Äù videos). There may also be clustering methods that help me identify different types of brand video strategies that naturally group together in the dataset.
