# **Description of dataset:**

I am only focusing on the US dataset as I am strictly analyzing the US YouTube videos.

The US dataset is collected directly from YouTube using its API tool, which means the source is internal. As for the type, it is activity data as it captures the record of viewer engagements online. The data is also structured and organized in rows and column with category headings, and can be easily laid out in a spreadsheet. The form is tabular.

This dataset has statistics up-to-date of the top trending videos on the YouTube platform in the US from August of 2020 to March of 2022. It updates daily to include the trendiest videos defined by how viral, or how quickly it geneartes a large audience online. Some statistical categories that measure video popularity in numerical values include the number of likes and comments the videos received from their viewers. The categorical variables include video category and it identifies the type of content of the videos. 

Please note these statistics were reflected on the dates when they were first trending, not as of today. This ensures fair comparisons between videos. 

The dataset was obtained directly from the YouTube API (Application Program Interface). The tool used a combination of factors including measuring user interactions (e.g. number of comments) to gather the statistics. 




To begin analyzing the dataset, I need to ask my first question: What are the top three and bottom three popular video categories. By finding 
out out the most popular categories, we can run ad campaigns for top content creators from those categories. As for the categories that are not as strong, we will focus less on advertising for those videos at the moment and find ways to increase their popularity.

The outcome variable will be most_popular, second_popular, and third_popular.The most_popular variable will define the number one most popular video category, the second_popular variable will define the second most popular category, and the third_popular variable will define the third most popular category. Popularity is defined by how quickly the videos could generate view count once they are published on YouTube.

Having these three variables will allow me to discover the view count of the three most popular categories. I plan to calculate the average views of videos in the categories and compare them. The top three highest view counts will be the most popular categories.

On the other hand, we will have least_popular, second_least_popular, and third_least popular. They will help me define the least popular categories.

There will be two models. The first model is view count mean v.s median. The variables are video category, mean view_count and median_view count. I will compare the mean view count of categories to the median view count of categories. 

The second model is the average view count. I will add the mean and median view count of each category and divide them by two. This will capture the whole story of popularity of the categories.

I hypthesize entertainment, gaming, and music will the the top three categories while travel & events, news & politics, and howto & style will the the least popular. This is based on my personal experience with YouTube. Their median and mean values should be the top three and bottom three as well.

Lastly, I will conduct analysis to calculate the category average view counts and plot a graph and table to visualize the data.

To prepare for the analsyis to answer my first question, I will begin cleaning the raw dataset that is imported below.



*Terminology Key:

Caveat: although these terms could be used interchangeable, please note they are still not the exact same terms. 

* Popularity = view count
* Likability = like count
* Engagement = comment count*

In [None]:
import numpy as np # import the numpy package with alias np for linear algebra
import pandas as pd # import the pandas package with alias pd for data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # import the matplotlib package with alias plt for data visulization

# Input data files are available in the read-only "../input/" directory
# List all files under the input directory. Each file has a dataset.

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
#Import the raw US dataset into Kaggle. Define it as youtube_raw. It will be used for further analysis and coding.
#Get the first five rows of the dataset. Have a glimpse at the dataset to understand its layout and structure.
#Get a view of the first and last five rows of dataset. Get an idea of how large the dataset is. There are 115591 rows by 16 columns.
youtube_raw = pd.read_csv("/kaggle/input/youtube-trending-video-dataset/US_youtube_trending_data.csv")
print(youtube_raw.head())
print(youtube_raw)



In [None]:
# Check individual values for missing values
print(youtube_raw.isna())
# Check each column for missing values
print(youtube_raw.isna().any())
# Bar plot of missing values by variable
youtube_raw.isna().sum().plot(kind="bar")

# Show plot
plt.show()

Besides the variable *description*, which is the video description written by the content creator, there are no missing values. This is a good sign as we have data for every single column/variable except for the *description* column/variable. Since none of the questions we ask invovle this variable, our analysis will not be affected by the missing values. We do not need to remove any rows. 

In [None]:
#Remove all the rows of data with duplicated video ID. Since video ID is a unique variable/identifier, you can identify a video based on video ID. Any rows with the same video ID are duplicates that need to be removed.
youtube_no_duplicate = youtube_raw.drop_duplicates("video_id")

#Replace all 0 values with the values NaN, which stand for not applicable. There are numeric variables like comment_count that may have values of 0 due to having the comment section disabled. By replacing the 0s with NaN, we don't skew the data or make the average lower than it should be.
youtube_cleaned = youtube_no_duplicate.replace(0, np.nan)


#Replace all False and True values with "yes" and "no" values respectively. This would make it more intuitive for non-coders, epsecially when data is being graphed and plotted.
youtube_cleaned_2 = youtube_cleaned.replace({False: "no", True: "yes",})


#Replace all categoryId ID values with the category's actual name. I used this source to do so: https://techpostplus.com/youtube-video-categories-list-faqs-and-solutions/ 
youtube_cleaned_almost_final = youtube_cleaned_2.replace({'categoryId' : { 1 : "Film & Animation", 2 : "Autos & Vehicles", 10 : "Music", 15 : "Pets & Animals", 17 : "Sports", 19 : "Travel & Events", 20 : "Gaming", 22 : "People & Blogs", 23 : "Comedy", 24 : "Entertainment", 25 : "News & Politics", 26 : "Howto & Style", 27 : "Education", 28 : "Science & Technology", 29 : "Nonprofits & Activism"  }})


#Rename variables like categoryId to category name so it becomes more intuitive. 
youtube_cleaned_final = youtube_cleaned_almost_final.rename(columns={"categoryId":"category name", "view_count":"view count", "likes": "like count", "dislikes": "dislike count", "comments_disabled": "comments disabled", "ratings_disabled": "ratings disabled", "comment_count": "comment count"  })
print(youtube_cleaned_final)

*Coder's Note(not part of project): The code above drops ALL duplicated rows of data with the same video id. What I wish to do is keep only the first row of those duplicated rows. Nonetheless, after trying to use codes like:

youtube_no_duplicate = youtube_no_duplicate_1.duplicated(keep="first")
print(youtube_no_duplicate)

It returned errors. Perhaps it has to do with the order of the code and I might need to define functions before this. To ensure dropping all duplicated rows of data won't affect my final results' accuracy by a large enough extent to make this analysis meaningless, I tested the entire project with the original raw dataset(uncleand) and the cleaned dataset with no duplicates dropped. The outcome results/variables were nearly identical. Of course, the values that measure/quantify the outcome variables did change by a degree, but it still followed the trend(e.g. the highest value that's an outlier increased making it even a more obvious outlier) or changed by a small extent that it barely dictated the final results. This is why I choose mean and median for all of the models below. They should negate any "unevenness" with the data. For example, certain categories have more rows of data than the others. Using sum() wouldn't capture the popularity of the category as accurately as the mean and median. Moreover, thanks to the enourmous dataset we have, it reduces any outliers or human errors by a large margin to the point that final results should be the same, or at least have the same trend. *

In [None]:
#After cleaning the data, it's time to dig in our Question 1: What are the three most popular and least popular video categories?
# Create a pivot table organized by median and mean view_count by category_name to compare the median and mean view count between categories

youtube_1st_question_1st_table = youtube_cleaned_final.pivot_table(values="view count", index="category name", aggfunc=[np.median, np.mean],)
print(youtube_1st_question_1st_table)

#coder's note: placed sort="descending" in the parenthesis of the code above. Didn't do anything

#Create a bar chart(1st figure below) of the table. 
youtube_1st_question_1st_table.plot(kind='bar',)

#Create a bar chart(2nd figure below) focusing specifically on the median view count
youtube_1st_question_1st_table["median"].plot(kind='bar')

#Create a bar chart(3rd figure below)) focusing specifically on the mean view count. Make bars orange so they relate to the mean view count color in figure 1 below.
youtube_1st_question_1st_table["mean"].plot(kind='bar', color="orange")


*Coders' Note:(not part of project): I have tested replacing NaN values with np.mean and np.median, but they still show me the same view_count for both mean and median as the view_count without replacing the NaN values. This is good news becuase that means the NaN values in this case have very little negative effect on the results, if at all. It is simply taken out of the calculation, which is what I want(and why I replaced all the 0s with NaNs at the beginning). I don't want to replace NaNs with 0s for the table becuase 0s were in the raw dataset becuase there was missing dataset, not becuase the counts are 0. For example, a likes count of 0 in the raw dataset means the likes count data was hidden from us, not becuase there was 0 likes.*


In order to find the poularity of the categories, we have to look at the view count, the single most important statistic that can tell you the number of times this video is being watched with engagement(video rentention time is at least longer than a few seconds). To summarize the popularity of a category, we find the average of all videos in that category. I used the mean avereage and the median average as they both capture the average in different and important ways. To do that, we created a pivot table that lists those statistics and a bar chart that reflects the table.

Based on the table, the medians and means of videos across all categories are worlds apart from each other. The mean is obviously many times higher than the median. This tells us that a small percentage videos are so popular that they skew the mean average to the very high end. The median captures a better story of how popular most videos are in a category. To truly measure the popularity of categoryies based on view counts, we need to perform some calculations wtih both the mean and median. Althought the mean is skewed, it tells us the potential of how viral a video in a category could go. 

In terms of median(please refer to second bar chart above), the top three are sports, film & animation, and music respectively. All top three categories are very close in views, and none of them three are signifcantly higher or lower than the other. As for the bottom three, they are nonprofits & activism, autos & vehicles, and autos & vehicles.
 
Comparing the mean view count(please refer to third bar chart above) of all categories, we could also tell that music has such a drastically higher mean compared to the rest of the category; music has a small percentage videos that are on the top in terms of popularity that it brought the mean so high up. I could also derive from this that many of the most viral videos are on YouTube are music. Entertainment and science & technology are the next two categories that have the closest mean to music, although still quite far apart from music. On the other hand, the three categories with the lowest mean are pets & animals, autos & vehicles, and howto & style respectively.

These six categories should be ranked about same in terms of popularity in the next model, where we are going to find out their overall popularity based on view count.




In [None]:
# Create a second table that adds a new numeric variable named popularity_rating to the first table. This will add the median to the mean and divide the sum of that by 2. It will calculate the avereage views of a trendy video of a category.

youtube_1st_question_2nd_table = youtube_1st_question_1st_table
youtube_1st_question_2nd_table["popularity rating"] = youtube_1st_question_2nd_table["median"] / 2 + youtube_1st_question_2nd_table["mean"] / 2
print(youtube_1st_question_1st_table)

#Create a bar graph of only the popularity rating from the second table. Make bars green so they relate to the popularity rating color in figure 2 below.
youtube_1st_question_2nd_table["popularity rating"].plot(kind="bar", color="green")

#Create a bar graph of the entire second table.
youtube_1st_question_2nd_table.plot(kind='bar')

Finally, to clearly determine the categories that are the most popular, I combined the mean and avereage statistics together and divide them by 2. Doing so would allow the mean and median to have equal effect on the final result, known as the popularity rating. As mentioned before, we need both averages to capture the whole story. 

From the figures above, the top three ratings(please refer to first bar chart above) proceed in the following order respectively: **music, entertainment, science & technology**. As for the bottom three ratings, they proceed in the following order respectively: **pets & animals, autos & vehicles, howto & style**.

Now that we have the results, we could assign them to the outcome variables we defined earlier:
* most_popular = music: average 1,244,171 views per trendy video
* 2nd_popular = entertainment: average 1,069,772 views per trendy video
* 3rd_popular = science & technology: average 1,025,972 views per trendy video

* least_popular = pets & animals: average 496,925 views per trendy video
* second_least_popular = autos & vehicles: average 538,851 views per trendy video
* third_least_popular = howto & style: average 636,735 views per trendy video

**In conlcusion, the top three most popular video categories are music, entertainment, and science & technology respectively. The least popular video catgories are pets & animals, autos & vehicles, and howto& style respectively. **



After answering our first question, it is time to ask the second one. I have defined it as such: does turning off the comment section affect the popularity and  likability. This allows YouTube to evaluate if allowing content creators to turn off their rating section could potentially impact their advertising revenue.

The models wll be similar to the ones above. For the first model, we will compare the mean and median of the view count between videos with comments disabled and those that are not disabled. 

The second model, we will combine the mean and median and divide them by two. This will calculate the true popularity between the two type of videos. I hypothesize videos with comments not disabled will have signficantly higher mean and median, which means they will also have a significantly higher popularity rating.

As for the third model, we will compare the mean and median of the like count between the two types of videos. In the fourth model, we will combine those two statistics together and divide by two to calculate the true likability rating. Finally, we will create a fifth model where we divide the popularity rating by the likability rating to popularity rating percentage. This will measure how many percentage of likes the two types' of videos get per view count they generate when they were trending. I hypothesize videos with comments not disabled will rule in every single on statistic again.

Similarly to question 1, we will conduct analysis by creating pivot tables that calculate such statistics. Then we will create plots that reflect the pivot tables.

One of the two outcome variables will be more_popular, which defines the type of videos that are more popular. Popularity is defined by how quickly the videos could generate view count once they are published on YouTube. The other outcome variable is more_likable. Likability is defined by how quickly the videos could generate like count once they are published. By defining these outcome variables, we could adjust our advertising strategy accordingly.



In [None]:
# Create a pivot table organized by median and mean view_count by category_name to compare the median and mean view count between videos with comments disabled and videos with comments not disabled.

youtube_2nd_question_1st_table = youtube_cleaned_final.pivot_table(values="view count", index="comments disabled", aggfunc=[np.median, np.mean])
print(youtube_2nd_question_1st_table)

#Create a bar chart of the table. 
youtube_2nd_question_1st_table.plot(kind='bar')

#Create a bar chart of the median view count
youtube_2nd_question_1st_table["median"].plot(kind="bar")

#Create a bar chart of the mean view count. Make bars orange so they relate to the mean view count color in figure 1 below.
youtube_2nd_question_1st_table["mean"].plot(kind="bar", color="orange")

In order to compare the popularity between the videos with comments disabled and videos with comments not disabled, we have to look at the view count of the avereage videos in both statuses. As mentioned before, becuase there are a small percentage of videos that are significantly more viral than others, we have to take in account both the median and mean avereages. To do that, we created a pivot table that lists those statistics and a bar chart that reflects the table.

Based on the figures above, videos with comments disabled have a higher mean, which came out as a surprise. This translates to the small percentage of videos with the most viral views that have comments disabled on average have more views than their counter part's most viral videos. 

On the other hand, videos with comments not disabled have a higher median. Most of the videos in that status have more views than most of the videos with comments disabled. 





In [None]:
#Add a column/variable that defines the popularity rating of the two type of videos. 


# Create a second table that adds a new numeric variable named popularity rating to the first table. This will add the median to the mean and divide the sum of that by 2. It will calculate the avereage views of both type of videos.

youtube_2nd_question_2nd_table = youtube_2nd_question_1st_table
youtube_2nd_question_2nd_table["popularity rating"] = youtube_2nd_question_2nd_table["median"] / 2 + youtube_2nd_question_2nd_table["mean"] / 2

print(youtube_2nd_question_2nd_table)

#Create a bar chart focusing on the popularity rating of the second table. Make bars green so they relate to the popularity rating color in figure 2 below.
youtube_2nd_question_2nd_table["popularity rating"].plot(kind="bar", color="green")

#Create a bar chart of the entire table. 
youtube_2nd_question_2nd_table.plot(kind='bar')

To clearly determine the categories that are the most popular, I combined the mean and avereage statistics together and divide them by 2. Doing so would allow the mean and median to have equal effect on the final result, known as the popularity rating. As mentioned before, we need both averages to capture the whole story. 

Looking at the figures above, the videos with comments disabled have a popularity rating(refer to first bar chart above) of  1,017,088 view count per trendy video on avereage. On the other hand, videos with comments not disabled have a popularity rating of 960,075 views. 

Based the reuslts, videos on avereage have a slightly higher popularity when they have their comments turned off. To break it down more, the videos with comments disabled take up a bigger percentage of the most popular videos on the entire platform. These  popular videos could be created by large companies, organizations, news channel, and other creators with already high existing influence or brand. Most videos with comments disabled, however, are on average less popular than most videos with comments not disabled. 

The outcome variable, more_popular, will be defined as videos with comments not disabled. However, a caveat will be placed on this variable as the popularity is of such a small difference that would pretty much have no effect on the popularity.

**In conclusion, due to such a small difference between the two video types' popularity rating, their popularity should not be affected by disabling the comment section to an extent that could impact their ability to generate advertisement revenue**. Nonetheless, that is speaking of the overall picture. To break it down more, average content creators who are not behind a big company or making content on certain global important topics such as politics should be more concerened with turning off their comment section. For large companies, they will still generate about the same advertising revenue, even with comments turned off, as the their video topics naturally attract attention. Nonetheless, average content creators do not have that ability. For them, they need to put in more effort to engage their audiene. Turning on the comment section is a crucial way to do that. As reflected by the median view count, most videos, which are created mostly by the average content cretaors, generate more views(although not a substantial amount, but still noticeable enough to impact advertsing revnue) by turning on their comment section. 

*That being all said, for the average content creator, turning off the comment section will negatively impact the popularity of your video. For content creators that are behind an influencial organization, it should not have an effect. *Please note this conclusion is based off of my personal experience with YouTube. Until research or more data backs up my claims, please refer to the bolded conclusion at the top.

We will now move onto the third model, which involves the like count statistics.




In [None]:
# Create a pivot table organized by median and mean like count by category name to compare the median and mean like count between videos with comments disabled and videos with comments not disabled.

youtube_2nd_question_3rd_table = youtube_cleaned_final.pivot_table(values="like count", index="comments disabled", aggfunc=[np.median, np.mean])
print(youtube_2nd_question_3rd_table)

#Create a bar chart of the table. 
youtube_2nd_question_3rd_table.plot(kind='bar')

#Create a bar chart of the median like count
youtube_2nd_question_3rd_table["median"].plot(kind="bar")

#Create a bar chart of the mean like count. Make it orange so that it relates to the color assigned to mean in figure 1 below.
youtube_2nd_question_3rd_table["mean"].plot(kind="bar", color="orange")

In order to compare the likeability between the videos with comments disabled and videos with comments not disabled, we have to look at the like count of the avereage videos in both statuses. As mentioned before, becuase there are a small percentage of videos that are significantly more viral than others, we have to take in account both the median and mean avereages. To do that, we created a pivot table that lists those statistics and a bar chart that reflects the table.

Based on the figures above, both the mean and median for videos with comments not disabled are significantly higher their counterparts. This translates to two things:
* Most videos with comments not disabled have significantly more likes.
* Most of the videos with top like count are overwhelmingly videos with comments not disabled.

Please see next model for the overall like count comparison.


In [None]:
#Add a column/variable that defines the likability rating of the two type of videos. 


# Create a second table that adds a new numeric variable named likability rating to the first table. This will add the median to the mean and divide the sum of that by 2. It will calculate the avereage like count of both type of videos.

youtube_2nd_question_4th_table = youtube_2nd_question_3rd_table
youtube_2nd_question_4th_table["likability rating"] = youtube_2nd_question_4th_table["median"] / 2 + youtube_2nd_question_4th_table["mean"] / 2
print(youtube_2nd_question_4th_table)

#Create a bar chart focusing on the popularity rating of the second table. Make bars green so they relate to the likability rating color in figure 2 below.
youtube_2nd_question_4th_table["likability rating"].plot(kind="bar", color="green")

#Create a bar chart of the entire table. 
youtube_2nd_question_4th_table.plot(kind='bar')

Based on the figures above, the videos with comments not disabled have a significantly higher likability rating. This means they have a higher like count on avereage. 

Please see the next model for the final outcome meauring the likability.

In [None]:
#(for creating a like count to view count proportion / ratio)
#Add a second column that 


# Create a third table that adds a new numeric variable named likability to popularity rating percentage to the second table and fourth table. This will divide the likability rating to the popularity rating. It will calculate the avereage percentage of likes of a trendy video of the two types of videos' receive off of the total amount of views.This will calculate the true likability.

youtube_2nd_question_5th_table = youtube_2nd_question_4th_table
youtube_2nd_question_5th_table["popularity rating"] = youtube_2nd_question_2nd_table["popularity rating"]
youtube_2nd_question_5th_table["likability rating to popularity rating percentage"] = youtube_2nd_question_5th_table["likability rating"] / youtube_2nd_question_5th_table["popularity rating"]


print(youtube_2nd_question_5th_table)

#Create a bar graph of the likability rating to popularity rating percentage. Assign it the color brown.
youtube_2nd_question_5th_table["likability rating to popularity rating percentage"].plot(kind="bar", color="brown")




Since the like count is dependent upon the view count, it is necessary to eliminate the effect view count, measured as popularity rating, could have on our final outcome. For that reason, the outcome variable is calculated as taking the likability rating from model four and divide it by the popularity rating from model two. This will return us amount of like generated off of every single view count when the videos were first trending. 

Based on the figures above, videos with comments not disabled on avereage have a likability rating to popularity rating percentage of 0.07 approximately. On the other hand, videos with comments disabled on average have a likability rating to popularity rating percentage of approximately 0.03 approximately. This translates videos with comments not disabled on avereage generate about 7 likes per 100 view count when they were first trending, more than twice the amount of likes generated by videos with comments disabled.

After the analyzing the data, the outcome variable more_likable is defined as videos with comments not disabled with approximately 7 views per 100 view count generated when they were first trending. 

**Finally, we could conclude that videos with comments disabled are significantly more likable than the disabled ones. And to recap, videos with comments disabled are slightly more popular than videos that do not have comments disabled.**


After answering our second question, it is time to ask our final one. I have defined it as such: does turning off the rating(blocking the viewers from seeing your likes and dislikes on your videos) affect the popularity and engagement from viewers/audiene. This allows YouTube to evaluate if allowing content creators to turn off their rating section could potentially impact their advertising revenue.

The models wll be similar to the ones above. For the first model, we will compare the mean and median of the view count between videos with ratings disabled and those that are not disabled. 

The second model, we will combine the mean and median and divide them by two. This will calculate the true popularity between the two type of videos. I hypothesize videos with ratings not disabled will have signficantly higher mean and median, which means they will also have a significantly higher popularity rating.

As for the third model, we will compare the mean and median of the comment count between the two types of videos. In the fourth model, we will combine those two statistics together and divide by two to calculate the engagement rating. Finally, we will create a fifth model where we divide the popularity rating by the engagement rating to get the engagement rating to popularity rating percentage. This will measure how many percentage of comments the two types' of videos get per view count they generate when they were trending. I hypothesize videos with rating not disabled will rule in every single on statistic again.

Similarly to question 2, we will conduct analysis by creating pivot tables that calculate such statistics. Then we will create plots that reflect the pivot tables.

One of the two outcome variables will be rating_more_popular, which defines the type of videos that are more popular. Popularity is defined by how quickly the videos could generate view count once they are published on YouTube. The other outcome variable is more_engaging. Engagement is defined by how quickly the videos could generate comment count once they are published. By defining these outcome variables, we could adjust our advertising strategy accordingly.

In [None]:
# Create a pivot table organized by median and mean view count by category_name to compare the median and mean view count between videos with rating disabled and videos with rating not disabled.

youtube_3rd_question_1st_table = youtube_cleaned_final.pivot_table(values="view count", index="ratings disabled", aggfunc=[np.median, np.mean])
print(youtube_2nd_question_1st_table)

#Create a bar chart of the table. 
youtube_3rd_question_1st_table.plot(kind='bar')

#Create a bar chart of the median view count
youtube_3rd_question_1st_table["median"].plot(kind="bar")

#Create a bar chart of the mean view count. Make the chart orange to relate to the color assigned to the mean view count in figure 1 below.
youtube_3rd_question_1st_table["mean"].plot(kind="bar", color="orange")


In order to compare the popularity between the videos with ratings disabled and videos with ratings not disabled, we have to look at the view count of the avereage videos in both statuses. As mentioned before, becuase there is a small percentage of videos that are significantly more viral than others, we have to take in account both the median and mean avereages. To do that, we created a pivot table that lists those statistics and a bar chart that reflects the table statistics.

Based on the figures above, videos with ratings disabled have a higher mean, which came out as a surprise. This translates to the small percentage of videos with the most viral views that have comments disabled on average have more views than their counter part's most viral videos.

On the other hand, videos with ratings not disabled have a slightly higher median. Most of the videos in that status have more views than most of the videos with ratings disabled.

In [None]:
#Add a column/variable that defines the popularity rating of the two type of videos. 


# Create a second table that adds a new numeric variable named popularity rating to the first table. This will add the median to the mean and divide the sum of that by 2. It will calculate the avereage views of both type of videos.

youtube_3rd_question_2nd_table = youtube_3rd_question_1st_table
youtube_3rd_question_2nd_table["popularity rating"] = youtube_3rd_question_2nd_table["median"] / 2 + youtube_3rd_question_2nd_table["mean"] / 2

print(youtube_3rd_question_2nd_table)

#Create a bar chart focusing on the popularity rating of the second table. Make color green to relate the second figure.
youtube_3rd_question_2nd_table["popularity rating"].plot(kind="bar", color="green")

#Create a bar chart of the entire table. 
youtube_3rd_question_2nd_table.plot(kind='bar')

To clearly determine the categories that are the most popular, I combined the mean and avereage statistics together and divide them by 2. Doing so would allow the mean and median to have equal effect on the final result, known as the popularity rating. As mentioned before, we need both averages to capture the whole story.

Looking at the figures above, the videos with ratings disabled have a popularity rating(refer to first bar chart above) of 122,925 view count per trendy video on avereage. On the other hand, videos with ratings not disabled have a popularity rating of 958,635 views.

Based the reuslts, videos on avereage have a slightly higher popularity when they have their comments turned off. To break it down more, the videos with comments disabled take up a bigger percentage of the most popular videos on the entire platform. These popular videos could be created by large companies, organizations, news channel, and other creators with already high existing influence or brand. Most videos with ratings disabled, however, are on average slightly less popular than most videos with comments not disabled.

The outcome variable, rating_more_popular, will be defined as videos with ratings disabled. 

In conclusion, although videos with ratings disabled are not substantially greater than their counter part in terms of popularity, it is still a great enough of a margin for the advertising revenue to be affected. This is of course speaking of the overall picture. To break it down more, average content creators who are not behind a big company or making content on certain global important topics such as politics should not be too concerned with turning on or off their ratings. For large companies, they will still generate about the same advertising revenue, even with ratings turned off, as the their video topics naturally attract attention. 

*That being all said, for the average content creator, disabling or not disabling their videos should not impact the popularity of your video. For content creators that are behind an influencial organization, it should not have an effect too.. *Please note this conclusion is based off of my personal experience with YouTube. Until research or more data backs up my claims, please refer to the bolded conclusion at the top.

We will now move onto the third model, which involves the like count statistics.

In [None]:
# Create a pivot table organized by median and mean comment count by category name to compare the median and mean comment count between videos with rating disabled and videos with rating not disabled.

youtube_3rd_question_3rd_table = youtube_cleaned_final.pivot_table(values="comment count", index="ratings disabled", aggfunc=[np.median, np.mean])
print(youtube_3rd_question_3rd_table)

#Create a bar chart of the table. 
youtube_3rd_question_3rd_table.plot(kind='bar')

#Create a bar chart of the median like count
youtube_3rd_question_3rd_table["median"].plot(kind="bar")

#Create a bar chart of the mean like count. Make the chart orange to relate it to the orange color shown in first figure.
youtube_3rd_question_3rd_table["mean"].plot(kind="bar", color="orange")

In order to compare the engagement between the videos with ratings disabled and videos with ratings not disabled, we have to look at the comment count of the avereage videos in both statuses. As mentioned before, becuase there is a small percentage of videos that are significantly more viral than others, we have to take in account both the median and mean avereages. To do that, we created a pivot table that lists those statistics and a bar chart that reflects the table.

Based on the figures above, the median of the videos with ratings not disabled are slightly higher than videos with ratings not disabled, while the mean of those videos are significantly higher than their counter parts. This translates to two things:

Most videos with comments not disabled have slightly more comments than videos with momments disabled.
Most of the videos with top comment count are overwhelmingly videos with ratings not disabled.
Please see next model for the overall like count comparison.

In [None]:
#Add a column/variable that defines the engagement rating of the two type of videos. 


# Create a second table that adds a new numeric variable named engagement rating to the first table. This will add the median to the mean and divide the sum of that by 2. It will calculate the avereage like count of both type of videos.

youtube_3rd_question_4th_table = youtube_3rd_question_3rd_table
youtube_3rd_question_4th_table["engagement rating"] = youtube_3rd_question_4th_table["median"] / 2 + youtube_3rd_question_4th_table["mean"] / 2
print(youtube_3rd_question_4th_table)

#Create a bar chart focusing on the engagement rating of the second table
youtube_3rd_question_4th_table["engagement rating"].plot(kind="bar")

#Create a bar chart of the entire table. 
youtube_3rd_question_4th_table.plot(kind='bar')

Based on the figures above, the videos with ratings not disabled have a higher engagement rating. Although it is not substantially higher, it is high enough to make a difference in terms of impacting the advertising strategy. This also means they have a higher comment count on avereage.

Please see the next model for the final outcome meauring the engagement.

In [None]:
#(for creating a like count to view count proportion / ratio)
#Add a second column that 


# Create a third table that adds a new numeric variable named likability to popularity rating percentage to the second table and fourth table. This will divide the likability rating to the popularity rating. It will calculate the avereage percentage of likes of a trendy video of the two types of videos' receive off of the total amount of views.This will calculate the true likability.

youtube_3rd_question_5th_table = youtube_3rd_question_4th_table
youtube_3rd_question_5th_table["popularity rating"] = youtube_3rd_question_2nd_table["popularity rating"]
youtube_3rd_question_5th_table["engagement rating to popularity rating percentage"] = youtube_3rd_question_5th_table["engagement rating"] / youtube_3rd_question_5th_table["popularity rating"]


print(youtube_3rd_question_5th_table)

#Create a bar graph of the likability rating to popularity rating percentage. Assign the graph the color brown.
youtube_3rd_question_5th_table["engagement rating to popularity rating percentage"].plot(kind="bar", color="brown")


Since the comment count is dependent upon the view count, it is necessary to eliminate the effect view count, measured as popularity rating, could have on our final outcome. For that reason, the outcome variable is calculated as taking the engagement rating from model four and divide it by the popularity rating from model two. This will return us amount of comments generated off of every single view count when the videos were first trending.

Based on the figures above, videos with ratings not disabled on avereage have a engagement rating to popularity rating percentage of 0.0057 approximately. On the other hand, videos with ratings disabled on average have a engagement rating to popularity rating percentage of approximately 0.027 approximately. This translates to videos with ratings not disabled on avereage generate about 5.7 comments per 1000 view count when they were first trending, more than twice the amount of comments generated by videos with ratings disabled.

After the analyzing the data, the outcome variable more_engaging is defined as videos with ratings not disabled with approximately 5.7 comments views per 1000 view count generated when they were first trending.

**Finally, we could conclude that videos with ratings disabled are significantly more engaging than the disabled ones. And to recap, videos with ratings disabled are more popular than videos that do not have ratings disabled.**

Although this is our final question, I have included one more question below to dig in deeper into this study. Please see the work below for the additional work:

My additional question is: What are the top three and bottom three most engaging video categories. By findout out the most engaging categories, we can run ad campaigns for top content creators from those categories. As for the categories that are not as strong, we will focus less on advertising for those videos at the moment and find ways to increase their engagement.

The outcome variable will be most_engaging, second_engaging, and third_engaging.The most_engaging variable will define the number one most engaging video category, the second_popular variable will define the second most engaging category, and the third_popular variable will define the third most engaging category. Engagement is defined by how quickly the videos could generate comment count once they are published on YouTube.

Having these three variables will allow me to discover the comment count of the three most popular categories. I plan to calculate the average views of videos in the categories and compare them. The top three highest view counts will be the most engagging categories.

On the other hand, we will have least_engaging, second_least_engaging, and third_least_engaging. They will help me define the least engaging categories.

There will be two models. The first model is view count mean v.s median. The variables are video category, mean comment_count and median comment count. I will compare the mean comment count of categories to the median view count of categories.

The second model is the average comment count. I will add the mean and median comment count of each category and divide them by two. This will capture the whole story of engagement of the categories.

I hypthesize entertainment, gaming, and music will the the top three categories while travel & events, news & politics, and howto & style will the the least engaging. This is based on my personal experience with YouTube. Their median and mean values should be the top three and bottom three as well.

Lastly, I will conduct analysis to calculate the category average comment count and plot graphs and tables to visualize the data.

In [None]:
# Create a pivot table organized by median and mean comment count by category name to compare the median and mean comment count between categories

youtube_4th_question_1st_table = youtube_cleaned_final.pivot_table(values="comment count", index="category name", aggfunc=[np.median, np.mean],)
print(youtube_4th_question_1st_table)

#Create a bar chart(1st figure below) of the table. 
youtube_4th_question_1st_table.plot(kind='bar')

#Create a bar chart(2nd figure below) focusing specifically on the median view count
youtube_4th_question_1st_table["median"].plot(kind='bar')

#Create a bar chart(3rd figure below)) focusing specifically on the mean view count. Make bars orange so they relate to the mean view count color in figure 1 below.
youtube_4th_question_1st_table["mean"].plot(kind='bar', color="orange")

In order to find the engagement of the categories, we have to look at the comment count, the statistic that can tell you the number of comments this video is being generated when the videos were first trending. To summarize the engagement of a category, we find the average of all videos in that category. I used the mean avereage and the median average as they both capture the average in different and important ways. To do that, we created a pivot table that lists those statistics and a bar chart that reflects the table.

Based on the table, the medians and means of videos across all categories are worlds apart from each other. The mean is obviously many times higher than the median. This tells us that a small percentage videos are so popular that they skew the mean average to the very high end. The median captures a better story of how popular most videos are in a category. To truly measure the engagement of categoryies based on view counts, we need to perform some calculations wtih both the mean and median. Althought the mean is skewed, it tells us the potential of how viral a video in a category could go. 

In terms of median(please refer to second bar chart above), the top three are music, film & animation, and nonprofits & activism respectively. Music is noticeably higher than all categories, while the other two are more closed to each other. As for the bottom three, they are travel & evets, sports, and autos & vehicles.
 
Comparing the mean view count(please refer to third bar chart above) of all categories, we could also tell that music has such an obvious, drastically higher mean compared to the rest of the category; music has a small percentage videos that are on the very top in terms of engagement that it brought the mean so high up. I could also derive from this that many of the most engaging videos are on YouTube are music. Entertainment and gaming are the next two categories that have the closest mean to music, although still miles apart from music. On the other hand, the three categories with the lowest mean are pets & animals, travel & events, and sports respectively.

These six categories should be ranked about same in terms of engagement in the next model, where we are going to find out their overall engagement based on comment count.


In [None]:
# Create a second table that adds a new numeric variable named engaging rating to the first table. This will add the median to the mean and divide the sum of that by 2. It will calculate the avereage comments of a trendy video of a category.

youtube_4th_question_2nd_table = youtube_4th_question_1st_table
youtube_4th_question_2nd_table["engagement rating"] = youtube_4th_question_2nd_table["median"] / 2 + youtube_4th_question_2nd_table["mean"] / 2
print(youtube_4th_question_2nd_table)

#Create a bar graph of only the engagement rating from the second table. Make bars green so they relate to the engagement rating color in figure 2 below.
youtube_4th_question_2nd_table["engagement rating"].plot(kind="bar", color="green")

#Create a bar graph of the entire second table.
youtube_4th_question_2nd_table.plot(kind='bar')

Finally, to clearly determine the categories that are the most engaging, I combined the mean and avereage statistics together and divide them by 2. Doing so would allow the mean and median to have equal effect on the final result, known as the engagement rating. As mentioned before, we need both averages to capture the whole story. 

From the figures above, the top three ratings(please refer to first bar chart above) proceed in the following order respectively: **music, entertainment, gaming**. As for the bottom three ratings, they proceed in the following order respectively: **travel & events, sports, autos & vehicles**.

Now that we have the results, we could assign them to the outcome variables we defined earlier:
* most_engaging = music
* 2nd_engaging = entertainment
* 3rd_engaging = gaming

* least_engaging = travel & events
* second_least_engaging =  sports
* third_least_engagin = autos & vehicles

**In conlcusion, the top three most engaging video categories are music, entertainment, and gaming. The least engaging video catgories are travel & events, sports, and autos & vehicles.



# **Executive Summary:**

YouTube is a video platform that generates most of its revenue from the advertisements placed on the videos. In order to maximize the revenue, we need to target the types of videos that will have the potential for having the most advertisements watched. To discover such videos, we have analyzed the view count, like count, and comment count of different types of videos. 

To begin, my first finding has to do with identifying the top three and bottom three video categories ranked in terms of popularity. To measure the popularity, I averaged the view count of all the trendiest videos when they were first released in every video category. The results I gathered reflected music, entertainment, and science & technology as the most popular, respectively. The latter two are close to each other, while music has a big gap that separates itself form the two as being the most popular. On the other hand, the three least populuar categories are pets & animals, autos & vehicles, and howto & style, in that order. 

Moving on to my next finding, I analyzed if disabling the comment section will affect the popularity and likability of videos. To measure the popularity, I averaged the view count of all the trendeist videos when they were first released labeled as two types of videos; they are videos with comments disabled and those without comments disabled. I calculated the average number of like count per 100 view count of these two types of videos to measure the likability. I discovered that disabling the comment increases a video's popularity by a very small amount that virtually has no effect on its ability to generate advertisement revenue. On the other hand, videos that disabled their comments are less likable than videos than videos with comments not disabled. 

Next, I conducted analysis on how disabling the ratings will affect the popularity and engagement of videos. Popularity is measured by averaging the view count of all trendiest videos labeled in two types of videos: ones with ratings disabled and ones without ratings disabled. On the other hand, engagement is measured by calculating the average comment count per 1000 view count of all trendiest videos labeled in the two types of videos aforementioned. My findings say that videos with ratings disabled are more popular than videos with ratings not disabled. On the other hand, videos that disabled their ratings are significantly less engaging than videos with comments not disabled. 

Finally, for my last finding, I identified the top three and bottom three categories of video categories ranked in terms of engagement. To measure the engagement, I averaged the comment count of all the trendiest videos in every video category. The results I gathered reflected music, entertainment, and gaming as the three most engaging categories, respectively. Music, as a category, is ranked miles ahead in terms of engagemeent compared to even the latter two categories in the top three aforementioned. On the other hand, travel & events, sports, and autos & vehicles are ranked the least popular in that respect. 

There are a lot of recommendations that we could derive from these results. I think there are two key conclusions we could make. Firstly, we should definitely focus our advertising effort on music videos. They are by far the most popular and engaging categories. Popularity will catpure the videos more views, multiplying the amount of advertisements that will be watched. Engagement captures the viewers' interest, so that they will stay longer and are more likely to watch the full length of the advertisements. With these two factors combined, we will generate a lot more advertising revenue if our advertising strategies are working on music videos. Secondly, we shall continue granting content creators the option of disabling their comment section and ratings. For the comment disabling, it barely affects the popularity of videos. Although disabling comment will affect the likability, there are certain controversial topics that are unavoidable to some content creators. Moreover, some content creators would like to stay away from a potentially controversial group of people commenting on their videos. Moreover, the likability is not affected by a substantial degree that it will seriously impact our finacial health. We should also care about our creators' well being. As for ratings, although disabling them will make videos more popular, not disabling them will make videos significantly more engaging, on the other hand. This should even out the field. For that reason, we should keep the option to allow content creators to disable their ratings. Again, they might have valid reasaons to do so, and we care about our creators! Without them, we would not be able to generate advertising revenue. 


# **Project comparison:**

There are many comparisons and contrasts I could make between the 350 and this project. I could tell you from off the top of my head that both are fun, although this one is more fun thanks to how creative you can get with the codes. It also feels a lot morer satisfying when you solve a coding error that got you stuck for a good time. 

To begin, I would like to say 350 project is a more theoretical. You are thinking more from the business perspective more and are more concered with the big picture(at least for me). Analyzing the data was more subjective, as I had no coding experience and could not think of it from a more technical perspective. Analyzing in Excel is slightly more easier for me as you could generate bar charts and tables form just clicking a few icons on the Excel menu. In terms of cleaning the data, I find both ways to be equally simple. Anyways, going back to the big picture, it is not that I did not think from an eagle's eye in this project, it is just that I felt so much more attracted to the details. Perhaps I have already visualzied the big picture in 350, so that became automatic to me. For 350, I was concerened about little details too, but more on the biases and human errors and such. 

I am surprised that this project actually took me shorter to complete than the 350 project. This seems like a much more complicated project. But then, again, I carreid over the knowledge from 350. Moreover, datacamp has been my friend. It provided me a template for coding all the models. I admit I used a lot of copy and pasting. However, what programmers do not do that? In this day of age, we rely on Googling so much, especially programmers. There are so many communities that can help them and provide them with codes to be pasted. I was iniitally guilty of pasting the codes, but learned what needs to be done needs to be done. 

Anyways, I love this project, and the 350 one too! I honestly think these are the most two useful classes in the entire BTA program(although I haven't taken privacy yet). Thye teach me to think from both a big picture and detail-oriented perspective. More importantly, it is fun and rewarding. 
