<a href="https://colab.research.google.com/github/jiayuzhao05/jiayuzhao05/blob/main/WiDS_Next_Gen_Activity1_V1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**BEFORE YOU BEGIN,** please work from a copy of this notebook in your Google Drive *(File > Save a copy in Drive)*.

If you skip this step, any changes you make **will not be saved.**


---



---

# **Activity 1: Introduction to Data Science**


---



# **Introduction**

This activity, created by WiDS Next Gen, guides you through the steps of data analysis and visualization using Python. This activity has a quick run through of how to use Google Colaboratory, then takes you through the steps of graphing and analyzing a set of data of top trending Youtube videos. You will get some practice working with simple Python statements as well as with different data representations and what conclusions you can draw from those graphs.

This is an introductory activity, so all of the code has already been written for you -- all you need to do is click a button to run it! We've also included comments about what the code is doing so you can start to get a hang of what Python functions look like. If this activity inspires further interest, there are more explorations and activities about data science and visualization included!

## What's Colab?

*Colaboratory*, or *Colab* for short, is a Google product that allows anybody to write and execute arbitrary Python code through the browser (for free!). Each .ipynb file is called a Colab "notebook," and can be stored in Google Drive just like Google Docs or Sheets, which you may already be familiar with.

If you haven't seen or interacted with this kind of document before, don't worry! Just keep following the tutorial in this section and you'll be good to go for the rest of the activity.

###Understanding Notebook Structure

The building blocks of every notebook are called *cells.* If you click once on this paragraph, you can see an outline of the cell it lives in. **Try clicking some of the paragraphs or headers from earlier in the notebook to see the separate cells.**

There are two types of cells: *text* and *code*, which are both editable. *Text cells* contain formatted text, as you've seen, and *code cells* contain executable Python code.

Let us demonstrate...

***This is a text cell.***

In [None]:
# And this is a code cell.

Next, you'll learn how to create, edit, and move around cells yourself!

###*Working with Cells*

**To CREATE a new cell, hover your mouse over the bottom edge of this cell.** You should see one button for a new code cell, and another for a new text cell.


**First, create a new text cell below and write a random sentence or two.** Maybe write about what you ate yesterday, or the video you last watched on YouTube.

You may notice something a little funny about the way the text you input looks vs. the text that's actually displayed on the cell. Each text cell is formatted using a syntax that's called Markdown. All it basically does is help mark formatted text (like **bolded** or *italicized* text) in a way that a computer can understand.

But don't worry about memorizing the syntax, because you can just use the icons that show up on the top of the cell when you start typing. **Feel free to play around with the icons to understand how they change your text.**

***To stop editing, simply click on a different cell.***

---

**To EDIT an existing cell, double click it.** Try it on the cell you just created.

**To MOVE an existing cell, first select it, then use the ↑ or ↓ arrows on the upper right of the cell to change its position.** Try moving your cell below this one.

**To DELETE an existing cell, first select it, then click the trash can icon at the upper right of the cell.** Try it on the cell you created.

You may be curious about the other symbols in the top right corner. Feel free to explore what they do.

*If you ever want to undo a cell deletion, you can use the Edit dropdown menu above.*

Hopefully this section helped you become familiar with working with a notebook document. The last thing you need to learn in order to complete this activity is how to run the code, which we'll dive into next.

###*Running Code Cells*

In this activity, we have already written out all the code for you. While you don't need to write any code, you will have to run each cell yourself.

**To RUN a code cell, click the [ ▶️ ] icon that appears on the left of the cell when you hover over it with your mouse.**

Give it a try below:

In [None]:
# <--- hover your mouse here!

# Run this block to print a message!
# (Side note: any lines in green like this one are just comments we can add within a code cell to explain what we're doing)

print("Hello world!")

Hello world!


**If the code has any output, it'll show up underneath the block, where** `Hello world!` **is now.**

You may also notice that a number appeared in between the brackets `[ ]`.

This number just confirms that a cell has finished running and also tells you the order in which code cells were run. `[1]` means this cell was the first one run, and `[2]` was the second cell run, etc.

For this activity, you also don't need to worry about trying to figure out exactly how the code works —— we'll explain what each block is doing at each step, so all you need to do is just run them!

Now that you're a bit more familiar with how a Colab notebook works, let's learn about the dataset we'll be exploring.

## About The Dataset

YouTube maintains a list of the [top trending videos](https://www.youtube.com/feed/trending) on the platform. To determine trending videos, "YouTube uses a combination of factors including measuring users interactions (number of views, shares, comments and likes)."

This dataset includes the daily top 100 trending videos on YouTube, for the week of June 3 - 9, 2018 in the United States.

*(Source: [Kaggle](https://www.kaggle.com/datasnaek/youtube-new?select=USvideos.csv))*

---



**You'll need to download the dataset to your computer first. You can access it [here](https://drive.google.com/file/d/1UvtifbVaJMb8x9AkHHMCK14sX9VVuRWK/view?usp=sharing).**

*(NOTE: If you have an issue with the download, make sure you have **third-party cookies** enabled on your browser. Click [here](https://akohubteam.medium.com/how-to-enable-third-party-cookies-on-your-browsers-f9a8143b8cc5) to learn how.)*

--

Now, let's run some code to take a closer look at how our information is structured.

First, we'll import a few libraries to let the notebook know what specific tools we'll need to work with our data.

**You *must* run the cell below before continuing the activity!** (Otherwise, the computer won't understand which specific functions we want to use from each library.)

In [None]:
# Import libraries
import io
import pandas as pd
import plotly.express as px

print("Libraries successfully imported!")

Libraries successfully imported!


Next, we'll upload our dataset to the notebook. **Run the cell below, click *Choose Files* when prompted and select the *WiDS_Activity1.csv* file you downloaded earlier.**

In [None]:
# Upload our dataset
from google.colab import files
uploaded = files.upload()

df = pd.read_csv(io.BytesIO(uploaded['WiDS_Activity1.csv']))

print("\nDataset successfully uploaded!")

Saving WiDS_Activity1.csv to WiDS_Activity1.csv

Dataset successfully uploaded!


Now that we've uploaded the dataset, let's look into how it's structured.

**Run the next cell to see how many rows and columns are in the dataset.**

The output will be formatted as `(number of rows, number of columns)`.

In [None]:
# See the shape/structure of the dataset
df.shape

(700, 9)

Ok, so it looks like we have 700 rows, which makes sense because we know that our dataset has 100 entries for each day of the week. But what about those 9 columns? **We can take a peek at the first few rows of the dataset by running the following:**

In [None]:
# Show top 5 rows in dataframe
df.head()

Unnamed: 0,trending_date,title,channel_title,category,tags,views,likes,dislikes,comment_count
0,18.03.06,BTS (방탄소년단) 'FAKE LOVE' Official MV (Extended ...,ibighit,Music,"BIGHIT|""빅히트""|""방탄소년단""|""BTS""|""BANGTAN""|""방탄""|""fak...",8120145,1624230,8402,145872
1,18.03.06,Official Call of Duty®: Black Ops 4 — Multipla...,Call of Duty,Gaming,"call of duty|""cod""|""activision""|""Black Ops 4""",9753197,348498,207752,141363
2,18.03.06,Maroon 5 - Girls Like You ft. Cardi B,Maroon5VEVO,Music,"Maroon|""Girls""|""Like""|""You""|""Interscope""|""Reco...",25497666,1547821,23176,98455
3,18.03.06,"Cardi B, Bad Bunny & J Balvin - I Like It [Off...",Cardi B,Music,"Cardi B|""I Like It""|""Invasion of Privacy""|""Bad...",38779629,1310596,67747,80520
4,18.03.06,[CHOREOGRAPHY] BTS (방탄소년단) 'FAKE LOVE' Dance P...,BANGTANTV,Music,"방탄소년단|""BTS""|""BANGTAN""|""HIPHOP""|""랩몬스터""|""RapMons...",8751939,1081236,4038,67537


# **Kickstart Your Curiosity**

Since we now know all the available information in this dataset, we can start asking some questions.

**Maybe we want to know all of the top 100 trending videos specifically on June 6th:**

In [None]:
# Filters rows based on date
df.loc[df['trending_date']=='18.03.06']

Unnamed: 0,trending_date,title,channel_title,category,tags,views,likes,dislikes,comment_count
0,18.03.06,BTS (방탄소년단) 'FAKE LOVE' Official MV (Extended ...,ibighit,Music,"BIGHIT|""빅히트""|""방탄소년단""|""BTS""|""BANGTAN""|""방탄""|""fak...",8120145,1624230,8402,145872
1,18.03.06,Official Call of Duty®: Black Ops 4 — Multipla...,Call of Duty,Gaming,"call of duty|""cod""|""activision""|""Black Ops 4""",9753197,348498,207752,141363
2,18.03.06,Maroon 5 - Girls Like You ft. Cardi B,Maroon5VEVO,Music,"Maroon|""Girls""|""Like""|""You""|""Interscope""|""Reco...",25497666,1547821,23176,98455
3,18.03.06,"Cardi B, Bad Bunny & J Balvin - I Like It [Off...",Cardi B,Music,"Cardi B|""I Like It""|""Invasion of Privacy""|""Bad...",38779629,1310596,67747,80520
4,18.03.06,[CHOREOGRAPHY] BTS (방탄소년단) 'FAKE LOVE' Dance P...,BANGTANTV,Music,"방탄소년단|""BTS""|""BANGTAN""|""HIPHOP""|""랩몬스터""|""RapMons...",8751939,1081236,4038,67537
...,...,...,...,...,...,...,...,...,...
95,18.03.06,[King of masked singer] 복면가왕 - 'unicorn' speci...,MBCentertainment,Entertainment,"MBC|""예능""|""일요예능""|""일밤""|""복면""|""가왕""|""복면가왕""|""김성주""|""김...",3744187,53006,535,2140
96,18.03.06,James Veitch’s Elaborate Wrong Number Prank -...,Team Coco,Comedy,Conan O'Brien Conan Conan (TV Series) TBS (TV ...,2183540,63438,967,893
97,18.03.06,Dan + Shay - Speechless (Wedding Video),Dan And Shay,Music,"wedding video|""heartfelt wedding video""|""emoti...",3190447,34215,1090,685
98,18.03.06,Wildlife - Official Teaser I HD I IFC Films,IFC Films,Film & Animation,"IFC Films|""ifc""|""film""|""trailer""|""2018""|""Paul ...",2121136,3281,76,155


**Or, maybe we want to know the number of days that Cardi B had a top trending video that week:**

In [None]:
# Find number of days that a channel has a trending video
df.loc[df['channel_title']=='Cardi B'].shape[0]

7

There are so, so, *so* many other questions you ask about this data.

**Take a moment to list some other questions you might have about these trending YouTube videos. What do you want to know?**

*Enter your response by editing the cell below.*

<< ***STUDENT RESPONSE*** >>

[type answer here]

---

Many of the best insights we can gather from data involve looking at trends and patterns --- things that we humans are quite keen at identifying. But it's a little challenging to spot interesting patterns just by looking at a table with words and numbers.

That's where *data visualization* comes in. In the next sections, we'll explore 3 different ways of using code to manipulate our dataset and create charts that can help reveal intriguing insights.

# **Exploration 1**

In this section, we'll look at how many days each video made it to the top 100 trending list that week.

1. Count and store the number of days each unique video title is trending

In [None]:
# Create a structure named "data" to store the information we want to graph
data = {}

print("Structure named 'data' created!")

Structure named 'data' created!


In [None]:
# Count the number of times each unique video title appears in our dataset,
# and save this information in "data"
for t in df['title']:
  if t in data:
    data[t]+=1
  else:
    data[t]=1

print("Number of trending days counted successfully!")

Number of trending days counted successfully!


2. Reformat our data for graphing

In [None]:
# Create a structure named "trending_dict" to store information from "data" in
# the particular format understood by our graphing function
trending_dict = {
    'video titles': [],
    '# days trending': []
}

print("Structure named 'trending_dict' with columns 'video titles' and '# days trending' created!")

Structure named 'trending_dict' with columns 'video titles' and '# days trending' created!


In [None]:
# Add the info in "data" to "trending_dict" to prepare for graphing
for v in data.keys():
  trending_dict['video titles'].append(v)
  trending_dict['# days trending'].append(data[v])

df1 = pd.DataFrame.from_dict(trending_dict)

print("Data in 'trending_dict' successfully reformatted for graphing!")

Data in 'trending_dict' successfully reformatted for graphing!


3. Create bar chart

In [None]:
# Input our properly-formatted data into our graphing function and specify that
# the x-axis will contain each unique video, and
# the y-axis will be the number of days each video was trending
fig = px.bar(df1, x='video titles', y='# days trending', title="Total number of days each video was trending in the week of June 3-9, 2018",height=900)

# Clean up visual presentation of the graph
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.update_xaxes(tickangle=70)
fig.update_yaxes(dtick=1)

fig.show()

### Let's Get Analytical

**Given the graph above, answer the following questions in the cell below:**

(You can hover over each bar for additional information.)

1. What do you notice?
2. What do you wonder?
3. What might be going on in this graph? Write a catchy headline that captures the graph’s main idea.

<< ***STUDENT RESPONSE*** >>

1.
2.
3.

# **Exploration 2**

We'll now explore how the view counts for trending videos increased throughout the week.

1. Store a list of the view count for each day a video is trending

In [None]:
# Reset "data" (we will use it to store new information for this next graph)
data = {}

print("Structure named 'data' recreated!")

Structure named 'data' recreated!


In [None]:
# For each unique video, store the view count for each day it was trending in "data"
for i,r in df.iterrows():
  row = df.iloc[i]
  title = row['title'];
  views = row['views'];
  if title not in data:
    data[title] = []
  data[title].append(views);

print("View counts stored successfully!")

View counts stored successfully!


2. Calculate the increase in view count for each succeeding trending day

In [None]:
# Using the stored view counts, we can figure out how a trending video's
# views increased over time
for t in data.keys():
  counts = data[t]
  deltas = []
  nDaysNotTrending = 7 - len(counts)
  # Add zeros for days it wasn't trending & Day 1 of trending
  for i in range(nDaysNotTrending+1):
    deltas.append(0)
  for i in range(len(counts)-1):
    deltas.append(counts[i+1] - counts[i])
  data[t] = deltas

print("View count increases calculated!")

View count increases calculated!


3. Reformat our data for graphing

In [None]:
# Create a structure named "delta_dict" to store information from "data" (the
# differences in view count) in the particular format understood by our
# graphing function
delta_dict = {
    'video title': [],
    'day': [],
    'new views': []
}

print("Structure named 'delta_dict' with columns 'video title', 'day', and 'new views' created!")

Structure named 'delta_dict' with columns 'video title', 'day', and 'new views' created!


In [None]:
# Create a list of the dates we will use for the x-axis labels
days = [
    'June 3, 2018',
    'June 4, 2018',
    'June 5, 2018',
    'June 6, 2018',
    'June 7, 2018',
    'June 8, 2018',
    'June 9, 2018'
]

print("List of the week's dates created!")

List of the week's dates created!


In [None]:
# Add the info in "data" to "delta_dict" to prepare for graphing
for t in data.keys():
  dArray = data[t]
  for i in range(len(dArray)):
    delta_dict['video title'].append(t)
    delta_dict['day'].append(days[i])
    delta_dict['new views'].append(dArray[i])

df2 = pd.DataFrame.from_dict(delta_dict)

print("Data in 'delta_dict' successfully reformatted for graphing!")

Data in 'delta_dict' successfully reformatted for graphing!


4. Create line graph

In [None]:
# Input our properly-formatted data into our graphing function and specify that
# the x-axis will contain the days of the week, and
# the y-axis will be the number of new views that day (the increase in view count)
fig2 = px.line(df2, x='day', y='new views', title="Increases in view count of trending videos during the week of June 3-9, 2018", color='video title')
fig2.update_layout(showlegend=False)

fig2.show()

### Let's Get Analytical

**Given the graph above, answer the following questions in the cell below:**

(You can hover over each line for additional information.)

1. What do you notice?
2. What do you wonder?
3. What might be going on in this graph? Write a catchy headline that captures the graph’s main idea.

<< ***STUDENT RESPONSE*** >>

1.
2.
3.

# **Exploration 3**

Finally, we'll check out the relationship between view count and the number of likes/dislikes.

1. Store the number of likes, dislikes, and corresponding view count

In [None]:
# Create a structure named "likesDislikes_dict" to store raw information from our dataset
# in the particular format understood by our graphing function
likesDislikes_dict = {
    'views': [],
    '# responses': [],
    'type': [] # type refers to Likes or Dislikes
}

print("Structure named 'likesDislikes_dict' with columns 'views', '# responses', and 'type' created!")

In [None]:
# For each entry in our dataset, store every pair of views + likes, and
# every pair of views + dislikes
for i,r in df.iterrows():
  row = df.iloc[i]
  views = row['views']
  likes = row['likes']
  dislikes = row['dislikes']

  # Add likes
  likesDislikes_dict['views'].append(views)
  likesDislikes_dict['# responses'].append(likes)
  likesDislikes_dict['type'].append('Likes')

  # Add dislikes
  likesDislikes_dict['views'].append(views)
  likesDislikes_dict['# responses'].append(dislikes)
  likesDislikes_dict['type'].append('Dislikes')

df3 = pd.DataFrame.from_dict(likesDislikes_dict)

print("Likes, dislikes, and view counts stored and formatted properly for graphing!")

2. Create scatter plot

In [None]:
# Input our properly-formatted data into our graphing function and specify that
# the x-axis will be the number of views,
# the y-axis will be the number of responses (i.e., number of likes or dislikes), and
# the plotted points will be colored depending on whether it is showing a like or dislike count
fig3 = px.scatter(df3, x='views', y='# responses', color='type', title="Correlation between views and likes/dislikes for videos trending during the week of June 3-9, 2018")
fig3.show()

### Let's Get Analytical

**Given the graph above, answer the following questions:**

(You can hover over each point for additional information.)

1. What do you notice?
2. What do you wonder?
3. What might be going on in this graph? Write a catchy headline that captures the graph’s main idea.

<< ***STUDENT RESPONSE*** >>

1.
2.
3.

# **Final Comments**

##Our Thoughts on Each Graph

###*Graph #1*

In this graph, we can see the number of days during the week each video was trending. The bars on the left side of the graph show that the most common trending time was 7 days--the whole week. Looking at the titles of the 7-day trending videos, you can see that many of them are official music videos or movie trailers. If you watch videos on YouTube, does this trend match what you often watch?

Something else to consider: how do you think examining only one week affects our view of how popular each video was? This dataset only gives us data on the trending videos on June 3 - 9, 2018. Do you think that some of the videos that were only trending for one day in this week (the shortest bars to the right) might've begun trending the week before? Consider what else we can and can’t assume from this graph.

###*Graph #2*

Graph 2 illustrates the number of new views for the top trending videos on YouTube for the week of June 3-9, 2018. If you hover over any line, you can see the corresponding name of the video, the date, and the number of new viewers. For each day we can find the difference in the amount of new viewers between two videos. We can also find how many total new viewers each video got the week of June 3-9 by adding the amount of new viewers for each day in the week.

The percent change in the number of new viewers between two dates can by found by subtracting the two dates (later date - earlier date) and dividing by the earlier date. Take a moment to find the percent change between June 7th and June 8th of three different videos. What can you look at to predict if the percent change will be positive or negative?

While the lines of the majority of the videos are clustered below 3 million views, there are a couple of videos that consistently get 5 million
or more new viewers each day this week. What questions does this lead you to ask? Take a look at the largest number of new viewers a video gets this week.
How many new viewers is this and was is the name of the video? Would you consider this video an outlier? We also note that certain videos have
large peaks. Which video demonstrates the largest increase in new viewers from June 6th to June 7th?

###*Graph #3*

This graph shows data on the number of views a Youtube video has (on the x-axis) and the number of times people interacted with the corresponding video (on the y-axis). Data counting likes are represented by blue dots, and data counting dislikes are shown in red dots.

Since the graph shows two different variables, it can be used to understand the relationship between the number of views and the number of likes/dislikes a video has. Notice how as the number of views increases, the number of likes also seems to increase significantly. This is a sign of an upwards trend and positive correlation. However, more complicated statistical tests are needed to understand the significance of these correlations and confirm the two variables are related.

After spending some time observing the graph, you may want to think critically about these additional questions:

* What do you notice about the scale of the dots representing likes vs. scale of dots representing dislikes? What does this tell you about the likes/dislikes on Youtube videos in general and how does this change your understanding of the correlation between the two variables?
* Do you think this graph succeeds in telling a story? Why or why not?
* Is the graph understandable and clear? If so, what elements (ex: labels, axes) help you to understand the graph? Is there anything that can be added or removed to make it more clear?
* What are other variables in the dataset you might plot using this type of graph? What kinds of data can be used in making a scatterplot?



##Other Possible Explorations

There are many other ways we can manipulate this dataset to answer more questions. Here are some additional questions we can ask, and how we might approach answering them.

***What were the most popular video categories?***

Looking at the category column in our dataset, we can add up how many trending videos were about Music, Gaming, Entertainment, etc. With these totals, we can create a bar chart that compares the total number of videos in each category.

***Were there any similarities in theme between the most popular songs?***

Looking at the tags column in our dataset, we can split up each cell into its individual tags and add up the number of each tag across all of the top videos. With these totals, we can create a bar chart that compares the frequencies of different tags.

***What were the most popular channels in this dataset of trending videos?***

Looking at the channels category in our dataset, we can add up how many trending videos come from each channel. With these totals, we can create a bar chart that compares the total number of videos from each channel.

***How did each comment counts for trending videos fluctuate throughout the week?***

We can repeat Exploration 2 with the comments column instead of the views column, creating a line graph of comments over time. If we had a similar question about the fluctuation of likes and dislikes over time, we could use the dislikes or likes columns instead.

***Is the number of comments correlated with (related to) the number of likes and/or dislikes that a video receives?***

While plotting the likes and dislikes in our final scatter plot, we can switch the number of views each video received with the comments it received. With this new graph, we can visually analyze the correlation between comments and dislikes or comments and likes.

***Design your own exploration!***

Review the questions you asked under "Kickstart Your Curiosity". What types of visualizations (bar graphs, line graphs, scatter plots, etc.) could we use to answer your questions? What would be on the x-axes and y-axes of your charts (if applicable)?



---


## **We value your feedback!**

Please consider [filling out this quick form](https://forms.gle/2nxmHEfMvRp2UBQP9) to let us know what you thought about this activity.

Help us improve our curriculum by telling us about your experience!



---


# **Additional Resources**

*More on **bar, line, and scatter graphs**:*
*   A Complete Guide to [Bar Charts](https://chartio.com/learn/charts/bar-chart-complete-guide/)
*   A Complete Guide to [Line Charts](https://chartio.com/learn/charts/line-chart-complete-guide/)
*   A Complete Guide to [Scatter Plots](https://chartio.com/learn/charts/what-is-a-scatter-plot/)

*More on **creating graphs in Python using the Pandas library**:*
* Create a [simple bar chart](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.bar.html)
* Create a [simple line graph](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.line.html)
* Create a [simple scatter plot](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.scatter.html)


*More on **real world applications**:*
* Data scientists [built a tool](https://towardsdatascience.com/youtube-views-predictor-9ec573090acb) to help YouTube influencers predict the number of views for their next video.
* A data scientist analyzed 70k+ of YouTube's trending videos. [Here's what he learned.](https://thehustle.co/07102020-data-scientist-analyzed-youtube-videos/)

*More on **data visualization**:*
* Watch [Fanny Chevalier's talk](https://www.youtube.com/watch?v=8vkqeOiETQM&ab_channel=ICMEStudio) at Stanford's 2020 WiDS Conference
* Data visualization [beginner's guide](https://www.tableau.com/learn/articles/data-visualization)
* Coursera Guided Project: [Exploratory Data Analysis with Python and Pandas](https://www.coursera.org/projects/exploratory-data-analysis-python-pandas)
(similar to this activity, but you get to write the code!)






