# Introduction to Pandas
<sup>Created by Natawut Nupairoj, Department of Computer Engineering, Chulalongkorn University</sup>

Pandas is one of the most popular tools in Python for data analytics.  It contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy.

In this tutorial, we will play with a dataset from kaggle to demonstrate Pandas' basic operations.  The dataset is [Trending YouTube Video Statistics](https://www.kaggle.com/datasnaek/youtube-new).  For simplicity, we will work with only US dataset ([USvideos.csv](https://www.kaggle.com/datasnaek/youtube-new?select=USvideos.csv) and [US_category_id.json](https://www.kaggle.com/datasnaek/youtube-new?select=US_category_id.json)).

We start with importing pandas and give it a short name, "pd".  We also import numpy to help with pandas

In [None]:
import pandas as pd
import numpy as np

## Youtube Trending Data Exploration

### Downloading data files from shared drive (optional for Colab)

To simplify data retrieval process on Colab, we heck if we are in the Colab environment and download data files from a shared drive and save them in folder "data".

For those using jupyter notebook on the local computer, you can read data directly assuming you save data in the folder "data".

In [None]:
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
    !wget https://github.com/kaopanboonyuen/2110446_DataScience_2021s2/raw/main/datasets/data.tgz -O data.tgz
    !tar -xzvf data.tgz

--2022-01-10 12:33:33--  https://github.com/kaopanboonyuen/2110446_DataScience_2021s2/raw/main/datasets/data.tgz
Resolving github.com (github.com)... 192.30.255.113
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/kaopanboonyuen/2110446_DataScience_2021s2/main/datasets/data.tgz [following]
--2022-01-10 12:33:33--  https://raw.githubusercontent.com/kaopanboonyuen/2110446_DataScience_2021s2/main/datasets/data.tgz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 45477462 (43M) [application/octet-stream]
Saving to: ‘data.tgz’


2022-01-10 12:33:34 (276 MB/s) - ‘data.tgz’ saved [45477462/45477462]

data/
data/._GB_category_id.json
data/GB_category_i

### Read input from a data file into dataframe

In [None]:
vdo_df = pd.read_csv('data/USvideos.csv')

In [None]:
type(vdo_df)

pandas.core.frame.DataFrame

### Show some data rows

In [None]:
vdo_df

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John..."
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3,puqaWrEC7tY,17.14.11,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,2017-11-13T11:00:04.000Z,"rhett and link|""gmm""|""good mythical morning""|""...",343168,10172,666,2146,https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg,False,False,False,Today we find out if Link is a Nickelback amat...
4,d380meD0W0M,17.14.11,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731,132235,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40944,BZt0qjTWNhw,18.14.06,The Cat Who Caught the Laser,AaronsAnimals,15,2018-05-18T13:00:04.000Z,"aarons animals|""aarons""|""animals""|""cat""|""cats""...",1685609,38160,1385,2657,https://i.ytimg.com/vi/BZt0qjTWNhw/default.jpg,False,False,False,The Cat Who Caught the Laser - Aaron's Animals
40945,1h7KV2sjUWY,18.14.06,True Facts : Ant Mutualism,zefrank1,22,2018-05-18T01:00:06.000Z,[none],1064798,60008,382,3936,https://i.ytimg.com/vi/1h7KV2sjUWY/default.jpg,False,False,False,
40946,D6Oy4LfoqsU,18.14.06,I GAVE SAFIYA NYGAARD A PERFECT HAIR MAKEOVER ...,Brad Mondo,24,2018-05-18T17:34:22.000Z,I gave safiya nygaard a perfect hair makeover ...,1066451,48068,1032,3992,https://i.ytimg.com/vi/D6Oy4LfoqsU/default.jpg,False,False,False,I had so much fun transforming Safiyas hair in...
40947,oV0zkMe1K8s,18.14.06,How Black Panther Should Have Ended,How It Should Have Ended,1,2018-05-17T17:00:04.000Z,"Black Panther|""HISHE""|""Marvel""|""Infinity War""|...",5660813,192957,2846,13088,https://i.ytimg.com/vi/oV0zkMe1K8s/default.jpg,False,False,False,How Black Panther Should Have EndedWatch More ...


In [None]:
vdo_df.head()

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John..."
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3,puqaWrEC7tY,17.14.11,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,2017-11-13T11:00:04.000Z,"rhett and link|""gmm""|""good mythical morning""|""...",343168,10172,666,2146,https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg,False,False,False,Today we find out if Link is a Nickelback amat...
4,d380meD0W0M,17.14.11,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731,132235,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...


In [None]:
vdo_df.tail()

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
40944,BZt0qjTWNhw,18.14.06,The Cat Who Caught the Laser,AaronsAnimals,15,2018-05-18T13:00:04.000Z,"aarons animals|""aarons""|""animals""|""cat""|""cats""...",1685609,38160,1385,2657,https://i.ytimg.com/vi/BZt0qjTWNhw/default.jpg,False,False,False,The Cat Who Caught the Laser - Aaron's Animals
40945,1h7KV2sjUWY,18.14.06,True Facts : Ant Mutualism,zefrank1,22,2018-05-18T01:00:06.000Z,[none],1064798,60008,382,3936,https://i.ytimg.com/vi/1h7KV2sjUWY/default.jpg,False,False,False,
40946,D6Oy4LfoqsU,18.14.06,I GAVE SAFIYA NYGAARD A PERFECT HAIR MAKEOVER ...,Brad Mondo,24,2018-05-18T17:34:22.000Z,I gave safiya nygaard a perfect hair makeover ...,1066451,48068,1032,3992,https://i.ytimg.com/vi/D6Oy4LfoqsU/default.jpg,False,False,False,I had so much fun transforming Safiyas hair in...
40947,oV0zkMe1K8s,18.14.06,How Black Panther Should Have Ended,How It Should Have Ended,1,2018-05-17T17:00:04.000Z,"Black Panther|""HISHE""|""Marvel""|""Infinity War""|...",5660813,192957,2846,13088,https://i.ytimg.com/vi/oV0zkMe1K8s/default.jpg,False,False,False,How Black Panther Should Have EndedWatch More ...
40948,ooyjaVdt-jA,18.14.06,Official Call of Duty®: Black Ops 4 — Multipla...,Call of Duty,20,2018-05-17T17:09:38.000Z,"call of duty|""cod""|""activision""|""Black Ops 4""",10306119,357079,212976,144795,https://i.ytimg.com/vi/ooyjaVdt-jA/default.jpg,False,False,False,Call of Duty: Black Ops 4 Multiplayer raises t...


### Explore structure

In [None]:
vdo_df.shape

(40949, 16)

In [None]:
vdo_df.columns

Index(['video_id', 'trending_date', 'title', 'channel_title', 'category_id',
       'publish_time', 'tags', 'views', 'likes', 'dislikes', 'comment_count',
       'thumbnail_link', 'comments_disabled', 'ratings_disabled',
       'video_error_or_removed', 'description'],
      dtype='object')

In [None]:
vdo_df.index

RangeIndex(start=0, stop=40949, step=1)

In [None]:
vdo_df.dtypes

video_id                  object
trending_date             object
title                     object
channel_title             object
category_id                int64
publish_time              object
tags                      object
views                      int64
likes                      int64
dislikes                   int64
comment_count              int64
thumbnail_link            object
comments_disabled           bool
ratings_disabled            bool
video_error_or_removed      bool
description               object
dtype: object

### Show partial data
- show some rows

In [None]:
vdo_df[0:2]

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John..."


In [None]:
vdo_df[-4:-1]

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
40945,1h7KV2sjUWY,18.14.06,True Facts : Ant Mutualism,zefrank1,22,2018-05-18T01:00:06.000Z,[none],1064798,60008,382,3936,https://i.ytimg.com/vi/1h7KV2sjUWY/default.jpg,False,False,False,
40946,D6Oy4LfoqsU,18.14.06,I GAVE SAFIYA NYGAARD A PERFECT HAIR MAKEOVER ...,Brad Mondo,24,2018-05-18T17:34:22.000Z,I gave safiya nygaard a perfect hair makeover ...,1066451,48068,1032,3992,https://i.ytimg.com/vi/D6Oy4LfoqsU/default.jpg,False,False,False,I had so much fun transforming Safiyas hair in...
40947,oV0zkMe1K8s,18.14.06,How Black Panther Should Have Ended,How It Should Have Ended,1,2018-05-17T17:00:04.000Z,"Black Panther|""HISHE""|""Marvel""|""Infinity War""|...",5660813,192957,2846,13088,https://i.ytimg.com/vi/oV0zkMe1K8s/default.jpg,False,False,False,How Black Panther Should Have EndedWatch More ...


- show some columns

Notice a difference between showing a row (slice into a serie) and multiple rows (slice into a dataframe)

In [None]:
vdo_df['title']

0                       WE WANT TO TALK ABOUT OUR MARRIAGE
1        The Trump Presidency: Last Week Tonight with J...
2        Racist Superman | Rudy Mancuso, King Bach & Le...
3                         Nickelback Lyrics: Real or Fake?
4                                 I Dare You: GOING BALD!?
                               ...                        
40944                         The Cat Who Caught the Laser
40945                           True Facts : Ant Mutualism
40946    I GAVE SAFIYA NYGAARD A PERFECT HAIR MAKEOVER ...
40947                  How Black Panther Should Have Ended
40948    Official Call of Duty®: Black Ops 4 — Multipla...
Name: title, Length: 40949, dtype: object

In [None]:
type(vdo_df['title'])

pandas.core.series.Series

In [None]:
vdo_df.title

0                       WE WANT TO TALK ABOUT OUR MARRIAGE
1        The Trump Presidency: Last Week Tonight with J...
2        Racist Superman | Rudy Mancuso, King Bach & Le...
3                         Nickelback Lyrics: Real or Fake?
4                                 I Dare You: GOING BALD!?
                               ...                        
40944                         The Cat Who Caught the Laser
40945                           True Facts : Ant Mutualism
40946    I GAVE SAFIYA NYGAARD A PERFECT HAIR MAKEOVER ...
40947                  How Black Panther Should Have Ended
40948    Official Call of Duty®: Black Ops 4 — Multipla...
Name: title, Length: 40949, dtype: object

In [None]:
vdo_df[['title', 'views', 'likes']]

Unnamed: 0,title,views,likes
0,WE WANT TO TALK ABOUT OUR MARRIAGE,748374,57527
1,The Trump Presidency: Last Week Tonight with J...,2418783,97185
2,"Racist Superman | Rudy Mancuso, King Bach & Le...",3191434,146033
3,Nickelback Lyrics: Real or Fake?,343168,10172
4,I Dare You: GOING BALD!?,2095731,132235
...,...,...,...
40944,The Cat Who Caught the Laser,1685609,38160
40945,True Facts : Ant Mutualism,1064798,60008
40946,I GAVE SAFIYA NYGAARD A PERFECT HAIR MAKEOVER ...,1066451,48068
40947,How Black Panther Should Have Ended,5660813,192957


In [None]:
type(vdo_df[['title', 'views', 'likes']])

pandas.core.frame.DataFrame

- show a block

*loc* and *iloc* can be used for selecting a block or a subset of rows and columns in a dataFrame.  *loc* is for index label and column names.  *iloc* is for integer position of rows and columns.  The selecting can be applied for both read and write operations.

In [None]:
vdo_df.loc[10:15, ['title', 'channel_title', 'publish_time']]

Unnamed: 0,title,channel_title,publish_time
10,Dion Lewis' 103-Yd Kick Return TD vs. Denver! ...,NFL,2017-11-13T02:05:26.000Z
11,(SPOILERS) 'Shiva Saves the Day' Talked About ...,amc,2017-11-13T03:00:00.000Z
12,Marshmello - Blocks (Official Music Video),marshmello,2017-11-13T17:00:00.000Z
13,Which Countries Are About To Collapse?,NowThis World,2017-11-12T14:00:00.000Z
14,SHOPPING FOR NEW FISH!!!,The king of DIY,2017-11-12T18:30:01.000Z
15,The New SpotMini,BostonDynamics,2017-11-13T20:09:58.000Z


In [None]:
vdo_df.iloc[10:15, 3:5]

Unnamed: 0,channel_title,category_id
10,NFL,17
11,amc,24
12,marshmello,10
13,NowThis World,25
14,The king of DIY,15


### Remove Duplicates

Dataframe may contain some duplicate rows.

In [None]:
vdo_df.shape

(40949, 16)

In [None]:
vdo_nondup = vdo_df.drop_duplicates()

In [None]:
vdo_nondup.shape

(40901, 16)

In [None]:
vdo_df.shape

(40949, 16)

In [None]:
vdo_df.drop_duplicates(inplace=True)

In [None]:
vdo_df.shape

(40901, 16)

# Basic Pandas Operations

We will learn several basic pandas operations including count_values, statistical calculation, describing numerical data, boolean indexing, etc.

## Statistical Calculation

### What is the VDO with highest days in trending?
- count number of video in trending by title

In [None]:
vdo_df.title

0                       WE WANT TO TALK ABOUT OUR MARRIAGE
1        The Trump Presidency: Last Week Tonight with J...
2        Racist Superman | Rudy Mancuso, King Bach & Le...
3                         Nickelback Lyrics: Real or Fake?
4                                 I Dare You: GOING BALD!?
                               ...                        
40944                         The Cat Who Caught the Laser
40945                           True Facts : Ant Mutualism
40946    I GAVE SAFIYA NYGAARD A PERFECT HAIR MAKEOVER ...
40947                  How Black Panther Should Have Ended
40948    Official Call of Duty®: Black Ops 4 — Multipla...
Name: title, Length: 40901, dtype: object

In [None]:
vdo_df.title.value_counts()

WE MADE OUR MOM CRY...HER DREAM CAME TRUE!                                      29
Sam Smith - Pray (Official Video) ft. Logic                                     29
Mission: Impossible - Fallout (2018) - Official Trailer - Paramount Pictures    29
The Deadliest Being on Planet Earth – The Bacteriophage                         28
Bohemian Rhapsody | Teaser Trailer [HD] | 20th Century FOX                      28
                                                                                ..
Meghan Trainor & Guillermo Del Toro: Rat Enthusiasts                             1
What Happens to Diesel in Liquid Nitrogen?                                       1
Mila Kunis & Kate McKinnon Play 'Speak Out'                                      1
Roger Federer's  20th Grand Slam Victory Tribute                                 1
How Spring Looks Like around the World                                           1
Name: title, Length: 6455, dtype: int64

- **value_counts** <br> Return a Series containing unique values as its index and frequencies as its values, ordered count in descending order

### What is the minimum views to get trending?
- calculate the minimum value in the views column (statistical calculation)

In [None]:
vdo_df.views.min()

549

**How about other statistics?**

In [None]:
vdo_df.views.mean()

2360678.0387276593

In [None]:
vdo_df.describe()

Unnamed: 0,category_id,views,likes,dislikes,comment_count
count,40901.0,40901.0,40901.0,40901.0,40901.0
mean,19.970588,2360678.0,74271.73,3711.722,8448.567
std,7.569362,7397719.0,228999.9,29046.24,37451.39
min,1.0,549.0,0.0,0.0,0.0
25%,17.0,241972.0,5416.0,202.0,613.0
50%,24.0,681064.0,18069.0,630.0,1855.0
75%,25.0,1821926.0,55338.0,1936.0,5752.0
max,43.0,225211900.0,5613827.0,1674420.0,1361580.0


**Descriptive and summary statistics methods**
- **count** <br> Number of non-NA values
- **describe** <br> Compute set of summary statistics for Series or each DataFrame column
- **min, max** <br> Compute minimum and maximum values
- **argmin, argmax** <br> Compute index locations (integer positions) at which minimum or maximum value obtained, respectively
- **idxmin, idxmax** <br> Compute index labels at which minimum or maximum value obtained, respectively
- **quantile** <br> Compute sample quantile ranging from 0 to 1
- **sum** <br> Sum of values
- **mean** <br> Mean of values
- **median** <br> Arithmetic median (50% quantile) of values
- **mad** <br> Mean absolute deviation from mean value
- **prod** <br> Product of all values
- **var** <br> Sample variance of values
- **std** <br> Sample standard deviation of values
- **skew** <br> Sample skewness (third moment) of values
- **kurt** <br> Sample kurtosis (fourth moment) of values
- **cumsum** <br> Cumulative sum of values
- **cummin, cummax** <br> Cumulative minimum or maximum of values, respectively
- **cumprod** <br> Cumulative product of values
- **diff** <br> Compute first arithmetic difference (useful for time series)
- **pct_change** <br> Compute percent changes

## Boolean Indexing

### What is the VDO with the most views?
We can use the same technique to We will learn more about advanced filtering with boolean indexing.

In [None]:
vdo_df.views.max()

225211923

**Boolean Indexing**<br>
Performing a logical operator to a series will create a new *logical* series containing the results from the operation.  The new logical series can be used to *select* rows that are true.

In [None]:
vdo_df.views

0          748374
1         2418783
2         3191434
3          343168
4         2095731
           ...   
40944     1685609
40945     1064798
40946     1066451
40947     5660813
40948    10306119
Name: views, Length: 40901, dtype: int64

In [None]:
vdo_df.views > 3000000

0        False
1        False
2         True
3        False
4        False
         ...  
40944    False
40945    False
40946    False
40947     True
40948     True
Name: views, Length: 40901, dtype: bool

In [None]:
vdo_df[vdo_df.views > 3000000]

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
32,n1WpP7iowLc,17.14.11,Eminem - Walk On Water (Audio) ft. Beyoncé,EminemVEVO,10,2017-11-10T17:00:03.000Z,"Eminem|""Walk""|""On""|""Water""|""Aftermath/Shady/In...",17158531,787419,43420,125882,https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg,False,False,False,Eminem's new track Walk on Water ft. Beyoncé i...
53,9t9u_yPEidY,17.14.11,"Jennifer Lopez - Amor, Amor, Amor (Official Vi...",JenniferLopezVEVO,10,2017-11-10T15:00:00.000Z,"Jennifer Lopez ft. Wisin|""Jennifer Lopez ft. W...",9548677,190083,15015,11473,https://i.ytimg.com/vi/9t9u_yPEidY/default.jpg,False,False,False,"Jennifer Lopez ft. Wisin - Amor, Amor, Amor (O..."
67,t4YAyT4ihIQ,17.14.11,Getting My Driver's License | Lele Pons,Lele Pons,23,2017-11-10T18:30:01.000Z,"getting my drivers license|""lele""|""pons""|""gett...",3358068,120876,8279,6408,https://i.ytimg.com/vi/t4YAyT4ihIQ/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ https://youtu.be/T8j...
69,Jw1Y-zhQURU,17.14.11,John Lewis Christmas Ad 2017 - #MozTheMonster,John Lewis,26,2017-11-10T07:38:29.000Z,"christmas|""john lewis christmas""|""john lewis""|...",7224515,55681,10247,9479,https://i.ytimg.com/vi/Jw1Y-zhQURU/default.jpg,False,False,False,Click here to continue the story and make your...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40941,7UoP9ABJXGE,18.14.06,Dan + Shay - Speechless (Wedding Video),Dan And Shay,10,2018-05-18T04:04:58.000Z,"wedding video|""heartfelt wedding video""|""emoti...",5534278,45128,1591,806,https://i.ytimg.com/vi/7UoP9ABJXGE/default.jpg,False,False,False,Stream + Download:https://wmna.sh/speechlessht...
40942,ju_inUnrLc4,18.14.06,Fifth Harmony - Don't Say You Love Me,FifthHarmonyVEVO,10,2018-05-18T07:00:08.000Z,"fifth hamony|""harmonizers""|""lauren""|""ally""|""no...",23502572,676467,15993,52432,https://i.ytimg.com/vi/ju_inUnrLc4/default.jpg,False,False,False,Fifth Harmony available at iTunes http://smart...
40943,1PhPYr_9zRY,18.14.06,BTS Plays With Puppies While Answering Fan Que...,BuzzFeed Celeb,22,2018-05-18T16:39:29.000Z,"BuzzFeed|""BuzzFeedVideo""|""Puppy Interview""|""pu...",8259128,645888,4052,62610,https://i.ytimg.com/vi/1PhPYr_9zRY/default.jpg,False,False,False,"BTS with the PPS, the puppies. These adorable ..."
40947,oV0zkMe1K8s,18.14.06,How Black Panther Should Have Ended,How It Should Have Ended,1,2018-05-17T17:00:04.000Z,"Black Panther|""HISHE""|""Marvel""|""Infinity War""|...",5660813,192957,2846,13088,https://i.ytimg.com/vi/oV0zkMe1K8s/default.jpg,False,False,False,How Black Panther Should Have EndedWatch More ...


In [None]:
vdo_df[vdo_df.views > 3000000][['title', 'views']]

Unnamed: 0,title,views
2,"Racist Superman | Rudy Mancuso, King Bach & Le...",3191434
32,Eminem - Walk On Water (Audio) ft. Beyoncé,17158531
53,"Jennifer Lopez - Amor, Amor, Amor (Official Vi...",9548677
67,Getting My Driver's License | Lele Pons,3358068
69,John Lewis Christmas Ad 2017 - #MozTheMonster,7224515
...,...,...
40941,Dan + Shay - Speechless (Wedding Video),5534278
40942,Fifth Harmony - Don't Say You Love Me,23502572
40943,BTS Plays With Puppies While Answering Fan Que...,8259128
40947,How Black Panther Should Have Ended,5660813


Multiple conditions are supported.  However, the parenthesis around each condition is essential.

In [None]:
vdo_df[(vdo_df.views > 3000000) & (vdo_df.likes > 100000)]

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
32,n1WpP7iowLc,17.14.11,Eminem - Walk On Water (Audio) ft. Beyoncé,EminemVEVO,10,2017-11-10T17:00:03.000Z,"Eminem|""Walk""|""On""|""Water""|""Aftermath/Shady/In...",17158531,787419,43420,125882,https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg,False,False,False,Eminem's new track Walk on Water ft. Beyoncé i...
53,9t9u_yPEidY,17.14.11,"Jennifer Lopez - Amor, Amor, Amor (Official Vi...",JenniferLopezVEVO,10,2017-11-10T15:00:00.000Z,"Jennifer Lopez ft. Wisin|""Jennifer Lopez ft. W...",9548677,190083,15015,11473,https://i.ytimg.com/vi/9t9u_yPEidY/default.jpg,False,False,False,"Jennifer Lopez ft. Wisin - Amor, Amor, Amor (O..."
67,t4YAyT4ihIQ,17.14.11,Getting My Driver's License | Lele Pons,Lele Pons,23,2017-11-10T18:30:01.000Z,"getting my drivers license|""lele""|""pons""|""gett...",3358068,120876,8279,6408,https://i.ytimg.com/vi/t4YAyT4ihIQ/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ https://youtu.be/T8j...
70,2Vv-BfVoq4g,17.14.11,Ed Sheeran - Perfect (Official Music Video),Ed Sheeran,10,2017-11-09T11:04:14.000Z,"edsheeran|""ed sheeran""|""acoustic""|""live""|""cove...",33523622,1634124,21082,85067,https://i.ytimg.com/vi/2Vv-BfVoq4g/default.jpg,False,False,False,🎧: https://ad.gt/yt-perfect\n💰: https://atlant...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40939,pcJo0tIWybY,18.14.06,SZA - Garden (Say It Like Dat) (Official Video),SZAVEVO,10,2018-05-18T14:00:04.000Z,"Garden (Say It Like Dat)|""R&B""|""SZA""|""Top Dawg...",6004782,210802,4166,15169,https://i.ytimg.com/vi/pcJo0tIWybY/default.jpg,False,False,False,SZA's CTRL available on:Apple Music - http://s...
40942,ju_inUnrLc4,18.14.06,Fifth Harmony - Don't Say You Love Me,FifthHarmonyVEVO,10,2018-05-18T07:00:08.000Z,"fifth hamony|""harmonizers""|""lauren""|""ally""|""no...",23502572,676467,15993,52432,https://i.ytimg.com/vi/ju_inUnrLc4/default.jpg,False,False,False,Fifth Harmony available at iTunes http://smart...
40943,1PhPYr_9zRY,18.14.06,BTS Plays With Puppies While Answering Fan Que...,BuzzFeed Celeb,22,2018-05-18T16:39:29.000Z,"BuzzFeed|""BuzzFeedVideo""|""Puppy Interview""|""pu...",8259128,645888,4052,62610,https://i.ytimg.com/vi/1PhPYr_9zRY/default.jpg,False,False,False,"BTS with the PPS, the puppies. These adorable ..."
40947,oV0zkMe1K8s,18.14.06,How Black Panther Should Have Ended,How It Should Have Ended,1,2018-05-17T17:00:04.000Z,"Black Panther|""HISHE""|""Marvel""|""Infinity War""|...",5660813,192957,2846,13088,https://i.ytimg.com/vi/oV0zkMe1K8s/default.jpg,False,False,False,How Black Panther Should Have EndedWatch More ...


In [None]:
(vdo_df.views > 3000000)

0        False
1        False
2         True
3        False
4        False
         ...  
40944    False
40945    False
40946    False
40947     True
40948     True
Name: views, Length: 40901, dtype: bool

In [None]:
(vdo_df.likes > 100000)

0        False
1        False
2         True
3        False
4         True
         ...  
40944    False
40945    False
40946    False
40947     True
40948     True
Name: likes, Length: 40901, dtype: bool

Notice at row #4.  There is more than 3000000 views, but less than 100000 likes.  Thus, it will not be included in the results.

When using complex conditions, you can think more like vectorized comparison and vectorized logical operations.

In [None]:
(vdo_df.views > 3000000) & (vdo_df.likes > 100000)

0        False
1        False
2         True
3        False
4        False
         ...  
40944    False
40945    False
40946    False
40947     True
40948     True
Length: 40901, dtype: bool

Back to our original question, what is the VDO with the most views?

In [None]:
vdo_df[vdo_df.views == vdo_df.views.max()]

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
38547,VYOjWnS4cMY,18.02.06,Childish Gambino - This Is America (Official V...,ChildishGambinoVEVO,10,2018-05-06T04:00:07.000Z,"Childish Gambino|""Rap""|""This Is America""|""mcDJ...",225211923,5023450,343541,517232,https://i.ytimg.com/vi/VYOjWnS4cMY/default.jpg,False,False,False,“This is America” by Childish Gambino http://s...


We can create a boolean series from the comparison and use it for boolean indexing.

In [None]:
most_view_filter = (vdo_df.views == vdo_df.views.max())

In [None]:
type(most_view_filter)

pandas.core.series.Series

In [None]:
vdo_df[most_view_filter]

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
38547,VYOjWnS4cMY,18.02.06,Childish Gambino - This Is America (Official V...,ChildishGambinoVEVO,10,2018-05-06T04:00:07.000Z,"Childish Gambino|""Rap""|""This Is America""|""mcDJ...",225211923,5023450,343541,517232,https://i.ytimg.com/vi/VYOjWnS4cMY/default.jpg,False,False,False,“This is America” by Childish Gambino http://s...


Boolean indexing can be used with string and other logical operators

In [None]:
vdo_df[vdo_df.title.str.contains('AI')]

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
44,STI2fI7sKMo,17.14.11,"AFFAIRS, EX BOYFRIENDS, $18MILLION NET WORTH -...",Shawn Johnson East,22,2017-11-11T15:00:03.000Z,"shawn johnson|""andrew east""|""shawn east""|""shaw...",321053,4451,1772,895,https://i.ytimg.com/vi/STI2fI7sKMo/default.jpg,False,False,False,Subscribe for weekly videos ▶ http://bit.ly/sj...
96,2XK4omx9uMU,17.14.11,Camila Cabello COMPLETELY NAILS 'Finish The Ly...,Capital FM,10,2017-11-10T14:40:32.000Z,"capitalfmofficial|""capital""|""capital fm""|""capi...",836544,40195,373,976,https://i.ytimg.com/vi/2XK4omx9uMU/default.jpg,False,False,False,It shouldn't be surprising that Camila Cabello...
189,o78x918zbFk,17.14.11,TOTAL FAIL! NATASHA DENONA HOLIDAY WTF,Tati,26,2017-11-08T18:00:05.000Z,"YouTube|""Beauty""|""Makeup""|""Tutorial""|""Review""|...",1277364,56867,2148,25326,https://i.ytimg.com/vi/o78x918zbFk/default.jpg,False,False,False,This was the most UNEXPECTED WTF I've done so ...
349,2XK4omx9uMU,17.15.11,Camila Cabello COMPLETELY NAILS 'Finish The Ly...,Capital FM,10,2017-11-10T14:40:32.000Z,"capitalfmofficial|""capital""|""capital fm""|""capi...",1126501,48219,444,1083,https://i.ytimg.com/vi/2XK4omx9uMU/default.jpg,False,False,False,It shouldn't be surprising that Camila Cabello...
590,2XK4omx9uMU,17.16.11,Camila Cabello COMPLETELY NAILS 'Finish The Ly...,Capital FM,10,2017-11-10T14:40:32.000Z,"capitalfmofficial|""capital""|""capital fm""|""capi...",1354030,57838,512,1257,https://i.ytimg.com/vi/2XK4omx9uMU/default.jpg,False,False,False,It shouldn't be surprising that Camila Cabello...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40539,D6Oy4LfoqsU,18.12.06,I GAVE SAFIYA NYGAARD A PERFECT HAIR MAKEOVER ...,Brad Mondo,24,2018-05-18T17:34:22.000Z,I gave safiya nygaard a perfect hair makeover ...,1058805,47810,1029,3972,https://i.ytimg.com/vi/D6Oy4LfoqsU/default.jpg,False,False,False,I had so much fun transforming Safiyas hair in...
40591,SkcucKDrbOI,18.13.06,HOW TO TRAIN YOUR DRAGON: THE HIDDEN WORLD | O...,DreamWorksTV,24,2018-06-07T14:30:03.000Z,"DreamWorksTV|""DreamWorks Animation""|""YouTube K...",5776332,99769,2300,14452,https://i.ytimg.com/vi/SkcucKDrbOI/default.jpg,False,False,False,Website: https://www.howtotrainyourdragon.comF...
40740,D6Oy4LfoqsU,18.13.06,I GAVE SAFIYA NYGAARD A PERFECT HAIR MAKEOVER ...,Brad Mondo,24,2018-05-18T17:34:22.000Z,I gave safiya nygaard a perfect hair makeover ...,1062709,47952,1031,3981,https://i.ytimg.com/vi/D6Oy4LfoqsU/default.jpg,False,False,False,I had so much fun transforming Safiyas hair in...
40804,SkcucKDrbOI,18.14.06,HOW TO TRAIN YOUR DRAGON: THE HIDDEN WORLD | O...,DreamWorksTV,24,2018-06-07T14:30:03.000Z,"DreamWorksTV|""DreamWorks Animation""|""YouTube K...",5962519,100761,2331,14578,https://i.ytimg.com/vi/SkcucKDrbOI/default.jpg,False,False,False,Website: https://www.howtotrainyourdragon.comF...


## Vectorized Calculation
When we perform mathematical calculation on a series, the calculation will be applied to each individual data.

In [None]:
vdo_df.likes

0         57527
1         97185
2        146033
3         10172
4        132235
          ...  
40944     38160
40945     60008
40946     48068
40947    192957
40948    357079
Name: likes, Length: 40901, dtype: int64

In [None]:
vdo_df.likes / 10

0         5752.7
1         9718.5
2        14603.3
3         1017.2
4        13223.5
          ...   
40944     3816.0
40945     6000.8
40946     4806.8
40947    19295.7
40948    35707.9
Name: likes, Length: 40901, dtype: float64

### What is the highest like-per-view ratio?

In [None]:
vdo_df.likes / vdo_df.views

0        0.076869
1        0.040179
2        0.045758
3        0.029641
4        0.063097
           ...   
40944    0.022639
40945    0.056356
40946    0.045073
40947    0.034086
40948    0.034647
Length: 40901, dtype: float64

In [None]:
vdo_df['lpv_ratio'] = vdo_df.likes / vdo_df.views

In [None]:
vdo_df[['likes', 'views', 'lpv_ratio']]

Unnamed: 0,likes,views,lpv_ratio
0,57527,748374,0.076869
1,97185,2418783,0.040179
2,146033,3191434,0.045758
3,10172,343168,0.029641
4,132235,2095731,0.063097
...,...,...,...
40944,38160,1685609,0.022639
40945,60008,1064798,0.056356
40946,48068,1066451,0.045073
40947,192957,5660813,0.034086


In [None]:
vdo_df[vdo_df.lpv_ratio == vdo_df.lpv_ratio.max()][['title', 'likes', 'views', 'lpv_ratio']]

Unnamed: 0,title,likes,views,lpv_ratio
10200,Bruno Mars - Finesse (Remix) [Feat. Cardi B] [...,159356,548621,0.290466


We can use sorting to achieve the same result.

In [None]:
vdo_df.sort_values(by=['lpv_ratio'],ascending=False)[['title', 'likes', 'views', 'lpv_ratio']]

Unnamed: 0,title,likes,views,lpv_ratio
10200,Bruno Mars - Finesse (Remix) [Feat. Cardi B] [...,159356,548621,0.290466
608,"Luis Fonsi, Demi Lovato - Échame La Culpa",135292,499946,0.270613
22174,j-hope 'Airplane' MV,1401915,5275672,0.265732
14428,dodie - Secret For The Mad,32755,129130,0.253659
5025,Louis Tomlinson - Miss You (Official Video),241679,985998,0.245111
...,...,...,...,...
3401,The New Snapchat in 60 Seconds,0,1894443,0.000000
14497,Why Are Fat People a Joke?,0,272163,0.000000
19762,Paris Hilton - “I Need You” (Official Music Vi...,0,352319,0.000000
16854,T-Mobile | #LittleOnes | 2018 Big Game Ad,0,14949494,0.000000


### String Manipulation
Python has long been a popular raw data manipulation language in part due to its ease of use for string and text processing. Most text operations are made simple with the string object’s built-in methods. For more complex pattern matching and text manipulations, regular expressions may be needed. pandas adds to the mix by enabling you to apply string and regular expressions concisely on whole arrays of data, additionally handling the annoyance of missing data.

In [None]:
vdo_df.title

0                       WE WANT TO TALK ABOUT OUR MARRIAGE
1        The Trump Presidency: Last Week Tonight with J...
2        Racist Superman | Rudy Mancuso, King Bach & Le...
3                         Nickelback Lyrics: Real or Fake?
4                                 I Dare You: GOING BALD!?
                               ...                        
40944                         The Cat Who Caught the Laser
40945                           True Facts : Ant Mutualism
40946    I GAVE SAFIYA NYGAARD A PERFECT HAIR MAKEOVER ...
40947                  How Black Panther Should Have Ended
40948    Official Call of Duty®: Black Ops 4 — Multipla...
Name: title, Length: 40901, dtype: object

In [None]:
vdo_df.title.str.lower()

0                       we want to talk about our marriage
1        the trump presidency: last week tonight with j...
2        racist superman | rudy mancuso, king bach & le...
3                         nickelback lyrics: real or fake?
4                                 i dare you: going bald!?
                               ...                        
40944                         the cat who caught the laser
40945                           true facts : ant mutualism
40946    i gave safiya nygaard a perfect hair makeover ...
40947                  how black panther should have ended
40948    official call of duty®: black ops 4 — multipla...
Name: title, Length: 40901, dtype: object

**Partial listing of vectorized string methods**

- **cat** Concatenate strings element-wise with optional delimiter
- **contains** Return boolean array if each string contains pattern/regex
- **count** Count occurrences of pattern
- **extract** Use a regular expression with groups to extract one or more strings from a Series of strings; the result will be a DataFrame with one column per group
- **endswith** Equivalent to x.endswith(pattern) for each element
- **startswith** Equivalent to x.startswith(pattern) for each element
- **findall** Compute list of all occurrences of pattern/regex for each string
- **get** Index into each element (retrieve i-th element)
- **isalnum** Equivalent to built-in str.alnum
- **isalpha** Equivalent to built-in str.isalpha
- **isdecimal** Equivalent to built-in str.isdecimal
- **isdigit** Equivalent to built-in str.isdigit
- **islower** Equivalent to built-in str.islower
- **isnumeric** Equivalent to built-in str.isnumeric
- **isupper** Equivalent to built-in str.isupper
- **join** Join strings in each element of the Series with passed separator
- **len** Compute length of each string
- **lower, upper** Convert cases; equivalent to x.lower() or x.upper() for each element
- **match** Use re.match with the passed regular expression on each element, returning matched groups as list
- **pad** Add whitespace to left, right, or both sides of strings
- **center** Equivalent to pad(side='both')
- **repeat** Duplicate values (e.g., s.str.repeat(3) is equivalent to x * 3 for each string)
- **replace** Replace occurrences of pattern/regex with some other string
- **slice** Slice each string in the Series
- **split** Split strings on delimiter or regular expression
- **strip** Trim whitespace from both sides, including newlines
- **rstrip** Trim whitespace on right side
- **lstrip** Trim whitespace on left side

## Data Transformation
In many occasions, we will have to transform data to get the results.  The transformation can be:
- Mapping and functional transformation
- Discretization and Binning
- Datetime transformation

In [None]:
vdo_df.head()

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,lpv_ratio
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...,0.076869
1,1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John...",0.040179
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...,0.045758
3,puqaWrEC7tY,17.14.11,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,2017-11-13T11:00:04.000Z,"rhett and link|""gmm""|""good mythical morning""|""...",343168,10172,666,2146,https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg,False,False,False,Today we find out if Link is a Nickelback amat...,0.029641
4,d380meD0W0M,17.14.11,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731,132235,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...,0.063097


### Mapping
We can map the values to some more useful labels.  Note that for *map* function, if we supply a dictionary, it will perform a simple mapping.  If we supply a function, it will perform that function to each data.

In [None]:
category_mapping = {
    22: 'People & Blogs',
    24: 'Entertainment',
}

In [None]:
vdo_df.category_id.map(category_mapping)

0        People & Blogs
1         Entertainment
2                   NaN
3         Entertainment
4         Entertainment
              ...      
40944               NaN
40945    People & Blogs
40946     Entertainment
40947               NaN
40948               NaN
Name: category_id, Length: 40901, dtype: object

In [None]:
vdo_df.likes

0         57527
1         97185
2        146033
3         10172
4        132235
          ...  
40944     38160
40945     60008
40946     48068
40947    192957
40948    357079
Name: likes, Length: 40901, dtype: int64

In [None]:
vdo_df.likes.map(lambda x: 'love' if x > 100000 else 'hate')

0        hate
1        hate
2        love
3        hate
4        love
         ... 
40944    hate
40945    hate
40946    hate
40947    love
40948    love
Name: likes, Length: 40901, dtype: object

### What is the most frequet tags being used?

**map** is a one-to-one mapping function.  This means the number of output rows will always be the same as the number of input rows.  We can create one-to-many mapping with **apply**.

In [None]:
vdo_df.tags

0                                          SHANtell martin
1        last week tonight trump presidency|"last week ...
2        racist superman|"rudy"|"mancuso"|"king"|"bach"...
3        rhett and link|"gmm"|"good mythical morning"|"...
4        ryan|"higa"|"higatv"|"nigahiga"|"i dare you"|"...
                               ...                        
40944    aarons animals|"aarons"|"animals"|"cat"|"cats"...
40945                                               [none]
40946    I gave safiya nygaard a perfect hair makeover ...
40947    Black Panther|"HISHE"|"Marvel"|"Infinity War"|...
40948        call of duty|"cod"|"activision"|"Black Ops 4"
Name: tags, Length: 40901, dtype: object

In [None]:
tags_split = vdo_df.tags.apply(lambda x: x.split('|'))

In [None]:
tags_split

0                                        [SHANtell martin]
1        [last week tonight trump presidency, "last wee...
2        [racist superman, "rudy", "mancuso", "king", "...
3        [rhett and link, "gmm", "good mythical morning...
4        [ryan, "higa", "higatv", "nigahiga", "i dare y...
                               ...                        
40944    [aarons animals, "aarons", "animals", "cat", "...
40945                                             [[none]]
40946    [I gave safiya nygaard a perfect hair makeover...
40947    [Black Panther, "HISHE", "Marvel", "Infinity W...
40948    [call of duty, "cod", "activision", "Black Ops...
Name: tags, Length: 40901, dtype: object

In [None]:
tags = tags_split.explode()
tags

0                           SHANtell martin
1        last week tonight trump presidency
1          "last week tonight donald trump"
1                       "john oliver trump"
1                            "donald trump"
                        ...                
40947                    "ending explained"
40948                          call of duty
40948                                 "cod"
40948                          "activision"
40948                         "Black Ops 4"
Name: tags, Length: 807062, dtype: object

Let's clean some punctuation before counting values.

In [None]:
tags = tags.str.strip().str.replace(r'[\"\'\.]', ' ')
tags

0                           SHANtell martin
1        last week tonight trump presidency
1           last week tonight donald trump 
1                        john oliver trump 
1                             donald trump 
                        ...                
40947                     ending explained 
40948                          call of duty
40948                                  cod 
40948                           activision 
40948                          Black Ops 4 
Name: tags, Length: 807062, dtype: object

In [None]:
tags.unique()

array(['SHANtell martin', 'last week tonight trump presidency',
       ' last week tonight donald trump ', ..., ' best hamburger ',
       ' langford ', ' katherine langford '], dtype=object)

In [None]:
len(tags.unique())

58121

In [None]:
tags.value_counts()

 funny                         3578
 comedy                        2860
 how to                        1558
[none]                         1534
 Pop                           1271
                               ... 
 this is us 2x14 preview          1
 plead the fifth wwhl             1
blade                             1
 crosses up wesley johnson        1
 Scripts                          1
Name: tags, Length: 58121, dtype: int64

### Discretization and Binning
Continuous data is often discretized or separted into *bins* for analysis.

In [None]:
views_range = [0, 1000000, 5000000, 10000000]
bin_names = ['some views', 'more views', 'lots of views']
vdo_df['views_level'] = pd.cut(vdo_df.views, views_range, labels=bin_names)

In [None]:
vdo_df[['title', 'views', 'views_level']]

Unnamed: 0,title,views,views_level
0,WE WANT TO TALK ABOUT OUR MARRIAGE,748374,some views
1,The Trump Presidency: Last Week Tonight with J...,2418783,more views
2,"Racist Superman | Rudy Mancuso, King Bach & Le...",3191434,more views
3,Nickelback Lyrics: Real or Fake?,343168,some views
4,I Dare You: GOING BALD!?,2095731,more views
...,...,...,...
40944,The Cat Who Caught the Laser,1685609,more views
40945,True Facts : Ant Mutualism,1064798,more views
40946,I GAVE SAFIYA NYGAARD A PERFECT HAIR MAKEOVER ...,1066451,more views
40947,How Black Panther Should Have Ended,5660813,lots of views


### Datetime Transformation

In [None]:
vdo_df.publish_time

0        2017-11-13T17:13:01.000Z
1        2017-11-13T07:30:00.000Z
2        2017-11-12T19:05:24.000Z
3        2017-11-13T11:00:04.000Z
4        2017-11-12T18:01:41.000Z
                   ...           
40944    2018-05-18T13:00:04.000Z
40945    2018-05-18T01:00:06.000Z
40946    2018-05-18T17:34:22.000Z
40947    2018-05-17T17:00:04.000Z
40948    2018-05-17T17:09:38.000Z
Name: publish_time, Length: 40901, dtype: object

In [None]:
pd.to_datetime(vdo_df.publish_time)

0       2017-11-13 17:13:01+00:00
1       2017-11-13 07:30:00+00:00
2       2017-11-12 19:05:24+00:00
3       2017-11-13 11:00:04+00:00
4       2017-11-12 18:01:41+00:00
                   ...           
40944   2018-05-18 13:00:04+00:00
40945   2018-05-18 01:00:06+00:00
40946   2018-05-18 17:34:22+00:00
40947   2018-05-17 17:00:04+00:00
40948   2018-05-17 17:09:38+00:00
Name: publish_time, Length: 40901, dtype: datetime64[ns, UTC]

### Which VDO does take the longest time to be trending?
Let's try to find the answer for this question.  Obviously, we will need to use find number of days between publish_time to trending_date.

In [None]:
vdo_df['publish_dt'] = pd.to_datetime(vdo_df.publish_time)

In [None]:
vdo_df.dtypes

video_id                               object
trending_date                          object
title                                  object
channel_title                          object
category_id                             int64
publish_time                           object
tags                                   object
views                                   int64
likes                                   int64
dislikes                                int64
comment_count                           int64
thumbnail_link                         object
comments_disabled                        bool
ratings_disabled                         bool
video_error_or_removed                   bool
description                            object
lpv_ratio                             float64
views_level                          category
publish_dt                datetime64[ns, UTC]
dtype: object

In [None]:
from datetime import datetime, timezone

In [None]:
may2008 = datetime(2008, 5, 1, tzinfo=timezone.utc)

In [None]:
vdo_df[vdo_df.publish_dt < may2008]

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,lpv_ratio,views_level,publish_dt
10710,UQtt9I6c-YM,18.06.01,Kramer vs Kramer-Clou Scene,Livia Giustiniani,1,2008-04-05T18:22:40.000Z,"Meryl|""Streep""|""kramer""|""vs""|""dustin""|""hoffman...",49942,46,6,26,https://i.ytimg.com/vi/UQtt9I6c-YM/default.jpg,False,False,False,Poor Meryl...she was really scaredxD,0.000921,some views,2008-04-05 18:22:40+00:00
10921,UQtt9I6c-YM,18.07.01,Kramer vs Kramer-Clou Scene,Livia Giustiniani,1,2008-04-05T18:22:40.000Z,"Meryl|""Streep""|""kramer""|""vs""|""dustin""|""hoffman...",50030,46,6,26,https://i.ytimg.com/vi/UQtt9I6c-YM/default.jpg,False,False,False,Poor Meryl...she was really scaredxD,0.000919,some views,2008-04-05 18:22:40+00:00
11150,UQtt9I6c-YM,18.08.01,Kramer vs Kramer-Clou Scene,Livia Giustiniani,1,2008-04-05T18:22:40.000Z,"Meryl|""Streep""|""kramer""|""vs""|""dustin""|""hoffman...",50117,46,6,26,https://i.ytimg.com/vi/UQtt9I6c-YM/default.jpg,False,False,False,Poor Meryl...she was really scaredxD,0.000918,some views,2008-04-05 18:22:40+00:00
11375,UQtt9I6c-YM,18.09.01,Kramer vs Kramer-Clou Scene,Livia Giustiniani,1,2008-04-05T18:22:40.000Z,"Meryl|""Streep""|""kramer""|""vs""|""dustin""|""hoffman...",50168,46,6,26,https://i.ytimg.com/vi/UQtt9I6c-YM/default.jpg,False,False,False,Poor Meryl...she was really scaredxD,0.000917,some views,2008-04-05 18:22:40+00:00
16294,MJO3FmmFuh4,18.05.02,Budweiser - Original Whazzup? ad,dannotv,24,2006-07-23T08:24:11.000Z,"Budweiser|""Bud""|""Whazzup""|""ad""",258506,459,152,82,https://i.ytimg.com/vi/MJO3FmmFuh4/default.jpg,False,False,False,"Original Whazzup ad - however, there is a litt...",0.001776,some views,2006-07-23 08:24:11+00:00


Now, we have to transform the *trending_date* from string to datetime.  The format is yy.dd.mm where yy is the last two digits of year.

In [None]:
vdo_df.trending_date

0        17.14.11
1        17.14.11
2        17.14.11
3        17.14.11
4        17.14.11
           ...   
40944    18.14.06
40945    18.14.06
40946    18.14.06
40947    18.14.06
40948    18.14.06
Name: trending_date, Length: 40901, dtype: object

In [None]:
vdo_df['trending_dt'] = pd.to_datetime(vdo_df.trending_date, format='%y.%d.%m', errors='ignore', utc=True)

In [None]:
vdo_df.trending_dt

0       2017-11-14 00:00:00+00:00
1       2017-11-14 00:00:00+00:00
2       2017-11-14 00:00:00+00:00
3       2017-11-14 00:00:00+00:00
4       2017-11-14 00:00:00+00:00
                   ...           
40944   2018-06-14 00:00:00+00:00
40945   2018-06-14 00:00:00+00:00
40946   2018-06-14 00:00:00+00:00
40947   2018-06-14 00:00:00+00:00
40948   2018-06-14 00:00:00+00:00
Name: trending_dt, Length: 40901, dtype: datetime64[ns, UTC]

In [None]:
vdo_df['days_to_trending'] = vdo_df.trending_dt.dt.date - vdo_df.publish_dt.dt.date

In [None]:
vdo_df.days_to_trending

0        1 days
1        1 days
2        2 days
3        1 days
4        2 days
          ...  
40944   27 days
40945   27 days
40946   27 days
40947   28 days
40948   28 days
Name: days_to_trending, Length: 40901, dtype: timedelta64[ns]

In [None]:
vdo_df.days_to_trending.mean()

Timedelta('16 days 19:50:39.857216205')

In [None]:
vdo_df.days_to_trending.max()

Timedelta('4215 days 00:00:00')

In [None]:
vdo_df[vdo_df.days_to_trending == vdo_df.days_to_trending.max()]

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,lpv_ratio,views_level,publish_dt,trending_dt,days_to_trending
16294,MJO3FmmFuh4,18.05.02,Budweiser - Original Whazzup? ad,dannotv,24,2006-07-23T08:24:11.000Z,"Budweiser|""Bud""|""Whazzup""|""ad""",258506,459,152,82,https://i.ytimg.com/vi/MJO3FmmFuh4/default.jpg,False,False,False,"Original Whazzup ad - however, there is a litt...",0.001776,some views,2006-07-23 08:24:11+00:00,2018-02-05 00:00:00+00:00,4215 days


In [None]:
vdo_df.days_to_trending.describe()

count                          40901
mean      16 days 19:50:39.857216205
std      146 days 02:22:47.782206980
min                  0 days 00:00:00
25%                  3 days 00:00:00
50%                  5 days 00:00:00
75%                  9 days 00:00:00
max               4215 days 00:00:00
Name: days_to_trending, dtype: object