# **YouTube Video Recommender**
This project consists on scraping videos from YouTube about certain keywords ('machine learning', 'data science' and 'kaggle'), process and label this data, determine features and create a machine learning solution to recommend new relevant videos on YouTube about these topics.
###### Observation: In this context, "relevant videos" are based on the labelling phase, it will be explained better when this phase is demonstrated.

## Imports and configurations

In [1]:
import pandas as pd
import numpy as np
import re
import time

import requests as rq
import bs4 as bs4

import json
import glob
import tqdm

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn.preprocessing import MaxAbsScaler, StandardScaler
from sklearn.pipeline import make_pipeline

from scipy.sparse import hstack, vstack
from scipy.sparse import csr_matrix

from skopt import forest_minimize

import joblib as jb

pd.set_option("max.columns", 131)

## Collect data from YouTube
First, we will search on YouTube the latest published videos using the three different keywords and store these pages.
Then, we will open each video page shown in the previous step and store these pages also.
### 1. Collecting search pages

In [3]:
keywords = ['machine+learning', 'kaggle', 'data+science']
url = 'https://www.youtube.com/results?search_query={query}&sp=CAI%253D&p={page}' # url for the search page. The parameter "sp=CAI%253D" is to sort by published date

for keyword in keywords:
    for page in range(1,101): # get 100 search pages for each keyword
        formatted_url = url.format(query=keyword, page=page) 
        print(formatted_url)
        response = rq.get(formatted_url) # request the page using the formatted url
        
        with open('./raw_search_page_data/{}_page{}.html'.format(keyword, page), 'w+', encoding='utf-8') as html_file:
            html_file.write(response.text) # write the html content of the page on a file named with the keyword and page number inside the folder "raw_search_page_data"

https://www.youtube.com/results?search_query=machine+learning&sp=CAI%253D&p=1
https://www.youtube.com/results?search_query=machine+learning&sp=CAI%253D&p=2
https://www.youtube.com/results?search_query=machine+learning&sp=CAI%253D&p=3
https://www.youtube.com/results?search_query=machine+learning&sp=CAI%253D&p=4
https://www.youtube.com/results?search_query=machine+learning&sp=CAI%253D&p=5
https://www.youtube.com/results?search_query=machine+learning&sp=CAI%253D&p=6
https://www.youtube.com/results?search_query=machine+learning&sp=CAI%253D&p=7
https://www.youtube.com/results?search_query=machine+learning&sp=CAI%253D&p=8
https://www.youtube.com/results?search_query=machine+learning&sp=CAI%253D&p=9
https://www.youtube.com/results?search_query=machine+learning&sp=CAI%253D&p=10
https://www.youtube.com/results?search_query=machine+learning&sp=CAI%253D&p=11
https://www.youtube.com/results?search_query=machine+learning&sp=CAI%253D&p=12
https://www.youtube.com/results?search_query=machine+learning

### 1.1. Processing raw search pages

In [3]:
for keyword in keywords:
    for page in range(1,101):
        # open each search page stored in the previous step
        with open('./raw_search_page_data/{}_page{}.html'.format(keyword, page), 'r+', encoding='utf-8') as html_file:
            file_content = html_file.read()
            parsed = bs4.BeautifulSoup(file_content) # parse the file content using BeautifulSoup4
            tags = parsed.findAll('a') # get all the "a" html tags, which refer to links
            
            for tag in tags:
                if tag.has_attr('aria-describedby'): # if the "a" tag has the attribute "aria-describedby", it means it is a link to a video page
                    link = tag['href']
                    title = tag['title']
                    with open('parsed_videos_search_page.json', 'a+') as search_page_videos_json:
                        data = {'link': link, 'title': title, 'query': keyword}
                        search_page_videos_json.write('{}\n'.format(json.dumps(data))) # store in a json file all the videos found in the retrieved search pages

### 1.2. Previewing processed data

In [9]:
df = pd.read_json('parsed_videos_search_page.json', lines=True)
df.head()

Unnamed: 0,link,title,query
0,/watch?v=SAbBbthqOCU,Lecture 6: Linear Regression and Gradient Desc...,machine+learning
1,/watch?v=cwL22NeiVxw,Machine Learning- Building a simple NN using P...,machine+learning
2,/watch?v=ArDBdzCRGv4,Lecture #9: Derivatives [Part 2] | Deep Learni...,machine+learning
3,/watch?v=CMkmF3DjJuc,Learn Machine Learning With Python,machine+learning
4,/watch?v=id0YYvwRUgg,Machine Learning & its Applications by Tarun D...,machine+learning


### 2. Collecting video pages

In [11]:
links = df['link'].unique()
url = 'https://www.youtube.com{link}'

for link in links:
    formatted_url = url.format(link=link)
    response = rq.get(formatted_url) # request the page using the formatted url
    
    video_code = re.search('v=(.*)', link).group() # get only the video code, after the "v=" in the video link
    with open('./raw_video_page_data/video_{}.html'.format(video_code), 'w+', encoding='utf-8') as html_file:
        html_file.write(response.text) # write the html content of the page on a file named with the video code in a folder named "raw_video_page_data"

### 2.1 Processing raw video pages

In [46]:
with open('parsed_videos_info.json', 'w+') as videos_info_json:
    for video_file in tqdm.tqdm_notebook(sorted(glob.glob('./raw_video_page_data/video*'))): # read all video pages from previous step
        with open(video_file, 'r+', encoding='utf-8') as html_video_file:
            file_content = html_video_file.read()
            parsed = bs4.BeautifulSoup(file_content) # parse video page using BeautifulSoup4
            
            class_watch = parsed.find_all(attrs={'class':re.compile(r'watch')})
            id_watch = parsed.find_all(attrs={'id':re.compile(r'watch')})
            channel_watch = parsed.find_all(attrs={'href':re.compile(r'watch')})
            meta = parsed.find_all('meta')
            
            data = dict()
            
            for occurrence in class_watch:
                colname = '_'.join(occurrence['class'])
                if 'clearfix' in colname:
                    continue
                data[colname] = occurrence.text.strip()
                
            for occurrence in id_watch:
                colname = occurrence['id']
                data[colname] = occurrence.text.strip()
                
            for occurrence in meta:
                colname = occurrence.get('property')
                if colname is not None:
                    data[colname] = occurrence['content']
                    
            for link_num, occurrence in enumerate(channel_watch):
                data['channel_link_{}'.format(link_num)] = occurrence['href']
            
            videos_info_json.write('{}\n'.format(json.dumps(data)))

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  


HBox(children=(FloatProgress(value=0.0, max=1700.0), HTML(value='')))




### 2.2 Previewing processed data and filtering columns

In [47]:
df = pd.read_json('parsed_videos_info.json', lines=True)

In [48]:
df.head(1)

Unnamed: 0,content-alignment_watch-small,watch-playlist_player-height,watch-queue-header,watch-queue-info,watch-queue-info-icon,watch-queue-title,watch-queue-control-bar_control-bar-button,watch-queue-mole-info,watch-queue-control-bar-icon,watch-queue-icon_yt-sprite,watch-queue-title-container,watch-queue-count,watch-queue-menu_yt-uix-button-menu_yt-uix-button-menu-dark-overflow-action-menu_hid,watch-queue-menu-choice_overflow-menu-choice_yt-uix-button-menu-item,watch-queue-controls,yt-uix-button_yt-uix-button-size-default_yt-uix-button-empty_yt-uix-button-has-icon_control-bar-button_prev-watch-queue-button_yt-uix-button-opacity_yt-uix-tooltip_yt-uix-tooltip,yt-uix-button-icon_yt-uix-button-icon-watch-queue-prev_yt-sprite,yt-uix-button_yt-uix-button-size-default_yt-uix-button-empty_yt-uix-button-has-icon_control-bar-button_play-watch-queue-button_yt-uix-button-opacity_yt-uix-tooltip_yt-uix-tooltip,yt-uix-button-icon_yt-uix-button-icon-watch-queue-play_yt-sprite,yt-uix-button_yt-uix-button-size-default_yt-uix-button-empty_yt-uix-button-has-icon_control-bar-button_pause-watch-queue-button_yt-uix-button-opacity_yt-uix-tooltip_hid_yt-uix-tooltip,yt-uix-button-icon_yt-uix-button-icon-watch-queue-pause_yt-sprite,yt-uix-button_yt-uix-button-size-default_yt-uix-button-empty_yt-uix-button-has-icon_control-bar-button_next-watch-queue-button_yt-uix-button-opacity_yt-uix-tooltip_yt-uix-tooltip,yt-uix-button-icon_yt-uix-button-icon-watch-queue-next_yt-sprite,watch-queue-items-container_yt-scrollbar-dark_yt-scrollbar,watch-queue-items-list,content-alignment_watch-player-playlist,watch-main-col,watch-title-container,watch-title,watch-secondary-actions_yt-uix-button-group,watch-view-count,watch-action-panels_yt-uix-button-panel_hid_yt-card_yt-card-has-padding,watch-time-text,watch-extras-section,watch-meta-item_yt-uix-expander-body,content_watch-info-tag-list,watch-sidebar,watch-playlist_player-height_hid,watch-sidebar-gutter_yt-card_yt-card-has-padding_yt-uix-expander_yt-uix-expander-collapsed,watch-sidebar-section,watch-sidebar-head,watch-sidebar-body,watch-sidebar-separation-line,yt-pl-watch-queue-overlay,watch-queue-mole,watch-queue,watch-queue-title-msg,watch-queue-count-msg,watch-queue-loading-template,watch7-container,watch7-main-container,watch7-main,watch7-preview,watch7-content,watch-header,watch7-headline,watch-headline-title,watch7-user-header,watch7-subscription-container,watch8-action-buttons,watch8-secondary-actions,watch8-sentiment-actions,watch7-views-info,watch-action-panels,watch-actions-share-loading,...,channel_link_23,channel_link_24,channel_link_25,channel_link_26,channel_link_27,channel_link_28,channel_link_29,channel_link_30,channel_link_31,channel_link_32,channel_link_33,channel_link_34,channel_link_35,channel_link_36,channel_link_37,channel_link_38,channel_link_39,channel_link_40,channel_link_41,channel_link_42,channel_link_43,channel_link_44,channel_link_45,channel_link_46,channel_link_47,channel_link_48,channel_link_49,channel_link_50,channel_link_51,channel_link_52,channel_link_53,watch-actions-transcript-loading,watch-actions-transcript,watch-transcript-container,watch-transcript-not-found,channel_link_54,channel_link_55,watch-sidebar-live-chat,channel_link_56,channel_link_57,channel_link_58,watch-sidebar-discussion,channel_link_59,channel_link_60,channel_link_61,watch-meta-item,channel_link_62,channel_link_63,channel_link_64,watch-skeleton,watch-page-skeleton,channel_link_65,channel_link_66,channel_link_67,channel_link_68,channel_link_69,channel_link_70,channel_link_71,watch-meta-item_has-image,channel_link_72,channel_link_73,channel_link_74,channel_link_75,channel_link_76,channel_link_77
0,This video is unavailable.\n\n \n\n\n\n\n\n...,Watch QueueQueueWatch QueueQueue \nRemove allD...,Watch QueueQueueWatch QueueQueue \nRemove allD...,Watch QueueQueue,,Watch Queue,Watch QueueQueue \nRemove allDisconnect,Watch QueueQueue,,,Watch QueueQueue,,Remove allDisconnect,Disconnect,,,,,,,,,,Loading...,Loading...,,"{\n ""@context"": ""http://schema.org"",\n ""...",Machine Learning Course A To Z || Beginner to ...,Machine Learning Course A To Z || Beginner to ...,Add to\n\nWant to watch this again later?\n\n ...,"175,135 views",Loading...\n \n\n\n\n\n\n\n\n\n\n\nLoading....,"Published on Aug 10, 2018",Category\n \n\nEducation,Category\n \n\nEducation,Education,Advertisement\n \n\n\n\n\n\n\n\n\nAutopla...,,Advertisement\n \n\n\n\n\n\n\n\n\nAutopla...,Mix\n \n\n\n\n\n\n\n\n\nPlay all\n \...,Up next,Mix\n \n\n\n\n\n\n\n\n\nPlay all\n \...,,,Watch QueueQueueWatch QueueQueue \nRemove allD...,Watch QueueQueueWatch QueueQueue \nRemove allD...,Watch Queue,__count__/__total__,,YouTube Premium\n \n\n\n\n\n\n\nLoading...\n ...,"{\n ""@context"": ""http://schema.org"",\n ""...","{\n ""@context"": ""http://schema.org"",\n ""...",,"{\n ""@context"": ""http://schema.org"",\n ""...",Machine Learning Course A To Z || Beginner to ...,Machine Learning Course A To Z || Beginner to ...,Machine Learning Course A To Z || Beginner to ...,Geek's Lesson\n\n\n\n\n\n\n\n\n\n\n\n\n\nLoadi...,Loading...\n \n\n\n\n\n\n\n\n Unsubs...,Add to\n\nWant to watch this again later?\n\n ...,Add to\n\nWant to watch this again later?\n\n ...,"175,135 views\n\n\n\n\n\n\n\n5,126\n\nLike thi...","175,135 views",Loading...\n \n\n\n\n\n\n\n\n\n\n\nLoading....,Loading...,...,/watch?v=8onB7rPG4Pk,/watch?v=8onB7rPG4Pk,/watch?v=h0e2HAPTGF4,/watch?v=h0e2HAPTGF4,/watch?v=DZ7xuZ1-uh8,/watch?v=DZ7xuZ1-uh8,/watch?v=ZX2Hyu5WoFg,/watch?v=ZX2Hyu5WoFg,/watch?v=mrRfpiAwad0,/watch?v=mrRfpiAwad0,/watch?v=JgvyzIkgxF0,/watch?v=JgvyzIkgxF0,/watch?v=6MYF6Zo6i6A,/watch?v=6MYF6Zo6i6A,/watch?v=cKxRvEZd3Mw,/watch?v=cKxRvEZd3Mw,/watch?v=meRc5MSrOO0,/watch?v=meRc5MSrOO0,/watch?v=63NTeLmDANo,/watch?v=63NTeLmDANo,/watch?v=ukzFI9rgwfU,/watch?v=ukzFI9rgwfU,/watch?v=O5xeyoRL95U,/watch?v=O5xeyoRL95U,/watch?v=DWsJc1xnOZo,/watch?v=DWsJc1xnOZo,/watch?v=ora5jY7yIEw,/watch?v=ora5jY7yIEw,/watch?v=Aut32pR5PQA,/watch?v=Aut32pR5PQA,https://accounts.google.com/ServiceLogin?passi...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [50]:
selected_columns = ['watch-title', 'watch-view-count', 'watch-time-text', 'content_watch-info-tag-list', 
                        'watch7-headline', 'watch7-user-header', 'watch8-sentiment-actions', 'og:image',
                       'og:image:width', 'og:image:height', 'og:description', 'og:video:width', 'og:video:height',
                       'og:video:tag', 'channel_link_0']

df[selected_columns].head(1)

Unnamed: 0,watch-title,watch-view-count,watch-time-text,content_watch-info-tag-list,watch7-headline,watch7-user-header,watch8-sentiment-actions,og:image,og:image:width,og:image:height,og:description,og:video:width,og:video:height,og:video:tag,channel_link_0
0,Machine Learning Course A To Z || Beginner to ...,"175,135 views","Published on Aug 10, 2018",Education,Machine Learning Course A To Z || Beginner to ...,Geek's Lesson\n\n\n\n\n\n\n\n\n\n\n\n\n\nLoadi...,"175,135 views\n\n\n\n\n\n\n\n5,126\n\nLike thi...",https://i.ytimg.com/vi/-58kO_zYUGE/maxresdefau...,1280,720,Welcome to this free online class on machine l...,640.0,360.0,Ai and machine learning course,https://www.youtube.com/watch?v=-58kO_zYUGE


In [51]:
df[selected_columns].to_csv('raw_video_data_without_labels.csv') # store the DataFrame in a csv so we can do labelling

## Labelling Phase
In this phase, we will label manually around 500 entries in our "raw_video_data_without_labels.csv" file, identifying which videos we would like to receive as recommendation and which ones we don't. It is done only seeing the video title, number of views aligned with the publish date and channel name.

The main criteria used for labelling was personal taste, but some points I want to bring:
* Label 0 - not recommend
  * Video in a language other than English and Portuguese
  * Videos like "Machine learning in 10 minutes" or "How to become a machine learning engineer"
  * Videos that seem to be very introductory
  * Videos that seem to be from outdated talks
  * Videos about salary
* Label 1 - recommend
  * Videos related to finance, economics, stocks
  * Andrew Ng's videos
  * Videos about data science/machine learning on big tech companies (Airbnb, Google, Amazon, Uber, etc)

In [18]:
df = pd.read_csv('raw_video_data_with_500_labels.csv', index_col=0) # open the csv file with around 500 labels
df.head() # the column "y" contains the labels

Unnamed: 0,watch-title,y,watch-view-count,watch-time-text,content_watch-info-tag-list,watch7-headline,watch7-user-header,watch8-sentiment-actions,og:image,og:image:width,og:image:height,og:description,og:video:width,og:video:height,og:video:tag,channel_link_0
0,Machine Learning Course A To Z || Beginner to ...,0.0,"175,135 views","Published on Aug 10, 2018",Education,Machine Learning Course A To Z || Beginner to ...,Geek's Lesson\n\n\n\n\n\n\n\n\n\n\n\n\n\nLoadi...,"175,135 views\n\n\n\n\n\n\n\n5,126\n\nLike thi...",https://i.ytimg.com/vi/-58kO_zYUGE/maxresdefau...,1280,720,Welcome to this free online class on machine l...,640.0,360.0,Ai and machine learning course,https://www.youtube.com/watch?v=-58kO_zYUGE
1,How to Become A Machine Learning Engineer | Ho...,0.0,"34,518 views","Published on Sep 3, 2018",Education,#MachineLearningAlgorithms #Datasciencecourse ...,Simplilearn\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nLoadi...,"34,518 views\n\n\n\n\n\n\n\n779\n\nLike this v...",https://i.ytimg.com/vi/-5hEYRt8JE0/maxresdefau...,1280,720,"This video on ""How to become a Machine Learnin...",1280.0,720.0,simplilearn,https://www.youtube.com/watch?v=-5hEYRt8JE0
2,Python For Data Science Full Course - 9 Hours ...,0.0,"16,860 views","Published on Mar 15, 2020",Education,#edureka #PythonEdureka #pythonfordatasciencef...,edureka!\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nLoading....,"16,860 views\n\n\n\n\n\n\n\n665\n\nLike this v...",https://i.ytimg.com/vi/-6RqxhNO2yY/maxresdefau...,1280,720,🔥Edureka Python Certification Training: https:...,1280.0,720.0,edureka,https://www.youtube.com/watch?v=-6RqxhNO2yY
3,Types of machine learning models used in healt...,0.0,384 views,"Published on Feb 9, 2020",Science & Technology,Types of machine learning models used in healt...,Mr Artificial Intelligence\n\n\n\n\n\n\n\n\n\n...,384 views\n\n\n\n\n\n\n\n14\n\nLike this video...,https://i.ytimg.com/vi/-7AwJx7F0vs/hqdefault.jpg,480,360,this video is the first in a series of videos ...,1280.0,720.0,secure and robust machine learning for health ...,https://www.youtube.com/watch?v=-7AwJx7F0vs
4,Michael I. Jordan: Machine Learning: Dynamical...,1.0,"4,152 views","Published on May 2, 2019",Creative Commons Attribution license (reuse al...,#purdue #michaelijordan #engineering\n\n\n\n ...,Purdue Engineering\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,"4,152 views\n\n\n\n\n\n\n\n94\n\nLike this vid...",https://i.ytimg.com/vi/-8yYFdV5SOc/maxresdefau...,1280,720,2019 Purdue Engineering Distinguished Lecture ...,1280.0,720.0,electrical engineer,https://www.youtube.com/watch?v=-8yYFdV5SOc


In [19]:
df = df[df['y'].notnull()] # excludes the lines without a label

In [20]:
df.shape # number of lines and columns, respectively, of the DataFrame only with labeled data

(517, 16)

## Data cleaning/processing
In this phase, we will clean and process the data in the format we need to create features for ours machine learning models 

In [21]:
df_cleaned = pd.DataFrame(index=df.index) # create a DataFramed for the cleaned data with the same index from the original DataFrame

### 1. Cleaning the published date

In [22]:
clean_date = df['watch-time-text'].str.extract(r'([A-z]{3}) (\d+), (\d+)')
clean_date['full_date'] = df['watch-time-text']
clean_date[clean_date[0].isnull()]

Unnamed: 0,0,1,2,full_date
64,,,,Streamed live 19 hours ago
68,,,,Streamed live 3 hours ago
124,,,,Premiered 11 hours ago
320,,,,Premiered 3 hours ago


In [23]:
clean_date = clean_date.drop('full_date', axis=1)
# remove the entries not understood by the regex. They are those without a date, but with how many hours ago was published/streamed/premiered
df = df[clean_date[0].notnull()]
df_cleaned = df_cleaned[clean_date[0].notnull()]
clean_date = clean_date[clean_date[0].notnull()]

In [26]:
clean_date[0] = clean_date[0].map(lambda x: str(x))
clean_date[1] = clean_date[1].map(lambda x: '0'+str(x) if len(str(x)) == 1 else str(x)) # include a 0 in days of one digit to fit the pandas pattern
clean_date[2] = clean_date[2].map(lambda x: str(x))
clean_date = clean_date.apply(lambda x: ' '.join(x), axis=1)
df_cleaned['date'] = pd.to_datetime(clean_date, format='%b %d %Y')
df_cleaned.head()

Unnamed: 0,date
0,2018-08-10
1,2018-09-03
2,2020-03-15
3,2020-02-09
4,2019-05-02


### 2. Cleaning number of views

In [27]:
views = df['watch-view-count'].str.extract(r'(\d+,?\d*,?\d*)', expand=False).str.replace(',', '').fillna(0).astype(int)
df_cleaned['views'] = views
df_cleaned.head()

Unnamed: 0,date,views
0,2018-08-10,175135
1,2018-09-03,34518
2,2020-03-15,16860
3,2020-02-09,384
4,2019-05-02,4152


## Creating features - Part 1
In this phase, we begin to create and organize the features we will use in our machine learning models. In this first part, we will create two simple features and use them in a version 1 model using Decision Tree

In [42]:
features = pd.DataFrame(index=df_cleaned.index)
y = df['y'].copy()

### 1. Features views and views_per_day

In [43]:
features['time_since_published'] = (pd.to_datetime('2020-04-22') - df_cleaned['date']) / np.timedelta64(1, 'D') # time, in days, from the published date until April 22th, 2020 (date this was written)
features['views'] = df_cleaned['views']
features['views_per_day'] = features['views'] / features['time_since_published']
features = features.drop(['time_since_published'], axis=1)

In [44]:
features.head()

Unnamed: 0,views,views_per_day
0,175135,282.020934
1,34518,57.819095
2,16860,443.684211
3,384,5.260274
4,4152,11.662921


## 2. Testing these two features in a simples Decision Tree model

In [45]:
# dividing the data in train and validation
Xtrain, Xval = features[df_cleaned['date'] < '2019-06-01'], features[df_cleaned['date'] >= '2019-06-01']
Ytrain, Yval = y[df_cleaned['date'] < '2019-06-01'], y[df_cleaned['date'] >= '2019-06-01'] 
Xtrain.shape, Xval.shape, Ytrain.shape, Yval.shape

((273, 2), (240, 2), (273,), (240,))

In [46]:
mdl = DecisionTreeClassifier(random_state=0, max_depth=2, class_weight='balanced')
mdl.fit(Xtrain, Ytrain)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight='balanced', criterion='gini',
                       max_depth=2, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=0, splitter='best')

In [47]:
p = mdl.predict_proba(Xval)[:,1] # predict_proba returns a matrix[Xval_lines, 2], 
                                 # in which the first column ([0]) has the prediction probability for the class 0 (not recommend) for the respective line of Xval,
                                 # and the second column ([1]) has the prediction probability for the class 1 (recommend) for the respective line of Xval
                                 # we use the second column, prediction probability for the class 1 (recommend), to measure the model

In [48]:
average_precision_score(Yval, p), roc_auc_score(Yval, p)

(0.07137776429341963, 0.5354352678571429)

The results obtained are:

* Average precision score: 0.07137776429341963

* Roc auc score: 0.5354352678571429

These results will be used as a minimum baseline. We will make adjustments and improvements to have better results

## Creating features - Part 2
Now, with a minimum baseline and two already created features, we will create more features using the video's title and test others kinds of models, aiming to have better results

In [50]:
#creating masks to facilitate the division of data
mask_train = df_cleaned['date'] < '2019-06-01'
mask_val = df_cleaned['date'] >= '2019-06-01'

In [52]:
df_cleaned['title'] = df['watch-title']

title_train = df_cleaned[mask_train]['title']
title_val = df_cleaned[mask_val]['title']

title_vec = TfidfVectorizer(min_df=2) # this method creates a matrix with the words found as the columns and each entry as a line. It creates a matrix indicating which words appear in each title.
                                      # the parameter "min_df" determines how many types a word must appear in the group of title to become a proper feature in the vectorizer.
                                      # the parameter "ngram_range", which we will use further, determines the size of grouping will be done to determine the columns of the matrix. If it is ngram_range=(1,2), besides having one column for each word, the title will have a column for each sequential group of 2 words
title_bow_train = title_vec.fit_transform(title_train)
title_bow_val = title_vec.transform(title_val)

In [53]:
title_bow_train.shape # the number of lines are the number of entries, and the number of columns are the number of words + groups of 2 sequential words

(273, 224)

In [58]:
# concatenates the features from Xtrain and Xval with the words and groups of words from the previous step
Xtrain_wtitle = hstack([Xtrain, title_bow_train])
Xval_wtitle = hstack([Xval, title_bow_val])

In [61]:
# creates a Random Forest classifier with 1000 trees
mdl = RandomForestClassifier(n_estimators=1000, random_state=0, class_weight='balanced', n_jobs=4)
mdl.fit(Xtrain_wtitle, Ytrain)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight='balanced',
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1000,
                       n_jobs=4, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

In [62]:
p = mdl.predict_proba(Xval_wtitle)[:,1]
average_precision_score(Yval, p), roc_auc_score(Yval, p)

(0.21355398271185985, 0.6470424107142857)

The results obtained using Random Forest are:

* Average precision score: 0.21355398271185985

* Roc auc score: 0.6470424107142857

## Active Learning
In order to help the model to obtain better results, we will use active learning, which is a practice to label a portion of entries that the model is having trouble to classify. In our case, we will get aroun 70 examples that the model is having trouble and around 30 random examples, then adding more 100 labeled entries for the models to use

In [63]:
# from the csv file with the around 500 entries labeled, we will filter only those which were not labeled yet and has a title
df_unlabeled = pd.read_csv('raw_video_data_with_500_labels.csv', index_col=0)
df_unlabeled = df_unlabeled[df_unlabeled['y'].isnull()].dropna(how='all').dropna(subset=['watch-title'])
df_unlabeled.shape

(1145, 16)

In [64]:
df_unlabeled.head()

Unnamed: 0,watch-title,y,watch-view-count,watch-time-text,content_watch-info-tag-list,watch7-headline,watch7-user-header,watch8-sentiment-actions,og:image,og:image:width,og:image:height,og:description,og:video:width,og:video:height,og:video:tag,channel_link_0
532,Kaggle Bike Share Demand Prediction [DEMO #13]...,,"2,214 views","Published on Oct 23, 2014",Creative Commons Attribution license (reuse al...,Kaggle Bike Share Demand Prediction [DEMO #13]...,Numenta\n\n\n\n\n\n\n\n\n\n\n\n\n\nLoading...\...,"2,214 views\n\n\n\n\n\n\n\n4\n\nLike this vide...",https://i.ytimg.com/vi/J4owvvBPPS8/maxresdefau...,1280,720,By Chandan Maruthi.,1280.0,720.0,bike share,https://www.youtube.com/watch?v=J4owvvBPPS8
533,Jeremy Howard: fast.ai Deep Learning Courses a...,,"61,718 views","Published on Aug 27, 2019",Science & Technology,Jeremy Howard: fast.ai Deep Learning Courses a...,Lex Fridman\n\n\n\n\n\n\n\n\n\n\n\n\n\nLoading...,"61,718 views\n\n\n\n\n\n\n\n2,014\n\nLike this...",https://i.ytimg.com/vi/J6XcP4JOHmk/maxresdefau...,1280,720,"Jeremy Howard is the founder of fast.ai, a res...",,,world economic forum,https://www.youtube.com/watch?v=J6XcP4JOHmk
534,Learn Python Basics for Data Science from IBM,,"9,136 views","Published on Apr 4, 2019",Education,Learn Python Basics for Data Science from IBM,edX\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nLoading...\n ...,"9,136 views\n\n\n\n\n\n\n\n72\n\nLike this vid...",https://i.ytimg.com/vi/JC3urnvKanI/maxresdefau...,1280,720,Welcome to the course Python Basics for Data S...,1280.0,720.0,learn data science with python,https://www.youtube.com/watch?v=JC3urnvKanI
535,"TPUs, systolic arrays, and bfloat16: accelerat...",,"9,994 views","Published on Apr 7, 2020",Science & Technology,"TPUs, systolic arrays, and bfloat16: accelerat...",Kaggle\n\n\n\n\n\n\n\n\n\n\n\n\n\nLoading...\n...,"9,994 views\n\n\n\n\n\n\n\n249\n\nLike this vi...",https://i.ytimg.com/vi/JC84GCU7zqA/maxresdefau...,1280,720,Today we’re going to talk about systolic array...,1280.0,720.0,bfloat16,https://www.youtube.com/watch?v=JC84GCU7zqA
536,Future of AI/ML | Rise Of Artificial Intellige...,,"10,164 views","Published on Mar 17, 2020",Education,#edureka #AIedureka #MLedureka\n\n\n\n Futu...,edureka!\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nLoading....,"10,164 views\n\n\n\n\n\n\n\n263\n\nLike this v...",https://i.ytimg.com/vi/JFB_751d2uc/maxresdefau...,1280,720,🔥Edureka NIT Warangal Post Graduate Program on...,1280.0,720.0,edureka,https://www.youtube.com/watch?v=JFB_751d2uc


We will now do the same process as before to clean and process this unlabeled data to get the predictions our model does for these entries

In [65]:
df_cleaned_unlabeled = pd.DataFrame(index=df_unlabeled.index)
df_cleaned_unlabeled['title'] = df_unlabeled['watch-title']

### 1. Cleaning the published date - Active Learning

In [66]:
clean_date = df_unlabeled['watch-time-text'].str.extract(r'([A-z]{3}) (\d+), (\d+)')

# remove the entries not understood by the regex. They are those without a date, but with how many hours ago was published/streamed/premiered
df_unlabeled = df_unlabeled[clean_date[0].notnull()]
df_cleaned_unlabeled = df_cleaned_unlabeled[clean_date[0].notnull()]
clean_date = clean_date[clean_date[0].notnull()]

clean_date[0] = clean_date[0].map(lambda x: str(x))
clean_date[1] = clean_date[1].map(lambda x: '0'+str(x) if len(str(x)) == 1 else str(x)) # include a 0 in days of one digit to fit the pandas pattern
clean_date[2] = clean_date[2].map(lambda x: str(x))
clean_date = clean_date.apply(lambda x: ' '.join(x), axis=1)
df_cleaned_unlabeled['date'] = pd.to_datetime(clean_date, format='%b %d %Y')
df_cleaned_unlabeled.head()

Unnamed: 0,title,date
532,Kaggle Bike Share Demand Prediction [DEMO #13]...,2014-10-23
533,Jeremy Howard: fast.ai Deep Learning Courses a...,2019-08-27
534,Learn Python Basics for Data Science from IBM,2019-04-04
535,"TPUs, systolic arrays, and bfloat16: accelerat...",2020-04-07
536,Future of AI/ML | Rise Of Artificial Intellige...,2020-03-17


### 2. Cleaning number of views - Active Learning

In [67]:
views = df_unlabeled['watch-view-count'].str.extract(r'(\d+,?\d*,?\d*)', expand=False).str.replace(',', '').fillna(0).astype(int)
df_cleaned_unlabeled['views'] = views
df_cleaned_unlabeled.head()

Unnamed: 0,title,date,views
532,Kaggle Bike Share Demand Prediction [DEMO #13]...,2014-10-23,2214
533,Jeremy Howard: fast.ai Deep Learning Courses a...,2019-08-27,61718
534,Learn Python Basics for Data Science from IBM,2019-04-04,9136
535,"TPUs, systolic arrays, and bfloat16: accelerat...",2020-04-07,9994
536,Future of AI/ML | Rise Of Artificial Intellige...,2020-03-17,10164


### 3. Creating features - Active Learning

In [68]:
features_unlabeled = pd.DataFrame(index=df_cleaned_unlabeled.index)
features_unlabeled['time_since_published'] = (pd.to_datetime('2020-04-22') - df_cleaned_unlabeled['date']) / np.timedelta64(1, 'D') # time, in days, from the published date until April 22th, 2020 (date this was written)
features_unlabeled['views'] = df_cleaned_unlabeled['views']
features_unlabeled['views_per_day'] = features_unlabeled['views'] / features_unlabeled['time_since_published']
features_unlabeled = features_unlabeled.drop(['time_since_published'], axis=1)

title_unlabeled = df_cleaned_unlabeled['title']
title_bow_unlabeled = title_vec.transform(title_unlabeled)
Xunlabeled_wtitle = hstack([features_unlabeled, title_bow_unlabeled])

In [69]:
p_unlabeled = mdl.predict_proba(Xunlabeled_wtitle)[:, 1]
df_unlabeled['p'] = p_unlabeled
df_unlabeled.head(1)

Unnamed: 0,watch-title,y,watch-view-count,watch-time-text,content_watch-info-tag-list,watch7-headline,watch7-user-header,watch8-sentiment-actions,og:image,og:image:width,og:image:height,og:description,og:video:width,og:video:height,og:video:tag,channel_link_0,p
532,Kaggle Bike Share Demand Prediction [DEMO #13]...,,"2,214 views","Published on Oct 23, 2014",Creative Commons Attribution license (reuse al...,Kaggle Bike Share Demand Prediction [DEMO #13]...,Numenta\n\n\n\n\n\n\n\n\n\n\n\n\n\nLoading...\...,"2,214 views\n\n\n\n\n\n\n\n4\n\nLike this vide...",https://i.ytimg.com/vi/J4owvvBPPS8/maxresdefau...,1280,720,By Chandan Maruthi.,1280.0,720.0,bike share,https://www.youtube.com/watch?v=J4owvvBPPS8,0.324


In [77]:
# getting entries that the model is having trouble to classify. In this case, 72 entries
mask_unlabeled = (df_unlabeled['p'] >= 0.34) & (df_unlabeled['p'] <= 0.7)
mask_unlabeled.sum()

72

In [79]:
df_unlabeled[mask_unlabeled].shape

(72, 17)

In [80]:
hard_entries = df_unlabeled[mask_unlabeled]
random_entries = df_unlabeled[~mask_unlabeled].sample(28)
pd.concat([hard_entries, random_entries]).to_csv('active_learning.csv') # store the hard and random entries is a csv for manual labelling

After manually labeling the new 100 examples, concat them with the other already labeled 500 entries and check the differences in the model results

In [82]:
df1 = pd.read_csv('raw_video_data_with_500_labels.csv', index_col=0)
df1 = df1[df1['y'].notnull()]
df1.shape

(517, 16)

In [83]:
df2 = pd.read_csv('active_learning_labeled.csv', index_col=0)
df2 = df2[df2['y'].notnull()]
df2['active_learning'] = 1
df2.shape

(100, 18)

In [84]:
# this measures the prediction probabilities calculated compared to the input labels
average_precision_score(df2['y'], df2['p']), roc_auc_score(df2['y'], df2['p'])

(0.36759868712367233, 0.6148018648018648)

### 4. Recalculating with active learning entries
Now, we will concatenate the previous labeled DataFrame with the new one from active learning, clean and process the data, create the features and create/run a machine learning model to see the difference in results

In [85]:
df = pd.concat([df1, df2.drop('p', axis=1)])

In [87]:
df_cleaned = pd.DataFrame(index=df.index)
df_cleaned['title'] = df['watch-title']
df_cleaned['active_learning'] = df['active_learning'].fillna(0)

#### 4.1. Cleaning the published date

In [88]:
clean_date = df['watch-time-text'].str.extract(r'([A-z]{3}) (\d+), (\d+)')

# remove the entries not understood by the regex. They are those without a date, but with how many hours ago was published/streamed/premiered
df = df[clean_date[0].notnull()]
df_cleaned = df_cleaned[clean_date[0].notnull()]
clean_date = clean_date[clean_date[0].notnull()]

clean_date[0] = clean_date[0].map(lambda x: str(x))
clean_date[1] = clean_date[1].map(lambda x: '0'+str(x) if len(str(x)) == 1 else str(x)) # include a 0 in days of one digit to fit the pandas pattern
clean_date[2] = clean_date[2].map(lambda x: str(x))
clean_date = clean_date.apply(lambda x: ' '.join(x), axis=1)
df_cleaned['date'] = pd.to_datetime(clean_date, format='%b %d %Y')
df_cleaned.head()

Unnamed: 0,title,active_learning,date
0,Machine Learning Course A To Z || Beginner to ...,0.0,2018-08-10
1,How to Become A Machine Learning Engineer | Ho...,0.0,2018-09-03
2,Python For Data Science Full Course - 9 Hours ...,0.0,2020-03-15
3,Types of machine learning models used in healt...,0.0,2020-02-09
4,Michael I. Jordan: Machine Learning: Dynamical...,0.0,2019-05-02


#### 4.2. Cleaning number of views

In [89]:
views = df['watch-view-count'].str.extract(r'(\d+,?\d*,?\d*)', expand=False).str.replace(',', '').fillna(0).astype(int)
df_cleaned['views'] = views
df_cleaned.head()

Unnamed: 0,title,active_learning,date,views
0,Machine Learning Course A To Z || Beginner to ...,0.0,2018-08-10,175135
1,How to Become A Machine Learning Engineer | Ho...,0.0,2018-09-03,34518
2,Python For Data Science Full Course - 9 Hours ...,0.0,2020-03-15,16860
3,Types of machine learning models used in healt...,0.0,2020-02-09,384
4,Michael I. Jordan: Machine Learning: Dynamical...,0.0,2019-05-02,4152


#### 4.3. Creating features

In [90]:
features = pd.DataFrame(index=df_cleaned.index)
y = df['y'].copy()

features['time_since_published'] = (pd.to_datetime('2020-04-22') - df_cleaned['date']) / np.timedelta64(1, 'D') # time, in days, from the published date until April 22th, 2020 (date this was written)
features['views'] = df_cleaned['views']
features['views_per_day'] = features['views'] / features['time_since_published']
features = features.drop(['time_since_published'], axis=1)

In [91]:
features.head()

Unnamed: 0,views,views_per_day
0,175135,282.020934
1,34518,57.819095
2,16860,443.684211
3,384,5.260274
4,4152,11.662921


In [92]:
mask_train = df_cleaned['date'] < '2019-06-01'
mask_val = df_cleaned['date'] >= '2019-06-01'

Xtrain, Xval = features[mask_train], features[mask_val]
Ytrain, Yval = y[mask_train], y[mask_val] 
Xtrain.shape, Xval.shape, Ytrain.shape, Yval.shape

((318, 2), (295, 2), (318,), (295,))

In [93]:
title_train = df_cleaned[mask_train]['title']
title_val = df_cleaned[mask_val]['title']

title_vec = TfidfVectorizer(min_df=2)
title_bow_train = title_vec.fit_transform(title_train)
title_bow_val = title_vec.transform(title_val)

Xtrain_wtitle = hstack([Xtrain, title_bow_train])
Xval_wtitle = hstack([Xval, title_bow_val])

mdl = RandomForestClassifier(n_estimators=1000, random_state=0, class_weight='balanced', n_jobs=4)
mdl.fit(Xtrain_wtitle, Ytrain)

p = mdl.predict_proba(Xval_wtitle)[:,1]

average_precision_score(Yval, p), roc_auc_score(Yval, p)

(0.3700295113833973, 0.7497037037037038)

The results after executing Active Learning, using the same kind of model as before (Random Forest), are:

* Average precision score: 0.3700295113833973

* Roc auc score: 0.7497037037037038

We can see that we got better results after performing Active Learning

## Modelling
In this phase, we will create and run different types of machine learning models to check their performance on our dataset. To start this phase, we will manually label the rest of the data so we can have more training and validation model, looking for better results. We will do all the process of cleaning and processing the data and creating features again, but this time for the unified DataFrame, which has all the entries labeled 

### 1. Loading and concatenating data
We will be loading the three different datasets (first 500 entries labeled, active learning entries and remaining entries) and concatenating them into a single DataFrame for modeling

In [74]:
# loads data from the first labelling
df1 = pd.read_csv('raw_video_data_with_500_labels.csv', index_col=0)
df1 = df1[df1['y'].notnull()].dropna(how='all').dropna(subset=['watch-title'])
df1.head()

Unnamed: 0,watch-title,y,watch-view-count,watch-time-text,content_watch-info-tag-list,watch7-headline,watch7-user-header,watch8-sentiment-actions,og:image,og:image:width,og:image:height,og:description,og:video:width,og:video:height,og:video:tag,channel_link_0
0,Machine Learning Course A To Z || Beginner to ...,0.0,"175,135 views","Published on Aug 10, 2018",Education,Machine Learning Course A To Z || Beginner to ...,Geek's Lesson\n\n\n\n\n\n\n\n\n\n\n\n\n\nLoadi...,"175,135 views\n\n\n\n\n\n\n\n5,126\n\nLike thi...",https://i.ytimg.com/vi/-58kO_zYUGE/maxresdefau...,1280,720,Welcome to this free online class on machine l...,640.0,360.0,Ai and machine learning course,https://www.youtube.com/watch?v=-58kO_zYUGE
1,How to Become A Machine Learning Engineer | Ho...,0.0,"34,518 views","Published on Sep 3, 2018",Education,#MachineLearningAlgorithms #Datasciencecourse ...,Simplilearn\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nLoadi...,"34,518 views\n\n\n\n\n\n\n\n779\n\nLike this v...",https://i.ytimg.com/vi/-5hEYRt8JE0/maxresdefau...,1280,720,"This video on ""How to become a Machine Learnin...",1280.0,720.0,simplilearn,https://www.youtube.com/watch?v=-5hEYRt8JE0
2,Python For Data Science Full Course - 9 Hours ...,0.0,"16,860 views","Published on Mar 15, 2020",Education,#edureka #PythonEdureka #pythonfordatasciencef...,edureka!\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nLoading....,"16,860 views\n\n\n\n\n\n\n\n665\n\nLike this v...",https://i.ytimg.com/vi/-6RqxhNO2yY/maxresdefau...,1280,720,🔥Edureka Python Certification Training: https:...,1280.0,720.0,edureka,https://www.youtube.com/watch?v=-6RqxhNO2yY
3,Types of machine learning models used in healt...,0.0,384 views,"Published on Feb 9, 2020",Science & Technology,Types of machine learning models used in healt...,Mr Artificial Intelligence\n\n\n\n\n\n\n\n\n\n...,384 views\n\n\n\n\n\n\n\n14\n\nLike this video...,https://i.ytimg.com/vi/-7AwJx7F0vs/hqdefault.jpg,480,360,this video is the first in a series of videos ...,1280.0,720.0,secure and robust machine learning for health ...,https://www.youtube.com/watch?v=-7AwJx7F0vs
4,Michael I. Jordan: Machine Learning: Dynamical...,1.0,"4,152 views","Published on May 2, 2019",Creative Commons Attribution license (reuse al...,#purdue #michaelijordan #engineering\n\n\n\n ...,Purdue Engineering\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,"4,152 views\n\n\n\n\n\n\n\n94\n\nLike this vid...",https://i.ytimg.com/vi/-8yYFdV5SOc/maxresdefau...,1280,720,2019 Purdue Engineering Distinguished Lecture ...,1280.0,720.0,electrical engineer,https://www.youtube.com/watch?v=-8yYFdV5SOc


In [75]:
df1.shape

(517, 16)

In [76]:
# loads data from the active learning batch
df2 = pd.read_csv('active_learning_labeled.csv', index_col=0)
df2.head()

Unnamed: 0,watch-title,y,watch-view-count,watch-time-text,content_watch-info-tag-list,watch7-headline,watch7-user-header,watch8-sentiment-actions,og:image,og:image:width,og:image:height,og:description,og:video:width,og:video:height,og:video:tag,channel_link_0,p
566,"Women in Data Science (Uber, Facebook, Reddit)",1,"12,443 views","Published on Jul 1, 2019",Education,#uber #springboard #womenindata\n\n\n\n Wom...,Springboard\n\n\n\n\n\n\n\n\n\n\n\n\n\nLoading...,"12,443 views\n\n\n\n\n\n\n\n147\n\nLike this v...",https://i.ytimg.com/vi/KI0bTZ3FEQk/maxresdefau...,1280,720,Chatting data science with data scientists fro...,1280.0,720.0,women in tech,https://www.youtube.com/watch?v=KI0bTZ3FEQk,689.0
596,Data Science and Artificial Intelligence - Tur...,1,"1,018 views","Published on Feb 18, 2020",Science & Technology,Data Science and Artificial Intelligence - Tur...,astrazeneca\n\n\n\n\n\n\n\n\n\n\n\n\n\nLoading...,"1,018 views\n\n\n\n\nLike this video?\n\n ...",https://i.ytimg.com/vi/LSGk9pVfujM/hqdefault.jpg,480,360,Our collaboration with BenevolentAI uses machi...,1280.0,720.0,collaboration,https://www.youtube.com/watch?v=LSGk9pVfujM,395.0
615,Why I left my Data Science Job at FANG (Facebo...,0,"592,378 views","Published on Apr 11, 2019",People & Blogs,Why I left my Data Science Job at FANG (Facebo...,Joma Tech\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nLoading...,"592,378 views\n\n\n\n\n\n\n\n12,243\n\nLike th...",https://i.ytimg.com/vi/M5v1nXiUaOI/maxresdefau...,1280,720,► Check out CoderPro for 100+ Video Explanatio...,1280.0,720.0,data scientist,https://www.youtube.com/watch?v=M5v1nXiUaOI,449.0
634,Making data science useful. Cassie Kozyrkov (G...,1,"5,117 views","Published on May 2, 2019",Science & Technology,Making data science useful. Cassie Kozyrkov (G...,O'Reilly\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nLoading....,"5,117 views\n\n\n\n\n\n\n\n66\n\nLike this vid...",https://i.ytimg.com/vi/Mukpy1QjVn8/hqdefault.jpg,480,360,To view the full keynote and other videos from...,1280.0,720.0,useful,https://www.youtube.com/watch?v=Mukpy1QjVn8,413.0
651,Kaggle Competition- Dengue or Malaria Predicti...,0,"4,237 views","Published on Sep 17, 2019",Education,#MalariaDetection\n\n\n\n Kaggle Competitio...,Krish Naik\n\n\n\n\n\n\n\n\n\n\n\n\n\nLoading....,"4,237 views\n\n\n\n\n\n\n\n88\n\nLike this vid...",https://i.ytimg.com/vi/NjvX4BhOjOw/hqdefault.jpg,480,360,In this video we will implement transfer learn...,1280.0,720.0,VGG19,https://www.youtube.com/watch?v=NjvX4BhOjOw,569.0


In [77]:
df2.shape

(100, 17)

In [78]:
# loads data from the remaining entries which were labeled lastly
df3 = pd.read_csv('remaining_videos_labeled.csv', index_col=0)
df3 = df3[df3['y'].notnull()].dropna(how='all').dropna(subset=['watch-title'])
df3.head()

Unnamed: 0,watch-title,y,watch-view-count,watch-time-text,content_watch-info-tag-list,watch7-headline,watch7-user-header,watch8-sentiment-actions,og:image,og:image:width,og:image:height,og:description,og:video:width,og:video:height,og:video:tag,channel_link_0
532,Kaggle Bike Share Demand Prediction [DEMO #13]...,0.0,"2,214 views","Published on Oct 23, 2014",Creative Commons Attribution license (reuse al...,Kaggle Bike Share Demand Prediction [DEMO #13]...,Numenta\n\n\n\n\n\n\n\n\n\n\n\n\n\nLoading...\...,"2,214 views\n\n\n\n\n\n\n\n4\n\nLike this vide...",https://i.ytimg.com/vi/J4owvvBPPS8/maxresdefau...,1280,720,By Chandan Maruthi.,1280.0,720.0,bike share,https://www.youtube.com/watch?v=J4owvvBPPS8
533,Jeremy Howard: fast.ai Deep Learning Courses a...,0.0,"61,718 views","Published on Aug 27, 2019",Science & Technology,Jeremy Howard: fast.ai Deep Learning Courses a...,Lex Fridman\n\n\n\n\n\n\n\n\n\n\n\n\n\nLoading...,"61,718 views\n\n\n\n\n\n\n\n2,014\n\nLike this...",https://i.ytimg.com/vi/J6XcP4JOHmk/maxresdefau...,1280,720,"Jeremy Howard is the founder of fast.ai, a res...",,,world economic forum,https://www.youtube.com/watch?v=J6XcP4JOHmk
534,Learn Python Basics for Data Science from IBM,0.0,"9,136 views","Published on Apr 4, 2019",Education,Learn Python Basics for Data Science from IBM,edX\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nLoading...\n ...,"9,136 views\n\n\n\n\n\n\n\n72\n\nLike this vid...",https://i.ytimg.com/vi/JC3urnvKanI/maxresdefau...,1280,720,Welcome to the course Python Basics for Data S...,1280.0,720.0,learn data science with python,https://www.youtube.com/watch?v=JC3urnvKanI
535,"TPUs, systolic arrays, and bfloat16: accelerat...",0.0,"9,994 views","Published on Apr 7, 2020",Science & Technology,"TPUs, systolic arrays, and bfloat16: accelerat...",Kaggle\n\n\n\n\n\n\n\n\n\n\n\n\n\nLoading...\n...,"9,994 views\n\n\n\n\n\n\n\n249\n\nLike this vi...",https://i.ytimg.com/vi/JC84GCU7zqA/maxresdefau...,1280,720,Today we’re going to talk about systolic array...,1280.0,720.0,bfloat16,https://www.youtube.com/watch?v=JC84GCU7zqA
536,Future of AI/ML | Rise Of Artificial Intellige...,0.0,"10,164 views","Published on Mar 17, 2020",Education,#edureka #AIedureka #MLedureka\n\n\n\n Futu...,edureka!\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nLoading....,"10,164 views\n\n\n\n\n\n\n\n263\n\nLike this v...",https://i.ytimg.com/vi/JFB_751d2uc/maxresdefau...,1280,720,🔥Edureka NIT Warangal Post Graduate Program on...,1280.0,720.0,edureka,https://www.youtube.com/watch?v=JFB_751d2uc


In [79]:
df3.shape

(1045, 16)

In [80]:
# concatenates the three DataFrames, so we have our definitive one for modeling
df = pd.concat([df1, df2.drop('p', axis=1), df3])
df.head()

Unnamed: 0,watch-title,y,watch-view-count,watch-time-text,content_watch-info-tag-list,watch7-headline,watch7-user-header,watch8-sentiment-actions,og:image,og:image:width,og:image:height,og:description,og:video:width,og:video:height,og:video:tag,channel_link_0
0,Machine Learning Course A To Z || Beginner to ...,0.0,"175,135 views","Published on Aug 10, 2018",Education,Machine Learning Course A To Z || Beginner to ...,Geek's Lesson\n\n\n\n\n\n\n\n\n\n\n\n\n\nLoadi...,"175,135 views\n\n\n\n\n\n\n\n5,126\n\nLike thi...",https://i.ytimg.com/vi/-58kO_zYUGE/maxresdefau...,1280,720,Welcome to this free online class on machine l...,640.0,360.0,Ai and machine learning course,https://www.youtube.com/watch?v=-58kO_zYUGE
1,How to Become A Machine Learning Engineer | Ho...,0.0,"34,518 views","Published on Sep 3, 2018",Education,#MachineLearningAlgorithms #Datasciencecourse ...,Simplilearn\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nLoadi...,"34,518 views\n\n\n\n\n\n\n\n779\n\nLike this v...",https://i.ytimg.com/vi/-5hEYRt8JE0/maxresdefau...,1280,720,"This video on ""How to become a Machine Learnin...",1280.0,720.0,simplilearn,https://www.youtube.com/watch?v=-5hEYRt8JE0
2,Python For Data Science Full Course - 9 Hours ...,0.0,"16,860 views","Published on Mar 15, 2020",Education,#edureka #PythonEdureka #pythonfordatasciencef...,edureka!\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nLoading....,"16,860 views\n\n\n\n\n\n\n\n665\n\nLike this v...",https://i.ytimg.com/vi/-6RqxhNO2yY/maxresdefau...,1280,720,🔥Edureka Python Certification Training: https:...,1280.0,720.0,edureka,https://www.youtube.com/watch?v=-6RqxhNO2yY
3,Types of machine learning models used in healt...,0.0,384 views,"Published on Feb 9, 2020",Science & Technology,Types of machine learning models used in healt...,Mr Artificial Intelligence\n\n\n\n\n\n\n\n\n\n...,384 views\n\n\n\n\n\n\n\n14\n\nLike this video...,https://i.ytimg.com/vi/-7AwJx7F0vs/hqdefault.jpg,480,360,this video is the first in a series of videos ...,1280.0,720.0,secure and robust machine learning for health ...,https://www.youtube.com/watch?v=-7AwJx7F0vs
4,Michael I. Jordan: Machine Learning: Dynamical...,1.0,"4,152 views","Published on May 2, 2019",Creative Commons Attribution license (reuse al...,#purdue #michaelijordan #engineering\n\n\n\n ...,Purdue Engineering\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,"4,152 views\n\n\n\n\n\n\n\n94\n\nLike this vid...",https://i.ytimg.com/vi/-8yYFdV5SOc/maxresdefau...,1280,720,2019 Purdue Engineering Distinguished Lecture ...,1280.0,720.0,electrical engineer,https://www.youtube.com/watch?v=-8yYFdV5SOc


In [81]:
df.shape

(1662, 16)

In [82]:
df_cleaned = pd.DataFrame(index=df.index)
df_cleaned['title'] = df['watch-title']

### 2. Cleaning the published date

In [83]:
clean_date = df['watch-time-text'].str.extract(r'([A-z]{3}) (\d+), (\d+)')

# remove the entries not understood by the regex. They are those without a date, but with how many hours ago was published/streamed/premiered
df = df[clean_date[0].notnull()]
df_cleaned = df_cleaned[clean_date[0].notnull()]
clean_date = clean_date[clean_date[0].notnull()]

clean_date[0] = clean_date[0].map(lambda x: str(x))
clean_date[1] = clean_date[1].map(lambda x: '0'+str(x) if len(str(x)) == 1 else str(x)) # include a 0 in days of one digit to fit the pandas pattern
clean_date[2] = clean_date[2].map(lambda x: str(x))
clean_date = clean_date.apply(lambda x: ' '.join(x), axis=1)
df_cleaned['date'] = pd.to_datetime(clean_date, format='%b %d %Y')
df_cleaned.head()

Unnamed: 0,title,date
0,Machine Learning Course A To Z || Beginner to ...,2018-08-10
1,How to Become A Machine Learning Engineer | Ho...,2018-09-03
2,Python For Data Science Full Course - 9 Hours ...,2020-03-15
3,Types of machine learning models used in healt...,2020-02-09
4,Michael I. Jordan: Machine Learning: Dynamical...,2019-05-02


### 3. Cleaning the number of views

In [84]:
views = df['watch-view-count'].str.extract(r'(\d+,?\d*,?\d*)', expand=False).str.replace(',', '').fillna(0).astype(int)
df_cleaned['views'] = views
df_cleaned.head()

Unnamed: 0,title,date,views
0,Machine Learning Course A To Z || Beginner to ...,2018-08-10,175135
1,How to Become A Machine Learning Engineer | Ho...,2018-09-03,34518
2,Python For Data Science Full Course - 9 Hours ...,2020-03-15,16860
3,Types of machine learning models used in healt...,2020-02-09,384
4,Michael I. Jordan: Machine Learning: Dynamical...,2019-05-02,4152


### 4. Creating Features

In [85]:
features = pd.DataFrame(index=df_cleaned.index)
y = df['y'].copy()

features['time_since_published'] = (pd.to_datetime('2020-04-22') - df_cleaned['date']) / np.timedelta64(1, 'D') # time, in days, from the published date until April 22th, 2020 (date this was written)
features['views'] = df_cleaned['views']
features['views_per_day'] = features['views'] / features['time_since_published']
features = features.drop(['time_since_published'], axis=1)
features.head()

Unnamed: 0,views,views_per_day
0,175135,282.020934
1,34518,57.819095
2,16860,443.684211
3,384,5.260274
4,4152,11.662921


In [86]:
mask_train = df_cleaned['date'] < '2019-06-01'
mask_val = df_cleaned['date'] >= '2019-06-01'

Xtrain, Xval = features[mask_train], features[mask_val]
Ytrain, Yval = y[mask_train], y[mask_val] 
Xtrain.shape, Xval.shape, Ytrain.shape, Yval.shape

((823, 2), (828, 2), (823,), (828,))

In [87]:
title_train = df_cleaned[mask_train]['title']
title_val = df_cleaned[mask_val]['title']

title_vec = TfidfVectorizer(min_df=4)
title_bow_train = title_vec.fit_transform(title_train)
title_bow_val = title_vec.transform(title_val)

Xtrain_wtitle = hstack([Xtrain, title_bow_train])
Xval_wtitle = hstack([Xval, title_bow_val])

### 5. Modeling
#### 5.1 Random Forest

In [88]:
mdl_rf = RandomForestClassifier(n_estimators=1000, random_state=0, min_samples_leaf=1, class_weight="balanced", n_jobs=4)
mdl_rf.fit(Xtrain_wtitle, Ytrain)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight='balanced',
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1000,
                       n_jobs=4, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

In [89]:
p_rf = mdl_rf.predict_proba(Xval_wtitle)[:, 1]
average_precision_score(Yval, p_rf), roc_auc_score(Yval, p_rf)

(0.35282168514773743, 0.7605229591836733)

The results after executing with the whole dataset labeled, using the same kind of model as before (Random Forest), are:

* Average precision score: 0.35282168514773743

* Roc auc score: 0.7605229591836733

Our results are slightly worst than before using Random Forest, but it is because we changed the parameter "min_df" of the TfidfVectorizer to 4, in order to have better results with LightGBM model in the next step

#### 5.2 LightGBM

In [90]:
mdl_lgbm = LGBMClassifier(random_state=0, class_weight="balanced", n_jobs=4)
mdl_lgbm.fit(Xtrain_wtitle, Ytrain)

LGBMClassifier(boosting_type='gbdt', class_weight='balanced',
               colsample_bytree=1.0, importance_type='split', learning_rate=0.1,
               max_depth=-1, min_child_samples=20, min_child_weight=0.001,
               min_split_gain=0.0, n_estimators=100, n_jobs=4, num_leaves=31,
               objective=None, random_state=0, reg_alpha=0.0, reg_lambda=0.0,
               silent=True, subsample=1.0, subsample_for_bin=200000,
               subsample_freq=0)

In [91]:
p_lgbm = mdl_lgbm.predict_proba(Xval_wtitle)[:, 1]
average_precision_score(Yval, p_lgbm), roc_auc_score(Yval, p_lgbm)



(0.10393319481721194, 0.646248840445269)

We can see that the results from LightGBM model are worse than those obtained with the Random Forest model. We will do a parameter tuning for our LightGBM model using Bayesian Optimization.

We will be tuning 8 parameters:
* Model learning rate
* Model max depth
* Min child samples
* Subsample
* Colsample by tree
* Number of estimators
* Min df for title vectorizer
* Ngram range for title vectorizer

We will iterate 50 times changing these parameters to find which combination returns the best model's results

In [67]:
def tune_lgbm(params):
    print(params)
    lr = params[0]
    max_depth = params[1]
    min_child_samples = params[2]
    subsample = params[3]
    colsample_bytree = params[4]
    n_estimators = params[5]
    
    min_df = params[6]
    ngram_range = (1, params[7])
    
    title_vec = TfidfVectorizer(min_df=min_df, ngram_range=ngram_range)
    title_bow_train = title_vec.fit_transform(title_train)
    title_bow_val = title_vec.transform(title_val)
    
    Xtrain_wtitle = hstack([Xtrain, title_bow_train])
    Xval_wtitle = hstack([Xval, title_bow_val])
    
    mdl_lgbm = LGBMClassifier(learning_rate=lr, num_leaves=2 ** max_depth, max_depth=max_depth, 
                         min_child_samples=min_child_samples, subsample=subsample,
                         colsample_bytree=colsample_bytree, bagging_freq=1,n_estimators=n_estimators, random_state=0, 
                         class_weight="balanced", n_jobs=6)
    mdl_lgbm.fit(Xtrain_wtitle, Ytrain)
    
    p_lgbm = mdl_lgbm.predict_proba(Xval_wtitle)[:, 1]
    
    print(roc_auc_score(Yval, p_lgbm))
    
    return -average_precision_score(Yval, p_lgbm)


space = [(1e-3, 1e-1, 'log-uniform'), # lr
          (1, 10), # max_depth
          (1, 20), # min_child_samples
          (0.05, 1.), # subsample
          (0.05, 1.), # colsample_bytree
          (100,1000), # n_estimators
          (1,5), # min_df
          (1,5)] # ngram_range

res = forest_minimize(tune_lgbm, space, random_state=160745, n_random_starts=20, n_calls=50, verbose=1)

Iteration No: 1 started. Evaluating function at random point.
[0.009944912110647982, 5, 1, 0.4677107511929402, 0.49263223036174764, 272, 3, 1]




0.7885262059369202
Iteration No: 1 ended. Evaluation done at random point.
Time taken: 0.4333
Function value obtained: -0.3670
Current minimum: -0.3670
Iteration No: 2 started. Evaluating function at random point.
[0.053887464791860025, 1, 15, 0.7437489153990157, 0.8675167974293533, 549, 3, 4]




0.6750637755102041
Iteration No: 2 ended. Evaluation done at random point.
Time taken: 0.3328
Function value obtained: -0.1070
Current minimum: -0.3670
Iteration No: 3 started. Evaluating function at random point.
[0.004151454520895999, 6, 20, 0.8682075103820793, 0.9491436163200662, 411, 4, 3]




0.6654394712430427
Iteration No: 3 ended. Evaluation done at random point.
Time taken: 0.5446
Function value obtained: -0.1065
Current minimum: -0.3670
Iteration No: 4 started. Evaluating function at random point.
[0.0014099928811969545, 9, 9, 0.6502182010234373, 0.6866210554187129, 828, 5, 2]




0.6839053803339518
Iteration No: 4 ended. Evaluation done at random point.
Time taken: 1.7090
Function value obtained: -0.1796
Current minimum: -0.3670
Iteration No: 5 started. Evaluating function at random point.
[0.08530558241838007, 8, 19, 0.2137736299768322, 0.1313765544201984, 961, 4, 1]




0.5246405380333952
Iteration No: 5 ended. Evaluation done at random point.
Time taken: 0.4441
Function value obtained: -0.0817
Current minimum: -0.3670
Iteration No: 6 started. Evaluating function at random point.
[0.003567949451535685, 10, 19, 0.7232951768944309, 0.7298538828427115, 939, 4, 3]


KeyboardInterrupt: 

In [None]:
res.x

In [92]:
params = [0.0024244874223047865, 5, 2, 0.7821475542656193, 0.12951276930788655, 965, 4, 1]
lr = params[0]
max_depth = params[1]
min_child_samples = params[2]
subsample = params[3]
colsample_bytree = params[4]
n_estimators = params[5]

min_df = params[6]
ngram_range = (1, params[7])

title_vec = TfidfVectorizer(min_df=min_df, ngram_range=ngram_range)
title_bow_train = title_vec.fit_transform(title_train)
title_bow_val = title_vec.transform(title_val)

Xtrain_wtitle = hstack([Xtrain, title_bow_train])
Xval_wtitle = hstack([Xval, title_bow_val])

mdl_lgbm = LGBMClassifier(learning_rate=lr, num_leaves=2 ** max_depth, max_depth=max_depth, 
                     min_child_samples=min_child_samples, subsample=subsample,
                     colsample_bytree=colsample_bytree, bagging_freq=1,n_estimators=n_estimators, random_state=0, 
                     class_weight="balanced", n_jobs=6)
mdl_lgbm.fit(Xtrain_wtitle, Ytrain)

p_lgbm = mdl_lgbm.predict_proba(Xval_wtitle)[:, 1]

average_precision_score(Yval, p_lgbm), roc_auc_score(Yval, p_lgbm)



(0.4208528794488573, 0.790903293135436)

These are the parameters values found which had the best model's results:
* Model learning rate: 0.0024244874223047865
* Model max depth: 5
* Min child samples: 2
* Subsample: 0.7821475542656193
* Colsample by tree: 0.12951276930788655
* Number of estimators: 965
* Min df for title vectorizer: 4
* Ngram range for title vectorizer: 1

The results obtained were:
* Average precision score: 0.4208528794488573
* Roc auc score: 0.790903293135436

#### 5.3 Logistic Regression

In [93]:
Xtrain_wtitle2 = csr_matrix(Xtrain_wtitle.copy())
Xval_wtitle2 = csr_matrix(Xval_wtitle.copy())

lr_pipeline = make_pipeline(MaxAbsScaler(), LogisticRegression(C=0.5, penalty='l2',n_jobs=4, random_state=0))
lr_pipeline.fit(Xtrain_wtitle2, Ytrain)

p_lr = lr_pipeline.predict_proba(Xval_wtitle2)[:, 1]

average_precision_score(Yval, p_lr), roc_auc_score(Yval, p_lr)

(0.33858537894011725, 0.7526379870129871)

The results obtained for a Logistic Regression Model were:
* Average precision score: 0.33858537894011725
* Roc auc score: 0.7526379870129871

We can see that they are behind the results obtained using Random Forest and LightGBM classifiers

#### 5.4 Final Results per model
To have a better visualization of the results obtained, here we can see all the results together:
* Random Forest
  * Average precision score: 0.35282168514773743
  * Roc auc score: 0.7605229591836733
* LightGBM
  * Average precision score: 0.4208528794488573
  * Roc auc score: 0.790903293135436
* Logistic Regression
  * Average precision score: 0.33858537894011725
  * Roc auc score: 0.7526379870129871

### 6. Ensemble
In this phase, we will combine the models in order to have a better performance for our solution 

In [96]:
p = (p_lr + p_rf + p_lgbm)/3
average_precision_score(Yval, p), roc_auc_score(Yval, p)

(0.41809284106687244, 0.7803513450834879)

In [97]:
pd.DataFrame({"LR": p_lr, "RF": p_rf, "LGBM": p_lgbm}).corr()

Unnamed: 0,LR,RF,LGBM
LR,1.0,0.718617,0.685078
RF,0.718617,1.0,0.666275
LGBM,0.685078,0.666275,1.0


We can see by the matrix above that our 2 best models, Random Forest and LightGBM, have a correlation of results of around 0.66, which shows that they united may be able to generate a solution that can generalize better in production, so we will find a good proportion to combine these two

In [98]:
p_1 = 0.3*p_rf + 0.7*p_lgbm
print('0.3 Random Forest, 0.7 LightGBM: ', average_precision_score(Yval, p_1), roc_auc_score(Yval, p_1))
p_2 = 0.4*p_rf + 0.6*p_lgbm
print('0.4 Random Forest, 0.6 LightGBM: ', average_precision_score(Yval, p_2), roc_auc_score(Yval, p_2))
p_3 = 0.5*p_rf + 0.5*p_lgbm
print('0.5 Random Forest, 0.5 LightGBM: ', average_precision_score(Yval, p_3), roc_auc_score(Yval, p_3))
p_4 = 0.6*p_rf + 0.4*p_lgbm
print('0.6 Random Forest, 0.4 LightGBM: ', average_precision_score(Yval, p_4), roc_auc_score(Yval, p_4))
p_5 = 0.7*p_rf + 0.3*p_lgbm
print('0.7 Random Forest, 0.3 LightGBM: ', average_precision_score(Yval, p_5), roc_auc_score(Yval, p_5))

0.3 Random Forest, 0.7 LightGBM:  0.43007489476435995 0.7870477736549165
0.4 Random Forest, 0.6 LightGBM:  0.4119087106614182 0.784235853432282
0.5 Random Forest, 0.5 LightGBM:  0.4026134041804759 0.7818297773654916
0.6 Random Forest, 0.4 LightGBM:  0.3985699350905097 0.7795396567717997
0.7 Random Forest, 0.3 LightGBM:  0.3864433615813018 0.774930426716141


By the results above and the results from the models alone, we will choose to use the proportion 0.3 Random Forest and 0.7 LightGBM, as this combination delivers a great result and we saw that LightGBM alone was the model that was getting best results on classifying our videos

We will now save our models and vectorizer to use them on our deploy

In [99]:
jb.dump(mdl_lgbm, "lgbm_20200422.pkl.z")
jb.dump(mdl_rf, "random_forest_20200422.pkl.z")
jb.dump(title_vec, "title_vectorizer_20200422.pkl.z")

['title_vectorizer_20200422.pkl.z']