# Project Milestone
<p> Showcase the progress you've achieved and how you're integrating theoretical knowledge with practical experimentation.

In your milestone project, it's crucial to incorporate a section on the theoretical foundation you have started to explore. This section should offer a thorough overview of the key concepts, theories, and relevant literature that form the basis of your research. In case you are using many tools from sklearn, I expect you study in depth at least one, even simple. For example if you use logistic regression as a classifier, write down few paragraphs explaining how it works. You can follow any notes or textbook you want to achieve this. 

Moreover, include a section on the preliminary experiments you have carried out. Detail the experimental setup, the data utilized, and any initial findings or observations. Emphasize any challenges faced and how they were resolved, and outline what you anticipate to accomplish by the project's conclusion.

## Import Packages

In [32]:
import pandas as pd
import numpy as np
import os

import nltk

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV

## Data Preparation and Transformation
<p> Data will be cleaned up and optimized for analysis by combining the two datasets to cover videos from 2007 to the present using NumPy and Pandas. Columns will be broken down into more specific categories. For example, in the YouTube Trending Video Dataset (updated daily), the publish date column will be further broken down into published day of the week and published time frame. More information will be extracted from the title and description by adding word count, character count, capital letter count.

### Load file and import as DataFrame using Pandas

In [17]:
filename = os.path.join(os.getcwd(), "data", "US_youtube_trending_data.csv")
df = pd.read_csv(filename)

In [18]:
df.head()

Unnamed: 0,video_id,title,publishedAt,channelId,channelTitle,categoryId,trending_date,tags,view_count,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,description
0,3C66w5Z0ixs,I ASKED HER TO BE MY GIRLFRIEND...,2020-08-11T19:20:14Z,UCvtRTOMP2TqYqu51xNrqAzg,Brawadis,22,2020-08-12T00:00:00Z,brawadis|prank|basketball|skits|ghost|funny vi...,1514614,156908,5855,35313,https://i.ytimg.com/vi/3C66w5Z0ixs/default.jpg,False,False,SUBSCRIBE to BRAWADIS ▶ http://bit.ly/Subscrib...
1,M9Pmf9AB4Mo,Apex Legends | Stories from the Outlands – “Th...,2020-08-11T17:00:10Z,UC0ZV6M2THA81QT9hrVWJG3A,Apex Legends,20,2020-08-12T00:00:00Z,Apex Legends|Apex Legends characters|new Apex ...,2381688,146739,2794,16549,https://i.ytimg.com/vi/M9Pmf9AB4Mo/default.jpg,False,False,"While running her own modding shop, Ramya Pare..."
2,J78aPJ3VyNs,I left youtube for a month and THIS is what ha...,2020-08-11T16:34:06Z,UCYzPXprvl5Y-Sf0g4vX-m6g,jacksepticeye,24,2020-08-12T00:00:00Z,jacksepticeye|funny|funny meme|memes|jacksepti...,2038853,353787,2628,40221,https://i.ytimg.com/vi/J78aPJ3VyNs/default.jpg,False,False,I left youtube for a month and this is what ha...
3,kXLn3HkpjaA,XXL 2020 Freshman Class Revealed - Official An...,2020-08-11T16:38:55Z,UCbg_UMjlHJg_19SZckaKajg,XXL,10,2020-08-12T00:00:00Z,xxl freshman|xxl freshmen|2020 xxl freshman|20...,496771,23251,1856,7647,https://i.ytimg.com/vi/kXLn3HkpjaA/default.jpg,False,False,Subscribe to XXL → http://bit.ly/subscribe-xxl...
4,VIUo6yapDbc,Ultimate DIY Home Movie Theater for The LaBran...,2020-08-11T15:10:05Z,UCDVPcEbVLQgLZX0Rt6jo34A,Mr. Kate,26,2020-08-12T00:00:00Z,The LaBrant Family|DIY|Interior Design|Makeove...,1123889,45802,964,2196,https://i.ytimg.com/vi/VIUo6yapDbc/default.jpg,False,False,Transforming The LaBrant Family's empty white ...


In [19]:
df.columns

Index(['video_id', 'title', 'publishedAt', 'channelId', 'channelTitle',
       'categoryId', 'trending_date', 'tags', 'view_count', 'likes',
       'dislikes', 'comment_count', 'thumbnail_link', 'comments_disabled',
       'ratings_disabled', 'description'],
      dtype='object')

In [20]:
df.dtypes

video_id             object
title                object
publishedAt          object
channelId            object
channelTitle         object
categoryId            int64
trending_date        object
tags                 object
view_count            int64
likes                 int64
dislikes              int64
comment_count         int64
thumbnail_link       object
comments_disabled      bool
ratings_disabled       bool
description          object
dtype: object

In [21]:
dropped_columns = ['video_id', 'channelId', 'thumbnail_link', 'trending_date']
df = df.drop(dropped_columns, axis=1)

### Title and Description Count Columns

In [22]:
def count_upper(text):
    if not text:
        return 0
    return sum(1 for char in text if char.isupper())

def count_lower(text):
    if not text:
        return 0
    return sum(1 for char in text if char.islower())

In [23]:
# character and word count
df['title_cc'] = df.apply(lambda row: len(row['title']), axis=1)
df['title_wc'] = df['title'].str.split().str.len()

# upper and lower case count
df['title_upper'] = df['title'].apply(count_upper)
df['title_lower'] = df['title'].apply(count_lower)

# upper and lower ratio
df['title_upper_ratio'] = df['title_upper'] / df['title_cc']
df['title_lower_ratio'] = df['title_lower'] / df['title_cc']

In [24]:
# convert column to be string type
df['description'] = df['description'].astype(str)  

# character and word count
df['description_cc'] = df.apply(lambda row: len(row['description']), axis=1)
df['description_wc'] = df['description'].str.split().str.len()

# upper and lower case count
df['description_upper'] = df['description'].apply(count_upper)
df['description_lower'] = df['description'].apply(count_lower)

# upper and lower ratio
df['description_upper_ratio'] = df['description_upper'] / df['description_cc']
df['description_lower_ratio'] = df['description_lower'] / df['description_cc']

### Date and Time Columns

In [25]:
df['publishedAt'] = pd.to_datetime(df['publishedAt'])
df['day_of_week'] = df['publishedAt'].dt.day_name()
df['hour_of_day'] = df['publishedAt'].dt.hour

In [26]:
def categorize_time(hour):
    if 5 <= hour < 12:
        return 'Morning'
    elif 12 <= hour < 17:
        return 'Afternoon'
    elif 17 <= hour < 21:
        return 'Evening'
    return 'Night'


df['time_of_day'] = df['hour_of_day'].apply(categorize_time)

In [27]:
df.head()

Unnamed: 0,title,publishedAt,channelTitle,categoryId,tags,view_count,likes,dislikes,comment_count,comments_disabled,...,title_lower_ratio,description_cc,description_wc,description_upper,description_lower,description_upper_ratio,description_lower_ratio,day_of_week,hour_of_day,time_of_day
0,I ASKED HER TO BE MY GIRLFRIEND...,2020-08-11 19:20:14+00:00,Brawadis,22,brawadis|prank|basketball|skits|ghost|funny vi...,1514614,156908,5855,35313,False,...,0.0,361,47,48,233,0.132964,0.645429,Tuesday,19,Evening
1,Apex Legends | Stories from the Outlands – “Th...,2020-08-11 17:00:10+00:00,Apex Legends,20,Apex Legends|Apex Legends characters|new Apex ...,2381688,146739,2794,16549,False,...,0.683333,715,100,38,512,0.053147,0.716084,Tuesday,17,Evening
2,I left youtube for a month and THIS is what ha...,2020-08-11 16:34:06+00:00,jacksepticeye,24,jacksepticeye|funny|funny meme|memes|jacksepti...,2038853,353787,2628,40221,False,...,0.698113,513,29,83,333,0.161793,0.649123,Tuesday,16,Afternoon
3,XXL 2020 Freshman Class Revealed - Official An...,2020-08-11 16:38:55+00:00,XXL,10,xxl freshman|xxl freshmen|2020 xxl freshman|20...,496771,23251,1856,7647,False,...,0.642857,762,94,89,502,0.116798,0.658793,Tuesday,16,Afternoon
4,Ultimate DIY Home Movie Theater for The LaBran...,2020-08-11 15:10:05+00:00,Mr. Kate,26,The LaBrant Family|DIY|Interior Design|Makeove...,1123889,45802,964,2196,False,...,0.636364,2493,209,355,1511,0.142399,0.606097,Tuesday,15,Afternoon


### Video Length

In [28]:
# Potentially include video length in data set by modifying scraper program

### NLTK Similarity and Sentiment Analysis

#### Title, Tag and Description Similarity

#### Sentiment Analysis

### Highest Correlation Columns

In [29]:
df_numeric = df.select_dtypes(include=[np.number])
df_numeric.head()

Unnamed: 0,categoryId,view_count,likes,dislikes,comment_count,title_cc,title_wc,title_upper,title_lower,title_upper_ratio,title_lower_ratio,description_cc,description_wc,description_upper,description_lower,description_upper_ratio,description_lower_ratio,hour_of_day
0,22,1514614,156908,5855,35313,34,7,25,0,0.735294,0.0,361,47,48,233,0.132964,0.645429,19
1,20,2381688,146739,2794,16549,60,10,6,41,0.1,0.683333,715,100,38,512,0.053147,0.716084,17
2,24,2038853,353787,2628,40221,53,11,5,37,0.09434,0.698113,513,29,83,333,0.161793,0.649123,16
3,10,496771,23251,1856,7647,56,8,8,36,0.142857,0.642857,762,94,89,502,0.116798,0.658793,16
4,26,1123889,45802,964,2196,55,9,11,35,0.2,0.636364,2493,209,355,1511,0.142399,0.606097,15


In [30]:
corrs_sorted = df_numeric.corr()['view_count'].sort_values(ascending=False)
corrs_sorted

view_count                 1.000000
likes                      0.877598
comment_count              0.522238
dislikes                   0.354859
description_wc             0.030790
description_cc             0.024974
description_upper          0.022564
description_lower          0.014490
description_upper_ratio   -0.001377
title_upper_ratio         -0.017117
categoryId                -0.023754
title_upper               -0.041277
title_lower_ratio         -0.044342
hour_of_day               -0.052580
title_cc                  -0.053285
title_lower               -0.054569
title_wc                  -0.054660
description_lower_ratio   -0.067634
Name: view_count, dtype: float64

## Preliminary Prediction Models


### Training and Testing Data Sets

In [48]:
# Create Labeled Examples
X = df_most_corr.drop(columns='view_count', axis=0)
y = df_most_corr['view_count']

In [58]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

### Linear Regression Model

In [59]:
# Create linear regression model and fit to training data
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)


# Make predictions on the test data and compute RMSE/R^2
y_lr_pred = lr_model.predict(X_test)
lr_rmse = mean_squared_error(y_test, y_lr_pred, squared=False)
lr_r2 = r2_score(y_test, y_lr_pred)

# Print RMSE and R^2
print('[LR] RMSE: {:.4f}'.format(lr_rmse))
print('[LR]  R^2: {:.4f}'.format(lr_r2))

[LR] RMSE: 3972503.5786
[LR]  R^2: 0.7787




### Decision Tree Model

In [60]:
# Parameters for Grid Search
param_grid = {
    'max_depth': [2**i for i in list(range(10))],
    'min_samples_leaf': [2**i for i in list(range(10))]
}

# Create a DecisionTreeRegressor model and run a Grid Search with 3-fold cross-validation
dt_regressor = DecisionTreeRegressor()
dt_grid = GridSearchCV(dt_regressor, param_grid, cv = 3, scoring='neg_root_mean_squared_error')# YOUR CODE HERE
dt_grid_search = dt_grid.fit(X_train, y_train)

# Save best parameters to dt_best_params
dt_best_params = dt_grid_search.best_params_

In [61]:
# Create final DecisionTreeRegressor regression model and fit to training data
dt_model = DecisionTreeRegressor(max_depth=dt_best_params['max_depth'], min_samples_leaf=dt_best_params['min_samples_leaf'])
dt_model.fit(X_train, y_train)

# Make predictions on the test data and compute RMSE/R^2
y_dt_pred = dt_model.predict(X_test)
dt_rmse = mean_squared_error(y_test, y_dt_pred, squared=False)
dt_r2 = r2_score(y_test, y_dt_pred)

# Print RMSE and R^2
print('[DT] RMSE: {:.4f}'.format(dt_rmse))
print('[DT]  R^2: {:.4f}'.format(dt_r2))

[DT] RMSE: 2618278.9577
[DT]  R^2: 0.9039




### Random Forest Model

### Gradient Boosting