# Youtube Analysis 
Proof of concept project for analysing the sentiment of videos return by searching for a key word (i.e. product name). NLP is used to assess the polarity of comments for each video which is then combined with other metrics to return a overall sentiment score.

Future additions will integrate a company product list and security identifiers to link companies/ tickers to sentiment scores.


## Setup

1. Create Youtube Data API key as per the instructions [here](https://developers.google.com/youtube/v3/getting-started) 

2. Create `.env` file containing the key as follows: `YT_API_KEY=[key]`
   
3. Finally, after installing SpaCy in your environment, ensure the language library is installed by running the below in the terminal:

```shell
                                        spacy download en_core_web_sm
```

#### Dependencies

In [1]:
import database
from data_api import youtube
from pprint import pprint
import os
from helpers import dict_search, min_max_scaler
import pandas as pd

# Fix for async capability in Jupyter Notebooks
import nest_asyncio
nest_asyncio.apply()

##### Import Google Data API Key and Initiate Youtube API Class Instance


In [2]:
# TODO: Convert this to use the .env file before project completion
DEVELOPER_KEY = 'AIzaSyC42N8_Sa6fsoSvG2tFkJNl2XLNYeT0fHk'

# Create YouTube Data API object
yt = youtube(DEVELOPER_KEY)

## Search
Run a search using the key term, returning the IDs of relevant videos ordered by upload date.

In [3]:
keyword = 'macbook' # Using macbooks as an example

# Use search method to retrieve IDs
response = yt.search(keyword, order='date')
raw_ids = dict_search(response, ["videoId"], list_depth=2)
ids = [row['videoId'] for row in raw_ids]

In [4]:
# Retrieve general information for each video
raw_stats = yt.video_stats(ids)
clean_stats = dict_search(raw_stats, [
    "id", 
    "title",
    "decription", 
    "channelTitle",
    "channelId", 
    "categoryId", 
    "viewCount", 
    "likeCount", 
    "commentCount", 
    "publishedAt"], list_depth=2)
stats_df = pd.DataFrame(clean_stats)
stats_df.head(5)

Unnamed: 0,id,publishedAt,channelId,title,channelTitle,categoryId,viewCount,likeCount,commentCount
0,eoR_49F7JB8,2022-01-29T16:08:41Z,UCBVJujdfYisnCWjy_AUgMBQ,MACBOOK at CHEAPEST PRICES 💥 ||SECOND HAND MAC...,Rishu's Squad 2.0,22,796,105,32
1,CeCppNN-I9g,2022-01-29T14:33:19Z,UCCkXU_pKTN98ktKl0TnHjBg,Budget APPLE Products For STUDENTS - iPad & Ma...,Ansh Nakwal,28,7693,1335,111
2,06pFsuSxbTM,2022-01-29T13:31:18Z,UCHLmaPy_UYFvQcUp-FdpxRg,പെട്ടിപൊട്ടിക്കൽ ft. MacBook Air M1 🔥 #Shorts,Sarath'S Neon Tech,28,29306,5586,34
3,WoeQaUC2StA,2022-01-29T13:01:46Z,UCjKcp_NRBj9qJnmHS2mZQ0A,開箱｜Macbook Air M1 Unboxing｜醫學生最需要時間，剪片好快啊｜小廢片｜...,Choco tv,27,136,8,1
4,HJQIYmcRQU4,2022-01-29T13:00:07Z,UCyQobySFx_h9oFwsBV0KGdg,"Benchmark MacBook Pro 16: M1 Max, 64Gb",Tinh tế,28,2014,63,5


In [5]:
# Retrieve top level comment threads for each video to be used to guage polarity
raw_comments = yt.commentThread(ids)
comments = dict_search(raw_comments, [
    "videoId",
    "textDisplay",
    "publishedAt"
    ], list_depth=2)
comments_df = pd.DataFrame(comments)

In [6]:
# Rename id, comments and comment publishedAt columns and merge with stats dataframe
stats_df.rename(columns={'id':'videoId'}, inplace=True)
comments_df.rename(columns={'textDisplay':'comment', 'publishedAt':'commentDate'}, inplace=True)
merged_df = pd.merge(stats_df, comments_df, how='left', on='videoId')
merged_df['comment'].fillna('', inplace=True)

In [7]:
# Retrieve channel stats for each video and merge with other dataframe
raw_channelStats = yt.channel(stats_df['channelId'].to_list(), part="statistics")
channelStats = dict_search(raw_channelStats, [
    "id", 
    "subscriberCount", 
    "videoCount"
    ], list_depth=2)
channel_df = pd.DataFrame(channelStats)

# Rename ID column and merge
channel_df.rename(columns={'id':'channelId'}, inplace=True)
merged_df = pd.merge(merged_df, channel_df, how='left', on='channelId')

## Comment Sentiment

In [8]:
# Import sentiment object for analysis
from analysis import sentiment

# Analyse each comment and give polarity score
# 1: Positive, 0: Neutral, -1: Negative
comment_list = merged_df['comment'].astype(str).to_list()
s = sentiment(comment_list)
merged_df['comment_polarity'] = s.polarity()

## Sentiment Analysis

We assess that sentiment can be summarised by: <br>
<br>
$\text{Sentiment} = \dfrac{\sum\text{Comment Polarity}}{\text{Video comment Count}} \times \dfrac{\text{Video Views}}{\text{Channel Subscribers}}$

In [10]:
# amend data type in count columns from string to integers
merged_df[['likeCount', 'viewCount', 'commentCount', 'subscriberCount']] = merged_df[['likeCount', 'viewCount', 'commentCount', 'subscriberCount']].astype(int)
merged_df[['publishedAt', 'commentDate']] = merged_df[['publishedAt', 'commentDate']].astype('datetime64')

df = merged_df.copy()

# Polarity scaled by comment count
df['comment_polarity'] /= df['commentCount']

df['view_sub_ratio'] = df['viewCount'] / df['subscriberCount']
df['like_view_ratio'] = df['likeCount'] / df['viewCount']
df['comment_view_ratio'] = df['commentCount'] / df['viewCount']

In [11]:
import matplotlib.pyplot as plt

# Sentiment time series
time_series = df.copy()
time_series = time_series.groupby('commentDate').agg(polarity=('comment_polarity','mean'), count = ('commentDate','size')).reset_index()
time_series.fillna(0, inplace=True)

In [13]:
# Normalise comment polarity
time_series['comment_polarity'] = min_max_scaler(time_series['comment_polarity'])


In [None]:

# Groupby, summing polarity of comments for each video ID
df = df.groupby(['videoId','view_sub_ratio', 'like_view_ratio', 'comment_view_ratio', 'subscriberCount']).agg({'comment_polarity':['sum']}).reset_index()
df.columns = df.columns.droplevel(1)

# Create video sentiment score
df['sentiment'] = df['comment_polarity']*df['view_sub_ratio']

df.head(5)


## Stock Price Predictive Power

\*\*TODO\*\*