# Youtube Analysis 
Proof of concept project for analysing the sentiment of videos returned by searching for a key word (i.e. product name). 

Initial implimentation uses NLP to assess the polarity of comments for each video combining this with other metrics to return an overall sentiment score.

Future additions will integrate a company product list and security identifiers to link companies/ tickers to sentiment scores, allowing for statistical assessment of it's predictive power with regards to stock price.


## Setup

1. Create Youtube Data API key as per the instructions [here](https://developers.google.com/youtube/v3/getting-started) 

2. Create `.env` file containing the key as follows: `YT_API_KEY=[key]`
   
3. Finally, after installing SpaCy in your environment, ensure the language library is installed by running the below command in the terminal:

```shell
                                        spacy download en_core_web_sm
```

#### Dependencies

In [3]:
import os
import pandas as pd

import database
from data_api import youtube
from helpers import dict_search, min_max_scaler

##### Import Google Data API Key and Initiate Youtube API Class Instance


In [4]:
# Import env variables and set API key
from dotenv import load_dotenv # Needed to insure .env file imported into jupyter env
load_dotenv() 

DEVELOPER_KEY = os.environ.get('YT_API_KEY')


# Create YouTube Data API object
yt = youtube(DEVELOPER_KEY)

## Search
Run a search using the key term, returning the IDs of relevant videos ordered by upload date.

In [5]:
# Enter keyword below:
keyword = 'macbook' # Macbook as an example

In [6]:
# Use search method to retrieve IDs
response = yt.search(keyword, order='date')
raw_ids = dict_search(response, ["videoId"], list_depth=2)
ids = [row['videoId'] for row in raw_ids]

In [7]:
# Retrieve general information for each video
raw_stats = yt.video_stats(ids)
clean_stats = dict_search(raw_stats, [
    "id", 
    "title",
    "decription", 
    "channelTitle",
    "channelId", 
    "categoryId", 
    "viewCount", 
    "likeCount", 
    "commentCount", 
    "publishedAt"], list_depth=2)
stats_df = pd.DataFrame(clean_stats)
stats_df.head(5)

Unnamed: 0,id,publishedAt,channelId,title,channelTitle,categoryId,viewCount,likeCount,commentCount
0,kXrdDLwVDkc,2022-02-03T16:15:20Z,UCHFr4p4oy2SluK1bGD0A2OQ,ЗАМЕНА ТЕРМОПАСТЫ НА MACBOOK PRO,Osipov Music,23,380,59,5
1,sgeyzFLxvdg,2022-02-03T15:34:56Z,UC_EkhlajM42KLVI6OkCksFg,ลองใช้ Macbook Air M1 + OBS Studio Live ไป Fac...,Undervlog,26,46,9,1
2,WmahmosgcRc,2022-02-03T15:00:18Z,UCp5VieoKmu5AyNjCgpLIPqg,Ho vinto un MACBOOK PRO da €2000,RedDani,2,67,13,1
3,gOTC4dTvHPo,2022-02-03T13:13:35Z,UCTx3-1Fka35VriJSmqzpaIw,【M1搭載】70万円超えの新型MacBook Proがついに来ました！！！！【16インチ】,カンタの大冒険【人間】,22,4845,885,285
4,RHCn6X-bl0k,2022-02-03T11:04:59Z,UCuw6-7g3ydtDcPJNPwaGkhQ,💻 MacBook 😍😍💻unboxing,Dinu & Yashi,19,306,13,2


In [8]:
# Retrieve top level comment threads for each video to be used to guage polarity
raw_comments = yt.commentThread(ids)
comments = dict_search(raw_comments, [
    "videoId",
    "textDisplay",
    "publishedAt"
    ], list_depth=2)
comments_df = pd.DataFrame(comments)

In [9]:
# Rename id, comments and comment publishedAt columns and merge with stats dataframe
stats_df.rename(columns={'id':'videoId'}, inplace=True)
comments_df.rename(columns={'textDisplay':'comment', 'publishedAt':'commentDate'}, inplace=True)
merged_df = pd.merge(stats_df, comments_df, how='left', on='videoId')
merged_df['comment'].fillna('', inplace=True)

In [10]:
# Retrieve channel stats for each video and merge with other dataframe
raw_channelStats = yt.channel(stats_df['channelId'].to_list(), part="statistics")
channelStats = dict_search(raw_channelStats, [
    "id", 
    "subscriberCount", 
    "videoCount"
    ], list_depth=2)
channel_df = pd.DataFrame(channelStats)

# Rename ID column and merge
channel_df.rename(columns={'id':'channelId'}, inplace=True)
merged_df = pd.merge(merged_df, channel_df, how='left', on='channelId')

## Sentiment Analysis

We assess that sentiment can be summarised by: <br>
<br>
$\text{Sentiment} = \dfrac{\sum\text{Comment Polarity}}{\text{Video comment Count}} \times \dfrac{\text{Video Views}}{\text{Channel Subscribers}}$

In [11]:
# Import sentiment object for analysis
from analysis import sentiment

# Analyse each comment and give polarity score
# 1: Positive, 0: Neutral, -1: Negative
comment_list = merged_df['comment'].astype(str).to_list()
s = sentiment(comment_list)
merged_df['comment_polarity'] = s.polarity()

In [13]:
# amend data type in count columns from string to integers
merged_df[['likeCount', 'viewCount', 'commentCount', 'subscriberCount']] = merged_df[['likeCount', 'viewCount', 'commentCount', 'subscriberCount']].astype(int)
for column in ['publishedAt', 'commentDate']:
    merged_df[column] = merged_df[column].astype('datetime64').dt.normalize()

df = merged_df.copy()

# Polarity scaled by comment count
df['comment_polarity'] /= df['commentCount']

df['view_sub_ratio'] = df['viewCount'] / df['subscriberCount']
df['like_view_ratio'] = df['likeCount'] / df['viewCount']
df['comment_view_ratio'] = df['commentCount'] / df['viewCount']

In [14]:
import matplotlib.pyplot as plt

# Sentiment time series
time_series = df.copy()
time_series = time_series.groupby('commentDate').agg(polarity=('comment_polarity','mean'), count = ('commentDate','size')).reset_index()
time_series.fillna(0, inplace=True)

In [15]:
# Normalise comment polarity
time_series['comment_polarity'] = min_max_scaler(time_series['comment_polarity'])


KeyError: 'comment_polarity'

In [None]:

# Groupby, summing polarity of comments for each video ID
df = df.groupby(['videoId','view_sub_ratio', 'like_view_ratio', 'comment_view_ratio', 'subscriberCount']).agg({'comment_polarity':['sum']}).reset_index()
df.columns = df.columns.droplevel(1)

# Create video sentiment score
df['sentiment'] = df['comment_polarity']*df['view_sub_ratio']

df.head(5)


## Stock Price Predictive Power

\*\*TODO\*\*