# <center>Youtube Sentiment Analysis

## <center> Progetto per il corso di <br> TECHNOLOGIES FOR ADVANCED PROGRAMMING

### <center>Orazio Sciuto
### <center>Università degli Studi di Catania <br> Corso di Laurea Magistrale in Informatica


## The project

The main goal of this project is to provide moderators of youtube channels with a simple and powerful tool to be able to analyze reactions to posted videos in real time using *Sentiment Analysis*

## How?

The goal is accomplished by building a data pipeline using Docker and some of the leading Big data management technologies

## Why?

There are more and more cases of videos being filled with negative comments even for reasons outside the video itself

![Alt text](mum.gif)

## Project workflow

- Ingestion: using the Youtube API we can retrieve comments from a video and send it to Logstash
- Streaming: using Kafka we can stream the comments to a Spark cluster
- Processing: using Spark we can process the comments and extract the sentiment
- Indexing: using Elasticsearch we can index the comments and their sentiment
- Visualization: using Kibana we can visualize the comments and their sentiment

![Alt text](image-1.png)

![Alt text](image-3.png)

## Data Source

### Youtube

YouTube is an American online video sharing and social media platform headquartered in San Bruno, California, United States. Accessible worldwide, it was launched on February 14, 2005. It is owned by Google and is the second most visited website, after Google Search. YouTube has more than 2.5 billion monthly users, who collectively watch more than one billion hours of videos each day.As of May 2019, videos were being uploaded at a rate of more than 500 hours of content per minute.

(Wikipedia)[https://en.wikipedia.org/wiki/YouTube]

![Alt text](image-2.png)

## Data Ingestion

1. **Python Script**
    - Youtube Data Api V3
    - Polling and filter new comments
    - Send to logstash using socket

2. **Logstash**
    - Retrieve data using TCP plugin and send it to Kafka in the topic *youtube*

### Youtube Data Api V3 - A small demo

In [3]:
from youtube_api import YouTubeDataAPI

api_key = input("Insert your API key: ");
yt = YouTubeDataAPI(api_key)
VIDEO_URL = input("Insert the video URL: ");
comments = yt.get_video_comments(VIDEO_URL, order_by_time=True)
print("Number of comments: ", len(comments))
print(comments[0])


Number of comments:  41
{'video_id': '3NwzryQ_MJ8', 'commenter_channel_url': 'http://www.youtube.com/channel/UCHcOKp9qkkC76SDfcKgrW8g', 'commenter_channel_id': 'UCHcOKp9qkkC76SDfcKgrW8g', 'commenter_channel_display_name': 'Silvana Cultrera', 'comment_id': 'UgyFcj-3Exg1XjrthaN4AaABAg', 'comment_like_count': 0, 'comment_publish_date': 1694985822.0, 'text': 'Video spettacolare! Dovresti fare più contenuti così', 'commenter_rating': 'none', 'comment_parent_id': None, 'collection_date': datetime.datetime(2023, 9, 18, 8, 52, 14, 896663), 'reply_count': 0}


## Data Streaming

 Streaming involves the continuous and real-time transmission of data from its source to a destination.

**Apache Kafka**

Apache Kafka is an open-source, distributed event streaming platform that is used for building real-time data pipelines and streaming applications. It provides a highly scalable and fault-tolerant way to publish and subscribe to data streams, making it ideal for processing and transmitting large volumes of data in real-time.


## Data Processing

Processing refers to the manipulation, transformation, and analysis of data to extract meaningful insights or perform specific tasks. This step can include data cleansing, aggregation, and computations like **Sentiment Analysis**

**Apache Spark**

Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.


![image.png](kafkatime.jpeg)

### Sentiment Analysis

Sentiment analysis is the interpretation and classification of emotions (positive, negative and neutral) within text data using text analysis techniques. Sentiment analysis allows businesses to identify customer sentiment toward products, brands or services in online conversations and feedback.

**Spark MLlib or something else?**

- Use SparkMLlib is for sure the fastest option for italian comments, but it should be trained with a lot of data to be able to classify correctly.
- An alternative is to use a pretrained model, but it's not easy to find a good one for Italian comments.
- Other possible solutions are to use a python library like Vader but in this case the problem is that Vader is not able to classify Italian comments and so we need to translate them into English with all the problems that this can bring.

![MlLib_bleah](mllib_not_work.jpg)

## Demo Time

Not fully satisfied with the result with MlLib, I tried using Vader in combination with the translate module

In [4]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from translate import Translator

translator = Translator(to_lang="en",from_lang="it")
analyzer = SentimentIntensityAnalyzer()

positive_comment = "Video veramente ottimo, mi è piaciuto molto!"\
                "Spero che continui su questa strada, sei bravo!"
negative_comment = "Video brutto, non mi è piaciuto per niente!"\
        "Spero che non continui su questa strada, non sei bravo!"

print(analyzer.polarity_scores(translator.translate(positive_comment)))
print(analyzer.polarity_scores(translator.translate(negative_comment)))


{'neg': 0.0, 'neu': 0.429, 'pos': 0.571, 'compound': 0.9396}
{'neg': 0.337, 'neu': 0.549, 'pos': 0.114, 'compound': -0.6908}


## New Approach for the future

**FEEL-IT: Emotion and Sentiment Classification for the Italian Language**

Official docs [here](https://github.com/MilaNLProc/feel-it)

In [5]:


from feel_it import EmotionClassifier, SentimentClassifier

emotion_classifier = EmotionClassifier()
print(emotion_classifier.predict(["sono molto felice", 
                                  "ma che cavolo vuoi", 
                                  "sono molto triste"]))

sentiment_classifier = SentimentClassifier()
print(sentiment_classifier.predict(["sono molto felice", 
                                    "ma che cavolo vuoi", 
                                    "sono molto triste"]))

['joy', 'anger', 'sadness']
['positive', 'negative', 'negative']


In [6]:
from feel_it import EmotionClassifier, SentimentClassifier

sentiment_classifier = SentimentClassifier()
emotion_classifier = EmotionClassifier()

positive_comment = "Video veramente ottimo, mi è piaciuto molto!"\
                "Spero che continui su questa strada, sei bravo!"
negative_comment = "Video brutto, non mi è piaciuto per niente!"\
        "Spero che non continui su questa strada, non sei bravo!"

print(emotion_classifier.predict([positive_comment, negative_comment]))
print(sentiment_classifier.predict([positive_comment, negative_comment]))


['joy', 'sadness']
['positive', 'negative']


## Data Indexing

Indexing involves creating structured references to the data, making it faster and more efficient to search and retrieve information. It's commonly used in databases and search engines. 

This is done using

**Elasticsearch**

Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is developed in Java and is released as open source under the terms of the Apache License.

![image.png](elastic.jpeg)

## Data Visualization

Visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data. 

This is done using

**Kibana**

Kibana is an open source data visualization dashboard for Elasticsearch. It provides visualization capabilities on top of the content indexed on an Elasticsearch cluster. Users can create bar, line and scatter plots, or pie charts and maps on top of large volumes of data.


![image.png](vader_dash.png)

![mllib_dash.png](mllib_dash.jpg)

![final](final.jpeg)

Thanks for the attention!