### About this Notebook

This is a Jupyter-Notebook. You can either simply read the PDF, or open the notebook with Jupyter to run the code yourself. If you have not yet installed Jupyter Notebooks, the easiest way is to install [Anaconda](https://www.anaconda.com/).

Once installed, you can start the notebook by opening a Notebook via `jupyter notebook` in your terminal and navigating to the appropriate folder.

# Sentiment Analysis

"Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information." [1]

We will discuss two different approaches.

## Keyword Based Analysis

A very simple approach is to compile a list of indicative words for each sentiment to be classified. While this method is robust and explainable, it can be hard to bootstrap by hand.

Many libraries exist, providing precompiled lists.

In [1]:
sentences = [
    "Computational Social Sciences is super exciting.",
    "Working on projects are very boring."
]

In [2]:
positive_words = [ "good", "awesome", "exciting" ]
negative_words = [ "boring", "bad", "no fun" ]

In [3]:
import numpy as np
NR_SENTIMENTS = 2

counts = np.zeros((len(sentences), NR_SENTIMENTS), dtype=int)

In [4]:
for idx, sentence in enumerate(sentences):
    for word in positive_words:
        if word in sentence:
            counts[idx, 0] += 1
        
    for word in negative_words:
        if word in sentence:
            counts[idx, 1] += 1

In [5]:
for count, sentence in zip(counts, sentences):
    print(f"\"{sentence}\" contains {count[0]} positive and {count[1]} negative words.")

"Computational Social Sciences is super exciting." contains 1 positive and 0 negative words.
"Working on projects are very boring." contains 0 positive and 1 negative words.
"Computational Social Sciences is super exciting." contains 1 positive and 0 negative words.
"Working on projects are very boring." contains 0 positive and 1 negative words.


### Lemmatization

Especially for German, but also for English texts, you will want to perform stemming before counting keywords. [Spacy](https://spacy.io/models/en) is a good library to do this.
After installing Spacy, you will need to choose a [model](https://spacy.io/models/en) and download it with

`python -m spacy download en_core_web_sm`

In [6]:
import spacy
nlp = spacy.load('en_core_web_sm')



In [7]:
sentences_lemmas = []

for sentence in sentences:
    lemmas = nlp(sentence)
    lemmas = ' '.join([x.lemma_ for x in lemmas]) 
    sentences_lemmas.append(lemmas)
    
for sent, lemma in zip(sentences, sentences_lemmas):
    print(f"\"{sent}\" turns into \"{lemma}\"")

"Computational Social Sciences is super exciting." turns into "Computational Social Sciences be super exciting ."
"Working on projects are very boring." turns into "work on project be very boring ."
"Computational Social Sciences is super exciting." turns into "Computational Social Sciences be super exciting ."
"Working on projects are very boring." turns into "work on project be very boring ."


## Deep-Learning Based Approaches

### Setup

Make sure you have installed [Huggingface](https://huggingface.co/docs/transformers/installation).

Look at the [Models-Section](https://huggingface.co/models) to search for a pretrained Model that fits your needs.
Make sure to look for models that fit your language and target sentiments. You should also have a look on the data which the model was trained on. The closer this data matches your own, the more likely the model will produce reasonable results.

### Downloading the Model

You can download your model and package it into a pipeline in a single command.
This might take a while the first time you execute this command.

GPUs vastly accelerate the inference process. If you have a recent NVidia GPU at hand, you can add `device = 0`. 
If you don't, you can set `device = -1`.

In [8]:
from transformers import pipeline
sentiment_analysis = pipeline("sentiment-analysis",model="siebert/sentiment-roberta-large-english", device=-1)

### Load your Data

You need to load your data into memory and clean it up as far as possible. If you get poor results, you can try to remove emojis, double punctuation, and so on. Also, check if your model is called `cased`, in which case it differentiates between `Trump` and `trump`.

### Splitting your Texts into Sentences

Most models can only handle individual sentences (easily). You hence need to split your texts into single sentences. After splitting and classifying, you will need to come up with a good way to recombine those sentences, e.g., by taking the average of all sentences, or looking at the data passage by passage.

A suitable, easy package to do this is [Sentence Splitter](https://github.com/mediacloud/sentence-splitter).

In [9]:
from sentence_splitter import split_text_into_sentences
sentences = split_text_into_sentences(
    text='This is a paragraph. It contains several sentences. "But why," you ask?',
    language='en'
)

for sent in sentences:
    print(sent)

This is a paragraph.
It contains several sentences.
"But why," you ask?
This is a paragraph.
It contains several sentences.
"But why," you ask?


### Classifying your Sentences

For this notebook, we will use some sample sentences.

In [10]:
sentences = [
    "Computational Social Sciences is super exciting.",
    "Working on projects can be really boring."
]

In [11]:
scores = sentiment_analysis(sentences)

for score, sent in zip(scores, sentences):
    print(f"\"{sent}\"", 'is classified as', score['label'], "with confidence", score['score'])

"Computational Social Sciences is super exciting." is classified as POSITIVE with confidence 0.9985401630401611
"Working on projects can be really boring." is classified as NEGATIVE with confidence 0.9992350339889526
"Computational Social Sciences is super exciting." is classified as POSITIVE with confidence 0.9985401630401611
"Working on projects can be really boring." is classified as NEGATIVE with confidence 0.9992350339889526


If you have more than 10 sentences, you should split the data into multiple batches. You can also iterate over each sentence individually in that case. `tqdm` is a very good package to monitor your progress.

In [12]:
from tqdm import tqdm

scores = []

for sent in tqdm(sentences, ncols=50):
    score = sentiment_analysis([ sent ])
    scores.append(score)

100%|███████████████| 2/2 [00:00<00:00,  4.23it/s]



#### Footnotes
[1] [Sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis)

## Convert anything to text before working with it

https://pypi.org/project/striprtf/

# Homework

Get data - even if not all of it, show that you can get it.