### *Requirements*
*To make use this notebook, you need an export of your search result from SolrWayback. (See **[SolrWayback > Export](https://nlnwa.github.io/research-services/docs/solrwayback/solrwayback-5export.html)** )*

*The exported data:*
- *must be in the JSONL format,*
- *must contain the **'title'** field.*
- *should also contain the fields **'warc_key_id'** and **domain**.*

*It is highly recommended that the amount of exported results are below 20000. If your data is based on a search result with vast more hits, you should reduce the scope, e.g. by applying facets for specific domains or crawl year.*

# Sentiment Analysis of Webpage Titles

### What is sentiment analysis?
Sentiment analysis is a computational technique to interpret and classify the emotional tone or connotations of a text.

### Analysing sentiments of document titles
This notebook will allow you to analyse the sentiments of document titles, exported from SolrWayback.

It is based on a naive approach, using the [Norwegian Sentiment Lexicon](https://github.com/ltgoslo/norsentlex) from the Language Technology Group (LTG) at the Department of Informatics, University of Oslo.

A sentiment lexicon is simply a list of potentially sentiment bearing words and their prior positive/negative polarity. This come with several shortcomings, since the polarity values are context-independent. However, the simplicity make it possible to run it with limited computer resources, and without training models for specific genres or domains.

If sentiment analysis is a pivotal part of your methodology, and you have sufficient computational resources, you can consider training your own models or make use of pre-trained models for more fine-grained analysis.

## Import packages
Before starting, we must import the necessary python libraries.

To run a code cell: Make sure it is marked and then press <kbd>Shift</kbd> + <kbd>Enter</kbd>)

In [None]:
import pandas as pd
import json

## Load data from SolrWayback into Pandas dataframe

First, you need to load the data you exported from SolrWayback. You can change the name and path of your exported JSONL file in the cell below. 

In [None]:
# Replace 'your-file' below with the name of your file
solrwb_corpus_titles = '../data/solrwayback_regjeringen-no.jsonl'

# Reading the .jsonl file line by line into a list of dictionaries
data_list = []

with open(solrwb_corpus_titles, 'r') as f:
    for line in f:
        data_list.append(json.loads(line))

Then, we read the data into a DataFrame.

In [None]:
# Create DataFrame
df = pd.DataFrame(data_list)

Displaying the DataFrame allow us to see the name of the columns, and the values of the first and last 5 rows.

In [None]:
display(df)

# Keeping only the needed columns

Before processing, we want to remove rows where the title is missing (NaN).

In [None]:
# Remove rows where the title is missing
df = df.dropna(subset=['title'])

In [None]:
display(df)

## Prepare classification of document title's sentiment

### Loading sentiment lexica

Now, you need to load the positive and negative lexica. (To speed up processing and reduce the look-up time, we loaded them into sets.)

In [None]:
# Load lexica from text files
positive_words = []
negative_words = []

with open('../resources/Fullform_Positive_lexicon.txt', 'r') as f:
    for line in f:
        positive_words.append(line.strip())

with open('../resources/Fullform_Negative_lexicon.txt', 'r') as f:
    for line in f:
        negative_words.append(line.strip())

Then, you create the function that will perform the sentiment analysis.

The first step will split the title of each document into single words.

The machine will then look up each word in the positive and negative lexica. For each word found in the positive lexicon, it will add a sentiment score of +1, while each word found in the negative lexicon will subtract -1.

If the sum of the words in the title above 0, the title will be classified as 'Positive'. If the sum is below 0, it will be classified as 'Negative'. If the sum is 0, the title will be classified as 'Neutral'.

In [None]:
# Sentiment analysis function
def sentiment_analysis(title):
    sentiment_score = 0
    words = title.split()
    
    for word in words:
        if word.lower() in positive_words:
            sentiment_score += 1
        elif word.lower() in negative_words:
            sentiment_score -= 1
    
    if sentiment_score > 0:
        return 'Positive'
    elif sentiment_score < 0:
        return 'Negative'
    else:
        return 'Neutral'

## Applying sentiment analysis to the DataFrame

Now, we are ready to perform the sentiment analysis. The code below will also output ...

In [None]:
df['sentiment'] = df['title'].apply(sentiment_analysis)

# Show DataFrame
display(df)

## Visualise sentiments in corpus
After classifying the title's sentiments, we can visualise how 

In [None]:
import plotly.express as px
from collections import Counter

In [None]:
# Extract sentiment data from DataFrame
sentiments = df['sentiment'].tolist()

# Count the occurrences of each sentiment
sentiment_counts = Counter(sentiments)

# Visualize using Plotly
fig = px.bar(x=list(sentiment_counts.keys()), y=list(sentiment_counts.values()), labels={'x':'Sentiment', 'y':'Count'}, title='Sentiment Distribution')
fig.show()