# Creating and Visualizing Polarity Independence Scores 

This notebook allows for the easy implementation of the sentiment analysis-based portion of the media capture measurement. This method requires a dataset of the following format:

|           | outlet   | date   | text         |
| --------- | -------- | ------ | ------------ |
| article 1 | outlet 1 | date 1 | article text |
| article 2 | ...      | ...    | ...          |

The outlet column should identify
- the names of the specific nominally independent outlets under investigation
- a group of outlets identified as opposition outlets outside of the reach of the regime
- a group of outlets identified as belonging to the regime through state ownership, family, or party ties

## Sentiment Classification

This first half of the notebook is concerned with classifying the sentiment of the sentences mentioning the regime leader.

To begin, we need to download the needed libraries. Uncomment the two top lines if you are running this notebook for the first time to install the sentiment classifiers.

In [None]:
# !pip install pysentimiento
# !pip install transformers

from tqdm import tqdm
import pandas as pd
import re
import pysentimiento
import pickle
from transformers import pipeline

With this code chunk, specify the location of the dataset containing the article texts.

In [None]:
# load dataset with preprocessed text
data_location = 'Data/dataset_token_ready.csv'

df = pd.read_csv(data_location)

The following code chunk prepares the dataset for sentiment analysis. Most importantly, you need to specify how sentences containing mentions of the regime leader should be identified. For example, in our article, we wanted to extract mentions of the president regime leader. For this, we extracted sentences containing the following terms (both with and without capital letters:) regime leader, Nuestro Presidente, Presidente de Nicaragua, Comandante Daniel, and Daniel y Rosario.

In code, we used the following string:

```r
regex = r'([Oo]rtega)|([Nn]uestro [Pp]residente)|[Pp]residente de [Nn]icaragua|([Cc]omandante [Dd]aniel)|[Dd]aniel y [Rr]osario'
```

In [None]:

# define regex to search for mentions of regime leader
regex = r'([Ee]xample1)|([Tt]wo-word [Ee]xample)'

# subset articles that contain mentions of regime leader
df = df.loc[df['text'].str.contains(regex, na = False)].reset_index(drop = True)

# split string of texts into list of sentences
df["sentences"] = df.text.apply(lambda x: re.split("[.!?]", x))

# explode rows so that each row contains one sentence
df = df.explode("sentences", ignore_index = True)

# subset sentences that contain mentions of regime leader
df = df.loc[df["sentences"].str.contains(regex)].reset_index(drop = True)

# if sentence longer than 200 words, keep only 90 word window around first mention of regime leader
def trim_sentence(sentence, regex):
    words = sentence.split()
    if len(words) > 200:
        match = re.search(regex, sentence)
        if match:
            start_index = match.start()
            start_word_index = len(sentence[:start_index].split())
            window_start = max(0, start_word_index - 90)
            window_end = min(len(words), start_word_index + 90)
            return ' '.join(words[window_start:window_end])
    return sentence

df["sentences"] = df["sentences"].apply(lambda x: trim_sentence(x, regex))

# drop text column
df.drop("text", axis = 1, inplace = True)

sentences = df["sentences"].tolist()

The following code chunks specify which sentiment classifier should be used. This model can be run with any classifier, but we provide two examples here. The first code chunk sets pysentimiento, the classifier we used in our article, which is well-suited for Spanish. The second code chunk instead sets a distilbert-based classifier trained by Lik Xun Yuan, described here on hugging face: https://huggingface.co/lxyuan/distilbert-base-multilingual-cased-sentiments-student. This model is multilingual and works for

- English
- Arabic
- German
- Spanish
- French
- Japanese
- Chinese
- Indonesian
- Hindi
- Italian
- Malay
- Portuguese

In terms of classification results, pysentimiento tends to classify more sentences as neutral relative to negative, and especially positive.

The distilbert-based classifier tends to view very few sentences as neutral and instead finds more positive and negative sentiment.

In [None]:
#################### Execute this code chunk to set pysentimiento as the sentiment analysis tool ####################

analyzer = pysentimiento.create_analyzer(task="sentiment", lang="es")


# define function to analyze sentiment of sentence
def analyze_sentences(sentences):
    results = []
    for sentence in tqdm(sentences):
        sentiment = analyzer.predict(sentence).output
        results.append(sentiment)
    return results

In [None]:
#################### Execute this code chunk to set distilbert as the sentiment analysis tool ####################

analyser_distil = pipeline(
    model="lxyuan/distilbert-base-multilingual-cased-sentiments-student", 
    return_all_scores=False
)



# define function to analyze sentiment of sentence
def analyze_sentences(sentences):
    results = []
    for sentence in tqdm(sentences):
        sentiment = analyser_distil(sentence)[0]["label"]
        results.append(sentiment)
    return results

This code chunk executes the sentiment analysis and saves the result in a new column called "sentiment". Be aware that this will take some time. On our machine, pysentimiento runs at about 20 sentences per second, and the distilbert classifier at around 30. This translates to a computing time of between 55 and 83 minutes for 100,000 sentences. 

In [None]:
results = analyze_sentences(sentences)
df["sentiment"] = results

Finally, this code saves the dataset with the classified sentiment in the chosen location.

In [None]:
save_location = 'chosen/location/filename.pkl'

with open(save_location, "wb") as f:
    pickle.dump(df, f)

## Computing the Polarity Score and the Polarity Independence Score

The following code is used to compute the polarity score for each chosen outlet/group of outlets in the specified time intervals.

The polarity score shows the average sentiment of mentions of the regime leader per time bracket. A score of 1 means that all mentions in the chosen time bracket are positive, -1 would mean that they are exclusively negative, while 0 signifies a balance between positive and negative mentions.

The polarity independence score shows the relative difference in polarity between nominally independent outlets and the regime-owned outlets. If an outlet scores 1, it reports as negatively on the regime as the opposition outlets (which are assumed to be independent of the regime). If it scores 0, then it reports in line with the regime preference.

The first code chunk loads the dataset with the classified sentiment. Edit the data_location object to point to the location of the dataset created with the first half of this notebook.

In [None]:
# load dataset with preprocessed text
data_location = 'location/dataset_with_sentiment.pkl'

with open(data_location, "rb") as f:
    df = pickle.load(f)

The following chunk creates a variety of time aggregations.

In [None]:
# Transform date to datetime format
df['date'] = pd.to_datetime(df['date'], format='mixed', errors='coerce')

# create different time periods
df['year'] = df['year'].dt.strftime('%Y')

df["quarter"] = df.date.dt.to_period('Q')
df['quarter'] = df['quarter'].dt.strftime('%Y-%m')

# Create semiannual periods
def get_semiannual_period(date):
    year = date.year
    if date.month <= 6:
        return f"{year}Q1"
    else:
        return f"{year}Q3"

df['semiannual'] = df['date'].apply(get_semiannual_period)
df['semiannual'] = df['semiannual'].dt.strftime('%Y-%m')

df['year_month'] = df['date'].dt.strftime('%Y-%m')

df['week'] = df['date'].dt.strftime('%Y-%U')

df['day'] = df['date'].dt.strftime('%Y-%m-%d')

Now we choose the time periods for which we want to compute the scores. These can be
- year
- semiannual
- quarter
- year_month
- week
- day

In [None]:
# Change this variable to desired aggregation level
agg_level = "year_month"


df["date"] = df[agg_level]
df["date"] = pd.to_datetime(df["date"])

The following code chunk computes the polarity score and saves the resulting dataframe. You will need to edit the save location to the desired directory.

In [None]:
# Create aggregated overview of sentiment per outlet and year-quarter
df_agg = (df.groupby(["outlet", "date"])["sentiment"]
          .value_counts(normalize=True)
          .rename("proportion")
          .reset_index())

# Create polarity variable
df_agg.loc[df_agg["sentiment"] == "neutral", "polarity"] = 0
df_agg.loc[df_agg["sentiment"] == "positive", "polarity"] = df_agg["proportion"]
df_agg.loc[df_agg["sentiment"] == "negative", "polarity"] = df_agg["proportion"] * -1

# Create polarity aggregated data
df_pol = df_agg.groupby(["outlet", "date"]).agg({"polarity": np.sum}).reset_index()

# save data
save_location = 'chosen/location/filename.csv'
df_pol.to_csv(save_location, index = False)

The following code chunk computes the independence score. Unlike the polarity score, the independence score is only computed only for the nominally independent outlets operating inside the country relative to the regime outlets and the opposition outlets. In the following code chunk you will need to enter the names of the outlets/outlet groups in the dataframe.

In [None]:
outlet_list = ["nominally independent outlet1", "nominally independent outlet2"]
regime_column = "name_regime_group"
opposition_column = "name_opposition_group"

Finally, this code chunk computes the polarity independence score and saves the resulting dataframe as a csv file. The resulting file can be used for statistical analysis or to visualise trends in independence of the different independent outlets over time.

In [None]:
# Function to calculate independence score
def calculate_independence_score(df, outlets, regime_column, opposition_column):
    """
    This function calculates the independence score of a media outlet based on the difference in sentiment between
    the regime and opposition.
    Regime and opposition columns can either identify single outlets, but were intended to be used as aggregated
    columns that represent the sentiment in all regime and opposition outlets.
    """
    for outlet in outlets:
        score = (((abs(df[outlet] - df[regime_column]) - abs(df[outlet] - df[opposition_column])) /
                abs(df[opposition_column] - df[regime_column])) + 1) / 2
        df[outlet] = score

# Create table to see how many mentions per outlet per year
df_can = pd.pivot(df_pol, index="date", columns="outlet", values="polarity")

# calculate independence score
calculate_independence_score(df_can, outlet_list, regime_column, opposition_column)

# Melt back to long format
df_ind = pd.melt(df_can.reset_index(), id_vars=['date'],
                 value_vars=outlet_list,
                 value_name="independence score")

# save data
save_location = 'chosen/location/filename.csv'
df_ind.to_csv(save_location, index = False)