<a href="https://colab.research.google.com/github/gthivaios/Bank-Greek-Tweets-Sentiment-Analysis/blob/main/Detect_Polarization_in_Political_Texts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to Interact with ChatGPT for Text Understanding


This notebook is associated to a how-to guide that gives a simple introduction to using Large Language Models for text analysis in the social sciences.

The notebook offers the code for the example of analyzing the level of polarization in a given political text, but can easily be adapted for your particular text analysis project.

## 1. Signing up for API access

The first step is to sign up to API access with OpenAI. This can be done on platform.openai.com.

You will receive an API key to be used below.

## 2. Installing and loading the relevant libraries

We first need to install and import the relevant libraries: the pandas package for general data processing, and the openai package for interacting with the OpenAI API.

In [None]:
#Install the libraries
!pip install pandas
!pip install openai==0.28
!pip install tiktoken

In [None]:
#Call the libraries
import os
import time
import openai
import tiktoken
import nltk
import glob
import pandas as pd
import numpy as np
import nltk.data
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')

In [3]:
#Set the API key. See the how-to guide for furhter instructions
openai.api_key = "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"
#We define which model to use throughout
MODEL = 'gpt-3.5-turbo-16k'
max_tokens = 120
WAIT_TIME = 20 #

We can now call the OpenAI API. For instance, we can ask ChatGPT-4 a question:

In [15]:
#Test the API
response = openai.ChatCompletion.create(
  model = MODEL, #Which model to use
  temperature=0.2, #How random is the answer
  max_tokens=120, #How long can the reply be
  messages=[
    {"role": "user",
    "content": "What is polarization in political texts?"}]
)
result = ''
for choice in response.choices:
    result += choice.message.content
print(f"Model answer: '{result}'")

Model answer: 'Polarization in political texts refers to the division or separation of opinions, beliefs, or ideologies into extreme or opposing positions. It occurs when political texts, such as speeches, articles, or social media posts, emphasize and reinforce differences between groups or individuals, leading to a more polarized political climate. Polarization can be seen in the language, tone, and content of political texts, as they often aim to appeal to specific audiences or promote a particular agenda. This can result in the amplification of conflicts, the creation of echo chambers, and the erosion of common ground or compromise in political discourse.'


## 3. Loading and preparing your data

The next step is to load and prepare the data that we want to analyze. We will load the data into a Pandas dataframe to allow easy processing.

The details of how to open your particular data depends on the structure and format of the data. Pandas offers ways of opening a range of file formats, including CSV and Excel files. You may wish to refer to the Pandas documentation for more details.

In [5]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [6]:
#Loading data from textfiles
import glob
import pandas as pd
import os

# Define the folder path where the text files are located
folder_path = '/content/gdrive/MyDrive/test/'

# Use glob to get a list of all *.txt files in the folder
txt_files = glob.glob(folder_path + '/*.txt')


## Chunking the texts
Unlike other NLP methods, not much preprocessing is needed. However, LLMs are only able to process texts that are smaller than their "context window". If our texts are longer than the context window of our model, we have to either split the texts into several smaller chunks and analyze them part by part, or simply truncate the text (not recommended).

The details depend on the model you use and the amount for data. we will use the standard 16K GPT-3.5 model and chunk the text into smaller pieces. If your text is short, such as a tweet, this function will do nothing.

In [8]:
def split_text_into_chunks(text, max_tokens):

    #Code the text in gpt coding and calculate the number of tokens
    encoding = tiktoken.encoding_for_model(MODEL)
    nrtokens = len(encoding.encode(text))

    if nrtokens < max_tokens:
        return [text]

    #how many chunks to split it into?
    num_chunks = np.ceil(nrtokens / max_tokens)

    # Tokenize the text into sentences
    sentences = sent_tokenize(text)

    # Calculate the number of words per chunk
    words_per_chunk = len(text.split()) // num_chunks

    # Initialize variables
    chunks = []
    current_chunk = []

    word_counter = 0
    # Iterate through each sentence
    for sentence in sentences:
        # Add the sentence to the current chunk
        current_chunk.append(sentence)
        word_counter += len(sentence.split())

        # Check if the current chunk has reached the desired number of words
        if word_counter >= words_per_chunk:
            # Add the current chunk to the list of chunks
            chunks.append(" ".join(current_chunk))
            word_counter = 0
            # Reset the current chunk
            current_chunk = []

    # Add the remaining sentences as the last chunk
    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

In [9]:
def process_file(file_path, max_tokens):
  with open(file_path, 'r', encoding='utf-8') as file:
    text = file.read()
    return split_text_into_chunks(text, max_tokens)

## 4. Prompt engineering

The next step is to formulate a first instructions for analyzing the text. The prompts will be a result of an iterative process through which you develop a formulation of the concept that you wish to capture.

In [7]:
prompt_text = """Your task is to evaluate the level of polarization in a political text. Polarization is considered an important feature of political systems. Although usually seen as a negative trait, it is important to recognise that a certain degree of polarization is reasonable and perhaps necessary. Political polarization represents the intensity of binary, opposing political ideologies and their respective party identities. Below are some critical features of a polarizing discourse:
1) Polarization occurs when a discourse promotes strong partisan or ideological divisions. This discourse promotes a representation of politics in dichotomous and binary terms, where society is divided into two major camps. A multitude of differences and contradictions are reduced to a single division. The remaining differences are downplayed.
2) The two political and ideological positions that this discourse constructs are presented as incompatible, and the political views and attitudes of citizens tend to diverge and cluster around these two opposing ideological positions. It creates a powerful and irreconcilable opposition between two camps, each challenging or even denying the legitimacy of the other. The political opponent becomes an enemy.
3) This discourse limits pluralism and fosters fanaticism. It results in the marginalization of intermediate or alternative views from the public sphere and, correspondingly, the squeezing and even the exclusion of smaller parties.
4) A discourse that increases polarization perceives and describes politics through the “us” vs. “them” distinction. There is no midpoint, everyone is asked to choose sides.
5) A discourse of polarization has a strong emotional dimension.
6) Polarizing discourse, in order to gain depth, often invokes deeply rooted social identities or social divisions that last over time and emphasizes opposing pairs of concepts and values (for example, modernization-tradition, progress-conservatism, workers-capitalists, right-left)

You should give the text a numeric grade between 0 and 1.
1. A speech in this category includes strong expressions of all of the polarization elements,  but either does not use them consistently or tempers them by including non-populist elements. The text may have a romanticized notion of the people and the idea of a unified popular will, but it avoids bellicose language or any particular enemy.
0. A speech in this category uses few if any populist elements.
[Answer with a number in the 0-1 range, followed by a semi-colon, and then a brief motivation. For instance: "1.23; The text shows many elements of a populist text." Do not use quotation marks.]
"""

# 5. Calling the LLM and analyzing the results
## 5.1 Call the LLM

We will now write simple functions for calling the API and carry out our analysis request. We will also need to handle possible errors returned from the API.

In [None]:
#import time
#import numpy as np
#import openai
#from nltk.tokenize import sent_tokenize

#openai.api_key = 'your-api-key'
max_chunk_size = 4000

def call_openai_api(input_text):
    try:
        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            max_tokens=150,
            temperature=0.2,
            messages=[{"role": "system", "content": prompt_text},
                      {"role": "user", "content": input_text}]
        )
        return [choice.message['content'] for choice in response.choices]
    except openai.error.RateLimitError as e:
        print("Rate limit reached. Retrying in 20 seconds...")
        time.sleep(21)
        return call_openai_api(input_text)
    except Exception as e:
        print("Error:", e)
        return None

generated_responses = []

for file_path in txt_files:
    chunk_texts = process_file(file_path, max_chunk_size)
    for input_text in chunk_texts:
        success = False
        while not success:
            generated_responses_chunk = call_openai_api(input_text)
            if generated_responses_chunk is not None:
                generated_responses.extend(generated_responses_chunk)
                success = True

print("Generated responses:", generated_responses)

# Calculate average response length
average_response_length = np.mean([len(response.split()) for response in generated_responses])
print("Average response length:", average_response_length)

## 5.2 Example result
We can now look at some examples of the result from the analysis and the associated motivation.

Average response length: 0.3; The text does not strongly promote strong partisan or ideological divisions. It does mention the opposition and contrasts their actions with those of the New Democracy party, but it does not use bellicose language or present the opposition as an enemy. The text focuses more on the achievements and future plans of the New Democracy party rather than creating a strong opposition between two camps.

In [35]:
average_response_length

'0.3; The text does not strongly promote strong partisan or ideological divisions. It does mention the opposition and contrasts their actions with those of the New Democracy party, but it does not use bellicose language or present the opposition as an enemy. The text focuses more on the achievements and future plans of the New Democracy party rather than creating a strong opposition between two camps.'