<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Revising Academic Writing with ChatGPT

## What is `python-docx`?

`python-docx` is a Python library for reading, creating, and updating Microsoft Word 2007+ (.docx) files.

Please refer to:
- [5 Best Ways to Read Microsoft Word Documents with Python](https://blog.finxter.com/5-best-ways-to-read-microsoft-word-documents-with-python/)

## Required Python packages

- openai
- pandas
- python-dotenv
- tqdm
- python-docx

## Importing the required libraries

In [1]:
from docx import Document
from dotenv import load_dotenv
import openai
import pandas as pd
import os
import sys
import logging
from tqdm import tqdm
import time

## Defining input variables

In [2]:
input_directory = 'cl_st1_ph1_multiteclin_input'
output_directory = 'cl_st1_ph1_multiteclin_output'
log_filename = 'cl_st1_ph1_multiteclin.log'
docx_filename = 'Ensaio Alexandre Rodrigues Nunes inglês.docx'
txt_filename = 'Ensaio Alexandre Rodrigues Nunes inglês python-docx.txt'
df_filename1 = 'cl_st1_ph1_multiteclin_df1'
df_filename2 = 'cl_st1_ph1_multiteclin_df2'

## Creating output directory

In [3]:
# Check if the output directory already exists. If it does, do nothing. If it doesn't exist, create it.
if os.path.exists(output_directory):
    print('Output directory already exists.')
else:
    try:
        os.makedirs(output_directory)
        print('Output directory successfully created.')
    except OSError as e:
        print('Failed to create the directory:', e)
        sys.exit(1)

Output directory successfully created.


## Setting up logging

In [4]:
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    filename = f"{output_directory}/{log_filename}"
)

## `python-docx` DOCX scraping

### Setting required variables

In [5]:
docx_path = f"{input_directory}/{docx_filename}"
output_txt = f"{output_directory}/{txt_filename}"
paragraph_num = 0

### Sampling paragraphs

In [6]:
doc = Document(docx_path)
paragraph = doc.paragraphs[paragraph_num]
text = paragraph.text
text

'Academic Essay: The Pedagogy of Multiliteracies Applied to English Teaching'

### Scraping the entire document

In [7]:
def scrape_docx(docx_path, output_txt):
    # Opening the DOCX file
    doc = Document(docx_path)

    # Initialising an empty string to store the text
    text_list = []

    # Iterating through all the paragraphs and extract text
    for paragraph in doc.paragraphs:
        text_list.append(paragraph.text)

    text = '\n'.join(text_list)
    
    # Writing the extracted text to a text file in UTF-8 encoding
    with open(output_txt, 'w', encoding='utf-8') as txt_file:
        txt_file.write(text)

scrape_docx(docx_path, output_txt)
print('DOCX scraped successfully!')

DOCX scraped successfully!


## Manually editing the text file according to the following format

```
Text ID: t000000

Title: The Pedagogy of Multiliteracies Applied to English Teaching

Section: Introduction
The teaching of foreign languages ...
The concept of multiliteracies, ...
<other paragraphs>

Section: Body
The Pedagogy of Multiliteracies is ...
The Pedagogy of Multiliteracies ...
<other paragraphs>

Section: Conclusion
The Pedagogy of Multiliteracies emerges ...
The creation of fanfictions within ...
<other paragraphs>
```

## Importing the text into a DataFrame

In [8]:
# Initialising variables
text_id = None
title = None
section = None
paragraph_count = 0
data = []

# Reading the file
logging.info('Starting to read the file')
with open(output_txt, 'r', encoding='utf-8') as file:
    for line in file:
        line = ' '.join(line.split())  # Removing duplicated spaces
        line = line.strip()
        
        # Capturing 'Text ID'
        if line.startswith('Text ID:'):
            text_id = line.split(': ')[1]
            logging.info(f"Captured Text ID: {text_id}")
        
        # Capturing 'Title'
        elif line.startswith('Title:'):
            title = line.split(': ')[1]
            logging.info(f"Captured Title: {title}")
        
        # Capturing 'Section'
        elif line.startswith('Section:'):
            section = line.split(': ')[1]
            paragraph_count = 0  # Resetting paragraph count for new section
            logging.info(f"Starting new section: {section}")
        
        # Capturing 'Paragraph Text'
        elif line:
            paragraph_count += 1
            data.append({
                'Text ID': text_id,
                'Title': title,
                'Section': section,
                'Paragraph': f"Paragraph {paragraph_count}",
                'Text Paragraph': line
            })
            logging.info(f"Captured Paragraph {paragraph_count} in Section: {section}")

# Creating DataFrame
df_text = pd.DataFrame(data)
logging.info('DataFrame created successfully')

In [9]:
df_text

Unnamed: 0,Text ID,Title,Section,Paragraph,Text Paragraph
0,t000000,The Pedagogy of Multiliteracies Applied to Eng...,Introduction,Paragraph 1,The teaching of foreign languages has evolved ...
1,t000000,The Pedagogy of Multiliteracies Applied to Eng...,Introduction,Paragraph 2,"The concept of multiliteracies, introduced by ..."
2,t000000,The Pedagogy of Multiliteracies Applied to Eng...,Introduction,Paragraph 3,This essay will first explore the theoretical ...
3,t000000,The Pedagogy of Multiliteracies Applied to Eng...,Development,Paragraph 1,The Pedagogy of Multiliteracies is based on th...
4,t000000,The Pedagogy of Multiliteracies Applied to Eng...,Development,Paragraph 2,The Pedagogy of Multiliteracies emerges as a c...
5,t000000,The Pedagogy of Multiliteracies Applied to Eng...,Development,Paragraph 3,"The New London Group's (NLG) manifesto, drafte..."
6,t000000,The Pedagogy of Multiliteracies Applied to Eng...,Development,Paragraph 4,Another crucial point in this discussion is th...
7,t000000,The Pedagogy of Multiliteracies Applied to Eng...,Development,Paragraph 5,"In this context, the Pedagogy of Multiliteraci..."
8,t000000,The Pedagogy of Multiliteracies Applied to Eng...,Development,Paragraph 6,"In the English language classroom, the applica..."
9,t000000,The Pedagogy of Multiliteracies Applied to Eng...,Development,Paragraph 7,The fanfiction genre has become an important p...


### Exporting to a file

In [10]:
df_text.to_json(f"{output_directory}/{df_filename1}.jsonl", orient='records', lines=True)

## Revising the paragraphs with ChatGPT

### Importing the data into a DataFrame

In [11]:
df_text = pd.read_json(f"{output_directory}/{df_filename1}.jsonl", lines=True)

### Defining input variables

In [12]:
#chatgpt_prompt = 'Dear ChatGPT, would it be possible for you to improve the writing of the following passage of a research article considering the generally accepted standards of English for Academic Purposes? Please keep each improved passage within a single paragraph - do not split it into multiple paragraphs. OK?'

In [13]:
chatgpt_prompt = 'Dear ChatGPT, please improve the writing of the following passage of a research article considering the generally accepted standards of English for Academic Purposes. It is very important that you are as objective, scientific and non-metaphorical as you can be. Please keep each improved passage within a single paragraph - do not split it into multiple paragraphs. Also, do not acknowledge this prompt - just provide the revised passage straightaway.'

### Revising the paragraphs

In [14]:
# Loading all environment variables from '.env' into 'os.environ'
load_dotenv()

# Importing the required programme variables from the environment
openai.api_key = os.environ.get('OPENAI_API_KEY', '')
assert openai.api_key

# Defining a function to query ChatGPT with exponential backoff
def get_completion(prompt, model='gpt-4o', max_retries=5):
    client = openai.OpenAI()
    messages = [{'role': 'user', 'content': prompt}]
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=messages,
                temperature=0
            )
            return response.choices[0].message.content
        except openai.error.RateLimitError as e:
            wait_time = 2 ** attempt  # Exponential backoff
            logging.warning(f"Rate limit exceeded. Retrying in {wait_time} seconds...")
            time.sleep(wait_time)
        except Exception as e:
            logging.error(f"Error querying ChatGPT: {e}")
            return None
    logging.error("Max retries exceeded.")
    return None

# Defining a function to process text using ChatGPT
def process_text(text, prompt_template):
    paragraphs = text.split('\n')  # Splitting text into paragraphs
    processed_paragraphs = []
    for paragraph in paragraphs:
        prompt = prompt_template + paragraph
        try:
            processed_paragraph = get_completion(prompt)
            if processed_paragraph:
                processed_paragraphs.append(processed_paragraph)
            else:
                processed_paragraphs.append(paragraph)  # Keeping original if there's an error
        except Exception as e:
            print(f"Error processing paragraph: {e}")
            processed_paragraphs.append(paragraph)  # Keeping original if there's an error
    return '\n'.join(processed_paragraphs)

# Applying the function to the 'Host Question' column with progress indication
processed_texts = []
for index, row in tqdm(df_text.iterrows(), total=len(df_text), desc='Processing texts'):
    # Defining the ChatGPT prompt template
    prompt_template = chatgpt_prompt + '\n'
    
    # Processing text
    processed_texts.append(process_text(row['Text Paragraph'], prompt_template))

df_text['Text Paragraph ChatGPT'] = processed_texts

Processing texts: 100%|██████████| 30/30 [02:08<00:00,  4.28s/it]


In [15]:
df_text

Unnamed: 0,Text ID,Title,Section,Paragraph,Text Paragraph,Text Paragraph ChatGPT
0,t000000,The Pedagogy of Multiliteracies Applied to Eng...,Introduction,Paragraph 1,The teaching of foreign languages has evolved ...,The teaching of foreign languages has undergon...
1,t000000,The Pedagogy of Multiliteracies Applied to Eng...,Introduction,Paragraph 2,"The concept of multiliteracies, introduced by ...","The concept of multiliteracies, introduced by ..."
2,t000000,The Pedagogy of Multiliteracies Applied to Eng...,Introduction,Paragraph 3,This essay will first explore the theoretical ...,This essay initially examines the theoretical ...
3,t000000,The Pedagogy of Multiliteracies Applied to Eng...,Development,Paragraph 1,The Pedagogy of Multiliteracies is based on th...,The Pedagogy of Multiliteracies is founded on ...
4,t000000,The Pedagogy of Multiliteracies Applied to Eng...,Development,Paragraph 2,The Pedagogy of Multiliteracies emerges as a c...,The Pedagogy of Multiliteracies represents a c...
5,t000000,The Pedagogy of Multiliteracies Applied to Eng...,Development,Paragraph 3,"The New London Group's (NLG) manifesto, drafte...","The New London Group's (NLG) manifesto, origin..."
6,t000000,The Pedagogy of Multiliteracies Applied to Eng...,Development,Paragraph 4,Another crucial point in this discussion is th...,Another important aspect of this discussion is...
7,t000000,The Pedagogy of Multiliteracies Applied to Eng...,Development,Paragraph 5,"In this context, the Pedagogy of Multiliteraci...","In this context, the Pedagogy of Multiliteraci..."
8,t000000,The Pedagogy of Multiliteracies Applied to Eng...,Development,Paragraph 6,"In the English language classroom, the applica...","In the English language classroom, this pedago..."
9,t000000,The Pedagogy of Multiliteracies Applied to Eng...,Development,Paragraph 7,The fanfiction genre has become an important p...,The fanfiction genre has emerged as a signific...


### Exporting to a file

In [16]:
df_text.to_json(f"{output_directory}/{df_filename2}.jsonl", orient='records', lines=True)

In [17]:
df_text.to_excel(f"{output_directory}/{df_filename2}.xlsx", index=False)