# Greening your Code Practices

Here are some practices for writing more energy-efficient code, often referred to as "greening your code":

1.  **Optimize Data Structures:**
    *   When importing your dataset encode the column to the most effective type. E.g. an integer will always be faster than a float so if your numeric column is an integer you shall encode it as integer and not as a float
    *   Select data structures appropriate for the task to minimize memory usage and access time.

2.  **Reduce Redundancy:**
    *   Avoid unnecessary computations. Cache results when possible.
    *   Eliminate duplicate code, which can lead to redundant processing.

3.  **Minimize I/O Operations:**
    *   Reading and writing data to disk or across a network consumes significant energy.
    *   Buffer data when possible to perform fewer, larger I/O operations instead of many small ones.
    *   Only read the data you actually need.

4.  **Use Efficient Libraries and Frameworks:**
    *   Leverage well-optimised libraries and frameworks that are designed for performance and efficiency.
    *   For computationally intensive tasks, consider libraries that utilise underlying hardware efficiently (e.g., NumPy, TensorFlow, PyTorch).

5.  **Manage Memory Effectively:**
    *   Release memory when it's no longer needed to reduce the memory footprint and the overhead of garbage collection.
    *   If you are done with the analysis of an object just remove it
    *   Avoid creating large, unnecessary objects.

6.  **Parallelise and Distribute When Appropriate:**
    *   For suitable tasks, parallelising or distributing computations across multiple cores or machines can potentially reduce the overall execution time and, in some cases, energy consumption (though this can be complex and depends on the task and infrastructure).

7.  **Profile and Measure:**
    *   Use profiling tools to identify bottlenecks in your code where most of the execution time and resource consumption occur. Focus your optimization efforts on these areas.

8. **Prototyping:**
    *   When working with large dataset, don't build up your code on the totality of the dataset but create a very small subset on which to test your code and then progressivly test it across larger and larger sections.

9. **Be Mindful of Loops:**
    *   Optimise loops to reduce the number of iterations and the work performed within each iteration.

10. **Code Review:**
    *   Review code with efficiency in mind, identifying potential areas for optimization.


   
\\
These are just some examples and not always is possible to implement them becasue a lot will depend on the hardware you are working on and the tasks you are trying to perform. The general suggestion is to write your first working draft and then refine it to be more efficient/consume less Co2.

\\

In the code below we are going to see three quick examples on how text analysis related code can be optimised.

All code is using the EmissionTracker from the Codecarbon package to check how changing the code impact the carbon emissions.

\\

One caveat is that these data will always be an estimate and will be more accurate if you are going to run things locally.

After the examples there are three challenges for you to pick from to test yourself.


## Installing the extra package we need  


In [None]:
!pip install codecarbon

# Example 1: Data Upload
Below is some code that will import a farily large dataset.
For our research though we only need some of the columns so can we improve the data upload part by directly importing only what you need.

**NB** if you are working with multiple files it is a good idea to clone the whole repo with !git clone rather than access the different files separately


In [None]:
from codecarbon import EmissionsTracker # this
tracker = EmissionsTracker()
tracker.start()

import pandas as pd
url = "https://raw.githubusercontent.com/kingsdigitallab/dh-rse-summer-school-2025/refs/heads/main/Day%204/Data/ParishTokenised.csv"

import pandas as pd

Parish = pd.read_csv(url)

emissions = tracker.stop()
print(f"Estimated CO2 emissions for importing the whole dataset: {emissions} kg")

This is ok but can we see a difference when we import only the variable I am interested in?

In [None]:
tracker = EmissionsTracker()
tracker.start()

url = "https://raw.githubusercontent.com/kingsdigitallab/dh-rse-summer-school-2025/refs/heads/main/Day%204/Data/ParishTokenised.csv"


import pandas as pd

Parish = pd.read_csv(url, usecols=["title", "Type", "Area", "Parish"])

emissions = tracker.stop()
print(f"Estimated CO2 emissions for optimized data import: {emissions} kg")

## Why it matters:

- Reduces memory use and loading time.

- Less data = less processing = lower energy usage.



Let's follow the advices we preached so since we have finished with this Parish object we remove it.

In [None]:
del Parish

## Example2: Data Wrangling _ Avoid Uneccessary Loops

Use case: You want to clean up a column by removing punctuation and formatted spaces (e.g. going to the next line) from text.

First we import the dataset we want to use for this

In [8]:
import pandas as pd
url = "https://raw.githubusercontent.com/kingsdigitallab/dh-rse-summer-school-2025/refs/heads/main/Day%204/Data/Parish.csv"

import pandas as pd

Parish2 = pd.read_csv(url)

Parish2['text'] = Parish2['text'].fillna('').astype(str) #Remove empty texts

In [None]:
tracker = EmissionsTracker()
tracker.start()

cleaned = []
for item in Parish2["text"]:
    cleaned.append("".join([c for c in item if c.isalnum() or c.isspace()]))# list comprehension to loop through each caracter and keep only space or alphanumeric removing all punctuation and space-formatting
Parish2["text_cleaned"] = cleaned

emissions = tracker.stop()
print(f"Estimated CO2 emissions for looping through words to remove numbers and spaces: {emissions} kg")

In [None]:
print(Parish2["text_cleaned"].iloc[1:3])

Let's see what happen when I used functions from re instead

In [None]:
tracker = EmissionsTracker()
tracker.start()

import re

Parish2["text_cleaned"] = Parish2["text"].str.replace(r"[^\w\s]", "", regex=True)

emissions = tracker.stop()
print(f"Estimated CO2 emissions for optimized cleaning function using regex: {emissions} kg")

In [None]:
print(Parish2["text_cleaned"].iloc[1:3])

## Why it matters:

- Uses optimized internal functions instead of manual Python loops.

- Faster and uses less CPU power.

## Example3: Text Analysis

In our third example we are going to see examples on how to improve the tokenisation aspect using the cache (this would work if you are using a corpus where senences words would repeat themselves).


In [None]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt_tab')


In [None]:
tracker = EmissionsTracker()
tracker.start()

Parish2["tokens"] = Parish2["text"].apply(lambda x: word_tokenize(x.lower()))

emissions = tracker.stop()
print(f"Estimated CO2 emissions for tokenisation {emissions} kg")

In [None]:
tracker = EmissionsTracker()
tracker.start()

from functools import lru_cache
# Use caching to avoid re-tokenizing identical texts
@lru_cache(maxsize=None)
def tokenize_cached(text):
    return word_tokenize(text.lower())

# Apply the cached function using .map for better performance
Parish2["tokens"] = Parish2["text"].map(tokenize_cached)

emissions = tracker.stop()
print(f"Estimated CO2 emissions for cached tokenisation: {emissions} kg")

Ok not very much difference in our example, this should work better for things like social media data that tend to repeat themselves more often.

Before moving to the challenge let's keep following our own advices and remove the object Parish2 that we do not need anymore

In [None]:
del Parish2

## Why it matters:
- Avoids repeating the same work – great when there are duplicate or very similar records.
- Lowers CPU usage, which is not only faster but more energy-efficient

# Challenge

Below you can find some code that will import a dataset of old Trump tweets. Right now the esitmated Co2 is **5.5201969058160406e-05 kg** lets see if follwing the advice we just went through we can make it better.

Let's say that we are only interested in the date created_at and text not the id

In [None]:
# Packages needed Run this just in case the kernel restarted otherwise they sould be already mounted
import pandas as pd
import re
#!pip install codecarbon
from codecarbon import EmissionsTracker
from functools import lru_cache
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt_tab')

In [None]:
# this again check the cost so we wrap it around the whole code
tracker = EmissionsTracker()
tracker.start()
#Define the URl
url = "https://raw.githubusercontent.com/kingsdigitallab/dh-rse-summer-school-2025/refs/heads/main/Day%204/Data/trump-tweet-archive.csv"

# Create the object
Tweets = pd.read_csv(url)
Tweets['text'] = Tweets['text'].fillna('').astype(str) #Remove empty texts

# Remove punctuation
cleaned = []
for item in Tweets["text"]:
    cleaned.append("".join([c for c in item if c.isalnum() or c.isspace()]))# list comprehension to loop through each caracter and keep only space or alphanumeric removing all punctuation and space-formatting
Tweets["text_cleaned"] = cleaned

#Tokenise
Tweets["tokens"] = Tweets["text"].apply(lambda x: word_tokenize(x.lower()))

emissions = tracker.stop()
print(f"Estimated CO2 emissions for working with Trump Tweets: {emissions} kg")

# Solution


In [None]:


# this again check the cost so we wrap it around the whole code
tracker = EmissionsTracker()
tracker.start()
#Define the URl
url = "https://raw.githubusercontent.com/kingsdigitallab/dh-rse-summer-school-2025/refs/heads/main/Day%204/Data/trump-tweet-archive.csv"
# Create the object
Tweets = pd.read_csv(url, usecols=["created_at", "text"])
Tweets['text'] = Tweets['text'].fillna('').astype(str) #Remove empty texts


# Remove punctuation
Tweets["text_cleaned"] = Tweets["text"].str.replace(r"[^\w\s]", "", regex=True)


#Tokenise
# Use caching to avoid re-tokenizing identical texts
@lru_cache(maxsize=None)
def tokenize_cached(text):
    return word_tokenize(text.lower())

# Apply the cached function using .map for better performance
Tweets["tokens"] = Tweets["text"].map(tokenize_cached)



emissions = tracker.stop()
print(f"Estimated CO2 emissions for working with Trump Tweets: {emissions} kg")

**NB** Remember the advice about prototyping so do not try to edit it straight from the cell above but do bit by bit