<a href="https://colab.research.google.com/github/msaharan/.github/blob/main/accelerated_data_processing_examples/cudf_pandas_large_string.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Accelerating large string data processing with cudf pandas accelerator mode (cudf.pandas)
<a href="https://github.com/rapidsai/cudf">cuDF</a> is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating tabular data using a DataFrame style API in the style of pandas.

cuDF now provides a <a href="https://rapids.ai/cudf-pandas/">pandas accelerator mode</a> (`cudf.pandas`), allowing you to bring accelerated computing to your pandas workflows without requiring any code change.

This notebook demonstrates how cuDF pandas accelerator mode can help accelerate processing of datasets with large string fields (4 GB+) processing by simply adding a `%load_ext` command. We have introduced this feature as part of our Rapids 24.08 release.

# ⚠️ Verify your setup

First, we'll verify that you are running with an NVIDIA GPU.

In [1]:
!nvidia-smi  # this should display information about available GPUs

Sat Nov  1 18:24:52 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   44C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

# Download the data

## Overview
The data we'll be working with summarizes job postings data that a developer working at a job listing firm might analyze to understand posting trends.

We'll need to download a curated copy of this Kaggle dataset [https://www.kaggle.com/datasets/asaniczka/1-3m-linkedin-jobs-and-skills-2024/data?select=job_summary.csv] directly from the kaggle API.  

**Data License and Terms** <br>
As this dataset originates from a Kaggle dataset, it's governed by that dataset's license and terms of use, which is the Open Data Commons license. Review here: https://opendatacommons.org/licenses/by/1-0/index.html . For each dataset an user elects to use, the user is responsible for checking if the dataset license is fit for the intended purpose.

**Are there restrictions on how I can use this data? </br>**
For each dataset an user elects to use, the user is responsible for checking if the dataset license is fit for the intended purpose.

## Get the Data
First, [please follow these instructions from Kaggle to download and/or updating your Kaggle API token to get acces the dataset](https://www.kaggle.com/discussions/general/74235).  
- If you're using Colab, you can skip Step #1
- If you're working on your local system, you can skip the Step #2.

This should take about 1-2 minutes.

Next, run this code below, which should also take 1-2 minutes:

In [2]:
# Download the dataset through kaggle API-
!kaggle datasets download -d asaniczka/1-3m-linkedin-jobs-and-skills-2024
#unzip the file to access contents
!unzip 1-3m-linkedin-jobs-and-skills-2024.zip

Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/kaggle/cli.py", line 68, in main
    out = args.func(**command_args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/kaggle/api/kaggle_api_extended.py", line 1741, in dataset_download_cli
    with self.build_kaggle_client() as kaggle:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/kaggle/api/kaggle_api_extended.py", line 688, in build_kaggle_client
    username=self.config_values['username'],
             ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
KeyError: 'username'
unzip:  cannot find or open 1-3m-linkedin-jobs-and-skills-2024.zip, 1-3m-linkedin-jobs-and-skills-2024.zip.zip or 1-3m-linkedin-jobs-and-skills-2024.zip.ZIP.


# Analysis using Standard Pandas

First, let's use Pandas to read in some columns of the dataset:

In [3]:
import pandas as pd
import numpy as np

In [4]:
pd

<module 'pandas' from '/usr/local/lib/python3.12/dist-packages/pandas/__init__.py'>

**Job Summary Dataset** (Dataset-1) : This dataset contains job summaries for each job link.

In [5]:
%%time
job_summary_df = pd.read_csv("job_summary.csv", dtype=('str'))
print("Dataset Size (in GB):",round(job_summary_df.memory_usage(
    deep=True).sum()/(1024**3),2))

FileNotFoundError: [Errno 2] No such file or directory: 'job_summary.csv'

This **8 GB (in-memory)** dataset took around 1 minute to load!

Let's examine the dataset entries along with their memory footprint and character length.

In [6]:
job_summary_df.head()

NameError: name 'job_summary_df' is not defined

The dataset contains job summaries from various job listings.

The `job_summary` column is particularly large, occupying **5 GB in size with a total of 5 billion characters**.

In [None]:
# Calculate memory usage of each column in GB
memory_usage_bytes = job_summary_df.memory_usage(deep=True)
memory_usage_gb = memory_usage_bytes / (1024 ** 3)

print("`job_summary` column size (in GB):", round(memory_usage_gb['job_summary'],1),
     "\n","`job_summary` column number of characters (in Bn):",
      round(job_summary_df['job_summary'].str.len().sum()/(10**9),2))

**Job Skills Dataset** (Dataset-2): This dataset contains a mapping between job links and the skill tags associated with the link.

In [None]:
%%time
job_skills_df = pd.read_csv("job_skills.csv", dtype=('str'))
job_skills_df.info()

In [None]:
job_skills_df.head()

**Job Postings Dataset** (Dataset - 3): This contains demographic and other work related details for each job posting.

In [None]:
%%time
job_postings_df = pd.read_csv("linkedin_job_postings.csv", dtype=('str'))
job_postings_df.info()

In [None]:
job_postings_df.head()

## Q. Which companies and roles have extremely long job summary?

Long job summaries can be challenging to read, but they are essential for certain roles requiring specific subject matter expertise. It would be interesting to identify which job roles and companies have extremely long summaries.

Let's determine the length of each job summary using the `.str.len()` method in pandas:

In [None]:
#Calculate Length of job summary -

In [None]:
%%time
job_summary_df['summary_length'] = job_summary_df['job_summary'].str.len()
job_summary_df['summary_length'].head()

To identify job roles and companies with the longest job summaries, we need to merge the two datasets using the `job_link` column.

In [None]:
%%time
df_merged=pd.merge(job_postings_df, job_summary_df, how="left", on="job_link")

Lets finally look at the `job_tile` and `company` with the maximum job summary length through data aggregation.

In [None]:
%%time
df_merged.groupby(['company',"job_title"]).agg({
    "summary_length":"mean"}).sort_values(by='summary_length', ascending = False).fillna(0)

We see that some specialized jobs like `Adoloscent Behavioural Therapist`, & `Airside Project Manager` have longer summary length.

## Q. How does the length of job summary varies by location?

Why stop here?

Another interesting trend would be to see whether job summary length changes with location. Hopefully, the role requirements shouldn't be biased by the location.

In [None]:
%%time
# Group by company, job_title, and month, and calculate the mean of summary_length
grouped_df = df_merged.groupby(['job_title', 'job_location']).agg({'summary_length': 'mean'})

# Reset index to sort by job_title and month
grouped_df = grouped_df.reset_index()

# Sort by job_title and month
sorted_df = grouped_df.sort_values(by=['job_title', 'job_location','summary_length'],
                                   ascending=False).reset_index(drop=True).fillna(0)
sorted_df

Let's analyze the job role `LEAD SALES ASSOCIATE-FT` to see if the job summary changes with its postings across different locations.

In [None]:
# isolating records for the specific job role across various location
job_title_acc=sorted_df[sorted_df['job_title'] == 'LEAD SALES ASSOCIATE-FT'].reset_index(
    drop=True)[1:15]

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

plt.figure(figsize=(7, 3.5))
plt.barh(job_title_acc['job_location'],job_title_acc['summary_length'], color='skyblue')
plt.xlabel('Summary Length')
plt.ylabel('Job Title')
plt.title('job summary length (Lead Sales Associate) role across cities')
plt.gca().invert_yaxis()  # To display the highest values at the top
plt.tight_layout()
plt.show()

The `Lead Sales Associate` role has similar job summary lengths across various cities, indicating that job requirements for this role remain consistent regardless of location.

# Analysis with cuDF Pandas

Typically, you should load the `cudf.pandas` extension as the first step in your notebook, before importing any modules. Here, we explicitly restart the kernel to simulate that behavior.

In [None]:
get_ipython().kernel.do_shutdown(restart=True)

Note: We just added the `%load-ext` and the rest of the code remains the same.

In [None]:
%load_ext cudf.pandas
import pandas as pd
import numpy as np

In [None]:
pd

We'll run the same code as above to get a feel what GPU-acceleration brings to pandas workflows.

In [None]:
%time job_summary_df = pd.read_csv("job_summary.csv", dtype=('str'))
print("Dataset Size (in GB):",round(job_summary_df.memory_usage(
    deep=True).sum()/(1024**3),2))

The same dataset takes about around 1.5 minutes to load with pandas. That's around **5x speedup** with no changes to the code!

Let's load the remaining two datasets as well:

In [None]:
%%time
job_skills_df = pd.read_csv("job_skills.csv", dtype=('str'))
job_postings_df = pd.read_csv("linkedin_job_postings.csv", dtype=('str'))

In [None]:
%%time
job_summary_df['summary_length'] = job_summary_df['job_summary'].str.len()
job_summary_df['summary_length'].head()

That was lightning fast! We went from around 10+ (with pandas) to a few milliseconds.

In [None]:
%%time
df_merged=pd.merge(job_postings_df, job_summary_df, how="left", on="job_link")
df_merged.head()

In [None]:
%%time
df_merged.groupby(['company',"job_title"]).agg({
    "summary_length":"mean"}).sort_values(by='summary_length', ascending = False).fillna(0)

We went down from around 5 seconds to less than a second here. This is in line with our speedups on other operations!

In [None]:
%%time
# Group by company, job_title, and month, and calculate the mean of summary_length
grouped_df = df_merged.groupby(['job_title', 'job_location']).agg({'summary_length': 'mean'})

# Reset index to sort by job_title and month
grouped_df = grouped_df.reset_index()

# Sort by job_title and month
sorted_df = grouped_df.sort_values(by=['job_title', 'job_location','summary_length'],
                                   ascending=False).reset_index(drop=True).fillna(0)
sorted_df

The acceleration is consistently 10x+ for complex aggregations and sorting that involve multiple columns.

# Summary

With cudf.pandas, you can keep using pandas as your primary dataframe library. When things start to get a little slow, just load the `cudf.pandas` extension and enjoy the incredible speedups.

If you like Google Colab and want to get peak `cudf.pandas` performance to process even larger datasets, Google Colab's paid tier includes both L4 and A100 GPUs (in addition to the T4 GPU this demo notebook is using).

To learn more about cudf.pandas, we encourage you to visit https://rapids.ai/cudf-pandas.

# Do you have any feedback for us?

Fill this quick survey <a href="https://www.surveymonkey.com/r/TX3QQQR">HERE</a>

Raise an issue on our github repo <a href="https://github.com/rapidsai/cudf/issues">HERE</a>