# How to use regex with Pandas DataFrame

- toc: true 
- badges: true
- categories: [Data Processing]
- permalink: /use-regex-with-pandas/


In this tutorial, we will go over some useful functions in pandas that you can use with regular experessions to process texts.

 function  | description
  :----:   |:----:
contains() | Test if pattern or regex is contained within a string of a Series or Index.
 count()   | Count occurrences of pattern in each string of the Series/Index
findall()  | Find all occurrences of pattern or regular expression in the Series/Index.
replace()  | Replace each occurrence of pattern/regex in the Series/Index with a custom string
split()    | Split strings around given pattern 


**Create a DataFrame:**

In [1]:
#hide
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.2.2-py3-none-any.whl (346 kB)
[K     |████████████████████████████████| 346 kB 5.0 MB/s 
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 32.6 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 41.0 MB/s 
[?25hCollecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 5.3 MB/s 
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting dill<0.3.5
  Downloading dill-0.3.4-py2.py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 6.4 MB/s 
Coll

In [2]:
#collapse-output
from datasets import load_dataset
agnews = load_dataset('ag_news')

Downloading builder script:   0%|          | 0.00/1.83k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.28k [00:00<?, ?B/s]

Using custom data configuration default


Downloading and preparing dataset ag_news/default (download: 29.88 MiB, generated: 30.23 MiB, post-processed: Unknown size, total: 60.10 MiB) to /root/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548...


Downloading data:   0%|          | 0.00/11.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/751k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

Dataset ag_news downloaded and prepared to /root/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [3]:
# from Datasets to Pandas DataFrames
agnews.set_format(type="pandas")
df = agnews['train'][:]
df.head()

Unnamed: 0,text,label
0,Wall St. Bears Claw Back Into the Black (Reute...,2
1,Carlyle Looks Toward Commercial Aerospace (Reu...,2
2,Oil and Economy Cloud Stocks' Outlook (Reuters...,2
3,Iraq Halts Oil Exports from Main Southern Pipe...,2
4,"Oil prices soar to all-time record, posing new...",2


## [contains](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.contains.html)
- find texts containing the word "business"

In [9]:
df[df['text'].str.contains(r'\bbusiness\b')].head()

Unnamed: 0,text,label
42,Technology company sues five ex-employees A M...,2
62,"Downhome Pinoy Blues, Intersecting Life Paths,...",2
63,The Real Time Modern Manila Blues: Bill Monroe...,2
65,What are the best cities for business in Asia?...,2
74,HP to Buy Synstar Hewlett-Packard will pay \$2...,2


## [Count](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.count.html)
- count the total number of times the word "business" occurs in texts

In [11]:
df['text'].str.count(r'\bbusiness\b').sum()

2759

## [findall](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.findall.html#)
- equivalent to re.findall()
- see another tutorial on [re.findall() and re.search()](https://www.intodeeplearning.com/use-regex-and-python-for-data-cleaning/)

- below is an example of how to find all the **a**'s in texts

In [23]:
df['text'].str.findall(r'\ba\b')

0                   []
1                  [a]
2                   []
3                  [a]
4                  [a]
              ...     
119995             [a]
119996    [a, a, a, a]
119997             [a]
119998              []
119999             [a]
Name: text, Length: 120000, dtype: object

## [replace](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html)
- replace the all the occurence of "today" or "Today" with "TODAYYYYYY"
- check second to the last row!

In [26]:
df['text'].str.replace(r'\b[Tt]oday\b','TODAYYYYYY')

  """Entry point for launching an IPython kernel.


0         Wall St. Bears Claw Back Into the Black (Reute...
1         Carlyle Looks Toward Commercial Aerospace (Reu...
2         Oil and Economy Cloud Stocks' Outlook (Reuters...
3         Iraq Halts Oil Exports from Main Southern Pipe...
4         Oil prices soar to all-time record, posing new...
                                ...                        
119995    Pakistan's Musharraf Says Won't Quit as Army C...
119996    Renteria signing a top-shelf deal Red Sox gene...
119997    Saban not going to Dolphins yet The Miami Dolp...
119998    TODAYYYYYY's NFL games PITTSBURGH at NY GIANTS...
119999    Nets get Carter from Raptors INDIANAPOLIS -- A...
Name: text, Length: 120000, dtype: object

## [split](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html)
- split texts by "the", the function returns a list of strings
- check first row of the output

In [33]:
df['text'].str.split(r"\bthe\b")

0         [Wall St. Bears Claw Back Into ,  Black (Reute...
1         [Carlyle Looks Toward Commercial Aerospace (Re...
2         [Oil and Economy Cloud Stocks' Outlook (Reuter...
3         [Iraq Halts Oil Exports from Main Southern Pip...
4         [Oil prices soar to all-time record, posing ne...
                                ...                        
119995    [Pakistan's Musharraf Says Won't Quit as Army ...
119996    [Renteria signing a top-shelf deal Red Sox gen...
119997    [Saban not going to Dolphins yet The Miami Dol...
119998    [Today's NFL games PITTSBURGH at NY GIANTS Tim...
119999    [Nets get Carter from Raptors INDIANAPOLIS -- ...
Name: text, Length: 120000, dtype: object

<br><br><br>
**You may be interested**
- [how to load datasets from Hugging Face Datasets](https://www.intodeeplearning.com/how-to-load-datasets-from-hugging-face-datasets/)

In [None]:
#hide
Reference
- [How to use Regex in Pandas](https://kanoki.org/2019/11/12/how-to-use-regex-in-pandas/)