# Generator of (extended) (skippy) n-grams out of words or sentences

developed by Kow Kuroda (kow.kuroda@gmail.com)

This Jupyter notebook demonstrates how to use gen2_ngrams.py (or gen2_ngrams_cy.pyx) developed to enhance the usability of its predecessor "gen_ngrams.py".

There are two main differences from its predecessor. First, gen_skippy_ngrams(..) generates extended skippy n-grams with "extended = True" option. Second, gen_skippy_ngrams(..) cann generate inclusive n-grams, thereby dispensing with incremental generation of n-grams from 1-grams.

Limitations
- Availablity of Cython-enhancement is limited. Apple Silicons like M1 and M2 (M3 is not tested yet) do not accept it, though it is available under Python 3.10 on M1.

Creation
- 2025/08/19

Modifications
- 2025/08/21 minor changes;
- 2025/08/22 i) minor changes; Cython-enhancement was implemented;
- 2025/08/28 fixed bugs in gen_skippy_ngrams in gen2_ngrams.py;
- 2025/08/29 adapted to release 1; adapted to gen2_ngrams_cy.py;

# Set up Cython

In [23]:
#conda update -n base -c defaults conda -y

In [24]:
## Cython の導入 (必要に応じて)
#!conda uninstall cython -y # seems necessary in certain situations
#!conda install cython -y
## Try the following if the above fails
#!pip install cython --upgrade --force-reinstall
#!conda update -n base -c defaults conda -y

In [25]:
#!pip show cython

In [63]:
## Cython を使うかどうか
use_Cython = False

In [64]:
## Cython extension の(再)構築が必要な場合は True に
build_Cython_extension = False
if build_Cython_extension:
    !python setup.py clean build_ext --inplace

In [65]:
## Cython 版の読込み: Cython-version will not run on Apple Silicons like M1, M2
if use_Cython:
    try:
        %reload_ext Cython
    except ImportError:
        %load_ext Cython
    import gen2_ngrams_cy as gen_ngrams
else:
    import gen2_ngrams as gen_ngrams

# Set up data

In [66]:
analyze_words = True # if False, analyze sentential/phrasal objects

## parameters for analysis
if analyze_words:
    segmenter: str = r""
    sep_local: str = ""
else:
    segmenter: str = r" "
    sep_local: str = " "

In [67]:
import pathlib
if analyze_words:
    data_dir = 'data/words'
    files = list(pathlib.Path(data_dir).glob('buddhist-listed2.txt'))
else:
    data_dir = 'data/phrases'
    files = list(pathlib.Path(data_dir).glob('austen-j-sample100.txt'))
##
print(files)

##
file = files[0]
source_name = file.stem
print(f"source_name: {source_name}")

[PosixPath('data/words/buddhist-listed2.txt')]
source_name: buddhist-listed2


In [68]:
## get data
docs = file.read_text(encoding = 'utf-8').splitlines()

## lowercase
docs = [ doc.lower() for doc in docs if len(doc) > 0 ]
print(docs[:10])

['阿羅漢', '辟支仏', '転法輪', '十二因縁', '五蘊盛苦', '三法印', '四念処', '四神足', '五根五力', '七覚支']


# Generation of (extended) (skippy) n-grams

In [69]:
## flags
check: bool = False

## saving results
save_results: bool = False
save_dir: str = "saves"

In [70]:
## n の最大値
max_n_for_ngram: int = 4

## max_gap_size
max_gap_size = 3

## n-gram
ngram_is_inclusive = True
#skippy_means_extended = True

## n-gram を文字列として生成するか否か
generated_as_string: bool = True
generated_as_list: bool = not(generated_as_string)

In [71]:
## 入力の要素数が n_for_ngram 以下の時に再帰的に (n-1)gram を生成するか
recursively = True

In [72]:
#!conda install pandas -y

In [73]:
import pandas as pd
columns0 = ['doc']
columns1 = [ f"{i}g" for i in range(1, max_n_for_ngram + 1)]
columns2 = [ f"sk{i}g" for i in range(1, max_n_for_ngram + 1)]
columns3 = [ f"xsk{i}g" for i in range(1, max_n_for_ngram + 1)]

used_columns = columns0 + columns1 + columns2 + columns3
df = pd.DataFrame(columns = used_columns)

## Normal 

In [74]:
## generate non-skippy n-grams
import re
for i, doc in enumerate(docs):
    ## update df for word
    df.loc[i,'doc'] = doc
    ##
    print(f"Processing word {i} [use_Cython: {use_Cython}]: {doc}")
    word_segs = [ x for x in re.split(segmenter, doc) if len(x) > 0 ]
    for j in range(1, max_n_for_ngram + 1):
        print(f"generating {j}-grams ...")
        ngrams = gen_ngrams.gen_ngrams(word_segs, j, inclusive = ngram_is_inclusive, sep = sep_local, as_list = generated_as_list, check = False)
        if check:
            print(ngrams)
        ## update df
        df.loc[i, f'{j}g'] = ngrams

Processing word 0 [use_Cython: False]: 阿羅漢
generating 1-grams ...
generating 2-grams ...
generating 3-grams ...
generating 4-grams ...
Processing word 1 [use_Cython: False]: 辟支仏
generating 1-grams ...
generating 2-grams ...
generating 3-grams ...
generating 4-grams ...
Processing word 2 [use_Cython: False]: 転法輪
generating 1-grams ...
generating 2-grams ...
generating 3-grams ...
generating 4-grams ...
Processing word 3 [use_Cython: False]: 十二因縁
generating 1-grams ...
generating 2-grams ...
generating 3-grams ...
generating 4-grams ...
Processing word 4 [use_Cython: False]: 五蘊盛苦
generating 1-grams ...
generating 2-grams ...
generating 3-grams ...
generating 4-grams ...
Processing word 5 [use_Cython: False]: 三法印
generating 1-grams ...
generating 2-grams ...
generating 3-grams ...
generating 4-grams ...
Processing word 6 [use_Cython: False]: 四念処
generating 1-grams ...
generating 2-grams ...
generating 3-grams ...
generating 4-grams ...
Processing word 7 [use_Cython: False]: 四神足
generating

In [75]:
df[columns0 + columns1]

Unnamed: 0,doc,1g,2g,3g,4g
0,阿羅漢,"[阿, 羅, 漢]","[阿, 羅, 漢, 阿羅, 羅漢]","[阿, 羅, 漢, 阿羅, 羅漢, 阿羅漢]",[阿羅漢]
1,辟支仏,"[辟, 支, 仏]","[辟, 支, 仏, 辟支, 支仏]","[辟, 支, 仏, 辟支, 支仏, 辟支仏]",[辟支仏]
2,転法輪,"[転, 法, 輪]","[転, 法, 輪, 転法, 法輪]","[転, 法, 輪, 転法, 法輪, 転法輪]",[転法輪]
3,十二因縁,"[十, 二, 因, 縁]","[十, 二, 因, 縁, 十二, 二因, 因縁]","[十, 二, 因, 縁, 十二, 二因, 因縁, 十二因, 二因縁]","[十, 二, 因, 縁, 十二, 二因, 因縁, 十二因, 二因縁, 十二因縁]"
4,五蘊盛苦,"[五, 蘊, 盛, 苦]","[五, 蘊, 盛, 苦, 五蘊, 蘊盛, 盛苦]","[五, 蘊, 盛, 苦, 五蘊, 蘊盛, 盛苦, 五蘊盛, 蘊盛苦]","[五, 蘊, 盛, 苦, 五蘊, 蘊盛, 盛苦, 五蘊盛, 蘊盛苦, 五蘊盛苦]"
...,...,...,...,...,...
195,両祖忌法要,"[両, 祖, 忌, 法, 要]","[両, 祖, 忌, 法, 要, 両祖, 祖忌, 忌法, 法要]","[両, 祖, 忌, 法, 要, 両祖, 祖忌, 忌法, 法要, 両祖忌, 祖忌法, 忌法要]","[両, 祖, 忌, 法, 要, 両祖, 祖忌, 忌法, 法要, 両祖忌, 祖忌法, 忌法要,..."
196,宗祖忌法要,"[宗, 祖, 忌, 法, 要]","[宗, 祖, 忌, 法, 要, 宗祖, 祖忌, 忌法, 法要]","[宗, 祖, 忌, 法, 要, 宗祖, 祖忌, 忌法, 法要, 宗祖忌, 祖忌法, 忌法要]","[宗, 祖, 忌, 法, 要, 宗祖, 祖忌, 忌法, 法要, 宗祖忌, 祖忌法, 忌法要,..."
197,御会式法要,"[御, 会, 式, 法, 要]","[御, 会, 式, 法, 要, 御会, 会式, 式法, 法要]","[御, 会, 式, 法, 要, 御会, 会式, 式法, 法要, 御会式, 会式法, 式法要]","[御, 会, 式, 法, 要, 御会, 会式, 式法, 法要, 御会式, 会式法, 式法要,..."
198,報恩講法要,"[報, 恩, 講, 法, 要]","[報, 恩, 講, 法, 要, 報恩, 恩講, 講法, 法要]","[報, 恩, 講, 法, 要, 報恩, 恩講, 講法, 法要, 報恩講, 恩講法, 講法要]","[報, 恩, 講, 法, 要, 報恩, 恩講, 講法, 法要, 報恩講, 恩講法, 講法要,..."


## Skippy

In [76]:
## generate regular skippy n-grams
import re
for i, doc in enumerate(docs):
    print(f"Processing word {i} [use_Cython: {use_Cython}]: {doc}")
    word_segs = [ seg for seg in re.split(segmenter, doc) if len(seg) > 0 ]
    for j in range(1, max_n_for_ngram + 1):
        print(f"generating skippy {j}-grams ...")
        ngrams = gen_ngrams.gen_skippy_ngrams(word_segs, j, extended = False, inclusive = ngram_is_inclusive, recursively = recursively, max_gap_size = max_gap_size, sep = sep_local, as_list = generated_as_list, check = False)
        if check:
            print(ngrams)
        ## update df
        df.loc[i, f'sk{j}g'] = ngrams

Processing word 0 [use_Cython: False]: 阿羅漢
generating skippy 1-grams ...
generating skippy 2-grams ...
generating skippy 3-grams ...
generating skippy 4-grams ...
Processing word 1 [use_Cython: False]: 辟支仏
generating skippy 1-grams ...
generating skippy 2-grams ...
generating skippy 3-grams ...
generating skippy 4-grams ...
Processing word 2 [use_Cython: False]: 転法輪
generating skippy 1-grams ...
generating skippy 2-grams ...
generating skippy 3-grams ...
generating skippy 4-grams ...
Processing word 3 [use_Cython: False]: 十二因縁
generating skippy 1-grams ...
generating skippy 2-grams ...
generating skippy 3-grams ...
generating skippy 4-grams ...
Processing word 4 [use_Cython: False]: 五蘊盛苦
generating skippy 1-grams ...
generating skippy 2-grams ...
generating skippy 3-grams ...
generating skippy 4-grams ...
Processing word 5 [use_Cython: False]: 三法印
generating skippy 1-grams ...
generating skippy 2-grams ...
generating skippy 3-grams ...
generating skippy 4-grams ...
Processing word 6 [u

In [77]:
df[columns0 + columns2]

Unnamed: 0,doc,sk1g,sk2g,sk3g,sk4g
0,阿羅漢,"[阿, 羅, 漢]","[阿, 羅, 漢, 阿羅, 羅漢, 阿…漢]","[阿, 羅, 漢, 阿羅, 羅漢, 阿羅漢, 阿…漢]","[阿, 羅, 漢, 阿羅, 羅漢, 阿羅漢, 阿…漢]"
1,辟支仏,"[辟, 支, 仏]","[辟, 支, 仏, 辟支, 支仏, 辟…仏]","[辟, 支, 仏, 辟支, 支仏, 辟支仏, 辟…仏]","[辟, 支, 仏, 辟支, 支仏, 辟支仏, 辟…仏]"
2,転法輪,"[転, 法, 輪]","[転, 法, 輪, 転法, 法輪, 転…輪]","[転, 法, 輪, 転法, 法輪, 転法輪, 転…輪]","[転, 法, 輪, 転法, 法輪, 転法輪, 転…輪]"
3,十二因縁,"[十, 二, 因, 縁]","[十, 二, 因, 縁, 十二, 二因, 因縁, 十…因, 二…縁, 十…縁]","[十, 二, 因, 縁, 十二, 二因, 因縁, 十二因, 十…因, 二因縁, 二…縁, 十...","[十, 二, 因, 縁, 十二, 二因, 因縁, 十二因, 十…因, 二因縁, 二…縁, 十..."
4,五蘊盛苦,"[五, 蘊, 盛, 苦]","[五, 蘊, 盛, 苦, 五蘊, 蘊盛, 盛苦, 五…盛, 蘊…苦, 五…苦]","[五, 蘊, 盛, 苦, 五蘊, 蘊盛, 盛苦, 五蘊盛, 五…盛, 蘊盛苦, 蘊…苦, 五...","[五, 蘊, 盛, 苦, 五蘊, 蘊盛, 盛苦, 五蘊盛, 五…盛, 蘊盛苦, 蘊…苦, 五..."
...,...,...,...,...,...
195,両祖忌法要,"[両, 祖, 忌, 法, 要]","[両, 祖, 忌, 法, 要, 両祖, 祖忌, 忌法, 法要, 両…忌, 祖…法, 忌…要,...","[両, 祖, 忌, 法, 要, 両祖, 祖忌, 忌法, 法要, 両祖忌, 両…忌, 祖忌法,...","[両, 祖, 忌, 法, 要, 両祖, 祖忌, 忌法, 法要, 両祖忌, 両…忌, 祖忌法,..."
196,宗祖忌法要,"[宗, 祖, 忌, 法, 要]","[宗, 祖, 忌, 法, 要, 宗祖, 祖忌, 忌法, 法要, 宗…忌, 祖…法, 忌…要,...","[宗, 祖, 忌, 法, 要, 宗祖, 祖忌, 忌法, 法要, 宗祖忌, 宗…忌, 祖忌法,...","[宗, 祖, 忌, 法, 要, 宗祖, 祖忌, 忌法, 法要, 宗祖忌, 宗…忌, 祖忌法,..."
197,御会式法要,"[御, 会, 式, 法, 要]","[御, 会, 式, 法, 要, 御会, 会式, 式法, 法要, 御…式, 会…法, 式…要,...","[御, 会, 式, 法, 要, 御会, 会式, 式法, 法要, 御会式, 御…式, 会式法,...","[御, 会, 式, 法, 要, 御会, 会式, 式法, 法要, 御会式, 御…式, 会式法,..."
198,報恩講法要,"[報, 恩, 講, 法, 要]","[報, 恩, 講, 法, 要, 報恩, 恩講, 講法, 法要, 報…講, 恩…法, 講…要,...","[報, 恩, 講, 法, 要, 報恩, 恩講, 講法, 法要, 報恩講, 報…講, 恩講法,...","[報, 恩, 講, 法, 要, 報恩, 恩講, 講法, 法要, 報恩講, 報…講, 恩講法,..."


## Extended skippy

In [78]:
## generate extended skippy n-grams
import re, unicodedata
for i, doc in enumerate(docs):
    ## update df for word
    df.loc[i,'doc'] = doc
    
    print(f"Processing word {i} [use_Cython: {use_Cython}]: {doc}")
    ## Unicode normalization is necessay to proper handling of accents in languages like Irish and Welsh
    word_segs = [ seg for seg in re.split(segmenter, unicodedata.normalize('NFC', doc)) if len(seg) > 0 ]
    for j in range(1, max_n_for_ngram + 1):
        print(f"generating extended skippy {j}-grams ...")
        ngrams = gen_ngrams.gen_skippy_ngrams(word_segs, j, extended = True, inclusive = ngram_is_inclusive, recursively = recursively, max_gap_size = max_gap_size, sep = sep_local, as_list = generated_as_list, check = False)
        if check:
            print(ngrams)
        ## update df
        df.loc[i, f'xsk{j}g'] = ngrams

Processing word 0 [use_Cython: False]: 阿羅漢
generating extended skippy 1-grams ...
generating extended skippy 2-grams ...
generating extended skippy 3-grams ...
generating extended skippy 4-grams ...
Processing word 1 [use_Cython: False]: 辟支仏
generating extended skippy 1-grams ...
generating extended skippy 2-grams ...
generating extended skippy 3-grams ...
generating extended skippy 4-grams ...
Processing word 2 [use_Cython: False]: 転法輪
generating extended skippy 1-grams ...
generating extended skippy 2-grams ...
generating extended skippy 3-grams ...
generating extended skippy 4-grams ...
Processing word 3 [use_Cython: False]: 十二因縁
generating extended skippy 1-grams ...
generating extended skippy 2-grams ...
generating extended skippy 3-grams ...
generating extended skippy 4-grams ...
Processing word 4 [use_Cython: False]: 五蘊盛苦
generating extended skippy 1-grams ...
generating extended skippy 2-grams ...
generating extended skippy 3-grams ...
generating extended skippy 4-grams ...
Pro

In [79]:
df[columns0 + columns3]

Unnamed: 0,doc,xsk1g,xsk2g,xsk3g,xsk4g
0,阿羅漢,"[阿…, …漢, …羅…]","[阿…, …漢, 阿羅…, 阿…漢, …羅漢, …羅…]","[阿…, …漢, 阿羅漢, 阿羅…, 阿…漢, …羅漢, …羅…]","[阿…, …漢, 阿羅漢, 阿羅…, 阿…漢, …羅漢, …羅…]"
1,辟支仏,"[辟…, …仏, …支…]","[辟…, …仏, 辟支…, 辟…仏, …支仏, …支…]","[辟…, …仏, 辟支仏, 辟支…, 辟…仏, …支仏, …支…]","[辟…, …仏, 辟支仏, 辟支…, 辟…仏, …支仏, …支…]"
2,転法輪,"[転…, …輪, …法…]","[転…, …輪, 転法…, 転…輪, …法輪, …法…]","[転…, …輪, 転法輪, 転法…, 転…輪, …法輪, …法…]","[転…, …輪, 転法輪, 転法…, 転…輪, …法輪, …法…]"
3,十二因縁,"[十…, …縁, …二…, …因…]","[十…, …縁, 十二…, …二…, …因縁, …因…, 十…因…, 十…縁, …二因…, ...","[十…, …縁, 十二…, …二…, …因縁, …因…, 十二因…, 十二…縁, 十…因縁,...","[十…, …縁, 十二…, …二…, …因縁, …因…, 十二因縁, 十二因…, 十二…縁,..."
4,五蘊盛苦,"[五…, …苦, …蘊…, …盛…]","[五…, …苦, 五蘊…, …蘊…, …盛苦, …盛…, 五…盛…, 五…苦, …蘊盛…, ...","[五…, …苦, 五蘊…, …蘊…, …盛苦, …盛…, 五蘊盛…, 五蘊…苦, 五…盛苦,...","[五…, …苦, 五蘊…, …蘊…, …盛苦, …盛…, 五蘊盛苦, 五蘊盛…, 五蘊…苦,..."
...,...,...,...,...,...
195,両祖忌法要,"[両…, …要, …祖…, …忌…, …法…]","[両…, …要, 両祖…, …祖…, …忌…, …法要, …法…, 両…忌…, …祖忌…, ...","[両…, …要, 両祖…, …祖…, …忌…, …法要, …法…, 両祖忌…, 両…忌…, ...","[両…, …要, 両祖…, …祖…, …忌…, …法要, …法…, 両祖忌…, 両…忌…, ..."
196,宗祖忌法要,"[宗…, …要, …祖…, …忌…, …法…]","[宗…, …要, 宗祖…, …祖…, …忌…, …法要, …法…, 宗…忌…, …祖忌…, ...","[宗…, …要, 宗祖…, …祖…, …忌…, …法要, …法…, 宗祖忌…, 宗…忌…, ...","[宗…, …要, 宗祖…, …祖…, …忌…, …法要, …法…, 宗祖忌…, 宗…忌…, ..."
197,御会式法要,"[御…, …要, …会…, …式…, …法…]","[御…, …要, 御会…, …会…, …式…, …法要, …法…, 御…式…, …会式…, ...","[御…, …要, 御会…, …会…, …式…, …法要, …法…, 御会式…, 御…式…, ...","[御…, …要, 御会…, …会…, …式…, …法要, …法…, 御会式…, 御…式…, ..."
198,報恩講法要,"[報…, …要, …恩…, …講…, …法…]","[報…, …要, 報恩…, …恩…, …講…, …法要, …法…, 報…講…, …恩講…, ...","[報…, …要, 報恩…, …恩…, …講…, …法要, …法…, 報恩講…, 報…講…, ...","[報…, …要, 報恩…, …恩…, …講…, …法要, …法…, 報恩講…, 報…講…, ..."


## Check differences

In [80]:
for i, row in df.iterrows():
    doc = row['doc']
    print("--------------")
    print(f"doc: {doc}")
    
    ## compare xsk, sk and norm
    for j in range(1, max_n_for_ngram + 1):
        print(" -------------")
        print(f"{j}g")
        norm_var = f"{j}g"
        sk_var = f"sk{j}g"
        xsk_var = f"xsk{j}g"
        norm  = list(df.loc[i,norm_var])
        print(f"norm: {norm}")
        sk  = list(df.loc[i,sk_var])
        print(f"sk: {sk}")
        xsk = list(df.loc[i,xsk_var])
        print(f"xsk: {xsk}")
        
        ## Differences
        D1 = [ x for x in norm if not x in sk and not x in xsk ]
        print(f"D1: x in norm, not in sk and xsk: {D1}")
        ##
        D2 = [ x for x in sk if not x in norm and not x in xsk ]
        print(f"D2: x in sk, not in norm and xsk: {D2}")
        ##
        D3 = [ x for x in xsk if not x in norm and not x in sk ]
        print(f"D3: x in xsk, not in norm and sk: {D3}")

        ## Commonalities
        C1 = [ x for x in xsk if x in sk and x in norm ]
        print(f"C1: x in xsk, sk, and norm: {C1}")
        ##
        C2 = [ x for x in xsk if x in sk ]
        print(f"C2: x in xsk and sk {C2}")
        ##
        C3 = [ x for x in sk if x in norm ]
        print(f"C3: x in sk and norm: {C3}")

--------------
doc: 阿羅漢
 -------------
1g
norm: ['阿', '羅', '漢']
sk: ['阿', '羅', '漢']
xsk: ['阿…', '…漢', '…羅…']
D1: x in norm, not in sk and xsk: []
D2: x in sk, not in norm and xsk: []
D3: x in xsk, not in norm and sk: ['阿…', '…漢', '…羅…']
C1: x in xsk, sk, and norm: []
C2: x in xsk and sk []
C3: x in sk and norm: ['阿', '羅', '漢']
 -------------
2g
norm: ['阿', '羅', '漢', '阿羅', '羅漢']
sk: ['阿', '羅', '漢', '阿羅', '羅漢', '阿…漢']
xsk: ['阿…', '…漢', '阿羅…', '阿…漢', '…羅漢', '…羅…']
D1: x in norm, not in sk and xsk: []
D2: x in sk, not in norm and xsk: []
D3: x in xsk, not in norm and sk: ['阿…', '…漢', '阿羅…', '…羅漢', '…羅…']
C1: x in xsk, sk, and norm: []
C2: x in xsk and sk ['阿…漢']
C3: x in sk and norm: ['阿', '羅', '漢', '阿羅', '羅漢']
 -------------
3g
norm: ['阿', '羅', '漢', '阿羅', '羅漢', '阿羅漢']
sk: ['阿', '羅', '漢', '阿羅', '羅漢', '阿羅漢', '阿…漢']
xsk: ['阿…', '…漢', '阿羅漢', '阿羅…', '阿…漢', '…羅漢', '…羅…']
D1: x in norm, not in sk and xsk: []
D2: x in sk, not in norm and xsk: []
D3: x in xsk, not in norm and sk: ['阿…', '…漢', '阿羅…

## 結果の保存

In [81]:
if save_results:
    file_name = f"{save_dir}/gen2_{source_name}-reg-sk-xsk-df.csv"
    df.to_csv(file_name, header = True)

# end of file