# Generator of (extended) (skippy) n-grams out of words or sentences

developed by Kow Kuroda (kow.kuroda@gmail.com)

This Jupyter notebook demonstrates how to use gen2_ngrams.py (or gen2_ngrams_cy.pyx) developed to enhance the usability of its predecessor "gen_ngrams.py".

There are two main differences from its predecessor. First, gen_skippy_ngrams(..) generates extended skippy n-grams with "extended = True" option. Second, gen_skippy_ngrams(..) cann generate inclusive n-grams, thereby dispensing with incremental generation of n-grams from 1-grams.

Limitations
- Availablity of Cython-enhancement is limited. Apple Silicons like M1 and M2 (M3 is not tested yet) do not accept it, though it is available under Python 3.10 on M1.

Creation
- 2025/08/19

Modifications
- 2025/08/21 minor changes;
- 2025/08/22 i) minor changes; Cython-enhancement was implemented;

# Set up Cython

In [26]:
#conda update -n base -c defaults conda -y

In [27]:
## Cython の導入 (必要に応じて)
#!conda uninstall cython -y # seems necessary in certain situations
#!conda install cython -y
## Try the following if the above fails
#!pip install cython --upgrade --force-reinstall
#!conda update -n base -c defaults conda -y

In [28]:
#!pip show cython

In [29]:
## Cython を使うかどうか
use_Cython = False

In [30]:
## Cython extension の(再)構築が必要な場合は True に
build_Cython_extension = False
if use_Cython and build_Cython_extension:
    !python setup.py clean build_ext --inplace

In [31]:
## Cython 版の読込み
## will not run on Apple Silicons like M1, M2
if use_Cython:
    try:
        %reload_ext Cython
    except ImportError:
        %load_ext Cython
    import gen2_ngrams_cy as gen_ngrams
else:
    import gen2_ngrams as gen_ngrams

# Set up data

In [32]:
analyze_words = True # if False, analyze sentential/phrasal objects

## parameters for analysis
if analyze_words:
    segmenter: str = r""
    sep_local: str = ""
else:
    segmenter: str = r" "
    sep_local: str = " "

In [33]:
import pathlib
if analyze_words:
    data_dir = 'data/words'
    files = list(pathlib.Path(data_dir).glob('buddhist-listed2.txt'))
else:
    data_dir = 'data/phrases'
    files = list(pathlib.Path(data_dir).glob('austen-j-sample100.txt'))
##
print(files)

##
file = files[0]
source_name = file.stem
print(f"source_name: {source_name}")

[PosixPath('data/words/buddhist-listed2.txt')]
source_name: buddhist-listed2


In [34]:
## get data
docs = file.read_text(encoding = 'utf-8').splitlines()

## lowercase
docs = [ doc.lower() for doc in docs if len(doc) > 0 ]
print(docs[:10])

['阿羅漢', '辟支仏', '転法輪', '十二因縁', '五蘊盛苦', '三法印', '四念処', '四神足', '五根五力', '七覚支']


# Generation of (extended) (skippy) n-grams

In [35]:
## flags
check: bool = False

## saving results
save_results: bool = False
save_dir: str = "saves"

In [36]:
### n-gram
## n の最大値
max_n_for_ngram: int = 5

## n-gram
ngram_is_inclusive = True
#skippy_means_extended = True

## n-gram を文字列として生成するか否か
generated_as_string: bool = True
generated_as_list: bool = not(generated_as_string)

In [37]:
#!conda install pandas -y

In [38]:
import pandas as pd
columns0 = ['doc']
columns1 = [ f"xsk{i}g" for i in range(1, max_n_for_ngram + 1)]
columns2 = [ f"sk{i}g" for i in range(1, max_n_for_ngram + 1)]
columns3 = [ f"{i}g" for i in range(1, max_n_for_ngram + 1)]

used_columns = columns0 + columns1 + columns2 + columns3
df = pd.DataFrame(columns = used_columns)

In [39]:
## generate extended skippy n-grams
import re, unicodedata
for i, doc in enumerate(docs):
    print(f"Processing word {i} [use_Cython: {use_Cython}]: {doc}")
    ## Unicode normalization is necessay to proper handling of accents in languages like Irish and Welsh
    word_segs = [ seg for seg in re.split(segmenter, unicodedata.normalize('NFC', doc)) if len(seg) > 0 ]
    for j in range(1, max_n_for_ngram + 1):
        print(f"generating extended skippy {j}-grams ...")
        ngrams = gen_ngrams.gen_skippy_ngrams(word_segs, j, extended = True, inclusive = ngram_is_inclusive, sep = sep_local, as_list = generated_as_list, check = False)
        if check:
            print(ngrams)
        ## update df
        df.loc[i, f'xsk{j}g'] = ngrams

Processing word 0 [use_Cython: False]: 阿羅漢
generating extended skippy 1-grams ...
generating extended skippy 2-grams ...
generating extended skippy 3-grams ...
generating extended skippy 4-grams ...
generating extended skippy 5-grams ...
Processing word 1 [use_Cython: False]: 辟支仏
generating extended skippy 1-grams ...
generating extended skippy 2-grams ...
generating extended skippy 3-grams ...
generating extended skippy 4-grams ...
generating extended skippy 5-grams ...
Processing word 2 [use_Cython: False]: 転法輪
generating extended skippy 1-grams ...
generating extended skippy 2-grams ...
generating extended skippy 3-grams ...
generating extended skippy 4-grams ...
generating extended skippy 5-grams ...
Processing word 3 [use_Cython: False]: 十二因縁
generating extended skippy 1-grams ...
generating extended skippy 2-grams ...
generating extended skippy 3-grams ...
generating extended skippy 4-grams ...
generating extended skippy 5-grams ...
Processing word 4 [use_Cython: False]: 五蘊盛苦
gen

In [40]:
df[columns1]

Unnamed: 0,xsk1g,xsk2g,xsk3g,xsk4g,xsk5g
0,"[阿…, …羅…, …漢]","[阿羅…, 阿…漢, 阿…, …羅漢, …羅…, …漢]","[阿羅…, 阿…漢, 阿…, …羅漢, …羅…, …漢, 阿羅漢]","[阿羅…, 阿…漢, 阿…, …羅漢, …羅…, …漢, 阿羅漢]","[阿羅…, 阿…漢, 阿…, …羅漢, …羅…, …漢, 阿羅漢]"
1,"[辟…, …支…, …仏]","[辟支…, 辟…仏, 辟…, …支仏, …支…, …仏]","[辟支…, 辟…仏, 辟…, …支仏, …支…, …仏, 辟支仏]","[辟支…, 辟…仏, 辟…, …支仏, …支…, …仏, 辟支仏]","[辟支…, 辟…仏, 辟…, …支仏, …支…, …仏, 辟支仏]"
2,"[転…, …法…, …輪]","[転法…, 転…輪, 転…, …法輪, …法…, …輪]","[転法…, 転…輪, 転…, …法輪, …法…, …輪, 転法輪]","[転法…, 転…輪, 転…, …法輪, …法…, …輪, 転法輪]","[転法…, 転…輪, 転…, …法輪, …法…, …輪, 転法輪]"
3,"[十…, …二…, …因…, …縁]","[十二…, 十…因…, 十…縁, 十…, …二因…, …二…縁, …二…, …因縁, …因…...","[十二因…, 十二…縁, 十二…, 十…因縁, 十…因…, 十…縁, 十…, …二因縁, …...","[十二因…, 十二…縁, 十二…, 十…因縁, 十…因…, 十…縁, 十…, …二因縁, …...","[十二因…, 十二…縁, 十二…, 十…因縁, 十…因…, 十…縁, 十…, …二因縁, …..."
4,"[五…, …蘊…, …盛…, …苦]","[五蘊…, 五…盛…, 五…苦, 五…, …蘊盛…, …蘊…苦, …蘊…, …盛苦, …盛…...","[五蘊盛…, 五蘊…苦, 五蘊…, 五…盛苦, 五…盛…, 五…苦, 五…, …蘊盛苦, …...","[五蘊盛…, 五蘊…苦, 五蘊…, 五…盛苦, 五…盛…, 五…苦, 五…, …蘊盛苦, …...","[五蘊盛…, 五蘊…苦, 五蘊…, 五…盛苦, 五…盛…, 五…苦, 五…, …蘊盛苦, …..."
...,...,...,...,...,...
195,"[両…, …祖…, …忌…, …法…, …要]","[両祖…, 両…忌…, 両…法…, 両…要, 両…, …祖忌…, …祖…法…, …祖…要, ...","[両祖忌…, 両祖…法…, 両祖…要, 両祖…, 両…忌法…, 両…忌…要, 両…忌…, 両...","[両祖忌法…, 両祖忌…要, 両祖忌…, 両祖…法要, 両祖…法…, 両祖…要, 両祖…, ...","[両祖忌法…, 両祖忌…要, 両祖忌…, 両祖…法要, 両祖…法…, 両祖…要, 両祖…, ..."
196,"[宗…, …祖…, …忌…, …法…, …要]","[宗祖…, 宗…忌…, 宗…法…, 宗…要, 宗…, …祖忌…, …祖…法…, …祖…要, ...","[宗祖忌…, 宗祖…法…, 宗祖…要, 宗祖…, 宗…忌法…, 宗…忌…要, 宗…忌…, 宗...","[宗祖忌法…, 宗祖忌…要, 宗祖忌…, 宗祖…法要, 宗祖…法…, 宗祖…要, 宗祖…, ...","[宗祖忌法…, 宗祖忌…要, 宗祖忌…, 宗祖…法要, 宗祖…法…, 宗祖…要, 宗祖…, ..."
197,"[御…, …会…, …式…, …法…, …要]","[御会…, 御…式…, 御…法…, 御…要, 御…, …会式…, …会…法…, …会…要, ...","[御会式…, 御会…法…, 御会…要, 御会…, 御…式法…, 御…式…要, 御…式…, 御...","[御会式法…, 御会式…要, 御会式…, 御会…法要, 御会…法…, 御会…要, 御会…, ...","[御会式法…, 御会式…要, 御会式…, 御会…法要, 御会…法…, 御会…要, 御会…, ..."
198,"[報…, …恩…, …講…, …法…, …要]","[報恩…, 報…講…, 報…法…, 報…要, 報…, …恩講…, …恩…法…, …恩…要, ...","[報恩講…, 報恩…法…, 報恩…要, 報恩…, 報…講法…, 報…講…要, 報…講…, 報...","[報恩講法…, 報恩講…要, 報恩講…, 報恩…法要, 報恩…法…, 報恩…要, 報恩…, ...","[報恩講法…, 報恩講…要, 報恩講…, 報恩…法要, 報恩…法…, 報恩…要, 報恩…, ..."


In [41]:
## generate regular skippy n-grams
import re
for i, doc in enumerate(docs):
    print(f"Processing word {i} [use_Cython: {use_Cython}]: {doc}")
    word_segs = [ seg for seg in re.split(segmenter, doc) if len(seg) > 0 ]
    for j in range(1, max_n_for_ngram + 1):
        print(f"generating skippy {j}-grams ...")
        ngrams = gen_ngrams.gen_skippy_ngrams(word_segs, j, extended = False, inclusive = ngram_is_inclusive, sep = sep_local, as_list = generated_as_list, check = False)
        if check:
            print(ngrams)
        ## update df
        df.loc[i, f'sk{j}g'] = ngrams

Processing word 0 [use_Cython: False]: 阿羅漢
generating skippy 1-grams ...
generating skippy 2-grams ...
generating skippy 3-grams ...
generating skippy 4-grams ...
generating skippy 5-grams ...
Processing word 1 [use_Cython: False]: 辟支仏
generating skippy 1-grams ...
generating skippy 2-grams ...
generating skippy 3-grams ...
generating skippy 4-grams ...
generating skippy 5-grams ...
Processing word 2 [use_Cython: False]: 転法輪
generating skippy 1-grams ...
generating skippy 2-grams ...
generating skippy 3-grams ...
generating skippy 4-grams ...
generating skippy 5-grams ...
Processing word 3 [use_Cython: False]: 十二因縁
generating skippy 1-grams ...
generating skippy 2-grams ...
generating skippy 3-grams ...
generating skippy 4-grams ...
generating skippy 5-grams ...
Processing word 4 [use_Cython: False]: 五蘊盛苦
generating skippy 1-grams ...
generating skippy 2-grams ...
generating skippy 3-grams ...
generating skippy 4-grams ...
generating skippy 5-grams ...
Processing word 5 [use_Cython: Fa

In [42]:
df[columns2]

Unnamed: 0,sk1g,sk2g,sk3g,sk4g,sk5g
0,"[阿, 羅, 漢]","[阿羅…, 阿…漢, 阿, …羅漢, 羅, 漢]","[阿羅…, 阿…漢, 阿, …羅漢, 羅, 漢, 阿羅漢]","[阿羅…, 阿…漢, 阿, …羅漢, 羅, 漢, 阿羅漢]","[阿羅…, 阿…漢, 阿, …羅漢, 羅, 漢, 阿羅漢]"
1,"[辟, 支, 仏]","[辟支…, 辟…仏, 辟, …支仏, 支, 仏]","[辟支…, 辟…仏, 辟, …支仏, 支, 仏, 辟支仏]","[辟支…, 辟…仏, 辟, …支仏, 支, 仏, 辟支仏]","[辟支…, 辟…仏, 辟, …支仏, 支, 仏, 辟支仏]"
2,"[転, 法, 輪]","[転法…, 転…輪, 転, …法輪, 法, 輪]","[転法…, 転…輪, 転, …法輪, 法, 輪, 転法輪]","[転法…, 転…輪, 転, …法輪, 法, 輪, 転法輪]","[転法…, 転…輪, 転, …法輪, 法, 輪, 転法輪]"
3,"[十, 二, 因, 縁]","[十二…, 十…因…, 十…縁, 十, …二因…, …二…縁, 二, …因縁, 因, 縁]","[十二因…, 十二…縁, 十二…, 十…因縁, 十…因…, 十…縁, 十, …二因縁, …二...","[十二因…, 十二…縁, 十二…, 十…因縁, 十…因…, 十…縁, 十, …二因縁, …二...","[十二因…, 十二…縁, 十二…, 十…因縁, 十…因…, 十…縁, 十, …二因縁, …二..."
4,"[五, 蘊, 盛, 苦]","[五蘊…, 五…盛…, 五…苦, 五, …蘊盛…, …蘊…苦, 蘊, …盛苦, 盛, 苦]","[五蘊盛…, 五蘊…苦, 五蘊…, 五…盛苦, 五…盛…, 五…苦, 五, …蘊盛苦, …蘊...","[五蘊盛…, 五蘊…苦, 五蘊…, 五…盛苦, 五…盛…, 五…苦, 五, …蘊盛苦, …蘊...","[五蘊盛…, 五蘊…苦, 五蘊…, 五…盛苦, 五…盛…, 五…苦, 五, …蘊盛苦, …蘊..."
...,...,...,...,...,...
195,"[両, 祖, 忌, 法, 要]","[両祖…, 両…忌…, 両…法…, 両…要, 両, …祖忌…, …祖…法…, …祖…要, 祖...","[両祖忌…, 両祖…法…, 両祖…要, 両祖…, 両…忌法…, 両…忌…要, 両…忌…, 両...","[両祖忌法…, 両祖忌…要, 両祖忌…, 両祖…法要, 両祖…法…, 両祖…要, 両祖…, ...","[両祖忌法…, 両祖忌…要, 両祖忌…, 両祖…法要, 両祖…法…, 両祖…要, 両祖…, ..."
196,"[宗, 祖, 忌, 法, 要]","[宗祖…, 宗…忌…, 宗…法…, 宗…要, 宗, …祖忌…, …祖…法…, …祖…要, 祖...","[宗祖忌…, 宗祖…法…, 宗祖…要, 宗祖…, 宗…忌法…, 宗…忌…要, 宗…忌…, 宗...","[宗祖忌法…, 宗祖忌…要, 宗祖忌…, 宗祖…法要, 宗祖…法…, 宗祖…要, 宗祖…, ...","[宗祖忌法…, 宗祖忌…要, 宗祖忌…, 宗祖…法要, 宗祖…法…, 宗祖…要, 宗祖…, ..."
197,"[御, 会, 式, 法, 要]","[御会…, 御…式…, 御…法…, 御…要, 御, …会式…, …会…法…, …会…要, 会...","[御会式…, 御会…法…, 御会…要, 御会…, 御…式法…, 御…式…要, 御…式…, 御...","[御会式法…, 御会式…要, 御会式…, 御会…法要, 御会…法…, 御会…要, 御会…, ...","[御会式法…, 御会式…要, 御会式…, 御会…法要, 御会…法…, 御会…要, 御会…, ..."
198,"[報, 恩, 講, 法, 要]","[報恩…, 報…講…, 報…法…, 報…要, 報, …恩講…, …恩…法…, …恩…要, 恩...","[報恩講…, 報恩…法…, 報恩…要, 報恩…, 報…講法…, 報…講…要, 報…講…, 報...","[報恩講法…, 報恩講…要, 報恩講…, 報恩…法要, 報恩…法…, 報恩…要, 報恩…, ...","[報恩講法…, 報恩講…要, 報恩講…, 報恩…法要, 報恩…法…, 報恩…要, 報恩…, ..."


In [None]:
## get differences between sk and xsk
for i, row in df.iterrows():
    print(row)
    for i in range(1, max_n_for_ngram + 1):
        print(f"checking {i}g")
        xsk = list(df.loc[i:,f"xsk{j}g"])
        sk  = list(df.loc[i:,f"sk{j}g"])
        C = [ x for x in xsk if x in sk ]
        D1 = [ x for x in xsk if not x in sk ]
        D2 = [ x for x in sk if not x in xsk ]

doc                                    阿羅漢
xsk1g                        [阿…, …羅…, …漢]
xsk2g         [阿羅…, 阿…漢, 阿…, …羅漢, …羅…, …漢]
xsk3g    [阿羅…, 阿…漢, 阿…, …羅漢, …羅…, …漢, 阿羅漢]
xsk4g    [阿羅…, 阿…漢, 阿…, …羅漢, …羅…, …漢, 阿羅漢]
xsk5g    [阿羅…, 阿…漢, 阿…, …羅漢, …羅…, …漢, 阿羅漢]
sk1g                             [阿, 羅, 漢]
sk2g              [阿羅…, 阿…漢, 阿, …羅漢, 羅, 漢]
sk3g         [阿羅…, 阿…漢, 阿, …羅漢, 羅, 漢, 阿羅漢]
sk4g         [阿羅…, 阿…漢, 阿, …羅漢, 羅, 漢, 阿羅漢]
sk5g         [阿羅…, 阿…漢, 阿, …羅漢, 羅, 漢, 阿羅漢]
1g                               [阿, 羅, 漢]
2g                       [阿, 羅, 漢, 阿羅, 羅漢]
3g                  [阿, 羅, 漢, 阿羅, 羅漢, 阿羅漢]
4g                  [阿, 羅, 漢, 阿羅, 羅漢, 阿羅漢]
5g                  [阿, 羅, 漢, 阿羅, 羅漢, 阿羅漢]
Name: 0, dtype: object
checking 1g
checking 2g
checking 3g
checking 4g
checking 5g
doc                                    辟支仏
xsk1g                        [辟…, …支…, …仏]
xsk2g         [辟支…, 辟…仏, 辟…, …支仏, …支…, …仏]
xsk3g    [辟支…, 辟…仏, 辟…, …支仏, …支…, …仏, 辟支仏]
xsk4g    [辟支…, 辟…仏, 辟…, …支仏, …支…, …仏, 辟支仏]
xsk5g    [辟支…,

In [57]:
for c in C:
    print(f"c: {c}")

In [58]:
for d1 in D1:
    print(f"d1: {d1}")

d1: ['三法…', '三…印', '三', '…法印', '法', '印', '三法印']
d1: ['四念…', '四…処', '四', '…念処', '念', '処', '四念処']
d1: ['四神…', '四…足', '四', '…神足', '神', '足', '四神足']
d1: ['五根五…', '五根…力', '五根…', '五…五力', '五…五…', '五…力', '五', '…根五力', '…根五…', '…根…力', '根', '…五力', '五', '力', '五根五力']
d1: ['七覚…', '七…支', '七', '…覚支', '覚', '支', '七覚支']
d1: ['三十七道…', '三十七…品', '三十七…', '三十…道品', '三十…道…', '三十…品', '三十…', '三…七道品', '三…七道…', '三…七…品', '三…七…', '三…道品', '三…道…', '三…品', '三', '…十七道品', '…十七道…', '…十七…品', '…十七…', '…十…道品', '…十…道…', '…十…品', '十', '…七道品', '…七道…', '…七…品', '七', '…道品', '道', '品', '三十七道品']
d1: ['十善…', '十…戒', '十', '…善戒', '善', '戒', '十善戒']
d1: ['六神…', '六…通', '六', '…神通', '神', '通', '六神通']
d1: ['四無量…', '四無…心', '四無…', '四…量心', '四…量…', '四…心', '四', '…無量心', '…無量…', '…無…心', '無', '…量心', '量', '心', '四無量心']
d1: ['慈悲喜…', '慈悲…捨', '慈悲…', '慈…喜捨', '慈…喜…', '慈…捨', '慈', '…悲喜捨', '…悲喜…', '…悲…捨', '悲', '…喜捨', '喜', '捨', '慈悲喜捨']
d1: ['八風不…', '八風…動', '八風…', '八…不動', '八…不…', '八…動', '八', '…風不動', '…風不…', '…風…動', '風', '…不動', '不', '動', '八風不動']
d1: ['不退…', '不…転', '不', 

In [59]:
for d2 in D2:
    print(f"d2: {d2}")

d2: ['三法…', '三…印', '三…', '…法印', '…法…', '…印', '三法印']
d2: ['四念…', '四…処', '四…', '…念処', '…念…', '…処', '四念処']
d2: ['四神…', '四…足', '四…', '…神足', '…神…', '…足', '四神足']
d2: ['五根五…', '五根…力', '五根…', '五…五力', '五…五…', '五…力', '五…', '…根五力', '…根五…', '…根…力', '…根…', '…五力', '…五…', '…力', '五根五力']
d2: ['七覚…', '七…支', '七…', '…覚支', '…覚…', '…支', '七覚支']
d2: ['三十七道…', '三十七…品', '三十七…', '三十…道品', '三十…道…', '三十…品', '三十…', '三…七道品', '三…七道…', '三…七…品', '三…七…', '三…道品', '三…道…', '三…品', '三…', '…十七道品', '…十七道…', '…十七…品', '…十七…', '…十…道品', '…十…道…', '…十…品', '…十…', '…七道品', '…七道…', '…七…品', '…七…', '…道品', '…道…', '…品', '三十七道品']
d2: ['十善…', '十…戒', '十…', '…善戒', '…善…', '…戒', '十善戒']
d2: ['六神…', '六…通', '六…', '…神通', '…神…', '…通', '六神通']
d2: ['四無量…', '四無…心', '四無…', '四…量心', '四…量…', '四…心', '四…', '…無量心', '…無量…', '…無…心', '…無…', '…量心', '…量…', '…心', '四無量心']
d2: ['慈悲喜…', '慈悲…捨', '慈悲…', '慈…喜捨', '慈…喜…', '慈…捨', '慈…', '…悲喜捨', '…悲喜…', '…悲…捨', '…悲…', '…喜捨', '…喜…', '…捨', '慈悲喜捨']
d2: ['八風不…', '八風…動', '八風…', '八…不動', '八…不…', '八…動', '八…', '…風不動', '…風不…', '…風…動', '…風

In [47]:
## generate non-skippy n-grams
import re
for i, doc in enumerate(docs):
    ## update df for word
    df.loc[i,'doc'] = doc
    ##
    print(f"Processing word {i} [use_Cython: {use_Cython}]: {doc}")
    word_segs = [ x for x in re.split(segmenter, doc) if len(x) > 0 ]
    for j in range(1, max_n_for_ngram + 1):
        print(f"generating {j}-grams ...")
        ngrams = gen_ngrams.gen_ngrams(word_segs, j, inclusive = ngram_is_inclusive, sep = sep_local, as_list = generated_as_list, check = False)
        if check:
            print(ngrams)
        ## update df
        df.loc[i, f'{j}g'] = ngrams

Processing word 0 [use_Cython: False]: 阿羅漢
generating 1-grams ...
generating 2-grams ...
generating 3-grams ...
generating 4-grams ...
generating 5-grams ...
Processing word 1 [use_Cython: False]: 辟支仏
generating 1-grams ...
generating 2-grams ...
generating 3-grams ...
generating 4-grams ...
generating 5-grams ...
Processing word 2 [use_Cython: False]: 転法輪
generating 1-grams ...
generating 2-grams ...
generating 3-grams ...
generating 4-grams ...
generating 5-grams ...
Processing word 3 [use_Cython: False]: 十二因縁
generating 1-grams ...
generating 2-grams ...
generating 3-grams ...
generating 4-grams ...
generating 5-grams ...
Processing word 4 [use_Cython: False]: 五蘊盛苦
generating 1-grams ...
generating 2-grams ...
generating 3-grams ...
generating 4-grams ...
generating 5-grams ...
Processing word 5 [use_Cython: False]: 三法印
generating 1-grams ...
generating 2-grams ...
generating 3-grams ...
generating 4-grams ...
generating 5-grams ...
Processing word 6 [use_Cython: False]: 四念処
generat

In [48]:
df[columns3]

Unnamed: 0,1g,2g,3g,4g,5g
0,"[阿, 羅, 漢]","[阿, 羅, 漢, 阿羅, 羅漢]","[阿, 羅, 漢, 阿羅, 羅漢, 阿羅漢]","[阿, 羅, 漢, 阿羅, 羅漢, 阿羅漢]","[阿, 羅, 漢, 阿羅, 羅漢, 阿羅漢]"
1,"[辟, 支, 仏]","[辟, 支, 仏, 辟支, 支仏]","[辟, 支, 仏, 辟支, 支仏, 辟支仏]","[辟, 支, 仏, 辟支, 支仏, 辟支仏]","[辟, 支, 仏, 辟支, 支仏, 辟支仏]"
2,"[転, 法, 輪]","[転, 法, 輪, 転法, 法輪]","[転, 法, 輪, 転法, 法輪, 転法輪]","[転, 法, 輪, 転法, 法輪, 転法輪]","[転, 法, 輪, 転法, 法輪, 転法輪]"
3,"[十, 二, 因, 縁]","[十, 二, 因, 縁, 十二, 二因, 因縁]","[十, 二, 因, 縁, 十二, 二因, 因縁, 十二因, 二因縁]","[十, 二, 因, 縁, 十二, 二因, 因縁, 十二因, 二因縁, 十二因縁]","[十, 二, 因, 縁, 十二, 二因, 因縁, 十二因, 二因縁, 十二因縁]"
4,"[五, 蘊, 盛, 苦]","[五, 蘊, 盛, 苦, 五蘊, 蘊盛, 盛苦]","[五, 蘊, 盛, 苦, 五蘊, 蘊盛, 盛苦, 五蘊盛, 蘊盛苦]","[五, 蘊, 盛, 苦, 五蘊, 蘊盛, 盛苦, 五蘊盛, 蘊盛苦, 五蘊盛苦]","[五, 蘊, 盛, 苦, 五蘊, 蘊盛, 盛苦, 五蘊盛, 蘊盛苦, 五蘊盛苦]"
...,...,...,...,...,...
195,"[両, 祖, 忌, 法, 要]","[両, 祖, 忌, 法, 要, 両祖, 祖忌, 忌法, 法要]","[両, 祖, 忌, 法, 要, 両祖, 祖忌, 忌法, 法要, 両祖忌, 祖忌法, 忌法要]","[両, 祖, 忌, 法, 要, 両祖, 祖忌, 忌法, 法要, 両祖忌, 祖忌法, 忌法要,...","[両, 祖, 忌, 法, 要, 両祖, 祖忌, 忌法, 法要, 両祖忌, 祖忌法, 忌法要,..."
196,"[宗, 祖, 忌, 法, 要]","[宗, 祖, 忌, 法, 要, 宗祖, 祖忌, 忌法, 法要]","[宗, 祖, 忌, 法, 要, 宗祖, 祖忌, 忌法, 法要, 宗祖忌, 祖忌法, 忌法要]","[宗, 祖, 忌, 法, 要, 宗祖, 祖忌, 忌法, 法要, 宗祖忌, 祖忌法, 忌法要,...","[宗, 祖, 忌, 法, 要, 宗祖, 祖忌, 忌法, 法要, 宗祖忌, 祖忌法, 忌法要,..."
197,"[御, 会, 式, 法, 要]","[御, 会, 式, 法, 要, 御会, 会式, 式法, 法要]","[御, 会, 式, 法, 要, 御会, 会式, 式法, 法要, 御会式, 会式法, 式法要]","[御, 会, 式, 法, 要, 御会, 会式, 式法, 法要, 御会式, 会式法, 式法要,...","[御, 会, 式, 法, 要, 御会, 会式, 式法, 法要, 御会式, 会式法, 式法要,..."
198,"[報, 恩, 講, 法, 要]","[報, 恩, 講, 法, 要, 報恩, 恩講, 講法, 法要]","[報, 恩, 講, 法, 要, 報恩, 恩講, 講法, 法要, 報恩講, 恩講法, 講法要]","[報, 恩, 講, 法, 要, 報恩, 恩講, 講法, 法要, 報恩講, 恩講法, 講法要,...","[報, 恩, 講, 法, 要, 報恩, 恩講, 講法, 法要, 報恩講, 恩講法, 講法要,..."


## 結果の保存

In [49]:
if save_results:
    file_name = f"{save_dir}/{source_name}-reg-sk-xsk-df.csv"
    df.to_csv(file_name, header = True)

# end of file