## Pre and Post Processing Code for Chapter 5

<p style='text-align: justify;'> This notebook contains code for various pre and post-processing steps related to the corpora used in Chapter 5. This includes steps such as lemmatization, obtaining word counts, unification of racial identifiers, making a concordance, using word-embedding models to identify semantically similar words, and the extraction of semantic clusters with specific racial identifiers. Owing to copyright restrictions, the base texts on which all these processing steps rely cannot be made public. Thus, this code is for reference only. I have also made it generic for both the fiction and kindai magazine collections. To apply to either corpus, simply change the directory names to point to appropriate file folders.</p>

<p style='text-align: justify;'>The bulk of the computational analysis for this chapter is done with python files described in more detail in the <b>Read Me</b> file. I have provided a post-processed copy of the fiction corpus so that the word-embedding and cluster analysis can be reproduced.</p>

### Import Libraries

In [1]:
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
from scipy import stats
import os
import sys
import re
from shutil import copyfile
import MeCab  #CHECK "MECABRC" FILE TO SEE WHICH DICTIONARY YOU ARE USING
mecab = MeCab.Tagger("")  #using unidic
import collections
import operator
import nltk
import math
from sklearn.metrics import jaccard_similarity_score
import xlsxwriter
import openpyxl

### Load Functions for Cleaning and Pre-Processing

In [2]:
#some cleaning and pre-processing functions for Japanese corpus

#this first one should only be used for non-tokenized texts; basically cleans them for tokenization step
def strip_chap_titles(raw):
    #get rid of chapter titles that use Chinese numbers with or without surronding parantheses
    raw = re.sub(r'（*([一二三四五六七八九十])+(）)*\n', '', raw)
    #get rid of chapter titles that use utf-8 alpha-numeric numbers
    raw = re.sub(r'[１-９]+\n', '', raw)
    raw = re.sub(r'[第弐拾章参壱一二三四五六七八九十]+\n', '', raw)
    raw = re.sub(r'『', r'「', raw)   #replace all 『 with 「
    raw = re.sub(r'』', r'」', raw)   #replace all 』 with 」
    raw = re.sub(r'\n', '', raw)  #strips all newlines
    raw = re.sub(r'\r', '', raw)  #strips all returns
    #raw = re.sub(r'\s', '', raw)
    return raw

#might need to run this separately for already tokenized files
def bracket_cleaner(raw):
    raw = re.sub(r'［[^］]+］', '', raw)   #replace annotations in brackets ([#...])
    raw = re.sub(r'\s+', ' ', raw)                         #get rid of double spaces
    return raw

puncs = ['、','。','「','」','…','！','――','？','ゝ','『','』','（','）','／','＼','々','ーーー','］','・','ゞ','［','<','〔','〕',
         '＃','△','※','＊']

def cleaner(text):
    for punc in puncs:
        text = re.sub(punc, '', text)
    text = re.sub(r'\s+', ' ', text)                         #get rid of double spaces
    return text

def remove_stopwords(tokens, stopwords):
    new_list = [token for token in tokens if token not in stopwords]
    return new_list

#this function computes percentage of text that is dialogue
def percent_dialogue(text):
    no_quotes = re.sub(r'「[^」]*」', '', text)   #eliminate all dialogue passages for single bracket quotes
    per_dialogue = (len(text)-len(no_quotes))/len(text)
    return per_dialogue

def punct(text):
    punctuation = ['、','。','…','！','？']
    count = 0
    
    #strip dialogue
    no_dialogue = re.sub(r'「[^」]*」', '', text)
    
    #search and tabulate first and third person usage
    for word in punctuation:
        instances = re.findall(word, no_dialogue)
        count += len(instances)

    return count/len(no_dialogue)
    
def geo_words(text):
    path = r'WordLists\Empire_Place_Names.txt'
    f = open(path, encoding='utf-8')
    words = f.read()
    geo_words = re.split(r'\n', words)
    
    count = 0
    
    #grab instances of "colonies" first
    instances = re.findall(r'植民　地', text)
    count += len(instances)
    
    #tokenize text and search for geographic names
    tokens = re.split(r'\s', text)
    
    for word in geo_words:
        if word in tokens:
            count += tokens.count(word)
    
    return count
    
    
def race_words(text, race_terms):
    count = 0
    
    #search and tabulate
    for word in race_terms:
        instances = re.findall(word, text)
        count += len(instances)
    
    return count

def get_stopwords(path):
    f = open(path, encoding='utf-8')
    words = f.read()
    return re.split(r'\n', words)

### Load Metadata Corpora (choose either Fiction or Kindai Magazine Corpus)

In [3]:
#LOAD in metadata for the fiction corpus
df = pd.read_excel(r"Data\Fiction_Meta.xlsx", sheet_name='Sheet1')
df = df.reset_index(drop=True)
df.shape

#LOAD in metadata for kindai magazine corpus
#df = pd.read_excel(r'Data\Kindai_Meta.xlsx', sheet_name='Sheet1')
#df = df[df['FILTER'] != 'YES']
#df.shape

(1905, 38)

### Lemmatize Corpus 

In [72]:
#Set the source path where pre-tokenized or non-tokenized texts are stored
CORPUS_PATH = r"Corpora\AozoraFictionTokenized\\"
#Set the output path where lemmatized texts will be stored
OUTPUT = r"Corpora\UnidicLemma\\"

for k in df.index:
    #get the tokenized text
    source_text = CORPUS_PATH + str(df.WORK_ID[k]) + ".txt"
    raw_text = open(source_text, encoding="utf-8")       #grab text
    raw = raw_text.read()
    
    #eliminate white space
    raw = re.sub(r'\s', '', raw)
    
    # create a list of all tokens
    node = mecab.parseToNode(raw)
    node = node.next
    tokens = []
    while node:
        #extract the lemma form based on unidic parsing
        if len(re.split(r',', node.feature)) > 6:  #some words don't have a lemma form
            tokens.append(re.split(r',', node.feature)[7])
            node = node.next
        else:   #if not, just add the plain token
            tokens.append(node.surface)
            node = node.next
    
    #now print back out
    with open(OUTPUT + str(df.WORK_ID[k]) + ".txt", "w", encoding="utf-8") as f:
        for token in tokens:
            f.write(token + ' ')
        f.close()

### Calculate Number of Tokens and Word Types

In [None]:
#create column in data frame to store word token counts
df['TOKENS'] = Series('', index=df.index)

#set path where lemmatized texts are stored
CORPUS_PATH = 'Corpora\UnidicLemma\\'
all_tokens = []

for k in df.index:
    #get the tokenized text
    source_text = CORPUS_PATH + str(int(df.WORK_ID[k])) + '.txt'
    raw_text = open(source_text, encoding="utf-8")       #grab text
    text = raw_text.read()
    
    #remove punctuation
    text = cleaner(text)
    
    #split the text into a list of individual tokens
    tokens = re.split(r' ', text)
    
    #keep track of tokens per document
    df.at[k, 'TOKENS'] = len(tokens)
    
    #add to global list
    all_tokens += tokens 

print(len(all_tokens))
types = set(all_tokens)
print(len(types))

### Calculate Total and Document Frequency of Specific Keywords

In [None]:
#Set path where corpus is stored
CORPUS_PATH = r"Corpora\UnidicLemma\\"

keywords_dict = {'朝鮮人':[0,0], '中国人':[0,0], '西洋人':[0,0], '日本人':[0,0], '黒人':[0,0], '部落民':[0,0],
                 '土人':[0,0], '東洋人':[0,0], '外国人':[0,0]}
                 
#iterate through all texts and count keywords; also keep track of number of texts in which keyword appears
for k in df.index:
    #get the tokenized text
    source_text = CORPUS_PATH + str(int(df.WORK_ID[k])) + ".txt"
    raw_text = open(source_text, encoding="utf-8")       #grab text
    raw = raw_text.read()
    
    for key in keywords_dict:
        count = re.findall(key, raw)
        keywords_dict[key][0] += len(count)  #first element is total count
        #keep track of number of docs
        if len(count) > 0:
            keywords_dict[key][1] += 1      #second element is doc count
        
print(keywords_dict)

### Count Categories of Words (e.g., racial terms; geographical terms) and add to DataFrame

In [14]:
#Set path where corpus to be analyzed is stored
CORPUS_PATH = r'Corpora\UnidicLemma\\'

#add columns to your data frame to store words counts
df['RACEWORDS'] = Series('',index=df.index)
df['GEO_WORDS'] = Series('',index=df.index)

#specify list of racial identifiers to count; geography terms come from a pre-specified list 
race_terms = ['朝鮮人', '中国人', '西洋人','部落民','土人']

for k in df.index:
    #get the tokenized text
    try:
        source_text = CORPUS_PATH + str(int(df.WORK_ID[k])) + ".txt"
        raw_text = open(source_text, encoding="utf-8")       #grab text
        raw = raw_text.read()
        df.GEO_WORDS.loc[k] = geo_words(raw)
        df.at[k, 'RACEWORDS'] = race_words(raw, race_terms)
    except:
        df.GEO_WORDS.loc[k] = 0
        df.at[k, 'RACEWORDS'] = 0

### Save Results to Excel File

In [16]:
import xlsxwriter
import openpyxl
writer = pd.ExcelWriter(r'Results\Fiction_Meta_Temp.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1')
writer.save()

### Eliminate Spaces for Key Race Terms

In [None]:
# Owing to the ways that Unidic tokenizes, key race terms need to be condensed to single word units
# This insures that they are treated as a single word in the ensuing analysis

# Point to folder where lemmatized corpus is stored
CORPUS_PATH = r"Corpora\UnidicLemma\\"

# order of search is important in some cases (e.g., tuples 3 and 4) to avoid writing over one term with another
race_terms = [('朝鮮 人', '朝鮮人'), ('朝鮮 民族', '朝鮮民族'), ('朝鮮 の 人 々','朝鮮の人々'), ('朝鮮 の 人','朝鮮の人'),
             ('朝鮮 の 人達','朝鮮の人達'),('半島 の 人達','半島の人達'),('鮮人 達','鮮人達'),('セン 女','鮮女'),('半島 人','半島人'),
             ('中国 人', '中国人'), ('中国 民族','中国民族'),('中国 民衆', '中国民衆'),('中国 の 民衆','中国の民衆'),
             ('中国 の 人 々','中国の人々'),('中国 の 人','中国の人'),('支那 人','支那人'),('支那 の 人','支那の人'),
             ('支那 の 民族','支那の民族'), ('支那 の 民衆','支那の民衆'), ('西洋 人','西洋人'), ('西洋 の 人','西洋の人'),
             ('西洋 の 人　々','西洋の人々'), ('オウシュウ-外国 人','欧州人'), ('外国 人','外国人'), ('日本 人','日本人'), 
             ('日本 の 人 々','日本の人々'), ('日本 の 人','日本の人'), ('ヤマト 民族','ヤマト民族'), ('日本 民族','日本民族'),
             ('日本 の 民族','日本の民族'), ('日本 民衆','日本民衆'), ('日本 の 民衆','日本の民衆'),('日本 国民','日本国民'),
             ('日本 の 国民','日本の国民'), ('ジャップ-Jap', 'ジャップ'), ('内地 人','内地人'),('新 平民','新平民'),('部落 民','部落民'),
             ('アフリカ-Africa 人', 'アフリカ-Africa人'), ('東洋 人','東洋人'),('東洋 の 人','東洋の人'),('黄 人','黄人'),
             ('黄 人種','黄人種'),('有色 人種','有色人種'), ('アジア-Asia 人','アジア-Asia人'),('野蛮 人','野蛮人'),
             ('本島 人','本島人'),('タカサゴ 族','タカサゴ族'),('リュウキュウ 人','リュウキュウ人'), ('オキナワ 人','オキナワ人')]

altered_docs = 0
change = 0

for k in df.index:
    #get the text
    source_text = CORPUS_PATH + str(df.WORK_ID[k]) + ".txt"
    raw_text = open(source_text, encoding="utf-8")       #grab text
    raw = raw_text.read()
    
    for pair in race_terms:
        if re.search(pair[0], raw):
            change = 1
            raw = re.sub(pair[0], pair[1], raw)  #replace all instances of the race term with condensed version
    
    if change == 1:
        altered_docs += 1
    
    #now print back out
    with open(CORPUS_PATH + str(df.WORK_ID[k]) + ".txt", "w", encoding="utf-8") as f:
        f.write(raw)
        f.close()
        
    change = 0

print(altered_docs)

### Merge Racial Identifiers into Unified Terms

In [None]:
# Merge the race terms into single terms in the corpus; Must run previous cell first

CORPUS_PATH = r"Corpora\UnidicLemma\\"
OUTPUT_PATH = r"Corpora\UnidicLemmaMerged\\"

#list of unified terms
unified_terms = ['朝鮮人','中国人','西洋人','外国人','日本人','部落民','黒人','東洋人','土人']
#set group number based on element in above list; 0 indicates first item
race_group = 0

#sub-lists of terms that will be replaced by unified term; order of sub-lists and unified terms must correspond 
all_merge_terms = [['朝鮮民族','朝鮮の人々','朝鮮の人',' 鮮人','半島人','朝鮮の人達','半島の人達','鮮人達','鮮女','在日','ヨボ',
                    '韓人'],
                   ['中国民族','中国民衆','中国の民衆','中国の人','中国の人々','シナ 人','支那人','支那の人','支那の民族',
                    '支那の民衆','華僑'],
                   ['西洋の人','白人','毛唐','欧州人','西洋の人々'],
                   ['外人','異人'],
                   ['日本の人々','日本の人','大和民族','日本民族','日本の民族','日本民衆','日本の民衆','日本国民',
                    '日本の国民','ジャップ','和人','内地人','邦人'],
                   ['穢多','非人','新平民'],
                   ['黒ん坊','アフリカ-Africa人','ニグロ'],
                   ['東洋の人','黄人','黄人種','有色人種','アジア-Asia人'],
                   ['蕃人','蛮族','蕃族','生蕃','熟蕃','タカサゴ族','本島人',' 蛮人','野蛮人']]

#select the set of merge terms based on race group
merge_terms = all_merge_terms[race_group]

#initialize some counters
altered_docs = 0
change = 0

for k in df.index:
    #get the text
    source_text = CORPUS_PATH + df.FILE_ID[k]
    raw_text = open(source_text, encoding="utf-8")       #grab text
    raw = raw_text.read()
    
    #insert exceptions for 鮮人 and 蛮人 since they require an aditional space in front
    for word in merge_terms:
        if re.search(word, raw):
            if word == ' 鮮人':
                change = 1
                raw = re.sub(word, ' ' + unified_terms[race_group], raw)
            elif word == ' 蛮人':
                change = 1
                raw = re.sub(word, ' ' + unified_terms[race_group], raw)
            else:
                change = 1
                raw = re.sub(word, unified_terms[race_group], raw)
    
    if change == 1:
        altered_docs += 1
    
    #now print back out
    with open(OUTPUT_PATH + df.FILE_ID[k], "w", encoding="utf-8") as f:
        f.write(raw)
        f.close()
        
    change = 0

print(altered_docs)

### Generate a Frequency Table for Corpus

In [105]:
#Set path where corpus is stored
CORPUS_PATH = r"Corpora\UnidicLemma\\"

all_tokens = []

#set this flag before you run
rmv_stopwords = True

#specify stopword list to use (e.g., "stopwords.txt" for fiction and "stopwordsNonFic.txt" for Kindai magazine corpus)
if rmv_stopwords == True:
    stopwords = get_stopwords(r'WordLists\stopwords.txt')

for k in df.index:
    #get the tokenized text
    source_text = CORPUS_PATH + str(df.WORK_ID[k]) + ".txt"
    raw_text = open(source_text, encoding="utf-8")       #grab text
    text = raw_text.read()
    
    text = bracket_cleaner(text)  #clean brackets (need to do this first)
    #remove punctuation
    text = cleaner(text)
    
    #split the text into a list of individual tokens
    tokens = re.split(r' ', text)
    #while '' in tokens: tokens.remove('')  #remove blank spaces
    
    if rmv_stopwords == True:
        tokens = remove_stopwords(tokens, stopwords)

    #add to global list
    all_tokens += tokens 

#produce the frequency list
fdist = nltk.FreqDist(all_tokens)
freq_pairs = fdist.items()
sort_freq_pairs = sorted(freq_pairs, key=lambda x:x[1], reverse=True)  #sort by decreasing frequency

token_count = len(all_tokens)

#create a dictionary to store word-frequency pairs
word_freqs = {}

#fill dictionary with pairs
for item in sort_freq_pairs:
    word_freqs[item[0]] = item[1]
    
freqs_df = DataFrame.from_dict(word_freqs, orient='index')  #convert dict to dataframe
freqs_df = freqs_df.rename(columns={0:'frequency'})    #rename column
freqs_df = freqs_df.sort_values(by='frequency', ascending=False)   #sort by frequency

#compute relative frequencies
freqs_df['rel_freq'] = freqs_df.frequency / token_count
freqs_df = freqs_df.reset_index()

# OPTIONAL: print out the results to an excel file
#import xlsxwriter
#import openpyxl
#writer = pd.ExcelWriter(r'Results\LemmaFreqTable.xlsx', engine='xlsxwriter')
#freqs_df.to_excel(writer, sheet_name='Sheet1')
#writer.save()

### Generate a Concordance for Close Analysis of Word Contexts

In [None]:
#Set location where lemmatized and term-unified corpus is stored
CORPUS_PATH = r"Corpora\UnidicLemmaMerged\\"

#list of terms to include in concordance
seed_terms = ['朝鮮人','中国人','西洋人','日本人','黒人','部落民','土人','東洋人','外国人']

#set window size (# of word tokens) on each side of seed term
window_size = 15

#set this optional flag
rmv_stopwords = True
#specify stopword list to use (e.g., "stopwords.txt" for fiction and "stopwordsNonFic.txt" for Kindai magazine corpus)
if rmv_stopwords == True:
    stopwords = get_stopwords(r'WordLists\stopwords.txt')

#OPTIONAL: prepare file to print out results
#f = open(r"C:\Users\Hoyt\Dropbox\SemanticsRace\LemmaConcordance.txt", "w", encoding="utf-8")

#set up data structure to hold info
metadata = {'text_id': [], 'seed_term': [], 'index_pos': [], 'left_tokens': [], 'right_tokens': []}

hits = 0

#iterate through all texts and count keywords; also keep track of number of texts in which keyword appears
for k in df.index:
    current_position = 0
    
    #get the text
    text_id = str(df.WORK_ID[k]) + ".txt"
    source_text = CORPUS_PATH + str(df.WORK_ID[k]) + ".txt"
    raw_text = open(source_text, encoding="utf-8")       #grab text
    text = raw_text.read()
    
    text = bracket_cleaner(text)  #clean brackets for non-unicode kanji; need to do this first
    #remove punctuation
    text = cleaner(text)
    
    #tokenize the text
    tokens = re.split(r' ', text)
    
    #remove stopwords
    if rmv_stopwords == True:
        tokens = remove_stopwords(tokens, stopwords)
    
    #iterate through tokens in search of seed_terms
    for token in tokens:
        #check for seed term
        if token in seed_terms:
            hits += 1
            #f.write(text_id + ' ' + token + ' ' + str(current_position) + '\t')
            metadata['text_id'].append(text_id)
            metadata['seed_term'].append(token)
            metadata['index_pos'].append(current_position)
            #check to make sure we are not at start of text
            if current_position - window_size > 0:
                #f.write(' '.join(tokens[current_position-window_size:current_position]) + ' ')
                metadata['left_tokens'].append(' '.join(tokens[current_position-window_size:current_position]))
            else:
                #f.write(' '.join(tokens[0:current_position]) + ' ')
                metadata['left_tokens'].append(' '.join(tokens[0:current_position]))
            
            #f.write('SEED_TERM' + ' ')        
            #check to make sure we are not at end of text
            if current_position + window_size < len(tokens):
                #f.write(' '.join(tokens[current_position+1:current_position+window_size]) + '\n')
                metadata['right_tokens'].append(' '.join(tokens[current_position+1:current_position+window_size]))
            else:
                #f.write(' '.join(tokens[current_position+1:]) + '\n')
                metadata['right_tokens'].append(' '.join(tokens[current_position+1:]))
            
        current_position += 1

#f.close()

#turn results into a Data Frame and sort
concord_df = DataFrame(metadata, columns=['text_id', 'seed_term', 'index_pos', 'left_tokens', 'right_tokens'])
concord_df = concord_df.sort_values(by = ["seed_term", "text_id", "index_pos"])
concord_df = concord_df.reset_index(drop=True)
#concord_df[0:5]  #inspect first 5 rows

# OPTIONAL: print out the results to an excel file
#import xlsxwriter
#import openpyxl
#writer = pd.ExcelWriter(r'Results\LemmaConcordance.xlsx', engine='xlsxwriter')
#concord_df.to_excel(writer, sheet_name='Sheet1')
#writer.save()

# Post Processing Code

<p style='text-align: justify;'>The following cells contain code for working with word-embedding models and for further analysis once significant semantic clusters have been identified.</p>

### Inspect Single Word Embedding Model

In [None]:
import gensim #For word2vec, etc

# load a pre-saved word embedding model
model = gensim.models.KeyedVectors.load_word2vec_format(r'./results_fic_bootstrap/word2vec.txt') #for fiction
#model = gensim.models.KeyedVectors.load_word2vec_format(r'./results_kindai_bootstrap/word2vec.txt') #for non-fiction

#get 20 words closest to specified word
model.most_similar('日本人', topn=20) 
#get 10 words most similar to "Japan" vector minus combined vector of other words
model.most_similar(positive=['日本'], negative=['下等','頭数','病人','ごろつき','坑夫','醜い'], topn=10)

### Extract Passages where Semantic Cluster Terms appear in Context Window of Racial Identifier

In [None]:
# Set path where lemmatized corpus is stored (and where race terms have been unified)
CORPUS_PATH = r"Corpora\UnidicLemmaMerged\\"

# specify racial identifier of interest
race_term = '土人'

#specify list of terms in semantic cluster of interest
atom_terms = ['音','聞こえる','叫び声','わっ','声','悲鳴','喚く','わあ','響く','鳴る','どっと','叫び','泣き声','響き']

#set window size on each side of race term
window_size = 20

rmv_stopwords = False
#specify stopword list to use (e.g., "stopwords.txt" for fiction and "stopwordsNonFic.txt" for Kindai magazine corpus)
if rmv_stopwords == True:
    stopwords = get_stopwords(r'WordLists\stopwords.txt')

#prepare file to print out results; Name should indicate racial identifier and semantic cluster label
f = open(r"ClusterContextsFiction\NativeVoice.txt", "w", encoding="utf-8")

#set up data structure to hold info
metadata = {'text_id': [], 'race_term': [], 'left_tokens': [], 'right_tokens': []}

hits = 0

#iterate through all texts and find instances where any atom_terms appears within 20 characters of race term
for k in df.index:
    
    #get the text
    text_id = str(df.WORK_ID[k]) + ".txt"
    title = df.WORK_TITLE[k]
    year = df.PUBL_START[k]
    source_text = CORPUS_PATH + str(df.WORK_ID[k]) + ".txt"
    raw_text = open(source_text, encoding="utf-8")       #grab text
    text = raw_text.read()
    
    text = bracket_cleaner(text)  #clean brackets for non-unicode kanji; need to do this first
    text = cleaner(text)          #remove punctuation
    tokens = re.split(r' ', text) #tokenize the text
    
    #remove stopwords
    if rmv_stopwords == True:
        tokens = remove_stopwords(tokens, stopwords)
    
    #get index positions of all race terms in text
    index_positions = [i for i, x in enumerate(tokens) if x == race_term] 
    
    #check to see if race term is in the text; skip otherwise
    if index_positions:
        for cur_pos in index_positions:    #iterate through all race terms
            
            #get tokens to the left of race term
            if cur_pos - window_size > 0:
                left_tokens = ' '.join(tokens[cur_pos-window_size:cur_pos])
            else:
                left_tokens = ' '.join(tokens[0:cur_pos])
        
            #get tokens to the right of race term
            if cur_pos + window_size < len(tokens):
                right_tokens = ' '.join(tokens[cur_pos+1:cur_pos+window_size])
            else:
                right_tokens = ' '.join(tokens[cur_pos+1:])
                
            #check to see if any atom words are in left and right tokens
            for word in atom_terms:
                if word in left_tokens or word in right_tokens:
                    metadata['text_id'].append(text_id)
                    metadata['race_term'].append(race_term)
                    metadata['left_tokens'].append(left_tokens)
                    metadata['right_tokens'].append(right_tokens)
                    
                    hits+=1  #record as a hit
                    
                    #print the results to a file
                    f.write(text_id + ' ' + str(title) + ' ' + str(year) + ' ' + word + '\n')
                    f.write(left_tokens + ' ' + race_term + ' ' + right_tokens + '\n\n')
                    
                    #try to grab the corresponding passage in original base version of text
                    CORPUS_DIR = r"Corpora\AozoraFictionTokenized\\"
                    #merge target passage into a single string
                    target_passage = left_tokens + ' ' + race_term + ' ' + right_tokens
                    target_passage = re.split(r' ', target_passage)
                    
                    #open the corresponding tokenized text
                    source = CORPUS_DIR + str(df.WORK_ID[k]) + ".txt"
                    raw = open(source, encoding="utf-8")       #grab text
                    new_text = raw.read()
    
                    sim_scores = {}
                    
                    new_text = bracket_cleaner(new_text)  #clean brackets for non-unicode kanji; need to do this first
                    #remove punctuation
                    #new_text = cleaner(new_text)
    
                    #tokenize the text
                    new_tokens = re.split(r' ', new_text)
                   
                    sliding_window_size = 2 * window_size
                   
                    #calcuate similarity across all passages of a given window size
                    for i in range(len(new_tokens) - (sliding_window_size - 1)):
                        source_passage = new_tokens[i:i+sliding_window_size]
                        
                        #make sure passages are equal length for similarity comparison
                        diff = abs(len(target_passage)-len(source_passage))
                        if len(target_passage) > len(source_passage):
                            source_passage.extend(diff*['*'])
                        elif len(source_passage) > len(target_passage):
                            target_passage.extend(diff*['*'])
                        
                        j_score = jaccard_similarity_score(target_passage, source_passage)
                        sim_scores[i] = j_score
    
                    #now rank by similarity score
                    sim_df = DataFrame.from_dict(sim_scores, orient='index')  #convert dict to dataframe
                    sim_df = sim_df.rename(columns={0:'j_score'})    #rename column
                    sim_df = sim_df.sort_values(by='j_score', ascending=False)   #sort by frequency
                    #get top most similar passage
                    top_sim_index = sim_df.index[0:1].tolist()[0]           
                    #grab the corresponding window (plus 20 words before and after)
                    matching_passage = new_tokens[top_sim_index - 20:top_sim_index + 60]
                    f.write('---' + ''.join(matching_passage) + '\n\n')
                   
                    break   #exit loop; no need to check the other atom_terms
            
f.close()
print(hits)

### Extract Passages where Semantic Cluster Terms appear in Context Window of Character Name

<p style='text-align: justify;'>This cell draws on texts that have been previously annotated for character names. This process is handled in a separate ipython notebook, as described in the <b>Read Me</b> file.</p>

In [None]:
# Specify folder where character annotated texts are stored
CORPUS_PATH = r"Corpora\CharAnnot\\"

#reduce data frame (only for Fiction) to texts with character annotation and where occurrences of "native" >= 5
df = df[df['FILTER'] == 'char_annot']
df = df[df['NATIVE'] >= 10]

#set list of semantic cluster terms to search for
atom_terms = ['音','聞こえる','叫び声','わっ','声','悲鳴','喚く','わあ','響く','鳴る','どっと']

#specify window size on each side of character reference
window_size = 15

#optional flag for removing stop words from results
rmv_stopwords = False
if rmv_stopwords == True:
    stopwords = get_stopwords(r'WordLists\stopwords.txt')

#prepare file to print out results
f = open(r"ClusterContextsFiction\CharNativeVoice.txt", "w", encoding="utf-8")

#set up data structure to hold info
metadata = {'text_id': [], 'char_name': [], 'left_tokens': [], 'right_tokens': []}

total_hits = 0

#iterate through all texts and find instances where any atom_terms appears within 20 characters of identified character name
for k in df.index:
    
    #get the text
    text_id = str(df.WORK_ID[k]) + ".txt"
    title = df.WORK_TITLE[k]
    year = df.PUBL_START[k]
    source_text = CORPUS_PATH + str(df.WORK_ID[k]) + ".txt"
    raw_text = open(source_text, encoding="utf-8")       #grab text
    text = raw_text.read()
    
    work_hits = 0
    
    text = bracket_cleaner(text)  #clean brackets for non-unicode kanji; need to do this first
    #text = cleaner(text)          #remove punctuation
    tokens = re.split(r' ', text) #tokenize the text
    
    #remove stopwords
    if rmv_stopwords == True:
        tokens = remove_stopwords(tokens, stopwords)
    
    #get index positions of all character names in text
    all_positions = [i for i, x in enumerate(tokens) if re.search(r'[^\s]*_[^\s]*', x)]
    
    #remove index positions within 15 positions from one another to avoid double counting
    index_positions = [x for i, x in reversed(list(enumerate(all_positions))) if abs(x - all_positions[i-1]) > 15]
    index_positions.reverse()  #reverse the order back to increasing
    
    #check to see if character names are in the text; skip otherwise
    if index_positions:
        
        for cur_pos in index_positions:    #iterate through all character names
            
            #get tokens to the left of character name
            if cur_pos - window_size > 0:
                left_tokens = ' '.join(tokens[cur_pos-window_size:cur_pos])
            else:
                left_tokens = ' '.join(tokens[0:cur_pos])
        
            #get tokens to the right of character name
            if cur_pos + window_size < len(tokens):
                right_tokens = ' '.join(tokens[cur_pos+1:cur_pos+window_size])
            else:
                right_tokens = ' '.join(tokens[cur_pos+1:])
            
            #check to see if any atom words are in left and right tokens
            for word in atom_terms:
                if word in left_tokens or word in right_tokens:
                    metadata['text_id'].append(text_id)
                    metadata['char_name'].append(tokens[cur_pos])
                    metadata['left_tokens'].append(left_tokens)
                    metadata['right_tokens'].append(right_tokens)
                    
                    total_hits += 1  #record as a hit
                    work_hits += 1
                    
                    #print the results to a file
                    f.write(text_id + ' ' + str(title) + ' ' + str(year) + ' ' + str(cur_pos) + ' ' + word + '\n')
                    f.write(left_tokens + ' ' + re.split('_',tokens[cur_pos])[0] + ' ' + right_tokens + '\n\n')
            
                    #try to grab the corresponding passage in non-lemmatized version of text
                    CORPUS_DIR = r"Corpora\AozoraFictionTokenized\\"
                
                    #open the corresponding tokenized text
                    source = CORPUS_DIR + str(df.WORK_ID[k]) + ".txt"
                    raw = open(source, encoding="utf-8")       #grab text
                    new_text = raw.read()
                    new_text = bracket_cleaner(new_text)  #clean brackets for non-unicode kanji; need to do this first
                    new_text = strip_chap_titles(new_text)
                    #remove punctuation
                    #new_text = cleaner(new_text)
    
                    #tokenize the text
                    new_tokens = re.split(r' ', new_text)
            
                    #merge target passage into a single string
                    target_passage = left_tokens + ' ' + re.split('_',tokens[cur_pos])[0] + ' ' + right_tokens
                    target_passage = re.split(r' ', target_passage)  #tokenize
                    
                    #set window of text to scan based on cur_pos in lemmatized text
                    if cur_pos < 2000:  #we're at start of text
                        search_area = new_tokens[0:cur_pos+2000]
                    elif (len(new_tokens) - cur_pos) < 2000:  #we're at end of text 
                        search_area = new_tokens[-2000:]
                    else:
                        search_area = new_tokens[cur_pos - 2000:cur_pos + 2000]
                    
                    sim_scores = {}
                    sliding_window_size = 2 * window_size
                    
                    #calcuate similarity across all passages of a given window size
                    for i in range(len(search_area) - (sliding_window_size - 1)):
                        source_passage = search_area[i:i+sliding_window_size]
                        j_score = jaccard_similarity_score(target_passage, source_passage)
                        sim_scores[i] = j_score
    
                    #now rank by similarity score
                    sim_df = DataFrame.from_dict(sim_scores, orient='index')  #convert dict to dataframe
                    sim_df = sim_df.rename(columns={0:'j_score'})    #rename column
                    sim_df = sim_df.sort_values(by='j_score', ascending=False)   #sort by frequency
                    #get top most similar passage
                    top_sim_index = sim_df.index[0:1].tolist()[0]           
                    #grab the corresponding window (plus 30 words before and after)
                    matching_passage = search_area[top_sim_index - 30:top_sim_index + 70]
                    f.write('---' + ''.join(matching_passage) + '\n\n')
                
                    break   #exit loop; no need to check the other atom_terms
    
    #output results
    if index_positions:
        print(title, len(index_positions), work_hits, work_hits/len(index_positions))
    else:
        print(title, "No Characters Annotated")
            
f.close()
print(total_hits)

### Process Cluster Detection Output File for Analysis in Excel

<p style='text-align: justify;'>This cell reads the output from the "Cluster_Detection_Fic.py" or "Cluster_Detection_Kindai.py" files and converts it into a more readable spreadsheet format.</p>

In [None]:
#specify the output file to process
path = r"results_fic_bootstrap\kfree_thres0.69_numtest5000_window20_alpha0.05_overlap_percent0.50.txt"

#create dictionary to store data
results_dict = {}

f = open(path, encoding='utf-8')
raw = f.read()
results = re.split(r'\n\n', raw)[0:-1]  #exclude the last entry, which is just a newline

#process each line of output, excluding instances where semantic cluster term does not appear around race term
for result in results:
    temp = re.split(r':\n', result)  #split title from atom results
    pair = re.findall(r'[^\s]*\sthan\s[^\s]*', temp[0]) #extract the comparison terms
    pair_string = '_'.join(re.split(r'\s', pair[0])) #create identifier string for dictionary
    results_dict[pair_string] = []  #create an empty list for this identifier
    
    #make sure there are atoms for this pair
    if len(temp) > 1:
        atoms = re.split(r'-\s', temp[1])[1:] #retrieve all atoms and split
    
        for atom in atoms:
            atom_freq = 0  #variable to keep track of strength of atom terms
        
            terms = re.split(r'[^,]\s', atom)[0:-1]
            keywords = []
            for term in terms:
                temp2 = re.split(r'[\((,\s)]', term)
                if float(temp2[1]) > 0:  # or float(temp2[3]) > 0:  #get only words that appear around race term
                    keywords.append(temp2[0])  #add the keyword to list
                    atom_freq += float(temp2[1])  #add first freq to total
        
            #add keywords of this atom to list in dictionary
            results_dict[pair_string].append((keywords, atom_freq))
            
#turn dictionary into a dataframe
results_df = pd.DataFrame.from_dict(results_dict, orient='index')
writer = pd.ExcelWriter(r'results_fic_bootstrap\Fiction_Clusters.xlsx', engine='xlsxwriter')
results_df.to_excel(writer, sheet_name='Sheet1')
writer.save()