# Homework 1
### Jonathan Lee - 李小慧

In [101]:
import stanza
import logging
import pandas as pd
import re

## Motivations and Prelimary Findings
When asked to give my thoughts on characteristics of both Modern and Old Chinese with respect to word order and distribution, it occurred to me that it would be an interesting experiment to see what examples I could find of these phenomena in various Old Chinese texts, especially given the fact that it is difficult to search for this in a traditional linguistic corpus. As we were given an example of how Old Chinese does inflect for case as shown in the Analects in interrogative and negation sentences (e.g., 吾誰欺？ and 不我與, respectively), I decided to first look at the Analects and Zuozhuan (similar time period). By using the Natural Language Processing (NLP) toolkit known as Stanza, I was able to identify several similar cases by scanning the text for the same POS patterns.

#### Stanza
I decided to use a collection of NLP tools in the form of Stanza (https://stanfordnlp.github.io/stanza/index.html). According to their website, Staza is "a collection of accurate and efficient tools for the linguistic analysis of many human languages." I chose this because they have tools to work specifically with Literary Chinese. In particular, it can tokenize (segment text strings into words) and attach parts-of-speech (POS) to each token in a Literary Chinese text.

In [102]:
#Set the logging level to ERROR to suppress informational messages
logging.getLogger('stanza').setLevel(logging.ERROR)

#Download the language model for Literary Chinese
stanza.download('lzh')
nlp = stanza.Pipeline(lang='lzh', processor='tokenize.pos', logging_level='ERROR')

### File Processing
* First, I started by finding a version of the Analects and Zuozhuan online and copying them to files with UTF-8 encoding (in order to work with Chinese texts). Below, I read the files from my computer.

In [103]:
#Read file
file_path_1 = "/Users/jlee/Desktop/UH Spring 2024 Courses/CHN 631C/CHN 631C Project/Analects.txt"
with open(file_path_1, 'r', encoding='utf-8') as file:
    analects = file.read()

#Preview of the text
print("Preview of Analects with Punctuation")
for i in range(0,100,10):
    print(analects[i:i+10])

Preview of Analects with Punctuation
學而第一
1.子曰：
「學而時習之，不亦說
乎？有朋自遠方來，不
亦樂乎？人不知而不慍
，不亦君子乎？」（1
.1）
2.有子曰：
「其為人也孝弟，而好
犯上者，鮮矣；不好犯
上，而好作亂者，未之
有也。君子務本，本立


In [104]:
file_path_2 = "/Users/jlee/Desktop/UH Spring 2024 Courses/CHN 631C/CHN 631C Project/春秋左传.txt"
with open(file_path_2, 'r', encoding='utf-8') as file:
    zuozhuan = file.read()

#Preview of the text
print("Preview of Zuozhuan with Punctuation")
for i in range(0,100,10):
    print(zuozhuan[i:i+10])

Preview of Zuozhuan with Punctuation


春秋左传
左丘明
 著



目录


隐公（元年～十一年）
⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯
⋯⋯⋯⋯⋯001桓公
（元年～十八年）⋯⋯
⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯
⋯⋯⋯019庄公（元
年～三十二年）⋯⋯⋯
⋯⋯⋯⋯⋯⋯⋯⋯⋯⋯


* Next, I removed everything but the characters to get a simpler and more straightforward text to work with.

In [105]:
#Function to clean up Chinese text with  utf-8 encoding
def clean_text(text):
    #Remove special punctuation with regex
    cleaned_text = re.sub(r"[。「」、﹑.()（）？！：，；『』⋯～“”]",'', text)
    #Remove numbers with regex
    cleaned_text = re.sub(r'\d+','', cleaned_text)
    #Remove spaces
    cleaned_text = re.sub(r' ','', cleaned_text)
    cleaned_text = re.sub(r'  ','', cleaned_text)
    cleaned_text = re.sub(r' ','', cleaned_text)
    return cleaned_text



In [106]:
cleaned_analects = clean_text(analects)
#A look at the first 100 characters
print("Preview of Analects without Punctuation")
for i in range(0,100,10):
    print(cleaned_analects[i:i+10])

Preview of Analects without Punctuation
學而第一
子曰學而時
習之不亦說乎有朋自遠
方來不亦樂乎人不知而
不慍不亦君子乎
有子
曰其為人也孝弟而好犯
上者鮮矣不好犯上而好
作亂者未之有也君子務
本本立而道生孝弟也者
其為仁之本與
子曰巧
言令色鮮矣仁
曾子曰


In [107]:
cleaned_zuozhuan = clean_text(zuozhuan)
#A look at the first 100 characters
print("Preview of Zuozhuan without Punctuation")
for i in range(0,100,10):
    print(cleaned_zuozhuan[i:i+10])

Preview of Zuozhuan without Punctuation


春秋左传
左丘明
著



目录

隐
公元年十一年桓公元年
十八年庄公元年三十二
年闵公元年二年僖公元
年三十三年文公元年十
八年宣公元年十八年成
公元年十八年襄公元年
三十一年昭公元年三十
二年定公元年十五年哀


In [108]:
#Save the cleaned text to a file on my computer
cleaned_analects_name = "/Users/jlee/Desktop/UH Spring 2024 Courses/CHN 631C/CHN 631C Project/Analects_No_Punct"
with open(cleaned_analects_name, 'w', encoding='utf-8') as file:
    file.write(cleaned_analects)
print(f"File saved as {cleaned_analects_name}")

cleaned_zuozhuan_name = "/Users/jlee/Desktop/UH Spring 2024 Courses/CHN 631C/CHN 631C Project/Zuozhuan_No_Punct"
with open(cleaned_zuozhuan_name, 'w', encoding='utf-8') as file:
    file.write(cleaned_analects)
print(f"File saved as {cleaned_zuozhuan_name}")

File saved as /Users/jlee/Desktop/UH Spring 2024 Courses/CHN 631C/CHN 631C Project/Analects_No_Punct
File saved as /Users/jlee/Desktop/UH Spring 2024 Courses/CHN 631C/CHN 631C Project/Zuozhuan_No_Punct


### Tokenization and POS tagging
* The next main task was to tokenize and attach POS tags to the cleaned text as was done below.

In [109]:
#Fuction to tokenize and attach POS tags
def pos_and_tokenize(cleaned_text, output_filename):
    doc = nlp(cleaned_text)
    tokens = []
    pos_tags = []
    #Tokenization and POS tagging
    for sentence in doc.sentences:
        for word in sentence.words:
            tokens.append(word.text)
            pos_tags.append(word.pos)

    df = pd.DataFrame({
        'TOKEN': tokens,
        'POS': pos_tags
    })
    df.to_csv(f'{output_filename}_pos_tok.csv')
    return df

In [110]:
pos_tok_analects = pos_and_tokenize(cleaned_analects, "cleaned_analects")
pos_tok_zuozhuan = pos_and_tokenize(cleaned_zuozhuan, "cleaned_zuozhuan")

* Below, we can see the results of Stanza's tokenization and POS. At least for Literary Chinese, Stanza tends to simply split each word into a single character. However, there are exceptions.

In [111]:
pos_tok_analects.head(10)

Unnamed: 0,TOKEN,POS
0,學,VERB
1,而,CCONJ
2,第,NOUN
3,一,NUM
4,子,NOUN
5,曰,VERB
6,學,VERB
7,而,CCONJ
8,時,NOUN
9,習,VERB


In [112]:
pos_tok_zuozhuan.head(10)

Unnamed: 0,TOKEN,POS
0,春,NOUN
1,秋,NOUN
2,左,VERB
3,传,NOUN
4,左,NOUN
5,丘,NOUN
6,明,VERB
7,著,VERB
8,目,NOUN
9,录,NOUN


* In the code snippets below, we see examples of cases where Stanza identified multisyllabic words. In general, disyllabic words are names or specific nouns like 君子 that would have no meaning if the characters were separated. In the Analects, Stanza was able to identify 601 cases of disyllabic words and 5 cases of larger multisyllabic words (which were all just numbers).

In [113]:
disyllabic_words_analects = pos_tok_analects[pos_tok_analects['TOKEN'].apply(lambda word: len(word) == 2)]
disyllabic_words_analects.head(5)

Unnamed: 0,TOKEN,POS
33,君子,NOUN
35,有子,PROPN
63,君子,NOUN
90,曾子,PROPN
140,弟子,NOUN


In [114]:
disyllabic_words_zuozhuan = pos_tok_zuozhuan[pos_tok_zuozhuan['TOKEN'].apply(lambda word: len(word) == 2)]
disyllabic_words_zuozhuan.head(5)

Unnamed: 0,TOKEN,POS
10,隐公,NOUN
13,十一,NUM
15,桓公,NOUN
18,十八,NUM
20,庄公,NOUN


In [115]:
print(f"There are {disyllabic_words_analects.shape[0]} disyllabic words in the Analects and {disyllabic_words_zuozhuan.shape[0]} in Zuozhuan")

There are 601 disyllabic words in the Analects and 6172 in Zuozhuan


In [116]:
other_multi_syllabic_words_analects = pos_tok_analects[pos_tok_analects['TOKEN'].apply(lambda word: len(word) > 2)].head()
other_multi_syllabic_words_analects.head()

Unnamed: 0,TOKEN,POS
503,三百一,NUM
5880,四十五十,NUM
7527,六七十,NUM
7529,五六十,NUM
7700,六七十,NUM


In [117]:
other_multi_syllabic_words_zuozhuan = pos_tok_zuozhuan[pos_tok_zuozhuan['TOKEN'].apply(lambda word: len(word) > 2)].head()
other_multi_syllabic_words_zuozhuan.head(5)

Unnamed: 0,TOKEN,POS
23,三十二,NUM
35,三十三,NUM
55,三十一,NUM
61,三十二,NUM
73,二十七,NUM


In [118]:
print(f"There are {other_multi_syllabic_words_analects.shape[0]} disyllabic words in the Analects and {other_multi_syllabic_words_zuozhuan.shape[0]} in Zuozhuan")

There are 5 disyllabic words in the Analects and 5 in Zuozhuan


* Next up is to check the POS that Stanza used for each word in 吾誰欺 and for 不 to determine what pattern to search for in the text. The result was it identified 吾 as a pronoun, 誰 a pronoun, and 欺 a verb. Stanza also tagged 不 as an adverb, so that would indicate for cases like 不我與, I would need to look for adverb-pronoun-verb occurences. While this might be too general for some cases, I decided to use it for the time being.

In [119]:
#Function for finding a sequence of tokens in the dataframe. It's essentially just a ctrl-f.

def get_sequence_df(df, sequence):
    for i in range(len(df) - len(sequence) + 1):
        if all(df.iloc[i + j]['TOKEN'] == sequence[j] for j in range(len(sequence))):
            return df.iloc[i:i+len(sequence)]
    return pd.DataFrame()  # Return an empty DataFrame if sequence is not found

sequence = ['吾', '誰', '欺']
sequence_df = get_sequence_df(pos_tok_analects, sequence)

print(sequence_df)

     TOKEN   POS
5631     吾  PRON
5632     誰  PRON
5633     欺  VERB


* Armed with the knowledge that I am looking for PRON-PRON-VERB and ADV-PRON-VERB sequences, I create a function to find those patterns and then return the reults in a list of strings, which ended up finding 21 cases for the first pattern and 33 for the second.

In [120]:
# Example: Find sequences of PRON-PRON-VERB
pattern1 = ['PRON','PRON','VERB']
pattern2 = ['ADV','PRON','VERB']
def find_pattern_in_df(df, pattern):
    results = []
    pattern_str = ''.join(pattern)
    for i in range(len(df) - len(pattern) + 1):
        if df.iloc[i:i+len(pattern)]['POS'].tolist() == pattern:
            tokens = df.iloc[i:i+len(pattern)]['TOKEN'].tolist()
            results.append(''.join(tokens))
    return results

# Example usage
pattern_str1_analects = find_pattern_in_df(pos_tok_analects, pattern1)
print(pattern_str1_analects)

pattern_str2_analects = find_pattern_in_df(pos_tok_analects, pattern2)
print(pattern_str2_analects)

pattern_str1_zuozhuan = find_pattern_in_df(pos_tok_zuozhuan, pattern1)
print(pattern_str1_zuozhuan)

pattern_str2_zuozhuan = find_pattern_in_df(pos_tok_zuozhuan, pattern2)
print(pattern_str2_zuozhuan)

['斯之謂', '其何以', '其或繼', '吾何以', '之何如', '是吾憂', '予何大', '吾誰欺', '何其聞', '爾何如', '爾何如', '爾何如', '何其徹', '之或問', '之何如', '諸己小', '斯某在', '斯之謂', '何其廢', '何其拒', '是之甚']
['未之有', '曾是以', '奚其為', '未之見', '莫己知', '猶吾大', '猶吾大', '莫吾猶', '未之有', '則之蕩', '不其然', '何其多', '未之思', '毋吾以', '不吾知', '則何以', '毋自辱', '奚其正', '亦奚以', '不吾以', '莫予違', '莫之違', '莫之違', '豈其然', '莫之知', '不己知', '莫我知', '莫己知', '未之學', '不己知', '不我與', '又誰怨', '猶之與']
['为之请', '是之谓', '之虽及', '为其少', '之彼则', '是其生', '此其以', '之是行', '我吾求', '此其在', '此其昌', '子其死', '子其行', '之何稽', '吾其定', '之何对', '之我怠', '其何以', '吾其奔', '其何以', '之是求', '之何庸', '之其若', '我吾使', '子是寡', '吾何以', '之吾舍', '之我辞', '吾自惧', '我何以', '之彼骄', '之吾以', '吾其死', '无自立', '此之谓', '之何杀', '之其先', '之何以', '余是以', '此之谓', '之是齐', '何其以', '其自为', '是之谓', '之之明', '之何曰', '之是弃', '之是以', '其何以', '其孰以', '将何以', '是我有', '我是以', '我是以', '我是以', '其何如', '之其御', '之我毙', '之其可', '之是以', '是之谓', '我是欲', '余余恐', '之何对', '之何’', '之或推', '为自逸', '之其若', '将何以', '其或难', '之是以', '何其以', '之何以', '是之谓', '此之谓', '是之谓', '此之谓', '之是以', '我我克', '我是以', '其何以', '之语问', '我其收', '我其拱', '之何以', '子余祭', '为之歌', '为之歌', '此之

In [121]:
print(f'There are {len(pattern_str1_analects)} instances of PRON-PRON-VERB in the Analects.')
print(f'There are {len(pattern_str2_analects)} instances of ADV-PRON-VERB in the Analects.')
print(f'There are {len(pattern_str1_zuozhuan)} instances of PRON-PRON-VERB in the Zuozhuan.')
print(f'There are {len(pattern_str2_zuozhuan)} instances of ADV-PRON-VERB in the Zuozhuan.')

There are 21 instances of PRON-PRON-VERB in the text.
There are 33 instances of ADV-PRON-VERB in the text.
There are 178 instances of PRON-PRON-VERB in the text.
There are 152 instances of ADV-PRON-VERB in the text.


* The last task (for now) was to actually return the sentences where these sequences occur. I start by defining a function to split the text into sentences and highlight the example 3-character sequences with underscores for readability. I added in functionality to check whether the sentence contains a question, since the PRON-PRON-VERB pattern should only occur in an interrogative sentence, in theory. I also added functionality to check for specific characters at the front of the pattern, since negation sentences in the Analects and Zuozhuan are relatively limited.

In [147]:
def find_and_highlight_sentences(text, search_strings, is_question=False, start_chars=None):
    # Split the text into sentences based on periods (or appropriate punctuation)
    sentences = text.split('。')
    
    # Compile a pattern to match and remove the trailing Chinese open parenthesis followed by a number
    remove_pattern = re.compile(r'（\d+$')
    
    # Apply the removal pattern to each sentence
    sentences = [remove_pattern.sub('', sentence) for sentence in sentences]
    
    highlighted_sentences = []
    
    # Filter search_strings if start_chars is provided
    if start_chars:
        filtered_search_strings = [s for s in search_strings if s.startswith(start_chars)]
    else:
        filtered_search_strings = search_strings
    
    # Compile a regular expression pattern for efficient matching
    pattern = re.compile('|'.join([re.escape(s) for s in filtered_search_strings]))
    
    # Function to add highlights only if not already highlighted
    def highlight(match):
        matched_text = match.group(0)
        # Check if the matched text is already highlighted
        if matched_text.startswith('___') and matched_text.endswith('___'):
            return matched_text  # Return as is if already highlighted
        else:
            return f"___{matched_text}___"  # Highlight if not already highlighted
    
    # Iterate through each sentence
    for sentence in sentences:
        # Strip leading and trailing whitespace from the sentence
        sentence = sentence.strip()
        
        # Replace all occurrences of search strings with highlighted version
        highlighted_sentence = pattern.sub(highlight, sentence)
        
        # Add condition based on 'question' parameter to filter sentences with Chinese question marks
        if is_question:
            # Only add the sentence if it contains a Chinese question mark
            if '？' in highlighted_sentence and highlighted_sentence != sentence:
                highlighted_sentences.append(highlighted_sentence)
        else:
            # Add the sentence if any replacements were made
            if highlighted_sentence != sentence:
                highlighted_sentences.append(highlighted_sentence)
    
    return highlighted_sentences

* The code below returns the pronoun-pronoun-verb example sentences. Note that certain instances such as 予何大 are not returned with a sentence for context. That is due to the fact that 予何大 is part of two sentences, but the pattern was identified as the punctuation was removed:

5.  子畏於匡。曰：「文王既沒，文不在茲乎。天之章喪斯文也。後死者不得與於斯文也。天之未喪斯文也。匡人其如**予何**。」（9.5）
6.  **大**宰問於子貢曰：「夫子聖者與！何其多能也？」子貢曰：「固天縱之將聖，又多能也。」子聞之曰：「大宰知我乎？吾少也賤，故多能鄙事。君子多乎哉？不多也！」（9.6）

In [154]:
highlighted_sentences = find_and_highlight_sentences(analects, pattern_str1_analects, is_question = True)
print("Example sentences for PRON-PRON-VERB in the Analects\n")
for i in range(len(highlighted_sentences)):
    print(f'_[{i+1}]_    {highlighted_sentences[i]}\n')

Example sentences for PRON-PRON-VERB in the Analects

_[1]_    」子貢曰：「詩云：『如切如磋，如琢如磨』，其___斯之謂___與？」子曰：「賜也，始可與言詩已矣，告諸往而知來者

_[2]_    大車無輗，小車無軏，___其何以___行之哉？」（2.22）
23.子張問：「十世可知也？」子曰：「殷因於夏禮，所損益，可知也；周因於殷禮，所損益，可知也

_[3]_    」（3.25）
26.子曰：「居上不寬，為禮不敬，臨喪不哀，___吾何以___觀之哉？」（3.26）
里仁第四
1.  子曰：「里仁為美

_[4]_    」（7.1）
2.  子曰：「默而識之，學而不厭，誨人不倦，何有於我哉？」（7.2）
3.  子曰：「德之不修，學之不講，聞義不能徒，不善不能改，___是吾憂___也

_[5]_    欲罷不能，既竭吾才，如有所立，卓爾；雖欲從之，末由也已！」（9.11）
12.子疾病，子路使門人為臣，病聞，曰：「久矣哉，由之行詐也！無臣而為有臣，___吾誰欺___？欺天乎？且予與其死於臣之手也，無甯死於二三子之手乎！且予縱不得大葬，予死於道路乎？」（9.12）
13.子貢曰：「有美玉於斯，韞(櫝)而藏諸？求善賈而沽諸？」子曰：「沽之哉！沽之哉！我待賈者也！」（9.13）
14.子欲居九夷

_[6]_    」子曰：「論篤是與，君子者乎？色莊者乎？」（11.19）
20.子路問：「聞斯行諸？」子曰：「有父兄在，如之___何其聞___斯行之！」冉有問：「聞斯行諸？」子曰：「聞斯行之！」公西華曰：「由也問『聞斯行諸？』，子曰：『有父兄在』；求也問，『聞斯行諸？』子曰：『聞斯行之』

_[7]_    「求，___爾何如___？」對曰：「方六七十，如五六十，求也為之，比及三年，可使足民；如其禮樂，以俟君子

_[8]_    」「赤，___爾何如___？」對曰：「非曰能之，願學焉！宗廟之事，如會同，端章甫，願為小相焉

_[9]_    」「點，___爾何如___？」鼓瑟希，鏗爾，舍瑟而作

_[10]_    」（12.8）
9.  哀公問於有若曰：「年饑，用不足，如之何？」有若對曰：「盍徹乎！」曰：「二，吾猶不足；如之___何其徹___也？」對曰：「百姓足，君孰不足？百姓不足，君孰與足？」（12.

The code below returns the adverb-pronoun-verb example sentences. Note that certain instances such as are not returned with a sentence for context for the same reasons as above.

In [157]:
print("Example sentences for ADV-PRON-VERB in the Analects\n")
highlighted_sentences = find_and_highlight_sentences(analects, pattern_str2_analects, start_chars=('不','未','無'))
for i in range(len(highlighted_sentences)):
    print(f'_[{i+1}]_    {highlighted_sentences[i]}\n')

Example sentences for ADV-PRON-VERB in the Analects

_[1]_    學而第一
1.子曰：「學而時習之，不亦說乎？有朋自遠方來，不亦樂乎？人不知而不慍，不亦君子乎？」（1.1）
2.有子曰：「其為人也孝弟，而好犯上者，鮮矣；不好犯上，而好作亂者，___未之有___也

_[2]_    蓋有之矣，我___未之見___也

_[3]_    躬行君子，則吾___未之有___得

_[4]_    」孔子曰：「才難，___不其然___乎，唐虞之際，於斯為盛，有婦人焉，九人而已

_[5]_    」子曰：「___未之思___也，未何遠之有？」（9.30）
鄉黨第十
1.  孔子於鄉黨，恂恂如也，似不能言者

_[6]_    居則曰：『___不吾知___也！』如或知爾，則何以哉？」子路率爾而對，曰：「千乘之國，攝乎大國之間閒，加之以師旅，因之以饑饉，由也為之，比及三年，可使有勇，且知方也

_[7]_    」子曰：「其事也！如有政，雖___不吾以___，吾其與聞之！」（13.14）
15.定公問：「一言而可以興邦，有諸？」孔子對曰：「言不可以若是其幾也！人之言曰：『為君難，為臣不易

_[8]_    子曰：「賜也，賢乎哉？夫我則不暇！」（14.30）
31.子曰：「不患人之___不己知___，患其不能也

_[9]_    孔子對曰：「俎豆之事，則嘗聞之矣；軍旅之事，___未之學___也

_[10]_    」（15.14）
15.子曰：「躬自厚，而薄責於人，則遠怨矣！」（15.45）
16.子曰：「不曰：『如之何，如之何』者，吾末如之何也已矣？」（15.46）
17.子曰：「群居終日，言不及義，好行小慧；難矣哉！」（15.17）
18.子曰：「君子義以為質，禮以行之，孫以出之，信以成之；君子哉！」（15.18）
19.子曰：「君子病無能焉，不病人之___不己知___也

_[11]_    」「日月逝矣！歲___不我與___！」孔子曰：「諾，吾將仕矣！」（17.1）
2.  子曰：「性相近也，習相遠也



In [151]:
highlighted_sentences = find_and_highlight_sentences(zuozhuan, pattern_str1_zuozhuan, is_question = True)
print("Example sentences for PRON-PRON-VERB\n")
for i in range(len(highlighted_sentences)):
    print(f'_[{i+1}]_    {highlighted_sentences[i]}\n')

Example sentences for PRON-PRON-VERB

_[1]_    岂曰能贤？光昭先君之令德，可不务乎？吾___子其无___废先君之功

_[2]_    ”曰 ：“___子其行___乎 ！”大子曰
：“君实不察其罪，被此名也以出，人谁纳我？”
    十二月戊申，缢于新城

_[3]_    ”
公谓公孙枝曰 ：“夷___吾其定___乎？对曰 ：“臣闻之，唯则定国

_[4]_    子鱼曰 ：“祸其在此乎！君欲已甚，___其何以___堪之？”于是楚执宋公以伐宋

_[5]_    叔詹曰 ：“楚王其不没乎！为礼卒于无别，无别不可谓礼，___将何以___没？”诸侯是以知其不遂霸也

_[6]_    其波及 晋国者，君之余也，___其何以___报君？”曰 ：“虽然，何以报我？ “对曰 ：“若以君之灵，得反晋国，晋、楚治兵，遇于中原，其辟君三舍

_[7]_    ___吾是以___失楚，又何祀焉？”秋，楚成得臣、斗宜申帅师灭夔，以夔子归

_[8]_    救而弃之，谓诸侯何？楚有三施，我有三怨，怨仇已多，___将何以___战？不如私许复曹、卫以携之，执宛春以怒楚，既战而后图之

_[9]_    ___吾何以___堪之？”
东门襄仲将聘于周，遂初聘于晋

_[10]_    秦以胜归，___我何以___报？”乃皆出战，交绥

_[11]_    今天或

者大警晋也，而又杀林父以重楚胜，其无乃久不竞乎？林父之事君也，进思尽忠，退思补过，社稷之卫也，若___之何杀___之？夫其败也，如日月之食焉，何损于明？”晋侯使复其位

_[12]_    背盟，不祥；欺大国，不义；神人弗助，___将何以___胜？”不听，遂伐茅戎

_[13]_    此车一人殿之，可以集事，若之___何其以___病败君之大 事也？擐甲执兵，固即死也

_[14]_    ’其___此之谓___乎！有上不吊，其谁不受乱 ？吾亡无日矣 ！”君子曰 ：“如惧如是，斯不亡矣

_[15]_    ‘七年之中，一与一夺，二三孰甚焉！士之二三，犹丧妃耦，而况霸主？霸主将德是以 ，而二三之 ，___其何以___长有诸侯乎？
《诗》曰：‘犹之未远 ，是用大简

_[16]_    与渠丘公立于池上，曰
：“城已恶 ！”莒子曰 ：“辟陋在夷，___其孰以___我为虞？”对曰
 ：“夫

In [158]:
highlighted_sentences = find_and_highlight_sentences(zuozhuan, pattern_str2_zuozhuan, start_chars=('不','未','无'))
print("Example sentences for ADV-PRON-VERB\n")
for i in range(len(highlighted_sentences)):
    print(f'_[{i+1}]_    {highlighted_sentences[i]}\n')

Example sentences for ADV-PRON-VERB

_[1]_    于天子，则诸卿皆行，公___不自送___

_[2]_    人无衅焉，妖___不自作___

_[3]_    纳而不定 ，废而不立，以德为怨，秦___不其然___

_[4]_    初，楚子玉自为琼弁玉缨，___未之服___也

_[5]_    先轸曰 ：“匹夫逞志于君而无讨 ，敢___不自讨___乎 ？”免胄入狄 师，死焉

_[6]_    谓上___不我知___，黜而宜，乃知我矣

_[7]_    以讨召诸侯，而以贪归之，无乃不可乎？王曰
 ：“善哉 ！”吾___未之闻___也

_[8]_    若___不我纳___，今将驰矣

_[9]_    ’今楚师至，晋___不我救___，则楚强矣

_[10]_    楚弱 于晋，晋___不吾疾___也

_[11]_    荀偃令曰 ：“鸡鸣而驾，塞井夷灶，唯余马首是瞻！ “栾黡曰 ：“晋国之命，___未是有___也

_[12]_    受楚之功而取货于郑，不可谓国，秦___不其然___

_[13]_    子不辟宗，何也？”曰 ：“宗___不余辟___，余独焉辟 之？赋诗断章，余取所求焉，恶识宗？”癸言王何而反之，二人皆嬖，使执寝戈，而先后之

_[14]_    子皮止之，众曰 ：“人___不我顺___，何止焉？”子皮曰 ：“夫人礼于死者，况生者乎？”遂自止之

_[15]_    公薨之月，子产相郑伯以如晋，晋侯以我丧故，___未之见___也

_[16]_    子产曰：“少，未知可否?”子皮曰： “愿，吾爱之，___不吾叛___也

_[17]_    ”叔向曰 ：“善哉！肸___未之闻___也

_[18]_    晋为盟主，其或者___未之祀___也乎？”韩子祀夏郊，晋侯有间，赐子产莒之二方鼎

_[19]_    取郠之役，莒人诉于晋，晋有平公之丧，___未之治___也，故辞公

_[20]_    今郑人贪赖其田，而___不我与___

_[21]_    降服而对，曰 ：“臣过失命，___未之致___也

_[22]_    ”不吉，投龟，诟天而 呼曰 ：“是区区者而___不余畀___，余必自取之

_[23]_    国人请为□焉，子产弗许，曰 ：“我斗，龙___不我觌___也

_[2

## Findings
#### SOV in Interrogative Sentences
Looking at the example sentences, I found a few interesting cases that exhibit similar shifts from subject-verb-object to subject-object-verb in the context of an interrogative sentence. These examples were found by filtering out all instances where there was no question mark, which limits our search to cases that are more likely to exhibit grammatical case.
* Analects 其何以, 吾何以, 斯之謂，爾何如
* Zuozhuan 吾其定, 其何以, 吾何以, 其何如, [某某某]将何以, 吾其入

After looking at the other cases in the Analects, it's most likely the case that 如之何 as a set phrase is throwing off the results, as there is no punctuation after 何, so the primitive search method using POS gives undesirable results, such as:
* 何其聞, 何其徹, 何其廢, 何其拒

This likely indicates that there is a better method for finding SOV cases.

#### SOV in Negation Sentences
There were also many more cases of the negative-pronoun-verb pattern. These instances were found by modifying my _find_and_highlight_sentences_ function to only look up the patterns where the first character was either 不 or 未 for simplicity and to narrow down the search to exclude patterns that are highly unlikely to be in a negation sentence.
* Analects: 未之見, 未之有, 未之學, 不吾知, 不吾以, 莫予違, 不己知, 不我與
* Zuozhuan: 不我知, 不我纳, 不我救, 不吾疾, 不吾叛,  不我与, 不余畀, 不我觌, 不吾远, 不余欺, 不吾废



## Future Research and Direction
I did not look for other cases of grammatical case within the Analects or Zuozhuan to see how they were tagged. Next, I could also compare the frequency of certain grammatical patterns across texts from different time periods to highlight certain trends. For example, when do we start to see SOV in interrogative and negation sentences turn into SVO and how quickly does that transition occur? More importantly, I did not consider other methods for identifying the patterns outside of POS tags. It is possible other methods such as supervised machine learning could be utilized to identify patterns without the rigid framework of some pattern of POS tags. Something along the lines of neural networks could provide weights for features based on context that could perhaps be difficult for a human to see. 

It is entirely possible to turn this entire project into a more intuitive and web-based tool for others interesting in research of Chinese grammar to use. In that case, the users could either (A) upload their documents or (B) choose to work from a list of works that have already been processed. 