# Grambank Features
Use LLMs to determine language features.

- Setup: Identify language, LLM, whether to use RAG, resource directories 
- Read feature
    - if using RAG, query DB and retrieve content
- Create prompt and query LLM

This notebook allows exploration of the grambank data and techniques to fill out features.

**Note that the code necessary to fill out the features for a given language is available in main.py**

## Grambank Data
Look at features and languages info from Grambank.
- 'languages.csv': Has a row for each language. Significantly, the "ID" field provides a unique key that is used in other csv files
- 'codes.csv': Explains the codes used when answering the feature questions. In general, it appears that 1 means present and 0 means absent
- 'values.csv': Provides the answer/code for each feature for each language

In [4]:
import pandas as pd
from IPython.display import display

**LintTypology** \
Utilities to get ISO codes by Glottocodes \
https://oneadder.github.io/lingtypology/html/index.html

In [5]:
import lingtypology.glottolog as glotto

glotto.get_iso_by_glot_id('arap1274')

'arp'

In [6]:
language_iso = 'kac'
print(glotto.get_by_iso(language_iso))
print(glotto.get_glot_id_by_iso(language_iso))

Southern Jinghpaw
kach1280


**languages.csv** - Provides info about each language

In [7]:
language = 'Arapaho'

df_languages = pd.read_csv('resources/grambank/languages.csv')
row = df_languages.loc[df_languages['Name']==language]
display(row)


Unnamed: 0,ID,Name,Macroarea,Latitude,Longitude,Glottocode,ISO639P3code,provenance,Family_name,Family_level_ID,Language_level_ID,level,lineage
105,arap1274,Arapaho,North America,43.3879,-108.81,arap1274,,HPG_arap1274.tsv,Algic,algi1248,arap1274,language,algi1248/algo1256/algo1257/arap1273/arap1280


In [8]:
df_languages['ISO'] = df_languages['Glottocode'].apply(glotto.get_iso_by_glot_id)



Print out languages that don't have an ISO code

In [9]:
df_languages[df_languages['ISO'].isna()]

Unnamed: 0,ID,Name,Macroarea,Latitude,Longitude,Glottocode,ISO639P3code,provenance,Family_name,Family_level_ID,Language_level_ID,level,lineage,ISO
66,amam1246,Amam,Papunesia,-7.900000,146.920000,amam1246,,JLA_amam1246.tsv,Goilalan,goil1242,amam1246,language,goil1242/weri1254,
99,apol1242,Apolista,South America,-14.830000,-68.660000,apol1242,,SD_apol1242.tsv,Arawakan,araw1281,apol1242,language,araw1281/sout3131/boli1260/uncl1518,
180,bala1316,Balade,Papunesia,-20.250000,164.200000,bala1316,,HW_bala1316.tsv,Austronesian,aust1307,bala1316,language,aust1307/mala1545/cent2237/east2712/ocea1241/s...,
195,bang1369,Bangru,Eurasia,27.950000,93.150000,bang1369,,MY_bang1369.tsv,Sino-Tibetan,sino1245,bang1369,language,sino1245/miji1239,
219,bath1248,Batha,Africa,12.188285,12.360552,bath1248,,NB_bath1248.tsv,Afro-Asiatic,afro1255,chad1249,dialect,afro1255/semi1276/west2786/cent2236/arab1394/a...,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2366,xinc1246,Xinca-Guazacapan,North America,14.069981,-90.416651,xinc1246,,RHE_xinc1246.tsv,Xincan,xinc1237,xinc1246,language,xinc1237,
2367,xink1235,Xinka-Jumaytepeque,North America,14.341898,-90.256519,xink1235,,RHE_xink1235.tsv,Xincan,xinc1237,xink1235,language,xinc1237/xinc1244,
2439,yulp1239,Yulparija,Australia,-18.923519,122.370316,yulp1239,,CB-PE-AS_yulp1239.tsv,Pama-Nyungan,pama1250,yulp1239,language,pama1250/dese1234/wati1241/mart1257,
2444,yuwa1242,Yuwaalaraay,Australia,-29.758184,147.923684,yuwa1242,,CB-PE-AS_yuwa1242.tsv,Pama-Nyungan,pama1250,gami1243,dialect,pama1250/sout3135/wira1261/gami1243,


In [10]:
row = df_languages.loc[df_languages['ISO']=='wol']
display(row)

Unnamed: 0,ID,Name,Macroarea,Latitude,Longitude,Glottocode,ISO639P3code,provenance,Family_name,Family_level_ID,Language_level_ID,level,lineage,ISO
1589,nucl1347,Wolof,Africa,15.2534,-15.383,nucl1347,,HS_nucl1347.tsv,Atlantic-Congo,atla1278,nucl1347,language,atla1278/nort3146/wolo1248/wolo1247,wol


Get the language ID

In [11]:
lang_id = row['ID'].array[0]
print(lang_id)
print(glotto.get_iso_by_glot_id(lang_id))

nucl1347
wol


**codes.csv** - Maps codes for each feature answer to a description

In [12]:
df_codes = pd.read_csv('resources/grambank/codes.csv')
df_codes

Unnamed: 0,ID,Parameter_ID,Name,Description
0,GB020-0,GB020,0,absent
1,GB020-1,GB020,1,present
2,GB021-0,GB021,0,absent
3,GB021-1,GB021,1,present
4,GB022-0,GB022,0,absent
...,...,...,...,...
393,GB520-1,GB520,1,present
394,GB521-0,GB521,0,absent
395,GB521-1,GB521,1,present
396,GB522-0,GB522,0,absent


In [13]:
df_values = pd.read_csv('resources/grambank/values.csv')
df_values

Unnamed: 0,ID,Language_ID,Parameter_ID,Value,Code_ID,Comment,Source,Source_comment,Coders
0,GB020-abad1241,abad1241,GB020,?,,Author states there is a possible example of a...,s_OaPaul_Gabadi[17],Oa & Paul 2013:17,JLA
1,GB021-abad1241,abad1241,GB021,?,,Author states there is a possible example of a...,s_OaPaul_Gabadi[17],Oa & Paul 2013:17,JLA
2,GB022-abad1241,abad1241,GB022,?,,Author states there is a possible example of a...,s_OaPaul_Gabadi[17],Oa & Paul 2013:17,JLA
3,GB023-abad1241,abad1241,GB023,?,,Author states there is a possible example of a...,s_OaPaul_Gabadi[17],Oa & Paul 2013:17,JLA
4,GB024-abad1241,abad1241,GB024,2,GB024-2,,s_OaPaul_Gabadi[15],Oa & Paul 2013:15,JLA
...,...,...,...,...,...,...,...,...,...
441658,GB433-zuni1245,zuni1245,GB433,0,GB433-0,,g_Nichols_Zuni[139],Nichols (1997: 139),BHR
441659,GB519-zuni1245,zuni1245,GB519,0,GB519-0,,s_Newman_Zuni_1965[30-1],Newman (1965: 30-1),BHR
441660,GB520-zuni1245,zuni1245,GB520,1,GB520-1,,g_Nichols_Zuni[26-7],Nichols (1997: 26-7),BHR
441661,GB521-zuni1245,zuni1245,GB521,0,GB521-0,,s_Newman_Zuni_1965[30-1],Newman (1965: 30-1),BHR


**values.csv** - gives the code for each of the features in each language

In [14]:
row = df_values.loc[df_values['Language_ID']==lang_id]
display(row)

Unnamed: 0,ID,Language_ID,Parameter_ID,Value,Code_ID,Comment,Source,Source_comment,Coders
284677,GB020-nucl1347,nucl1347,GB020,1,GB020-1,,typ_Nouguier_Wolof[201],Voisin-Nouguier 2002:201,HS
284678,GB021-nucl1347,nucl1347,GB021,0,GB021-0,"It is not stated explicitly, but there are sev...",typ_Nouguier_Wolof,Voisin-Nouguier 2002:in passim,HS
284679,GB022-nucl1347,nucl1347,GB022,1,GB022-1,,typ_Nouguier_Wolof[19-21],Voisin-Nouguier 2002:19-21,HS
284680,GB023-nucl1347,nucl1347,GB023,1,GB023-1,,typ_Nouguier_Wolof[19-21],Voisin-Nouguier 2002:19-21,HS
284681,GB024-nucl1347,nucl1347,GB024,1,GB024-1,,g_Sauvageot_Dyolof[191],Sauvageot 1965:191,HS
...,...,...,...,...,...,...,...,...,...
284813,GB276-nucl1347,nucl1347,GB276,0,GB276-0,,typ_Nouguier_Wolof,Voisin-Nouguier 2002,HS
284814,GB285-nucl1347,nucl1347,GB285,0,GB285-0,,"g_Diouf_Wolof[44, 102, 109, 138]","Diouf 2009:44, 102, 109, 138",HS
284815,GB286-nucl1347,nucl1347,GB286,0,GB286-0,,"g_Diouf_Wolof[44, 102, 109, 138]","Diouf 2009:44, 102, 109, 138",HS
284816,GB291-nucl1347,nucl1347,GB291,0,GB291-0,,"g_Diouf_Wolof[44, 102, 109, 138]","Diouf 2009:44, 102, 109, 138",HS


## Utility Functions

**Define function to get language ID**

In [15]:
def get_language_id(lang: str):
    df_languages = pd.read_csv('resources/grambank/languages.csv')
    row = df_languages.loc[df_languages['Name']==lang.capitalize()]
    if row.shape[0] == 0:
        return None
    elif row.shape[0] == 1:
        return row['ID'].array[0]
    else:
        return "Found too many"

## Set Environment Parameters

In [21]:
import os
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
from pathlib import Path

OPENAI_API_KEY = ""
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
LANGCHAIN_API_KEY = ""
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = LANGCHAIN_API_KEY

## Add Grammars to Vectorstore
Ideally this should only need to be done once for each grammar book, reference, etc.

In [22]:
def get_by_iso(iso_code):
    name = glotto.get_by_iso(iso_code)
    return name.replace(" ", "_")

In [35]:
language = 'min'
language_name = get_by_iso(language)
print(language_name)

CHROMA_PATH = "./chroma_db"
# TEXT_DOC_PATH = f"./resources/{language.lower()}/grammar_book_long.txt"

Minangkabau


In [36]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 2000,
    chunk_overlap = 200,
    length_function = len,
    is_separator_regex=False
)

resource_directory = Path(f"./resources/{language_name.lower()}")
resources = list(resource_directory.glob('*.txt'))

for resource in resources:
    print(f"Loading resource: {resource}")
    # loader = TextLoader(TEXT_DOC_PATH)
    loader = TextLoader(resource)
    pages = loader.load()

    chunks = text_splitter.split_documents(pages)

    embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
    # db_chroma = Chroma.from_documents(chunks, embeddings, persist_directory=CHROMA_PATH)
    db_chroma = Chroma.from_documents(documents=chunks, collection_name=language_name.lower(), embedding=embeddings, persist_directory=CHROMA_PATH)

Loading resource: resources/minangkabau/adelaar_proto-malayic1992v2_o.txt
Loading resource: resources/minangkabau/zarbaliev_minangkabau1987_o.txt
Loading resource: resources/minangkabau/reibaud_minangkabau2004_o.txt
Loading resource: resources/minangkabau/crouch_minangkabau2009.txt


**Test connecting to database**

Here we're simply getting a client to interact with the database.

In [37]:
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
vectorstore = Chroma(embedding_function=embeddings, persist_directory=CHROMA_PATH)
chroma_client = vectorstore._client

# collections = db_chroma._client.list_collections()
collections = chroma_client.list_collections()

for collection in collections:
    print(collection.name)

# Delete a collection by name
# chroma_client.delete_collection('ilo')

mizo
kalamang
langchain
southern_jinghpaw
minangkabau
iloko


In [39]:
my_set = set()


collection = chroma_client.get_or_create_collection(name='minangkabau')
results = collection.get()
# print(results.keys())
results['metadatas']
# List document IDs
# document_ids = results['ids']
# for doc_id in document_ids:
#     print(f"Document ID: {doc_id}")
for metadata in results['metadatas']:
    my_set.add(metadata['source'])
print(my_set)

{'resources/minangkabau/zarbaliev_minangkabau1987_o.txt', 'resources/minangkabau/adelaar_proto-malayic1992v2_o.txt', 'resources/minangkabau/crouch_minangkabau2009.txt', 'resources/minangkabau/reibaud_minangkabau2004_o.txt'}


## Setup

Load the list of features

In [54]:
import json

feature_file = "resources/grambank/features/all_features.json"
with open(feature_file) as f:
    all_features = json.load(f)
all_features['GB020']

{'id': 'GB020',
 'source': 'GB020.md',
 'feature': 'Are there definite or specific articles?',
 'Summary': "An article is a marker that accompanies the noun and expresses notions such as (non-)specificity and (in)definiteness. Sometimes these notions of specificity and definiteness are summed up in the term 'identifiability'. The formal expression is irrelevant; articles can be free, bound, or marked by suprasegmental markers such as tone.\nArticles are different from demonstratives in that demonstratives occur in a paradigm of markers that have a clear spatial deictic function. As demonstratives can grammaticalize into definite or specific articles, they form a natural continuum, making it hard to define discrete categories, but to qualify as an article a marker should be used in some cases to express definiteness without also expressing a spatial deictic meaning.\n\n\n",
 'Procedure': '1. Code 1 if there is a morpheme that can mark definiteness or specificity without also conveying a

Languages represented in Grambank that are also in Back to School paper. Languages are looked up in languages.csv file
- min : Minangkabau
- lus : Mizo
- Wolof
- Dinka???
- Chuvash
- gug : Guarani
- kgv : Kalamang
- ilo : Iloko (aka Ilokano)
- kac : Kachin (aka Southern Jinghpaw)
- ntu : Natugu

First we set the language, API keys, etc

**Experiment Parameters**

Set languages and features

In [71]:
# list_of_languages = ['Kalamang']
# list_of_languages = ['ilo']
language = 'lus'
use_rag = True
list_of_features = list(all_features.keys())
list_of_features = ['GB020', 'GB051', 'GB522'] # smaller list for testing

In [56]:
# Verify we can get language name and glottocode by ISO code
language_name = glotto.get_by_iso(language).capitalize()
language_id = get_language_id(language_name)
glot_lang_id = glotto.get_glot_id_by_iso(language)

print(language_name)
print(language_id)
print(glot_lang_id)

Mizo
lush1249
lush1249


**Incorporate RAG**\
Do this section if you want to query the vector store for content to include in the prompt.

In [60]:
# Test loading the database
embedding_db = Chroma(collection_name=language_name.lower(),
                      embedding_function=embeddings,
                      persist_directory=CHROMA_PATH
)

retriever = embedding_db.as_retriever(search_kwargs={'k': 2})
docs = retriever.invoke("What is the basic sentence structure of the language? Does it use subject-verb-object or aubject-object-verb")
docs[1]

Document(metadata={'source': 'resources/mizo/chhangte_mizo1993_o.txt'}, page_content="poor!\n\nb. 6y_ a-vS-na-fc&n EXCL 3s-go-hurt-INT 'Ouch! That really hurts!'\n\nSentences will now be discussed in the following chapter.\n\n\x0c134 Notes 1. Reduplication has been discussed in chapter four.\n\n\x0c135\nCHAPTER V I\nTHE SENTENCE\nFor purposes of this dissertation a sentence w ill be defined as a syntactic unit which consists of at least one independent clause and optionally one or more dependent clauses. The clauses may be combined in a number o f ways using various morphological devices. Optional final particles mark the end o f a sentence.\nIn many Tibeto-Bunnan languages such as Tibetan it is quite common to have a string o f clauses where only the verb in the la st clause is finite and inflected (see DeLancey 1991). Mizo, however, does not allow for clause chains. The rest o f this chapter will describe the various clause combining strategies.\nBasic Sentence Structure\nAn independ

In [61]:
# Function for joining all the docs retrieved from vector store
# def format_docs(docs):
#     return "\n".join(doc.page_content for doc in docs)
def format_docs(docs):
    output = ""
    for doc in docs:
        output += str(doc.metadata) + "\n"
        output += doc.page_content + "\n"
    return output

In [62]:
# Define the rag_chain
rag_chain = retriever | format_docs

In [63]:
# Function for querying the database
def query_db(chain, search_string, use_rag):
    if use_rag:
        ref_data = chain.invoke(search_string)
        return "To help answer the question, here is relevant data retrieved from a grammar book\n" + ref_data
    else:
        return ""

**Set Up Model and Template**

Note that the model uses a feature called **JSON mode** to obtain a response in JSON format\
https://python.langchain.com/docs/concepts/#structured-output

In [65]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.output_parsers.json import SimpleJsonOutputParser

model = ChatOpenAI(
    model="gpt-4o-2024-08-06",
    model_kwargs={ "response_format": {"type": "json_object"}},
)

system_template = (
    "You are an expert linguist with extensive knowledge about many languages."
    "Answer the following question about the language {language_name}. You are also provided "
    "with additional information about the question and you are given a procedure "
    "that indicates allowable answers for the question. You MUST provide an answer "
    "following the procedure. If you do not know the answer, answer 'IDK'."
    "Output the answer in JSON format with the following key-value pairs: "
    "'code': code, 'comment': other_data"
    "Please format the response as valid JSON that I can parse."
)

user_template = (
    "{question}\n"
    "Here is a summary about the question:\n"
    "{summary}\n"
    "{context}\n"  # Reference material retrieved from the vector database can go here
    "Here is the procedure to follow and the allowable responses:\n"
    "{procedure}"
)

prompt_template = ChatPromptTemplate.from_messages(
    [("system", system_template), ("user", user_template)]
)


**Create the RAG Chain** which is necessary when we want to query the vectorstore

**Create the Chain** using the prompt template, model, and a JSON output parser

In [66]:
chain = (
    prompt_template |
    model |
    SimpleJsonOutputParser()
)

**Sample Prompt**

In [67]:
# Test the prompt template
example = {
    'language_name': 'English',
    'question': all_features['GB020']['feature'],
    'summary': all_features['GB020']['Summary'],
    'procedure': all_features['GB020']['Procedure'],
    'context': query_db(rag_chain, all_features['GB020']['feature'] + all_features['GB020']['Summary'], True)
}
print(prompt_template.invoke(example))
print((prompt_template.invoke(example).messages[1].content))

messages=[SystemMessage(content="You are an expert linguist with extensive knowledge about many languages.Answer the following question about the language English. You are also provided with additional information about the question and you are given a procedure that indicates allowable answers for the question. You MUST provide an answer following the procedure. If you do not know the answer, answer 'IDK'.Output the answer in JSON format with the following key-value pairs: 'code': code, 'comment': other_dataPlease format the response as valid JSON that I can parse.", additional_kwargs={}, response_metadata={}), HumanMessage(content="Are there definite or specific articles?\nHere is a summary about the question:\nAn article is a marker that accompanies the noun and expresses notions such as (non-)specificity and (in)definiteness. Sometimes these notions of specificity and definiteness are summed up in the term 'identifiability'. The formal expression is irrelevant; articles can be free

In [68]:
# Test invoking the chain on an example
hus_test = chain.invoke(example)
print(hus_test)


{'code': 0, 'comment': 'The source text does not mention a definite article, nor does it provide examples or analysis indicating the presence of a morpheme that marks definiteness or specificity without conveying spatial deictic meaning in English. The discussion focuses on ergative case marking and demonstrative pronouns, but there is no indication of definite or specific articles in the examples provided.'}


In [72]:
from tqdm import tqdm
from pathlib import Path
import sys

language_name = glotto.get_by_iso(language).capitalize()
language_id = get_language_id(language_name)

# Define output file
output_file = language.lower() + "_results"
if use_rag:
    output_file += "_RAG"
output_file += ".json"

output_dir = Path('outputs') / language_name.lower()
output_path = output_dir / output_file

if output_path.is_file():
    print("output already exists. Exiting now.")
    # sys.exit()
else:
    print("Output path does not exist")

output_dict = dict()
# for language in list_of_languages:
print(f"{language_name} -- {language} -- {language_id}")
for feature in tqdm(list_of_features):
    if use_rag:
        sources = retriever.invoke(all_features[feature]['feature'] + all_features[feature]['Summary'])
        context_string = "To help answer the question, here is relevant data retrieved from a grammar book\n" + format_docs(sources)
    else:
        context_string = ""
        
    example = {
        'language_name': language_name.capitalize(),
        'question': all_features[feature]['feature'],
        'summary': all_features[feature]['Summary'],
        'procedure': all_features[feature]['Procedure'],
        'context': context_string
        # 'context': query_db(rag_chain, all_features[feature]['feature'] + all_features[feature]['Summary'], use_rag)
    }
    value_id = f"{feature}-{language_id}"
    result = chain.invoke(example)
    # print(f"Result type: {type(result)}")
    # print(result)
    output_dict[value_id] = result
    output_dict[value_id]['source'] = example['context']


# Save results
output_path.parent.mkdir(exist_ok=True, parents=True)
with open(output_path, 'w') as f:
    json.dump(output_dict, f, indent=4, ensure_ascii=False)
print(f"Saved outputs to {output_path}")

output already exists. Exiting now.
Mizo -- lus -- lush1249


100%|██████████| 3/3 [00:05<00:00,  1.67s/it]

Saved outputs to outputs/mizo/lus_results_RAG.json





In [85]:
[{"content": doc.page_content, "metadata": doc.metadata} for doc in sources]

[{'content': "60\n\nCHAPTER III\n\nSYNTAX AND MORPHOLOGY OF THE NOUN PHRASE\n\nThe morphology of the noun phrases is fairly simple. Nouns are inflected for case and pronouns for number. Other important morphological categories include demonstrative pronouns, emphatic particles and numerical expressions. These categories are useful in helping distinguishing nouns from verbs as the distinction is not always clear when dealing with complex clauses. This chapter will discuss the morphological categories pertaining to noun phrases.\nCase marking will be dealt w ith first, followed by a discussion o f pronominal paradigms. Noun modifiers and quantifiers follow n ex t The last section illustrates lexical nominalization.\n\nCase Marking\n\nIn a simple declarative clause the case markers are: -in ergative, -m instrumental/adverbial and -a? locative. Absolutives are unmarked. Genitives are marked\n\nby high tone and vocative by a low tone. The following examples are illustrative.\n\n( l) a . kee

In [86]:
result

{'code': 1,
 'comment': 'In Mizo, the S or A argument may be omitted in a pragmatically unmarked main clause when its referent can be inferred from context, allowing for pro-drop characteristics.',
 'source': "To help answer the question, here is relevant data retrieved from a grammar book\n{'source': 'resources/mizo/chhangte_mizo1993_o.txt'}\n60\n\nCHAPTER III\n\nSYNTAX AND MORPHOLOGY OF THE NOUN PHRASE\n\nThe morphology of the noun phrases is fairly simple. Nouns are inflected for case and pronouns for number. Other important morphological categories include demonstrative pronouns, emphatic particles and numerical expressions. These categories are useful in helping distinguishing nouns from verbs as the distinction is not always clear when dealing with complex clauses. This chapter will discuss the morphological categories pertaining to noun phrases.\nCase marking will be dealt w ith first, followed by a discussion o f pronominal paradigms. Noun modifiers and quantifiers follow n ex 

In [87]:
example

{'language_name': 'Mizo',
 'question': 'Can the S or A argument be omitted from a pragmatically unmarked clause when the referent is inferable from context (‘pro-drop’)?',
 'summary': 'This feature identifies languages that do not require a phonologically overt [S or A argument](https://github.com/grambank/grambank/wiki/Core-arguments-(S,-A-and-P)) expressed by a noun phrase or [phonologically independent pronoun](https://github.com/grambank/grambank/wiki/Indexing-vs.-pronouns) when the referent can be inferred from context. These languages are often called ‘pro-drop’ languages, and the phenomenon is generally referred to as ‘null anaphora’ or ‘zero anaphora’. \n\nHere ‘omitted’ means phonologically unexpressed (we do not intend for this feature to make any distinctions regarding the theoretical status of a null argument). To count as 1, pro-drop should occur where the S or A argument might otherwise be expressed by an independent pronoun or full noun phrase. Specifically, pro-drop sho