# Colx 525 Lab 1: Morphological Analyzer for Swahili (Cheat sheet)

Swahili is a Bantu language which is widely spoken in East Africa. It is estimated that there are 100-150 million Swahili speakers but not all of them are native speakers of Swahili. Swahili is the lingua franca of the African Great Lakes region. This means that Swahili serves a similar role to English in Europe and North America: non-native speakers use it as a common language of communication. Since Africa's economic impact and the Afican consumer market are growing continuously, Swahili is likely to become an important language for NLP in the future.

Below you can see a small dataset of Swahili verbs:  

| Swahili      | English           | Swahili     | English           |
|:-------------|:------------------|:------------|:------------------|
| anapenda     | 'he likes'        | alimona     | 'he saw him'      |
| atapenda     | 'he will like'    | alimsaidia  | 'he helped him'   |
| alipenda     | 'he liked'        | alimpiga    | 'he hit him'      |
| amependa     | 'he has liked'    | alimchukua  | 'he carried him'  |
| alinipenda   | 'he liked me'     | alimua      | 'he killed him'   |
| alikupenda   | 'he liked you'    | ananitazama | 'he looks at me'  |
| alimpenda    | 'he liked him'    | atakusikia  | 'he will hear you'|
| alitupenda   | 'he liked us'     | alitupanya  | 'he cured us'     |
| aliwapenda   | 'he liked them'   | ninakupenda | 'I like you'      |
| nitampenda   | 'I will like him' | nitawapenda | 'I will like them'|

As you can see, an inflected form of a verb in Swahili can correspond to an entire short English sentence. In this assignment, you will first perform a linguistic analysis of the Swahili dataset. After uncovering the morphemes in the dataset and their grammatical roles, you will create a finite-state analyzer for Swahili verb forms using `foma`. 

```
ana | penda      he      | likes
ata | penda      he will | like
```
$\Rightarrow$ penda = like(s)

```

ali  | ku | penda      he | liked     | you
ata  | ku | sikia      he | will hear | you
nina | ku | penda      I  | like      | you
```
$\Rightarrow$ ku = you (obj)

### Assignment 0: Install `foma`

`foma` is a popular finite-state toolkit which we'll use on this course because it is easy-ish to install on Windows, Mac OSX and Linux.

#### Windows Users

Download the following [program binaries](https://bitbucket.org/mhulden/foma/downloads/foma-0.9.18_win32.zip) and unzip. Place the program files `foma.exe`, `flookup.exe` and `cgflookup.exe` in a director which is included in your `PATH` variable. To check the value of `PATH` and add directories to it, you can navigate to:

```My Computer > Properties > Advanced > Environment Variables > System Variables```

(Miikka suggests creating a subdirectory `Foma` in your `Program Files` directory and adding it to `PATH`).

#### Mac Users

Download the following [program binaries](https://bitbucket.org/mhulden/foma/downloads/foma-0.9.18_OSX.tar.gz). Then open terminal and run:

```
cd ~/Downloads
tar -xzvf foma-0.9.18_OSX.tar.gz
sudo cp OSX/foma OSX/flookup /usr/local/bin/
```

You will be prompted for your password. Give it and press enter. Don't be alarmed if the cursor doesn't move when you're typing your password. This is a security feature.

Alternatively, you can install it through [`brew`](https://formulae.brew.sh/formula/foma)

```
brew install foma
```

#### Linux Users

Open terminal and run:

```
sudo apt install foma-bin
```

You will be prompted for your password. Give it and press enter. Don't be alarmed if the cursor doesn't move when you're typing your password. This is a security feature.

### Assignment 1: Linguistic Analysis

Perform a linguistic analysis of the dataset above. You should uncover all stems and bound morphemes in the dataset and find their English translation. To identify the morphemes, it is helpful to compare the different examples and see how substrings in the Swahili word forms relate to the English translations. 

There are no morphophonological alternations in this data. This means that a morpheme like the verb stem for 'like' will have the exact same realization in each example where it occurs. This makes it easier to locate the morphemes.

In the following questions, if a given morpheme is realized as a zero morph, you can answer "0". 

#### Assignment 1.1: Verb stems

rubric={accuracy:9}

What is the Swahili verb stem corresponding to the English verb 'like'?

`Your answer:` 

What is the Swahili verb stem corresponding to the English verb 'see'?

`Your answer:` 

What is the Swahili verb stem corresponding to the English verb 'help'?

`Your answer:` 

What is the Swahili verb stem corresponding to the English verb 'hit'?

`Your answer:` 

What is the Swahili verb stem corresponding to the English verb 'carry'?

`Your answer:` 

What is the Swahili verb stem corresponding to the English verb 'kill'?

`Your answer:` 

What is the Swahili verb stem corresponding to the English verb 'look at'?

`Your answer:` 

What is the Swahili verb stem corresponding to the English verb 'hear'?

`Your answer:` 

What is the Swahili verb stem corresponding to the English verb 'cure'?

`Your answer:` 

#### Assignment 1.2: Tense

rubric={accuracy:4}

Which morpheme is used to express simple present tense (for example: 'I **like** him') in Swahili?

`Your answer:` 

Which morpheme is used to express simple past tense (for example: 'he **liked** you') in Swahili?

`Your answer:` 

Which morpheme is used to express past perfect tense (for example: 'he **has liked**') in Swahili?

`Your answer:` 

Which morpheme is used to express future tense (for example: 'I **will like** them') in Swahili?

`Your answer:` 

#### Assignment 1.3: Personal Pronouns

rubric={accuracy:7}

Which morpheme is used to express 1st person singular when it's the **subject** (as in '**I** will like') in Swahili?

`Your answer:` 

Which morpheme is used to express the 1st person singular when it is the **object** (as in 'they like **me**') in Swahili?

`Your answer:` 

Which morpheme is used to express the 2nd person singular when it's the **object** (as in 'I saw **you**') in Swahili?

`Your answer:` 

Which morpheme is used to express the masculine 3rd person singular when it's the **subject** (as in '**he** saw me') in Swahili?

`Your answer:` 

Which morpheme is used to express the 3rd person singular when it is the **object** (as in 'I liked **him**') in Swahili?

`Your answer:` 


Which morpheme is used to express the 1st person plural when it's the **object** (as in 'They saw **us**') in Swahili?

`Your answer:` 

Which morpheme is used to express the 3rd person plural when it's the **object** (as in 'I saw **them**') in Swahili?

`Your answer:` 

#### Assignment 1.4: General Questions

rubric={accuracy:3}

How would you say in Swahili 'I will cure you'?

`Your answer:` 

How would you say in Swahili 'he hit me'?

`Your answer:` 

How would you say in Swahili 'I heard you'?

`Your answer:` 

#### Assignment 1.5: General Questions

rubric={reasoning:2}

Based on this dataset, do you think it person marking is affected by whether the person is the subject or the object?

`Your answer:` 

### Assignment 2: Building the Lexicon

rubric={accuracy:10}

Construct a lexc lexicon `Swahili.lexc` which maps Swahili verb forms to their analyses. You are free to structure the lexicon any way you like but it should be able to analyze any grammatical combination of stem and inflectional affixes in our verb dataset. If you need help with foma and lexc syntax, please consult the [foma tutorial](https://fomafst.github.io/morphtut.html) ("The lexc-script" may be the most relevant part).

Use the following grammatical markers:

|   *POS*         |
|-----------------|
| VERB+           |

|   *Tense*       |
|-----------------|
| `FUTURE+`       |
| `PRESENT+`      |
| `PAST+`         |
| `PAST_PERFECT+` |

|   *Person*      |                                       |
|-----------------|---------------------------------------|
| `1SG_SUBJ+`     | 1st person singular subject           |
| `1SG_OBJ+`      | 1st person singular object            |
| `2SG_OBJ+`      | 2nd person singular object            |
| `3MASC_SG_SUBJ+`| 3rd person masculine singular subject |
| `3MASC_SG_OBJS+`| 3rd person masculine singular object  |
| `1PL_OBJ+`      | 1st person plural object              |
| `3PL_OBJ+`      | 3rd person plural object              |


Read your lexc lexicon into foma. For the verb form *ananitazama*, your analyzer should behave in the following way:

```
apply up> ananitazama
3MASC_SG_SUBJ+PRESENT+1SG_OBJ+VERB+tazama
```

#### Assignment 2.1: Constructing the lexicon

rubric={accuracy:10}

Copy-paste your lexc lexicon below.

```
!!! Swahili.lexc !!!
```

***a li ni penda* ('he liked me') = Subj ('he') Tense ('past') Obj ('me') Verb ('like')** : **3MASC_SG_SUBJ+PAST+1SG_OBJ+VERB+penda**



- First, in the lexc-formalism, we need to declare those symbols that are to be multicharacter strings:

```
Multichar_Symbols 

VERB+ 
3MASC_SG_SUBJ+ PAST+ 1SG_OBJ+  ... 
```

- Then, we must declare a Root lexicon. The Root lexicon is where we start building a word:

```
LEXICON Root

Subj ;

LEXICON Subj

3MASC_SG_SUBJ+:a Tense ; 

LEXICON Tense

PAST+:li Obj ;

LEXICON Obj

1SG_OBJ+:ni Verb ;

LEXICON Verb

VERB+:0 VerbStem ;

LEXICON VerbStem

penda # ;

```

#### Assignment 2.2: Compiling the lexicon

rubric={accuracy:3}

Read your lexicon into `foma` and let `foma` compile it. Copy-paste your `foma` commands and the compilation message below.

In [None]:
Foma, version 0.9.18alpha (svn r241)
Copyright © 2008-2015 Mans Hulden
This is free software; see the source code for copying conditions.
There is ABSOLUTELY NO WARRANTY; for details, type "help license"

Type "help" to list all commands available.
Type "help <topic>" or help "<operator>" for further help.

foma[0]: ...

#### Assignment 2.3: Testing the lexicon

rubric={accuracy:3}

Test your lexicon using the following verb forms: *nitampenda*, *nitawapenda*, *ananitazama*, *nitakupanya* and *nilikusikia*. Copy-paste the output from the `foma` command `apply up` below.

In [None]:
foma[0]: up
apply up> nitampenda
...

### Assignment 3: Translating verb stems to English

Create a second lexc lexicon `translate.lexc`. It should map English translations to Swahili verb stems. This means that when you execute `apply up` for a Swahili verb stem, `foma` should output its English translation, for example *help*. 

Your lexicon should not contain any grammatical tags so you don't need a `Multichar_Symbols` section. You probably only need one `LEXICON` in your lexc file. 

**Note!** The verb 'look at' contains a space character which is part of the lexc language syntax. Feel free to replace it with an underscore `_`. You can also escape the space symbol using `%`.

#### Assignment 3.1: Building the translation lexicon

rubric={accuracy:5}

Copy-paste your `translate.lexc` lexicon below.

```
!!! translate.lexc !!!


```

```
LEXICON Root

like:penda # ;
```

In [None]:
!!! translate.lexc !!!



#### Assignment 3.2: Combining lexicons

rubric={accuracy:5}

You should now combine `Swahili.lexc` and `translate.lexc` using finite-state calculus. For the verb form *ananitazama*, your analyzer should behave in the following way:

```
apply up> ananitazama
3MASC_SG_SUBJ+PRESENT+1SG_OBJ+look at
```

Read both `Swahili.lexc` and `translate.lexc` into `foma` and combine them using finite-state operations (**HINT:** composition will be useful here). Copy-paste all the output from `foma` below.

In [None]:
Foma, version 0.9.18alpha (svn r241)
Copyright © 2008-2015 Mans Hulden
This is free software; see the source code for copying conditions.
There is ABSOLUTELY NO WARRANTY; for details, type "help license"

Type "help" to list all commands available.
Type "help <topic>" or help "<operator>" for further help.

foma[0]: ...

#### Assignment 3.3: Testing the lexicon

rubric={accuracy:3}

Test your lexicon using the following verb forms: *nitampenda*, *nitawapenda*, *ananitazama*, *nitakupanya* and *nilikusikia*. Copy-paste the output from the foma command `apply up` below.

In [None]:
foma[1]: up
apply up> nitampenda
...

#### Assignment 3.4
rubric={reasoning:3}

How many distinct pairs of word forms and analyses does your FST recognize? The foma compiler will display this number when reading in your lexicon. It will print something like:

```
2.1 kB. 42 states, 65 arcs, 432 paths
```

Indicating that the analyzer recognizes `432` pairs of inputs forms like "ananitazama" and analyses "3MASC_SG_SUBJ+PRESENT+1SG_OBJ+VERB+tazama". Please explain why the model recognizes exactly this number of paths. You can refer to the entry counts in your sub-lexicons to justify your reasoning. 