# Convert IMSDB screenplays to structured json using Antlr

Note that the attempt to [use OpenAI to parse the screenplay](./LLM_StructuredOutput_Screenplay.ipynb) failed with a `ContentFilterError`. Either flagged as copyrighted work or something else. Switching to a proper parser. One of the risks with using OpenAI or any LLM for that matter, is that you need to catch the hallucinations _(gpt-4o-mini, for instance, was making up some dialogues. Caught it entirely by accident)_. So might need a proper parser in any case. However, given that it tooke me almost 3 days to get the parser cleaned up and the LLM could figure out the format instantly, I am looking at a large amount of time investment if I come across other formats for screen-plays. At that point, worth investing in installing a local LLM _(Ollama, TGI etc)_ and see if those also fail with ContentFilter error. For now, moving on.

 - Taking ScreenJSON schema as a starting point
 - Using Aladin script for now
 - Copying bits and pieces from OpenAI access notebooks

## Completed files

> Links are relative to the file. Will not work when executed on colab, please view the notebook in github.

 - [README_DevelopingTheParser-2.md](../lib/python/imsdb/README_DevelopingTheParser-2.md) 
 - [Antlr Grammar - Screenplay.g4](../lib/python/imsdb/antlr/Screenplay.g4)
 - [Screenplay json code (dataclasses and pydantic bits)](../lib/python/imsdb/screenplay_json.py)
 - [Antlr Parser driver](../lib/python/imsdb/screenplay_parser.py)

In [1]:
# Install pre-reqs
!pip install nb-js-diagrammers --quiet
!pip install iplantuml --quiet
!pip install tiktoken --quiet

# For charts and such
%load_ext nb_js_diagrammers
import iplantuml

# For displaying HTML and Markdown responses from ChatGPT
from IPython.display import display, HTML, Markdown

def colorBox(txt):
    display(HTML(f"<div style='border-radius:15px;padding:15px;background-color:pink;color:black;'>{txt}</div>"))    

# ANTLR

The screenplay is so nicely structured. At first I thought regex could work. However, the regex was getting somewhat hairy with named sub-patterns and such. I remember back in the day, perl regexes got complex enough I had to make them whitespace insensitive and break them into multiple lines, nicely indented. Those were complex. Python not yet so familiar and I am thinking, getting refamiliarized with Antlr is not a bad thing after all.

In [2]:
sample = """
PEDDLER:    Oh I come from a land
    From a faraway place
    Where the caravan camels roam
    Where they cut off your ear /Where it's flat and immense
    If they don't like your face /And the heat is intense
    It's barbaric, but hey--it's home!
    When the wind's at your back

(Camera tilts down to find JAFAR sitting on his horse and IAGO
    on his shoulder.  GAZEEM comes riding up to the pair.)

JAFAR:  You...are late.
GAZEEM:A thousand apologies, O patient one.
JAFAR:  You have it, then?
GAZEEM:I had to slit a few throats to get it.  (Pulls out
        half of the medallion.  JAFAR reaches out for it,
        but GAZEEM yanks it back.)  Ah, ah, ahhh!  The treasure!
        (IAGO squawks as he flies by and grabs the medallion.)  Ouch!
JAFAR:  Trust me, my pungent friend.  You'll get what's
        coming to you.
IAGO:   What's coming to you!  Awk!

(JAFAR pulls out the second half of the medallion.  He connects
    them, and the insect medallion begins to glow.  Finally, it
    flies out of JAFAR's hand, scaring the horses, and is off
    towards the dunes.)

JAFAR:  Quickly, follow the trail!    
"""

I did manage to get some copy/paste action going and came up with a parser/lexr fairly quickly. Quite a bi rusty after all this time, but a lot like riding a bike. As long as you remember `zero-width-look-behind`, it'll all come back :-).

```python
# Use examples from
# - 
# - https://yetanotherprogrammingblog.medium.com/antlr-with-python-974c756bdb1b
# Except use the stream-from-text for temp writing

import sys
from antlr4 import*
from antlrgen.ScreenplayLexer  import ScreenplayLexer
from antlrgen.ScreenplayParser import ScreenplayParser

sample = """SULTAN: That's right.  You've certainly proven your worth
        as far as I'm concerned. It's that law that's the
        problem.
JASMINE:    Father?
SULTAN: Well, am I sultan or am I sultan?  From this day
        forth, the princess shall marry whomever she deems
        worthy.
JASMINE:    (She smiles widely and runs into ALADDIN's arms.)
        Him!  I choose...I choose you, Aladdin.
ALADDIN:    Ha, ha.  Call me Al.

(They are about to kiss when giant blue hands pull everybody together.
    GENIE is decked out in a Hawaiian shirt with golf clubs and a Goofy
     hat.)

GENIE:  Oh, all of ya. Come over here.  Big group hug!
        Mind if I kiss the monkey?  (He kisses ABU.)  Ooh,
        hairball!  Well, I can't do any more damage around
        this popsicle stand.  I'm outta here!  Bye, bye,
        you two crazy lovebirds.  Hey, Rugman: ciao!  I'm
        history!   No, I'm mythology!  No, I don't care
        what I am--I'm free!
"""

def main(argv):
    print(f"Have {len(argv)} arguments")

    # Either FileStream if I have an arg or the local hardcoded sample.
    input_stream = FileStream(argv[1]) if len(argv) > 1 else InputStream(sample)    

    lexer  = ScreenplayLexer(input_stream)
    stream = CommonTokenStream(lexer)
    parser = ScreenplayParser(stream)

    tree = parser.screenplay()
    print(tree.toStringTree(recog=parser))

if __name__ == '__main__':
    main(sys.argv)
```

Note that the code is all under `repo: hillops/libs/python/hillops/imsdb`. The above allows me to quicly debug the parsing of the entire file
 - `python screenplay_parser.py samples/aladdin.txt`
 - for each lexer error I encounter
   - copy the offending parts to `sample`
   - run `python screenplay_parser.py` so it'll pick the sample inside
   - fix that particular problem and then run the whole file.

Most were issues with unbalanced parens.

## Grammer TODOs

Am building up a list here as I encounter problems. Will refine all of them one by one.

 - ⬜ **Nested parens inside a `scene_section`**. Keep a push/pop count of parens and make a nested ( section semantically belong to the currently under-parse `scene_section`. 
   - 👉 [this slackoverflow post](https://stackoverflow.com/questions/63400627/antlr4-java-how-to-make-a-semantic-predicate-that-skips-a-token-lexer-accord) on implementing semantic predicates. Hopefully it works in python.
 - ✔️ **Scene lines using trailing `:`** Once the `NAME:` is passed. Allow the use of subsequent words that also end in `:`.

## Parse Problem - nested parens in scene lines

_I remember way back that I had to keep some code in the parser rule to count the parens_. Maybe I can simply start a new rule ?


```antlr
scene_section       : PARENS_OPEN section_line+ PARENS_CLOSE
                    ;
```

*to*

```antlr
scene_section       : PARENS_OPEN 
                        (
                            section_line
                            |
                            scene_section
                        )+ 
                        PARENS_CLOSE WS? CR?
                    ;
```

This will simply add nested scene_sections. After parse, will need to merge them in. Might be simpler to deal with them

## Parse - Problem - distinguish NAME: and other colon terminated words

Not sure yet how to distinguish `^ALADDIN:` and `...Turned into:`. The rule could be that the start of a new line with no-space can only be a name. All others can be words. How ? look-behind sntactic assertion ?

> Temporarily changed `ABU into:` → `ABU into - `

---

**Problem**: not sure yet how to distinguish `^ALADDIN:` and `...Alladin:)`. The rule could be that the start of a new line with no-space can only be a name. All others can be words. How ? look-behind sntactic assertion ?

> Temporarily changed `Whispering:` → `Whispering`

## Parse Problem - Semantics of actor introduction with no lines ?

Have this strange text:

```screenplay
GENIE:
        GIRLS: (in couterpoint)
    Prince Ali, Handsome is he, Ali Ababwa
        There's no question this Ali's alluring
    That physique, how can I speak
        Never ordinary, never boring
    Weak at the knee
        Everything about the man just plain impresses
    Well, get on out in that square
        He's a wonder, he's a whiz, a wonder
    Adjust your veil and prepare
        He's about to pull my heart asunder
    To gawk and grovel and stare at Prince Ali!
        And I absolutely love the way he dresses!
```

Not sure if this is a typo or is allowed. Does this mean a scene with `GENIE and GIRLS` ? Ignore for now by simply removing `GENIE:` _(line 1557)_

## Parse Problem - Formalize end of actor block

While parsing the test passage below

```screenplay
ALADDIN:    Sultan?  They want me to be sultan?

(GENIE comes out of lamp)

GENIE:  Huzzah!  Hail the conquering hero! 
```

The parse-tree print looked like below:

```text
(screenplay (actor_section (actor_name ALADDIN :) (section_line      Sultan?   They  want  me  to  be  sultan?\n\n) (scene_section ( (....
```
 - Note that the `\n\n` sequence was included in the section_line. 
 - I want this to terminate the `actor_section` and start a new `scene_section`

I currently have the following relevant bits

```antlr

actor_section       : actor_name 
                        (
                            section_line
                            |
                            scene_section
                        ) +
                        (
                            CR 
                            | 
                            EMPTY_LINE
                            |
                            EOF
                        ) ?
                    ;

section_line        : WS? (WORD WS? PUNCT? WS?)+

WORD               : ~[ \n\r\t:()]+ (WS | CR | EOF)?

CR                 : [\r\n]+
                   ;

WS                 : [ \t]+
                   ;

EMPTY_LINE         : (CR WS* CR)+
                   ;                   

```

Obvious issue right away
 - Why is `WORD` including `WS|CR|EOF` ??
 - I would have like to have a look-ahead assertion in the lexer to terminate a WORD.
 - If WS is not significant can I just `-> skip` it and drop it from the parser rules ?

Since a carriage-return (CR) is simply a continuation of the actor lines, we need more complex semantics to when an actor's segment ends: any of the following
 - EMPTY_LINE 
 - CR `<nospace>` NAME COLON   _next actor segment starts_
 - CR `<nospace>` (...) _scene section starts_

So far all actor lines continuations have had space in front to visually group it with the lines above, so maybe the `<nospace>` can be made meaningful.



---

**Fix 1**

Doing this iteratively in the order listed.

 - `WORD: : ~[ \n\r\t:()]+ ;` remove the idiotic tacking on of WS 
 - `WS    : ... -> skip;` Skip the WS token.
 - `section_line        : (WORD PUNCT?)+;` since WS is skipped, might as well remove all `WS?` tokens




## Convert parse-tree to JSON

Started following example to convert things to an internal structure that I could later convert to JSON. Initially, I went with a solution of multiple `@dataclass` objects. However, those would need custom conversion to json _(you can use their underlying `__dict` to dump as json string I think but this struck me later)_ so, following some advice on the net, switched to use pydantic classes.

```python
from pydantic import BaseModel
from typing import Optional, List

# Using pydantic classes instead of dataclasses to get the .json() method.
class SceneSection(BaseModel):
    content: str

class ActorSceneSection(SceneSection):
    pass

class ActorSection(BaseModel):
    name   : str
    content: List[str | ActorSceneSection]

class ScreenPlay(BaseModel):    
    sections: List[ActorSection | SceneSection]
```

Followed the example code at [sumeets medium article](), I came up with this.

```python
class ScreenplayASTToDataclass(ScreenplayListener):

    def __init__(self):
        # Keep a stack with whatever object is being
        # built at the current parse level
        self.stack : List[any] = []
        self.parsed_screenplay = None

    def parsed_data(self):
        return self.parsed_screenplay
    
    #- Stack management ------------
    def _pop(self):
        self.stack.pop()

    def _push(self, obj):
        self.stack.append(obj)

    def _peek(self):
        return self.stack[-1]
    #-------------- Stack management -

    # Enter a parse tree produced by ScreenplayParser#screenplay.
    def enterScreenplay(self, ctx:ScreenplayParser.ScreenplayContext):
        self._push(
            ScreenPlay(sections=[])
            )

    # Exit a parse tree produced by ScreenplayParser#screenplay.
    def exitScreenplay(self, ctx:ScreenplayParser.ScreenplayContext):
        self.parsed_screenplay = self._pop()

    # Enter a parse tree produced by ScreenplayParser#actor_section.
    #    screenplay : (
    #                 actor_section 
    #                 | 
    #                 scene_section
    #               )+ EOF
    #               ;
    def enterActor_section(self, ctx:ScreenplayParser.Actor_sectionContext):
        self._push(
            ActorSection(name='not_set', content=[])
        )        

    # Exit a parse tree produced by ScreenplayParser#actor_section.
    def exitActor_section(self, ctx:ScreenplayParser.Actor_sectionContext):
        actor_section = self._pop()

        top = self._peek()
        assert isinstance(top, ScreenPlay)
        top.sections.append(actor_section)


    # Enter a parse tree produced by ScreenplayParser#actor_name.
    def enterActor_name(self, ctx:ScreenplayParser.Actor_nameContext):
        name = "unknown"
        token = ctx.NAME_WORD()
        print(f"actor_name. NAME_WORD = {token}")        
        print(f"token.getText() = {token.getText()}")
        print(f"token.getSymbol() = {token.getSymbol()}")
        print(f"token.getChildCount() = {token.getChildCount()}")

        top = self._peek()
        assert isinstance(top, ActorSection)
        top.name = name

    # Exit a parse tree produced by ScreenplayParser#actor_name.
    def exitActor_name(self, ctx:ScreenplayParser.Actor_nameContext):
        # Nothing to pop as enterActor_name directly modifies the 
        # actorSection on the stack.
        pass


    # Enter a parse tree produced by ScreenplayParser#section_line.
    def enterSection_line(self, ctx:ScreenplayParser.Section_lineContext):
        pass

    # Exit a parse tree produced by ScreenplayParser#section_line.
    def exitSection_line(self, ctx:ScreenplayParser.Section_lineContext):
        pass


    # Enter a parse tree produced by ScreenplayParser#scene_section.
    def enterScene_section(self, ctx:ScreenplayParser.Scene_sectionContext):
        pass

    # Exit a parse tree produced by ScreenplayParser#scene_section.
    def exitScene_section(self, ctx:ScreenplayParser.Scene_sectionContext):
        pass
```

While investigating the `actor_name` rule. Realized that the `ctx.NAME_WORD` was giving me `ALLADIN:`. I will have to strip the colon later on. Seeing if I can change it so I get two tokens _(and then in `enterActorName only use the `WORD` token)_.

```diff
+actor_name          : WORD COLON
-actor_name          : NAME_WORD
                    ;                    
```                    

✔️ that worked. Now I have `WORD() and COLON()` in the context and the listener code changes to 

```python
def enterActor_name(self, ctx:ScreenplayParser.Actor_nameContext):        
        token = ctx.WORD()
        name  = token.getText()
        
        # debug
        print(f"actor_name. WORD token= {token}")
        print(f"token.getText() = {token.getText()}")
        print(f"token.getSymbol() = {token.getSymbol()}")
        print(f"token.getChildCount() = {token.getChildCount()}")
```

gives me the following output

```console
actor_name. WORD token= ALADDIN
token.getText() = ALADDIN
token.getSymbol() = [@0,0:6='ALADDIN',<1>,1:0]
token.getChildCount() = 0
```