# SpInPy - SPanish IN PYthon
TEAM MEMBERS: Swetha Berana, Morgan Sholeen, Lindia Tjuatja

Research has shown that code switching - or the switching between languages within a single utterance - is an effective tool in language learning. We have created a program that takes user input in English and translates certain parts of the input into Spanish, as well as two alternate language modes in French and German. The result is a mixed-language sentence that allows the user to learn new Spanish/French/German vocabulary by incorporating unknown words into a known English context.

In [None]:
#modules to be used
import nltk
from nltk.chunk import *
from nltk.chunk.regexp import *
from nltk.tokenize import word_tokenize
from googletrans import Translator
translator = Translator()
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
import pyttsx3
engine = pyttsx3.init()

We have used a number of modules and APIs to create this program. First and foremost is Python's Natural Language Toolkit (NLTK). We also used a Google translate API and an offline text to speech library from PyPI.

### main menu
The program has three modes which are presented to the user at the main menu. They are (1) code switch, (2) glossary, (3) change language. 

(1) Code switch takes user input (an English sentence) and translates and replaces the noun phrases in a different language. This will be further discussed in the code switch function. (Note: code switch function cell must be run before main menu loop)

(2) Glossary presents the user with a list of all translated noun phrases. The list contains tuples of the format (original English noun phrase, translated noun phrase, language of translation).

(3) Change language allows the user to change language of translation to Spanish (default), French, or German.

Finally, the user can exit the program by entering the (*) character.

In [None]:
#outer loop - main menu and exit
while(1):
   mode = input("Welcome to SpInPy!\n Main menu-- (1) code switch    (2) glossary    (3) change language    (*) exit\n Mode select: ") 
   imode = mode
   
   if imode == '*':
       print('¡Adios¡ Goodbye!')
       break
   
   elif imode == '1':   #codeswitch mode
     print('\nCode switch mode. Enter (*) to go back to the main menu.')
     codeswitch()
     
   elif imode == '2':   #view glossary
        print('\nGlossary\n', gloss)

   elif imode == '3':   #option to switch languages, which is then applied in codeswitch mode
       langPrompt = 1
       alang = ''
       while(langPrompt):
           lang = input("To return to the main menu, enter (*)\n Enter (f) for French, (g) for German or (s) for Spanish (default): ")
           if (lang == '*'):
               langPrompt = 0;
           elif (lang == 'f'):
               alang = 'fr'
               langPrompt = 0
           elif (lang == 'g'):
               alang = 'de'
               langPrompt = 0
           elif(lang == 's'):
               alang = 'es'   
               langPrompt = 0
           else: 
               print("\nNot a valid input. ")
       destlang = alang
       
   else:
       print("\nNot a valid input. ")

The main menu isn't too complicated. The largest part of this program is the code switch function called under the first elif.

### codeswitch( )

The codeswitch funciton is composed of four parts. They are:
1. Obtain user input (an English sentence)
2. Parse the sentence for noun phrases via part of speech tagging and chunking*
3. Find, translate, and replace noun phrases
4. Print and allow user to either listen to the sentence, enter a new sentence, or return to main menu

*Before delving into the code, here is a brief description of part of speech tagging, noun phrases, and chunking.

Simply put, the NLTK part of speech tagger takes a string of tokens (such as individual words) and tags them with their part of speech (as in noun, verb, adjective, etc.). The output is a tuple of the form (word, POS tag).

From there, we can group words of certain parts of speech together (i.e. "chunking") using regular expressions to form phrases. A noun phrase consists of an optional determiner, optionally one or more adjectives, and a noun. We specify this rule in defining NounPhrase.

Why use noun phrases instead of just translating the nouns? The short answer is word alignment (although the longer one has more complicated roots in the rules code-switching syntax). For example, the word order of adjectives and nouns in English and Spanish is reversed - saying "big dog" in English turns into "dog big" in Spanish. By translating the entire noun phrase, which includes the adjective, we can avoid this issue.

In [None]:
def codeswitch():
    #defining what constitutes a noun phrase
    NounPhrase = RegexpParser('''
                              NP: {(<(DT)>)?(<(JJ)>)*(<(NN)>|<(NNS)>)} 
                              ''')
    flag = 1

After defining the NounPhrase grammar rule and setting the flag parameter, we enter the while loop. The first thing is to prompt the user for an English sentence. If the input is (*), we exit back to the main menu. If the input does not end in a period, we append one for ease of parsing. 

In [None]:
    while(flag): 
        sentence = input("Type a sentence in English: ")
        if sentence == '*':
            break
        if sentence[-1] != '.':
            sentence = sentence + '.'

The sentence string is then tokenized (separated by word and punctuation) and then part of speech tagged. Afterwards, it is parsed by the noun phrase chunk rule we applied previously. Finally, we add beginning-inside-outside tags (which, as the name implies, marks the beginning, inside, and outside of chunks). This will aid us in finding and replacing chunks.

In [None]:
        words = word_tokenize(sentence)   #tokenize string - returns list of tokens
        tagged_words = nltk.pos_tag(words)   #tag with part of speech - returns list of tuples 
        parsed = NounPhrase.parse(tagged_words)   #parse noun phrase chunks
        parsed_bio = nltk.chunk.tree2conlltags(parsed)   #add b-i-o tags as third element to tuples in list
        np_chunk = ''

We then traverse the list of tagged and chunked words. Every time a part of a noun phrase is encountered, the word is concatenated to the running np_chunk string. Once the end of a noun phrase is encountered, the string is translated, added to the glossary, and replaces the English source text. The np_chunk string is then cleared and the process repeats until the end of the list is reached. 

In [None]:
        for word in parsed_bio:
            i = parsed_bio.index(word)
            if word[2] == 'B-NP':
                beg = parsed_bio.index(word)
                np_chunk = np_chunk + ' ' + word[0]
                if (parsed_bio[i+1])[2] == 'O':
                    print(np_chunk)
                    trans_words = (translator.translate(np_chunk, dest=destlang).text)
                    gloss_entry = (np_chunk, trans_words, destlang)
                    gloss.append(gloss_entry)
                    words.insert(beg, trans_words) 
                    del parsed_bio[beg+1]
                    del words[beg+1]
                    np_chunk = ''
            elif word[2] == 'I-NP':
                np_chunk = np_chunk + ' ' + word[0]
                if parsed_bio[parsed_bio.index(word)+1][2] == 'O':
                    trans_words = (translator.translate(np_chunk, dest=destlang).text)
                    gloss_entry = (np_chunk, trans_words, destlang)
                    gloss.append(gloss_entry)
                    words.insert(beg, trans_words) 
                    del parsed_bio[(beg+1):(i+2)]
                    del words[(beg+1):(i+2)]
                    np_chunk = ''
            elif word[2] == 'O-NP':
                np_chunk = np_chunk + ' ' + word[0]
                trans_words = (translator.translate(np_chunk, dest=destlang).text)
                gloss_entry = (np_chunk, trans_words, destlang)
                gloss.append(gloss_entry)
                words.insert(beg, trans_words) 
                del parsed_bio[(beg+1):(i+2)]
                del words[(beg+1):(i+2)]
                np_chunk = ''

Once the chunks have been replaced by their translations, the list is put together as a space-separated string of words.

In [None]:
        if words[-1] == '.':
            del words[-1]
        s = ' '
        cs_sentence = s.join(words) 
        result = cs_sentence + '.'

Finally, the user is prompted with the option to listen to their sentence be read out loud (disclaimer: the tts feature only supports English phonetics, so it doesn't pronounce certain foreign words correctly). Otherwise, the user can either input another sentence or return to the main menu.

In [None]:
        while(1):
            listen = input(f'{result}\nEnter (@) to listen, (n) to enter a new sentence, (*) to go back to main menu: ')
            if (listen == '@'):
                engine.say(result)
                engine.runAndWait()
                break
            elif (listen == 'n'):
                break
            elif (listen == '*'):
                flag = 0
                break
            else: 
                print("\nNot a valid input. ")
    return;

That's the bulk of what we have developed in SpInPy for this project. In the future, it would be interesting to add speech to text as well as language-switching for the text to speech feature. We could also train our own POS tagger to make it more accurate (sometimes the built-in NLTK tagger can be a little faulty which messes with translation).

We had a fun time learning Python these past few weeks! Thanks for teaching us!

#### sources
- https://www.nltk.org/
- https://pypi.org/project/googletrans/
- https://pypi.org/project/pyttsx3/
- https://pythonprogramming.net/chunking-nltk-tutorial/
- https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)