### Overview
This notebook reads a pre-processed set of 500 most common bi- and tri-grams from questions from Kenya in Swahili, <br>translates them to English with *deep-translator* package that calls *GoogleTranslator*. <br>Findings: translations weren't that informative, so **didn't proceed further with visualizations**

### Inputs (in working directory): *created by 'nlp_swa.ipynb' notebook*
* CSV file of bigrams: 'ken_swa_bigrams_top500.csv'
* CSV file of trigrams: 'ken_swa_trigrams_top500.csv'
* CSV file of quadgrams: 'ken_swa_quadgrams_top500.csv'

### Outputs (in working directory): 
* translated bigrams file:  'ken_500bigrams_swa2eng.txt'
* translated trigrams file:  'ken_500trigrams_swa2eng.txt'
* translated quadgrams file:  'ken_240quadgrams_swa2eng.txt'

### Steps:
1. Select and translate 300 trigrams into English using google translate (5,000 character limit)
2. Select and translate 240 quadgrams into English using google translate
4. Save translated n-grams as text files

In [4]:
#import packages:  pandas, numpy, but didn't need all of them...
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.util import bigrams
from nltk.util import trigrams
from nltk.tokenize import word_tokenize
#from collections import Counter


import re

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\liulo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\liulo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [5]:
#Read top 500 bigram text file and convert strings column to list
bigrams_df = pd.read_csv('ken_swa_bigrams_top500.csv')
trigrams_df = pd.read_csv('ken_swa_trigrams_top500.csv')
quadgrams_df = pd.read_csv('ken_swa_quadgrams_top500.csv')
bigrams_df.head()
trigrams_df.head()
quadgrams_df.head()

#drop unnamed columns in both files
trigrams_df.drop(columns='Unnamed: 0')
bigrams_df.drop(columns='Unnamed: 0')
quadgrams_df.drop(columns='Unnamed: 0')
#select top 300 trigrams

trigrams300=trigrams_df.nlargest(300, 'count')['trigram_str']
#select top 300 trigrams
quadgrams240=quadgrams_df.nlargest(240, 'count')['quadgram_str']

In [6]:
quadgrams240[0:5]

0       s_fuatiwa_jibu_lako
1       jibu_s_fuatiwa_jibu
2       gani_jibu_s_fuatiwa
3    naweza_tumia_dawa_gani
4     naeza_tumia_dawa_gani
Name: quadgram_str, dtype: object

In [9]:
#check character length <5,000 for translation
print("# of characters: ",sum(len(item) for item in trigrams300))
print("# of characters: ",sum(len(item) for item in quadgrams240))

# of characters:  4387
# of characters:  4936


In [11]:
# convert strings column to list
bigrams_list=bigrams_df['bigram_str'].tolist()
trigrams_list=trigrams300.tolist()
quadgrams_list=quadgrams240.tolist()

print(bigrams_list[0:5])
print(trigrams_list[0:5])
print(quadgrams_list[0:5])

['dawa_gani', 'jibu_lako', 'jibu_s', 'ku_mngu', 'fuatiwa_jibu']
['fuatiwa_jibu_lako', 's_fuatiwa_jibu', 'jibu_s_fuatiwa', 'tumia_dawa_gani', 'nitumie_dawa_gani']
['s_fuatiwa_jibu_lako', 'jibu_s_fuatiwa_jibu', 'gani_jibu_s_fuatiwa', 'naweza_tumia_dawa_gani', 'naeza_tumia_dawa_gani']


### Run translator function - chatgpt prompt:  'write function with list of phrases as input to translator'

In [12]:
#install swahili translation package, 5,000 character limit
!pip install deep-translator
from deep_translator import GoogleTranslator



In [13]:
#chatgpt prompt:  write function with list of phrases as input to translator
def translate_phrases(phrases, src="sw", dest="en"):
    """
    Translate a list of text strings (phrases) from Swahili → English
    using deep-translator's GoogleTranslator.
    
    Returns a new list of translated phrases.
    Includes error handling for each phrase.
    """
    translated = []
    
    for text in phrases:
        try:
            result = GoogleTranslator(source=src, target=dest).translate(text)
            translated.append(result)
        except Exception as e:
            print(f"⚠️ Error translating '{text}': {e}")
            translated.append("[untranslated]")
    
    return translated

In [14]:
#translate bigrams
bigram_phrases = translate_phrases(bigrams_list)

In [15]:
print(bigram_phrases[0:10])

['what_drug', 'your_answer', 'answer_s', 'to_god', 'followed_response', 's_followed', 'cow_god', 'get_mpi', 'use_medicine', 'what_seed']


In [16]:
#write to bigram text file
df=pd.DataFrame(bigram_phrases)
df.to_csv('ken_500bigrams_swa2eng.txt', header=False) 

In [17]:
#translate trigrams
trigram_phrases = translate_phrases(trigrams_list)

In [18]:
print(trigram_phrases[0:10])

['followed_your_answer', 's_followed_answer', 'answer_s_followed', 'what_medicine_to_use', 'what medicine should i use', 'what_good_medicine', 'what_seed_india', 'can_get_me', 'what_good_medicine', 'what_drug_to_use']


In [19]:
#write to trigram text file
df=pd.DataFrame(trigram_phrases)
df.to_csv('ken_500trigrams_swa2eng.txt', header=False) 

In [9]:
#translate quadgrams
quadgram_phrases = translate_phrases(quadgrams_list)

NameError: name 'translate_phrases' is not defined

In [21]:
#write to quadgram text file
df=pd.DataFrame(quadgram_phrases)
df.to_csv('ken_500quadgrams_swa2eng.txt', header=False) 