# UNICEF- Finding Common Themes

This notebook serves to convert the original dataset into clusters of translated questions with similar themes

In [1]:
#First,the usual imports
import numpy as np
import pandas as pd
import random
from tqdm import tqdm
tqdm.pandas()
import re
from nltk.corpus import stopwords

  from pandas import Panel


In [3]:
#Now the aim is to translate all the questions and find the relevant ones
from googletrans import Translator
translator = Translator()

In [4]:
df = pd.read_csv('ureport_sample.csv')

In [5]:
non_eng_polls = df[df['org_language'] != 'en'][['poll_title', 'question_title', 
                                                'org_language', 'poll_category_name']]
#It would make sense to remove duplicates

In [8]:
all_poll_categs = non_eng_polls['poll_category_name'].unique()
print(len(all_poll_categs))
print(all_poll_categs)

299
['Comunicación' 'Inclusión' 'Protección' 'Participación Adolescente'
 'Educación' 'Salud' 'Fechas Importantes' 'Igualdad de Género'
 'Cambio Climático' 'Adolescencia' 'Participación' 'Niñez' 'embarazo'
 'General' 'Ureporteri' 'U-Report Brasil' 'Saúde' 'Educação'
 'ODS 2 - Fome Zero' '+Q' 'Objetivos do Desenvolvimento Sustentável' 'ODS'
 'Mete a colher' 'ODS Geral' 'HIV ' 'Política' 'Migrantes' 'Aprendiz'
 'Esportes' 'Violência Sexual' 'Evasão Escolar' 'Corpo e Gordofobia'
 'ODS 5 - Igualdade de Gênero' 'Corpo' 'Redução' 'Juv e Trabalho'
 'Direitos Reprodutivos e Prevenções' 'Proteção'
 "ODS 14 - Vida debaixo D'Água"
 'ODS 9 - Indústria, Inovação e Infraestrutura'
 'ODS 3 - Saúde e Bem-Estar' 'ODS 1 - Erradicação da Pobreza'
 'Abordagem Policial' 'Acesso à justiça' 'ARMAS' 'DROGAS' 'Rede LGBT'
 'Letalidade Violenta' 'Segurança' 'S4D' 'Opinions' 'Participation'
 'Education' 'Santé' 'Cybercrime' 'Nutrition' 'Hygiène'
 "Droits de l'enfant" 'Général' 'Eau, Assainissement et Hygiène'
 "P

Important Observations:
- Some of these themes are in two languages separated by a slash. 
- Some polls- especially in the Balkan region- simply use 'Polls' or 'U-Report' in the titles. We will have to dig deeper to find the questions within them

There are only 299 categories. We will now create a function to store a dictionary of these 299 mappings (and any changes resulting from the points above). This would prove more efficient than running Google Translate over thousands of lines of code. It would most likely push us past the daily API call limit. 


## Preprocessing for Google Translate

In [10]:
#We are assuming that the language codes are the same in Google Translate and our data
non_eng_polls['org_language'].unique()

array(['es', 'bs', 'pt-br', 'bg', 'fr', 'id', 'ar', 'ro', 'pt', 'my', nan,
       'it', 'sr-rs@latin', 'uz', 'vi', 'uk', 'th'], dtype=object)

Cross-checking against the Google Translate list, we find a few mismatches
https://cloud.google.com/translate/docs/languages


In [12]:
non_eng_polls.loc[non_eng_polls['org_language']=='my']

Unnamed: 0,poll_title,question_title,org_language,poll_category_name
14758,Physical Wellbeing during COVID-19 (English),Do you think you’re physically healthy during ...,my,Health
14759,Physical Wellbeing during COVID-19 (English),Do you think you’re physically healthy during ...,my,Health
14760,Physical Wellbeing during COVID-19 (English),Do you think your daily lifestyle is healthy?,my,Health
14761,Physical Wellbeing during COVID-19 (English),What do you usually do at home for your health?,my,Health
14762,Physical Wellbeing during COVID-19 (English),What do you usually do at home for your health?,my,Health
...,...,...,...,...
17455,Visit of the United Nations Secretary-General ...,သင့္အေနနဲ႕ ကုလသမဂၢအေထြေထြအတြင္းေရးမွဴးခ်ဳပ္မစၥ...,my,Youth
17456,Visit of the United Nations Secretary-General ...,သင့္အေနနဲ႕ ကုလသမဂၢအေထြေထြအတြင္းေရးမွဴးခ်ဳပ္မစၥ...,my,Youth
17457,Visit of the United Nations Secretary-General ...,ၿငိမ္းခ်မ္းေရးကိုေဖာ္ေဆာင္္ႏိုင္ဖို႕ သင့္အေနနဲ...,my,Youth
17458,Youth issues in Myanmar Week 2 - Employment,အျခား” လို႔သူငယ္ခ်င္းေရြးခ်ယ္လိုက္တာဆိုေတာ့့့ ...,my,Employment


This is Myanmar- the language of Burma, which is not available on Google Translate. 

In [13]:
non_eng_polls.loc[non_eng_polls['org_language']=='sr-rs@latin']

Unnamed: 0,poll_title,question_title,org_language,poll_category_name
22126,Životna sredina,Da li si informisan/a o stanju životne sredine...,sr-rs@latin,Zaštita životne sredine
22127,Životna sredina,Da li si informisan/a o stanju životne sredine...,sr-rs@latin,Zaštita životne sredine
22128,Životna sredina,Da li si informisan/a o stanju životne sredine...,sr-rs@latin,Zaštita životne sredine
22129,Životna sredina,Da li si informisan/a o stanju životne sredine...,sr-rs@latin,Zaštita životne sredine
22130,Životna sredina,Da li si informisan/a o stanju životne sredine...,sr-rs@latin,Zaštita životne sredine
...,...,...,...,...
23221,Nasilje,Kom tipu nasilja nad decom bi najpre trebalo d...,sr-rs@latin,Prevencija nasilja
23222,Nasilje,Na koji način bi deca trebalo da se uključe ka...,sr-rs@latin,Prevencija nasilja
23223,Nasilje,Na koji način bi deca trebalo da se uključe ka...,sr-rs@latin,Prevencija nasilja
23224,Nasilje,Na koji način bi deca trebalo da se uključe ka...,sr-rs@latin,Prevencija nasilja


A quick check reveals that this is Serbian/Croatian, which is represented in Google translate as 'hr'. 
Brazilian Portuguese can be replaced by Portuguese.

In [14]:
non_eng_polls.loc[non_eng_polls['org_language']=='uz']

Unnamed: 0,poll_title,question_title,org_language,poll_category_name
37217,Karantinda bolalar vaqtini uyda qanday o’tkazy...,2)\tSiz bu bolaga kimsiz? / Кем Вы являетесь д...,uz,Ta'lim / Образование
37218,Karantinda bolalar vaqtini uyda qanday o’tkazy...,3)\tKarantin tufayli bog'chalar yopilganidan s...,uz,Ta'lim / Образование
37219,Karantinda bolalar vaqtini uyda qanday o’tkazy...,"6)\tBog’chaga borolmaslik, do’stlari, tarbiyac...",uz,Ta'lim / Образование
37220,Karantinda bolalar vaqtini uyda qanday o’tkazy...,"6)\tBog’chaga borolmaslik, do’stlari, tarbiyac...",uz,Ta'lim / Образование
37221,Karantinda bolalar vaqtini uyda qanday o’tkazy...,"6)\tBog’chaga borolmaslik, do’stlari, tarbiyac...",uz,Ta'lim / Образование
...,...,...,...,...
38166,Yoshlarning ijtimoiy faolligi / Социальная акт...,Tuman/mahallangizni o’zgartirish bo’yicha g’oy...,uz,Yoshlar / Молодежь
38167,Yoshlarning ijtimoiy faolligi / Социальная акт...,Tuman/mahallangizni o’zgartirish bo’yicha g’oy...,uz,Yoshlar / Молодежь
38168,Yoshlarning ijtimoiy faolligi / Социальная акт...,Tuman/mahallangizni o’zgartirish bo’yicha g’oy...,uz,Yoshlar / Молодежь
38169,Yoshlarning ijtimoiy faolligi / Социальная акт...,Tuman/mahallangizni rivojlantirishda ko’ngilli...,uz,Yoshlar / Молодежь


Fortunately, because of its Soviet history, the Uzbek questions carries the Russian translation after a slash. This could come in handy in case the lookup in Uzbek fails. Currently however, it may be a liability by including two languages in one piece of text. So we will remove the Russian that occurs after the slash.  

## Translation Function

In [None]:
def translate_to_eng(txt, src_lang):
    """
    takes in text in a non-english language
    returns the english translation
    """
    
    try:
        print('it worked')
        result = translator.translate(txt, 
                     src=src_lang, dest="en")
    #In case the organization's language label doesn't match the question language
    except:
        result = translator.translate(txt, dest="en")
        
    return result.text

In [None]:
#Testing the function
translate_to_eng()