# UNICEF- Finding Common Themes

This notebook serves to convert the original dataset into clusters of translated questions with similar themes

In [4]:
!pip install googletrans==4.0.0-rc1



In [36]:
pd.options.mode.chained_assignment = None

In [5]:
#First,the usual imports
import numpy as np
import pandas as pd
import random
from tqdm import tqdm
tqdm.pandas()
import re
from nltk.corpus import stopwords

  from pandas import Panel


In [28]:
#Now the aim is to translate all the questions and find the relevant ones
from googletrans import Translator
translator = Translator()

In [7]:
df = pd.read_csv('ureport_sample.csv')

In [8]:
non_eng_polls = df[df['org_language'] != 'en'][['poll_title', 'question_title', 
                                                'org_language', 'poll_category_name']]
#It would make sense to remove duplicates

In [9]:
all_poll_categs = non_eng_polls['poll_category_name'].unique()
print(len(all_poll_categs))
print(all_poll_categs)

299
['Comunicación' 'Inclusión' 'Protección' 'Participación Adolescente'
 'Educación' 'Salud' 'Fechas Importantes' 'Igualdad de Género'
 'Cambio Climático' 'Adolescencia' 'Participación' 'Niñez' 'embarazo'
 'General' 'Ureporteri' 'U-Report Brasil' 'Saúde' 'Educação'
 'ODS 2 - Fome Zero' '+Q' 'Objetivos do Desenvolvimento Sustentável' 'ODS'
 'Mete a colher' 'ODS Geral' 'HIV ' 'Política' 'Migrantes' 'Aprendiz'
 'Esportes' 'Violência Sexual' 'Evasão Escolar' 'Corpo e Gordofobia'
 'ODS 5 - Igualdade de Gênero' 'Corpo' 'Redução' 'Juv e Trabalho'
 'Direitos Reprodutivos e Prevenções' 'Proteção'
 "ODS 14 - Vida debaixo D'Água"
 'ODS 9 - Indústria, Inovação e Infraestrutura'
 'ODS 3 - Saúde e Bem-Estar' 'ODS 1 - Erradicação da Pobreza'
 'Abordagem Policial' 'Acesso à justiça' 'ARMAS' 'DROGAS' 'Rede LGBT'
 'Letalidade Violenta' 'Segurança' 'S4D' 'Opinions' 'Participation'
 'Education' 'Santé' 'Cybercrime' 'Nutrition' 'Hygiène'
 "Droits de l'enfant" 'Général' 'Eau, Assainissement et Hygiène'
 "P

Important Observations:
- Some of these themes are in two languages separated by a slash. 
- Some polls- especially in the Balkan region- simply use 'Polls' or 'U-Report' in the titles. We will have to dig deeper to find the questions within them

There are only 299 categories. We will now create a function to store a dictionary of these 299 mappings (and any changes resulting from the points above). This would prove more efficient than running Google Translate over thousands of lines of code. It would most likely push us past the daily API call limit. 


## Preprocessing for Google Translate

In [10]:
#We are assuming that the language codes are the same in Google Translate and our data
non_eng_polls['org_language'].unique()

array(['es', 'bs', 'pt-br', 'bg', 'fr', 'id', 'ar', 'ro', 'pt', 'my', nan,
       'it', 'sr-rs@latin', 'uz', 'vi', 'uk', 'th'], dtype=object)

Cross-checking against the Google Translate list, we find a few mismatches
https://cloud.google.com/translate/docs/languages


In [11]:
non_eng_polls.loc[non_eng_polls['org_language']=='sr-rs@latin']

Unnamed: 0,poll_title,question_title,org_language,poll_category_name
22126,Životna sredina,Da li si informisan/a o stanju životne sredine...,sr-rs@latin,Zaštita životne sredine
22127,Životna sredina,Da li si informisan/a o stanju životne sredine...,sr-rs@latin,Zaštita životne sredine
22128,Životna sredina,Da li si informisan/a o stanju životne sredine...,sr-rs@latin,Zaštita životne sredine
22129,Životna sredina,Da li si informisan/a o stanju životne sredine...,sr-rs@latin,Zaštita životne sredine
22130,Životna sredina,Da li si informisan/a o stanju životne sredine...,sr-rs@latin,Zaštita životne sredine
...,...,...,...,...
23221,Nasilje,Kom tipu nasilja nad decom bi najpre trebalo d...,sr-rs@latin,Prevencija nasilja
23222,Nasilje,Na koji način bi deca trebalo da se uključe ka...,sr-rs@latin,Prevencija nasilja
23223,Nasilje,Na koji način bi deca trebalo da se uključe ka...,sr-rs@latin,Prevencija nasilja
23224,Nasilje,Na koji način bi deca trebalo da se uključe ka...,sr-rs@latin,Prevencija nasilja


A quick check reveals that this is Serbian/Croatian, which is represented in Google translate as 'hr'. 
Brazilian Portuguese can be replaced by Portuguese.

In [12]:
#Completing the first two replacements
non_eng_polls['org_language'].replace('pt-br', 'pt', inplace=True)
non_eng_polls['org_language'].replace('sr-rs@latin', 'hr', inplace=True)

In [13]:
#Exploring the Uzbek case
non_eng_polls.loc[non_eng_polls['org_language']=='uz']

Unnamed: 0,poll_title,question_title,org_language,poll_category_name
37217,Karantinda bolalar vaqtini uyda qanday o’tkazy...,2)\tSiz bu bolaga kimsiz? / Кем Вы являетесь д...,uz,Ta'lim / Образование
37218,Karantinda bolalar vaqtini uyda qanday o’tkazy...,3)\tKarantin tufayli bog'chalar yopilganidan s...,uz,Ta'lim / Образование
37219,Karantinda bolalar vaqtini uyda qanday o’tkazy...,"6)\tBog’chaga borolmaslik, do’stlari, tarbiyac...",uz,Ta'lim / Образование
37220,Karantinda bolalar vaqtini uyda qanday o’tkazy...,"6)\tBog’chaga borolmaslik, do’stlari, tarbiyac...",uz,Ta'lim / Образование
37221,Karantinda bolalar vaqtini uyda qanday o’tkazy...,"6)\tBog’chaga borolmaslik, do’stlari, tarbiyac...",uz,Ta'lim / Образование
...,...,...,...,...
38166,Yoshlarning ijtimoiy faolligi / Социальная акт...,Tuman/mahallangizni o’zgartirish bo’yicha g’oy...,uz,Yoshlar / Молодежь
38167,Yoshlarning ijtimoiy faolligi / Социальная акт...,Tuman/mahallangizni o’zgartirish bo’yicha g’oy...,uz,Yoshlar / Молодежь
38168,Yoshlarning ijtimoiy faolligi / Социальная акт...,Tuman/mahallangizni o’zgartirish bo’yicha g’oy...,uz,Yoshlar / Молодежь
38169,Yoshlarning ijtimoiy faolligi / Социальная акт...,Tuman/mahallangizni rivojlantirishda ko’ngilli...,uz,Yoshlar / Молодежь


Fortunately, because of its Soviet history, the Uzbek questions carries the Russian translation after a slash. This could come in handy in case the lookup in Uzbek fails. Currently however, it may be a liability by including two languages in one piece of text. So we will remove the Russian that occurs after the slash.  

In [14]:
def keep_preslash(text):
    return text.split('/')[0]

In [15]:
print(keep_preslash('Yoshlar / Молодежь'))

Yoshlar 


In [37]:
non_eng_polls.loc['poll_category_name'] = non_eng_polls['poll_category_name'].apply(lambda x: keep_preslash(x.poll_category_name) if x.org_language=='uz' else x.poll_category_name,
                                                                                    axis=1)

TypeError: <lambda>() got an unexpected keyword argument 'axis'

In [17]:
non_eng_polls.loc[non_eng_polls['org_language']=='uz']['poll_category_name']

37217    Ta'lim / Образование
37218    Ta'lim / Образование
37219    Ta'lim / Образование
37220    Ta'lim / Образование
37221    Ta'lim / Образование
                 ...         
38166      Yoshlar / Молодежь
38167      Yoshlar / Молодежь
38168      Yoshlar / Молодежь
38169      Yoshlar / Молодежь
38170      Yoshlar / Молодежь
Name: poll_category_name, Length: 954, dtype: object

## Translation Function

In [18]:
import time

In [19]:
def translate_to_eng(txt, src_lang):
    """
    takes in text in a non-english language
    returns the english translation
    """
    
    try:
        print('it worked')
        result = translator.translate(txt, 
                     src=src_lang, dest="en")
    #In case the organization's language label doesn't match the question language
    except:
        result = translator.translate(txt, dest="en")
        
    return result.text

In [29]:
def translate_unknown_to_eng(txt):
    """
    takes in text in a non-english language (not specified by user)
    returns the english translation
    """
    
    try:
        result = translator.translate(txt,
                                    dest="en")
        return result.text
    #In case the organization's language label doesn't match the question language
    except:
        return txt

In [21]:
#Testing the function
print(translate_unknown_to_eng('Prevencija nasilja'))

Prevention of violence


In [22]:
print(translate_to_eng('Prevencija nasilja', 'hr'))

it worked
Prevention of violence


In [23]:
time.sleep(10)
print(translate_to_eng('Bonjour', 'fr'))

it worked
Hello


In [24]:
unique_non_eng_polls = non_eng_polls.drop_duplicates(subset=['poll_category_name'])

In [25]:
unique_non_eng_polls

Unnamed: 0,poll_title,question_title,org_language,poll_category_name
0,¿Cuáles son las redes sociales preferidas por ...,¿Cuál es la red social que más usás?,es,Comunicación
6,¿Cómo afecta el Aislamiento Preventivo Social ...,Mencione los tres cambios más importantes en l...,es,Inclusión
57,¿Cómo es la situación en el hogar de los adole...,"Durante la cuarentena, ¿quién hace la mayor pa...",es,Protección
67,Covid-19: ¿Qué acciones pueden tomar los adole...,¿Qué tipo de acciones querés conocer?,es,Participación Adolescente
96,Educación a Distancia en contexto de Covid-19,Para quienes se encuentran estudiando actualme...,es,Educación
...,...,...,...,...
42397,เราได้จับมือกับสำนักงานกองทุนสนับสนุนการสร้างเ...,ปกติน้องๆ เล่นเกมออนไลน์บ่อยแค่ไหนครับ,th,Other images
42548,ทุกปียูนิเซฟจะจัดทำรายงานสภาวะเด็กโลก ซึ่งสำรว...,น้องๆ คิดว่าปกติตัวเองกินอาหารอย่างระวังสุขภาพ...,th,สุขภาพ
42572,เราได้ร่วมมือกับกลุ่มการศึกษาเพื่อความเป็นไทอี...,สิทธิขั้นพื้นฐานในสถานศึกษาด้านไหนที่น้องอยากไ...,th,สิทธิเด็ก
43915,ภัยพิบัติต่างๆ สร้างความสูญเสียแก่ชุมชนและครอบ...,น้องๆ คิดว่าตนเองมีความรู้เรื่องการเตรียมความพ...,th,สถานการณ์ฉุกเฉิน


In [30]:
unique_non_eng_polls['poll_category_eng'] = unique_non_eng_polls['poll_category_name'].progress_apply(translate_unknown_to_eng)


  0%|                                                                                          | 0/299 [00:00<?, ?it/s][A
  1%|▌                                                                                 | 2/299 [00:00<00:33,  8.94it/s][A
  1%|▊                                                                                 | 3/299 [00:00<00:40,  7.39it/s][A
  1%|█                                                                                 | 4/299 [00:00<00:50,  5.84it/s][A
  2%|█▎                                                                                | 5/299 [00:00<00:52,  5.62it/s][A
  2%|█▋                                                                                | 6/299 [00:01<00:54,  5.35it/s][A
  2%|█▉                                                                                | 7/299 [00:01<01:00,  4.84it/s][A
  3%|██▏                                                                               | 8/299 [00:01<00:59,  4.87it/s][A
  3%|██▍       

 22%|██████████████████▏                                                              | 67/299 [00:13<00:42,  5.41it/s][A
 23%|██████████████████▍                                                              | 68/299 [00:13<00:48,  4.80it/s][A
 23%|██████████████████▋                                                              | 69/299 [00:14<00:46,  4.96it/s][A
 23%|██████████████████▉                                                              | 70/299 [00:14<00:45,  5.08it/s][A
 24%|███████████████████▏                                                             | 71/299 [00:14<00:43,  5.27it/s][A
 24%|███████████████████▌                                                             | 72/299 [00:14<00:43,  5.28it/s][A
 24%|███████████████████▊                                                             | 73/299 [00:14<00:40,  5.58it/s][A
 25%|████████████████████                                                             | 74/299 [00:15<00:43,  5.14it/s][A
 25%|███████████

 44%|███████████████████████████████████▌                                            | 133/299 [00:26<00:33,  4.96it/s][A
 45%|███████████████████████████████████▊                                            | 134/299 [00:27<00:32,  5.15it/s][A
 45%|████████████████████████████████████                                            | 135/299 [00:27<00:30,  5.43it/s][A
 45%|████████████████████████████████████▍                                           | 136/299 [00:27<00:28,  5.64it/s][A
 46%|████████████████████████████████████▋                                           | 137/299 [00:27<00:27,  5.90it/s][A
 46%|████████████████████████████████████▉                                           | 138/299 [00:27<00:27,  5.85it/s][A
 46%|█████████████████████████████████████▏                                          | 139/299 [00:27<00:26,  5.98it/s][A
 47%|█████████████████████████████████████▍                                          | 140/299 [00:28<00:29,  5.31it/s][A
 47%|███████████

 67%|█████████████████████████████████████████████████████▏                          | 199/299 [00:40<00:19,  5.09it/s][A
 67%|█████████████████████████████████████████████████████▌                          | 200/299 [00:40<00:18,  5.36it/s][A
 67%|█████████████████████████████████████████████████████▊                          | 201/299 [00:40<00:18,  5.39it/s][A
 68%|██████████████████████████████████████████████████████                          | 202/299 [00:40<00:20,  4.84it/s][A
 68%|██████████████████████████████████████████████████████▎                         | 203/299 [00:40<00:21,  4.43it/s][A
 68%|██████████████████████████████████████████████████████▌                         | 204/299 [00:41<00:20,  4.73it/s][A
 69%|██████████████████████████████████████████████████████▊                         | 205/299 [00:41<00:18,  4.95it/s][A
 69%|███████████████████████████████████████████████████████                         | 206/299 [00:41<00:17,  5.17it/s][A
 69%|███████████

 89%|██████████████████████████████████████████████████████████████████████▉         | 265/299 [00:54<00:07,  4.33it/s][A
 89%|███████████████████████████████████████████████████████████████████████▏        | 266/299 [00:54<00:07,  4.16it/s][A
 89%|███████████████████████████████████████████████████████████████████████▍        | 267/299 [00:55<00:10,  3.14it/s][A
 90%|███████████████████████████████████████████████████████████████████████▋        | 268/299 [00:55<00:08,  3.50it/s][A
 90%|███████████████████████████████████████████████████████████████████████▉        | 269/299 [00:55<00:08,  3.71it/s][A
 90%|████████████████████████████████████████████████████████████████████████▏       | 270/299 [00:56<00:07,  4.00it/s][A
 91%|████████████████████████████████████████████████████████████████████████▌       | 271/299 [00:56<00:06,  4.31it/s][A
 91%|████████████████████████████████████████████████████████████████████████▊       | 272/299 [00:56<00:05,  4.53it/s][A
 91%|███████████

In [33]:
unique_non_eng_polls.sort_values(by='poll_category_eng')

Unnamed: 0,poll_title,question_title,org_language,poll_category_name,poll_category_eng
39606,Домашнє насильство,Чи стикалися Ви з домашнім насильством щодо Ва...,uk,#EndViolence - Протидія насильству,#EndViolence - Countering violence
1393,Voluntariado UNICEF,Você está interessado em trabalhar como volunt...,pt,+Q,+Q
8480,Poll: Apakah citra diri memengaruhi perilakumu?,Seberapa puas atau tidak puaskah kamu terhadap...,id,ADAP,ADAP
1906,Acesso à justiça: para todos ou para alguns?,"Se a escola violar seus direitos, você acessar...",pt,Acesso à justiça,Access to justice
19706,Conoscenza e servizi sulla violenza sessuale,"Sono...Maschio, Femmina o Altro",it,Accesso ai servizi,Access to services
...,...,...,...,...,...
5338,Día Mundial contra el Mosquito,1. ¿A cuántas personas mata el mosquito/zancud...,es,Zika,Zika
42548,ทุกปียูนิเซฟจะจัดทำรายงานสภาวะเด็กโลก ซึ่งสำรว...,น้องๆ คิดว่าปกติตัวเองกินอาหารอย่างระวังสุขภาพ...,th,สุขภาพ,health
20511,Sondaj despre Obiectivele de Dezvoltare Durabilă,Ai auzit până acum despre Obiectivele de Dezvo...,ro,Sondaje,polls
1042,Un 14% de los nacimientos en Bolivia es de mad...,¿Conoces casos de adolescentes embarazadas?,es,embarazo,pregnancy


This sorting behaviour seems odd. Not clear why health, polls, and sports are at the bottom, after Z (Zika)

In [35]:
sorted(unique_non_eng_polls['poll_category_eng'].unique())

['#EndViolence - Countering violence',
 '+Q',
 'ADAP',
 'Access to justice',
 'Access to services',
 'Adolescence',
 'Adolescent Health',
 'Adolescent Participation',
 'Adolescent participation',
 'Advocacy 4',
 'Advocacy and Participation',
 'Apprentice',
 'Art',
 'Aspirations',
 'Average',
 'Black Gold',
 'Body',
 'Body and Gordophobia',
 'CHILD FRIENDLY CITIES',
 'CHILD PROTECTION',
 'CORONAVIRUS 2019 - (COVID-19)',
 'COVID19',
 'Career',
 'Cheers',
 'Child Protection',
 'Child Protection - Identity Card',
 'Child Rights',
 'Child marriage',
 'Child protection',
 'Childhood',
 'Childhood and adolescence',
 'Children on the Move',
 "Children's rights",
 "Children's rights & monitoring",
 'Citizenship',
 'Climate Change',
 'Climate Change 2',
 'Climate and Environment',
 'Climate change',
 'Commitment / solidarity',
 'Common',
 'Communication',
 'Communication Externe',
 'Consumption',
 'Contingency',
 'Corona Virus Response',
 'Culture',
 'Cyber Bullying',
 'Cybercrime',
 'Cyclone ID

From this initial look, it seems like we have a few themes that are showing up across countries.
- We might need to remove the upper case letters in some cses
- The Uzbek terms, as expected, had two languages in the original title. So only the Russian has been translated
- Some polls are simply titled 'Polls'. We will need a separate sub-category for them

