# UNICEF Youth Opinions Visualization
## Using PySpark and Leaflet.js to Map Global Diversity of Mindsets Among Millenials


This notebook consists of two sections. The first experiments with a tiny percentage (1%) of the full dataset to understand the required approaches to building visualizations. The second section then conducts similar analyses for the full dataset (1 GB>) in PySpark. Finally, results are visualized using Plotly and Leaflet.js

"All poll titles, question titles, order of questions are metadata generated by U-Report programme managers – not data coming directly from the system."

"Org_language refers to the language on the website, not necessarily the language of the polls."
We will therefore expect cases where the language of the polls does not match the stated one. But the initial assumption should be that they are indeed the same. 




### Identifying Notification Polls- 
# Needs Future Work
"Poll_url is unique to every poll, total of 4,399 to date. A subset of these “polls” are notifications and only have a single “question” for which there are no set response"

We would benefit from removing these notifications. 

In [16]:
df[df['question_results_open_ended']]['data_segment_category'].unique()

array(['total', 'location', 'age'], dtype=object)

In [17]:
df[['data_type', 'question_title', 'data_category_label']]

Unnamed: 0,data_type,question_title,data_category_label
0,response,¿Cuál es la red social que más usás?,Twitter
1,response,¿Cuál es la red social que más usás?,Twitter
2,set_unset,¿Cuánto tiempo por día pasas en la red social ...,set
3,response,¿Cuánto tiempo por día pasas entre todas las r...,Entre 1 y 2 horas por día
4,response,¿Cuánto tiempo por día pasas entre todas las r...,Más de 4 horas por día
...,...,...,...
44674,response,ทราบหรือไม่ว่าในอนุสัญญาว่าด้วยสิทธิเด็ก มีสาร...,ไม่รู้
44675,set_unset,ทราบหรือไม่ว่าในอนุสัญญาว่าด้วยสิทธิเด็ก มีสาร...,set
44676,set_unset,จากสิทธิพื้นฐานทั้ง 4 ด้าน คุณคิดว่าด้านใดสำคั...,unset
44677,set_unset,จากสิทธิพื้นฐานทั้ง 4 ด้าน คุณคิดว่าด้านใดสำคั...,unset


data_type
set_unset = whether respondents gave a suitable response for this question
response = breakdown of valid response options

data_segment_category

    total = all respondents, whether they have submitted age/gender/location or not
    age = for respondents that submitted age, in age-bands
    gender = for respondents that submitted gender
    location = most granular category for location, on most platforms this is region or district

data_segment_label = disaggregated data labels for each category of data_segment_category. 

For example, this would be 

data_category_label = label of either set/unset (for that data type) or label for the response


- This creates a problem, because set_unset becomes an issue
data_category_count = number of respondents falling into the data_category_label

Thoughts on exploring the data- 
Is it possible to get a global aggregation of responses to the same set of questions? Does the same questions have the same ID? Is there some way of aggregating across them 

In [1]:
#First,the usual imports
import numpy as np
import pandas as pd
import random
filename = '../all_poll_data.2020.05.29/all_poll_data.2020.05.29.csv'
unicef_seed = 2020
from tqdm import tqdm
tqdm.pandas()
import re
from nltk.corpus import stopwords

  from pandas import Panel


In [None]:
from time import time

In [5]:

# Approach 1: Sampling
# https://stackoverflow.com/questions/22258491/read-a-small-random-sample-from-a-big-csv-file-into-a-python-data-frame#22259008
random.seed(a=unicef_seed)

p = 0.01  # 1% of the lines
# keep the header, then take only 1% of lines
# if random from [0,1] interval is greater than 0.01 the row will be skipped
df = pd.read_csv(
         filename,
         header=0, 
         skiprows=lambda i: i>0 and random.random() > p)

In [6]:
df.to_csv('ureport_sample.csv')

In [2]:
df = pd.read_csv('ureport_sample.csv')

In [3]:
print(df.shape)

(44679, 27)


In [18]:
df[['question_id', 'question_title', 'question_results_open_ended',
       'question_order', 'data_type', 'data_segment_category',
       'data_segment_label', 'data_category_label']].head()

Unnamed: 0,question_id,question_title,question_results_open_ended,question_order,data_type,data_segment_category,data_segment_label,data_category_label
0,12689,¿Cuál es la red social que más usás?,False,1,response,location,Entre Ríos,Twitter
1,12689,¿Cuál es la red social que más usás?,False,1,response,location,Córdoba,Twitter
2,12690,¿Cuánto tiempo por día pasas en la red social ...,False,2,set_unset,location,Mendoza,set
3,12691,¿Cuánto tiempo por día pasas entre todas las r...,False,3,response,location,Santa Fe,Entre 1 y 2 horas por día
4,12691,¿Cuánto tiempo por día pasas entre todas las r...,False,3,response,location,Formosa,Más de 4 horas por día


So this means the final dataset will have 4,467,600 responses. Right now, our aim is to find strategies to make this work in a subsample. Then we can find Big Data methods for parallelization. 

In [8]:
# Explore Dataset
df.columns

Index(['org_name', 'org_language', 'org_id', 'org_host', 'org_subdomain',
       'org_domain', 'poll_id', 'poll_flow_uuid', 'poll_title', 'poll_org',
       'poll_created_on', 'poll_date', 'poll_category_image_url',
       'poll_category_name', 'poll_url', 'question_ruleset_uuid',
       'question_title', 'question_id', 'question_results_open_ended',
       'question_order', 'data_type', 'data_segment_category',
       'data_segment_label', 'data_category_label', 'data_category_count',
       'data_order'],
      dtype='object')

In [8]:
df['data_order']

0        6
1        6
2        0
3        2
4        4
5        0
6        2
7        0
8        2
9        0
10       5
11       5
12       4
13       2
14       6
15       0
16       3
17       0
18       3
19       2
20       2
21       0
22       0
23       4
24       3
25       0
26       4
27       1
28       2
29       0
        ..
44649    6
44650    0
44651    4
44652    7
44653    0
44654    0
44655    1
44656    1
44657    0
44658    0
44659    1
44660    2
44661    0
44662    0
44663    0
44664    2
44665    2
44666    1
44667    0
44668    0
44669    2
44670    2
44671    0
44672    0
44673    0
44674    2
44675    0
44676    0
44677    0
44678    0
Name: data_order, Length: 44679, dtype: int64

In [5]:
df['data_segment_category'].value_counts()

location    32380
age          5914
total        4363
gender       2022
Name: data_segment_category, dtype: int64

In [6]:
df['data_segment_label'].value_counts()

total                   4363
15-19                   1011
20-24                   1001
0-14                     991
31-34                    982
35+                      967
25-30                    962
Female                   425
Male                     420
Homme                    168
Femme                    137
Ayeyarwady               120
Yangon                   114
Hombre                   112
အမျိုးသား                110
Kayah                    106
Chin                     104
Shan                     102
Tanintharyi              101
Rakhine                   98
Magway                    97
အမျိုးသမီး                97
Sagaing                   96
Mon                       95
Mujer                     95
Kachin                    92
Bago                      90
Чоловіки                  89
Mandalay                  86
Kayin                     85
                        ... 
Област Шумен               1
Ariana - أريانة            1
الأنبار                    1
Rumonge       

It seems like most of the variety in this variable comes from the location tag. There are also some variants with the meaning of 'male' and 'female' in different languages. Those could easily be standardized into one. 

Location will be challenging to use. We could later use some geospatial libraries like geopy to visualize them. 

In [7]:
df.groupby(['data_segment_category'])['data_segment_label'].value_counts()

data_segment_category  data_segment_label  
age                    15-19                   1011
                       20-24                   1001
                       0-14                     991
                       31-34                    982
                       35+                      967
                       25-30                    962
gender                 Female                   425
                       Male                     420
                       Homme                    168
                       Femme                    137
                       Hombre                   112
                       အမျိုးသား                110
                       အမျိုးသမီး                97
                       Mujer                     95
                       Чоловіки                  89
                       Жінки                     75
                       Perempuan                 40
                       Laki-Laki                 35
                    

This exploration would be most valuable if we had some level of standardization. The age variable stays the same across countries, regardless of language. However, we will need a standardized set. 

In [4]:
df['data_category_label'].value_counts()

unset                                   7213
set                                     7084
No                                      1878
Yes                                     1468
yes                                      292
no                                       276
unknown                                  246
Non                                      200
Da                                       199
Oui                                      188
Si                                       176
C                                        139
Nu                                       138
Others                                   137
Not sure                                 129
B                                        126
A                                        124
D                                        106
Other specify                            103
Так                                      102
Ні                                        99
Sí                                        96
Don't know

These are the sets of answers to poll questions. We can no whether or not they are open-ended through the 

There is no index or identifier for individual users. That becomes more problematic later, when we try to combine the columns for age, location and gender. We have multiple rows now for what could well have been the same individual. 

We also note that essentially, our dataset is a combination of three sets of information (three tables in database):
- Organization
- Poll
- Question

The complication emerges at the level of questions. We need to be sure that we can find the same questions being asked across the world so as to discern geographic patterns to youth responses

In [3]:
df[['org_id', 'poll_org']]

Unnamed: 0,org_id,poll_org
0,8,8
1,8,8
2,8,8
3,8,8
4,8,8
...,...,...
44674,5,5
44675,5,5
44676,5,5
44677,5,5


In [7]:
df['org_id'].equals(df['poll_org'])

True

The code above proves that 'org_id' is a primary key in the 'org' table and 'poll_org' is its foreign key in the Poll table.

In [9]:
for col in df.columns:
    print(df[col].unique())

['Argentina' 'Bangladesh' 'Belize' 'Bolivia' 'Bosnia and Herzegovina'
 'Botswana' 'Brasil' 'Bulgaria' 'Burkina Faso' 'Burundi' 'Cameroon'
 'Chile' 'Congo Brazzavile' 'Congo(RDC)' 'Costa Rica' "Côte d'Ivoire"
 'Ecuador' 'El Salvador' 'FSM' 'France' 'Gambia' 'Ghana' 'Guatemala'
 'Guinea' 'Haiti' 'Honduras' 'India' 'Indonesia' 'Ireland' 'Jamaica'
 'Jordan' 'Kiribati' 'Lesotho' 'Liberia' 'Malawi' 'Malaysia' 'Mexico'
 'Moldova' 'Moçambique' 'Myanmar' 'Nigeria' 'On the Move' 'Pacific'
 'Pakistan' 'Papua New Guinea' 'Philippines' 'România'
 'République Centrafricaine' 'Senegal' 'Sierra Leone' 'South Africa'
 'Srbija' 'Tanzania' 'Tchad' 'Trinidad and Tobago' 'Tunisie'
 'U-Report Global' 'U-Report24x7' 'U-report Mali' 'Uganda' 'Uzbekistan'
 'VIETNAM' 'Western Balkans' 'Zimbabwe' 'eSwatini' 'УКРАЇНА'
 'العراق\u200e' 'ประเทศไทย']
['es' 'en' 'bs' 'pt-br' 'bg' 'fr' 'id' 'ar' 'ro' 'pt' 'my' nan 'it'
 'sr-rs@latin' 'uz' 'vi' 'uk' 'th']
[ 8 17 15 27 19 42  1 33 23  5 10 12 32 46 28 26 49  4 40 38  7 2

In [9]:
df[['question_title','data_type', 'data_segment_category',
       'data_segment_label', 'data_category_label', 'data_category_count',
       'data_order']]

Unnamed: 0,question_title,data_type,data_segment_category,data_segment_label,data_category_label,data_category_count,data_order
0,¿Cuál es la red social que más usás?,response,location,Entre Ríos,Twitter,0.0,6
1,¿Cuál es la red social que más usás?,response,location,Córdoba,Twitter,0.0,6
2,¿Cuánto tiempo por día pasas en la red social ...,set_unset,location,Mendoza,set,0.0,0
3,¿Cuánto tiempo por día pasas entre todas las r...,response,location,Santa Fe,Entre 1 y 2 horas por día,0.0,2
4,¿Cuánto tiempo por día pasas entre todas las r...,response,location,Formosa,Más de 4 horas por día,0.0,4
...,...,...,...,...,...,...,...
44674,ทราบหรือไม่ว่าในอนุสัญญาว่าด้วยสิทธิเด็ก มีสาร...,response,location,ชุมพร,ไม่รู้,0.0,2
44675,ทราบหรือไม่ว่าในอนุสัญญาว่าด้วยสิทธิเด็ก มีสาร...,set_unset,location,พังงา,set,0.0,0
44676,จากสิทธิพื้นฐานทั้ง 4 ด้าน คุณคิดว่าด้านใดสำคั...,set_unset,location,บุรีรัมย์,unset,0.0,0
44677,จากสิทธิพื้นฐานทั้ง 4 ด้าน คุณคิดว่าด้านใดสำคั...,set_unset,location,สกลนคร,unset,0.0,0


In [9]:
assert(len(df['poll_org'].unique())  == len(df['org_id'].unique()))

We could translate questions and check for similarities across languages. But this would not be the most economical approach. We would be checking each question against thousands of others, and in other languages, which will exact a high computational cost. Instead, we can translate the poll titles and see what clusters arise in terms of similarity of meanings. We can them plot them in semantic space and check for clusters. 

Then within those clusters, we are far likelier to find questions with overlapping meanings, and finally focus on the ones which are comparable across countries. 


The easy wins here would be findings poll titles that cover the same topic. 

The above assertion came true, which indicates that poll_org and org_id are referring to the same IDs, and each organization has provided a poll at least once. 

In [10]:
# Now let's get the unique questions per poll
questions_by_poll = df.groupby('poll_title')['question_title'].apply(set)
questions_by_poll = pd.DataFrame(questions_by_poll.reset_index())

In [11]:
questions_by_poll['poll-length'] = questions_by_poll['question_title'].apply(len)
questions_by_poll['poll-length'].describe()

count    4039.000000
mean        3.818024
std         2.768573
min         1.000000
25%         2.000000
50%         3.000000
75%         5.000000
max        27.000000
Name: poll-length, dtype: float64

Some polls are more driven by the youth and their concerns, others by what the UN wishes to learn. On average, there are about 3 -4 questions per poll (of course, with outliers).  
There are also fewer

In [13]:
print(len(df['question_title'].unique()))
print(len(df['question_id'].unique()))

14542
13121


In [14]:
df.head()

Unnamed: 0,org_name,org_language,org_id,org_host,org_subdomain,org_domain,poll_id,poll_flow_uuid,poll_title,poll_org,...,question_title,question_id,question_results_open_ended,question_order,data_type,data_segment_category,data_segment_label,data_category_label,data_category_count,data_order
0,Argentina,es,8,ilhasoft,argentina,,1734,5a7577ea-aa1f-4f9c-b031-78ab91695448,¿Cuáles son las redes sociales preferidas por ...,8,...,¿Cuál es la red social que más usás?,12689,False,1,response,location,Entre Ríos,Twitter,0.0,6
1,Argentina,es,8,ilhasoft,argentina,,1734,5a7577ea-aa1f-4f9c-b031-78ab91695448,¿Cuáles son las redes sociales preferidas por ...,8,...,¿Cuál es la red social que más usás?,12689,False,1,response,location,Córdoba,Twitter,0.0,6
2,Argentina,es,8,ilhasoft,argentina,,1734,5a7577ea-aa1f-4f9c-b031-78ab91695448,¿Cuáles son las redes sociales preferidas por ...,8,...,¿Cuánto tiempo por día pasas en la red social ...,12690,False,2,set_unset,location,Mendoza,set,0.0,0
3,Argentina,es,8,ilhasoft,argentina,,1734,5a7577ea-aa1f-4f9c-b031-78ab91695448,¿Cuáles son las redes sociales preferidas por ...,8,...,¿Cuánto tiempo por día pasas entre todas las r...,12691,False,3,response,location,Santa Fe,Entre 1 y 2 horas por día,0.0,2
4,Argentina,es,8,ilhasoft,argentina,,1734,5a7577ea-aa1f-4f9c-b031-78ab91695448,¿Cuáles son las redes sociales preferidas por ...,8,...,¿Cuánto tiempo por día pasas entre todas las r...,12691,False,3,response,location,Formosa,Más de 4 horas por día,0.0,4


The important point to check here is if the IDs repeat for different questions. So let's use groupby

In [15]:
#https://kite.com/python/answers/how-to-count-unique-values-in-a-pandas-dataframe-group-in-python
question_by_id = df.groupby(['question_id']).aggregate({'org_language':'nunique'}).reset_index()

In [16]:
question_by_id

Unnamed: 0,question_id,org_language
0,1,0
1,2,0
2,3,0
3,4,1
4,5,1
5,6,1
6,7,1
7,8,0
8,9,1
9,10,1


In [17]:
question_by_id['org_language'].describe()

count    13121.000000
mean         1.069659
std          0.456853
min          0.000000
25%          1.000000
50%          1.000000
75%          1.000000
max          3.000000
Name: org_language, dtype: float64

So on average, a question only gets asked in one language, sometimes less
The latter option implies that certain polls don't have question associated with them. This might be problematic later. 

In [34]:
#Now the aim is to translate all the questions and find the relevant ones
from googletrans import Translator
translator = Translator()

In [19]:
result = translator.translate("ไม่ทราบว่าเพื่อนๆ ของยูรีพอร์ตเตอร์มีประสบการณ์ถูกกลั่นแกล้งบนโลกออนไลน์หรือไม่ และทราบหรือไม่ว่าหากพบเห็นการกลั่นแกล้งบนโลกออนไลน์ควรทำอย่างไร?", 
                     src="th", dest="en")

In [20]:
result.text

"Do not know friends Of Yu-Reporter's experience in cyberbullying? And do you know what to do if you see cyberbullying online?"

Each poll has an ID with data already generated. 
https://thailand.ureport.in/poll/62

In [21]:
speaker_countries = df.groupby('org_language')['org_name'].apply(set)
speaker_countries = pd.DataFrame(speaker_countries.reset_index())

In [22]:
speaker_countries.columns

Index(['org_language', 'org_name'], dtype='object')

In [23]:
speaker_countries['num_country'] = speaker_countries['org_name'].apply(len) 

In [24]:
for country in speaker_countries[speaker_countries['org_language']=='en']['org_name']:
    print(country)

{'U-Report24x7', 'Lesotho', 'Bangladesh', 'Philippines', 'Ireland', 'Pacific', 'eSwatini', 'Papua New Guinea', 'FSM', 'Botswana', 'Tanzania', 'Uganda', 'Sierra Leone', 'India', 'Ghana', 'Pakistan', 'Malaysia', 'U-Report Global', 'Kiribati', 'Liberia', 'South Africa', 'Belize', 'Jamaica', 'Gambia', 'Malawi', 'Trinidad and Tobago'}


So the English speaking organizations span quite a range- including groups such as 'Pacific', and more international entities like 'U-Report Global' and 'U-Report 24*7]
Now we look for the 

In [28]:
non_eng_polls = df[df['org_language'] != 'en'][['poll_title', 'question_title', 
                                                'org_language', 'poll_category_name']]
#It would make sense to remove duplicates

In [29]:
non_eng_polls.shape

(26776, 4)

In [30]:
non_eng_polls['poll_category_name'].unique()

array(['Comunicación', 'Inclusión', 'Protección',
       'Participación Adolescente', 'Educación', 'Salud',
       'Fechas Importantes', 'Igualdad de Género', 'Cambio Climático',
       'Adolescencia', 'Participación', 'Niñez', 'embarazo', 'General',
       'Ureporteri', 'U-Report Brasil', 'Saúde', 'Educação',
       'ODS 2 - Fome Zero', '+Q',
       'Objetivos do Desenvolvimento Sustentável', 'ODS', 'Mete a colher',
       'ODS Geral', 'HIV ', 'Política', 'Migrantes', 'Aprendiz',
       'Esportes', 'Violência Sexual', 'Evasão Escolar',
       'Corpo e Gordofobia', 'ODS 5 - Igualdade de Gênero', 'Corpo',
       'Redução', 'Juv e Trabalho', 'Direitos Reprodutivos e Prevenções',
       'Proteção', "ODS 14 - Vida debaixo D'Água",
       'ODS 9 - Indústria, Inovação e Infraestrutura',
       'ODS 3 - Saúde e Bem-Estar', 'ODS 1 - Erradicação da Pobreza',
       'Abordagem Policial', 'Acesso à justiça', 'ARMAS', 'DROGAS',
       'Rede LGBT', 'Letalidade Violenta', 'Segurança', 'S4D', 'Opinio

We have gone from 40,000 or so question 26.7k- which indicates that though the number of countries for non-English speaking countries may be less, they stil involve a considerable proportion of the total questions. How many from Spain? 

In [26]:
@non_eng_polls.drop_duplicates(inplace=True)
non_eng_polls.shape

In [27]:
#Translating the Poll Titles

(10796, 4)

In [32]:
non_eng_polls.loc[non_eng_polls['org_language']=='fr']

Unnamed: 0,poll_title,question_title,org_language,poll_category_name
2304,Perception des adolescent-e-s et jeunes sur le...,[Si non] Pourquoi ?,fr,Participation
2305,Perception des adolescent-e-s et jeunes sur le...,Quelles mesures prends-tu ?,fr,Participation
2306,Perception des adolescent-e-s et jeunes sur le...,Quelles mesures prends-tu ?,fr,Participation
2307,Perception des adolescent-e-s et jeunes sur le...,Comment reçois-tu les informations sur le coro...,fr,Participation
2308,Perception des adolescent-e-s et jeunes sur le...,Comment reçois-tu les informations sur le coro...,fr,Participation
2309,Rétro information sur l'Etat Civil dans la rég...,Est-ce que la délivrance de la copie d’acte de...,fr,Participation
2310,Rétro information sur l'Etat Civil dans la rég...,Combien de fois doit-on se rendre dans la comm...,fr,Participation
2311,Rétro information sur l'Etat Civil dans la rég...,Comment jugez-vous l’accueil dans les services...,fr,Participation
2312,Rétro information sur l'Etat Civil dans la rég...,Comment jugez-vous l’accueil dans les services...,fr,Participation
2313,Rétro information sur l'Etat Civil dans la rég...,Comment jugez-vous l’accueil dans les services...,fr,Participation


In [33]:
non_english_polls['translated_category'] = non_eng_polls.progress_apply(lambda x: translate_to_eng(x.poll_category_name, x.org_language),
                                                       axis=1)

  0%|                                                                             | 2/26776 [00:18<70:27:51,  9.47s/it]


AttributeError: ("'NoneType' object has no attribute 'group'", 'occurred at index 0')

#### Deduplication
So we have only 10.7k entries now- which suggests about 2-3 duplicates for questions earlier

In [36]:
#let's check if it's the same number of questions for all languages
poll_by_lang = df.groupby(['org_language'])['poll_title'].count()

In [30]:
poll_by_lang
#So more or less, we have the same number of poll_titles as question_titles
# Therefore, Only one question per poll
# So polls themselves aren't going to be very useful

org_language
ar               104
bg                 7
bs                72
en             17903
es              4360
fr              3512
id              1512
it               456
my              2702
pt               461
pt-br           1088
ro              2046
sr-rs@latin     1100
th              2723
uk              2657
uz               954
vi                32
Name: poll_title, dtype: int64

In [33]:
poll_by_lang_poll = df.groupby(['org_language', 'poll_title'])['question_title'].count()

In [35]:
poll_by_lang_poll.reset_index()

Unnamed: 0,org_language,poll_title,question_title
0,ar,COVID-19 Risk Perception at Community Level Su...,15
1,ar,COVID19 Information centre,23
2,ar,CRC,8
3,ar,Ending Violence Online,12
4,ar,Impact of COVID-19 on Education,5
5,ar,Mira: Movement,2
6,ar,Safe Internet Day Poll,1
7,ar,U-Report For Syria 9 Years,12
8,ar,World Mental Health Day - 10 October,2
9,ar,الصحة النفسية والرفاهية,10


#### Translation
Now we can can use the Translator object to get the vast range of translations

In [52]:
def translate_to_eng(txt, src_lang):
    """
    takes in text in a non-english language
    returns the english translation
    """
    
    try:
        print('it worked')
        result = translator.translate(txt, 
                     src=src_lang, dest="en")
    #In case the organization's language label doesn't match the question language
    except:
        result = translator.translate(txt, dest="en")
        
    return result.text

In [40]:
def translate_unknown_to_eng(txt):
    """
    takes in text in a non-english language (not specified by user)
    returns the english translation
    """
    
    try:
        result = translator.translate(txt,
                                    dest="en")
        return result.text
    #In case the organization's language label doesn't match the question language
    except:
        return ""

In [44]:
_only_uniq_categs = non_eng_polls.drop_duplicates(subset=['org_language', 'poll_category_name'])

In [46]:
_only_uniq_categs.shape

(316, 4)

How is this not equal to the unique values across just the poll categories? There may some poll titles which are the same even across languages. How is that acceptable for us?


In [47]:
_only_uniq_categs

Unnamed: 0,poll_title,question_title,org_language,poll_category_name
0,¿Cuáles son las redes sociales preferidas por ...,¿Cuál es la red social que más usás?,es,Comunicación
6,¿Cómo afecta el Aislamiento Preventivo Social ...,Mencione los tres cambios más importantes en l...,es,Inclusión
57,¿Cómo es la situación en el hogar de los adole...,"Durante la cuarentena, ¿quién hace la mayor pa...",es,Protección
67,Covid-19: ¿Qué acciones pueden tomar los adole...,¿Qué tipo de acciones querés conocer?,es,Participación Adolescente
96,Educación a Distancia en contexto de Covid-19,Para quienes se encuentran estudiando actualme...,es,Educación
118,Actitudes y comportamientos sobre el COVID-19,¿Cómo te has sentido estos últimos siete días?,es,Salud
150,Día Internacional de la Mujer,¿Considerás que las mujeres deberían tener igu...,es,Fechas Importantes
313,#DíaDeLaNiña ¿Cuál es el trabajo de tus sueños?,¿Cuál es el trabajo de tus sueños?,es,Igualdad de Género
352,Cambio Climático,¿Considerás que a través de tus acciones podés...,es,Cambio Climático
978,COVID 19 - Actitudes y comportamientos durante...,¿Qué opción describe mejor cómo te has sentido...,es,Adolescencia


In [53]:
translate_to_eng('Comunicación', 'es')

it worked


'Comunicación'

In [48]:
_only_uniq_categs['poll_category_translation'] = _only_uniq_categs.progress_apply(lambda x: translate_to_eng(x.poll_category_name, x.org_language),
                                                       axis=1)

100%|████████████████████████████████████████████████████████████████████████████████| 316/316 [03:05<00:00,  1.71it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [49]:
_only_uniq_categs

Unnamed: 0,poll_title,question_title,org_language,poll_category_name,poll_category_translation
0,¿Cuáles son las redes sociales preferidas por ...,¿Cuál es la red social que más usás?,es,Comunicación,Comunicación
6,¿Cómo afecta el Aislamiento Preventivo Social ...,Mencione los tres cambios más importantes en l...,es,Inclusión,Inclusión
57,¿Cómo es la situación en el hogar de los adole...,"Durante la cuarentena, ¿quién hace la mayor pa...",es,Protección,Protección
67,Covid-19: ¿Qué acciones pueden tomar los adole...,¿Qué tipo de acciones querés conocer?,es,Participación Adolescente,Participación Adolescente
96,Educación a Distancia en contexto de Covid-19,Para quienes se encuentran estudiando actualme...,es,Educación,Educación
118,Actitudes y comportamientos sobre el COVID-19,¿Cómo te has sentido estos últimos siete días?,es,Salud,Salud
150,Día Internacional de la Mujer,¿Considerás que las mujeres deberían tener igu...,es,Fechas Importantes,Fechas Importantes
313,#DíaDeLaNiña ¿Cuál es el trabajo de tus sueños?,¿Cuál es el trabajo de tus sueños?,es,Igualdad de Género,Igualdad de Género
352,Cambio Climático,¿Considerás que a través de tus acciones podés...,es,Cambio Climático,Cambio Climático
978,COVID 19 - Actitudes y comportamientos durante...,¿Qué opción describe mejor cómo te has sentido...,es,Adolescencia,Adolescencia


In [None]:
translate

In [55]:
non_eng_polls.iloc[1208]

poll_title                         BreastMilk and No water Campaign
question_title    (For YES answers): Ok, What else should be giv...
org_language                                                     fr
Name: 2998, dtype: object

In [35]:
non_eng_polls['poll_category_translation'] = non_eng_polls.progress_apply(lambda x: translate_to_eng(x.poll_category_name, x.org_language),
                                                       axis=1)


  2%|█▋                                                                          | 592/26776 [03:04<2:15:38,  3.22it/s]

KeyboardInterrupt



More Efficient Approach- do not translate every line. 
Instead, replace with a dictionary of all unique values and translate back.


In [None]:
categ_map = {}
for category in all_poll_categs:
    categ_map['category'] = translate_unknown_to_eng(category)
    
    

In [45]:
poll_categories = non_eng_polls['poll_category_translation'].unique()

In [61]:
#What would be interesting is to see the differences to the same poll questions across different segments
non_eng_polls['poll_translation'] = non_eng_polls.progress_apply(lambda x: translate_to_eng(x.poll_title, x.org_language),
                                                       axis=1)


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10780/10780 [47:11<00:00,  3.81it/s]


In [63]:
poll_topics = non_eng_polls['poll_translation'].unique()

In [64]:
print(len(poll_topics))

3126


In [65]:
print(poll_topics)

['What are the preferred social networks for Argentine teens?'
 '¿Cuáles son las redes sociales preferidas por los adolescentes argentinos?'
 'How does Social and Mandatory Preventive Isolation affect households with Disabilities?'
 ...
 'The Convention on the Rights of the Child states that the children have the right to express themselves in their thoughts. And their opinions should be considered as appropriate\r\n\r\nWhat level of opinion do you think you can exercise your rights to participate in?'
 "Don't know friends Of U-readers have experienced cyberbullying? And do you know what if bullying is seen online?"
 'Thailand signed as a member of the Convention on the Rights of the Child on 12 February 1992. Did you know that this Convention? What are the key principles in protecting the basic rights of children?']


In [46]:
poll_categories.to_pickle('../non_english_category_titles.pkl')

AttributeError: 'numpy.ndarray' object has no attribute 'to_pickle'

In [None]:
#Save this for later use
poll_topics.to_pickle('../non_english_poll_titles.pkl')
# Merge with the English poll titles


The questions need to be organized by some measure of similarity. There are many different ways of doing this but we will start with the overview covered in this comprehensive medium post:
https://medium.com/@adriensieg/text-similarities-da019229c894
and the associated github repository:
https://github.com/adsieg/text_similarity/blob/master/Different%20Embeddings%20%2B%20Cosine%20Similarity%20%2B%20HeatMap%20illustration.ipynb


In [13]:
import spacy
nlp = spacy.load("en_core_web_md")
tokens = nlp("dog cat banana afskfsd")

In [3]:
from googletrans import Translator
translator = Translator()

In [56]:
translator.translate('bonjour', src='fr').text

'bonjour'

In [3]:
for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

dog dog 1.0
dog cat 0.80168545
dog banana 0.24327648
dog afskfsd 0.0
cat dog 0.80168545
cat cat 1.0
cat banana 0.28154367
cat afskfsd 0.0
banana dog 0.24327648
banana cat 0.28154367
banana banana 1.0
banana afskfsd 0.0
afskfsd dog 0.0
afskfsd cat 0.0
afskfsd banana 0.0
afskfsd afskfsd 1.0


  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)


In [None]:
gloveFile = "data\\glove.6B.50d.txt"
import numpy as np
def loadGloveModel(gloveFile):
    print ("Loading Glove Model")
    with open(gloveFile, encoding="utf8" ) as f:
        content = f.readlines()
    model = {}
    for line in content:
        splitLine = line.split()
        word = splitLine[0]
        embedding = np.array([float(val) for val in splitLine[1:]])
        model[word] = embedding
    print ("Done.",len(model)," words loaded!")
    return model

def preprocess(raw_text):

    # keep only words
    letters_only_text = re.sub("[^a-zA-Z]", " ", raw_text)

    # convert to lower case and split 
    words = letters_only_text.lower().split()

    # remove stopwords
    stopword_set = set(stopwords.words("english"))
    cleaned_words = list(set([w for w in words if w not in stopword_set]))

    return cleaned_words

def cosine_distance_between_two_words(word1, word2):
    import scipy
    return (1- scipy.spatial.distance.cosine(model[word1], model[word2]))

def calculate_heat_matrix_for_two_sentences(s1,s2):
    s1 = preprocess(s1)
    s2 = preprocess(s2)
    result_list = [[cosine_distance_between_two_words(word1, word2) for word2 in s2] for word1 in s1]
    result_df = pd.DataFrame(result_list)
    result_df.columns = s2
    result_df.index = s1
    return result_df

def cosine_distance_wordembedding_method(s1, s2):
    import scipy
    vector_1 = np.mean([model[word] for word in preprocess(s1)],axis=0)
    vector_2 = np.mean([model[word] for word in preprocess(s2)],axis=0)
    cosine = scipy.spatial.distance.cosine(vector_1, vector_2)
    print('Word Embedding method with a cosine distance asses that our two sentences are similar to',round((1-cosine)*100,2),'%')

def heat_map_matrix_between_two_sentences(s1,s2):
    df = calculate_heat_matrix_for_two_sentences(s1,s2)
    import seaborn as sns
    import matplotlib.pyplot as plt
    fig, ax = plt.subplots(figsize=(5,5)) 
    ax_blue = sns.heatmap(df, cmap="YlGnBu")
    # ax_red = sns.heatmap(df)
    print(cosine_distance_wordembedding_method(s1, s2))
    return ax_blue

In [None]:
len(df[df['org_language']=="es"]['question_title'].unique())

In [None]:
# We don't know yet what each of these language odes means

In [None]:
#https://stackoverflow.com/questions/27842613/pandas-groupby-sort-within-groups

In [25]:
# What would be useful is to find the polls that ask the same question in multiple languages
# Then we could check for differences

In [None]:
wide_df = df.pivot()