# Data Visualization

## Assignment 6: Narrative, Figure Composition and Geographical Visualization

You can't learn technical subjects without hands-on practice. The assignments are an important part of the course. To submit this assignment you will need to make sure that you save your Jupyter notebook. 

Below are the links to 2 videos that explain:

1. [How to save your Jupyter notebook](https://youtu.be/0aoLgBoAUSA) and,       
2. [How to answer a question in a Jupyter notebook assignment](https://youtu.be/7j0WKhI3W4s).

<div class="alert alert-info" style="color:black">
    
Assignment Learning Goals:

By the end of the module, students are expected to:

- Create a narrative with visualizations
- Layout plots in panels of a figure grid
- Visualize geographical data on maps
- Save figures outside the notebook


This assignment covers [Module 6](https://viz-learn.mds.ubc.ca/en/module6) of the online course. You should complete this module before attempting this assignment.
 
</div>

Any place you see `...`, you must fill in the function, variable, or data to complete the code. Substitute the `None` and the `raise NotImplementedError # No Answer - remove if you provide an answer` with your completed code and answers then proceed to run the cell!

Note that some of the questions in this assignment will have hidden tests. This means that no feedback will be given as to the correctness of your solution. It will be left up to you to decide if your answer is sufficiently correct. These questions are worth 2 points.

In [1]:
# Import libraries needed for this assignment

from hashlib import sha1
import altair as alt
import pandas as pd
import numpy as np
import test_assignment6 as t
from vega_datasets import data


# 0. The Back Story

For this assignment, instead of looking around us, we will look inside.
There are over 7,000 languages spoken across this globe (we will be exploring only 17 of them) , but it would be interesting to know whether some of these at least allow for more rapid information transmission than others. The implications of this reach beyond just communication also restrict many of your thoughts to those that can be expressed in words (again, fascinating!). This means that you could potentially (minorly) upgrade the rate at which you distribute information by learning a new language. And it could mean that there are groups of people who have access to faster thinking than others, just because their language can convey information at a higher rate, what an enormous advantage!


## 0.1 The data

To aid our exploration of whether some languages are more efficient at conveying information than others, we have obtained data from [a study in Science Advances in 2019](https://advances.sciencemag.org/content/5/9/eaaw2594). There is also a popular science version of this article published in [The Economist](https://www.economist.com/graphic-detail/2019/09/28/why-are-some-languages-spoken-faster-than-others).

We have compiled two tables, one with general information on the languages we are studying and one with experimental data where they recorded people speaking a certain text in different languages and noted down how fast they spoke, etc. You can find a description of the columns of both datasets below and they are available in the `data/` folder.

***Note: many popular datasets have sex as a feature where the possible values are male and female. This representation reflects how the data were collected and is not meant to imply that, for example, sex is binary.***

---

<center><h3>Languages dataset</h4></center>

| Column              | Description                                                       |
|---------------------|-------------------------------------------------------------------|
| iso_lang            | [ISO_639-3 language code](https://en.wikipedia.org/wiki/ISO_639-3)|
| language            | Language name                                                     |
| information density | Bits of information per syllable in the language                  |
| distinct_syllables  | The number of different syllables in the language                 |
| continent           | The continent where the language originated from. This was derived from the original column "Family and subfamily" where languages were categorized by their originating cotinent (for more details see [here](https://advances.sciencemag.org/content/advances/suppl/2019/08/29/5.9.eaaw2594.DC1/aaw2594_SM.pdf))            |
| id                  | The [ISO 3166-1 numeric code](https://en.wikipedia.org/wiki/ISO_3166-1_numeric) of the country the language is spoken|
| lat                  | The country's latitude coordinate as per [google's developer tool](https://developers.google.com/public-data/docs/canonical/countries_csv)|
| long                  | The country's longitude coordinate as per [google's developer tool](https://developers.google.com/public-data/docs/canonical/countries_csv)|

---

<center><h3>Spoken texts dataset</h4></center>

| Column    | Description                                                        |
|-----------|--------------------------------------------------------------------|
| speaker   | Speaker ID                                                         |
| iso_lang  | [ISO_639-3 language code](https://en.wikipedia.org/wiki/ISO_639-3) |
| text      | Text ID                                                            |
| sex       | The sex of the speaker                                             |
| duration  | The number of seconds it took to speak the text                    |
| syllables | Number of syllables uttered during the speech                      |
| age       | The age of the speaker                                             |

---


**Question 0.1** <br> {points: 1}

Read in the `languages.csv` data and take a look at the languages we have information on. 

*Save the dataframe in an object named `lang_df`*. 

In [2]:
lang_df = pd.read_csv('data/languages.csv')

lang_df

Unnamed: 0,iso_lang,language,information_density,distinct_syllables,continent,id,lat,lon
0,CAT,Catalan,5.49,3600,Europe,20.0,42.546245,1.601554
1,CMN,Mandarin,6.96,1274,Asia,156.0,35.86166,104.195397
2,DEU,German,6.08,5100,Europe,276.0,51.165691,10.451526
3,ENG,English,7.09,6949,Europe,826.0,55.378051,-3.435973
4,EUS,Basque,4.83,2082,Europe,,,
5,FIN,Finnish,5.49,3844,Europe,246.0,61.92411,25.748151
6,FRA,French,6.68,2949,Europe,250.0,46.227638,2.213749
7,HUN,Hungarian,5.9,4325,Europe,348.0,47.162494,19.503304
8,ITA,Italian,5.29,2729,Europe,380.0,41.87194,12.56738
9,JPN,Japanese,5.03,643,Asia,392.0,36.204824,138.252924


In [3]:
t.test_0_1(lang_df)

'Success'

**Question 0.2** <br> {points: 1}

Each language had a unique number of possible syllables that could be used to communicated with. 

For example, Japanese only has a possibility of a few hundred, whereas the english language has around 7000. 

This makes you think how many distinct syllables you could possibly be using each day. 

Just from looking at the data, obtain the row of with the language that has the greatest number of distinct syllables?

*Save your answer as a dataframe object named `answer0_2`*. 

In [4]:
answer0_2 = lang_df.sort_values(by='distinct_syllables', ascending=False).head(1)
answer0_2

Unnamed: 0,iso_lang,language,information_density,distinct_syllables,continent,id,lat,lon
3,ENG,English,7.09,6949,Europe,826.0,55.378051,-3.435973


In [5]:
t.test_0_2(answer0_2)

'Success'

**Question 0.3** <br> {points: 2}

Information density is essentially explaining how much information is packed into the syllables of a language and information density therefore can be used as a good estimate of the average amount of information per syllable.

For example. If a language has an information density of 4.23, a second language with an information density of 8.46 would be twice as informative! This means that with equal amounts of syllables, the second language would be communicating twice as much information. 

This is not to be confused with the speech rate, which is how quickly the language is spoken. 


What is the language with the greatest information density? 

*Save your answer as a string in an object named `answer0_3`*. 

In [6]:
answer0_3 = lang_df.sort_values(by='information_density', ascending=False).iloc[0,1]
answer0_3

'Vietnamese'

In [7]:
# check that the variable exists
assert 'answer0_3' in globals(
), "Please make sure that your solution is named 'answer0_3'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

# 1. Language exploration

Let’s start our analysis by getting familiar with potential differences and similarities between the languages in this data set. A language's information density is going to be important in how quickly a speaker can convey information, since information is transfered in form of a syllable but it would be interesting to see if the number of distinct syllables also is relevant in how information is transfered.  

Try and see if there are any patterns in the data by organizing the languages by these two variables - `distinct_syllables` and `information_density`.

**Question 1.1** <br> {points: 2}  

Create a scatterplot of the information density of each language versus the number of distinct syllables. Map `information_density` to the x-axis and `distinct_syllables` to the y-axis. Scale your axis so that it does not include the zero ticks on the x  and y axis. Assign a size of 50 to each point. 

Make sure to give each axis a label. Since we will be layering this plot, there is no need to add a title yet. 

*Save this plot in an object named `den_syl_plot`.*

In [8]:
den_syl_plot = alt.Chart(lang_df).mark_circle(size=50).encode(
    alt.X('information_density', scale=alt.Scale(zero=False), title="Information Density"),
    alt.Y('distinct_syllables', scale=alt.Scale(zero=False), title="Distinct Syllables"))

den_syl_plot

In [9]:
t.test_1_1_1(den_syl_plot)

'Success'

In [10]:
t.test_1_1_2(den_syl_plot)

'Success'

**Question 1.2** <br> {points: 1} 

If you look carefully at the plot `den_syl_plot` from **Question 1.1**, you'll notice that the data appears to cluster in groups in the scatter plot. 
Can you find a categorical variable in this dataframe that roughly explains this clustering?

Using the plot `den_syl_plot`, add a colour channel to the plot that reflects this categorical variable. 

*Save this plot in an object named `coloured_den_syl_plot`.*

In [11]:
coloured_den_syl_plot = alt.Chart(lang_df).mark_circle(size=50).encode(
    alt.X('information_density', scale=alt.Scale(zero=False), title="Information Density"),
    alt.Y('distinct_syllables', scale=alt.Scale(zero=False), title="Distinct Syllables"),
    alt.Color('continent'))

coloured_den_syl_plot

In [12]:
t.test_1_2(coloured_den_syl_plot)

'Success'

**Question 1.3** <br> {points: 2} 

It appears that there is an language that appears to be in the middle of the two groups in our plot above. 

Which language does not fit in with the groups from the above plot? 


*Save the language that appears to not fit in with it's group as a string to an object named `middle_lang`.*

In [13]:
middle_lang = 'French'
middle_lang

'French'

In [14]:
# check that the variable exists
assert 'middle_lang' in globals(
), "Please make sure that your solution is named 'middle_lang'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 1.4** <br> {points: 1}

Filter the dataframe `lang_df` to only contain the outlying language(s) from **Question 1.3**. 

*Save the dataframe in an object named `outlier_lang_df`.*

In [15]:
outlier_lang_df = lang_df[lang_df['language'] == 'French']
outlier_lang_df

Unnamed: 0,iso_lang,language,information_density,distinct_syllables,continent,id,lat,lon
6,FRA,French,6.68,2949,Europe,250.0,46.227638,2.213749


In [16]:
t.test_1_4(outlier_lang_df)

'Success'

**Question 1.5** <br> {points: 2}

Take the dataframe `outlier_lang_df` from **Question 1.4** and create a `mark_text` plot that annotates the language(s) deemed as outliers. Name this plot `text_plot`. You may want to change the position of the text so that it's not hiding the point it's annotating. 

Add `text_plot` to `coloured_den_syl_plot` and save this layered plot in an object named `annotated_plot`.  An appropriate title should be added to the combined plots and not the plots individually (This is because we will be presenting this plot with others in part 5).

In [17]:
text_plot = alt.Chart(outlier_lang_df).mark_text(dx=10, dy=-10).encode(
    alt.X('information_density'),
    alt.Y('distinct_syllables'),
    text=alt.value('French'))

annotated_plot = (coloured_den_syl_plot + text_plot).properties(title='Language Data by Continent')

annotated_plot

In [18]:
t.test_1_5_1(text_plot)

'Success'

In [19]:
t.test_1_5_2(annotated_plot)

'Success'

**Question 1.6** <br> {points: 2}

Which of the following observations can be made from the plot `annotated_plot` above? 

Select all that apply:

i) Asian languages tend to have a positive relationship between the number of distinct syllables and their information density. 

ii) European languages tend to have a positive relationship between the number of distinct syllables and their information density.

iii) Asian languages appear to have less distinct syllables for a language's information density than European languages. 

iv) Asian languages tend to have a lower information density than European Languages. 

v) European languages appear to have more distinct syllables. 

*To answer the question, select all that apply and add the letter(s) associated with the correct answer(s) to a list and assign it to a variable named `answer1_6`. For example, if you believe that i) and ii) are True, then your answer would look like this:*

`answer1_6 = ["i", "ii"]`

In [20]:
answer1_6 = ["i", "ii", "iii", "v"]
answer1_6

['i', 'ii', 'iii', 'v']

In [21]:
t.test_1_6(answer1_6)

'Success'

# 2. Speech Rate and Information Density

Hmm... what an intriguing pattern we just revealed... let’s keep these clusters in mind while exploring whether to continue our exploration. The two variables that determine the information rate of a language **(IR)** are its information density **(ID)** (bits of information per syllable) and the rate at which it is spoken (syllables per second) **(SR)**. 


$IR = ID * SR$ 

([Source: Equation 6](https://advances.sciencemag.org/content/5/9/eaaw2594))


A high value in both would indicate a high information rate and efficient communication (a higher number of information bits conveyed per second).



While the information density of each language has already been provided to us, we will have to approximate the speech rate (syllables per second) of each language using the `spoken_texts` dataset.

This data was collected by the different candidates reading excepts in their native language. Excerpts such as the following were translated into all 17 languages and read by their respective candidates: 

- *I have a problem with my water softener. The water-level is too high and the overflow keeps
dripping. Could you arrange to send an engineer on Tuesday morning please? It's the only day I can
manage this week. I'd be grateful if you could confirm the arrangement in writing.*


- *Municipal Fire Service speaking. We're trying to locate an emergency caller who rang off
without giving any personal details. He appeared to be on the local network. He connected on our line
number 762 584. We'd appreciate immediate attempts to trace him because he sounded desperate* 


- *Can you give me a list of the restaurants in the neighbourhood? I live in Clancy Street, NW1.
I'm interested in something a little more exotic than usual. Perhaps a Polynesian place, for example. I'd
prefer it not to be vegetarian.*

**Question 2.1** <br> {points: 1}  

Read in the `spoken-texts.csv` data. 

*Save the dataframe in an object named `speech_df`*. 

In [22]:
speech_df = pd.read_csv('data/spoken-texts.csv')

speech_df.head()

Unnamed: 0,speaker,iso_lang,text,sex,duration,syllables,age
0,CAT_F1(Su),CAT,O1,F,12.43,88,42.0
1,CAT_F1(Su),CAT,O2,F,16.92,118,42.0
2,CAT_F1(Su),CAT,O3,F,19.96,139,42.0
3,CAT_F1(Su),CAT,O4,F,22.42,142,42.0
4,CAT_F1(Su),CAT,O6,F,13.49,99,42.0


In [23]:
t.test_2_1(speech_df)

'Success'

**Question 2.2** <br> {points: 1}  

To explore the speech rate, we will have to calculate it first from columns in the `speech_df` dataframe. 

Create a new dataframe that contains all the columns from `speech_df` as well as a newly calculated `speech_rate` column that calculates the speech rate from the `syllables` and `duration` columns. 

Using the same equation as the [study in Science Advances](https://advances.sciencemag.org/content/5/9/eaaw2594), speech rate can be calculated as:

$\text{speech rate} = \frac{\text{Number of syllables}}{\text{duration}}$

This new column will calculate the speech rate for each observation in the `speech_df`.

*Save the dataframe in an object named `speech_rate_df`*. 

In [24]:
speech_rate_df = speech_df.assign(speech_rate = speech_df['syllables']/speech_df['duration'])

speech_rate_df

Unnamed: 0,speaker,iso_lang,text,sex,duration,syllables,age,speech_rate
0,CAT_F1(Su),CAT,O1,F,12.43,88,42.0,7.079646
1,CAT_F1(Su),CAT,O2,F,16.92,118,42.0,6.973995
2,CAT_F1(Su),CAT,O3,F,19.96,139,42.0,6.963928
3,CAT_F1(Su),CAT,O4,F,22.42,142,42.0,6.333631
4,CAT_F1(Su),CAT,O6,F,13.49,99,42.0,7.338769
...,...,...,...,...,...,...,...,...
2283,YUE_M6(Hu),YUE,P3,M,16.82,83,23.0,4.934602
2284,YUE_M6(Hu),YUE,P8,M,15.25,77,23.0,5.049180
2285,YUE_M6(Hu),YUE,P9,M,18.20,90,23.0,4.945055
2286,YUE_M6(Hu),YUE,Q0,M,15.86,88,23.0,5.548550


In [25]:
t.test_2_2(speech_rate_df)

'Success'

**Question 2.3** <br> {points: 1}  

Since we are exploring *language* information rates and not just the speech rate of individuals, we are going to have to calculate the average `speech_df` for each of the languages in our dataset. 

Calculate the mean speech rate of each of the languages from `speech_rate_df`. This should be a dataframe that only contains the `iso_lang` column, the `speech_rate` and an index with integer values. 

*Hint: You may need to reset your index.* 


*Save the dataframe in an object named `lang_speech_rt_df`*. 

In [26]:
lang_speech_rt_df = speech_rate_df.groupby('iso_lang')['speech_rate'].mean().reset_index()
lang_speech_rt_df

Unnamed: 0,iso_lang,speech_rate
0,CAT,7.06533
1,CMN,5.857748
2,DEU,6.09194
3,ENG,6.338201
4,EUS,7.540279
5,FIN,7.171397
6,FRA,6.87613
7,HUN,5.868333
8,ITA,7.162335
9,JPN,8.034726


In [27]:
t.test_2_3(lang_speech_rt_df)

'Success'

**Question 2.4** <br> {points: 1}  


Combine the `lang_speech_rt_df` dataframe with our original language dataframe `lang_df` so that we can resume our language analysis and assess the information rates of different languages. Make sure that you are merging using a common key column name.


*Save the dataframe in an object named `lang_rates_df`*. 

In [28]:
lang_rates_df = pd.merge(lang_df, lang_speech_rt_df, on='iso_lang')
lang_rates_df

Unnamed: 0,iso_lang,language,information_density,distinct_syllables,continent,id,lat,lon,speech_rate
0,CAT,Catalan,5.49,3600,Europe,20.0,42.546245,1.601554,7.06533
1,CMN,Mandarin,6.96,1274,Asia,156.0,35.86166,104.195397,5.857748
2,DEU,German,6.08,5100,Europe,276.0,51.165691,10.451526,6.09194
3,ENG,English,7.09,6949,Europe,826.0,55.378051,-3.435973,6.338201
4,EUS,Basque,4.83,2082,Europe,,,,7.540279
5,FIN,Finnish,5.49,3844,Europe,246.0,61.92411,25.748151,7.171397
6,FRA,French,6.68,2949,Europe,250.0,46.227638,2.213749,6.87613
7,HUN,Hungarian,5.9,4325,Europe,348.0,47.162494,19.503304,5.868333
8,ITA,Italian,5.29,2729,Europe,380.0,41.87194,12.56738,7.162335
9,JPN,Japanese,5.03,643,Asia,392.0,36.204824,138.252924,8.034726


In [29]:
t.test_2_4(lang_rates_df)

'Success'

**Question 2.5** <br> {points: 3}  

Alright! Now we can finally take a look at our information rates via the two columns `information_density` and `speech_rate`. 

Create a scatter plot that maps the information density on the x-axis and the speech rate on the y-axis for each of the languages from the `lang_rates_df`  dataframe. Zoom in to the extent of the data instead of forcing zero to be in the axis and assign a size of 50 to each point. 

Use a colour channel for the same column as in the scatter plot from **Question 1.2**.  

Make sure to give each axis a label and an appropriate title. 



*Save the plot in an object named `info_rate_plot`*. 

In [30]:
info_rate_plot = alt.Chart(lang_rates_df).mark_circle(size=50).encode(
    alt.X('information_density', scale=alt.Scale(zero=False), title="Information Density"),
    alt.Y('speech_rate', scale=alt.Scale(zero=False), title="Speech Rate"),
    alt.Color('continent')).properties(
    title='Languages Speech Rate & Information Density')
info_rate_plot

In [31]:
t.test_2_5_1(info_rate_plot)

'Success'

In [32]:
t.test_2_5_2(info_rate_plot)

'Success'

In [33]:
t.test_titles(info_rate_plot)

'Success'

**Question 2.6** <br> {points: 2}

Alright, what have we learned about the information rates of languages? Which of the following observations can be said regarding the plot `info_rate_plot` from the question above?

***Remember that a high value in both speech rate and information density indicates a high information rate.***

Select all that apply:

i) There seems to be a big difference between Asia and European languages. 

ii) There does not seem to be any big differences between Asia and European languages. 

iii) There seems to be no languages that have a really high information rate (upper right corner of the plot).

iv) There seems to be no languages that have a really low information rate (lower left corner of the plot).

v) Information density and speech rate appear to have a linear positive relationship.

*To answer the question, select all that apply and add the letter(s) associated with the correct answer(s) to a list and assign it to a variable named `answer2_6`. For example, if you believe that i) and ii) are True, then your answer would look like this:*

`answer2_6 = ["i", "ii"]`

In [34]:
answer2_6 = ["ii","iii","iv"]
answer2_6

['ii', 'iii', 'iv']

In [35]:
t.test_2_6(answer2_6)

'Success'

# 3. Language information rate

Oh those results are very interesting…

It looks like there is no language that is both high in information density and speech rate. This may be indicative  that the human mind is not good at processing auditory information beyond a certain rate. (You can read more in the [article](https://advances.sciencemag.org/content/5/9/eaaw2594) if you're interested). 

In the previous question we looked at the average speech rate of each language, however, we learned that it's important to look at the distributions and not just a single value always. Since we are looking at many different candidates speaking there are going to be different speech rates and therefore different rates at which information is communicated. 

Let’s directly plot the distribution  at which information is conveyed with an "information rate" measure and see if there are at least small differences between them. This might help our analysis and answer our question if all languages convey information at a similar rate. 


**Question 3.1** <br> {points: 1}

Before we can begin to plot, we need to add a few columns to our `speech_rate_df` dataframe so that we can calculate each speaker's information rate.

First, select the `iso_lang`, `information_density`, `continent`, `language`, `id` and `lat` and `lon` columns from the `lang_rates_df` dataframe and merge them with the `speech_rate_df` dataframe. This will help us differentiate between the two continents and let us calculate the information rate for each speaker in the `speech_rate_df`. 

*Save the dataframe in an object named `speech_larger_df`*. 

In [36]:
speech_larger_df = pd.merge(speech_rate_df, lang_rates_df.loc[:, ['iso_lang', 
                                                                  'information_density', 
                                                                  'continent', 
                                                                  'language', 
                                                                  'id',
                                                                 'lat',
                                                                 'lon']], on='iso_lang')
speech_larger_df

Unnamed: 0,speaker,iso_lang,text,sex,duration,syllables,age,speech_rate,information_density,continent,language,id,lat,lon
0,CAT_F1(Su),CAT,O1,F,12.43,88,42.0,7.079646,5.49,Europe,Catalan,20.0,42.546245,1.601554
1,CAT_F1(Su),CAT,O2,F,16.92,118,42.0,6.973995,5.49,Europe,Catalan,20.0,42.546245,1.601554
2,CAT_F1(Su),CAT,O3,F,19.96,139,42.0,6.963928,5.49,Europe,Catalan,20.0,42.546245,1.601554
3,CAT_F1(Su),CAT,O4,F,22.42,142,42.0,6.333631,5.49,Europe,Catalan,20.0,42.546245,1.601554
4,CAT_F1(Su),CAT,O6,F,13.49,99,42.0,7.338769,5.49,Europe,Catalan,20.0,42.546245,1.601554
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2283,YUE_M6(Hu),YUE,P3,M,16.82,83,23.0,4.934602,6.53,Asia,Cantonese,344.0,22.396428,114.109497
2284,YUE_M6(Hu),YUE,P8,M,15.25,77,23.0,5.049180,6.53,Asia,Cantonese,344.0,22.396428,114.109497
2285,YUE_M6(Hu),YUE,P9,M,18.20,90,23.0,4.945055,6.53,Asia,Cantonese,344.0,22.396428,114.109497
2286,YUE_M6(Hu),YUE,Q0,M,15.86,88,23.0,5.548550,6.53,Asia,Cantonese,344.0,22.396428,114.109497


In [37]:
t.test_3_1(speech_larger_df)

'Success'

**Question 3.2** <br> {points: 2}

Ok, let's now calculate the information rate for each speaker and add it as a new column named `information_rate` in the `speech_larger_df` dataframe.

information rate can be calculated (with units as the number of bits of information per second) with the following equation: 

$\text{information rate} = \text{speech rate} * \text{information density}$ 

Since we will have multiple information rates for each language, we can visualize them in a distribution and attempt to answer our question if all languages convey information at a similar rate.


*Save the dataframe in an object named `speech_full_df`*. 

In [39]:
speech_full_df = speech_larger_df.assign(information_rate = 
                                         speech_larger_df['speech_rate']*speech_larger_df['information_density'])
speech_full_df

Unnamed: 0,speaker,iso_lang,text,sex,duration,syllables,age,speech_rate,information_density,continent,language,id,lat,lon,information_rate
0,CAT_F1(Su),CAT,O1,F,12.43,88,42.0,7.079646,5.49,Europe,Catalan,20.0,42.546245,1.601554,38.867257
1,CAT_F1(Su),CAT,O2,F,16.92,118,42.0,6.973995,5.49,Europe,Catalan,20.0,42.546245,1.601554,38.287234
2,CAT_F1(Su),CAT,O3,F,19.96,139,42.0,6.963928,5.49,Europe,Catalan,20.0,42.546245,1.601554,38.231964
3,CAT_F1(Su),CAT,O4,F,22.42,142,42.0,6.333631,5.49,Europe,Catalan,20.0,42.546245,1.601554,34.771632
4,CAT_F1(Su),CAT,O6,F,13.49,99,42.0,7.338769,5.49,Europe,Catalan,20.0,42.546245,1.601554,40.289844
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2283,YUE_M6(Hu),YUE,P3,M,16.82,83,23.0,4.934602,6.53,Asia,Cantonese,344.0,22.396428,114.109497,32.222949
2284,YUE_M6(Hu),YUE,P8,M,15.25,77,23.0,5.049180,6.53,Asia,Cantonese,344.0,22.396428,114.109497,32.971148
2285,YUE_M6(Hu),YUE,P9,M,18.20,90,23.0,4.945055,6.53,Asia,Cantonese,344.0,22.396428,114.109497,32.291209
2286,YUE_M6(Hu),YUE,Q0,M,15.86,88,23.0,5.548550,6.53,Asia,Cantonese,344.0,22.396428,114.109497,36.232030


In [40]:
t.test_3_2(speech_full_df)

'Success'

**Question 3.3** <br> {points: 1}

What is the average information rate for each language? Create a pandas series that contains the name of the language and the mean information rate for each language and sort it in ascending order.

*Save the **panda series** in an object named `mean_info_rate`*.

In [48]:
mean_info_rate = speech_full_df.groupby('language')['information_rate'].mean().sort_values()
mean_info_rate

language
Thai          33.797382
Hungarian     34.623162
Cantonese     36.384961
Basque        36.419546
German        37.038995
Turkish       37.625198
Italian       37.888751
Catalan       38.788659
Serbian       39.132096
Finnish       39.370971
Korean        39.581189
Japanese      40.414673
Mandarin      40.769927
Spanish       41.956637
Vietnamese    42.534749
English       44.937847
French        45.932546
Name: information_rate, dtype: float64

In [49]:
t.test_3_3(mean_info_rate)

'Success'

**Question 3.4** <br> {points: 3}

Create a rug plot using the `speech_full_df` dataframe mapping the language on one axis and the information rate on the other. Colour this by the same variable as in the previous two scatter plots. Set opacity to 0.3 and a size of 10 for the marks. Make sure you scale your x-axis so that zero is not required in your plot.

Make sure that you sort the languages in order of **ascending mean information rate**
*Hint: You'll have to use `mean_info_rate` here*.
Make sure to give each axis a label. Since we will be layering this plot, there is no need to add a title yet. 

*Save the plot in an object named `info_rate_dist`*. 

In [64]:
info_rate_dist = alt.Chart(speech_full_df).mark_tick(opacity=0.3, size=10).encode(
    alt.X('information_rate', scale=alt.Scale(zero=False), title='Information Rate'),
    alt.Y('language', sort=mean_info_rate.index.to_list(), title='Language'),
    alt.Color('continent')).properties(title='Languages Information Rate')
info_rate_dist

In [65]:
t.test_3_4_1(info_rate_dist)

'Success'

In [66]:
t.test_3_4_2(info_rate_dist)

'Success'

In [67]:
t.test_3_4_3(info_rate_dist)

'Success'

**Question 3.5** <br> {points: 3}

This time, we want to include a mean value in our plot. Create a `.mark_circle()` plot **chained from the plot `info_rate_dist` from Question 3.4**, however this should only plot the mean information rate for each language. The point should be black and size 40. Save this in an object named `mean_info_plot`. 


Next, layer the plots `info_rate_dist` from **Question 3.4** and `mean_info_plot` together and save it in an object named `info_dists`.

An appropriate title should be added to the combined plots and not the plots individually (This is because we will be presenting this plot with others in part 5).



In [183]:
mean_info_plot = info_rate_dist.properties(title='').mark_circle(size=40).encode(
    alt.X('mean(information_rate)'),
    alt.Y('language', sort=mean_info_rate.index.to_list()),
    color=alt.value('black'))

info_dists = (info_rate_dist + mean_info_plot).properties(title='Information Rate for Languages')

info_dists

In [184]:
t.test_3_5_1(info_dists)

'Success'

In [185]:
t.test_3_5_2(info_dists)

'Success'

In [186]:
t.test_3_5_3(info_dists)

'Success'

**Question 3.6** <br> {points: 3}

Which language has the highest mean information rate for each continent? 

Save each language as a string in an object named `asian_highest` and `european_highest`. 

In [108]:
asian_highest = "Vietnamese"
european_highest = "French" 

In [109]:
t.test_3_6_1(european_highest)

'Success'

In [110]:
# check that the variable exists
assert 'asian_highest' in globals(
), "Please make sure that your solution is named 'asian_highest'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 3.7** <br> {points: 2}

Which language has the lowest mean information rate for each continent? 

Save each language as a string in an object named `asian_lowest` and `european_lowest`. 

In [111]:
asian_lowest = "Thai"
european_lowest = "Hungarian"

In [112]:
t.test_3_7_1(european_lowest)

'Success'

In [113]:
t.test_3_7_2(asian_lowest)

'Success'

**Question 3.8** <br> {points: 1}

Which language appears to be the least consistent (greatest range of values) when it comes to communicating information? 

Save your answer as a string in an object named `answer3_8`.

In [114]:
answer3_8 = "Vietnamese"

In [115]:
t.test_3_8(answer3_8)

'Success'

**Question 3.9** <br> {points: 1}

Let's return back to our original question **"Do all languages convey information at a a similar rate"**. 
Although we would need more statistical methods to answer this properly, generally speaking, how would you answer it by looking at the plot above? 


A) Languages generally communicate information at approximately similar rate (within 20 units). 

B) There is quite a bit of variation between the rates at which information is communicated.

C) Languages from the same continent communicate information at a similar rate.

To answer the question, assign the letter associated with the correct answer to a variable in the code cell below.

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer3_9`.*


In [116]:
answer3_9 = "A"
answer3_9

'A'

In [117]:
t.test_3_9(answer3_9)

'Success'

# 4. Making Maps 

What a perfect opportunity to practice the mapping skills we just learned. Since we are using geographical columns, using a geographical map may be a better way to communicate the mean information rate of each language instead of simply making a chart or a plot.

In this questions, we will be using the `world_110m` TopoJSON file of world countries (at 110-meter resolution) from the `vega_datasets` library. 

In [118]:
world_data = data.world_110m

**Question 4.1** <br> {points: 1}

In the lecture, you learned that the TopoJSON is a specialized format, that needs to be parsed using Altair to select  the desired feature object from the topology. 

Fill in the code below to indicating that we wish to extract the GeoJSON features from the `world_data` url data for the countries object:

*Save the data in an object named `world_map`.*

In [119]:
world_map = alt.topo_feature(data.world_110m.url, 'countries')

In [120]:
t.test_4_1(world_map)

'Success'

**Question 4.2** <br> {points: 1}

Although the `world_map` data provides what we need to make a map, the reason for making this map is to communicate the mean information rate of each language. Using the `speech_full_df` dataframe we made in **Question 3.2**, make a new dataframe where you groupby the columns:
- `language`
- `id`(The [ISO 3166-1 numeric code](https://en.wikipedia.org/wiki/ISO_3166-1_numeric) of the country)
- `lat`, and
- `lon`

Then obtain the mean values of all other columns we did not group in from `speech_df`. Don't forget to reset your index, to make sure that `language` and  `id` are still available to Altair to use. 

*Save this in a dataframe named `mean_rates_df`.*

In [121]:
mean_rates_df = speech_full_df.groupby(['language','id','lat','lon']).mean().reset_index()

mean_rates_df

Unnamed: 0,language,id,lat,lon,duration,syllables,age,speech_rate,information_density,information_rate
0,Cantonese,344.0,22.396428,114.109497,14.550333,80.733333,22.0,5.57197,6.53,36.384961
1,Catalan,20.0,42.546245,1.601554,16.640533,116.8,35.4,7.06533,5.49,38.788659
2,English,826.0,55.378051,-3.435973,13.015,81.666667,,6.338201,7.09,44.937847
3,Finnish,246.0,61.92411,25.748151,14.5928,104.0,33.2,7.171397,5.49,39.370971
4,French,250.0,46.227638,2.213749,13.391133,91.333333,32.5,6.87613,6.68,45.932546
5,German,276.0,51.165691,10.451526,16.0624,95.4,,6.09194,6.08,37.038995
6,Hungarian,348.0,47.162494,19.503304,17.8628,103.266667,39.3,5.868333,5.9,34.623162
7,Italian,380.0,41.87194,12.56738,14.493148,101.833333,,7.162335,5.29,37.888751
8,Japanese,392.0,36.204824,138.252924,17.275667,138.133333,30.6,8.034726,5.03,40.414673
9,Korean,410.0,35.907757,127.766922,16.478467,115.933333,28.6,7.118919,5.56,39.581189


In [122]:
t.test_4_2(mean_rates_df)

'Success'

**Question 4.3** <br> {points: 3}

Ok, now we've prepared our data, it's time to make the map.

Since this code is a little tricky, we've provided you with the majority of it, while we expect you to *fill in the "blanks"* or more so "*replace the `...` with the necessary code*". 

Let's explain the code below before you start.

To make this map, we use 2 Chart objects;`background` and `foreground`.  
We need to use the `world_map` data for the `background` map and both `world_map` and `mean_rates_df` for the `foreground` map where we want to colour the countries based on the `information_rate` values. 

Since the `foreground` map is joining 2 data sources we need to use `.transform_lookup()` and `.LookupData()` to merge them together using `id` in each source as the common column. The columns we wish to fetch from the `mean_rates_df` dataframe are `information_rate` and `language`. 
This may be a little confusing but feel free to look at the documentation we've provided [here](https://altair-viz.github.io/altair-viz-v4/user_guide/transform/lookup.html#example-lookup-transforms-for-geographical-visualization).

We also are assigning both the `language` and `information_rate` column to the `tooltip` channel which will help us confirm exact values and labels. 

You can use any colour scheme you think is appropriate but take a look [here](https://vega.github.io/vega/docs/schemes/) for suggestions.

Finally, we layer these two charts together in `lang_map` and make sure we give it an `naturalEarth1` projection with the plot zoomed in to `200` and panned to `120, 260`. 

In [135]:
#background = alt.Chart(...).mark_geoshape(color='white', stroke="grey")
#
# foreground = (
#     alt.Chart(...)
#     ....(stroke="black", strokeWidth=0.15)
#     .encode(
#         color=alt.Color(
#             "...:Q", scale=alt.Scale(...="..."), title = 'Mean information Rate (info bits/second)'
#         ),
#         tooltip=[
#             alt.Tooltip("...:N", title="Language"),
#             alt.Tooltip("...:Q", title="Mean information Rate (info bits/second)", format='.2f'),
#         ],
#     )....(
#         ...="id",
#         from_=alt....(mean_rates_df, "id", ["...", "..."]),
#     )
# )

# lang_map = (
#     (... + ...)
#     .properties(width=650, height=300)
#     ....("...", scale=..., ...=[120, 260])
# )

background = alt.Chart(world_map).mark_geoshape(color='white', stroke="grey")

foreground = (
     alt.Chart(world_map)
     .mark_geoshape(stroke="black", strokeWidth=0.15)
     .encode(
         color=alt.Color(
             "information_rate:Q", scale=alt.Scale(scheme="blues"), title = 'Mean information Rate (info bits/second)'
         ),
         tooltip=[
             alt.Tooltip("language:N", title="Language"),
             alt.Tooltip("information_rate:Q", title="Mean information Rate (info bits/second)", format='.2f'),
         ],
     ).transform_lookup(
         lookup="id",
         from_=alt.LookupData(mean_rates_df, "id", ["language", "information_rate"]),
     )
 )

lang_map = (
     (background + foreground)
     .properties(width=650, height=300)
     .project(type='naturalEarth1',scale=200, translate=[120, 260])
 )

lang_map

lang_map

In [136]:
t.test_4_3_1(lang_map)

'Success'

In [137]:
t.test_4_3_2(lang_map)

'Success'

In [138]:
t.test_4_3_3(lang_map)

'Success'

**Question 4.4** <br> {points: 2}

Looking at the map above, which language has the greatest information rate? 

*Save your answer as a string in an object named `answer4_4`*. 

In [139]:
answer4_4 = "French"
answer4_4

'French'

In [140]:
# check that the variable exists
assert 'answer4_4' in globals(
), "Please make sure that your solution is named 'answer4_4'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 4.5** <br> {points: 3}

What about a point map but this time mapping the speech rate to the size channel of each point. 

Let's explain the code below. 

To make this map, we use 2 Chart objects;`background` and `points`.  
We again use the `world_map` data for the `background` map and `mean_rates_df` for the `points` map where we want to assign the point's size to the`speech_rate` of each country. 

Unlike before, we no longer need to join the 2 data sources since we have `lat` and `lon` values we can use to place the points. that means we that for the `points` chart we simply have to use `mean_rates_df` as a source and then map `lat` and `lon` to the `latitude` and `longitude` channels and then map `speach_rate` to the `size` channel. 

Just like we have in th previous question, we are assigning the `language` and `speech_rate` column to the `tooltip` channel which will help us confirm exact values and labels. 

Finally, we layer these two charts together in `point_map` and make sure we give it an `naturalEarth1` projection with the plot zoomed in to `280` and panned to `90, 370`. 


In [145]:
# background = alt.Chart(...)....(color='white', stroke="grey")

# points = (
#     alt.Chart(...)
#     ....()
#     .encode(
#         ...='lon',
#         ...='lat',
#         ...=alt.Size('...', scale=alt.Scale(domain=[4,9], range=[50,500]), title="Average Speech Rate (syllables/sec)"),
#         tooltip=[
#             alt.Tooltip("...:N", title="Language"),
#             alt.Tooltip("...:Q", title="Average Speech Rate (syllables/sec)", format='.2f'),
#         ]
# ))


# point_map = (
#     (... + ...)
#     .properties(width=680, height=320)
#     ....("...", scale=280, ...)
# )

background = alt.Chart(world_map).mark_geoshape(color='white', stroke="grey")

points = (
     alt.Chart(mean_rates_df)
     .mark_circle()
     .encode(
         longitude='lon',
         latitude='lat',
         size=alt.Size('speech_rate', scale=alt.Scale(domain=[4,9], range=[50,500]), title="Average Speech Rate (syllables/sec)"),
         tooltip=[
             alt.Tooltip("language:N", title="Language"),
             alt.Tooltip("speech_rate:Q", title="Average Speech Rate (syllables/sec)", format='.2f'),
         ]
 ))


point_map = (
     (background + points)
     .properties(width=680, height=320)
     .project(type="naturalEarth1", scale=280, translate = [90,370])
 )

point_map

In [146]:
t.test_4_5_1(point_map)

'Success'

In [147]:
t.test_4_5_2(point_map)

'Success'

In [148]:
t.test_4_5_3(point_map)

'Success'

**Question 4.6** <br> {points: 1}

Which map is more effective at communicating their respective rates? (There may be data missing from plot `lang_map`. And consider how the data is presented.)


A) `lang_map` from **Question 4.3** 

B) `point_map` from **Question 4.5** 

To answer the question, assign the letter associated with the correct answer to a variable in the code cell below.

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer4_6`.*

In [151]:
answer4_6 = "A"
answer4_6

'A'

In [152]:
t.test_4_6(answer4_6)

'Success'

# 5. Figure Composition and Narrative


Now that we have the individual figures, the final step for us here is to compile our figures into the same layout for presentation and give it a strong narrative. 

This will help communicate with your reader your findings in the most efficient way possible. 

**Question 5.1** <br> {points: 1}

Using the plots `annotated_plot` from **Question 1.5**, `info_rate_plot` from **Question 2.5** compile the two plots horizontally. 

Give each plot a height and width of 150 and 180 respective and remove any titles instead giving one overall title.(The test cannot check this so please take care that you are removing individual plot titles) 
Make sure to anchor your overall title in the middle of the plot. 

*Save your combined plots in an object named `presentation_scatter`.*

In [174]:
presentation_scatter = (annotated_plot.properties(title = '', height=150, width=180) | 
                        info_rate_plot.properties(title = '', height=150, width=180)).properties(
    title=alt.TitleParams(text="Languages Syllables and Information Density", anchor='middle'))

presentation_scatter

In [175]:
t.test_5_1(presentation_scatter)

'Success'

**Question 5.2** <br> {points: 5}

Using the plots `annotated_plot` from **Question 1.5**, `info_rate_plot` from **Question 2.5**, `info_dists` from **Question 3.5**,  and `lang_map` from **Question 4.3**, arrange these plots into a layout worthy for a presentation. 
The overall layout should look even.
Set only 1 overall title and subtitle for the entire plot presentation. 

*Save your combined plots in an object named `presentation_large`.*

In [193]:
presentation_large = ((annotated_plot.properties(width=200, height=100, title='') 
                      | info_rate_plot.properties(width=200, height=100, title='')) & info_dists.properties(width=400, height=200, title='')).properties(
    title=alt.TitleParams("Languages Data",
    subtitle = "French language has top information rate", anchor='middle'))

presentation_large

In [194]:
t.test_5_2_1(presentation_large)

'Success'

In [195]:
t.test_5_2_2(presentation_large)

'Success'

In [196]:
t.test_5_2_3(presentation_large)

'Success'

**Question 5.3** <br> {points: 2}

These results are looking pretty promising, now we just need to add a narrative so that it will be easy for others to follow along with what we have done here. 

Below we have told 2 stories using the three plots we have created. 


Which of the following stories is most clear about what the plots are telling us and how it motivates the next plot we're making.  There should be a clear storyline for the reader to follow as well as a clear take-home message.

##### Narrative A

> This dataset explores the different languages across the world and how they translate and communicate information among speakers. While looking at the relationship between the number of distinct syllables and the information density, we discovered that the French language does not follow a trend similar to its European cousins and appeared to resemble more like Asian languages. With that in tow, we see that there is a strong positive relationship between the number of distinct syllables and the information density.
>
> The plot from with information rate on the x-axis and the number of syllables on the y-axis, in contrast to the first plot, shows a negative relationship but this time between information density and speech rate. It appears that the faster the language is spoken, the lower the information density. It seems there is a trade-off between the speed at which a language is spoken and information density.
>
> Segue into our last plots with information rate on the x-axis and the languages on the y-axis, we learn that most languages in our data have a similar mean information rate even though the distributions are quite varied. The language with the greatest information rate appears to be French and the language with the lowest information rate is Thai. That being said it appears that there is some limitation that prevents humans from processing information through language most efficiently.


##### Narrative B 

> Here we explored the amount of information conveyed by languages of European and Asian descent. First, we examined the relationship between the number of distinct syllables in these languages and their information density (i.e. bits of information per syllable). We discovered that as the number of distinct syllables increased, so did the information density. This finding could suggest that speakers from languages with a higher number of distinct syllables may have an advantage in conveying more information to others, compared to speakers who speak languages with a lower number of distinct syllables. Given that speech is how information is conveyed, we turned to spoken language and the rate of spoken language to see if such an advantage exists.
> 
> To determine if more information is conveyed by one language group over another, we specifically looked at the rate at which syllables are spoken (labelled as speech rate). Here we observed slower speech rates for speakers of languages with a higher information density. This relationship is displayed in the plot with information rate on the x-axis and the languages on the y-axis. This finding suggests that there may be a balance between the information density of a language and the rate at which syllables of that language can be spoken. We hypothesized that there may be some cognitive limitations that are causing this relationship.
>
> If there was indeed a human cognitive limit on receiving auditory information, then we would predict that the information receiving rate (referred to as information rate in our plots) should be a close range for all human languages. When we explored the the information rate for all languages in our dataset we observed that most languages clustered together around an information rate of 40 to 45 bits per second conveyed. There was some variation, however,. With English and French having the highest information rates, while Thai and Hungarian had the lowest information rates.



*Answer in the cell below either specifying `"Narrative A"` or `"Narrative B"` as a string and  assign the correct answer to an object called `answer5_2`.*

In [201]:
answer5_3 = "Narrative B"
answer5_3

'Narrative B'

In [202]:
t.test_5_3(answer5_3)

'Success'

# 6. Presenting Figures to a General or a Technical Audience 

Below you can see the original figure from [the scientific article in “Science Advances”](https://advances.sciencemag.org/content/5/9/eaaw2594)</a> and the [simplified version that was published in The Economist](https://www.economist.com/graphic-detail/2019/09/28/why-are-some-languages-spoken-faster-than-others)

(you can create a free account to read the article, no need to pay or use the trial).
There is also [an R Markdown file for how they did their analysis](https://advances.sciencemag.org/highwire/filestream/218792/field_highwire_adjunct_files/1/aaw2594_Analysis_script_file_S1.zip) in case you're interested.

## Scientific figure

![image.png](img/scientific.jpg)



## Economist figure

![image.png](img/economist_plot.png)

**Question 6.1** <br> {points: 2}

Looking at the 2 plots above, how has the Economist improved the figure from the original scientific publication?

Select all that apply:

i) Removed language family colours to reduce the distraction. 

ii) Removed some of the languages.

iii) Removed title annotations.

iv) Adding more data insights so the plots communicate more to the reader. 

v) Replaced abbreviations with full names of languages, countries and variables.

vi) Added extra explanatory labels which explain the variables better.

vii) Resorted the languages in a systematic way.

*To answer the question, select all that apply and add the letter(s) associated with the correct answer(s) to a list and assign it to a variable named `answer6_1`. For example, if you believe that i) and ii) are True, then your answer would look like this:*

`answer6_1 = ["i", "ii"]`


In [259]:
answer6_1 = ['i','v','vi', 'ii']
answer6_1

['i', 'v', 'vi', 'ii']

In [260]:
t.test_6_1(answer6_1)

'Success'

**Question 6.2** <br> {points: 2}

Do you agree with the choices made by the Economist? Is there anything you think could have been done differently or additionally, that would have communicated the findings more clearly to a general/popular science audience? 

Select all that apply:

i) Used colour to highlight something interesting (continents) while not overwhelming the visualization.

ii) Added rug plots to the densities to communicate more information. 

iii) Removed title annotations.

iv) Added a subtitle with a clear take-home message. 

v) Kept the median or mean line for all languages to use as a reference point. 

vi) Sorted by the information rate since that is what the article was focusing on. 

*To answer the question, select all that apply and add the letter(s) associated with the correct answer(s) to a list and assign it to a variable named `answer6_2`. For example, if you believe that i) and ii) are True, then your answer would look like this:*

`answer6_2 = ["i", "ii"]`

In [265]:
answer6_2 = ['i','iv','v','vi']
answer6_2

['i', 'iv', 'v', 'vi']

In [266]:
t.test_6_2(answer6_2)

'Success'

## Before Submitting 

Before submitting your assignment please do the following:

- Read through your solutions
- **Restart your kernel, clear output and rerun your cells from top to bottom** 
- Makes sure that none of your code is broken 
- Verify that the tests from the questions you answered have obtained the output "Success"

This is a simple way to make sure that you are submitting all the variables needed to mark the assignment. This method should help avoid losing marks due to changes in your environment.  

## Attributions


- Coupé, C., Oh, Y., Dediu, D., &amp; Pellegrino, F. (2019, September 01). Different languages, similar encoding efficiency: Comparable information rates across the human communicative niche. Retrieved March 02, 2021, from https://advances.sciencemag.org/content/5/9/eaaw2594

- The Economist (2019, September 28). Why are some languages spoken faster than others? . Retrieved March 02, 2021, from https://www.economist.com/graphic-detail/2019/09/28/why-are-some-languages-spoken-faster-than-others

- Datasest processed and uploaded by Joel Ostblom 

- MDS DSCI 531: Data Visualization I - [MDS's GitHub website](https://github.com/UBC-MDS/DSCI_531_viz-1) 