# Countries/Languages dataset
---

Here we construct a pandas data frame consisting of countries' names and their characteristics (ISO code, continent they belong to and languages spoken in them).

The data is imported from [this repository](https://github.com/annexare/Countries/tree/master/data) containing several json files. We will load the ```countries.json``` and ```languages.json``` files: the former giving the countries names with the languages codes while the latter associates the codes with the language name.

```countries.json``` also contains ISO and continents codes to be associated with the ```continents.json``` and ```countries2to3.json``` files, respectively. However, for simplicity, we will make use of of the [country-converter](https://pypi.org/project/country-converter/) python package to get those.

In [1]:
import json
import pandas as pd
import country_converter as coco

with open('Countries/countries.json', 'r', encoding='utf-8') as json_file:
    countries_dict = json.load(json_file)
with open('Countries/languages.json', 'r', encoding='utf-8') as json_file:
    languages_dict = json.load(json_file)

In [2]:
## countries
countries = pd.DataFrame(countries_dict).T
countries.reset_index(drop=True, inplace=True)
countries.columns = ['Country_name', 'native', 'Phone_code', 
                     'Continent', 'Capital', 'Currency', 'Language']
countries = countries[['Country_name', 'Phone_code', 'Continent', 
                       'Capital', 'Currency', 'Language']]
# change name to standard name
countries['Country_name'] = countries.Country_name.apply(
    lambda x: coco.convert(names=x, to='name_short', not_found=None))
# Add ISO3 codes
countries['Country_code'] = countries.Country_name.apply(
    lambda x: coco.convert(names=x, to='iso3', not_found=None))
# append continents
countries['Continent'] = coco.convert(names=countries.Country_name.tolist(), 
                                      to='Continent', not_found=None)
countries.head()



Unnamed: 0,Country_name,Phone_code,Continent,Capital,Currency,Language,Country_code
0,Andorra,376,Europe,Andorra la Vella,EUR,[ca],AND
1,United Arab Emirates,971,Asia,Abu Dhabi,AED,[ar],ARE
2,Afghanistan,93,Asia,Kabul,AFN,"[ps, uz, tk]",AFG
3,Antigua and Barbuda,1268,America,Saint John's,XCD,[en],ATG
4,Anguilla,1264,America,The Valley,XCD,[en],AIA


> Notice that the Languages field is made of lists of languages as each country could have several oficial languages.

In [3]:
# languages
languages = pd.DataFrame(languages_dict).T
languages.reset_index(inplace=True)
languages.rename(columns={'index':'code'}, inplace=True)
languages = languages[['code', 'name']]
languages.head()

Unnamed: 0,code,name
0,aa,Afar
1,ab,Abkhazian
2,af,Afrikaans
3,ak,Akan
4,am,Amharic


> Now we replace the languages codes in ```countries``` by the languages names in ```languages```.

In [4]:
# we create a new dataframe where we keep the country names from countries
# and the language names from languages
langs_list = []
for list_ in countries.Language.values:
    int_l = []
    for code in list_:
        try:
            lang = languages[languages.code == code].name.tolist()[0]
            int_l.append(lang)
        except: pass
    langs_list.append(int_l)
        
lang_dict = dict(zip(['Languages'], [langs_list]))
lang_df = pd.DataFrame(lang_dict)
countries_lang = pd.concat([countries, lang_df], axis=1)
countries_lang.drop(['Language'], axis=1, inplace=True)
countries_lang.head(10)

Unnamed: 0,Country_name,Phone_code,Continent,Capital,Currency,Country_code,Languages
0,Andorra,376,Europe,Andorra la Vella,EUR,AND,[Catalan]
1,United Arab Emirates,971,Asia,Abu Dhabi,AED,ARE,[Arabic]
2,Afghanistan,93,Asia,Kabul,AFN,AFG,"[Pashto, Uzbek, Turkmen]"
3,Antigua and Barbuda,1268,America,Saint John's,XCD,ATG,[English]
4,Anguilla,1264,America,The Valley,XCD,AIA,[English]
5,Albania,355,Europe,Tirana,ALL,ALB,[Albanian]
6,Armenia,374,Asia,Yerevan,AMD,ARM,"[Armenian, Russian]"
7,Angola,244,Africa,Luanda,AOA,AGO,[Portuguese]
8,Antarctica,672,Antarctica,,,ATA,[]
9,Argentina,54,America,Buenos Aires,ARS,ARG,"[Spanish, Guarani]"


In [5]:
countries_lang.to_parquet('Countries/countries_lang.parquet')

> The dataframe might be useful as it is but, as a final step, we will expand the _Languages_ field. In doing so, the rest of the variables are duplicated.

In [6]:
countries_lang_full = pd.DataFrame(columns=countries_lang.columns)
for i in range(len(countries_lang)):
    cols = countries_lang.columns.tolist()
    langs = countries_lang.iloc[i].Languages#.tolist()
    rest = countries_lang.iloc[i,:-1].tolist()
    for j, lang in enumerate(langs):
        line = pd.DataFrame(dict(zip(cols, rest+[lang])), index=[float('{}.{}'.format(i,j))])
        countries_lang_full = countries_lang_full.append(line, ignore_index=False)

countries_lang_full.reset_index(drop=True, inplace=True)
countries_lang_full.head(20)

Unnamed: 0,Country_name,Phone_code,Continent,Capital,Currency,Country_code,Languages
0,Andorra,376,Europe,Andorra la Vella,EUR,AND,Catalan
1,United Arab Emirates,971,Asia,Abu Dhabi,AED,ARE,Arabic
2,Afghanistan,93,Asia,Kabul,AFN,AFG,Pashto
3,Afghanistan,93,Asia,Kabul,AFN,AFG,Uzbek
4,Afghanistan,93,Asia,Kabul,AFN,AFG,Turkmen
5,Antigua and Barbuda,1268,America,Saint John's,XCD,ATG,English
6,Anguilla,1264,America,The Valley,XCD,AIA,English
7,Albania,355,Europe,Tirana,ALL,ALB,Albanian
8,Armenia,374,Asia,Yerevan,AMD,ARM,Armenian
9,Armenia,374,Asia,Yerevan,AMD,ARM,Russian


In [7]:
countries_lang_full.to_csv('Countries/countries_lang_full.csv')