# From CSV to zip to dict

## Slovenian occupations

by Koenraad De Smedt at UiB

---
At the CLARIN.SI repository, there is an [online table with Slovenian occupations in masculine and feminine forms](http://hdl.handle.net/11356/1347), e.g. *artist / artistka*. This can be used to translate masculine to feminine forms (or the other way around).

With these data, this notebook shows how to:

1.  Combine information from two external sources
2.  Read a remote CSV file into a dataframe
3.  Make a dict out of two columns in the dataframe
4.  Write a function that uses the dict to translate depending on gender information from the *genderize* API.

More specifically, given a name and a masculine occupation, the gender API is used to find out if a name is feminine and if so, the masculine term is translated to the feminine one.

---

In [None]:
import pandas as pd
import requests

Read the online table with occupations into a dataframe. The dataframe has four columns:
1.  The masculine form
2.  An alternative form for the masculine
3.  The feminine form
4.  An alternative form for the feminine

In [None]:
so_url = 'https://www.clarin.si/repository/xmlui/bitstream/handle/11356/1347/Male_and_female_occupations_Slovene.csv?sequence=1'
so_df = pd.read_csv(so_url, sep=';', encoding='utf-8', header=0)
so_df.head(10)

Let's change the headers.

In [None]:
so_df.columns = ['masc', 'masc_alt', 'fem', 'fem_alt']
so_df.head(10)

Zip the `masc` and `fem` columns and make a dict that works as a translation dictionary. This means each `masc` item will be a key and the corresponding `fem` item will be the value. Disregard the alternative forms. Test a few occupations.

In [None]:
so_dict = dict(zip(so_df['masc'], so_df['fem']))

print(so_dict['pirotehnik'])
print(so_dict['artist'])

---
Now let's use this dict to change the occupation from masculine to feminine depending on the likely gender of a name.
First define a function to determine the likely gender of a first name. See a previous [notebook on APIs](https://colab.research.google.com/drive/1vi1T1NPi9YIxVEJ4fClD-ynAmzdPL70F?usp=sharing) for details.

In [None]:
def find_gender(name, country_code=None):
  parameters = {'name':name}
  if country_code: parameters['country_id'] = country_code
  return requests.get('https://api.genderize.io',
                      params=parameters).json()['gender']

find_gender('Marija', country_code='SI')

Make a function that takes a full name and a masculine occupation as arguments. If the first name is likely female, find and return the feminine version of the occupation, otherwise return the masculine one. Test.

In [None]:
def occupation(name, occ):
  # name is full name, occ is masculine occupation
  firstname = name.split()[0] # assume first word in n is first name
  if find_gender(firstname, country_code='SI') == 'female':
    return so_dict.get(occ)
  else:
    return occ

print(occupation('Marija', 'artist'))
print(occupation('Janez', 'artist'))
print(occupation('Nina', 'kosmonaut')) # not in the table

Let's do a slightly longer test. Here is a list of names and a list of Slovenian occupations in the masculine form. Zip names and occupations, then make a dict so that the names are keys and the occupations are values.

In [None]:
names = ['Marija Krajnc', 'Andrej Novak', 'Mojca Horvat', 'Marko Kos',
         'Zarja Novak', 'Janez Kovačič', 'Ana Božič']

occupations = ['artist', 'akrobat', 'baletnik', 'etnolog',
               'statistik', 'vinogradnik', 'frizer']

nodict = dict(zip(names, occupations))
nodict

Now do the translation where necessary. Using a dict comprehension, iterate over the items and determine the occupation for each name.

In [None]:
{n: occupation(n, o) for n, o in nodict.items()}

---
##Exercises

1.   Turn the genders around. Make a dict that translates feminine to masculine and make other necessary changes.