## Look at data
Look at the .csv file, study its structure and where to find the information we need. 

Outline the issues you might have (e.g. which fieldnames I need to provide the citation? Do the values are clean?)

## Collect resources required
In the first line of the .py script we add a comment to tell the interpreter which encoding standard has to use when dealing with our data.

In [None]:
# coding: utf-8

Then we import the modules we'll need to accomplish our tasks, such as *csv* for reading and manipulating a .csv file. 

In [1]:
import csv

# Is there any duplicate in the list? Print the list of duplicates in the form `ISBN: number`

We define a function that looks into values of a fieldname in a .csv file, and returns a list of values that appear more than once. First we name the function, e.g. **findDuplicates**, that accepts as arguments two strings: the path of the csv file, and the fieldname for the lookup function.

In [None]:
def findDuplicates(inputFile, fieldName):
    ''' given a csv file in input and a field name for disambiguating records,
    returns a list of duplicates'''

We prepare an empty list, named *listDisambiguate*, to store all the values that appear several times in the field 'ISBN' - which is a standard for identifying unique bibliographic references. We also create a list called *duplicates*, wherein we'll store only the repeated ones. We use the compact syntax to declare both of them in the same line.

In [2]:
listDisambiguate , duplicates = [] , []

We open the .csv file - in the same way we learnt in the last lesson - but we do not specify the name of the file. We use a variable instead, *inputFile*. This will ensure that the function can be reused for the same purposes with other files (in future).

We iterate over the rows of the csv (that have been converted in dictionaries through the function **dictReader**) and we fill the empty list named *listDisambiguate* with all the values of the field 'ISBN'. Likewise the name of the file, here we do not explicit the name of the field, that can vary according to the input file.

In [6]:
with open(inputFile, 'r', errors='ignore') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader: 
        listDisambiguate.append(row[fieldName]) 

Now that we have the list of all the ISBN numbers, we fill the list *duplicates* whith only the strings that appear more than once. We use the class **Counter**, that we have already seen: this returns a dictionary  to count items in the 

In [23]:
from collections import Counter
for item, count in Counter(listDisambiguate).items():
    if count > 1:
        duplicates.append(item)
duplicates

['',
 '9781783052578',
 '9788374323079',
 '1861891717',
 '1900924927',
 '0335150659',
 '2207256138 ; 9782207256138',
 '0670838454',
 '8445135147 ; 9788445135143',
 '1910705187 ; 9781910705186',
 '9781632151803',
 '1550545981',
 '9780316270397',
 '0887846068',
 '0956642098 ; 9780956642097',
 '0473289067 ; 9780473289065',
 '9780957260009',
 '1843534738 ; 9781843532644',
 '',
 '9781783052578',
 '9788374323079',
 '1861891717',
 '1900924927',
 '0335150659',
 '2207256138 ; 9782207256138',
 '0670838454',
 '8445135147 ; 9788445135143',
 '1910705187 ; 9781910705186',
 '9781632151803',
 '1550545981',
 '9780316270397',
 '0887846068',
 '0956642098 ; 9780956642097',
 '0473289067 ; 9780473289065',
 '9780957260009',
 '1843534738 ; 9781843532644',
 '',
 '9781783052578',
 '9788374323079',
 '1861891717',
 '1900924927',
 '0335150659',
 '2207256138 ; 9782207256138',
 '0670838454',
 '8445135147 ; 9788445135143',
 '1910705187 ; 9781910705186',
 '9781632151803',
 '1550545981',
 '9780316270397',
 '0887846068'

To clean the list and obtain a list of unique duplicates we use **set()**

In [24]:
list(set(duplicates))

['',
 '0670838454',
 '0335150659',
 '1861891717',
 '1910705187 ; 9781910705186',
 '0473289067 ; 9780473289065',
 '9781783052578',
 '1550545981',
 '0956642098 ; 9780956642097',
 '8445135147 ; 9788445135143',
 '9780957260009',
 '9788374323079',
 '9780316270397',
 '2207256138 ; 9782207256138',
 '1843534738 ; 9781843532644',
 '9781632151803',
 '1900924927',
 '0887846068']

We can use list comprehension to obtain the same result, but with only one line of code.

In [19]:
from collections import Counter
duplicates = [item for item, count in Counter(listDisambiguate).items() if count > 1]
duplicates

['',
 '9781783052578',
 '9788374323079',
 '1861891717',
 '1900924927',
 '0335150659',
 '2207256138 ; 9782207256138',
 '0670838454',
 '8445135147 ; 9788445135143',
 '1910705187 ; 9781910705186',
 '9781632151803',
 '1550545981',
 '9780316270397',
 '0887846068',
 '0956642098 ; 9780956642097',
 '0473289067 ; 9780473289065',
 '9780957260009',
 '1843534738 ; 9781843532644']

Therefore, our final function will looks like:

In [33]:
# coding: utf-8
import csv , re
from collections import Counter

def findDuplicates(inputFile, fieldName):
    ''' given a csv file in input and a field name for disambiguating records,
    returns a list of duplicates'''
    listDisambiguate = []

    with open(inputFile, 'r', errors='ignore') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader: # iterates over the lines in the .csv file / i.e., key, value of the dictionary
            listDisambiguate.append(row[fieldName]) # look into a column and create a list
        duplicates = [item for item, count in Counter(listDisambiguate).items() if count > 1]

        print('\nDuplicates in list:')
        for dupl in duplicates:
            print((fieldName+':'+dupl))

That we can call with our specific parameters.

In [34]:
findDuplicates('titles.csv', 'ISBN')


Duplicates in list:
ISBN:
ISBN:9781783052578
ISBN:9788374323079
ISBN:1861891717
ISBN:1900924927
ISBN:0335150659
ISBN:2207256138 ; 9782207256138
ISBN:0670838454
ISBN:8445135147 ; 9788445135143
ISBN:1910705187 ; 9781910705186
ISBN:9781632151803
ISBN:1550545981
ISBN:9780316270397
ISBN:0887846068
ISBN:0956642098 ; 9780956642097
ISBN:0473289067 ; 9780473289065
ISBN:9780957260009
ISBN:1843534738 ; 9781843532644


# Can you delete duplicates from the dictionary?
We define a function that cleans a *list of dictionaries* (such as the one created by **dictReader**) by deleting all the dictionaries that are repeated. This means we delete all the records that are repeated in bibliography.

As usual, we provide the function of a name (**cleanListOfDict**) and optional arguments. In this case we ask for a list of dictionaries and we return a new cleaned list of dictionaries.

In [None]:
def cleanListOfDict(oldList):
    ''' given a list of dictionaries with duplicate items
    returns a new cleaned list of dictionaries'''
    newList = []

We have prepared an empty list, *newList*, to store only the unique dictionaries.

Then we iterate over the length of the list. We use the built-in function **range()** to create a sequence of numbers, from 0 to the maximum number of items included in the list to be cleaned, i.e., *len(oldList)*. This creates an index of positions for all the dictionaries included in the original list, that we'll use to see whether an item (dictionary) is repeated in the next ones. 

The **if clause** asks for any dictionary in the current position \[i\] that is not repeated in the following siblings. *i* is a number included between 0 and len(oldList) that increases iteratively (*i+1*). We use the membership operator **not in** to compare any dictionary to all the following ones (then the notation *i+1:*, that means: from i, with a step of 1, till the end of the sequence, expressed by :). 

If the condition is satisfied, the dictionary is copied in the new list.

In [28]:
    for i in range(0, len(oldList)):
        if oldList[i] not in oldList[i+1:]: 
            newList.append(oldList[i]) 
    return newList

NameError: name 'oldList' is not defined

Or better, we can further simplify the code by using list comprehension:

In [None]:
newList = [oldList[i] for i in range(0, len(oldList)) if oldList[i] not in oldList[i+1:]]

Our final function looks like:

In [29]:
def cleanListOfDict(oldList):
    ''' given a list of dictionaries with duplicate items
    returns a new cleaned list of dictionaries'''
    newList = [oldList[i] for i in range(0, len(oldList)) if oldList[i] not in oldList[i+1:]]
    return newList

For example, we can call the function with a simple list of dictionaries as parameter:

In [30]:
anotherList = [{'a': 100}, {'b': 100}, {'a': 100}]
cleanListOfDict(anotherList)

[{'b': 100}, {'a': 100}]

It returns only the unique dictionaries as we wanted. So we are sure our function works.

Now we try it with the list of dictionaries that we obtain from opening the .csv file with *dictReader*. 

In [50]:
with open('titles.csv', 'r', errors='ignore') as csvfile:
    reader = [row for row in csv.DictReader(csvfile)]
    clean = cleanListOfDict(reader)
#check how many rows have been deleted 
print('new:',len(clean),' old:',len(reader) ) 

new: 323  old: 323


Oooops, apparently we have not deleted any row! But the function works.. This means that even if there are duplicates in ISBN numbers, other data in the same row is different, e.g. two different titles are recorded, or different topics.
We should decide which record is the most reliable among the duplicates and exclude the others.. But for moment this is enough! Let's move to another task.

# Print references in Chicago Style
A bibliographic citation in Chicago Style looks like:

`LastName1, FirstName1, FirstName2 LastName2, and FirstNameN LastNameN. Title. City: Publisher, date.`

The information we need in order to build the citation is split in several fields: 'All names', 'Title', 'Place of publication', 'Publisher', ande 'Date of publication'.

Looking at the column of authors ('All names') we immediately notice that last names and first name are not in the correct order, and that strings have to be cleaned (removing parentheses, and roles). Since the task of cleaning such a string is quite expensive, we create a bespoke function for that purpose, and we call it inside a more general function that only builds the citation.

We summary the operations we need to perform on such string before starting (the sequence is relevant!). We observe two situations: single author and multiple authors. Few operations must be performed case by case. Some others can be performed despite the variable number of authors (i.e., in both case at once).

 * **multiple authors** delete all the organisations from the list
 * **multiple authors** delete roles
 * **multiple authors** swap last name and first name for all the authors but the first one, and remove the comma
 * **multiple authors** add 'and' before the last author
 * **single author** delete roles
 * **both multiple authors and single author** remove parenthesis and dates
 * **both multiple authors and single author** delete the semicolon and place a comma instead

As usual we start working with lists, therefore we create empty lists for storing our data (the list of authors and the final list of authors' names in the correct order).


In [59]:
    authors , cleanAuthors = [] , [] # create a list with authors, prepare the list to store the cleaned list of authors

    with open('titles.csv', 'r', errors='ignore') as csvfile:
        reader = csv.DictReader(csvfile)
        authors = [row['All names'] for row in reader] # we used list comprehension to make it short
        print(authors)

['Evans, Mike [person]', 'Jones, Steven [person]', 'Marcus, Greil [person]', 'Gibbon, Peter [person] ; Centret for udviklingsforskning (Denmark) [organisation]', 'Butler, Chris [person]', 'Mulholland, Garry [person]', 'Wright, Peter Poyntz [person]', 'Wells, Steven [person]', 'Jasper, Maura [person] ; Mancini, Hilken [person]', 'McGuirk, Niall [person]', 'Ercoli, Rikki [person]', 'Jocoy, Jim, 1952- [person] ; Moore, Thurston [person] ; Cervenka, Exene [person]', 'Walsh, Gavin, 1964- [person]', 'King, John, 1960- [person]', 'Cannon, Brian John Matthew [person]', 'Bayley, Roberta [person]', 'Bockris, Victor, 1949- [person]', 'Rocco, John (John M.) [person]', 'McIver, Joel [person]', 'Anscombe, Isabelle [person]', "O'Regan, Denis [person]", 'Belsito, Peter [person] ; Davis, Bob [person] ; Kester, Marion [person]', 'Leblanc, Lauraine, 1968- [person]', 'Alexandersson, Gunvor [person] ; Lundquist, Lena [person]', 'Teipel, Jürgen [person]', 'Ogram, patippu [person]', 'Kerr, Joe, 1958- [person

Then we work only on the list of authors we created and we iterate over its items to clean every string representing a single author. 

The first scenario we deal with is whether we have only one author or several ones separated by ';'. In the latter case we split single authors belonging to a group and we clean them singularly. Since there are no other situations at this level (i.e., only ';' is used to separate authors) we use an **if else statement**. 

In [72]:
for auth in authors: 
    if ';' in auth:
        # clean every author's name
    else:
        # clean only on author's name

IndentationError: expected an indented block (<ipython-input-72-b132c6a60bdd>, line 4)

We start with the case of multiple authors, since is the most complex situation. 

First of all, we use **split(';')** to divide authors' names separated by ';' and we get a sequence of names. 

Secondly we remove from the sequence all the names that refer to organisations, since they can not be considered authors. We use the membership operator **in** to check whether a string contains the word 'organisation' and then we update the list of single names 'splitList' using list comprehension. 

Then we iterate over each author's name and we split this string by the separator ',' (so that we retrieve, first name, last name, and roles). If there are more than three items in such a list, it means that the third string and following ones are roles (e.g. 'author', 'writer', etc.) and we delete them. Then we update the list of names replacing the whole string: we join the first name and last name with ', ', as it was in the beginning. If there are only two items separated by comma, but the second string is 'artist' or 'author', we delete the second item and we update again the list of a signle author (that in this case corresponds to the first item in *splitList2*). 

Finally, we join the substrings we cleaned with ';' and we recreate the string representing a group of authors of a single publication (auth). Rebuilding the original string continuously is necessary because several operations we need to perform later require a precise sequence (e.g. if we delete ';' we can not recognize clearly boundaries of authors' names, therefore it needs to be the last object to be cleaned).

In [123]:
for auth in authors: 
    if ';' in auth: # multiple authors
        splitList = auth.split(';') # split a group of authors by semicolon
        splitList = [au for au in splitList if 'organisation' not in au] # remove organizations from the list of authors
        for au in splitList: # iterate over each author in a group
            splitList2 = au.split(',') # split parts of the name of a single author by comma
            if len(splitList2) >=3: # after removing organisations, if there are still more than one author
                del splitList2[2:] # delete the substring after the third comma if exists (e.g. author, writer)
                splitList = [auth.replace(au, ', '.join(aux.strip() for aux in splitList2))] 
            if (len(splitList2) == 2) and ('artist' in splitList2[1] or 'author' in splitList2[1]) : # delete artist when in second position
                del splitList2[1]
                splitList = splitList2
        auth = ';'.join(au for au in splitList if au != ' ')
        print(auth) # print all the multiple authors

Gibbon, Peter [person] 
Jasper, Maura [person] ; Mancini, Hilken [person]
Jocoy, Jim, 1952- [person] ; Moore, Thurston [person] ; Cervenka, Exene [person]
Belsito, Peter [person] ; Davis, Bob [person] ; Kester, Marion [person]
Alexandersson, Gunvor [person] ; Lundquist, Lena [person]
Kerr, Joe, 1958- [person] ; Gibson, Andrew, 1949- [person]
Stevenson, Nils [person] ; Stevenson, Ray [person]
Andersen, Mark [person] ; Jenkins, Mark, 1954- [person]
Craine, Nick, 1971- [person] ; Turner, Michael, 1962- [person] ; MacDonald, Bruce [person]
Craine, Nick, 1971- [person] ; Turner, Michael, 1962- [person] ; MacDonald, Bruce [person]
McNeil, Legs [person] ; McCain, Gillian [person]
Davis, Guy, 1966- [person] ; Reed, Gary, 1956- [person]
Aizlewood, John [person] ; Collins, Andrew, 1965- [person] ; Prince, Bill [person]
Aizlewood, John [person] ; Collins, Andrew, 1965- [person] ; Prince, Bill [person]
McNeil, Legs [person] ; McCain, Gillian [person]
Perry, Mark, vocalist [person] ; Rawlings, Terr

Our strings are now partially cleaned. The next step is to swap the last name and the first name of all the authors but the first one. 

We split again the string (auth) by ';' and we use another **if statement** to check whether there are only two items (which means that there is nothing after ';') or there are more than two authors (and therefore we need to swap even and odd items).

We deal with the first item in the list (first author): here we just strip spaces using the function **strip**, that removes both left and right redundant whitespaces. 

We then move to the other items (we use the notation *\[1:\]* to take all the items from the second to the end of the sequence), and we split all of them by ',' to retrieve first names and last names. By using variable assignment in one line, we swap the first item with the second one. Once this is done, we rebuild the full name of the author (now called *i* to avoid confusion): we replace for each author group (auth) the old string by joining first name and last name (au) with a white space. Then we rebuild the group of authors (auth) by joining each authors' name with a comma.

Lastly, we add 'and' before the last author in the group (auth). We split the string (auth) again and we assign a new value to the last item of the list (see the notation *\[-1\]*). As usual we rebuild our string representing a group of authors with **join**.

In [196]:
# repeat the code for the interpreter of jupyter
for auth in authors: 
    if ';' in auth:      
        splitList = auth.split(';') # split a group of authors by semicolon
        splitList = [au for au in splitList if 'organisation' not in au] # remove organizations from the list of authors
        for au in splitList: # iterate over each author in a group
            splitList2 = au.split(',') # split parts of the name of a single author by comma
            if len(splitList2) >=3: # if there are more than three items, the last ones are roles
                del splitList2[2:] # delete the substring after the third comma if exists (e.g. author, writer)
                splitList = [auth.replace(au, ', '.join(aux.strip() for aux in splitList2))] 
            if (len(splitList2) == 2) and ('artist' in splitList2[1] or 'author' in splitList2[1]) : # delete artist when in second position
                del splitList2[1]
                splitList = splitList2 # delete role when there is only the name of art
        auth = ';'.join(au for au in splitList if au != ' ') # rebuild the string
        
# new code
        splitList = auth.split(';')  # split again the string to swap names
        if len(splitList) >= 2: # when there is more than one author
            for i in splitList[0]: # deal with the first author 
                i = i.strip() # remove whitespace in the first author's name
            for i in splitList[1:]: # deal with all the other authors' names 
                if ',' in i: 
                    splitList2 = i.split(',') # split each author's name by comma
                    splitList2[0], splitList2[1] = splitList2[1], splitList2[0] # swap last name and first name
                    splitList = [auth.replace(i, ' '.join(i for i in splitList2))] # rebuild the list of names
                    auth = ', '.join(au for au in splitList if au != ' ') # rebuild the list of authors
            
            splitList = auth.split(';') # split again the (updated) list of multiple authors
            splitList[-1] = str(' and '+splitList[-1]) # add "and" before the last author, after swapping last/first name
            auth = ';'.join(au for au in splitList if au != ' ') # update the list again
        print(auth)

Gibbon, Peter [person] 
Jasper, Maura [person] ; and  Hilken [person]  Mancini
Jocoy, Jim; Thurston [person]   Moore; and  Exene [person]  Cervenka
Belsito, Peter [person] ; Bob [person]   Davis; and  Marion [person]  Kester
Alexandersson, Gunvor [person] ; and  Lena [person]  Lundquist
Kerr, Joe, 1958- [person] ; and  Andrew Gibson
Stevenson, Nils [person] ; and  Ray [person]  Stevenson
Andersen, Mark [person] ; and  Mark Jenkins
Craine, Nick, 1971- [person] ; Michael Turner; and  Bruce [person]  MacDonald
Craine, Nick, 1971- [person] ; Michael Turner; and  Bruce [person]  MacDonald
McNeil, Legs [person] ; and  Gillian [person]  McCain
Davis, Guy, 1966- [person] ; and  Gary Reed
Aizlewood, John [person] ; Andrew Collins; and  Bill [person]  Prince
Aizlewood, John [person] ; Andrew Collins; and  Bill [person]  Prince
McNeil, Legs [person] ; and  Gillian [person]  McCain
Perry, Mark, vocalist [person] ; and  Terry Rawlings
Carroll, Jessica, 1976- [person] ; and  Craig Smith
Henri, Adria

We did almost everything for what concerns groups of authors (few more stuff will be done in the end, for both groups of authors and single authors, e.g. removing parenthesis and dates). We now deal with single authors, by using the **else clause** that we left empty.

We split our name by comma and we delete roles (that are all the items of the list 'splitList' after the second comma, i.e., *\[2:\]*). As for groups, we also take into account artists' names: when there are only two items separated by comma and the second one include the string 'artist' or 'author' we delete this item from the list. As usual, we rebuild the full string of the author's name (auth) joining the items with a comma.

Finally, we need to take into account another exception we might have forgot. There might be empty groups of authors (i.e. a blank string). We substitute empty groups of authors with the string '-'. 

In [150]:
for auth in authors: 
    if ';' in auth:
        pass # just for the sake of brevity, we skip the code here we have already seen
    else:
        splitList = auth.split(',') # split parts of the name of a single author by comma
        if len(splitList) >=3:
            del splitList[2:]
            splitList = [auth.replace(auth, ', '.join(aux.strip() for aux in splitList))]
        if (len(splitList) == 2) and ('artist' in splitList[1] or 'author' in splitList[1]) : # delete artist when in second position
            del splitList[1]
        auth = ','.join(au for au in splitList)
    if len(auth) == 0: # substitutes empty authors with -
        auth = '-'
    print(auth)

Evans, Mike [person]
Jones, Steven [person]
Marcus, Greil [person]
Gibbon, Peter [person] ; Centret for udviklingsforskning (Denmark) [organisation]
Butler, Chris [person]
Mulholland, Garry [person]
Wright, Peter Poyntz [person]
Wells, Steven [person]
Jasper, Maura [person] ; Mancini, Hilken [person]
McGuirk, Niall [person]
Ercoli, Rikki [person]
Jocoy, Jim, 1952- [person] ; Moore, Thurston [person] ; Cervenka, Exene [person]
Walsh, Gavin
King, John
Cannon, Brian John Matthew [person]
Bayley, Roberta [person]
Bockris, Victor
Rocco, John (John M.) [person]
McIver, Joel [person]
Anscombe, Isabelle [person]
O'Regan, Denis [person]
Belsito, Peter [person] ; Davis, Bob [person] ; Kester, Marion [person]
Leblanc, Lauraine
Alexandersson, Gunvor [person] ; Lundquist, Lena [person]
Teipel, Jürgen [person]
Ogram, patippu [person]
Kerr, Joe, 1958- [person] ; Gibson, Andrew, 1949- [person]
Caiafa, Janicł [person]
Stevenson, Nils [person] ; Stevenson, Ray [person]
O'Hara, Craig [person]
O'Hara, Cra

Our code since now looks like the following listing. Now we deal with both single authors and groups of authors. We clean texts by using *re*, a module that provides few functions to modify strings on the basis of some pattern.

`re.sub` method accepts three arguments: 
 * the pattern to be replaced 
 * the string that replaces the former 
 * the string where to apply such a replacement.

Analyse the following patterns:

`\s?[\(\[].*?[\)\]]\s?`

 * `\s` matches any whitespace character; 
 * `\s?` matches an optional whitespace (it might be there or not). We consider it both at the beginning and at the end of a string;
 * `[\(\[]` the outer squared barckets include a group of matching strings: this is the pattern that we want to substitute with "" (an empty string). The group includes both ( and \[, that are escaped (i.e., since \[\] normally wrap groups of matching strings, we add a slash before the characters so that the interpreter matches the characters ( and [ literally).
 * `[\)\]]` similarly to the prior one, this matches the outer parenthesis in our string.
 * `.` matches any character (in this case, any character included in either () or \[\]); `.*?` quantifies the number of possible characters between 0 and unlimited (so that we don't care about the actual content of parenthesis)

`\,?\s?\d+\-?`

 * `\,?`, `\-?` matche optional characters ',' and '-'
 * `\d+` matches one or more digits (numbers)

In [159]:
for auth in authors:   
    auth = re.sub("\s?[\(\[].*?[\)\]]\s?", "", auth) # remove [], () and text included 
    auth = re.sub("\,?\s?\d+\-?", "", auth) # remove digits and if exist, following - and  preceding ','
    auth = re.sub(';',',',auth) # finally remove the ;
    auth = auth+'.' # rebuild the final string and add a dot in the end

Evans, Mike.
Jones, Steven.
Marcus, Greil.
Gibbon, Peter, Centret for udviklingsforskning.
Butler, Chris.
Mulholland, Garry.
Wright, Peter Poyntz.
Wells, Steven.
Jasper, Maura, Mancini, Hilken.
McGuirk, Niall.
Ercoli, Rikki.
Jocoy, Jim, Moore, Thurston, Cervenka, Exene.
Walsh, Gavin.
King, John.
Cannon, Brian John Matthew.
Bayley, Roberta.
Bockris, Victor.
Rocco, John.
McIver, Joel.
Anscombe, Isabelle.
O'Regan, Denis.
Belsito, Peter, Davis, Bob, Kester, Marion.
Leblanc, Lauraine.
Alexandersson, Gunvor, Lundquist, Lena.
Teipel, Jürgen.
Ogram, patippu.
Kerr, Joe, Gibson, Andrew.
Caiafa, Janicł.
Stevenson, Nils, Stevenson, Ray.
O'Hara, Craig.
O'Hara, Craig.
Vague, Tom.
Fielding, Garry.
Home, Stewart.
Zimmermann, Peter, writer.
May, Michael.
Coon, Caroline.
.
King, John.
Andersen, Mark, Jenkins, Mark.
Craine, Nick, Turner, Michael, MacDonald, Bruce.
Craine, Nick, Turner, Michael, MacDonald, Bruce.
Carr, Emily.
Carr, Emily.
Mulholland, Garry.
McNeil, Legs, McCain, Gillian.
Steel, Mark.
Davi

Here below is listed the entire code of our function, that we called 'cleanAuthors', with arguments *inputFile* and *fieldName*:

In [199]:
def cleanAuthors(inputFile, fieldName):
    ''' given a list of authors in the form lastName, firstName 
    returns the same list cleaned and ready to be integrated in a bibliographic citation in Chicago style'''
    authors , cleanAuthorsList = [] , [] # create a list with authors, prepare the list to store the cleaned list of authors

    with open(inputFile, 'r', errors='ignore') as csvfile:
        reader = csv.DictReader(csvfile)
        authors = [row[fieldName] for row in reader] # we used list comprehension to make it short
    
    for auth in authors: 
        if ';' in auth:      
            splitList = auth.split(';') # split a group of authors by semicolon
            splitList = [au for au in splitList if 'organisation' not in au] # remove organizations from the list of authors
            for au in splitList: # iterate over each author in a group
                splitList2 = au.split(',') # split parts of the name of a single author by comma
                if len(splitList2) >=3: # after removing organisations, if there are still more than one author
                    del splitList2[2:] # delete the substring after the third comma if exists (e.g. author, writer)
                    splitList = [auth.replace(au, ', '.join(aux.strip() for aux in splitList2))] 
                if (len(splitList2) == 2) and ('artist' in splitList2[1] or 'author' in splitList2[1]) : # delete artist when in second position
                    del splitList2[1]
                    splitList = splitList2 # delete role when there is only the name of art
                auth = ';'.join(au for au in splitList if au != ' ') # rebuild the string

            splitList = auth.split(';')  # split again the string to swap names
            if len(splitList) >= 2: # when there is more than one author
                for i in splitList[0]: # deal with the first author 
                    i = i.strip() # remove whitespace in the first author's name
                for i in splitList[1:]: # deal with all the other authors' names 
                    if ',' in i: 
                        splitList2 = i.split(',') # split each author's name by comma
                        splitList2[0], splitList2[1] = splitList2[1], splitList2[0] # swap last name and first name
                        splitList = [auth.replace(i, ' '.join(i for i in splitList2))] # rebuild the list of names
                        auth = ', '.join(au for au in splitList if au != ' ') # rebuild the list of authors

                splitList = auth.split(';') # split again the (updated) list of multiple authors
                splitList[-1] = str(' and'+splitList[-1]) # add "and" before the last author, after swapping last/first name
                auth = ';'.join(au for au in splitList if au != ' ') # update the list again

        else:
            splitList = auth.split(',') # split parts of the name of a single author by comma
            if len(splitList) >=3:
                del splitList[2:]
                splitList = [auth.replace(auth, ', '.join(aux.strip() for aux in splitList))]
            if (len(splitList) == 2) and ('artist' in splitList[1] or 'author' in splitList[1]) : # delete artist when in second position
                del splitList[1]
            auth = ','.join(au for au in splitList)
        if len(auth) == 0: # substitutes empty authors with -
            auth = '-'
            
        auth = re.sub("\s?[\(\[].*?[\)\]]\s?", "", auth) # remove [], () and text included 
        auth = re.sub("\,?\s?\d+\-?", "", auth) # remove digits and if exist, following - and  preceding ','
        auth = re.sub(';',',',auth) # finally remove the ;
        auth = auth+'.' # rebuild the final string and add a dot in the end
        
        cleanAuthorsList.append(auth)
    
    return cleanAuthorsList

In [200]:
cleanAuthors('titles.csv', 'All names')

['Evans, Mike.',
 'Jones, Steven.',
 'Marcus, Greil.',
 'Gibbon, Peter.',
 'Butler, Chris.',
 'Mulholland, Garry.',
 'Wright, Peter Poyntz.',
 'Wells, Steven.',
 'Jasper, Maura, and Hilken Mancini.',
 'McGuirk, Niall.',
 'Ercoli, Rikki.',
 'Jocoy, Jim, Thurston  Moore, and Exene Cervenka.',
 'Walsh, Gavin.',
 'King, John.',
 'Cannon, Brian John Matthew.',
 'Bayley, Roberta.',
 'Bockris, Victor.',
 'Rocco, John.',
 'McIver, Joel.',
 'Anscombe, Isabelle.',
 "O'Regan, Denis.",
 'Belsito, Peter, Bob  Davis, and Marion Kester.',
 'Leblanc, Lauraine.',
 'Alexandersson, Gunvor, and Lena Lundquist.',
 'Teipel, Jürgen.',
 'Ogram, patippu.',
 'Kerr, Joe, and Andrew Gibson.',
 'Caiafa, Janicł.',
 'Stevenson, Nils, and Ray Stevenson.',
 "O'Hara, Craig.",
 "O'Hara, Craig.",
 'Vague, Tom.',
 'Fielding, Garry.',
 'Home, Stewart.',
 'Zimmermann, Peter.',
 'May, Michael.',
 'Coon, Caroline.',
 '-.',
 'King, John.',
 'Andersen, Mark, and Mark Jenkins.',
 'Craine, Nick, Michael Turner, and Bruce MacDonal

Now we finally have the cleaned list of authors we need. We can move forward and create the function to print our bibliographic citation in Chicago Style. We call it **chicagoCitation** and we ask as arguments: 
 
 * a csv file
 * all the fieldnames we need to retrieve: authors, title, city, publisher, and date of publishing
 
We behave as we know, namely:
 * open a csv file
 * call the function we created to clean authors' names and fill a new list called *authorsList*
 * create empty lists to store values of required fields
 * iterate over rows of the csv and fill the aforementioned lists
 
Then we collate the lists we created by using the **zip()** function. It iterates over every list we include as argument, and creates a tuple for each match. In order to query a zip object, we transform it in a list.

Finally we can iterate over items of the latter list and build our citation. Since both the publisher and the city might not appear in original data for some records, we define two situations (with and without the publisher). In our tuples the publisher is always the fourth item (*citation\[3\]*): when this equals '', we avoid to print that value and skip directly to the date. Otherwise we print the complete reference.

In [173]:
import csv 
def chicagoCitation(inputFile, authors, title, city, publisher, date):
    ''' given a csv file and fieldnames including all the elements of a bibliographic reference 
    returns the bibliographic citation in Chicago style'''

    with open(inputFile, 'r', errors='ignore') as csvfile:
        authorsList = cleanAuthors(inputFile, authors)
        titlesList , citiesList , publishersList , datesList = [] , [] , [] , [] # create lists
        reader = csv.DictReader(csvfile)
        for row in reader: 
            titlesList.append(row[title])
            citiesList.append(row[city])
            publishersList.append(row[publisher])
            datesList.append(row[date])
        
        citations = list(zip(authorsList, titlesList, citiesList, publishersList, datesList))
        for citation in citations:
            if citation[3] is '':
                print(citation[0]+' '+citation[1]+'. '+citation[4]+'.')
            else:
                print(citation[0]+' '+citation[1]+'. '+citation[2]+': '+citation[3]+', '+citation[4]+'.')

We call the function with all the known arguments, and we get our result!

In [201]:
chicagoCitation('titles.csv', 'All names', 'Title', 'Place of publication', 'Publisher', 'Date of publication')

Evans, Mike. White Punks on Dope. 1977.
Jones, Steven. No One is Innocent : a punk prayer by Ronald Biggs. 1978.
Marcus, Greil. In the fascist bathroom : writings on punk 1977-1992. London: Viking, 1993.
Gibbon, Peter. Of saviours and punks : the political economy of the Nile perch marketing chain in Tanzania. Copenhagen: Centre for Development Research, 1997.
Butler, Chris. Monkey punk. Hove: Slab-O-Concrete, 1997.
Mulholland, Garry. This is uncool : the 500 greatest singles since punk and disco. London: Cassell, 2003.
Wright, Peter Poyntz. Hunky punks : a study in Somerset stone carving. Loughborough: Heart of Albion, 2004.
Wells, Steven. Anarchy in the UK : the stories behind the anthems of punk. London: Carlton, 2004.
Jasper, Maura, and Hilken Mancini. Punk rock aerobics : 75 killer moves, 50 punk classics, and 25 reasons to get off your ass and exercise. Cambridge, Massachusetts ; Oxford: Da Capo ; Oxford Publicity Partnership, 2004.
McGuirk, Niall. Please feed me : a punk vegan c