## Cleaning and Preparing Data in Python
#### Lesson 1

CSV is a module built into python, it has the function reader.
We first Open the csv file, then interpret (or parse) it, then convert into a list

In [1]:
from csv import reader
opened = open("artworks.csv")
read = reader(opened)
Artwork = list(read)

for i in range(3): # ie 0,1,2
    print(f"\n{Artwork[i]}") #prints index 0, 1 and 2

artwork = Artwork[1:] #Removes header row


['Title', 'Artist', 'Nationality', 'BeginDate', 'EndDate', 'Gender', 'Date', 'Department']

['Dress MacLeod from Tartan Sets', 'Sarah Charlesworth', '(American)', '(1947)', '(2013)', '(Female)', '1986', 'Prints & Illustrated Books']

['Duplicate of plate from folio 11 verso (supplementary suite, plate 4) from ARDICIA', 'Pablo Palazuelo', '(Spanish)', '(1916)', '(2007)', '(Male)', '1978', 'Prints & Illustrated Books']


We can replace parts of a string using the method `str.replace()` method. Suppose

In [2]:
fav_colour = "red is my favourite colour"
#Then we can change this to blue by fav_colour.replace(old,new)
fav_colour = fav_colour.replace("red","blue") #Remember assigning it makes it permanent
print(fav_colour)

blue is my favourite colour


We can use this technique to remove the brackets in the columns `Nationality`, `gender`, `year of death` and `year of birth`. Removing is similar but use `""` as the 'new' substring.

In [3]:
for row in artwork:
    nationality = row[2]
    nationality_open = nationality.replace("(","")
    nationality_clean = nationality_open.replace(")","")
    row[2] = nationality_clean

In [4]:
print(artwork[2])

['Tailpiece (page 55) from SAGESSE', 'Maurice Denis', 'French', '(1870)', '(1943)', '(Male)', '1889-1911', 'Prints & Illustrated Books']


French doesn't have parenthesis anymore. Repeat the same for Gender columns

In [5]:
for row in artwork: 
    gender = row[5]
    gender_open = gender.replace("(","")
    gender_clean = gender_open.replace(")","")
    row[5] = gender_clean
    
print(artwork[2])

['Tailpiece (page 55) from SAGESSE', 'Maurice Denis', 'French', '(1870)', '(1943)', 'Male', '1889-1911', 'Prints & Illustrated Books']


Date columns are within parenthesis suggesting they would still be strings. Instead of repeating code, we can create a function to correct this.

In [6]:
def date_clean(date):
    if date != "":
        date = date.replace("(","")
        date = date.replace(")","")
    if date == "":
        return "Date Unknown"
    return int(date)

#The below method will not work as artwork[3] is the 4th ROW not COLUMN
#for date in artwork[3]:
#    date = date_clean(date)
#for date in artwork[4]:
#    date = date_clean(date)

for row in artwork: 
    birth_date = row[3]
    death_date = row[4]
    row[3] = date_clean(birth_date)
    row[4] = date_clean(death_date)
    
#Check to find dates have no parentheses
for i in range(1,4): 
    print(f"\n{artwork[i]}")
    


['Duplicate of plate from folio 11 verso (supplementary suite, plate 4) from ARDICIA', 'Pablo Palazuelo', 'Spanish', 1916, 2007, 'Male', '1978', 'Prints & Illustrated Books']

['Tailpiece (page 55) from SAGESSE', 'Maurice Denis', 'French', 1870, 1943, 'Male', '1889-1911', 'Prints & Illustrated Books']

['Headpiece (page 129) from LIVRET DE FOLASTRIES, À JANOT PARISIEN', 'Aristide Maillol', 'French', 1861, 1944, 'Male', '1927-1940', 'Prints & Illustrated Books']


All columns don't have parenthesis any longer. Some nationalities and genders appear to be blank, so we can replace these with appropriate 

In [8]:
for row in artwork: 
    gender = row[5]
    if not gender: 
        gender = "Gender unknown/other"
        row[5] = gender
    gender = gender.title()
    
    nationality = row[2]
    if not nationality:
        nationality = "Nationality unknown"
        row[2] = nationality
    nationality = nationality.title() #Capitalisation

### Cleaning the Date (of creation) column
This column is much less clean as it has values such as "c. 1955.", "c. 1970's", "1990-1999" etc.
Instead of replacing each individual character, we can iterate over a list of them. 

In [9]:
bad_char = ["c", "C", ".", "'", "s", "S", "(", ")", " "]

def clean_char(string):
    for row in artwork:
        string = row[6]
        for char in bad_char:
            string = string.replace(bad_char, "")
            row[6] = string

In [12]:
for i in range(6,9): 
    print(f"\n{artwork[i]}")


["Rue de l'Hôtel-de-Ville", 'Eugène Atget', 'French', 1857, 1927, 'Male', '1924', 'Photography']

['Los Angeles Airport', 'Garry Winogrand', 'American', 1928, 1984, 'Male', '1978-1983', 'Photography']

['Why Defy from Disasters of Peace', 'Diane Victor', 'South African', 1964, 'Date Unknown', 'Female', '2001', 'Prints & Illustrated Books']


The only cleaning operation that remains is how to handle range of dates (contains "-"). Since the accuracy isn't too important, we can opt to take the average of the range of dates.

In [None]:
dash = ["-"]
for row in artwork: 
    date = row[6]
    if char in date:
        # want to seperate the 2 numbers, convert to integer 
        #Average the two dates and round
    else:
        date = int(date)