# 01: Initial Cleaning

Originally gathered by the Austin Animal Center, the dataset in its initial form required attention specifically regarding rows with null values, feature datatypes, and entries with erroneous values.

Cleaning began with imports and reading the primary `csv` files.

In [1]:
import pandas as pd
import numpy as np

In [4]:
outcomes = pd.read_csv('../data/Austin_Animal_Center_Outcomes.csv')
intakes = pd.read_csv('../data/Austin_Animal_Center_Intakes.csv')

Below is brief display of the two resulting dataframes to get a sense of their initial forms. There is a great deal of overlap in the two datasets, with only a few columns unique to one or the other. These are `Age upon Outcome`, `Sex upon Outcome`, `Age upon Intake`, and `Sex upon Intake`. Likewise, the `DateTime` and `MonthYear` columns represent different timestamps--that is, in the Outcomes dataset these rows reflect the date/time of an animal's outcome, and in the Intakes dataset they represent the date/time that an animal was brought in.

In [5]:
outcomes.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
0,A794011,Chunk,05/08/2019 06:20:00 PM,05/08/2019 06:20:00 PM,05/02/2017,Rto-Adopt,,Cat,Neutered Male,2 years,Domestic Shorthair Mix,Brown Tabby/White
1,A776359,Gizmo,07/18/2018 04:02:00 PM,07/18/2018 04:02:00 PM,07/12/2017,Adoption,,Dog,Neutered Male,1 year,Chihuahua Shorthair Mix,White/Brown
2,A821648,,08/16/2020 11:38:00 AM,08/16/2020 11:38:00 AM,08/16/2019,Euthanasia,,Other,Unknown,1 year,Raccoon,Gray
3,A720371,Moose,02/13/2016 05:59:00 PM,02/13/2016 05:59:00 PM,10/08/2015,Adoption,,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff
4,A674754,,03/18/2014 11:47:00 AM,03/18/2014 11:47:00 AM,03/12/2014,Transfer,Partner,Cat,Intact Male,6 days,Domestic Shorthair Mix,Orange Tabby


In [6]:
intakes.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color
0,A786884,*Brock,01/03/2019 04:19:00 PM,01/03/2019 04:19:00 PM,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor
1,A706918,Belle,07/05/2015 12:59:00 PM,07/05/2015 12:59:00 PM,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver
2,A724273,Runster,04/14/2016 06:43:00 PM,04/14/2016 06:43:00 PM,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White
3,A665644,,10/21/2013 07:59:00 AM,10/21/2013 07:59:00 AM,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico
4,A682524,Rio,06/29/2014 10:38:00 AM,06/29/2014 10:38:00 AM,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray


Below, column names are converted to snake case for both datasets.

In [7]:
outcomes.columns = outcomes.columns.str.lower().str.replace(' ', '_')
intakes.columns = intakes.columns.str.lower().str.replace(' ', '_')

## I. Dropping Columns

In [8]:
outcomes.columns

Index(['animal_id', 'name', 'datetime', 'monthyear', 'date_of_birth',
       'outcome_type', 'outcome_subtype', 'animal_type', 'sex_upon_outcome',
       'age_upon_outcome', 'breed', 'color'],
      dtype='object')

#### `monthyear`

The `monthyear` column is dropped from each dataset, as it is a copy of the `datetime` column.

In [9]:
outcomes.drop(columns=['monthyear'], axis = 1, inplace = True)
intakes.drop(columns=['monthyear'], axis = 1, inplace = True)

#### `found_location`

In [10]:
intakes['found_location'].value_counts()

Austin (TX)                               24010
Travis (TX)                                2137
Outside Jurisdiction                       1593
7201 Levander Loop in Austin (TX)           897
Manor (TX)                                  657
                                          ...  
25Th St And San Gabriel in Austin (TX)        1
4601 E Slaughter Ln in Austin (TX)            1
Kenneth Avenue in Austin (TX)                 1
18508 Deep Water Dr in Travis (TX)            1
4401 Freidrich in Austin (TX)                 1
Name: found_location, Length: 55484, dtype: int64

The `found_location` column is dropped from the Intakes dataset as it does not supply useful information.

In [11]:
intakes.drop(columns=['found_location'], axis = 1, inplace = True)

#### `date_of_birth`

The `date_of_birth` column is dropped from the Outcomes dataset, as it is less consistent than the `age_upon_outcome` column.

In [12]:
outcomes.drop(columns=['date_of_birth'], axis = 1, inplace = True)

## II. Dropping Erroneous Entries

Some erroneous entries in the age columns are displayed below. Due to their infrequency, they are simply hard-coded out of each dataset.

In [13]:
outcomes['age_upon_outcome'].unique()

array(['2 years', '1 year', '4 months', '6 days', '7 years', '2 months',
       '2 days', '3 weeks', '9 months', '4 weeks', '2 weeks', '3 months',
       '9 years', '10 years', '6 months', '8 years', '3 years',
       '7 months', '6 years', '4 years', '1 month', '12 years', '5 years',
       '1 weeks', '5 months', '5 days', '15 years', '11 months',
       '10 months', '4 days', '16 years', '1 day', '8 months', '11 years',
       '13 years', '1 week', '14 years', '3 days', '0 years', '5 weeks',
       '17 years', '18 years', '20 years', '22 years', '-2 years',
       '19 years', '23 years', '24 years', '-1 years', '25 years',
       '21 years', '-3 years', nan], dtype=object)

In [14]:
outcomes = outcomes[~(outcomes['age_upon_outcome'] == '0 years')]
outcomes = outcomes[~(outcomes['age_upon_outcome'] == '-1 years')]
outcomes = outcomes[~(outcomes['age_upon_outcome'] == '-2 years')]
outcomes = outcomes[~(outcomes['age_upon_outcome'] == '-3 years')]
outcomes = outcomes[~(outcomes['age_upon_outcome'].isnull())]

In [15]:
intakes['age_upon_intake'].unique()

array(['2 years', '8 years', '11 months', '4 weeks', '4 years', '6 years',
       '5 months', '14 years', '1 month', '2 months', '18 years',
       '4 months', '1 year', '6 months', '3 years', '4 days', '1 day',
       '5 years', '2 weeks', '15 years', '7 years', '3 weeks', '3 months',
       '12 years', '1 week', '9 months', '10 years', '10 months',
       '7 months', '9 years', '8 months', '1 weeks', '5 days', '2 days',
       '11 years', '0 years', '17 years', '3 days', '13 years', '5 weeks',
       '19 years', '6 days', '16 years', '20 years', '-1 years',
       '22 years', '23 years', '-2 years', '21 years', '-3 years',
       '25 years', '24 years'], dtype=object)

In [16]:
intakes = intakes[~(intakes['age_upon_intake'] == '0 years')]
intakes = intakes[~(intakes['age_upon_intake'] == '-1 years')]
intakes = intakes[~(intakes['age_upon_intake'] == '-2 years')]
intakes = intakes[~(intakes['age_upon_intake'] == '-3 years')]
intakes = intakes[~(intakes['age_upon_intake'].isnull())]

## III. Null Values

For the null values in the Intakes dataset, `Nan` entries are replaced with `Unknown` in the `outcome_subtype` column, and entries with null values in the `outcome_type` column are simply dropped, as this information is essential to the modeling phase. The null values in the `name` column in both datasets will be modified later.

In [17]:
intakes.isnull().sum()

animal_id               0
name                39624
datetime                0
intake_type             0
intake_condition        0
animal_type             0
sex_upon_intake         1
age_upon_intake         0
breed                   0
color                   0
dtype: int64

In [18]:
intakes['sex_upon_intake'].replace(np.nan, 'Unknown', inplace = True)

In [19]:
outcomes.isnull().sum()

animal_id               0
name                39951
datetime                0
outcome_type           23
outcome_subtype     69923
animal_type             0
sex_upon_outcome        1
age_upon_outcome        0
breed                   0
color                   0
dtype: int64

For the null values in the Outcomes dataset, `Nan` entries are replaced with `Unknown` in the `outcome_subtype` column, and entries with null values in the `outcome_type` column are simply dropped, as this information is essential to the modeling phase. The null values in the `name` column in both datasets will be modified later.

In [20]:
outcomes = outcomes[~(outcomes['outcome_type'].isnull())]

In [21]:
outcomes['outcome_subtype'].replace(np.nan, 'Unknown', inplace = True)
outcomes['outcome_type'].replace(np.nan, 'Unknown', inplace = True)
outcomes['sex_upon_outcome'].replace(np.nan, 'Unknown', inplace = True)

## IV. Creating Columns

#### `age_upon_outcome`,  `age_upon_intake`

The age columns in their original form have entries on a handful of different time scales (i.e. months, weeks, days). Here a function takes in the string object of each age and converts it to years as a float datatype:

In [22]:
def clean_age(age):
    """
        Takes a string form of age, and transforms into an a floating point number.
    """
    age_f = age.split(' ')
    if (age_f[1] == 'day' or age_f[1] == 'days'):
        return round(float(age_f[0])/365, 3)
    elif (age_f[1] == 'week' or age_f[1] == 'weeks'):
        return round(float(age_f[0])/52.143, 3)
    elif (age_f[1] == 'month' or age_f[1] == 'months'):
        return round(float(age_f[0])/12,3)
    else:
        return age[0]

Creating `age_upon_outcome` and `age_upon_intake` columns using the function above

In [23]:
outcomes['age_upon_outcome'] = outcomes['age_upon_outcome'].apply(clean_age).astype(float)
intakes['age_upon_intake'] = intakes['age_upon_intake'].apply(clean_age).astype(float)

#### `is_named`

In [24]:
outcomes['is_named'] = outcomes['name'].notnull().astype(int)
intakes['is_named'] = intakes['name'].notnull().astype(int)

#### `month`, `year`, `day`

The `datetime` columns are converted to the appropriate datatype. Also, columns are engineered which separate time stamp information, to experiment later and determine whether or not the specific month or year influences animal outcomes.

In [25]:
outcomes['datetime'] = pd.to_datetime(outcomes['datetime'])
intakes['datetime'] = pd.to_datetime(intakes['datetime'])

In [26]:
outcomes['year'] = outcomes['datetime'].apply(lambda x: x.year).astype(object)
intakes['year'] = intakes['datetime'].apply(lambda x: x.year).astype(object)

outcomes['month'] = outcomes['datetime'].apply(lambda x: x.month).astype(object)
intakes['month'] = intakes['datetime'].apply(lambda x: x.month).astype(object)

outcomes['day'] = outcomes['datetime'].apply(lambda x: x.day_name()).astype(object)
intakes['day'] = intakes['datetime'].apply(lambda x: x.day_name()).astype(object)

#### `age_range`

A column to categorize an animal's age into one of five discrete ranges.

In [27]:
bins = [0, 0.5, 2, 5, 8, np.inf]
names = ['< 6 Months', '6 Months-2 Years', '2 Years-5 Years', 
         '5 Years-8 Years', '8 Years+']

intakes['age_range'] = pd.cut(intakes['age_upon_intake'], bins, labels=names)
outcomes['age_range'] = pd.cut(outcomes['age_upon_outcome'], bins, labels=names)

#### `sex`

In [28]:
intakes['sex'] = intakes['sex_upon_intake'].str.contains("Male").map({True: 'Male', False:'Female'})
outcomes['sex'] = outcomes['sex_upon_outcome'].str.contains("Male").map({True: 'Male', False:'Female'})

#### `is_neutered`

In [29]:
intakes['is_neutered'] = intakes['sex_upon_intake'].str.split(' ').str[0]
intakes['is_neutered'] = (intakes['is_neutered'] != 'Intact').map({True: 'Neutered/Spayed', False:'Intact'})

outcomes['is_neutered'] = outcomes['sex_upon_outcome'].str.split(' ').str[0]
outcomes['is_neutered'] = (outcomes['is_neutered'] != 'Intact').map({True: 'Neutered/Spayed', False:'Intact'})

#### `mix`

In [30]:
intakes['mix'] = intakes['breed'].str.contains("Mix").astype(int)
outcomes['mix'] = outcomes['breed'].str.contains("Mix").astype(int)

## V. Dropping Duplicate Information

In [31]:
intakes.drop(columns='sex_upon_intake', inplace=True)

In [32]:
outcomes.drop(columns='sex_upon_outcome', inplace=True)

## VI. Saving Work

With initial cleaning done, datasets are written to the `datasets` folder in a more prepared form to begin EDA and feature engineering.

In [36]:
outcomes.to_csv('../data/outcomes_initial.csv', index=False)
intakes.to_csv('../data/intakes_initial.csv', index=False)

# ***Next Notebook*** - [02: Initial EDA](/02_Initial_EDA.ipynb)