## clean_animal_services_data
### this file cleans animal services data, including 
1. grouping factors into larger bins for categorical features 
2. scaling numeric features
3. getting dummies for categorical feature
4. labeling target as either 'adopted' or 'not_adopted'

### Data Source


https://data.louisvilleky.gov/dataset/animal-service-intake-and-outcome

https://data.louisvilleky.gov/dataset/animal-service-intake-and-outcome/resource/3a416835-fa66-4abc-bba7-3314435b26e9#{view-graph:{graphOptions:{hooks:{processOffset:{},bindEvents:{}}}},graphOptions:{hooks:{processOffset:{},bindEvents:{}}}}

### Data Dictionary

* __Animal ID__ - A generated unique identification when an animal's information is stored in the Chameleon Data Base
* __Animal Type__ - Type of animal
* __Intake Date__ - The date that the animal arrives at Metro Animal Services
* __Intake Type__ - The reason the animal is at Metro Animal Services
* __Intake Subtype__ - A secondary but more in depth reason to why the animal is at Metro Animal Services
* __Primary Color__ - The color that is most prevalent in the animal
* __Primary Breed__ - The breed of the animal or the breed the animal looks like the most
* __Secondary Breed__ - The other breed that the animal looks like
* __Gender__ - Sex of the animal
* __Secondary Color__ - A further description of the animal's color
* __DOB__ - The date of birth of the animal or an estimated date of birth
* __Intake Reason__ - The primary reason the animal is at Metro Animal Services
* __Outcome Date__ - The date the outcome is entered, if no outcome date is available the animal is still in the shelter
* __Outcome Type__ - The type of outcome for the animal that can include returned to owner, * adoption, sent to a rescue, etc.
* __Outcome Subtype__ - A secondary, more in depth definition of the outcome type (ex. Transfer, rescue group vs. Transfer, KHS)

In [468]:
import pandas as pd

In [469]:
df = pd.read_csv('data/Animal_IO_Data_1.csv')

In [470]:
#df.info()

In [471]:
df.drop_duplicates(inplace=True)

In [472]:
df = df[~pd.isnull(df['OutcomeType'])]

In [473]:
df = df[df['IntakeType']!='DEAD']

In [474]:
# drop OutcomeReason as it's entirely NaN
df.drop('OutcomeReason',axis=1, inplace=True)

In [475]:
df['IntakeDate'] = df.IntakeDate.apply(pd.to_datetime)
df['OutcomeDate'] = df.OutcomeDate.apply(pd.to_datetime)
df['DOB'] = df.DOB.apply(pd.to_datetime)

In [476]:
def data_info(df, data_name):

    print(data_name+ ' has ',df.shape[0], 'records. Per column \n')

    for col in df.columns:
        print(col, 'has ',len(df[col].value_counts()),'unique values and ',sum(pd.isnull(df[col])),'NaNs')

In [477]:
df['IntakeYear']=df['IntakeDate'].dt.year
df['IntakeMonth']=df['IntakeDate'].dt.month

df['OutcomeYear']=df['IntakeDate'].dt.year
df['OutcomeMonth']=df['IntakeDate'].dt.month


In [478]:
df['IntakeAgeInDays']=(df['IntakeDate']-df['DOB']).astype('timedelta64[h]')/24.0

In [479]:
df['OutcomeAgeInDays']=(df['OutcomeDate']-df['DOB']).astype('timedelta64[h]')/24.0

In [480]:
df['DaysInShelter']=(df['OutcomeDate']-df['IntakeDate']).astype('timedelta64[h]')/24.0

In [481]:
#df.info()

In [482]:
df['SecondaryBreed']=df['SecondaryBreed'].astype(str)
df['SecondaryColor']=df['SecondaryColor'].astype(str)
df['OutcomeSubtype']=df['OutcomeSubtype'].astype(str)

In [483]:
df['PrimaryColor']=df['PrimaryColor'].astype(str)
df['IntakeReason']=df['IntakeReason'].astype(str)
df['OutcomeInternalStatus']=df['OutcomeInternalStatus'].astype(str)


In [484]:
df['OutcomeAsilomarStatus']=df['OutcomeAsilomarStatus'].astype(str)
df['ReproductiveStatusAtOutcome']=df['ReproductiveStatusAtOutcome'].astype(str)
df['IntakeSubtype']=df['IntakeSubtype'].astype(str)

In [485]:
df['HasIntakeAge']=df['IntakeAgeInDays'].apply(pd.isnull)
df['HasOutcomeAge']=df['OutcomeAgeInDays'].apply(pd.isnull)
df['HasDaysInShelter']=df['DaysInShelter'].apply(pd.isnull)

In [486]:
df['HasIntakeAge']=df['HasIntakeAge'].astype(int)
df['HasOutcomeAge']=df['HasOutcomeAge'].astype(int)
df['HasDaysInShelter']=df['HasDaysInShelter'].astype(int)

In [487]:
df['IntakeAgeInDays'].fillna(-999, inplace=True)
df['OutcomeAgeInDays'].fillna(-999, inplace=True)
df['DaysInShelter'].fillna(-999, inplace=True)

In [488]:
df=df.drop(['DOB','IntakeDate','OutcomeDate'],axis=1)

In [489]:
freq=df.AnimalID.value_counts()
def get_numTimesInShelter(animalId):
    return freq[animalId]
df['TimesInShelter']=df.AnimalID.apply(get_numTimesInShelter)

In [490]:
df=df.drop(['AnimalID'],axis=1)

In [491]:
from sklearn import preprocessing

df['ScaledIntakeYear']=preprocessing.scale(df['IntakeYear'])
df['ScaledIntakeMonth']=preprocessing.scale(df['IntakeMonth'])
df['ScaledOutcomeYear']=preprocessing.scale(df['OutcomeYear'])
df['ScaledOutcomeMonth']=preprocessing.scale(df['OutcomeMonth'])
df['ScaledIntakeAgeInDays']=preprocessing.scale(df['IntakeAgeInDays'])
df['ScaledOutcomeAgeInDays']=preprocessing.scale(df['OutcomeAgeInDays'])
df['ScaledDaysInShelter']=preprocessing.scale(df['DaysInShelter'])




In [492]:
#df=df.drop(['IntakeYear','IntakeMonth','OutcomeYear','OutcomeMonth','IntakeAgeInDays','OutcomeAgeInDays','DaysInShelter'],axis=1)

In [493]:
df['intake_year_month'] = df.apply(lambda row: str(row.IntakeYear)+'-' + str(row.IntakeMonth), axis=1)
df['outcome_year_month'] = df.apply(lambda row: str(row.OutcomeYear)+'-' + str(row.OutcomeMonth), axis=1)


In [494]:
df2=pd.DataFrame.copy(df,deep=True)

In [495]:
not_adopted = list(df['OutcomeType'].value_counts().index)
not_adopted.remove('ADOPTION')

for outcome in not_adopted: 
    df['OutcomeType'].replace(outcome,"not_adopted",inplace=True)

In [496]:
df['OutcomeType'].value_counts()

not_adopted    124576
ADOPTION        25612
Name: OutcomeType, dtype: int64

In [497]:
df['AnimalType'].value_counts()

DOG          76764
CAT          68503
OTHER         1322
BIRD          1176
RABBIT        1138
RODENT         680
REPTILE        262
LIVESTOCK      252
FERRET          91
Name: AnimalType, dtype: int64

In [498]:
others = ['RODENT','REPTILE','LIVESTOCK','FERRET','BIRD','RABBIT']
for animal_type in others: 
    df['AnimalType'].replace(animal_type,'OTHER',inplace=True)
df['AnimalType'].value_counts()

DOG      76764
CAT      68503
OTHER     4921
Name: AnimalType, dtype: int64

In [499]:
df['IntakeType'].value_counts()

STRAY         93510
OWNER SUR     33749
CONFISCATE     5069
FOSTER         4912
EUTH REQ       3701
OUTSURGERY     3462
RETURN         2254
ET REQUEST     1097
DISPOSAL       1057
QUARANTINE      657
KHS             621
INVESTIGAT       39
TRANSFER         26
FOR TRANSP       20
MED OBSERV        8
EVACUEE           3
FOUND             2
LOST              1
Name: IntakeType, dtype: int64

In [500]:
df = df[df['IntakeType']!='DISPOSAL']
df = df[df['IntakeType']!='EUTH REQ']

In [501]:
df['IntakeType'].value_counts()

STRAY         93510
OWNER SUR     33749
CONFISCATE     5069
FOSTER         4912
OUTSURGERY     3462
RETURN         2254
ET REQUEST     1097
QUARANTINE      657
KHS             621
INVESTIGAT       39
TRANSFER         26
FOR TRANSP       20
MED OBSERV        8
EVACUEE           3
FOUND             2
LOST              1
Name: IntakeType, dtype: int64

In [502]:
df['IntakeType'].nunique()

16

In [503]:
medical = ['OUTSURGERY','QUARANTINE','MED OBSERV']
others = ['ET REQUEST','KHS','INVESTIGAT','TRANSFER','FOR TRANSP','EVACUEE','FOUND','LOST','POLICE','CONFISCATE']

for intake_type in medical: 
    df['IntakeType'].replace(intake_type,'medical',inplace=True)

for intake_type in others: 
    df['IntakeType'].replace(intake_type,'OTHER',inplace=True)
df['IntakeType'].value_counts()

STRAY        93510
OWNER SUR    33749
OTHER         6878
FOSTER        4912
medical       4127
RETURN        2254
Name: IntakeType, dtype: int64

In [504]:
df = df[df['IntakeSubtype']!='EUTH REQ']
df['IntakeSubtype'].value_counts()
behavior = ['DANGER DOG','BITE','AGGRESSIVE']
health=['SICK','HOSPITAL','POST SURG']
others=['NEGLECT','OWNER DIED','AN CONTROL','RESCUE GRP','OWNER SUR','COURT ORD','FOSTER','NUISANCE','WEB','NIGHT','FIELD OWN',' ','UNPERMITED','EVICTION']
for intake_subtype in behavior: 
    df['IntakeSubtype'].replace(intake_subtype,'behavior',inplace=True)
for intake_subtype in health: 
    df['IntakeSubtype'].replace(intake_subtype,'health',inplace=True)
for intake_subtype in others: 
    df['IntakeSubtype'].replace(intake_subtype,'OTHER',inplace=True)
others = df['IntakeSubtype'].value_counts().index[6:]
for intake_subtype in others: 
    df['IntakeSubtype'].replace(intake_subtype,'OTHER',inplace=True)


df['IntakeSubtype'].value_counts()

OTC         69451
FIELD       54719
OTHER        7988
RETURN       4838
nan          4433
ADOPTION     1164
Name: IntakeSubtype, dtype: int64

In [505]:
(df.PrimaryColor.value_counts())

BLACK                39738
WHITE                16323
BROWN                15662
GRAY                 11077
TAN                   7207
BROWN TABBY           6802
GRAY TABBY            4172
CALICO                3797
ORANGE TABBY          3713
TRICOLOR              3632
BROWN BRINDLE         3519
ORANGE                3464
TORTIE                3227
RED                   2854
BLACK TABBY           1723
YELLOW                1464
BLUE                  1448
GRAY TIGER            1364
BUFF                  1363
CHOCOLATE             1221
CREAM                 1111
BLACK BRINDLE          884
BROWN TIGER            762
FAWN                   747
BLACK TIGER            687
ORANGE TIGER           555
GOLD                   551
SEAL POINT             387
LYNX POINT             292
BLUE MERLE             266
                     ...  
TORTIE POINT           105
SILVER TIGER           103
BLUE POINT              90
GREEN                   68
PINK                    65
LIVER                   61
R

In [506]:
import numpy as np
def color_map(color):
    if color=='nan' or pd.isnull(color):
        return 'nan'
    
    blacks =['BLACK','BLACK TABBY','BLACK BRINDLE','BLACK TIGER']
    browns=['BROWN','TAN','BROWN TABBY','BROWN BRINDLE','CHOCOLATE']
    grays=['GRAY','GRAY TABBY','TORTIE']
    whites=['WHITE','CREAM','BUFF']
    calico=['CALICO','TRICOLOR']
    yellows=['ORANGE TABBY','YELLOW','ORANGE']
    
    if color in blacks:
        return 'BLACK'
    if color in browns:
        return 'BROWN'
    if color in grays:
        return 'GRAY'
    if color in whites:
        return 'WHITE'
    if color in calico:
        return 'CALICO'
    if color in yellows:
        return 'YELLOW'
    return 'OTHER'

In [507]:
#df[~pd.isnull(df['PrimaryColor'])]['PrimaryColor']=df[~pd.isnull(df['PrimaryColor'])]['PrimaryColor'].apply(color_map)

df['PrimaryColor'] = df['PrimaryColor'].apply(color_map)

In [508]:
df['PrimaryColor'].value_counts()

BLACK     43032
BROWN     34411
WHITE     18797
GRAY      18476
OTHER     11796
YELLOW     8641
CALICO     7429
nan          11
Name: PrimaryColor, dtype: int64

In [509]:
df['SecondaryColor'].value_counts()

nan               66102
WHITE             40786
BLACK             10619
BROWN              7705
TAN                5409
GRAY               3230
MUTED               921
ORANGE              794
TRICOLOR            698
BROWN BRINDLE       549
RED                 541
BROWN TABBY         539
CALICO              526
CREAM               457
TORTIE              456
BLACK TABBY         418
GRAY TABBY          336
YELLOW              272
ORANGE TABBY        212
SILVER              201
BLACK BRINDLE       198
BUFF                186
BLUE                179
BLACK TIGER         144
GRAY TIGER          141
GOLD                126
BROWN TIGER         108
CHOCOLATE            96
BLUE MERLE           56
ORANGE TIGER         48
                  ...  
SILVER TABBY         22
APRICOT              20
PINK                 14
FLAME POINT          14
BLUE TICKED          14
LYNX POINT           13
CREAM TABBY          11
CALICO POINT         11
BLUE POINT            9
RED MERLE             9
RED TICKED      

In [510]:
df['SecondaryColor'] = df['SecondaryColor'].apply(color_map)

In [511]:
#df['SecondaryColor'] = df2['SecondaryColor'].apply(color_map)

In [512]:
df['SecondaryColor'].value_counts()

nan       66102
WHITE     41429
BROWN     14298
BLACK     11379
GRAY       4022
OTHER      2861
YELLOW     1278
CALICO     1224
Name: SecondaryColor, dtype: int64

In [513]:
df['PrimaryBreed'].value_counts()[50:]

LHASA APSO                    257
SHETLAND SHEEPDOG             249
CHINESE SHARPEI               244
BLACK AND TAN COONOUND        240
PEKINGESE                     234
GREAT PYRENEES                232
STAFFORDSHIRE BULL TERRIER    229
ALASKAN HUSKY                 227
CANE CORSO                    226
POODLE - TOY                  215
GREAT DANE                    191
PIGEON                        183
DALMATIAN                     175
AMERICAN ESKIMO               158
POINTER                       158
WELSH CORGI - CARDIGAN        149
RUSSIAN BLUE                  148
ENGLISH BULLDOG               147
SNAKE                         143
FOX TERRIER - WIREHAIRED      138
TREEING WALKER COONHOUND      134
REX                           130
CATAHOULA LEOPARD HOUND       129
BENGAL                        127
HAMSTER                       126
PERSIAN                       125
BLOODHOUND                    125
SNOWSHOE                      124
RAT                           123
DUTCH         

In [514]:
def breed_map(breed):
    result = {}
    
    if pd.isnull(breed) or breed=='nan': 
        return 'nan'
    
    result['DOMESTIC SHORTHAIR']='SHORTHAIR CAT'
    result['OTHER']='OTHER'
    result['PIT BULL TERRIER']='MEDIUM DOG'
    result['LABRADOR RETRIEVER'] ='LARGE DOG'
    result['GERMAN SHEPHERD DOG'] = 'LARGE DOG'
    result['BEAGLE'] ='SMALL DOG'
    result['DOMESTIC LONGHAIR']='LONGHAIR CAT'
    result['BOXER'] = 'MEDIUM DOG'
    result['CHIHUAHUA'] = 'SMALL DOG'
    result['AMERICAN PIT BULL TERRIER'] = 'MEDIUM DOG'
    result['ROTTWEILER'] = 'LARGE DOG'
    result['CHOW CHOW'] = 'MEDIUM DOG'
    result['AMERICAN SHORTHAIR']= 'SHORTHAIR CAT'
    result['BORDER COLLIE'] = 'MEDIUM DOG'
    result['SHIH TZU'] = 'SMALL DOG'
    result['JACK RUSS TER'] = 'SMALL DOG'
    result['SIAMESE'] = 'SHORTHAIR CAT'
    result['SIBERIAN HUSKY'] = 'LARGE DOG'
    result['POODLE - MINIATURE'] = 'SMALL DOG'
    result['AUSTRALIAN SHEPHERD']='MEDIUM DOG'
    result['YORKSHIRE TERRIER'] = 'SMALL DOG'
    result['DACHSHUND'] = 'SMALL DOG'
    result['GOLDEN RETRIEVER'] = 'LARGE DOG'
    result['COCKER SPANIEL'] = 'SMALL DOG'
    result['PUG'] = 'SMALL DOG'
    result['POMERANIAN'] = 'SMALL DOG'
    result['AUSTRALIAN CATTLE DOG'] = 'MEDIUM DOG'
    result['MINIATURE PINSCHER'] = 'SMALL DOG'
    result['AMERICAN STAFFORDSHIRE TERRIER'] = 'MEDIUM DOG'
    result['CHICKEN'] = 'CHICKEN'
    
    result['AMERICAN BULLDOG'] ='LARGE DOG'
    result['BASSET HOUND'] = 'MEDIUM DOG'
    result['SCHNAUZER - MINIATURE'] = 'SMALL DOG'
    result['PARSON (JACK) RUSSELL TERRIER'] = 'SMALL DOG'
    result['RAT TERRIER'] = 'SMALL DOG'
    result['BOSTON TERRIER'] = 'SMALL DOG'
    result['DOBERMAN PINSCHER'] = 'LARGE DOG'
    result['COLLIE - SMOOTH ']='LARGE DOG'
    result['MASTIFF']='LARGE DOG'
    result['CAIRN TERRIER']='SMALL DOG'
    result['AKITA']='LARGE DOG'
    result['CHIHUAHUA - LONG HAIRED']= 'SMALL DOG'
    result['MALTESE']='SMALL DOG'
    result['COLLIE - ROUGH']='SMALL DOG'
    result['MAINE COON ']= 'LONGHAIR CAT'
    result['LHASA APSO']='SMALL DOG'
    result['SHETLAND SHEEPDOG']='SMALL DOG'
    
    result['CHINESE SHARPEI']= 'MEDIUM DOG'
    result['BLACK AND TAN COONOUND']= 'LARGE DOG'
    result['PEKINGESE']='SMALL DOG'
    result['GREAT PYRENEES']= 'LARGE DOG'
    result['STAFFORDSHIRE BULL TERRIER']='SMALL DOG'
    result['ALASKAN HUSKY']= 'MEDIUM DOG'
    result['CANE CORSO']=  'LARGE DOG'                  
    result['POODLE - TOY']='SMALL DOG'
    result['GREAT DANE']='LARGE DOG'
    result['DALMATIAN']='MEDIUM DOG'
    result['POINTER']='MEDIUM DOG'
    result['AMERICAN ESKIMO']='SMALL DOG'
    result['WELSH CORGI - CARDIGAN']='MEDIUM DOG'
    result['RUSSIAN BLUE']='SHORTHAIR CAT'
    result['ENGLISH BULLDOG']='MEDIUM DOG'
    result['FOX TERRIER - WIREHAIRED']='SMALL DOG'
    result['TREEING WALKER COONHOUND']='MEDIUM DOG'
    
    if breed in result.keys():
        return result[breed]
    else:
        return 'OTHER'
    
    

    
    
    
    

In [515]:
df['PrimaryBreed'] = df['PrimaryBreed'].apply(breed_map)

In [516]:
df['PrimaryBreed'].value_counts()

SHORTHAIR CAT    56395
MEDIUM DOG       25977
LARGE DOG        20256
OTHER            19154
SMALL DOG        17164
LONGHAIR CAT      3104
CHICKEN            543
Name: PrimaryBreed, dtype: int64

In [517]:
df['SecondaryBreed'] = df['SecondaryBreed'].apply(breed_map)

In [518]:
df['SecondaryBreed'].value_counts()

nan              104459
OTHER             26907
MEDIUM DOG         3313
SMALL DOG          3202
LARGE DOG          3200
SHORTHAIR CAT      1390
LONGHAIR CAT        122
Name: SecondaryBreed, dtype: int64

In [519]:
(df.IntakeReason.value_counts())[30]

100

In [520]:
df.IntakeInternalStatus.value_counts()

NORMAL        101566
SICK            6733
INJURED         6123
FEARFUL         6110
NURSING         5322
FERAL           4132
OTHER           3295
AGGRESSIVE      2813
AGED            2425
DEAD            1575
EMACIATED       1044
PREGNANT         678
OBESE            218
PARVO            114
AGG PEOPLE       100
HEARTWORM         89
TERITORIAL        66
AGG FEAR          56
AGG ANIMAL        50
FIV +             23
DEHYDRA           16
RINGWORM          15
FELV+             13
DIARRHEA           9
AGG FOOD           4
AGG BARRIE         4
Name: IntakeInternalStatus, dtype: int64

In [521]:
df = df[df['IntakeInternalStatus']!='DEAD']

In [522]:
df['IntakeInternalStatus'].value_counts().index

Index(['NORMAL', 'SICK', 'INJURED', 'FEARFUL', 'NURSING', 'FERAL', 'OTHER',
       'AGGRESSIVE', 'AGED', 'EMACIATED', 'PREGNANT', 'OBESE', 'PARVO',
       'AGG PEOPLE', 'HEARTWORM', 'TERITORIAL', 'AGG FEAR', 'AGG ANIMAL',
       'FIV +', 'DEHYDRA', 'RINGWORM', 'FELV+', 'DIARRHEA', 'AGG FOOD',
       'AGG BARRIE'],
      dtype='object')

In [523]:
def map_status(status):
    if pd.isnull(status) or status=='nan':
        return 'nan'
    medical = ['SICK', 'INJURED','NURSING','EMACIATED', 'PREGNANT', 'OBESE','PARVO','HEARTWORM','FIV +', 'DEHYDRA', 'RINGWORM', 'FELV+', 'DIARRHEA','DEAD']
    behavior = ['FEARFUL','FERAL','AGGRESSIVE','TERITORIAL','AGG FEAR', 'AGG ANIMAL','AGG PEOPLE','AGG BARRIE']
    if status in medical:
        return 'medical'
    if status in behavior:
        return 'behavior'
    if status == 'NORMAL' or status == 'AGED' or status =='DEAD' :
        return status
    return 'OTHER'

In [524]:
df['IntakeInternalStatus']=df['IntakeInternalStatus'].apply(map_status)

In [525]:
df['IntakeInternalStatus'].value_counts()

NORMAL      101566
medical      20397
behavior     13331
OTHER         3299
AGED          2425
Name: IntakeInternalStatus, dtype: int64

In [526]:
def reason_map(reason):
    
    if pd.isnull(reason) or reason=='nan':
        return 'nan'
    if reason =='STRAY':
        return reason
    if reason in ['AGG ANIMAL','BITES','HYPER','DESTRUC IN']:
        return 'behavior'
    return 'OTHER'
    

In [527]:
df['IntakeReason'] = df['IntakeReason'].apply(reason_map)

In [528]:
df['IntakeReason'].value_counts()

nan         109605
OTHER        18580
STRAY        11522
behavior      1311
Name: IntakeReason, dtype: int64

In [529]:
(df.OutcomeInternalStatus.value_counts())[10]

377

In [530]:
def map_status(status):
        if pd.isnull(status) or status == 'nan':
            return 'nan'
        medical = ['SICK', 'INJURED','NURSING','EMACIATED', 'PREGNANT', 'OBESE','PARVO','HEARTWORM','FIV +', 'DEHYDRA', 'RINGWORM', 'FELV+', 'DIARRHEA','DEAD']
        behavior = ['FEARFUL','FERAL','AGGRESSIVE','TERITORIAL','AGG FEAR', 'AGG ANIMAL','AGG PEOPLE','AGG BARRIE']
        if status in medical:
            return 'medical'
        if status in behavior:
            return 'behavior'
        if status == 'NORMAL' or status == 'AGED':
            return status
        return 'OTHER'

In [531]:
df['OutcomeInternalStatus']=df['OutcomeInternalStatus'].apply(map_status)

In [532]:
df['OutcomeInternalStatus'].value_counts()

nan         96640
NORMAL      26315
medical     12247
behavior     3697
OTHER        1114
AGED         1005
Name: OutcomeInternalStatus, dtype: int64

In [533]:
df.OutcomeAsilomarStatus .value_counts()

HEALTHY                  137872
UNHEALTHY/UNTREATABLE      2759
TREATABLE/MANAGEABLE        377
nan                          10
Name: OutcomeAsilomarStatus, dtype: int64

In [534]:
y=df.OutcomeType
X=df.drop(['OutcomeType','OutcomeSubtype'],axis=1)
clean_df=X
clean_df['OutcomeType']=y
clean_df['intake_year_month']=df['intake_year_month']
clean_df['outcome_year_month']=df['outcome_year_month']

clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 141018 entries, 0 to 150841
Data columns (total 36 columns):
AnimalType                     141018 non-null object
IntakeType                     141018 non-null object
IntakeSubtype                  141018 non-null object
PrimaryColor                   141018 non-null object
PrimaryBreed                   141018 non-null object
SecondaryBreed                 141018 non-null object
Gender                         141018 non-null object
SecondaryColor                 141018 non-null object
IntakeReason                   141018 non-null object
IntakeInternalStatus           141018 non-null object
IntakeAsilomarStatus           141018 non-null object
ReproductiveStatusAtIntake     141018 non-null object
OutcomeInternalStatus          141018 non-null object
OutcomeAsilomarStatus          141018 non-null object
ReproductiveStatusAtOutcome    141018 non-null object
IntakeYear                     141018 non-null int64
IntakeMonth               

In [535]:
import pickle
with open('/home/jieliang/proj3/mvp/mvp_data/eda-6-clean_data_without_dummies.pkl', 'wb') as fp:
    pickle.dump(clean_df, fp)

In [539]:

with open('/home/jieliang/proj3/mvp/mvp_data/eda-6-clean_data_without_dummies.pkl', 'rb') as fp:
    df = pickle.load(fp)

In [540]:
X = df.drop(['intake_year_month','outcome_year_month','OutcomeType'],axis=1)

In [541]:
X.shape

(141018, 33)

In [542]:
X=pd.get_dummies(X)
X.shape

(141018, 98)

In [543]:
X['intake_year_month']=df['intake_year_month']

In [544]:
X['outcome_year_month'] = df['outcome_year_month']
y=clean_df['OutcomeType']
X.shape

(141018, 100)

In [545]:
df= X
df['OutcomeType']=y
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 141018 entries, 0 to 150841
Columns: 101 entries, IntakeYear to OutcomeType
dtypes: float64(10), int64(8), object(3), uint8(80)
memory usage: 34.4+ MB


In [546]:
df.columns

Index(['IntakeYear', 'IntakeMonth', 'OutcomeYear', 'OutcomeMonth',
       'IntakeAgeInDays', 'OutcomeAgeInDays', 'DaysInShelter', 'HasIntakeAge',
       'HasOutcomeAge', 'HasDaysInShelter',
       ...
       'OutcomeAsilomarStatus_TREATABLE/MANAGEABLE',
       'OutcomeAsilomarStatus_UNHEALTHY/UNTREATABLE',
       'OutcomeAsilomarStatus_nan', 'ReproductiveStatusAtOutcome_ALTERED',
       'ReproductiveStatusAtOutcome_FERTILE',
       'ReproductiveStatusAtOutcome_UNKNOWN',
       'ReproductiveStatusAtOutcome_nan', 'intake_year_month',
       'outcome_year_month', 'OutcomeType'],
      dtype='object', length=101)

In [547]:
import pickle
with open('/home/jieliang/proj3/mvp/mvp_data/eda-6-clean_data.pkl', 'wb') as fp:
    pickle.dump(clean_df, fp)

In [548]:

with open('/home/jieliang/proj3/mvp/mvp_data/eda-6-clean_data.pkl', 'rb') as fp:
    df = pickle.load(fp)

In [549]:
df.to_csv('/home/jieliang/proj3/mvp/mvp_data/eda-6-clean_data.csv', sep=',')

In [550]:
df.dtypes


AnimalType                      object
IntakeType                      object
IntakeSubtype                   object
PrimaryColor                    object
PrimaryBreed                    object
SecondaryBreed                  object
Gender                          object
SecondaryColor                  object
IntakeReason                    object
IntakeInternalStatus            object
IntakeAsilomarStatus            object
ReproductiveStatusAtIntake      object
OutcomeInternalStatus           object
OutcomeAsilomarStatus           object
ReproductiveStatusAtOutcome     object
IntakeYear                       int64
IntakeMonth                      int64
OutcomeYear                      int64
OutcomeMonth                     int64
IntakeAgeInDays                float64
OutcomeAgeInDays               float64
DaysInShelter                  float64
HasIntakeAge                     int64
HasOutcomeAge                    int64
HasDaysInShelter                 int64
TimesInShelter           

In [551]:
df.IntakeInternalStatus.value_counts()

NORMAL      101566
medical      20397
behavior     13331
OTHER         3299
AGED          2425
Name: IntakeInternalStatus, dtype: int64