In [26]:
import pandas as pd
import plotly.express as px
from plotly.subplots import make_subplots
from wordcloud import WordCloud, STOPWORDS

In [12]:
df_train = pd.read_csv('./train.csv')
df_test = pd.read_csv('./test.csv')
df_color = pd.read_csv('./color_labels.csv')
df_breed = pd.read_csv('./breed_labels.csv')
df_state = pd.read_csv('./state_labels.csv')

df_train['dataset_type'] = 'train'
df_test['dataset_type'] = 'test'
all_data = pd.concat([df_train, df_test])

# Data Fields
- PetID - Unique hash ID of pet profile
- AdoptionSpeed - Categorical speed of adoption. Lower is faster. This is the value to predict. See below section for more info.
- Type - Type of animal (1 = Dog, 2 = Cat)
- Name - Name of pet (Empty if not named)
- Age - Age of pet when listed, in months
- Breed1 - Primary breed of pet (Refer to BreedLabels dictionary)
- Breed2 - Secondary breed of pet, if pet is of mixed breed (Refer to BreedLabels dictionary)
- Gender - Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets)
- Color1 - Color 1 of pet (Refer to ColorLabels dictionary)
- Color2 - Color 2 of pet (Refer to ColorLabels dictionary)
- Color3 - Color 3 of pet (Refer to ColorLabels dictionary)
- MaturitySize - Size at maturity (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified)
- FurLength - Fur length (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified)
- Vaccinated - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure)
- Dewormed - Pet has been dewormed (1 = Yes, 2 = No, 3 = Not Sure)
- Sterilized - Pet has been spayed / neutered (1 = Yes, 2 = No, 3 = Not Sure)
- Health - Health Condition (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified)
- Quantity - Number of pets represented in profile
- Fee - Adoption fee (0 = Free)
- State - State location in Malaysia (Refer to StateLabels dictionary)
- RescuerID - Unique hash ID of rescuer
- VideoAmt - Total uploaded videos for this pet
- PhotoAmt - Total uploaded photos for this pet
- Description - Profile write-up for this pet. The primary language used is English, with some in Malay or Chinese.

In [3]:
df_train.shape

(14993, 24)

In [4]:
df_train.head()

Unnamed: 0,Type,Name,Age,Breed1,Breed2,Gender,Color1,Color2,Color3,MaturitySize,...,Health,Quantity,Fee,State,RescuerID,VideoAmt,Description,PetID,PhotoAmt,AdoptionSpeed
0,2,Nibble,3,299,0,1,1,7,0,1,...,1,1,100,41326,8480853f516546f6cf33aa88cd76c379,0,Nibble is a 3+ month old ball of cuteness. He ...,86e1089a3,1.0,2
1,2,No Name Yet,1,265,0,1,1,2,0,2,...,1,1,0,41401,3082c7125d8fb66f7dd4bff4192c8b14,0,I just found it alone yesterday near my apartm...,6296e909a,2.0,0
2,1,Brisco,1,307,0,1,2,7,0,2,...,1,1,0,41326,fa90fa5b1ee11c86938398b60abc32cb,0,Their pregnant mother was dumped by her irresp...,3422e4906,7.0,3
3,1,Miko,4,307,0,2,1,2,0,2,...,1,1,150,41401,9238e4f44c71a75282e62f7136c6b240,0,"Good guard dog, very alert, active, obedience ...",5842f1ff5,8.0,2
4,1,Hunter,1,307,0,1,1,0,0,2,...,1,1,0,41326,95481e953f8aed9ec3d16fc4509537e8,0,This handsome yet cute boy is up for adoption....,850a43f90,3.0,2


In [6]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14993 entries, 0 to 14992
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Type           14993 non-null  int64  
 1   Name           13736 non-null  object 
 2   Age            14993 non-null  int64  
 3   Breed1         14993 non-null  int64  
 4   Breed2         14993 non-null  int64  
 5   Gender         14993 non-null  int64  
 6   Color1         14993 non-null  int64  
 7   Color2         14993 non-null  int64  
 8   Color3         14993 non-null  int64  
 9   MaturitySize   14993 non-null  int64  
 10  FurLength      14993 non-null  int64  
 11  Vaccinated     14993 non-null  int64  
 12  Dewormed       14993 non-null  int64  
 13  Sterilized     14993 non-null  int64  
 14  Health         14993 non-null  int64  
 15  Quantity       14993 non-null  int64  
 16  Fee            14993 non-null  int64  
 17  State          14993 non-null  int64  
 18  Rescue

- We have almost 15 thousands dogs and cats in the dataset;
- Main dataset contains all important information about pets: age, breed, color, some characteristics and other things;
- Desctiptions were analyzed using Google's Natural Language API providing sentiments and entities. I suppose we could do a - similar thing ourselves;
- There are photos of some pets;
- Some meta-information was extracted from images and we can use it;
- Let's start with the main dataset.

I have also created a full dataset by combining train and test data. This is done purely for more convenient visualization. Column "dataset_type" shows which dataset the data belongs to.

## AdoptionSpeed
Contestants are required to predict this value. The value is determined by how quickly, if at all, a pet is adopted. The values are determined in the following way:

0 - Pet was adopted on the same day as it was listed.

1 - Pet was adopted between 1 and 7 days (1st week) after being listed.

2 - Pet was adopted between 8 and 30 days (1st month) after being listed.

3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed.

4 - No adoption after 100 days of being listed. (There are no pets in this dataset that waited between 90 and 100 days).

In [9]:
adoption_speed_counts = df_train['AdoptionSpeed'].value_counts().sort_index()
adoption_speed_df = pd.DataFrame({'AdoptionSpeed': adoption_speed_counts.index,
                                  'Count': adoption_speed_counts.values})


fig = px.bar(adoption_speed_df, x='Count', y='AdoptionSpeed', orientation='h', color='AdoptionSpeed')

fig.update_layout(title='Adoption speed classes counts',
                  xaxis_title='Count',
                  yaxis_title='Adoption speed',
                  showlegend=False)

fig.show()

In [14]:
train_data = all_data.loc[all_data['dataset_type'] == 'train']
adoption_speed_counts = train_data['AdoptionSpeed'].value_counts().sort_index()
adoption_speed_df = pd.DataFrame({'AdoptionSpeed': adoption_speed_counts.index,
                                  'Count': adoption_speed_counts.values})

fig = px.bar(adoption_speed_df, x='AdoptionSpeed', y='Count', color='AdoptionSpeed')


fig.update_layout(title='Adoption speed classes rates',
                  xaxis_title='Adoption speed',
                  yaxis_title='Count',
                  showlegend=False)

fig.show()

### Type
1 - Dog, 2 - Cat

In [23]:
all_data['Type'] = all_data['Type'].apply(lambda x: 'Dog' if x == 1 else 'Cat')
fig = px.histogram(all_data, x='dataset_type', color='Type')
fig.update_layout(title='Number of cats and dogs in train and test data')
fig.show()

In [5]:
fig = px.histogram(df_train, x="Type", color="Type", nbins=2, barmode="group")


fig.update_layout(
    title="Distribution of Types",
    xaxis_title="Type",
    yaxis_title="Count"
)

fig.show()

### Name

In [27]:
# Get the names of cats and dogs separately
cats_names = all_data.loc[all_data['Type'] == 'Cat', 'Name'].fillna('').values
dogs_names = all_data.loc[all_data['Type'] == 'Dog', 'Name'].fillna('').values

# Generate wordclouds for cats and dogs
cat_wordcloud = WordCloud(background_color='white', width=1200, height=1000).generate(' '.join(cats_names))
dog_wordcloud = WordCloud(background_color='white', width=1200, height=1000).generate(' '.join(dogs_names))

# Create the plot
fig = make_subplots(rows=1, cols=2, subplot_titles=('Top cat names', 'Top dog names'))

# Add the wordclouds to the plot
fig.add_trace(px.imshow(cat_wordcloud).data[0], row=1, col=1)
fig.add_trace(px.imshow(dog_wordcloud).data[0], row=1, col=2)

# Customize the plot layout
fig.update_layout(showlegend=False, width=1000, height=500, title_text='Top cat and dog names')

# Display the plot
fig.show()

Regarding the pet names:
-  it's interesting to note that some of them are quite common and predictable, while others are more unique or unconventional. 
- It's also worth noting that some pets don't have names, which could be an important factor to consider when analyzing adoption rates or predicting outcomes.

In [28]:
print('Most popular pet names and AdoptionSpeed')
for n in df_train['Name'].value_counts().index[:5]:
    print(n)
    print(df_train.loc[df_train['Name'] == n, 'AdoptionSpeed'].value_counts().sort_index())
    print('')

Most popular pet names and AdoptionSpeed
Baby
0     2
1    11
2    15
3    11
4    27
Name: AdoptionSpeed, dtype: int64

Lucky
0     5
1    14
2    16
3    12
4    17
Name: AdoptionSpeed, dtype: int64

Brownie
0     1
1    11
2    14
3    12
4    16
Name: AdoptionSpeed, dtype: int64

No Name
0     3
1    14
2    11
3     6
4    20
Name: AdoptionSpeed, dtype: int64

Mimi
0     3
1    12
2    13
3     7
4    17
Name: AdoptionSpeed, dtype: int64



In [30]:
df_train['Name'] = df_train['Name'].fillna('Unnamed')
df_test['Name'] = df_test['Name'].fillna('Unnamed')
all_data['Name'] = all_data['Name'].fillna('Unnamed')

df_train['No_name'] = 0
df_train.loc[df_train['Name'] == 'Unnamed', 'No_name'] = 1
df_test['No_name'] = 0
df_test.loc[df_test['Name'] == 'Unnamed', 'No_name'] = 1
all_data['No_name'] = 0
all_data.loc[all_data['Name'] == 'Unnamed', 'No_name'] = 1

print(f"Rate of unnamed pets in train data: {df_train['No_name'].sum() * 100 / df_train['No_name'].shape[0]:.4f}%.")
print(f"Rate of unnamed pets in test data: {df_test['No_name'].sum() * 100 / df_test['No_name'].shape[0]:.4f}%.")

Rate of unnamed pets in train data: 8.4173%.
Rate of unnamed pets in test data: 10.3474%.


In [32]:
pd.crosstab(df_train['No_name'], df_train['AdoptionSpeed'], normalize='index')

AdoptionSpeed,0,1,2,3,4
No_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0.027966,0.205302,0.271211,0.22147,0.274051
1,0.020602,0.214739,0.248019,0.172742,0.343899


#### "Bad" names
I have noticed that shorter names tend to be meaningless. Here is an example of some names with 3 characters.

In [35]:
all_data[all_data['Name'].apply(lambda x: len(str(x))) == 3]['Name'].value_counts().tail()

C14    1
Bob    1
Nat    1
B.B    1
Ino    1
Name: Name, dtype: int64

In [36]:
all_data[all_data['Name'].apply(lambda x: len(str(x))) < 3]['Name'].unique()


array(['H3', 'Z3', 'C', 'BB', 'QQ', 'Y1', 'H1', 'D9', 'Y4', 'Z4', 'DD',
       'M2', 'H6', 'D4', 'JJ', 'F1', 'W7', '1F', 'Q1', '6', 'CJ', '3F',
       'KD', 'G1', 'B3', 'Cc', 'F6', 'Mk', 'A5', 'GM', 'D5', 'EE', 'A4',
       'Q4', 'B', 'CC', 'Y7', 'W6', 'A3', 'A1', 'T1', 'W1', 'M4', 'P5',
       'H2', 'GG', 'Y6', 'Z', 'D7', 'B4', 'C2', 'M8', '3', 'G2', 'ML',
       'DJ', 'PP', '8', 'OJ', 'D', 'F2', 'MJ', 'W8', 'W4', 'C1', 'W2',
       'GR', 'B1', '5', 'Fa', 'Y5', 'M', 'F5', 'Y0', 'B2', 'Q6', 'G3',
       '..', 'S1', 'Qu', 'R9', 'W3', 'R7', 'Tj', 'P3', '7', '!', 'RC',
       'Z2', 'Q3', 'A2', 'QD', 'S', '-', 'R6', 'IV', 'Mo', 'W5', 'F8',
       'M6', 'M9', 'Py', 'Rt', 'F9', 'P6', 'AJ', 'Y3', 'D6', 'T2', 'F4',
       'T3', 'YY', '99', 'F7', 'W+', 'D2', '1', '#1', 'S4', '2', 'Am',
       'P', 'P4', 'R5', 'M3', 'R3', 'JD', 'BJ', 'L', 'KC', 'VV', 'M1',
       '!.', 'V6', 'P1', 'J', 'S3', 'A6', 'Cq', 'M5', 'B5', 'J1', 'O',
       '2F', 'Q2', 'Y2', 'AB', 'A', 'Jo', 'ET', 'A9', 'ST', 'Po', 'KK'

I think that we could create a new feature, showing that name is meaningless - pets with these names could have less success in adoption.

#### Description

In [37]:
text = ' '.join(df_train['Description'].astype(str))

wordcloud = WordCloud(stopwords=set(STOPWORDS), background_color='white', width=800, height=400).generate(text)


fig = px.imshow(wordcloud)
fig.update_layout(
    title="Word Cloud of Descriptions",
    xaxis=dict(showticklabels=False, showgrid=False, zeroline=False),
    yaxis=dict(showticklabels=False, showgrid=False, zeroline=False),
)
fig.show()