<a href="https://colab.research.google.com/github/nolll77/Tabular-Data/blob/master/Deep_Learning_for_Tabular_Data_using_PyTorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Deep Learning for tabular data using Pytorch**

![On a multiclass classification problem](https://drive.google.com/uc?id=1jyQqIZ3DZDizA25fySnsq3SlerlMQ7HG)

Deep learning has proved to be groundbreaking in a lot of domains like Computer Vision, Natural Language Processing, Signal Processing, etc. However, when it comes to more structured, tabular data consisting of categorical or numerical variables, traditional machine learning approaches (such as Random Forests, XGBoost) are believed to perform better. As expected, Neural nets have caught up and in many instances shown to be performing equally well or even better at times.

The easiest way to perform deep learning with tabular data is through the fast-ai library and it gives really good results, but it might be a little too abstracted for someone who’s trying to understand what is really going on behind the scenes. Hence, in this article, I’ve covered how to build a simple deep learning model to deal with tabular data in Pytorch on a multiclass classification problem.

**A little background on Pytorch**

Pytorch is a popular open-source machine library. It is as simple to use and learn as Python. A few other advantages of using PyTorch are its multi-GPU support and custom data loaders. If you’re unfamiliar with the basics or need a revision, here’s a good place to start:



<b>Dataset</b> - https://www.kaggle.com/c/shelter-animal-outcomes - It’s a tabular dataset consisting of about 26k rows and 10 columns in the training set. All columns except DateTime are categorical.

<b>Problem Statement</b> : Given certain features about a shelter animal (like age, sex, color, breed), predict its outcome.

There are 5 possible outcomes : Return_to_owner, Euthanasia, Adoption, Transfer, Died. We are expected to find the probability of an animal's outcome belonging to each of the 5 categories.

## Library imports

In [None]:
import pandas as pd
import numpy as np
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import torch
from torch.utils.data import Dataset, DataLoader
import torch.optim as torch_optim
import torch.nn as nn
import torch.nn.functional as F
from torchvision import models
from datetime import datetime

## **Data Preprocessing**

Although this step depends largely on the particular data and problem, there are two necessary steps that need to be followed:

Getting rid of Nan values:
Nan (not a number) indicates a missing value in the dataset. The model doesn’t accept Nan values, hence they must be either deleted or replaced.

For numerical columns, a popular way of dealing with these values is to impute them with 0, mean, median, mode or some other function of the remaining data. Missing values might sometimes indicate an underlying feature in your dataset, so people often create a new binary column corresponding to the column with missing values to record whether the data was missing or not.

For categorical columns, Nan values can be considered as their own category!

Label encoding all categorical columns:
Since our model can only take numerical inputs, we convert all our categorical elements to numbers. This means instead of using strings to represent categories, we use numbers. The numbers chosen to represent the categories should be in the range of 0 to the total number of different categories (including Nan ) in the column. This is so that when we create categorical embeddings for the column, we want to be able to index into our embedding matrix which would have one entry for each category. Here’s a simple example of label encoding :

![Label Encoder](https://drive.google.com/uc?id=1cZUIspp02WSh4s9gc6pSKA02pCaanA7i)


I’ve used the LabelEncoder class from the scikit-learn library to encode the categorical columns. You could define a custom class to do this and keep track of the category labels because you’d need them to encode test data too.

Label encoding the target:
We also need to label encode the target if it has string entries. Also, make sure you maintain a dictionary mapping the encodings to original values because you’ll need it to figure out the final output of your model.

Data Processing particular to the Shelter Outcome problem:
Along with the above-mentioned steps, I did a little more processing for the example problem.

Removed the AnimalID column because it’s unique and won’t help in training.
Removed the OutcomeSubtype column because it’s a part of the target but we’re not asked to predict it.
Removed DateTime column because exact Timestamp of when the record was entered didn’t seem like an important feature. In fact, I first tried to split it out into separate month and year columns but later realized that removing the column altogether gave me a better result!
Removed Name column because it had too many Nan values (more than 10k missing). Also, it did not seem like a very important feature in determining an animal’s outcome.
Note: In my notebook, I stacked the train and test columns and then did the preprocessing to avoid having to do label encoding based on the train set labels on the test set (because it would involve maintaining a dictionary of encoded labels to actual values). It was okay to do the stacking and processing here because there are no numerical columns (hence no imputing done) and the number of categories per column was fixed. In practice, we must never do this because it may leak some data from the test/validation sets to the training data and lead to an inaccurate evaluation of the model. For example, if you had missing values in a numerical column like age and decided to impute it with the average value, the average value should be calculated only on the train set (not stacked train-test-valid set) and this value should be used to impute missing values in validation and test sets too.

**Categorical Embeddings**

[Categorical embeddings](https://drive.google.com/file/d/1BWt5aW9l5Ha-rR8YvUksHhe46ynOMVe3/view?usp=sharing) are very similar to word embeddings which are commonly used in NLP. The basic idea is to have a fixed-length vector representation of each category in the column. How this is different from a one-hot encoding is that instead of having a sparse matrix, using embeddings, we get a dense matrix for each category with similar categories having values close to each other in the embedding space. Hence, this process not only saves up memory (as the one-hot encoding for columns having too many categories can really blow up the input matrix, also it is a very sparse matrix) but also reveals intrinsic properties of the categorical variables.

For example, if we had a column of colors and we find embeddings for it, we can expect red and pink to be closer in the embedding space than red and blue

Categorical embedding layers are equivalent to extra layers on top of each one-hot encoded input :

![Texte alternatif…](https://drive.google.com/uc?id=1iQpcTOm7qYSSYgUdBY2TOh63wKoZdDhE)

For our shelter outcome problem, we have only categorical columns but I’ll be considering columns with less than 3 values as continuous. To decide the length of each column’s embedding vector I’ve taken a simple function from the fast-ai library:

**Pytorch Dataset and DataLoader**

We extend the [Dataset](https://pytorch.org/docs/stable/_modules/torch/utils/data/dataset.html#TensorDataset) (abstract) class provided by Pytorch for easier access to our dataset while training and for effectively using the DataLoader module to manage batches. This involves overwriting the __len__ and __getitem__ methods as per our particular dataset.

Since we only need to embed categorical columns, we split our input into two parts: numerical and categorical.

We then choose our batch size and feed it along with the dataset to the DataLoader. Deep learning is generally done in batches. DataLoader helps us in effectively managing these batches and shuffling the data before training.

To do a sanity check, you can iterate through the created DataLoaders to look at each batch :

![Texte alternatif…](https://drive.google.com/uc?id=11HvL2OwpBW90myt-QQ55TIXvXycdSZHH)

**Model**
Our data is split into continuous and categorical parts. We first convert the categorical parts into embedding vectors based on the previously determined sizes and concatenate them with the continuous parts to feed to the rest of the network. This picture demonstrates the model I’ve used :

![Texte alternatif…](https://drive.google.com/uc?id=1Nnb1GZP6c9uOCUOu6NsBGGZ667C702mH)

Training
Now we train the model on the training set. I’ve used the Adam optimizer to optimize the cross-entropy loss. The training is pretty straightforward: iterate through each batch, do a forward pass, compute gradients, do a gradient descent and repeat this process for as many epochs as needed.

**Test Output**
Since we’re interested in finding the probabilities for each class for our test inputs, we apply a Softmax function over our model output. I also made a Kaggle submission to see how well this model performs :

![Texte alternatif…](https://drive.google.com/uc?id=1cxxhEZARnsbKWIJoYiKwfjeJ-9W3va3M)

We’ve done very less feature engineering and data exploration and used a very basic deep learning architecture, yet our model has done better than about 50% of the solutions. This shows that this approach of modeling tabular data using neural networks is pretty powerful!

#### Training set 

In [None]:
# Auth from my Google Drive
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [None]:
path = "/content/drive/My Drive/Dataset/Jovian/Tabular Data/train.csv"
train = pd.read_csv(path)
print("Shape:", train.shape)
train.head()

Shape: (26729, 10)


Unnamed: 0,AnimalID,Name,DateTime,OutcomeType,OutcomeSubtype,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color
0,A671945,Hambone,2014-02-12 18:22:00,Return_to_owner,,Dog,Neutered Male,1 year,Shetland Sheepdog Mix,Brown/White
1,A656520,Emily,2013-10-13 12:44:00,Euthanasia,Suffering,Cat,Spayed Female,1 year,Domestic Shorthair Mix,Cream Tabby
2,A686464,Pearce,2015-01-31 12:28:00,Adoption,Foster,Dog,Neutered Male,2 years,Pit Bull Mix,Blue/White
3,A683430,,2014-07-11 19:09:00,Transfer,Partner,Cat,Intact Male,3 weeks,Domestic Shorthair Mix,Blue Cream
4,A667013,,2013-11-15 12:52:00,Transfer,Partner,Dog,Neutered Male,2 years,Lhasa Apso/Miniature Poodle,Tan


#### Test set

In [None]:
test = pd.read_csv('/content/drive/My Drive/Dataset/Jovian/Tabular Data/test.csv')
print("Shape:", test.shape)
test.head()

Shape: (11456, 8)


Unnamed: 0,ID,Name,DateTime,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color
0,1,Summer,2015-10-12 12:15:00,Dog,Intact Female,10 months,Labrador Retriever Mix,Red/White
1,2,Cheyenne,2014-07-26 17:59:00,Dog,Spayed Female,2 years,German Shepherd/Siberian Husky,Black/Tan
2,3,Gus,2016-01-13 12:20:00,Cat,Neutered Male,1 year,Domestic Shorthair Mix,Brown Tabby
3,4,Pongo,2013-12-28 18:12:00,Dog,Intact Male,4 months,Collie Smooth Mix,Tricolor
4,5,Skooter,2015-09-24 17:59:00,Dog,Neutered Male,2 years,Miniature Poodle Mix,White


#### Sample submission file

For each row, each outcome's probability needs to be filled into the columns

In [None]:
sample = pd.read_csv('/content/drive/My Drive/Dataset/Jovian/Tabular Data/sample_submission.csv')
sample.head()

Unnamed: 0,ID,Adoption,Died,Euthanasia,Return_to_owner,Transfer
0,1,1,0,0,0,0
1,2,1,0,0,0,0
2,3,1,0,0,0,0
3,4,1,0,0,0,0
4,5,1,0,0,0,0


## Very basic data exploration

#### How balanced is the dataset?

Adoption and Transfer seem to occur a lot more than the rest

In [None]:
Counter(train['OutcomeType'])

Counter({'Adoption': 10769,
         'Died': 197,
         'Euthanasia': 1555,
         'Return_to_owner': 4786,
         'Transfer': 9422})

#### What are the most common names and how many times do they occur? 

There seem to be too many Nan values. Name might not be a very important factor too

In [None]:
Counter(train['Name']).most_common(5)

[(nan, 7691), ('Max', 136), ('Bella', 135), ('Charlie', 107), ('Daisy', 106)]

## Data preprocessing

OutcomeSubtype column seems to be of no use, so we drop it. Also, since animal ID is unique, it doesn't help in training

In [None]:
train_X = train.drop(columns= ['OutcomeType', 'OutcomeSubtype', 'AnimalID'])
Y = train['OutcomeType']
test_X = test

#### Stacking train and test set so that they undergo the same preprocessing 

In [None]:
stacked_df = train_X.append(test_X.drop(columns=['ID']))

#### splitting datetime into month and year

In [None]:
# stacked_df['DateTime'] = pd.to_datetime(stacked_df['DateTime'])
# stacked_df['year'] = stacked_df['DateTime'].dt.year
# stacked_df['month'] = stacked_df['DateTime'].dt.month
stacked_df = stacked_df.drop(columns=['DateTime'])
stacked_df.head()

Unnamed: 0,Name,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color
0,Hambone,Dog,Neutered Male,1 year,Shetland Sheepdog Mix,Brown/White
1,Emily,Cat,Spayed Female,1 year,Domestic Shorthair Mix,Cream Tabby
2,Pearce,Dog,Neutered Male,2 years,Pit Bull Mix,Blue/White
3,,Cat,Intact Male,3 weeks,Domestic Shorthair Mix,Blue Cream
4,,Dog,Neutered Male,2 years,Lhasa Apso/Miniature Poodle,Tan


#### dropping columns with too many nulls

In [None]:
for col in stacked_df.columns:
    if stacked_df[col].isnull().sum() > 10000:
        print("dropping", col, stacked_df[col].isnull().sum())
        stacked_df = stacked_df.drop(columns = [col])

dropping Name 10916


In [None]:
stacked_df.head()

Unnamed: 0,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color
0,Dog,Neutered Male,1 year,Shetland Sheepdog Mix,Brown/White
1,Cat,Spayed Female,1 year,Domestic Shorthair Mix,Cream Tabby
2,Dog,Neutered Male,2 years,Pit Bull Mix,Blue/White
3,Cat,Intact Male,3 weeks,Domestic Shorthair Mix,Blue Cream
4,Dog,Neutered Male,2 years,Lhasa Apso/Miniature Poodle,Tan


#### label encoding

In [None]:
for col in stacked_df.columns:
    if stacked_df.dtypes[col] == "object":
        stacked_df[col] = stacked_df[col].fillna("NA")
    else:
        stacked_df[col] = stacked_df[col].fillna(0)
    stacked_df[col] = LabelEncoder().fit_transform(stacked_df[col])

In [None]:
stacked_df.head()

Unnamed: 0,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color
0,1,3,5,1482,146
1,0,4,5,775,184
2,1,3,21,1293,97
3,0,1,26,775,47
4,1,3,21,1101,311


In [None]:
# making all variables categorical
for col in stacked_df.columns:
    stacked_df[col] = stacked_df[col].astype('category')

#### splitting back train and test

In [None]:
X = stacked_df[0:26729]
test_processed = stacked_df[26729:]

#check if shape[0] matches original
print("train shape: ", X.shape, "orignal: ", train.shape)
print("test shape: ", test_processed.shape, "original: ", test.shape)

train shape:  (26729, 5) orignal:  (26729, 10)
test shape:  (11456, 5) original:  (11456, 8)


#### Encoding target

In [None]:
Y = LabelEncoder().fit_transform(Y)

#sanity check to see numbers match and matching with previous counter to create target dictionary
print(Counter(train['OutcomeType']))
print(Counter(Y))
target_dict = {
    'Return_to_owner' : 3,
    'Euthanasia': 2,
    'Adoption': 0,
    'Transfer': 4,
    'Died': 1
}

Counter({'Adoption': 10769, 'Transfer': 9422, 'Return_to_owner': 4786, 'Euthanasia': 1555, 'Died': 197})
Counter({0: 10769, 4: 9422, 3: 4786, 2: 1555, 1: 197})


#### train-valid split

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, Y, test_size=0.10, random_state=0)
X_train.head()

Unnamed: 0,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color
6917,1,3,5,1293,146
13225,0,4,33,1515,231
2697,1,4,5,1353,43
21905,1,3,31,245,40
17071,0,4,37,775,156


#### Choosing columns for embedding

In [None]:
#categorical embedding for columns having more than two values
embedded_cols = {n: len(col.cat.categories) for n,col in X.items() if len(col.cat.categories) > 2}
embedded_cols

{'AgeuponOutcome': 46, 'Breed': 1678, 'Color': 411, 'SexuponOutcome': 6}

In [None]:
embedded_col_names = embedded_cols.keys()
len(X.columns) - len(embedded_cols) #number of numerical columns

1

#### Determining size of embedding 
(borrowed from https://www.usfca.edu/data-institute/certificates/fundamentals-deep-learning lesson 2)

In [None]:
embedding_sizes = [(n_categories, min(50, (n_categories+1)//2)) for _,n_categories in embedded_cols.items()]
embedding_sizes

[(6, 3), (46, 23), (1678, 50), (411, 50)]

## Pytorch Dataset

In [None]:
class ShelterOutcomeDataset(Dataset):
    def __init__(self, X, Y, embedded_col_names):
        X = X.copy()
        self.X1 = X.loc[:,embedded_col_names].copy().values.astype(np.int64) #categorical columns
        self.X2 = X.drop(columns=embedded_col_names).copy().values.astype(np.float32) #numerical columns
        self.y = Y
        
    def __len__(self):
        return len(self.y)
    
    def __getitem__(self, idx):
        return self.X1[idx], self.X2[idx], self.y[idx]

In [None]:
#creating train and valid datasets
train_ds = ShelterOutcomeDataset(X_train, y_train, embedded_col_names)
valid_ds = ShelterOutcomeDataset(X_val, y_val, embedded_col_names)

## Making device (GPU/CPU) compatible 
(borrowed from https://jovian.ml/aakashns/04-feedforward-nn)

In order to make use of a GPU if available, we'll have to move our data and model to it.

In [None]:
def get_default_device():
    """Pick GPU if available, else CPU"""
    if torch.cuda.is_available():
        return torch.device('cuda')
    else:
        return torch.device('cpu')

In [None]:
def to_device(data, device):
    """Move tensor(s) to chosen device"""
    if isinstance(data, (list,tuple)):
        return [to_device(x, device) for x in data]
    return data.to(device, non_blocking=True)

In [None]:
class DeviceDataLoader():
    """Wrap a dataloader to move data to a device"""
    def __init__(self, dl, device):
        self.dl = dl
        self.device = device
        
    def __iter__(self):
        """Yield a batch of data after moving it to device"""
        for b in self.dl: 
            yield to_device(b, self.device)

    def __len__(self):
        """Number of batches"""
        return len(self.dl)

In [None]:
device = get_default_device()
device

device(type='cpu')

## Model

(modified from https://www.usfca.edu/data-institute/certificates/fundamentals-deep-learning lesson 2)

In [None]:
class ShelterOutcomeModel(nn.Module):
    def __init__(self, embedding_sizes, n_cont):
        super().__init__()
        self.embeddings = nn.ModuleList([nn.Embedding(categories, size) for categories,size in embedding_sizes])
        n_emb = sum(e.embedding_dim for e in self.embeddings) #length of all embeddings combined
        self.n_emb, self.n_cont = n_emb, n_cont
        self.lin1 = nn.Linear(self.n_emb + self.n_cont, 200)
        self.lin2 = nn.Linear(200, 70)
        self.lin3 = nn.Linear(70, 5)
        self.bn1 = nn.BatchNorm1d(self.n_cont)
        self.bn2 = nn.BatchNorm1d(200)
        self.bn3 = nn.BatchNorm1d(70)
        self.emb_drop = nn.Dropout(0.6)
        self.drops = nn.Dropout(0.3)
        

    def forward(self, x_cat, x_cont):
        x = [e(x_cat[:,i]) for i,e in enumerate(self.embeddings)]
        x = torch.cat(x, 1)
        x = self.emb_drop(x)
        x2 = self.bn1(x_cont)
        x = torch.cat([x, x2], 1)
        x = F.relu(self.lin1(x))
        x = self.drops(x)
        x = self.bn2(x)
        x = F.relu(self.lin2(x))
        x = self.drops(x)
        x = self.bn3(x)
        x = self.lin3(x)
        return x

In [None]:
model = ShelterOutcomeModel(embedding_sizes, 1)
to_device(model, device)

ShelterOutcomeModel(
  (embeddings): ModuleList(
    (0): Embedding(6, 3)
    (1): Embedding(46, 23)
    (2): Embedding(1678, 50)
    (3): Embedding(411, 50)
  )
  (lin1): Linear(in_features=127, out_features=200, bias=True)
  (lin2): Linear(in_features=200, out_features=70, bias=True)
  (lin3): Linear(in_features=70, out_features=5, bias=True)
  (bn1): BatchNorm1d(1, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (bn2): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (bn3): BatchNorm1d(70, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (emb_drop): Dropout(p=0.6, inplace=False)
  (drops): Dropout(p=0.3, inplace=False)
)

#### Optimizer

In [None]:
def get_optimizer(model, lr = 0.001, wd = 0.0):
    parameters = filter(lambda p: p.requires_grad, model.parameters())
    optim = torch_optim.Adam(parameters, lr=lr, weight_decay=wd)
    return optim

#### Training function

In [None]:
def train_model(model, optim, train_dl):
    model.train()
    total = 0
    sum_loss = 0
    for x1, x2, y in train_dl:
        batch = y.shape[0]
        output = model(x1, x2)
        loss = F.cross_entropy(output, y)   
        optim.zero_grad()
        loss.backward()
        optim.step()
        total += batch
        sum_loss += batch*(loss.item())
    return sum_loss/total

#### Evaluation function

In [None]:
def val_loss(model, valid_dl):
    model.eval()
    total = 0
    sum_loss = 0
    correct = 0
    for x1, x2, y in valid_dl:
        current_batch_size = y.shape[0]
        out = model(x1, x2)
        loss = F.cross_entropy(out, y)
        sum_loss += current_batch_size*(loss.item())
        total += current_batch_size
        pred = torch.max(out, 1)[1]
        correct += (pred == y).float().sum().item()
    print("valid loss %.3f and accuracy %.3f" % (sum_loss/total, correct/total))
    return sum_loss/total, correct/total

In [None]:
def train_loop(model, epochs, lr=0.01, wd=0.0):
    optim = get_optimizer(model, lr = lr, wd = wd)
    for i in range(epochs): 
        loss = train_model(model, optim, train_dl)
        print("training loss: ", loss)
        val_loss(model, valid_dl)

## Training 

In [None]:
batch_size = 1000
train_dl = DataLoader(train_ds, batch_size=batch_size,shuffle=True)
valid_dl = DataLoader(valid_ds, batch_size=batch_size,shuffle=True)

In [None]:
train_dl = DeviceDataLoader(train_dl, device)
valid_dl = DeviceDataLoader(valid_dl, device)

In [None]:
train_loop(model, epochs=8, lr=0.05, wd=0.00001)

training loss:  1.23645598742904
valid loss 0.960 and accuracy 0.604
training loss:  1.0135881132884481
valid loss 0.953 and accuracy 0.603
training loss:  0.9957453416041611
valid loss 0.882 and accuracy 0.633
training loss:  0.9612488248517119
valid loss 0.883 and accuracy 0.631
training loss:  0.95715885815289
valid loss 0.877 and accuracy 0.640
training loss:  0.9451313052297947
valid loss 0.875 and accuracy 0.634
training loss:  0.9463126289713688
valid loss 0.877 and accuracy 0.634
training loss:  0.9450647421837804
valid loss 0.866 and accuracy 0.635


## Test Output

In [None]:
test_ds = ShelterOutcomeDataset(test_processed, np.zeros(len(test_processed)), embedded_col_names)
test_dl = DataLoader(test_ds, batch_size=batch_size)

In [None]:
preds = []
with torch.no_grad():
    for x1,x2,y in test_dl:
        out = model(x1, x2)
        prob = F.softmax(out, dim=1)
        preds.append(prob)

In [None]:
final_probs = [item for sublist in preds for item in sublist]

In [None]:
len(final_probs)

11456

In [None]:
target_dict

{'Adoption': 0,
 'Died': 1,
 'Euthanasia': 2,
 'Return_to_owner': 3,
 'Transfer': 4}

In [None]:
sample.head()

Unnamed: 0,ID,Adoption,Died,Euthanasia,Return_to_owner,Transfer
0,1,1,0,0,0,0
1,2,1,0,0,0,0
2,3,1,0,0,0,0
3,4,1,0,0,0,0
4,5,1,0,0,0,0


In [None]:
sample['Adoption'] = [float(t[0]) for t in final_probs]
sample['Died'] = [float(t[1]) for t in final_probs]
sample['Euthanasia'] = [float(t[2]) for t in final_probs]
sample['Return_to_owner'] = [float(t[3]) for t in final_probs]
sample['Transfer'] = [float(t[4]) for t in final_probs]
sample.head()

Unnamed: 0,ID,Adoption,Died,Euthanasia,Return_to_owner,Transfer
0,1,0.100872,0.004188,0.07126,0.183928,0.639751
1,2,0.524133,0.001504,0.021052,0.2887,0.16461
2,3,0.541326,0.002692,0.026802,0.103425,0.325755
3,4,0.065527,0.004611,0.053677,0.072495,0.80369
4,5,0.586185,0.001023,0.01181,0.216782,0.1842


In [None]:
sample.to_csv('samp.csv', index=False)