# Inductive Node Classification

![Neo4j version](https://img.shields.io/badge/Neo4j->=4.4.9-brightgreen)
![GDS version](https://img.shields.io/badge/GDS-2.3-brightgreen)
![GDS Python Client version](https://img.shields.io/badge/GDS_Python_Client-1.6-brightgreen)

__This notebook demonstrates how graph features can be used to improve Machine Learning accuracy__ in an inductive setting.
In this example, we see accuracy increase by ~25% for supervised node classification.

## Inductive Node classification
In this application of graph machine learning we are provided a graph where some nodes have a target label that we wish to predict.  The goal is to train a model on the graph for one of two purposes:
1. Predict the target label of new nodes as new data is added to the graph
2. Predict target labels of nodes on a separate similar graph

We refer to this as *__inductive__* graph machine learning. It is a very useful approach for leveraging relationships to improve ML in situations where you must reuse the same model to predict on new, unseen, data.

## Dataset
The dataset we use to demonstrate is the [`ogbn-arxiv`](https://ogb.stanford.edu/docs/nodeprop/#ogbn-arxiv) citation graph which is composed of papers as nodes and citations between papers as relationships.  Each paper comes with a 128-dimensional floating point vector representing a word embedding of the paper's title and abstract.

- __Classification Task:__ The goal is to predict the paper subject, i.e. what the paper is about. There are 40 possible subjects.

- __Data Spliting__: To reflect the inductive setting, we want to mirror a realistic scenario where the model is trained on current data then used to predict on new, unseen, future data. To accomplish this we will split train, validation, and test sets by the publication date. Training papers will be those published through 2017, validation will be papers published in 2018, and test will be papers published in 2019.


## Models
We will train 4 Machine Learning Models and compare their test set accuracy:

- __Default Best Guess__: A really naive heuristic (predict by most frequent class in training set) to use as a sanity check. A model must do better than this to be useful.
- __Non-Graph (NLP Only)__ : Use just the 128-dimensional word embeddings as feature inputs to a Neural Network classifier.
- __Graph ML with FastRP Embeddings:__ Generate Fast Random Projection (FastRP) node embeddings with the word embeddings as weights. Use the FastRP embeddings as inputs to a Neural Network Classifier
- __Graph ML with GraphSage Embeddings:__ train an unsupervised GraphSage model (a type of neural network) on a subset of the graph, using word embeddings as inputs, to generate node embeddings.  Use the Graph Sage model to predict node embeddings on the entire graph. Use the node embeddings as inputs to a Neural Network Classifier.




In [47]:
import torch
import pandas as pd
import numpy as np
from graphdatascience import GraphDataScience
from dotenv import load_dotenv
import os
import benchmark.ogbn_arxiv as bm_ogbn_arxiv
from graph_data.data_import import get_ogbn_arxiv_data

## Prepare Data
Load source data, mask a proportion of node labels to reflect transductive setting, and create test and train set indexes.

In [48]:
VALID_YEAR = 2018

In [49]:
paper_source_df, citation_source_df = get_ogbn_arxiv_data()
paper_source_df

Unnamed: 0,nodeId,textEmbedding,year,subjectId
0,0,"[-0.05794300138950348, -0.05253000184893608, -...",2013,4
1,1,"[-0.12449999898672104, -0.07066500186920166, -...",2015,5
2,2,"[-0.08024200052022934, -0.02332800067961216, -...",2014,28
3,3,"[-0.1450439989566803, 0.05491499975323677, -0....",2014,8
4,4,"[-0.07115399837493896, 0.07076600193977356, -0...",2014,27
...,...,...,...,...
169338,169338,"[-0.32135099172592163, -0.03933500126004219, -...",2020,4
169339,169339,"[-0.15121200680732727, -0.12470199912786484, -...",2020,24
169340,169340,"[-0.22053000330924988, -0.03656800091266632, -...",2020,10
169341,169341,"[-0.13823600113391876, 0.04088500142097473, -0...",2020,4


In [50]:
# Set some labels as missing to reflect transductive setting
paper_df = paper_source_df.copy()


# split train, valid and test by year to reflect inductive setting
train_idx = paper_df[paper_df.year < VALID_YEAR].nodeId
valid_idx = paper_df[paper_df.year == VALID_YEAR].nodeId
test_idx = paper_df[paper_df.year > VALID_YEAR].nodeId

print(f'{100*len(train_idx)/paper_df.shape[0]:.6}% of the papers are assigned to the train set')
print(f'{100*len(valid_idx)/paper_df.shape[0]:.6}% of the papers are assigned to the validation set')
print(f'{100*len(test_idx)/paper_df.shape[0]:.6}% of the papers are assigned to the test set')

print(f'{100*(len(train_idx) + len(test_idx) + len(valid_idx))/paper_df.shape[0]}% of the papers are assigned to just one of the above sets')

53.7022% of the papers are assigned to the train set
17.5968% of the papers are assigned to the validation set
28.7009% of the papers are assigned to the test set
100.0% of the papers are assigned to just one of the above sets


## Default Best Guess
As a dummy baseline - how good do we do if we predicted all examples as the most frequent class in the training set?
A useful model must do better than this.

In [51]:
%%time
default_stats = bm_ogbn_arxiv.default_best_guess_benchmark(paper_df, train_idx, valid_idx, test_idx)
default_stats

CPU times: user 22.7 ms, sys: 1.82 ms, total: 24.5 ms
Wall time: 23.6 ms


{'train_acc': 0.17906114953651267,
 'valid_acc': 0.07627772744051814,
 'test_acc': 0.05861778079542415}

This performs horribly, as one may expect, with an accuracy of about 16% in training and it gets worse in validation and test.
The drop in accuracy going from train -> valid -> test is an example of the data drifting over time.  __Data drift is a common issue in inductive settings__.

## Non Graph (NLP Only)
Use non-graph features only.  For a model with graph features to be useful, it must do better than this.

In [52]:
# convert indexes to tensors for PyTorch
train_idx = torch.tensor(train_idx.to_numpy(), dtype=torch.long)
valid_idx = torch.tensor(valid_idx.to_numpy(), dtype=torch.long)
test_idx = torch.tensor(test_idx.to_numpy(), dtype=torch.long)

In [53]:
%%time
x=torch.tensor(np.stack(paper_df.textEmbedding), dtype=torch.float)
y=torch.tensor(paper_df.subjectId.to_numpy(), dtype=torch.long)

non_graph_benchmark, _ = bm_ogbn_arxiv.run_model(x, y, train_idx, valid_idx, test_idx,
                                                     hidden_dims=[64], epochs=300, patience=3, verbose=False)

CPU times: user 17.4 s, sys: 5.86 s, total: 23.3 s
Wall time: 8.48 s


In [54]:
non_graph_benchmark.qualityChecks

{'train': {'allZeroFeatureVec_Count': 0, 'allZeroFeatureVec_Percent': 0.0},
 'valid': {'allZeroFeatureVec_Count': 0, 'allZeroFeatureVec_Percent': 0.0},
 'test': {'allZeroFeatureVec_Count': 0, 'allZeroFeatureVec_Percent': 0.0}}

In [55]:
non_graph_benchmark.bestStats

{'epoch': 98,
 'loss': 1.6757088899612427,
 'train_acc': 0.5501478980877712,
 'valid_acc': 0.5436424041075204,
 'test_acc': 0.5174989198197643}

## Load Graph Into Neo4j for Feature Engineering

In [56]:
load_dotenv('db-credentials.env', override=True)

# Use Neo4j URI and credentials according to our setup
gds = GraphDataScience(
    os.getenv('NEO4J_URI'),
    auth=(os.getenv('NEO4J_USERNAME'),
          os.getenv('NEO4J_PASSWORD')),
    aura_ds=eval(os.getenv('AURA_DS').title()))

# Necessary if you enabled Arrow on the db - this is true for AuraDS
gds.set_database("neo4j")
PROJ_NAME = 'proj'

In [57]:
gds.version()

'2.3.0'

In [58]:
# Use node labels to demarcate test, valid, train split
node_df = paper_df.drop(columns=['subjectId'])
def label_nodes_from_year(x):
    given_node_labels = ['Paper']
    res = given_node_labels.copy()
    if x < VALID_YEAR:
        res.append('Train')
    elif x == VALID_YEAR:
        res.append('Valid')
    else:
        res.append('Test')
    return res
node_df['labels'] = node_df.year.apply(label_nodes_from_year)
node_df

Unnamed: 0,nodeId,textEmbedding,year,labels
0,0,"[-0.05794300138950348, -0.05253000184893608, -...",2013,"[Paper, Train]"
1,1,"[-0.12449999898672104, -0.07066500186920166, -...",2015,"[Paper, Train]"
2,2,"[-0.08024200052022934, -0.02332800067961216, -...",2014,"[Paper, Train]"
3,3,"[-0.1450439989566803, 0.05491499975323677, -0....",2014,"[Paper, Train]"
4,4,"[-0.07115399837493896, 0.07076600193977356, -0...",2014,"[Paper, Train]"
...,...,...,...,...
169338,169338,"[-0.32135099172592163, -0.03933500126004219, -...",2020,"[Paper, Test]"
169339,169339,"[-0.15121200680732727, -0.12470199912786484, -...",2020,"[Paper, Test]"
169340,169340,"[-0.22053000330924988, -0.03656800091266632, -...",2020,"[Paper, Test]"
169341,169341,"[-0.13823600113391876, 0.04088500142097473, -0...",2020,"[Paper, Test]"


In [59]:
rel_df = citation_source_df.rename(columns={'paper': 'sourceNodeId', 'citedPaper': 'targetNodeId'})
rel_df['relationshipType'] = 'CITED'
rel_df

Unnamed: 0,sourceNodeId,targetNodeId,relationshipType
0,104447,13091,CITED
1,15858,47283,CITED
2,107156,69161,CITED
3,107156,136440,CITED
4,107156,107366,CITED
...,...,...,...
1166238,45118,79124,CITED
1166239,45118,147994,CITED
1166240,45118,162473,CITED
1166241,45118,162537,CITED


In [60]:
if gds.graph.exists(PROJ_NAME)['exists']:
    gds.graph.get(PROJ_NAME).drop()

In [61]:
%%time
g = gds.alpha.graph.construct(PROJ_NAME, node_df, rel_df, undirected_relationship_types = ['CITED'])

Uploading Nodes:   0%|          | 0/169343 [00:00<?, ?Records/s]

Uploading Relationships:   0%|          | 0/1166243 [00:00<?, ?Records/s]

CPU times: user 880 ms, sys: 1.41 s, total: 2.29 s
Wall time: 1min 18s


In [62]:
print(f'Node Count: {g.node_count():,}')
print(f'Relationship Count: {g.relationship_count():,}')

Node Count: 169,343
Relationship Count: 2,332,486


## Graph ML with FastRP Embeddings

### Generating Embeddings
To reflect the inductive setting, we can only use the data available at the time within each data split.  This means that
1. The training set embeddings cannot use nodes and relationships from the validation and test sets (a.k.a. papers published after 2017)
2. The validation set cannot use nodes and relationships fromm the test set (a.k.a. papers published after 2018)

Violating the above would be a form of [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)). To accomplish this filtering we can use the `nodeLabels` parameter to filter out data splits when creating embeddings for each set.


In [63]:
RANDOM_SEED = 7474

In [64]:
stats = gds.fastRP.mutate(g, embeddingDimension=256, mutateProperty='trainFastrp', nodeLabels=['Train'],
                  featureProperties=['textEmbedding'], propertyRatio=0.5, randomSeed=RANDOM_SEED)
stats[['nodePropertiesWritten', 'computeMillis']]

FastRP:   0%|          | 0/100 [00:00<?, ?%/s]

nodePropertiesWritten    90941
computeMillis             1004
Name: 0, dtype: object

In [65]:
stats = gds.fastRP.mutate(g, embeddingDimension=256, mutateProperty='validFastrp', nodeLabels=['Train', 'Valid'],
                  featureProperties=['textEmbedding'], propertyRatio=0.5, randomSeed=RANDOM_SEED)
stats[['nodePropertiesWritten', 'computeMillis']]

FastRP:   0%|          | 0/100 [00:00<?, ?%/s]

nodePropertiesWritten    120740
computeMillis              1442
Name: 0, dtype: object

In [66]:
stats = gds.fastRP.mutate(g, embeddingDimension=256, mutateProperty='testFastrp',
                  featureProperties=['textEmbedding'], propertyRatio=0.5, randomSeed=RANDOM_SEED)
stats[['nodePropertiesWritten', 'computeMillis']]

FastRP:   0%|          | 0/100 [00:00<?, ?%/s]

nodePropertiesWritten    169343
computeMillis              1570
Name: 0, dtype: object

In [67]:
# retrieve embeddings and merge into other data
fastrp_train_df = gds.graph.nodeProperties.stream(g, node_properties=['trainFastrp'], node_labels=['Train'],
                                               separate_property_columns=True).rename(columns={'trainFastrp': 'fastrpEmb'})
fastrp_valid_df = gds.graph.nodeProperties.stream(g, node_properties=['validFastrp'], node_labels=['Valid'],
                                               separate_property_columns=True).rename(columns={'validFastrp': 'fastrpEmb'})
fastrp_test_df = gds.graph.nodeProperties.stream(g, node_properties=['testFastrp'], node_labels=['Test'],
                                              separate_property_columns=True).rename(columns={'testFastrp': 'fastrpEmb'})
fastrp_df = pd.concat([fastrp_train_df, fastrp_valid_df, fastrp_test_df]).reset_index(drop=True)
fastrp_df = paper_df.merge(fastrp_df, on='nodeId').drop(columns=['textEmbedding'])
fastrp_df

Unnamed: 0,nodeId,year,subjectId,fastrpEmb
0,0,2013,4,"[-0.018001456, 0.002976255, -0.0039167125, 0.0..."
1,1,2015,5,"[0.03180246, -0.0023662215, 0.010938041, -0.01..."
2,2,2014,28,"[-0.0033430415, -0.00047625753, 0.008955019, 0..."
3,3,2014,8,"[0.04108315, -0.020637406, 0.0030388825, -0.00..."
4,4,2014,27,"[0.030352138, 0.0052911597, -0.026630275, -0.0..."
...,...,...,...,...
169338,169338,2020,4,"[-0.008318075, -0.0039453013, -0.0050907657, 2..."
169339,169339,2020,24,"[-0.0048398594, -0.0017861046, 0.003423419, 0...."
169340,169340,2020,10,"[-0.00029355925, 0.0054876376, -0.0012249375, ..."
169341,169341,2020,4,"[-0.006040647, 0.0008232696, -0.0027380046, 0...."


### Train Model with FastRP Embeddings

In [68]:
%%time
x=torch.tensor(np.stack(fastrp_df.fastrpEmb), dtype=torch.float)
y=torch.tensor(fastrp_df.subjectId.to_numpy(), dtype=torch.long)

fastrp_benchmark, _= bm_ogbn_arxiv.run_model(x, y, train_idx, valid_idx, test_idx,
                                                 hidden_dims=[64], epochs=300, patience=3, verbose=False)

CPU times: user 19.8 s, sys: 6.62 s, total: 26.4 s
Wall time: 9.24 s


In [69]:
fastrp_benchmark.qualityChecks

{'train': {'allZeroFeatureVec_Count': 3342,
  'allZeroFeatureVec_Percent': 0.036749101065526},
 'valid': {'allZeroFeatureVec_Count': 1339,
  'allZeroFeatureVec_Percent': 0.044934393771603076},
 'test': {'allZeroFeatureVec_Count': 0, 'allZeroFeatureVec_Percent': 0.0}}

In [70]:
fastrp_benchmark.bestStats

{'epoch': 100,
 'loss': 1.1686692237854004,
 'train_acc': 0.6832011963800706,
 'valid_acc': 0.6505251854089064,
 'test_acc': 0.6654527498302574}

In [71]:
# clean the graph projection
gds.graph.nodeProperties.drop(g, ['trainFastrp', 'validFastrp', 'testFastrp'])

graphName                                              proj
nodeProperties       [testFastrp, trainFastrp, validFastrp]
propertiesRemoved                                    381024
Name: 0, dtype: object

## Graph ML with GraphSage Embeddings

### Generating Embeddings
Like we did with FastRP, we will need to use the `nodeLabels` parameter to filter to the appropriate data split(s) when generating embeddings so-as to avoid any potential data leakage. In addition, for GraphSAGE, we will also need to make sure that we only sample from the training set when creating a subragraph for training.

In [72]:
# create a graph subsample to train graphSage. This will speed up computation
if gds.graph.exists(PROJ_NAME + '_sample')['exists']:
    gds.graph.get(PROJ_NAME + '_sample').drop()
g_sample, _ = gds.alpha.graph.sample.rwr(PROJ_NAME + '_sample', g, nodeLabels=['Train'], samplingRatio=0.5,
                                         restartProbability=0.05, concurrency=1, randomSeed=RANDOM_SEED)

Random walk with restarts sampling:   0%|          | 0/100 [00:00<?, ?%/s]

In [73]:
# train GraphSage
if gds.beta.model.exists('gsModel')['exists']:
    gds.model.get('gsModel').drop()
gds.beta.graphSage.train(g_sample, modelName='gsModel', embeddingDimension=256, sampleSizes=[30, 30],
                         searchDepth=20, epochs=20, learningRate=0.001, activationFunction='RELU',
                         aggregator='MEAN', featureProperties=['textEmbedding'], randomSeed=RANDOM_SEED,
                         batchSize=10)

(GraphSageModel({'modelInfo': {0: {'modelName': 'gsModel', 'modelType': 'graphSage', 'metrics': {'ranIterationsPerEpoch': [10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10], 'iterationLossesPerEpoch': [[25.791606997865387, 25.114036037415012, 24.375698489073265, 23.092969808841165, 20.908147085065416, 20.11745489430456, 19.614694880398794, 18.16501598766947, 17.461258242101753, 17.811124640310517], [15.974288570042626, 16.68824702498322, 17.0807047504146, 16.758503383307772, 16.29289904352825, 16.04885266306575, 16.108261032916104, 16.22598518354161, 16.499142604490693, 15.82853004672163], [15.916939146288438, 16.64429836677249, 17.274962391115928, 16.293324827535205, 16.503968911663517, 17.138665891338245, 16.451237916505356, 16.42754990374648, 16.296662875259795, 15.554955869252575], [15.429310096682144, 16.628391177029304, 16.97519335026579, 16.933382472716467, 16.406737242337364, 17.21787291214148, 16.007909225133552, 15.988575019748092, 15.72984122104

In [74]:
# generate graphSage embeddings
stats = gds.beta.graphSage.mutate(g, modelName='gsModel', nodeLabels=['Train'], mutateProperty='trainGraphSageEmb')
stats[['nodePropertiesWritten', 'computeMillis']]

GraphSage:   0%|          | 0/100 [00:00<?, ?%/s]

nodePropertiesWritten    90941
computeMillis            15434
Name: 0, dtype: object

In [75]:
stats = gds.beta.graphSage.mutate(g, modelName='gsModel', nodeLabels=['Train', 'Valid'],mutateProperty='validGraphSageEmb')
stats[['nodePropertiesWritten', 'computeMillis']]

GraphSage:   0%|          | 0/100 [00:00<?, ?%/s]

nodePropertiesWritten    120740
computeMillis             23842
Name: 0, dtype: object

In [76]:
stats = gds.beta.graphSage.mutate(g, modelName='gsModel', mutateProperty='testGraphSageEmb')
stats[['nodePropertiesWritten', 'computeMillis']]

GraphSage:   0%|          | 0/100 [00:00<?, ?%/s]

nodePropertiesWritten    169343
computeMillis             37295
Name: 0, dtype: object

In [77]:
# retrieve embeddings and merge into other data
gs_train_df = gds.graph.nodeProperties.stream(g, node_properties=['trainGraphSageEmb'], node_labels=['Train'],
                                               separate_property_columns=True).rename(columns={'trainGraphSageEmb': 'graphSageEmb'})

gs_valid_df = gds.graph.nodeProperties.stream(g, node_properties=['validGraphSageEmb'], node_labels=['Valid'],
                                               separate_property_columns=True).rename(columns={'validGraphSageEmb': 'graphSageEmb'})

gs_test_df = gds.graph.nodeProperties.stream(g, node_properties=['testGraphSageEmb'], node_labels=['Test'],
                                              separate_property_columns=True).rename(columns={'testGraphSageEmb': 'graphSageEmb'})

graphsage_df = pd.concat([gs_train_df, gs_valid_df, gs_test_df]).reset_index(drop=True)
graphsage_df = paper_df.merge(graphsage_df, on='nodeId').drop(columns=['textEmbedding'])
graphsage_df

Unnamed: 0,nodeId,year,subjectId,graphSageEmb
0,0,2013,4,"[-0.005992484327022294, -0.005962161298943747,..."
1,1,2015,5,"[-0.008038558263026806, -0.007251417085306731,..."
2,2,2014,28,"[-0.012396935589389924, -0.019481974472033814,..."
3,3,2014,8,"[-0.0029723242472047863, -0.001560210898866543..."
4,4,2014,27,"[-0.0037868415162238406, -0.002975725965524208..."
...,...,...,...,...
169338,169338,2020,4,"[-0.013385240564853875, -0.018195681974958553,..."
169339,169339,2020,24,"[-0.015074901577559354, -0.017700057011142297,..."
169340,169340,2020,10,"[-0.03498560267038506, -0.0393931385640612, -0..."
169341,169341,2020,4,"[-0.00911268037011832, -0.011591528839730993, ..."


In [78]:
%%time
x=torch.tensor(np.stack(graphsage_df.graphSageEmb), dtype=torch.float)
y=torch.tensor(graphsage_df.subjectId.to_numpy(), dtype=torch.long)

gs_benchmark, _ = bm_ogbn_arxiv.run_model(x, y, train_idx, valid_idx, test_idx,
                                             hidden_dims=[64], epochs=300, patience=3, verbose=False)

CPU times: user 25.4 s, sys: 8.27 s, total: 33.7 s
Wall time: 12.2 s


In [79]:
gs_benchmark.qualityChecks

{'train': {'allZeroFeatureVec_Count': 0, 'allZeroFeatureVec_Percent': 0.0},
 'valid': {'allZeroFeatureVec_Count': 0, 'allZeroFeatureVec_Percent': 0.0},
 'test': {'allZeroFeatureVec_Count': 0, 'allZeroFeatureVec_Percent': 0.0}}

In [80]:
gs_benchmark.bestStats

{'epoch': 128,
 'loss': 1.2998781204223633,
 'train_acc': 0.6511694395267261,
 'valid_acc': 0.6334440753045404,
 'test_acc': 0.6413801617184124}

In [81]:
# clean the graph projections
g_sample.drop()
gds.graph.nodeProperties.drop(g, ['trainGraphSageEmb', 'validGraphSageEmb', 'testGraphSageEmb'])

graphName                                                         proj
nodeProperties       [testGraphSageEmb, trainGraphSageEmb, validGra...
propertiesRemoved                                               381024
Name: 0, dtype: object

## Results

In [82]:
non_graph_acc = non_graph_benchmark.bestStats['test_acc']
fastrp_acc = fastrp_benchmark.bestStats['test_acc']
gs_acc = gs_benchmark.bestStats['test_acc']

print(f'====== Model Results =========')
pd.DataFrame({'Model': ['Non Graph (NLP Only)', 'Graph ML with FastRP Embeddings', 'Graph ML with GraphSAGE Embeddings'],
    'Test Set Accuracy': [round(non_graph_acc,3), round(fastrp_acc, 3), round(gs_acc, 3)],
    '% Improvement Over Non-Graph': ['.', f'{(fastrp_acc - non_graph_acc)/non_graph_acc:.2%}', f'{(gs_acc - non_graph_acc)/non_graph_acc:.2%}']})



Unnamed: 0,Model,Test Set Accuracy,% Improvement Over Non-Graph
0,Non Graph (NLP Only),0.517,.
1,Graph ML with FastRP Embeddings,0.665,28.59%
2,Graph ML with GraphSAGE Embeddings,0.641,23.94%


# Cleanup

In [83]:
if gds.beta.model.exists('gsModel')['exists']:
    gds.model.get('gsModel').drop()

In [84]:
g.drop()

graphName                                                             proj
database                                                             neo4j
memoryUsage                                                               
sizeInBytes                                                             -1
nodeCount                                                           169343
relationshipCount                                                  2332486
configuration                                                           {}
density                                                           0.000081
creationTime                           2023-01-23T02:29:42.748771399+00:00
modificationTime                       2023-01-23T02:32:47.532329610+00:00
schema                   {'graphProperties': {}, 'relationships': {'CIT...
schemaWithOrientation    {'graphProperties': {}, 'relationships': {'CIT...
Name: 0, dtype: object