# Assignment: IMDB Classification


*IMDB-BINARY* is a movie collaboration dataset that consists of the ego-networks of 1,000 actors/actresses who played roles in movies in IMDB (https://www.imdb.com/ ). In each graph, nodes represent actors/actress, and there is an edge between them if they appear in the same movie. These graphs are derived from the **Action** and **Romance** genres. The task
is identify which genre an ego-network graph belongs to.

 

### Get Data

In [1]:
# Install required packages.
import os
import torch
os.environ['TORCH'] = torch.__version__
print(torch.__version__)

!pip install -q torch-scatter -f https://data.pyg.org/whl/torch-${TORCH}.html
!pip install -q torch-sparse -f https://data.pyg.org/whl/torch-${TORCH}.html
!pip install -q git+https://github.com/pyg-team/pytorch_geometric.git

1.11.0+cu113
[K     |████████████████████████████████| 7.9 MB 2.9 MB/s 
[K     |████████████████████████████████| 3.5 MB 2.8 MB/s 
[?25h  Building wheel for torch-geometric (setup.py) ... [?25l[?25hdone


### Get Data Properties

In [12]:
import torch
from torch_geometric.datasets import TUDataset

dataset = TUDataset(root='data/TUDataset', name='IMDB-BINARhttps://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviewsY')

print()
print(f'Dataset: {dataset}:')
print('====================')
print(f'Number of graphs: {len(dataset)}')
print(f'Number of features: {dataset.num_features}')
print(f'Number of classes: {dataset.num_classes}')

data = dataset[0]  # Get the first graph object.

print()
print(data)
print('=============================================================')

# Gather some statistics about the first graph.
print(f'Number of nodes: {data.num_nodes}')
print(f'Number of edges: {data.num_edges}')
print(f'Average node degree: {data.num_edges / data.num_nodes:.2f}')
print(f'Has isolated nodes: {data.has_isolated_nodes()}')
print(f'Has self-loops: {data.has_self_loops()}')
print(f'Is undirected: {data.is_undirected()}')


Dataset: IMDB-BINARY(1000):
Number of graphs: 1000
Number of features: 0
Number of classes: 2

Data(edge_index=[2, 146], y=[1], num_nodes=20)
Number of nodes: 20
Number of edges: 146
Average node degree: 7.30
Has isolated nodes: False
Has self-loops: False
Is undirected: True


### Splitting Train and Test Data

In [13]:
torch.manual_seed(12345)
dataset = dataset.shuffle()

train_dataset = dataset[:850]
test_dataset = dataset[850:]

print(f'Number of training graphs: {len(train_dataset)}')
print(f'Number of test graphs: {len(test_dataset)}')

Number of training graphs: 850
Number of test graphs: 150


### Data Loader

In [6]:
from torch_geometric.loader import DataLoader

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

for step, data in enumerate(train_loader):
    print(f'Step {step + 1}:')
    print('=======')
    print(f'Number of graphs in the current batch: {data.num_graphs}')
    print(data)
    print()

Step 1:
Number of graphs in the current batch: 64
DataBatch(edge_index=[2, 13694], y=[64], num_nodes=1328, batch=[1328], ptr=[65])

Step 2:
Number of graphs in the current batch: 64
DataBatch(edge_index=[2, 12360], y=[64], num_nodes=1294, batch=[1294], ptr=[65])

Step 3:
Number of graphs in the current batch: 64
DataBatch(edge_index=[2, 10680], y=[64], num_nodes=1166, batch=[1166], ptr=[65])

Step 4:
Number of graphs in the current batch: 64
DataBatch(edge_index=[2, 15298], y=[64], num_nodes=1340, batch=[1340], ptr=[65])

Step 5:
Number of graphs in the current batch: 64
DataBatch(edge_index=[2, 9880], y=[64], num_nodes=1217, batch=[1217], ptr=[65])

Step 6:
Number of graphs in the current batch: 64
DataBatch(edge_index=[2, 12160], y=[64], num_nodes=1352, batch=[1352], ptr=[65])

Step 7:
Number of graphs in the current batch: 64
DataBatch(edge_index=[2, 13044], y=[64], num_nodes=1210, batch=[1210], ptr=[65])

Step 8:
Number of graphs in the current batch: 64
DataBatch(edge_index=[2, 12

## Task: build and train a GNN or GCN to classify the test data.  