# A Crawler for https://kunstaspekte.art with Neo4j support

## The Website 

It calls itself an "international exhibition announcements and artist catalogue. It contains huge amount of data related to artists, venues and events.
It is constructed around artist, venue and event pages, all containing links to other related pages.

## The Crawler

It is built on Scrapy architecture. It starts with pre-programmed initial "starting url" list. It goes through all the initial urls, checks if there are new urls in those pages and stores them. 
For all stored urls, it makes a new request, which is put in the request que and later scraped in compliance with the settings file.

For all responses that it gets, it uses a appropriate parse method to extract data.

## Parsing

For parsing instead of the built-in parse functions of Scrapy, this script uses Beautiful Soup 4 module for Python, which gives more freedom in navigating the responses.

## Scraped Data

The appropriate data found in the responses after parsing, is sent to the items pipeline, a bulit in Scrapy functionality;
For every page category (event, person and venue) parser decides on the appropriate item class and initiates an instance of that class. After that the items pipeline class takes action and processes the items in que, in compliance with the settings file. In this program, items pipeline starts a connection with a local Neo4j database when the crawling starts. For every item which goes through the items pipeline, the category of the item is checked, Neo4j nodes are created containing the corresponding data, and also any Neo4j relationships are created for those nodes and commited to the databese.

This notebook will not connect to a Neo4j database, all the necessary code is present but commented out. instead it will write the data to a csv file.   
Although in the end of the notebook there are few cells showing the Neo4j functionality, and some data in graph form.

## This Notebook

It aims to go through the code part by part and show the neo4j integration. work in progress..



### This cell is just setting some jupyter related settings

In [1]:
# Settings for notebook
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# Show Python version
import platform
print('python version: ')
platform.python_version()

python version: 


'3.6.8'

### This cell is checking for requirements, if not present it installs them

In [15]:
try:
    import scrapy
    print('scrapy already installed')
except:
    !pip install scrapy
    import scrapy 
try:
    import neo4jupyter
    print('neo4jupyter already installed')
except:
    !pip install neo4jupyter
    import neo4jupyter
    
from scrapy.crawler import CrawlerProcess

scrapy already installed
neo4jupyter already installed


### This cell contains the settings.py file, which is necessary for the Crawler
#### It contains settings as;
- Bot name
- spider info
- whether to obey or not robots.txt file
- settings for the scrapy-proxy-pool moddule
- settings for the scrapy-user-agents moddule
- delays and concurrent process limits
- and other Scrapy related settings

In [4]:
# settings.py

BOT_NAME = 'artist'

SPIDER_MODULES = ['artist.spiders']
NEWSPIDER_MODULE = 'artist.spiders'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

PROXY_POOL_ENABLED = True

CONCURRENT_REQUESTS = 100

DOWNLOAD_DELAY = 0.1

RANDOM_UA_TYPE = 'desktop.random'
RANDOM_UA_FALLBACK = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'
RANDOM_UA_SAME_OS_FAMILY = True

PROXY_POOL_FILTER_ANONYMOUS = False
PROXY_POOL_FILTER_TYPES = ['http', 'https']
PROXY_POOL_FILTER_CODE = 'us'
PROXY_POOL_REFRESH_INTERVAL = 600
PROXY_POOL_CLOSE_SPIDER = False
PROXY_POOL_FORCE_REFRESH = False
PROXY_POOL_TRY_WITH_HOST = True
PROXY_POOL_PAGE_RETRY_TIMES = 3

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 800,

    'scrapy_proxy_pool.middlewares.ProxyPoolMiddleware': 610,
    'scrapy_proxy_pool.middlewares.BanDetectionMiddleware': 620,
}


ITEM_PIPELINES = {
   'artist.pipelines.ArtistPipeline': 300,
}

# RetryMiddleware settings
RETRY_ENABLED = True
RETRY_TIMES = 5
RETRY_HTTP_CODES = [500, 503, 408, 400]

### This cell contains the pipelines.py file, which is necessary for the Crawler
It contains a pipeline method for the spider, which takes the created item objects by the parser as an argument and processes them according to their type  
There are 3 types of items;
- person
- envet 
- venue

#### if it is a person item
- it creates a node of that person with all the available data as attributes and 'person' as a label and puts it in a local subgraph variable
- for the venue list which come with the item;
 - it iterates over those list and creates new nodes with 'venue' label and only a 'url' attribute
 - then it creates a relationship between the person node and venue node with the label of 'has an exhibition in" 
- for the event list which come with the item;
 - it iterates over those list and creates new nodes with 'event' label and only a 'url' attribute
 - then it creates a relationship between the person node and event node with the label of 'participated in" 
- for every node and relationship created by this item, it appends them in the local subgraph
- After going through all the data contained in the item and storing it as nodes or relationships in the subgraph it merges this subgraph with the Neo4j database

#### if it is a venue item
- it creates a node of that person with all the available data as attributes and 'venue' as a label and puts it in a local subgraph variable
- for the artist list which come with the item;
 - it iterates over those list and creates new nodes with 'person' label and only a 'url' attribute
 - then it creates a relationship between the person node and artist node with the label of 'has an exhibition in" 
- for the event list which come with the item;
 - it iterates over those list and creates new nodes with 'event' label and only a 'url' attribute
 - then it creates a relationship between the venue node and event node with the label of 'took place in" 
- for the connected_venue list which come with the item;
 - it iterates over those list and creates new nodes with 'venue' label and only a 'url' attribute
 - then it creates a relationship between the venue node and the new venue node with the label of 'is cooperating with" 
- for every node and relationship created by this item, it appends them in the local subgraph
- After going through all the data contained in the item and storing it as nodes or relationships in the subgraph it merges this subgraph with the Neo4j database

#### if it is a event item
- it creates a node of that event with all the available data as attributes and 'event' as a label and puts it in a local subgraph variable
- for the artist list which come with the item;
 - it iterates over those list and creates new nodes with 'person' label and only a 'url' attribute
 - then it creates a relationship between the person node and event node with the label of 'participated in" 
- for the venue which come with the item;
 - it creates a new node with 'venue' label and only a 'url' attribute
 - then it creates a relationship between the event node and venue node with the label of 'took place in" 
- for every node and relationship created by this item, it appends them in the local subgraph
- After going through all the data contained in the item and storing it as nodes or relationships in the subgraph it merges this subgraph with the Neo4j database


In [5]:
# pipelines.py
# defines pipeline elements

from py2neo import Node, Relationship
from py2neo import Graph, Schema


class ArtistPipeline(object):

    person_list = []
    venue_list = []
    event_list = []
    
    main_graph = Node('person') # main graph to append all the subgraphs, in order to show the neo4j functionality in jupyter notebooks

    def __init__(self):
        pass

    def process_item(self, item, spider):

        # transaction = self.graph.begin()
        # item_type = item.get('category')

        if item_type == 'person':
            person = Node(item.get('category'),
                          url=item.get('url'),
                          name=item.get('name'),
                          bio=item.get('bio'),
                          scrape_time=item.get('scrape_time'))
            person.__primarylabel__ = 'person'
            person.__primarykey__ = 'url'
            sub_graph = person

            for venues in item.get('venue_list'):
                venue = Node('venue', url=venues)
                venue.__primarylabel__ = 'venue'
                venue.__primarykey__ = 'url'
                sub_graph = sub_graph | venue
                sub_graph = sub_graph | Relationship(person, 'has an exhibition in', venue)

            for events in item.get('event_list'):
                event = Node('event', url=events)
                event.__primarylabel__ = 'event'
                event.__primarykey__ = 'url'
                sub_graph = sub_graph | event
                sub_graph = sub_graph | Relationship(person, 'participated in', event)

            # transaction.merge(sub_graph)
            self.main_graph = main_graph | sub_graph # main graph to append all the subgraphs, in order to show the neo4j functionality in jupyter notebooks
            print('person item')

        elif item_type == 'event':
            event = Node(item.get('category'),
                         url=item.get('url'),
                         name=item.get('name'),
                         start_date=item.get('url'),
                         end_date=item.get('url'),
                         press_release=item.get('url'),
                         scrape_time=item.get('scrape_time'))
            event.__primarylabel__ = 'event'
            event.__primarykey__ = 'url'
            sub_graph = event

            venue = Node('venue', url=item.get('venue'))
            venue.__primarylabel__ = 'venue'
            venue.__primarykey__ = 'url'

            sub_graph = sub_graph | venue
            sub_graph = sub_graph | Relationship(event, 'took place in', venue)

            for participants in item.get('participant_list'):
                participant = Node('person', url=participants)
                participant.__primarylabel__ = 'person'
                participant.__primarykey__ = 'url'
                sub_graph = sub_graph | participant
                sub_graph = sub_graph | Relationship(participant, 'participated in', event)

            # transaction.merge(sub_graph)
            self.main_graph = main_graph | sub_graph # main graph to append all the subgraphs, in order to show the neo4j functionality in jupyter notebooks
            print('event item')

        elif item_type == 'venue':
            venue = Node(item.get('category'),
                         url=item.get('url'),
                         name=item.get('name'),
                         city=item.get('city'),
                         latitude=item.get('latitude'),
                         longitude=item.get('longitude'),
                         street_address=item.get('street_address'),
                         website=item.get('website'),
                         email=item.get('email'),
                         scrape_time=item.get('scrape_time'))
            venue.__primarylabel__ = 'venue'
            venue.__primarykey__ = 'url'
            sub_graph = venue

            for venues in item.get('connected_venue_list'):
                connected_venue = Node('venue', url=venues)
                connected_venue.__primarylabel__ = 'venue'
                connected_venue.__primarykey__ = 'url'
                sub_graph = sub_graph | connected_venue
                sub_graph = sub_graph | Relationship(connected_venue, 'is cooperating with', venue)

            for events in item.get('event_list'):
                event = Node('event', url=events)
                event.__primarylabel__ = 'event'
                event.__primarykey__ = 'url'
                sub_graph = sub_graph | event
                sub_graph = sub_graph | Relationship(event, 'took place in', venue)

            for artists in item.get('artist_list'):
                artist = Node('person', url=artists)
                artist.__primarylabel__ = 'person'
                artist.__primarykey__ = 'url'
                sub_graph = sub_graph | artist
                sub_graph = sub_graph | Relationship(artist, 'has an exhibition in', venue)

            # transaction.merge(sub_graph)
            self.main_graph = main_graph | sub_graph # main graph to append all the subgraphs, in order to show the neo4j functionality in jupyter notebooks
            print('venue item')

        else:
            print('invalid item')

        # transaction.commit()
        return item

    def open_spider(self, spider):
        # self.graph = Graph("bolt://localhost:7687", auth=('neo4j', '123456'))
        # self.graph.schema.create_uniqueness_constraint("person", "url")
        # self.graph.schema.create_uniqueness_constraint("venue", "url")
        # self.graph.schema.create_uniqueness_constraint("event", "url")
        pass
        
    def close_spider(self, spider):
        pass


### This cell contains the middlewares.py file, which is necessary for the Crawler
#### It defines middlewares

In [6]:
# middlewares.py
# defines middlewares

from scrapy import signals


class ArtistSpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Response, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class ArtistDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

### This cell contains the items.py file, which is necessary for the Crawler
#### It defines item classes

There are 3 item classes, containing different data fields;
- PersonItem
 - category
 - url
 - name
 - bio
 - venue_list
 - event_list
 - scrape_time
- EventItem
 - category
 - url
 - name
 - venue
 - start_date
 - end_date
 - press_release
 - participant_list
 - scrape_time
- VenueItem
 - category
 - url
 - name
 - connected_venue_list
 - event_list
 - artist_list
 - city
 - latitude
 - longitude
 - street_address
 - website
 - email
 - scrape_time



In [7]:
# items.py
# defines item classes

import scrapy


class PersonItem(scrapy.Item):
    category = scrapy.Field()
    url = scrapy.Field()
    name = scrapy.Field()
    bio = scrapy.Field()
    venue_list = scrapy.Field()
    event_list = scrapy.Field()
    scrape_time = scrapy.Field()


class EventItem(scrapy.Item):
    category = scrapy.Field()
    url = scrapy.Field()
    name = scrapy.Field()
    venue = scrapy.Field()
    start_date = scrapy.Field()
    end_date = scrapy.Field()
    press_release = scrapy.Field()
    participant_list = scrapy.Field()
    scrape_time = scrapy.Field()


class VenueItem(scrapy.Item):
    category = scrapy.Field()
    url = scrapy.Field()
    name = scrapy.Field()
    connected_venue_list = scrapy.Field()
    event_list = scrapy.Field()
    artist_list = scrapy.Field()
    city = scrapy.Field()
    latitude = scrapy.Field()
    longitude = scrapy.Field()
    street_address = scrapy.Field()
    website = scrapy.Field()
    email = scrapy.Field()
    scrape_time = scrapy.Field()


### This cell contains the artist_spider.py file, which contains the spider definition
It contains the spider class which contains the parse and start_requests methods

#### ArtistSpider()
- Spider class definition
- contains all related methods
- contains spider name (necessary) and the domain allowed to scrape (not necessary)

#### ArtistSpider.start_requests()
- handles the initial request,
- contains the start url of the artist index and a pagination list, for all the letters
 - it appends the pagination list one by one and issues a request with the 'parse_initial' parser
 - example link https://kunstaspekte.art/artists-overview/a
- It issues for all the artist pages in the artist index

#### ArtistSpider.parse_initial()
- checks for all the links in the artist index
- issues requests for those links with the 'parse' parser

#### ArtistSpider.parse()
- checks if its a valid page
- decides what is the category of the page
- sends the response to the appropriate parser;
 - event_parse
 - person_parse
 - venue_ parse

#### ArtistSpider.event_parse(), ArtistSpider.venue_parse(), ArtistSpider.person_parse()
- creates an Item
- tries to scrape all appropriate data, if none leaves the field blank
- sends the item to the ItemPipeline

In [8]:
import logging

import scrapy
# from ..items import PersonItem, VenueItem, EventItem
from scrapy.crawler import CrawlerProcess
from bs4 import BeautifulSoup
import datetime

class ArtistSpider(scrapy.Spider):
    name = "artist"
    domain = 'https://kunstaspekte.art'
    
    scrape_all = 0

    def start_requests(self):
        
        # for demonstration purposes complete crawl is disabled, instead the spider will crawl just the url list below
        # if you want to try the complete crawl comment the urls variable and the for loop below and uncomment the 5 commented lines below. and change scrape_all variable above to 1
        
        
        
        
        urls = ['https://kunstaspekte.art/event/mediations-biennale-poznan-2010-event',
                'https://kunstaspekte.art/event/deuscthland-eine-ausstellung-von-jan-boehmermann-und-btf',
                'https://kunstaspekte.art/venue/biennial-of-graphic-arts-ljubljana-venue',
                'https://kunstaspekte.art/person/michel-blazy',
                'https://kunstaspekte.art/person/index-books-peter-gidal',
                'https://kunstaspekte.art/person/shannon-ebner'
                ]
        for url in urls:
            print(url)
            yield scrapy.Request(url=url, callback=self.parse)

        # page_list = {'0', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
        #              'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'}
        # url = 'https://kunstaspekte.art/artists-overview/'
        # for char in page_list:
        #     yield scrapy.Request(url=url + char, callback=self.parse_initial)

    def parse_initial(self, response):

        raw = response.body
        soup = BeautifulSoup(raw, 'html.parser')
        url_list = soup.find_all('a')

        for urls in url_list:
            yield scrapy.Request(url=self.domain + urls['href'], callback=self.parse)
            print(self.domain + urls['href'])

    def parse(self, response):

        print(response)
        raw = response.body
        soup = BeautifulSoup(raw, 'html.parser')

        try:
            page_type = soup.find(class_='content-heading').find('h3').get_text(' ', strip=True)
            print(page_type)
        except AttributeError:
            page_type = 0
            print("An exception occurred")

        if page_type == 'artist / curator':
            yield from self.person_parse(response)
            print('person page')
        elif page_type == 'venue':
            yield from self.venue_parse(response)
            print('venue page')
        elif page_type == 'exhibition':
            print('exhibition page')
            yield from self.event_parse(response)
        else:
            print('non valid page')

    def event_parse(self, response):

        print('event scraper')
        event = EventItem()
        # category = scrapy.Field()
        # url = scrapy.Field()
        # name = scrapy.Field()
        # venue = scrapy.Field()
        # start_date = scrapy.Field()
        # end_date = scrapy.Field()
        # press_release = scrapy.Field()
        # participant_list = scrapy.Field()

        category = 'event'

        url = response.url

        raw = response.body
        soup = BeautifulSoup(raw, 'html.parser')

        name = ''
        try:
            name = soup.find('h1').get_text(' ', strip=True)
        except Exception as ex:
            print(ex)

        venue = ''
        try:
            venue = self.domain + soup.find(class_='venue-module').find('h3').find('a')['href']
        except Exception as ex:
            print(ex)

        start_date = ''
        try:
            start_date = soup.find(class_='begins').get_text(' ', strip=True)
        except Exception as ex:
            print(ex)

        end_date = ''
        try:
            end_date = soup.find(class_='ends').get_text(' ', strip=True)
        except Exception as ex:
            print(ex)

        press_release = ''
        try:
            text = soup.find(id='textblock').find_all('p')
            for i in text:
                press_release = press_release + i.get_text('\n', strip=True)
        except Exception as ex:
            print(ex)

        participant_list = []
        try:
            participants = soup.find_all(class_='artist-list')
            for i in participants:
                links = i.find_all('a')
                for j in links:
                    participant_list.append(self.domain + j['href'])
        except Exception as ex:
            print(ex)

        event['category'] = category
        event['url'] = url
        event['name'] = name
        event['venue'] = venue
        event['start_date'] = start_date
        event['end_date'] = end_date
        event['press_release'] = press_release
        event['participant_list'] = participant_list
        event['scrape_time'] = datetime.datetime.now()

        yield scrapy.Request(url=venue, callback=self.parse)
        
        if scrape_all:
            for url in participant_list:
                yield scrapy.Request(url=url, callback=self.parse)

        yield event

    def venue_parse(self, response):

        print('venue scraper')
        venue = VenueItem()
        # category = scrapy.Field()
        # url = scrapy.Field()
        # name = scrapy.Field()
        # connected_venue_list = scrapy.Field()
        # event_list = scrapy.Field()
        # artist_list = scrapy.Field()
        # city = scrapy.Field()
        # coordinates = scrapy.Field()
        # street_address = scrapy.Field()
        # website = scrapy.Field()
        # email = scrapy.Field()

        category = 'venue'

        url = response.url

        raw = response.body
        soup = BeautifulSoup(raw, 'html.parser')

        name = ''
        try:
            name = soup.find('h1').get_text(' ', strip=True)
        except Exception as ex:
            print(ex)

        connected_venue_list = []
        try:
            dependencies = soup.find(id='texts').find_all('a')
            for i in dependencies:
                connected_venue_list.append(self.domain + i['href'])
        except Exception as ex:
            print(ex)

        event_list = []
        try:
            exhibition = soup.find_all(class_='exhib-title')
            for links in exhibition:
                event_list.append(self.domain + links['href'])
        except Exception as ex:
            print(ex)

        artist_list = []
        try:
            artists = soup.find(class_='artist-list').find_all('a')
            for links in artists:
                artist_list.append(self.domain + links['href'])
        except Exception as ex:
            print(ex)

        city = ''
        coordinates = ['', '']
        street_address = ''
        website = ''
        mail = ''
        try:
            address = soup.find('div', class_='address')
            city = address.find('p').find('a').get_text(' ', strip=True)
            coordinates = address['data-latlon'].split(',')
            street_address = address.find('p').get_text(' ', strip=True)
            website = address.find(class_='website')['href']
            mail = address.find(class_='mail')['href'].split(':')[1]
        except Exception as ex:
            print(ex)

        venue['category'] = category
        venue['url'] = url
        venue['name'] = name
        venue['connected_venue_list'] = connected_venue_list
        venue['event_list'] = event_list
        venue['artist_list'] = artist_list
        venue['city'] = city
        venue['latitude'] = coordinates[0]
        venue['longitude'] = coordinates[1]
        venue['street_address'] = street_address
        venue['website'] = website
        venue['email'] = mail
        venue['scrape_time'] = datetime.datetime.now()
        
        if scrape_all:

            for url in connected_venue_list:
                yield scrapy.Request(url=url, callback=self.parse)

            for url in event_list:
                yield scrapy.Request(url=url, callback=self.parse)

            for url in artist_list:
                yield scrapy.Request(url=url, callback=self.parse)

        yield venue

    def person_parse(self, response):

        print('person scraper')
        person = PersonItem()
        # category = scrapy.Field()
        # url = scrapy.Field()
        # name = scrapy.Field()
        # bio = scrapy.Field()
        # venue_list = scrapy.Field()
        # event_list = scrapy.Field()

        category = 'person'

        url = response.url
        print(url)

        raw = response.body
        soup = BeautifulSoup(raw, 'html.parser')

        name = ''
        try:
            name = soup.find('h1').get_text(' ', strip=True)
        except Exception as ex:
            print(ex)

        bio = ''
        try:
            for i in soup.find_all('p'):
                bio = bio + ' ' + i.get_text(' ', strip=True)
        except Exception as ex:
            print(ex)

        venue_list = []
        try:
            collections = soup.find('div', class_='collections').find_all('a')
            for collection in collections:
                venue_list.append(self.domain + collection['href'])
        except Exception as ex:
            print(ex)

        try:
            galleries = soup.find('div', class_='galleries').find_all('a')
            for gallery in galleries:
                venue_list.append(self.domain + gallery['href'])
        except Exception as ex:
            print(ex)

        event_list = []
        try:
            events = soup.findAll(class_='exhib-title')
            for event in events:
                event_list.append(self.domain + event['href'])
        except Exception as ex:
            print(ex)

        person['category'] = category
        person['url'] = url
        person['name'] = name
        person['bio'] = bio
        person['venue_list'] = venue_list
        person['event_list'] = event_list
        person['scrape_time'] = datetime.datetime.now()
        
        if scrape_all:
            for url in venue_list:
                yield scrapy.Request(url=url, callback=self.parse)

            for url in event_list:
                yield scrapy.Request(url=url, callback=self.parse)

        yield person


### To try the spider out run the cell below

This is not the conventional way to run a spider. Normally, for full functionality you would run it from a terminal in the project directory with the following command

scrapy crawl your_spiders_name

In [9]:
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(ArtistSpider)
process.start()

2019-05-02 15:57:02 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: scrapybot)
2019-05-02 15:57:02 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.0, Python 3.6.8 |Anaconda, Inc.| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1b  26 Feb 2019), cryptography 2.6.1, Platform Windows-10-10.0.17763-SP0
2019-05-02 15:57:02 [scrapy.crawler] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2019-05-02 15:57:02 [scrapy.extensions.telnet] INFO: Telnet Password: 6ed766fc7019c22e
2019-05-02 15:57:02 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2019-05-02 15:57:02 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewa

<Deferred at 0x21da18a4198>

https://kunstaspekte.art/event/mediations-biennale-poznan-2010-event
https://kunstaspekte.art/event/deuscthland-eine-ausstellung-von-jan-boehmermann-und-btf
https://kunstaspekte.art/venue/biennial-of-graphic-arts-ljubljana-venue
https://kunstaspekte.art/person/michel-blazy
https://kunstaspekte.art/person/index-books-peter-gidal
https://kunstaspekte.art/person/shannon-ebner


2019-05-02 15:57:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://kunstaspekte.art/person/index-books-peter-gidal> (referer: None)
2019-05-02 15:57:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://kunstaspekte.art/person/shannon-ebner> (referer: None)
2019-05-02 15:57:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://kunstaspekte.art/venue/biennial-of-graphic-arts-ljubljana-venue> (referer: None)
2019-05-02 15:57:03 [scrapy.core.scraper] ERROR: Spider error processing <GET https://kunstaspekte.art/person/index-books-peter-gidal> (referer: None)
Traceback (most recent call last):
  File "C:\Users\Zibir Zibzan\Anaconda3\envs\Artist_Scraper_kunstaspekte\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "C:\Users\Zibir Zibzan\Anaconda3\envs\Artist_Scraper_kunstaspekte\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "C:\Users\Zibir Zibzan\Anaconda3\envs

<200 https://kunstaspekte.art/person/index-books-peter-gidal>
artist / curator
person scraper
https://kunstaspekte.art/person/index-books-peter-gidal
'NoneType' object has no attribute 'find_all'
'NoneType' object has no attribute 'find_all'
<200 https://kunstaspekte.art/person/shannon-ebner>
artist / curator
person scraper
https://kunstaspekte.art/person/shannon-ebner
'NoneType' object has no attribute 'find_all'
<200 https://kunstaspekte.art/venue/biennial-of-graphic-arts-ljubljana-venue>
venue
venue scraper



2019-05-02 15:57:03 [scrapy.core.scraper] ERROR: Spider error processing <GET https://kunstaspekte.art/person/michel-blazy> (referer: None)
Traceback (most recent call last):
  File "C:\Users\Zibir Zibzan\Anaconda3\envs\Artist_Scraper_kunstaspekte\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "C:\Users\Zibir Zibzan\Anaconda3\envs\Artist_Scraper_kunstaspekte\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "C:\Users\Zibir Zibzan\Anaconda3\envs\Artist_Scraper_kunstaspekte\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "C:\Users\Zibir Zibzan\Anaconda3\envs\Artist_Scraper_kunstaspekte\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Users\Zibir Zibzan\Anaconda3\envs\Artist_Scraper_kunstaspekte\

<200 https://kunstaspekte.art/person/michel-blazy>
artist / curator
person scraper
https://kunstaspekte.art/person/michel-blazy
<200 https://kunstaspekte.art/event/mediations-biennale-poznan-2010-event>
exhibition
exhibition page
event scraper
<200 https://kunstaspekte.art/event/deuscthland-eine-ausstellung-von-jan-boehmermann-und-btf>
exhibition
exhibition page
event scraper


2019-05-02 15:57:03 [scrapy.core.scraper] ERROR: Spider error processing <GET https://kunstaspekte.art/event/mediations-biennale-poznan-2010-event> (referer: None)
Traceback (most recent call last):
  File "C:\Users\Zibir Zibzan\Anaconda3\envs\Artist_Scraper_kunstaspekte\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "C:\Users\Zibir Zibzan\Anaconda3\envs\Artist_Scraper_kunstaspekte\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "C:\Users\Zibir Zibzan\Anaconda3\envs\Artist_Scraper_kunstaspekte\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "C:\Users\Zibir Zibzan\Anaconda3\envs\Artist_Scraper_kunstaspekte\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Users\Zibir Zibzan\Anaconda3\envs\Artis

<200 https://kunstaspekte.art/venue/mediations-biennale-poznan-venue>
venue
venue scraper
<200 https://kunstaspekte.art/venue/nrw-forum>


2019-05-02 15:57:04 [scrapy.core.scraper] ERROR: Spider error processing <GET https://kunstaspekte.art/venue/nrw-forum> (referer: https://kunstaspekte.art/event/deuscthland-eine-ausstellung-von-jan-boehmermann-und-btf)
Traceback (most recent call last):
  File "C:\Users\Zibir Zibzan\Anaconda3\envs\Artist_Scraper_kunstaspekte\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "C:\Users\Zibir Zibzan\Anaconda3\envs\Artist_Scraper_kunstaspekte\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "C:\Users\Zibir Zibzan\Anaconda3\envs\Artist_Scraper_kunstaspekte\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "C:\Users\Zibir Zibzan\Anaconda3\envs\Artist_Scraper_kunstaspekte\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter

venue
venue scraper


In [17]:
neo4jupyter.init_notebook_mode()

results = ArtistPipeline.main_graph
options = {"person": "name", "venue": "name", "event": "name"}

neo4jupyter.draw(results, options)

print(results)


<IPython.core.display.Javascript object>

AttributeError: 'Node' object has no attribute 'run'