## 4 How `ItemLoader` and `Item` work

---

In section 3 we get the values we want.

How we convey and contain them?

The answers are:

`ItemLoader` and 

`Item`

---



### Item

In [2]:
from scrapy import Item, Field
class SourceItem(Item):
    publication_title = Field()
    chief_editor = Field()
    issn = Field()
    description = Field()
    home_url = Field()
    coverimage = Field()
    title = Field()

In [3]:
# intialization
item = SourceItem()
isinstance(item, SourceItem)

True

In [4]:
# it acts in the way of dictionary
item['issn'] = '1234'
item['coverimage'] = 'imageurl'
item

{'coverimage': 'imageurl', 'issn': '1234'}

### ItemLoader

---
`Item` matters when `ItemLoader` is used.

---

In [5]:
from scrapy.loader import ItemLoader

In [7]:
# we need headers to disguise our bot as a browser

headers = {
    "Connection": "keep-alive",
    "Cache-Control": "max-age=0",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36",
    "Accept-Encoding": "gzip,deflate,sdch",
    "Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.6,en;q=0.4,zh-TW;q=0.2",
}


import requests
from scrapy.http import TextResponse

In [9]:
r = requests.get('http://www.journals.elsevier.com/decision-support-systems/', 
                 headers = headers)

response = TextResponse(r.url, body = r.text, encoding = 'utf-8')

# there is a response we need to handle
response

<200 https://www.journals.elsevier.com/decision-support-systems/>

In [10]:
# Initialization with Item and response
# Item, (here is SourceItem()), it the container ItemLoader uses
# response, is the raw material ItemLoader to exploit
l = ItemLoader(item = SourceItem(), response = response)
type(l)

scrapy.loader.ItemLoader

---

`l`, the object of `ItemLoader`, has many methods.

I am going to introduct three of them.

`l.get_xpath`

`l.add_xpath`

`l.add_value`

---

In [12]:
# here are the xpaths for the items
issn_xpath = '//*[@class="issn keyword"]/span/text()'
chief_editor_xpath = '//*[@id="Title"]//span[@class="nowrap"]/text()'
title_xpath = '//*[@id="Title"]//h1[@itemprop="name"]/text()'
description_xpath = '//*[@class="publication-description"]//p'
coverimage_xpath = '//*[@id="Title"]//img[@class="cover-img"]/@src'

In [13]:
# notice that l = ItemLoader(item = SourceItem(), response = response)
# so l can handle with response 
# where l.get_xpath can do the same thing as response.xpath().extract()

response.xpath(issn_xpath).extract()

['0167-9236']

In [14]:
# you can interprete l.get_xpath() as get value via xpath.
l.get_xpath(issn_xpath)

['0167-9236']

In [15]:
# this show waht is l.add_value()

# but with item first.

# initializaiton, did as In[17] unit
item = SourceItem()

# then store the issn in the SourceItem object: item
item['issn'] = response.xpath(issn_xpath).extract()
item

{'issn': ['0167-9236']}

In [16]:
# l.add_value can do the same thing
l.add_value('issn', l.get_xpath(issn_xpath))

# show the item.
# this item will return an enriched item
l.load_item()

{'issn': ['0167-9236']}

In [17]:
# you can add any value you want to the defined fields.
# so you must fully consider the fields you defined in the item.

l.add_value('home_url', response.url)
l.load_item()

{'home_url': ['https://www.journals.elsevier.com/decision-support-systems/'],
 'issn': ['0167-9236']}

In [18]:
# if you still add another value in 'issn', the new value will be appended.
# such as
newvalue = "next source's url"
l.add_value('issn', newvalue)
l.load_item()

{'home_url': ['https://www.journals.elsevier.com/decision-support-systems/'],
 'issn': ['0167-9236', "next source's url"]}

In [19]:
# so every time you start to load anthor item, you should initialize a new object of the SourceItem
# (say, for this source, all the value are loaded in this item.)
# (you should prepare a new item for the next source)

l = ItemLoader(item = SourceItem(), response = response)

In [20]:
# for l.add_xpath()
# it combines l.get_xpath() and l.add_value()
# you can interprete l.add_xpath() as add the value got via xpath to the item's field.

l.add_xpath('issn', issn_xpath)
l.load_item()

{'issn': ['0167-9236']}

In [21]:
# based on the same logic
l.add_xpath('chief_editor', chief_editor_xpath)
l.add_xpath('coverimage', coverimage_xpath)
l.add_xpath('description', description_xpath)
publication_title = l.get_xpath(title_xpath)
l.add_value('publication_title', publication_title)
l.add_value('home_url', response.url)
l.load_item()

{'chief_editor': ['James R. Marsden'],
 'coverimage': ['https://www.elsevier.com/__data/cover_img/505540.gif'],
 'description': ['<p>The common thread of articles published in <em>Decision '
                 'Support Systems</em> is their relevance to theoretical and '
                 'technical issues in the support of enhanced decision making. '
                 'The areas addressed may include foundations, functionality, '
                 'interfaces, implementation, impacts, and evaluation of '
                 'decision support systems...</p>',
                 '<p>The common thread of articles published in <em>Decision '
                 'Support Systems</em> is their relevance to theoretical and '
                 'technical issues in the support of enhanced decision making. '
                 'The areas addressed may include foundations, functionality, '
                 'interfaces, implementation, impacts, and evaluation of '
                 'decision support systems (DSSs

In [23]:
# let's see the item we got
a = l.load_item()
a['issn']
# the value is a list

# sometimes, the returned list maybe contain many elements

# what is we just want the first one of them?


['0167-9236']

#### Processor

---

In order to get clean data, we need anthor tool

`Processor`

---

In [24]:
from scrapy.loader.processors import Join, TakeFirst

In [25]:
print(type(TakeFirst))

# Join is a class, TakeFirst is also a class
# when we initialize it, we will get an object, and the object is a function

tf = TakeFirst()
print(type(tf))

<class 'type'>
<class 'scrapy.loader.processors.TakeFirst'>


In [27]:
# let's see how tf works 

a = l.get_xpath(issn_xpath)

a.append(a[0])
print(a)

# have a look at the results, you will find the first element of the list is returned.
tf(a)

['0167-9236', '0167-9236']


'0167-9236'

In [28]:
# let's see how Join works

join = Join()
print(a)

# combine all the elements in the list as a string.
join(a)

['0167-9236', '0167-9236']


'0167-9236 0167-9236'

In [29]:
# actually, we can combine them

# we initialize a new one
l = ItemLoader(item = SourceItem(), response = response)

# the logics is, first, get a list of value via xpath.
# then convey the returned list to the function join()
# return the value
l.get_xpath(issn_xpath, join)



'0167-9236'

In [33]:
# we can also define our own function
# and apply it to the ItemLoader
import re

# this function is used to strip the html tags
def cleanhtml(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', raw_html)
    return cleantext

In [34]:
# before the usage
l.get_xpath(description_xpath)

['<p>The common thread of articles published in <em>Decision Support Systems</em> is their relevance to theoretical and technical issues in the support of enhanced decision making. The areas addressed may include foundations, functionality, interfaces, implementation, impacts, and evaluation of decision support systems...</p>',
 '<p>The common thread of articles published in <em>Decision Support Systems</em> is their relevance to theoretical and technical issues in the support of enhanced decision making. The areas addressed may include foundations, functionality, interfaces, implementation, impacts, and evaluation of decision support systems (DSSs). Manuscripts may draw from diverse methods and methodologies, including those from decision theory, economics, econometrics, statistics, computer supported cooperative work, data base management, linguistics, management science, mathematical modeling, operations management, cognitive science, psychology, user interface management, and other

In [35]:
# after the usage


l.get_xpath(description_xpath, Join(), cleanhtml)

# as join = Join()
# this code is the same as 

# l.get_xpath(description_xpath, join, cleanhtml)


# the logics is:
# first get a list contains the selected values via xpath.
# then convery this list to function Join() (join)
# this function joins the elements in this list, return a string
# then this string is conveyed to function cleanhtml
# cleanhtml return a new string without html tags.

"The common thread of articles published in Decision Support Systems is their relevance to theoretical and technical issues in the support of enhanced decision making. The areas addressed may include foundations, functionality, interfaces, implementation, impacts, and evaluation of decision support systems... The common thread of articles published in Decision Support Systems is their relevance to theoretical and technical issues in the support of enhanced decision making. The areas addressed may include foundations, functionality, interfaces, implementation, impacts, and evaluation of decision support systems (DSSs). Manuscripts may draw from diverse methods and methodologies, including those from decision theory, economics, econometrics, statistics, computer supported cooperative work, data base management, linguistics, management science, mathematical modeling, operations management, cognitive science, psychology, user interface management, and others. However, a manuscript focused 

----

Use 

`l.add_xpath`

`l.default_output_method`

----

In [36]:
# get a new Itemloader object - l
l = ItemLoader(item = SourceItem(), response = response)

# this is also applied in l.add_xpath
l.add_xpath('issn', issn_xpath, tf)

# however, you find the returned value is still a list
l.load_item()

{'issn': ['0167-9236']}

In [37]:
# In this case, to get the first element in list
# we should use the default_output_method
from scrapy.loader import ItemLoader
from scrapy.loader.processors import Join, TakeFirst
l = ItemLoader(item = SourceItem(), response = response)
l.default_output_processor = TakeFirst()


l.add_xpath('issn', issn_xpath, TakeFirst())
l.load_item()


# you can compare the result with the ones of last unit.

{'issn': '0167-9236'}

In [38]:
# based on this logics, we can get a clean item here.

from scrapy.loader import ItemLoader
from scrapy.loader.processors import Join, TakeFirst
l = ItemLoader(item = SourceItem(), response = response)
l.default_output_processor = TakeFirst()

# issn
l.add_xpath('issn', issn_xpath)

# chief_editor
l.add_xpath('chief_editor', chief_editor_xpath)

# coverimage
l.add_xpath('coverimage', coverimage_xpath)

# description
# notice Join() and cleanhtml here
# refer to the units above.
l.add_xpath('description', description_xpath, Join(), cleanhtml)

# publication_title
publication_title = l.get_xpath(title_xpath)
l.add_value('publication_title', publication_title)

# home_url
l.add_value('home_url', response.url)
l.load_item()


# relatively cleaner now!

{'chief_editor': 'James R. Marsden',
 'coverimage': 'https://www.elsevier.com/__data/cover_img/505540.gif',
 'description': 'The common thread of articles published in Decision Support '
                'Systems is their relevance to theoretical and technical '
                'issues in the support of enhanced decision making. The areas '
                'addressed may include foundations, functionality, interfaces, '
                'implementation, impacts, and evaluation of decision support '
                'systems... The common thread of articles published in '
                'Decision Support Systems is their relevance to theoretical '
                'and technical issues in the support of enhanced decision '
                'making. The areas addressed may include foundations, '
                'functionality, interfaces, implementation, impacts, and '
                'evaluation of decision support systems (DSSs). Manuscripts '
                'may draw from diverse methods