Usar Item Loaders #19

matiskay · 2015-08-11T23:21:55Z

Si usamos Items Loaders en los items podemos reducir la cantidad de codigo necesario para la Spider y podemos aprovechar las funciones para limpieza de datos.

Item Loaders provide a convenient mechanism for populating scraped Items. 
Even though Items can be populated using their own dictionary-like API,
Item Loaders provide a much more convenient API for populating them 
from a scraping process, by automating some common tasks like parsing 
the raw extracted data before assigning it.

Item Loaders: http://doc.scrapy.org/en/latest/topics/loaders.html
ScrapyLib: https://github.com/scrapinghub/scrapylib
processors: https://github.com/scrapinghub/scrapylib/blob/master/scrapylib/processors/__init__.py

matiskay · 2015-08-14T14:36:21Z

Documentar el proceso de limpieza de datos para esto investigar todas las tareas de limpieza que se hacen en las spiders y en los pipelines.

matiskay · 2015-08-14T16:09:42Z

Pipelines

Trimming and Spaces cleaning.

        for k, v in item.items():
            if isinstance(v, basestring) is True:
                value = re.sub('\s+', ' ', v)
                item[k] = value.strip()
            else:
                item[k] = v

Date can be and object or a string formatted like %Y-%m-%d

        try:
            item['date'] = datetime.date.strftime(item['date'], '%Y-%m-%d')
        except TypeError:
            # our date is good, continue
            pass

Set the other item keys to empty string ''.
Drop Items without full_name.
If date_start has 'HORA DE then drop the item.

Spiders

All the spiders extract and get the first element of the xpath.

item['host_name'] = data[7].xpath('./span/text()').extract()[0].strip()

item['full_name'] = fields[1].xpath('text()').extract_first().strip()

Some spider also strip the item data.

item['host_name'] = data[7].xpath('./span/text()').extract()[0].strip()

item['full_name'] = fields[1].xpath('text()').extract_first().strip()

matiskay · 2015-08-14T16:39:14Z

                item = make_hash(item)

Puede ser usado en su propio pipeline.

matiskay · 2015-08-14T20:19:13Z

@aniversarioperu, las primeras spiders con item loaders estan en el pull request #26. Por ahora las otras spiders pueden quedaran con el metodo antiguo hasta que encontremos una forma de hacer tests y validacion de datos en las spiders #24

aniversarioperu · 2015-08-17T19:34:10Z

buena, el uso de item loaders es recomendado por la gente de Scrapy.

aniversarioperu · 2015-08-19T09:23:58Z

pero ahora tenemos el problema que se están generando duplicados. Mirá:
http://manolo.rocks/search/?q=47344647

algún campo debe haber cambiado que ahora el hash sale diferente y al scrapear 2 veces los mismos registros se han guardado en la base de datos. Antes no pasaba eso.

me parece que problema está aqui https://github.com/aniversarioperu/manolo_scraper/blob/master/manolo_scraper/manolo_scraper/spiders/congreso.py#L67

la fecha debe ser string YYYY-MM-DD

matiskay · 2015-08-19T12:55:04Z

Voy a revisar cual puede ser el problema.

On Wednesday, August 19, 2015, AniversarioPeru notifications@github.com
wrote:

pero ahora tenemos el problema que se están generando duplicados. Mirá:
http://manolo.rocks/search/?q=47344647

algún campo debe haber cambiado que ahora el hash sale diferente y al
scrapear 2 veces los mismos registros se han guardado en la base de datos.
Antes no pasaba eso.

me parece que problema está aqui
https://github.com/aniversarioperu/manolo_scraper/blob/master/manolo_scraper/manolo_scraper/spiders/congreso.py#L67

la fecha debe ser string YYYY-MM-DD

—
Reply to this email directly or view it on GitHub
#19 (comment)
.

matiskay · 2015-08-19T15:49:04Z

@aniversarioperu, el problema es que los item loaders hacen las tranformaciones de los datos antes de pasar por el hash. Antes, se pasaba el hash primero y luego el item loaders. Es posible rescrapear todo y tener una base de datos Limpia desde el comienzo. Eso es lo que estaba pensando para un proximo cambio en Manolo. Hay que seguir trabajando en los cambios y luego volver a rescrapear todo para tener los datos consitentes y los hashes como deben ser.

matiskay · 2015-08-21T13:51:03Z

@aniversarioperu, we are using Item Loaders everywhere now :D.

Note: TCSpider doesn't count as a spider.

matiskay · 2015-08-21T13:51:31Z

closed #19

aniversarioperu mentioned this issue Aug 19, 2015

delete and rescrape Congreso from 2015-08-13 #37

Closed

2 tasks

matiskay closed this as completed Aug 21, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Usar Item Loaders #19

Usar Item Loaders #19

matiskay commented Aug 11, 2015

matiskay commented Aug 14, 2015

matiskay commented Aug 14, 2015

matiskay commented Aug 14, 2015

matiskay commented Aug 14, 2015

aniversarioperu commented Aug 17, 2015

aniversarioperu commented Aug 19, 2015

matiskay commented Aug 19, 2015

matiskay commented Aug 19, 2015

matiskay commented Aug 21, 2015

matiskay commented Aug 21, 2015

Usar Item Loaders #19

Usar Item Loaders #19

Comments

matiskay commented Aug 11, 2015

matiskay commented Aug 14, 2015

matiskay commented Aug 14, 2015

Pipelines

Spiders

matiskay commented Aug 14, 2015

matiskay commented Aug 14, 2015

aniversarioperu commented Aug 17, 2015

aniversarioperu commented Aug 19, 2015

matiskay commented Aug 19, 2015

matiskay commented Aug 19, 2015

matiskay commented Aug 21, 2015

matiskay commented Aug 21, 2015