# Learn About Scrapy Project Structure

### 1 Start a Project

In the former tutorials, we have explored scrapy's powerful objects and methods. 

From now on, we will explore scrapy project structure in order run it in the terminal and yield the items we want to get from the websits.

**Open your terminal and enter the following commands to start a scrapy project**

---

```bash
$ scrapy startproject firstpro

$ cd firstpro

$ tree

```
---


![](startfintime50.png)

## Files inside the Projects

There are many files already generated in this project package.

Then I will give a brief introduction to them.

## 1  `scrapy.cfg`

**Configuration file**

Some configurations for your projects.

You do not even bother to study or write this file.

As this file is generated by `scrapy` already.
    
    
**File contents**:

---

```cfg
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.org/en/latest/deploy.html

[settings]
default = fintime50.settings

[deploy]
#url = http://localhost:6800/
project = fintime50

```

---




---

---



**When you turn to `fintime50` folder**

You will find there are four python files:

`__init__.py`

`settings.py`

`items.py`

`pipelines.py`

and a folder named `spiders`.

---

### 0 `__init__.py`

This python file contains nothing.

It is here for the concern to unitify the whole project.

---

## 2 `settings.py`


This settings.py contains settings which control the behaviors of the spiders inside this project.

Open this file and you will find there are many lines of sentences, 

and most of them are commented away by `#`.

So, the acutal valid codes are:

---

```python

BOT_NAME = 'fintime50'

SPIDER_MODULES = ['fintime50.spiders']
NEWSPIDER_MODULE = 'fintime50.spiders'


ROBOTSTXT_OBEY = True

```

---


Here are some items you need to modified, therefore, your spider can run well.


Which includes:

`ROBOTSTXT_OBEY`: turn it to `False`, it is just fine!

`DEFAULT_REQUEST_HEADERS`: give spiders strong headers to hide.

`ITEM_PIPELINES`: the pipelines will convey it yield items to the destination. More information when talking about `pipelines.py`.



So the modified `settings.py` is:

---

```python
BOT_NAME = 'fintime50'

SPIDER_MODULES = ['fintime50.spiders']
NEWSPIDER_MODULE = 'fintime50.spiders'

ROBOTSTXT_OBEY = False

DEFAULT_REQUEST_HEADERS = {
    "Connection": "keep-alive",
    "Cache-Control": "max-age=0",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36",
    "Accept-Encoding": "gzip,deflate,sdch",
    "Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.6,en;q=0.4,zh-TW;q=0.2",
}

ITEM_PIPELINES = {
   'fintime50.pipelines.Fintime50Pipeline': 300,
}
```

---

## 3 `items.py`

When you open `items.py` file, you will find:

---


```python
import scrapy


class Fintime50Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass
```

---


Essentially, there are nothing here.


This file contains the `item` you defined which is used to contain the items you scrape from the Internet.



Actually, in the fourth tutorial `ItemLoader`, `Item` have been introduced.

Such as this.

In [1]:
from scrapy import Item, Field
class SourceItem(Item):
    publication_title = Field()
    chief_editor = Field()
    issn = Field()
    description = Field()
    home_url = Field()
    coverimage = Field()
    title = Field()

In [2]:
# intialization
item = SourceItem()
isinstance(item, SourceItem)

True

In [3]:
# it acts in the way of dictionary
item['issn'] = '1234'
item['coverimage'] = 'imageurl'
item

{'coverimage': 'imageurl', 'issn': '1234'}

In this python file, every item is defined here.

For our project, we need `items` to contain `sources`, `authors`, `documents` and `keywords`.

So the file turns out in this way.

---

```python
from scrapy import Item, Field


class DocumentItem(Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    abstract = Field()

    publication_date = Field()
    submission_date = Field()
    online_date = Field()
    revision_date = Field()
    accepted_date = Field()

    title = Field()
    coverpage_url = Field()
    fpage = Field()
    lpage = Field()
    pages = Field()
    submission_path = Field()

    publication_title = Field()


class KeywordItem(Item):
    keyword = Field()

    title = Field()


class SourceItem(Item):
    publication_title = Field()
    chief_editor = Field()
    issn = Field()
    description = Field()
    home_url = Field()
    coverimage = Field()

    title = Field()

class AuthorItem(Item):
    institution = Field()
    email = Field()
    avatar = Field()
    vitae = Field()
    fname = Field()
    lname = Field()
    address = Field()

    title = Field()
```

---

There are some tricks here, such as both AuthorItem, KeywordsItem contain the field `title`.

Actually, this is the sign to identify the relationship between authors, keywords and documents.

This is used to store the items into Database via pipelines.

## 4 `pipelines.py`

When opening the `pipelines.py`, you will find:

---

```python
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

class Fintime50Pipeline(object):
    def process_item(self, item, spider):
        return item
```

---

When `items` is scraped, or yielded from spiders, they will be processed by `Pipeline.process_item` method.

Then, they will appear in the terminal.

For now, these codes are enough.

Actually, writing the items into the database is acheived in the this process.

And, definitely, more codes are needed, as well as a database.




#