### Using python Scrapy module to design a web scraping program to get the content from the following website

https://webscraper.io/test-sites/e-commerce/allinone/computers/tablets

This is a website for testing web scraping, this given URL link to a computer eCommerce store selling different models of tablets.

The task is to collect information on all the tablets list on the webpage.
 
The information to be collected are product, description, price, and review for each of the tablets.
You are supposed to program your code using the focus spider
class( scrapy.Spider ) in python.
The result of the scraped data must store in a JSON format file as listed below:
```
{"price": ["$603.99"], "description": ["Wi-Fi, 64GB, Silver"], "product": "Apple iPad Air", "review": "7 reviews"}
{"price": ["$172.99"], "description": ["Silver, 7" IPS, Quad-Core 1.2Ghz, 16GB, 3G, Android 4.2"], "product": "IdeaTab S5000", "review": "8 reviews"}
{"price": ["$148.99"], "description": ["Blue, 7" IPS, Quad-Core 1.3GHz, 8GB, 3G, Android 4.2"], "product": "IdeaTab A3500-H", "review": "9 reviews"}
{"price": ["$233.99"], "description": ["LTE (SM-T235), Quad-Core 1.2GHz, 8GB, Black"], "product": "Galaxy Tab 4", "review": "1 reviews"}
{"price": ["$399.99"], "description": ["10.1", 3G, Android 4.0, Garnet Red"], "product": "Galaxy Note", "review": "12 reviews"}
```


#### Create a python function to search for Android or Apple Tablet information. 

Function name: searchbybrand( string)

Argument: string -> android or apple

Return result: list of all matched item( with product, description, price, reviews)


In [1]:
import lxml.etree

import json

class JsonPipeline(object):
    def open_spider(self, spider):
        self.file = open('result.json', 'w')

    def close_spider(self, spider):
        print('JSON File Generated')
        self.file.close()

    def process_item(self, item, spider):
        display(item)
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

In [2]:
import logging
import scrapy
from scrapy.crawler import CrawlerProcess

class InfoSpider(scrapy.Spider):
    name = "info"
    start_urls = [
        'https://webscraper.io/test-sites/e-commerce/allinone/computers/tablets'
    ]
    custom_settings = {
        'LOG_LEVEL': logging.WARNING,                            # Default : Debug
        'ITEM_PIPELINES': {'__main__.JsonPipeline': 1} # Used for pipeline
    }
    
    def parse(self, response):
        for itm in response.css("div[class='col-sm-4 col-lg-4 col-md-4']"):
            yield {
                'price': itm.css("h4[class='pull-right price']::text").getall(),
                'description': itm.css("p[class='description']::text").getall(),
                'product': itm.css('h4 a::attr(title)').get(),
                'review': itm.css("div.ratings p[class='pull-right']::text").get(),


            }


In [3]:
info_crawler_process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

info_crawler_process.crawl(InfoSpider)
info_crawler_process.start()

2020-06-29 12:39:45 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapybot)
2020-06-29 12:39:45 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.7 (default, Mar 26 2020, 10:32:53) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1f  31 Mar 2020), cryptography 2.8, Platform Darwin-19.5.0-x86_64-i386-64bit
2020-06-29 12:39:45 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-06-29 12:39:45 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 30,
 'USER_AGENT': 'Mozilla/5.0 (compatible; MSIE 7.0; Windows NT 5.1)'}


{'price': ['$69.99'],
 'description': ['7" screen, Android'],
 'product': 'Lenovo IdeaTab',
 'review': '7 reviews'}

{'price': ['$88.99'],
 'description': ['Black, 7" IPS, Quad-Core 1.2GHz, 8GB, Android 4.2'],
 'product': 'IdeaTab A3500L',
 'review': '7 reviews'}

{'price': ['$96.99'],
 'description': ['7" screen, Android, 16GB'],
 'product': 'Acer Iconia',
 'review': '7 reviews'}

{'price': ['$97.99'],
 'description': ['7", 8GB, Wi-Fi, Android 4.2, White'],
 'product': 'Galaxy Tab 3',
 'review': '2 reviews'}

{'price': ['$99.99'],
 'description': ['Black, 7", 1.6GHz Dual-Core, 8GB, Android 4.4'],
 'product': 'Iconia B1-730HD',
 'review': '1 reviews'}

{'price': ['$101.99'],
 'description': ['IPS, Dual-Core 1.2GHz, 8GB, Android 4.3'],
 'product': 'Memo Pad HD 7',
 'review': '10 reviews'}

{'price': ['$102.99'],
 'description': ['7" screen, Android, 8GB'],
 'product': 'Asus MeMO Pad',
 'review': '14 reviews'}

{'price': ['$103.99'],
 'description': ['6" screen, wifi'],
 'product': 'Amazon Kindle',
 'review': '3 reviews'}

{'price': ['$107.99'],
 'description': ['7", 8GB, Wi-Fi, Android 4.2, Yellow'],
 'product': 'Galaxy Tab 3',
 'review': '14 reviews'}

{'price': ['$121.99'],
 'description': ['Blue, 8" IPS, Quad-Core 1.3GHz, 16GB, Android 4.2'],
 'product': 'IdeaTab A8-50',
 'review': '13 reviews'}

{'price': ['$130.99'],
 'description': ['White, 7", Atom 1.2GHz, 8GB, Android 4.4'],
 'product': 'MeMO Pad 7',
 'review': '11 reviews'}

{'price': ['$148.99'],
 'description': ['Blue, 7" IPS, Quad-Core 1.3GHz, 8GB, 3G, Android 4.2'],
 'product': 'IdeaTab A3500-H',
 'review': '9 reviews'}

{'price': ['$172.99'],
 'description': ['Silver, 7" IPS, Quad-Core 1.2Ghz, 16GB, 3G, Android 4.2'],
 'product': 'IdeaTab S5000',
 'review': '8 reviews'}

{'price': ['$233.99'],
 'description': ['LTE (SM-T235), Quad-Core 1.2GHz, 8GB, Black'],
 'product': 'Galaxy Tab 4',
 'review': '1 reviews'}

{'price': ['$251.99'],
 'description': ['16GB, White'],
 'product': 'Galaxy Tab',
 'review': '14 reviews'}

{'price': ['$320.99'],
 'description': ['White, 10.1" IPS, 1.6GHz, 2GB, 16GB, Android 4.2'],
 'product': 'MeMo PAD FHD 10',
 'review': '7 reviews'}

{'price': ['$399.99'],
 'description': ['10.1", 3G, Android 4.0, Garnet Red'],
 'product': 'Galaxy Note',
 'review': '12 reviews'}

{'price': ['$489.99'],
 'description': ['12.2", 32GB, WiFi, Android 4.4, White'],
 'product': 'Galaxy Note',
 'review': '9 reviews'}

{'price': ['$537.99'],
 'description': ['Wi-Fi + Cellular, 32GB, Silver'],
 'product': 'iPad Mini Retina',
 'review': '8 reviews'}

{'price': ['$587.99'],
 'description': ['10.1", 32GB, Black'],
 'product': 'Galaxy Note 10.1',
 'review': '6 reviews'}

{'price': ['$603.99'],
 'description': ['Wi-Fi, 64GB, Silver'],
 'product': 'Apple iPad Air',
 'review': '7 reviews'}

JSON File Generated


In [4]:
import json

with open('result.json') as f:
    data = [json.loads(line) for line in f]
    f.close()
print(len(data))
data

21


[{'price': ['$69.99'],
  'description': ['7" screen, Android'],
  'product': 'Lenovo IdeaTab',
  'review': '7 reviews'},
 {'price': ['$88.99'],
  'description': ['Black, 7" IPS, Quad-Core 1.2GHz, 8GB, Android 4.2'],
  'product': 'IdeaTab A3500L',
  'review': '7 reviews'},
 {'price': ['$96.99'],
  'description': ['7" screen, Android, 16GB'],
  'product': 'Acer Iconia',
  'review': '7 reviews'},
 {'price': ['$97.99'],
  'description': ['7", 8GB, Wi-Fi, Android 4.2, White'],
  'product': 'Galaxy Tab 3',
  'review': '2 reviews'},
 {'price': ['$99.99'],
  'description': ['Black, 7", 1.6GHz Dual-Core, 8GB, Android 4.4'],
  'product': 'Iconia B1-730HD',
  'review': '1 reviews'},
 {'price': ['$101.99'],
  'description': ['IPS, Dual-Core 1.2GHz, 8GB, Android 4.3'],
  'product': 'Memo Pad HD 7',
  'review': '10 reviews'},
 {'price': ['$102.99'],
  'description': ['7" screen, Android, 8GB'],
  'product': 'Asus MeMO Pad',
  'review': '14 reviews'},
 {'price': ['$103.99'],
  'description': ['6" scr

#### There are total of 21 items.

In [5]:
#while examining the scraped data, I noticed that keyword 'Android' does not neccessarily appear
#in all product or description. Some products such as Memo and IdeaTab actually runs on 'Android' but
#does not have the keyword 'Android' in its descriptions.
#Likewise, Ipad is clearly an apple product. But there are some entries without 'apple' keyword.
#Therefore, a list of android and apple keywords are created to enrich the search critieria. 
 
def searchbybrand(s):
    result=[]
    
    androidkeyword=['android','galaxy','memo','ideatab','iconia' ]
    applekeyword=['apple', 'ipad']
    if s in str(androidkeyword).lower():
        for i in data:   
               # print(i['description'])
                #print(any(elem.lower() in androidkeyword for elem in i['description']))

               # print(i['product'])
               # print(any(elem  in i['product'].lower() for elem in androidkeyword))
                if any(elem.lower() in androidkeyword for elem in i['description']) or\
                   any(elem  in i['product'].lower() for elem in androidkeyword): 
                       result.append(i) 
                
         
                
    elif s in str(applekeyword).lower():
        for i in data:   
           # print(i['description'])
          #  print(any(elem.lower() in applekeyword for elem in i['description']))

           # print(i['product'])
          #  print(any(elem  in i['product'].lower() for elem in applekeyword))
            if any(elem.lower() in applekeyword for elem in i['description']) or\
                   any(elem  in i['product'].lower() for elem in applekeyword): 
                       result.append(i) 
        
    return (result, len(result))


            
r,l=searchbybrand('apple')
print('Number of Apple Products: ', l)
display("Return Result :",r)

Number of Apple Products:  2


'Return Result :'

[{'price': ['$537.99'],
  'description': ['Wi-Fi + Cellular, 32GB, Silver'],
  'product': 'iPad Mini Retina',
  'review': '8 reviews'},
 {'price': ['$603.99'],
  'description': ['Wi-Fi, 64GB, Silver'],
  'product': 'Apple iPad Air',
  'review': '7 reviews'}]

In [6]:
r2,l2=searchbybrand('android')

print('Number of Android Products: ', l2)
display("Return Result :",r2)


Number of Android Products:  18


'Return Result :'

[{'price': ['$69.99'],
  'description': ['7" screen, Android'],
  'product': 'Lenovo IdeaTab',
  'review': '7 reviews'},
 {'price': ['$88.99'],
  'description': ['Black, 7" IPS, Quad-Core 1.2GHz, 8GB, Android 4.2'],
  'product': 'IdeaTab A3500L',
  'review': '7 reviews'},
 {'price': ['$96.99'],
  'description': ['7" screen, Android, 16GB'],
  'product': 'Acer Iconia',
  'review': '7 reviews'},
 {'price': ['$97.99'],
  'description': ['7", 8GB, Wi-Fi, Android 4.2, White'],
  'product': 'Galaxy Tab 3',
  'review': '2 reviews'},
 {'price': ['$99.99'],
  'description': ['Black, 7", 1.6GHz Dual-Core, 8GB, Android 4.4'],
  'product': 'Iconia B1-730HD',
  'review': '1 reviews'},
 {'price': ['$101.99'],
  'description': ['IPS, Dual-Core 1.2GHz, 8GB, Android 4.3'],
  'product': 'Memo Pad HD 7',
  'review': '10 reviews'},
 {'price': ['$102.99'],
  'description': ['7" screen, Android, 8GB'],
  'product': 'Asus MeMO Pad',
  'review': '14 reviews'},
 {'price': ['$107.99'],
  'description': ['7", 8G

#### Amazon Kindle is neither Andoid or Apple. Therefore, both lists do not contain Kindle. 
#### The total for both andoid and apple list is 20 items. (Excluding Kindle).
