## Assignment 3: Web Scraping

For this assignment, you are required to scrape data from e-commerce or other websites such as [Lelong](http://www.lelong.com.my), [Lazada](http://www.lazada.com.my/), [Mudah](http://www.mudah.my/), [iProperty](https://www.iproperty.com.my/), [Booking](http://www.booking.com), [Expedia](https://www.expedia.com.my/) etc.

You are required to fork this Jupyter Notebook from my Github [here](https://github.com/kuanhoong/EDS-Assignment3 ) and then scrape the latest 1000 items from one of the website mentioned above. The scraped data should include:

* Product Name/Product Title
* Amount/Price
* Brand
* Comments/Reviews
* Number of views

In addition, you are required to export the scraped data to dataframe format and also save a copy in csv format. Upon successful extracting data to dataframe, you are required to do a data analysis on the data. 

Your analysis should provide answers to the following questions:
* What do you think is interesting about this data? Tell a story about some interesting thing you have discovered by looking at the data.
* Visualize your data with matplotlib or with folium library package.

For example, you might consider whether there is a difference in pricings at different times doing the day or city, or whether other factors that influnced the pricings etc. Another thing you might consider is whether there is a relationship between the pricing and number of reviews or comments.

Show your analysis workflow in your Jupyter notebook.

The final submission should be pushed back to your respective Github account.

### Folium

[Folium](https://github.com/python-visualization/folium) makes it easy to visualize data that's been manipulated in Python on an interactive Leaflet map. It enables both the binding of data to a map for choropleth visualizations as well as passing Vincent/Vega visualizations as markers on the map.

In [1]:
import pandas as pd
import re
import numpy as np
import folium
import matplotlib.pyplot as plt
%matplotlib inline
from bs4 import BeautifulSoup
import requests

In [2]:
mapit = None
latlon = [ (3.144473, 101.708722), (3.144473, 101.708722),(3.135732, 101.686989)]
for coord in latlon:
    mapit = folium.Map( location=[ coord[0], coord[1] ] )
    folium.Marker([3.144473, 101.708722], icon=folium.Icon(color='blue',icon='star'), popup='Federal Hotel').add_to(mapit)
    folium.Marker([3.156374, 101.714579], icon=folium.Icon(color='green',icon='info-sign') , popup='Mandarin Oriental').add_to(mapit)
    folium.Marker([3.135732, 101.686989], icon=folium.Icon(color='red',icon='star') , popup='Le Meridien').add_to(mapit)
mapit

In [3]:
# scrape from lelong

# find the pattern for the first page
url = 'https://www.lelong.com.my/catalog/all/list?TheKeyword=macbook+pro&D='

# write a loop to scrape from page 1 to the last page

product_name=[]
for page in range(1,19):
    url_page = url+str(page)
    scrape = requests.get(url_page)
    soup = BeautifulSoup(scrape.content, 'lxml')
    link = soup.find_all('div',{'class':'item','class':'summary'})
    length = len(link)
    for i in range(0,length):
        name = link[i].a.get('title')
        product_name.append(name)

In [4]:
# write to csv
# convert the list to a pandas dataframe

df = pd.DataFrame({'name':product_name})
df
df.to_csv('output.csv', index=False)

In [39]:
#Jason Phoon Answer
#The scraped data should include:
#Product Name/Product Title
#Amount/Price
#Brand
#Comments/Reviews
#Number of views

url = 'https://www.lelong.com.my/catalog/all/list?TheKeyword=macbook+pro&D='

names = []
prices = []
views = []

div_item = soup.find_all('div',{'class':'item'})

#rewrite...
for page in range(1,19):
    url_page = url+str(page)
        for item in soup.find_all('div',{'class':'item'}):
            names.append(item.find('div',{'class':'summary'}).a.get('title'))
            prices.append(item.find('div',{'class':'col total'}).span.b.string)
            views.append(item.find('div',{'class':'list-sv-icon'}).find('span',{'class':'hit'}).string)
    
df = pd.DataFrame({'name':name, 'price': prices, 'view': views})
df.price = df.price.str.replace('RM','')
df.view = df.price.str.extract(r'(\d+)')
df

df.to_csv('output.csv', index=False)

