<center><h1>Introduction to Web-Scraping</h1></center>

A method of collecting data from websites is called Web Scraping. Usually the software or the script that does this process is termed as `Bot` or `Web Crawler`.

* Web Harvesting
* Web Data Extraction
* Screen Scraping
* Growth Hacking

**Usual Ways**

* Collecting data from online and storing it in your local file or database.
* Collecting data from online and deploying it as an API or URL for further usages.

<img src="https://miro.medium.com/max/960/0*9PbNnwHdp8ocBn7b.jpg">

**Credit** - Image from Internet

### What is API?

* API - (Application Programing Interface) acts as a mediator between server and the client machine.
    - Imagine API to be a URL (link) in which the data is obtained by slightly changing the behaviour.
    - Client (User) requests for the data from the server through API.
    - Server responds the user if the request is valid (success - status code → 200).

<center><h1>Web Scraping and Hacking</h1></center>

Web scraping is often termed as a growth hacking technique to build up sales pipeline and determine how the competitors are setting their prices for the similar products. 

Well that comes under marketing field. How is data science and coding related to web scraping.

**More information** - https://www.entrepreneur.com/article/296906

Web scraping is used to collect the data which is publicly open. It helps so many businesses in so many ways -

* To understad the customer behaviour.
* To estimate or understand what the customer is craving for.
* To make machine learning model **from the public data** and predict the customer interest.
* so on.

### Is web scraping legal?

There are two dimensions here as well

* Good Bots
* Bad Bots

**Good Bots** - They value the owner's standards and abide with the rules of scraping. They value the customers point in knowing more with less effort like `price comparison`, `social sentiment guaging`, `helping market researchers` and other so many aspects.

**Bad Bots** - Very much opposite to `Good Bots`. Data Breach, User account hacking, Online Fraud, Unauthorized vulnerability scans, Spam and digital ad fraud.

![gb_bots](https://user-images.githubusercontent.com/63333753/126785454-592335c1-6e99-4378-a041-d4c7fa389c02.PNG)

<center>Bad Bot <strong> vs </strong> Good Bot</center>

**Credits** - Image from Internet

<br>
Web scraping is not illegal after all. Startups love web scraping to understand customers and they are getting the data without much effort in partnering with other data providers.

**More information** - https://www.imperva.com/blog/is-web-scraping-illegal/

<center><h1>Web Scraping in Python</h1></center>

Web scraping in python can be done using the following packages.

* requests
* bs4
* selenium - requires chromium or firefox driver
* scrapy

<img src="https://www.scrapingbee.com/images/post/python-101/python_101_cover.jpg">

**Credits** - Image from Internet

### Installation of the Packages

For Windows - open command prompt

* **bs4** - `py -m pip install bs4 --user`
* **requests** - `py -m pip install requests --user`

For Linux - open terminal

* **bs4** - `pip install bs4 --user`
* **requests** - `pip install requests --user`

But before doing this, make sure your `pip` is recognized in Windows

<center><h1>Live coding</h1></center>

**JSON** - JavaScript Object Notation

* lightweigt data interchange format
* easy for humans to read
* extraction is done by parsing method
* it can be taken as a dictionary in python

**Struncture of JSON**

```json
{
    "key" : "value",
    "key" : {
        "sub_key" : "value",
        "sub_key" : "value"
    },
    "key" : [
        {
            "sub_key" : "value",
            "sub_key" : "value"
        }, 
        {
            "sub_key" : "value",
            "sub_key" : "value"
        }
    ],
    "key" : "value",
    "key" : ["value", "value", "value"]
}
```

### Let's scrape the device location

In [1]:
# ip_url = 'http://ip-api.com/json'

import requests

class DeviceTracker():
    def __init__(self, ip_url):
        self.ip_url = ip_url
    
    def get_device_data(self):
        ip_req = requests.get(url=self.ip_url)
        ip_data = ip_req.json()
        return ip_data
    
    def get_user_loc(self):
        ip_data = self.get_device_data()
        city_name = ip_data['city']
        return city_name

In [2]:
ip_url = 'http://ip-api.com/json'
ip_dev = DeviceTracker(ip_url=ip_url)
city_name = ip_dev.get_user_loc()
print(city_name)

Hyderabad


### Let's scrape the location of any place and get the weather data

In [3]:
# 'http://api.openweathermap.org/data/2.5/weather?q={}&appid=9d41bd4e5bffd04e03a6cb6832066559'
# name - anything
# celsius - temp - 273
# farenheit - celsius * 9/5 + 32

import requests

class WeatherApp(DeviceTracker):
    def __init__(self, ip_url):
        self.ip_url = ip_url
        self.weather_url = 'http://api.openweathermap.org/data/2.5/weather?q={}&appid=9d41bd4e5bffd04e03a6cb6832066559'
        self.place_name = None
    
    def get_weather_data(self):
        self.place_name = input("Please enter valid city name: ")
        w_url = self.weather_url.format(self.place_name)
        w_req = requests.get(url=w_url)
        
        if (w_req.status_code == 200):
            w_data = w_req.json()
        else:
            print("-----------------")
            print("The entered place name is not valid")
            print("Getting the user location ...")
            self.place_name = self.get_user_loc()
            w_url = self.weather_url.format(self.place_name)
            w_req = requests.get(url=w_url)
            w_data = w_req.json()
        
        return w_data
    
    def get_parsed_details(self):
        w_data = self.get_weather_data()
        
        desc = w_data['weather'][0]['description']
        temp = w_data['main']['temp']
        humidity = w_data['main']['humidity']
        wind_speed = w_data['wind']['speed']
        all_clouds = w_data['clouds']['all']
        
        celsius = temp - 273
        farenheit = (celsius * (9 / 5)) + 32
        
        print("-----------------")
        print("The weather details of the place - {}".format(self.place_name))
        print("Weather description - ", desc)
        print("The temp in celsius - ", round(celsius, 2))
        print("The temp in farenheit - ", round(farenheit, 2))
        print("The wind speed - {} mpg".format(wind_speed))
        print("Humidity - ", humidity)
        print("Total clouds - ", all_clouds)
        
        return None

In [4]:
ip_url = 'http://ip-api.com/json'
w_app = WeatherApp(ip_url=ip_url)
w_app.get_parsed_details()

Please enter valid city name: lucknow
-----------------
The weather details of the place - lucknow
Weather description -  haze
The temp in celsius -  29.14
The temp in farenheit -  84.45
The wind speed - 1.03 mpg
Humidity -  89
Total clouds -  75


### What did we learn?

* Web scraping definition
* Bot and crawlers
* Web scraping and Growth hacking
* Web scraping legal/illegal
* Live coding