## Web Scraping
- Extracting required data from any website and store that into some structured format like csv or excel and so on

### libraries required
1. requests
2. BeautifulSoup

### Life cycle of Data Sciece
1. Business Understanding
2. Data Collection / Data Mining
3. Data Cleaning
4. Data Analysis
5. Data Visualization
6. Feature Engineering
7. Model Building
8. Model Evaluation
9. Model Deployment

##### import required libraries

In [5]:
import requests
from bs4 import BeautifulSoup

In [9]:
page = requests.get("https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_area")

In [10]:
page.text

'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-disabled vector-feature-night-mode-disabled skin-theme-clientpref-day vector-toc-available" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8">\n<title>List of countries and dependencies by area - Wikipedia</title>\n<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-m

In [11]:
# converting string data into HTML format
soup = BeautifulSoup(page.text)

In [12]:
soup

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-disabled vector-feature-night-mode-disabled skin-theme-clientpref-day vector-toc-available" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of countries and dependencies by area - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-m

In [14]:
soup.find_all("span",class_="mw-page-title-main")

[<span class="mw-page-title-main">List of countries and dependencies by area</span>]

In [17]:
str(soup.find_all("span",class_="mw-page-title-main"))

'[<span class="mw-page-title-main">List of countries and dependencies by area</span>]'

In [19]:
soup.find_all("span",class_="mw-page-title-main")[0].text

'List of countries and dependencies by area'

In [22]:
soup.find_all("b")[0].text

'Total area:'

In [24]:
soup.find_all("span",class_="mw-headline")[0].text

'Map'

### Problem Statement
- I want to analysis on Mobile prices

1. Brand, RAM, ROM, Processesor, Display, Camera, Battery, color, Model, Price
2. search for the websites
    - e-commerce websites
    - Own brand websites

In [27]:
request_header = {'Content-Type': 'text/html; charset=UTF-8', 
                  'User-Agent': 'Chrome/101.0.0.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/119.0', 
                  'Accept-Encoding': 'gzip, deflate, br'}

In [28]:
page = requests.get("https://www.flipkart.com/search?q=mobiles&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off",
                   headers = request_header)

In [29]:
page

<Response [200]>

In [30]:
soup = BeautifulSoup(page.text)

In [31]:
soup

<!DOCTYPE html>
<html lang="en"><head><link href="https://rukminim2.flixcart.com" rel="preconnect"/><link href="//static-assets-web.flixcart.com/fk-p-linchpin-web/fk-cp-zion/css/app_modules.chunk.c48a12.css" rel="stylesheet"/><link href="//static-assets-web.flixcart.com/fk-p-linchpin-web/fk-cp-zion/css/app.chunk.47c551.css" rel="stylesheet"/><meta content="text/html; charset=utf-8" http-equiv="Content-type"/><meta content="IE=Edge" http-equiv="X-UA-Compatible"/><meta content="102988293558" property="fb:page_id"/><meta content="658873552,624500995,100000233612389" property="fb:admins"/><link href="https://static-assets-web.flixcart.com/www/promos/new/20150528-140547-favicon-retina.ico" rel="shortcut icon"/><link href="/osdd.xml?v=2" rel="search" type="application/opensearchdescription+xml"/><meta content="website" property="og:type"/><meta content="Flipkart.com" name="og_site_name" property="og:site_name"/><link href="/apple-touch-icon-57x57.png" rel="apple-touch-icon" sizes="57x57"/><l

#### Prices

In [33]:
len(soup.find_all("div",class_="Nx9bqj _4b5DiR"))

24

In [44]:
soup.find_all("div",class_="Nx9bqj _4b5DiR")[20].text

'₹9,299'

In [45]:
p = soup.find_all("div",class_="Nx9bqj _4b5DiR")
p

[<div class="Nx9bqj _4b5DiR">₹23,999</div>,
 <div class="Nx9bqj _4b5DiR">₹14,999</div>,
 <div class="Nx9bqj _4b5DiR">₹11,999</div>,
 <div class="Nx9bqj _4b5DiR">₹11,999</div>,
 <div class="Nx9bqj _4b5DiR">₹11,999</div>,
 <div class="Nx9bqj _4b5DiR">₹7,999</div>,
 <div class="Nx9bqj _4b5DiR">₹8,999</div>,
 <div class="Nx9bqj _4b5DiR">₹9,999</div>,
 <div class="Nx9bqj _4b5DiR">₹7,699</div>,
 <div class="Nx9bqj _4b5DiR">₹7,999</div>,
 <div class="Nx9bqj _4b5DiR">₹10,490</div>,
 <div class="Nx9bqj _4b5DiR">₹7,699</div>,
 <div class="Nx9bqj _4b5DiR">₹10,999</div>,
 <div class="Nx9bqj _4b5DiR">₹9,999</div>,
 <div class="Nx9bqj _4b5DiR">₹8,999</div>,
 <div class="Nx9bqj _4b5DiR">₹10,999</div>,
 <div class="Nx9bqj _4b5DiR">₹6,999</div>,
 <div class="Nx9bqj _4b5DiR">₹24,999</div>,
 <div class="Nx9bqj _4b5DiR">₹17,620</div>,
 <div class="Nx9bqj _4b5DiR">₹19,598</div>,
 <div class="Nx9bqj _4b5DiR">₹9,299</div>,
 <div class="Nx9bqj _4b5DiR">₹19,509</div>,
 <div class="Nx9bqj _4b5DiR">₹7,999</div>,

In [48]:
prices = []
for i in p:
    prices.append(i.text)

In [50]:
prices

['₹23,999',
 '₹14,999',
 '₹11,999',
 '₹11,999',
 '₹11,999',
 '₹7,999',
 '₹8,999',
 '₹9,999',
 '₹7,699',
 '₹7,999',
 '₹10,490',
 '₹7,699',
 '₹10,999',
 '₹9,999',
 '₹8,999',
 '₹10,999',
 '₹6,999',
 '₹24,999',
 '₹17,620',
 '₹19,598',
 '₹9,299',
 '₹19,509',
 '₹7,999',
 '₹14,999']

### brands

In [52]:
soup.find_all("div",class_="KzDlHZ")[0].text

'Nothing Phone (2a) 5G (White, 128 GB)'

In [56]:
soup.find_all("div",class_="KzDlHZ")[15].text.split()[0]

'Motorola'

In [57]:
data = soup.find_all("div",class_="KzDlHZ")

In [60]:
brands = []
for i in data:
    brands.append(i.text.split()[0])

In [61]:
brands

['Nothing',
 'vivo',
 'Motorola',
 'Motorola',
 'Motorola',
 'MOTOROLA',
 'MOTOROLA',
 'POCO',
 'REDMI',
 'MOTOROLA',
 'SAMSUNG',
 'REDMI',
 'POCO',
 'POCO',
 'MOTOROLA',
 'Motorola',
 'MOTOROLA',
 'Motorola',
 'OnePlus',
 'OnePlus',
 'POCO',
 'OnePlus',
 'POCO',
 'Motorola']

In [69]:
data[1].text.split("(")[0]

'vivo T3x 5G '

In [71]:
models = []
for i in data:
    models.append(i.text.split("(")[0])

In [76]:
data[1].text.split(",")[0].split("(")[1]

'Crimson Bliss'

In [78]:
colors = []
for i in data:
    colors.append(i.text.split(",")[0].split("(")[1])

In [80]:
features = soup.find_all("li",class_="J+igdf")

In [81]:
for i in features:
    print(i.text)

8 GB RAM | 128 GB ROM
17.02 cm (6.7 inch) Full HD+ Display
50MP (OIS) + 50MP | 32MP Front Camera
5000 mAh Battery
Dimensity 7200 Pro Processor
1 Year Manufacturing Warranty
6 GB RAM | 128 GB ROM | Expandable Upto 1 TB
17.07 cm (6.72 inch) Full HD+ Display
50MP + 2MP | 8MP Front Camera
6000 mAh Battery
6 Gen 1 Processor
1 Year Manufacturer Warranty for Device and 6 Months Manufacturer Warranty for Inbox Accessories
8 GB RAM | 128 GB ROM
16.51 cm (6.5 inch) HD+ Display
50MP + 2MP | 16MP Front Camera
5000 mAh Battery
Snapdragon 695 5G Processor
Vegan Leather Design
1 Year on Handset and 6 Months on Accessories
8 GB RAM | 128 GB ROM
16.51 cm (6.5 inch) HD+ Display
50MP + 2MP | 16MP Front Camera
5000 mAh Battery
Snapdragon 695 5G Processor
1 Year on Handset and 6 Months on Accessories
8 GB RAM | 128 GB ROM
16.51 cm (6.5 inch) HD+ Display
50MP + 2MP | 16MP Front Camera
5000 mAh Battery
Snapdragon 695 5G Processor
1 Year on Handset and 6 Months on Accessories
4 GB RAM | 128 GB ROM | Expandabl

In [86]:
features[6].text

'6 GB RAM | 128 GB ROM | Expandable Upto 1 TB'

In [87]:
import re

In [93]:
re.findall("(\d+)\s\w+\sRAM",features[6].text)[0]

'6'

In [100]:
RAM = []
for i in features:
    a = i.text
    b = re.findall("(\d+)\s\w+\sRAM",a)
    if len(b)>0:
        RAM.append(b[0])

In [101]:
features[0].text

'8 GB RAM | 128 GB ROM'

In [104]:
re.findall("(\d+)\s\w+\sROM",features[0].text)[0]

'128'

In [111]:
ROM = []
for i in features:
    a = i.text
    b = re.findall("(\d+)\s\w+\sROM",a)
    if len(b)>0:
        ROM.append(b[0])

In [112]:
ROM

['128',
 '128',
 '128',
 '128',
 '128',
 '128',
 '128',
 '128',
 '128',
 '128',
 '128',
 '128',
 '128',
 '128',
 '128',
 '128',
 '64',
 '256',
 '128',
 '256',
 '128',
 '256',
 '128',
 '128']

In [114]:
features[1].text

'17.02 cm (6.7 inch) Full HD+ Display'

In [117]:
re.findall("(\d+\.\d+)\sinch",features[1].text)[0]

'6.7'

In [122]:
display = []
for i in features:
    a = i.text
    b = re.findall("(\d+\.\d+)\sinch",a)
    if len(b)>0:
        display.append(b[0])

In [123]:
display

['6.7',
 '6.72',
 '6.5',
 '6.5',
 '6.5',
 '6.6',
 '6.6',
 '6.79',
 '6.74',
 '6.6',
 '6.6',
 '6.74',
 '6.79',
 '6.79',
 '6.6',
 '6.5',
 '6.6',
 '6.55',
 '6.72',
 '6.72',
 '6.74',
 '6.72',
 '6.74',
 '6.5']

In [125]:
features[3].text

'5000 mAh Battery'

In [127]:
re.findall("(\d+)\smAh",features[3].text)[0]

'5000'

In [128]:
battery = []
for i in features:
    a = i.text
    b = re.findall("(\d+)\smAh",a)
    if len(b)>0:
        battery.append(b[0])

In [129]:
battery

['5000',
 '6000',
 '5000',
 '5000',
 '5000',
 '6000',
 '6000',
 '5000',
 '5000',
 '5000',
 '6000',
 '5000',
 '5000',
 '5000',
 '6000',
 '5000',
 '5000',
 '5000',
 '5000',
 '5000',
 '5000',
 '5000',
 '5000',
 '6000']

### Dataframe creation

In [130]:
import pandas as pd

In [131]:
d = {"Brand":brands,
    "Model":models,
    "RAM":RAM,
    "ROM":ROM,
    "Display":display,
    "Color":colors,
    "Battery":battery,
    "Price":prices}

In [132]:
df = pd.DataFrame(d)

In [133]:
df

Unnamed: 0,Brand,Model,RAM,ROM,Display,Color,Battery,Price
0,Nothing,Nothing Phone,8,128,6.7,2a) 5G,5000,"₹23,999"
1,vivo,vivo T3x 5G,6,128,6.72,Crimson Bliss,6000,"₹14,999"
2,Motorola,Motorola G34 5G,8,128,6.5,Ocean Green,5000,"₹11,999"
3,Motorola,Motorola G34 5G,8,128,6.5,Ice Blue,5000,"₹11,999"
4,Motorola,Motorola G34 5G,8,128,6.5,Charcoal Black,5000,"₹11,999"
5,MOTOROLA,MOTOROLA g24 Power,4,128,6.6,Ink blue,6000,"₹7,999"
6,MOTOROLA,MOTOROLA g24 Power,8,128,6.6,Ink blue,6000,"₹8,999"
7,POCO,POCO M6 Pro 5G,4,128,6.79,Power Black,5000,"₹9,999"
8,REDMI,REDMI 13C,4,128,6.74,Stardust Black,5000,"₹7,699"
9,MOTOROLA,MOTOROLA G04,8,128,6.6,Sea Green,5000,"₹7,999"


#### save the data into Local Directory

In [134]:
df.to_csv("result.csv")