<a href="https://colab.research.google.com/github/lauramwichekha/Data-science-introduction/blob/main/scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Scraping

Web scraping is the automated process of extracting data from websites. It involves using bots or web crawlers to navigate through web pages, retrieve specific information, and store it for analysis or other purposes. This data can include text, images, links, and more. Web scraping is commonly used in various fields like research, data analysis, price monitoring, and competitive intelligence. However, it's important to respect website terms of service and legal regulations while scraping data.

## HTML

HTML, or Hypertext Markup Language, is the standard language used to create and design web pages. For a data scientist interested in scraping data, understanding HTML is crucial because it's the structure in which information is organized on most websites.

Here's a brief overview:

1. **Elements**: HTML documents consist of elements, which are enclosed in tags. These tags define the structure of the document and determine how content is displayed. For example, `<p>` is a tag used for paragraphs, `<h1>` to `<h6>` for headings, `<div>` for divisions, etc.

2. **Attributes**: Tags can have attributes that provide additional information about the element. Attributes appear within the opening tag and modify the element's behavior or appearance. For instance, the `<a>` tag for links has an `href` attribute that specifies the URL the link points to.

3. **Hierarchy**: Elements in HTML are organized hierarchically in a tree-like structure known as the Document Object Model (DOM). Understanding this structure helps in navigating and selecting specific elements within a webpage.

4. **Classes and IDs**: Elements can also have classes and IDs, which are used to style or identify them uniquely. Classes are used for multiple elements with similar styling, while IDs are unique identifiers for individual elements.


Example of a HTML page

```html
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>My First Webpage</title>
    <link rel="stylesheet" href="styles.css">
    <style>
        body {
            font-family: Arial, sans-serif;
            margin: 0;
            padding: 0;
            background-color: #f0f0f0;
        }
        header {
            background-color: #333;
            color: #fff;
            padding: 10px 20px;
            text-align: center;
        }
        h1 {
            margin-top: 20px;
            text-align: center;
        }
        p {
            margin: 10px auto;
            max-width: 600px;
            text-align: center;
        }
    </style>
</head>
<body>
    <header>
        <h1>Welcome to My First Webpage</h1>
    </header>
    <main>
        <p>This is a simple HTML page created using basic HTML and CSS.</p>
        <p>Feel free to explore and learn more about web development!</p>
    </main>
</body>
</html>



```

## CSS

CSS, or Cascading Style Sheets, is a stylesheet language used to describe the presentation of a document written in HTML (or XML). It defines how HTML elements are displayed on a web page, including their layout, colors, fonts, and other visual aspects.

```css

/* Resetting default margin and padding for all elements */
* {
    margin: 0;
    padding: 0;
    box-sizing: border-box; /* Ensures padding and border are included in element's total width and height */
}

/* Styling the body */
body {
    font-family: Arial, sans-serif; /* Setting default font */
    background-color: #f0f0f0; /* Setting background color */
}

/* Styling the header */
header {
    background-color: #333; /* Setting background color */
    color: #fff; /* Setting text color */
    padding: 20px; /* Adding padding */
    text-align: center; /* Aligning text to center */
}

/* Styling the main content */
main {
    max-width: 800px; /* Setting maximum width */
    margin: 0 auto; /* Centering the content horizontally */
    padding: 20px; /* Adding padding */
}

/* Styling headings */
h1 {
    font-size: 32px; /* Setting font size */
    margin-bottom: 20px; /* Adding space below heading */
}

/* Styling paragraphs */
p {
    font-size: 16px; /* Setting font size */
    line-height: 1.5; /* Setting line height */
    margin-bottom: 15px; /* Adding space below paragraphs */
}

```

## Kilimall scraping Example

In [None]:
import requests as rq
from bs4 import BeautifulSoup as bs

In [None]:
response = rq.get("https://www.kilimall.co.ke/search-result?id=872&form=category&ctgName=Phones%20&%20Accessories")

soup = bs(response.content, "html.parser")

In [None]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width,initial-scale=1.0,maximum-scale=1.0,user-scalable=0;" name="viewport"/>
  <title>
   Find what you favorite Phones  here
  </title>
  <meta content="all" name="robots"/>
  <meta content="OkbLQWAI5EwE1ZYBnoqkWAOIASsjGbOvnK09aFksCsY" name="google-site-verification"/>
  <meta content="kaecobftnbyknc6ln51jgopse42kfi" name="facebook-domain-verification"/>
  <link crossorigin="use-credentials" href="/js/manifest.webmanifest" rel="manifest"/>
  <link href="https://image.kilimall.com/kenya/new_wap/logo/icon_48.ico" rel="icon" type="image/x-icon"/>
  <script src="/js/globalThis.min.js">
  </script>
  <script async="" src="https://www.googletagmanager.com/gtag/js?id=AW-965268020">
  </script>
  <script>
   window.dataLayer = window.dataLayer || [];
          function gtag(){dataLayer.push(arguments);}
          gtag('js', new Date());
          gtag('config', 'AW-965268020');
  </script>
  <script async=

In [None]:
listing_wrapper = soup.find("div", {"class": "result-listings-wrapper"})

In [None]:
listings = listing_wrapper.find("div", {"class": "listings"})

In [None]:
items = listings.find_all("div", {"class": "listing-item"})

[<div class="listing-item" data-v-b41b726d=""><div class="inner-listing" data-v-b41b726d=""><div class="product-item" data-v-a810dad4="" data-v-b41b726d=""><a data-v-a810dad4="" href="/listing/2697191?title=Refurbished+Nokia+6310+Classic+Design,+Wireless+FM+Feature+Phone+-+Black&amp;image=https://image.kilimall.com/kenya/shop/store/goods/9349/2023/03/167777238245704712a700fa34fbe9711939c69ec9a99_360.png.webp%23&amp;source=&amp;skuId=18934713" target="_blank"><div class="product-image" data-v-a810dad4=""><img data-v-a810dad4=""/><!-- --></div><div class="info-box" data-v-a810dad4=""><p class="product-title" data-v-a810dad4="">Refurbished Nokia 6310 Classic Design, Wireless FM Feature Phone - Black</p><div class="product-price" data-v-a810dad4="" style="text-align:left;">KSh 1,500</div><!-- --><div class="rate" data-v-a810dad4=""><div aria-disabled="false" aria-readonly="false" class="van-rate" data-v-a810dad4="" role="radiogroup" tabindex="0"><!--[--><div aria-checked="true" aria-posins

In [None]:
data = []
for item in items:
    name = item.find('p',{'class':'product-title'})
    price = item.find('div',{'class':'product-price'})

    # print(name.text,"------",price.text)
    data.append([name.text,price.text])

In [None]:
print(data)

[['Refurbished Nokia 6310 Classic Design, Wireless FM Feature Phone - Black', 'KSh 1,500'], ['Infinix Hot 40i 128GB+4GB / 256GB+8GB 6.56" Screen 5000 mAh Battery Capacity With 18W Type-C Charging 50MP Dual AI Camera Fingerprint Lock Smart Phone', 'KSh 12,450'], ['TECNO SPARK 20 128GB +8(4+4)GB RAM 90Hz 6.6" MTK G85 Gaming Processor 50MP+32MP Dual Speakers Android 13 5000mAh Smart Phones 4G-Dual SIM Side Fingerprint', 'KSh 14,399'], ['Refurbished Phones Nokia 105 (2017)- Original Nokia Phone 800 mAh 1.5" Dual SIM Cards Unlocked Cheap Durable Old Phone Classic Feature', 'KSh 1,200'], ["Refurbished OPPO R9 oppor9 (F1s, F1, f1plus) - 5.5'', 64GB + 4GB, 3350mAh 13mp + 16mp, 4G dual sim Original  OLED smart phones fashion phone fingerprint unlock", 'KSh 6,599'], ["Refurbished  phones  OPPO   A57 F1    Smartphone   32+3GB 5.2'' Smartphones 16mp+13MP 2900mAh witDual h fingerprint unlocking Gold      Single Card   Unlock Smart Phone", 'KSh 4,977'], ['Refurbished OPPO A59s  F1s Smart Phone 5.5 i

In [None]:
import pandas as pd

df = pd.DataFrame(data, columns = ['Name', 'Price'])

df.head()

Unnamed: 0,Name,Price
0,"Refurbished Nokia 6310 Classic Design, Wireles...","KSh 1,500"
1,"Infinix Hot 40i 128GB+4GB / 256GB+8GB 6.56"" Sc...","KSh 12,450"
2,"TECNO SPARK 20 128GB +8(4+4)GB RAM 90Hz 6.6"" M...","KSh 14,399"
3,Refurbished Phones Nokia 105 (2017)- Original ...,"KSh 1,200"
4,"Refurbished OPPO R9 oppor9 (F1s, F1, f1plus) -...","KSh 6,599"
