Skip to content

Commit

Permalink
Documentation, Typo correction, Improve wording, Improve code for bet…
Browse files Browse the repository at this point in the history
…ter Exception Handling, Improve pipeline function implementation

This tweak will probably ended in incremented the version to 0.1.1
  • Loading branch information
max-efort committed Jun 19, 2023
1 parent b15ee85 commit 83d6ceb
Show file tree
Hide file tree
Showing 13 changed files with 887 additions and 85 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
__pycache__/
46 changes: 26 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,8 @@ Scraple is a Python library designed to simplify the process of web scraping,
providing easy scraping and easy searching for selectors.

## Installation
[Pypi page](https://pypi.org/project/scraple/)

You can install the package using pip:
The package is hosted in [Pypi](https://pypi.org/project/scraple/) and can be
installed using pip:

```shell
pip install scraple
Expand All @@ -16,17 +15,17 @@ pip install scraple
The package provides two main classes: Rules and SimpleExtractor.

#### 1. Rules
The Rules class allows you to define rules for extracting elements from a web page.
You can add field rules using the add_field_rule method, which has the capability to
automatically pick selectors based on a provided string. Also, support for regex
matching.
The Rules class allows you to define rules of extraction from a web page reference.
You can pick selector just by knowing what string present in that page using the `add_field_rule` method.
This method automatically searches for selector of element which text content match the string.
Additionally, the `add_field_rule` method supports regular expression matching.

```python
from scraple import Rules

some_rules = Rules("reference in the form of beautifulSoup4 object, html code or string path to local html file")
some_rules = Rules("reference in the form of string path to local html file", "local")
some_rules.add_field_rule("a sentence or word exist in reference page", "field name 1")
some_rules.add_field_rule("some othe.*?sentences", "field name 2", re_flag=True)
some_rules.add_field_rule("some othe.*?text", "field name 2", re_flag=True)
# Add more field rules...

# It automatically search for the selector, to see it you can see the rule in console
Expand All @@ -35,21 +34,28 @@ some_rules.add_field_rule("some othe.*?sentences", "field name 2", re_flag=True)
```

#### 2. SimpleExtractor
The SimpleExtractor class performs the actual scraping based on the defined rules.
You provide the Rules object to the SimpleExtractor constructor and use the
perform_extraction method to create a generator object that iterate dictionary of
element or text information.
The SimpleExtractor class performs the actual scraping based on a defined rule.
A Rules object act as the "which to extract" and the SimpleExtractor do the "extract" or
scraping. First pass a Rules object
to SimpleExtractor constructor and use the
`perform_extraction` method to create a generator object that iterate dictionary of
elements extracted.

```python
from scraple import SimpleExtractor

extractor = SimpleExtractor(some_rules)
result = extractor.rule(
"web page object in the form of beautifulSoup4 object, html code or string path to local html file")
extractor = SimpleExtractor(some_rules) # some_rules from above code snippet
result = extractor.perform_extraction(
"web page in the form of beautifulSoup4 object",
"parsed"
)

# print(next(result))
# {"field name 1": element or text information (if you provide pipeline func.),
# print(next(result))
# "field name 2": ..., ...}
# {
# "field name 1": [element, ...],
# "field name 2": ...,
# ...
# }
```
For more detail, see the [repository](https://github.com/max-efort/scraple)
For more information and tutorial, see the [documentation](https://github.com/max-efort/scraple/doc) or
visit the main [repository](https://github.com/max-efort/scraple)
118 changes: 118 additions & 0 deletions doc/code_example/Modified Quotes to Scrape.html

Large diffs are not rendered by default.

11 changes: 11 additions & 0 deletions doc/code_example/Modified Quotes to Scrape_files/bootstrap.min.css

Large diffs are not rendered by default.

111 changes: 111 additions & 0 deletions doc/code_example/Modified Quotes to Scrape_files/main.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
/* Custom page CSS */
body {
font-family: sans-serif;
}

.container .text-muted {
margin: 20px 0;
}

.tags-box {
text-align: right;
}

.tags-box h2 {
margin-top: 0px;
}

.tag-item {
display: block;
margin: 4px;
}

.quote {
padding: 10px;
margin-bottom: 30px;
border: 1px solid #333333;
border-radius: 5px;
box-shadow: 2px 2px 3px #333333;
}

.quote small.author {
font-weight: bold;
color: #3677E8;
}

.quote span.text {
display: block;
margin-bottom: 5px;
font-size: large;
font-style: italic;
}

.quote .tags {
margin-top: 10px;
}

.tag {
padding: 2px 5px;
border-radius: 5px;
color: white;
font-size: small;
background-color: #7CA3E6;
}

a.tag:hover {
text-decoration: none;
}

/* Sticky footer styles */
html {
position: relative;
min-height: 100%;
}

body {
/* Margin bottom by footer height */
margin-bottom: 60px;
}

.footer {
position: absolute;
bottom: 0;
width: 100%;
/* Set the fixed height of the footer here */
height: 6em;
background-color: #f5f5f5;
}

.error {
color: red;
}

.header-box {
padding-bottom: 40px;
}

.header-box p {
margin-top: 30px;
float: right;
}

.author-details {
width: 80%;
}

.author-description {
text-align: justify;
margin-bottom: 20px;
}

ul.pager {
margin-bottom: 100px;
}

.copyright {
text-align: center;
}

.sh-red {
color: #cc0b0f;
}
165 changes: 165 additions & 0 deletions doc/code_example/example_code.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
from scraple import Rules, SimpleExtractor
# to displaying the end result in tabular manner we use pandas
from pandas import concat, DataFrame as Df
# ----------------------------------------------------------------------------------------------------------------------
# End Section
# ----------------------------------------------------------------------------------------------------------------------

# suppose we want to scrap just the Quote
quote_rule = Rules(r"Modified Quotes to Scrape.html", "local")
quote_rule.add_field_rule(
"It is our choices, Harry, that show what we truly are, far more than our abilities.",
"Quotes",
)
print(quote_rule)
# >>>
# Parent Selector:
# div.container div.row div.col-md-8 div.quote span.text,
# Field Rule:
# {'Quotes': ('', None)}

# create DataFrame object to accumulate the iterated item
result_panda = Df()

# scrape using SimpleExtractor, for now we just gonna scrape the reference page
extract = SimpleExtractor(quote_rule)
extracting = extract.perform_extraction(r"Modified Quotes to Scrape.html", "local")

for index, dictionary in enumerate(extracting, 1): # iterate dictionary of scraping result
result_panda = concat([result_panda, Df(dictionary, index=[index])])

print(result_panda)
# >>>
# Quotes
# 1 [“The world as we have created it is a process...
# 2 [“It is our choices, Harry, that show what we ...
# 3 [“The person, be it gentleman or lady, who has...
# 4 [“Try not to become a man of success. Rather b...

# ----------------------------------------------------------------------------------------------------------------------
# End Section
print("# --------------------------------------------------------------------------------------------------")
# ----------------------------------------------------------------------------------------------------------------------

rules = Rules(r"Modified Quotes to Scrape.html", "local")
# To make defining rule more easier we iterate list of defined field name, string identifier and
# additionally finding the string of n-th and we will provide pipeline to process the extracted
# element internally.
#
# If there is confusing piece of code, there is more info in the API section.

field_name = [
"Quote",
"Author",
"Tags"
]
# Using part of the string in this case is valid because all the string contained in a
# single element which selector we want to pick.
string_identifier = [
"It cannot be changed",
"Einstein",
"change"
]
find_string_of_nth = [
1,
1,
2 # note that if you look at the page element, the string "change" occurred in the Quote string
# first so we need to find the 2nd for the element that contain the tag.
]
processor_function = [
"text",
"text",
"tags"
]

change_iterate_queue = [1, 2, 0] # to show the Author first and Quote last in the columned data
for i in change_iterate_queue:
rules.add_field_rule(
string=string_identifier[i],
field_name=field_name[i],
find_string_of_nth=find_string_of_nth[i],
pipeline=processor_function[i]
)
print(rules)
# >>>
# Parent Selector:
# div.container div.row div.col-md-8 div.quote,
# Field Rule:
# {'Author': (' span small.author', <function text at 0x0000017CB4FA9A20>), 'Tags': (' div.tags...


# Scraping using SimpleExtractor class.
extract = SimpleExtractor(rules)

# For this tutorial, we will just use the reference page as the source
# In the previous example we use:
# extracting = extract.perform_extraction(r"Modified Quotes to Scrape.html", "local")
# Rules class provide a methode that retrieve beautifulsoup object of the reference, so we can also use:
extracting = extract.perform_extraction(rules.get_reference_soup(), "parsed")

# create DataFrame object to accumulate the iterated item
result_panda = Df()

for index, dictionary in enumerate(extracting, 1): # iterate dictionary of scraping result
# because one of value inside the "item" is an array (a list, product of the "tags" pipeline function),
# we convert it to string so the DataFrame treated it as one (scalar) value.
dictionary["Tags"] = ", ".join(dictionary["Tags"])
result_panda = concat([result_panda, Df(dictionary, index=[index])])

print(result_panda)
# >>>
# Author ... Quote
# 1 Albert Einstein ... “The world as we have created it is a process ...
# 2 ... “It is our choices, Harry, that show what we t...
# 3 Jane Austen ... “The person, be it gentleman or lady, who has ...
# 4 Albert Einstein ... “Try not to become a man of success. Rather be...

# [4 rows x 3 columns]
# ----------------------------------------------------------------------------------------------------------------------
# End Section
print("# --------------------------------------------------------------------------------------------------")
# ----------------------------------------------------------------------------------------------------------------------
# continuation from the previous example of extraction
rules.add_field_rule(
string="Next",
field_name="Pager",
pipeline="link"
)

extract = SimpleExtractor(rules)
extracting = extract.perform_extraction(rules.get_reference_soup(), "parsed")

result_panda = Df()

for index, dictionary in enumerate(extracting, 1):
dictionary["Tags"] = ", ".join(dictionary["Tags"])
result_panda = concat([result_panda, Df(dictionary, index=[index])])

print(result_panda)
# Author ... Pager
# 1 Albert Einstein Jane Austen Albert Einstein ... https://quotes.toscrape.com/page/2/
#
# [1 rows x 4 columns]
# ----------------------------------------------------------------------------------------------------------------------
# End Section
print("# --------------------------------------------------------------------------------------------------")
# ----------------------------------------------------------------------------------------------------------------------

navigation_rules = Rules(r"Modified Quotes to Scrape.html", "local")
navigation_rules.add_field_rule(
string="Next",
field_name="Navigation",
pipeline="link"
)
nav_selector = navigation_rules.get_parent_selector()
print(nav_selector)
# >>> div.container div.row div.col-md-8 nav ul.pager li.next a

scrap_page = rules.get_reference_soup()
next_link_dict = next(SimpleExtractor(navigation_rules).perform_extraction(scrap_page, "parsed"))
print(next_link_dict["Navigation"])
# >>> https://quotes.toscrape.com/page/2/
# ----------------------------------------------------------------------------------------------------------------------
# End Section
print("# --------------------------------------------------------------------------------------------------")
# ----------------------------------------------------------------------------------------------------------------------
Loading

0 comments on commit 83d6ceb

Please sign in to comment.