Documentation, Typo correction, Improve wording, Improve code for bet…

…ter Exception Handling, Improve pipeline function implementation This tweak will probably ended in incremented the version to 0.1.1
max-efort · Jun 19, 2023 · 83d6ceb · 83d6ceb
1 parent b15ee85
commit 83d6ceb
Show file tree

Hide file tree

Showing 13 changed files with 887 additions and 85 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1 @@
+__pycache__/
diff --git a/README.md b/README.md
@@ -4,9 +4,8 @@ Scraple is a Python library designed to simplify the process of web scraping,
 providing easy scraping and easy searching for selectors.
 
 ## Installation
-[Pypi page](https://pypi.org/project/scraple/)
-
-You can install the package using pip:
+The package is hosted in [Pypi](https://pypi.org/project/scraple/) and can be 
+installed using pip:
 
 ```shell
 pip install scraple
@@ -16,17 +15,17 @@ pip install scraple
 The package provides two main classes: Rules and SimpleExtractor.
 
 #### 1. Rules
-The Rules class allows you to define rules for extracting elements from a web page. 
-You can add field rules using the add_field_rule method, which has the capability to 
-automatically pick selectors based on a provided string. Also, support for regex 
-matching.
+The Rules class allows you to define rules of extraction from a web page reference. 
+You can pick selector just by knowing what string present in that page using the `add_field_rule` method. 
+This method automatically searches for selector of element which text content match the string. 
+Additionally, the `add_field_rule` method supports regular expression matching.
 
 ```python
 from scraple import Rules
 
-some_rules = Rules("reference in the form of beautifulSoup4 object, html code or string path to local html file")
+some_rules = Rules("reference in the form of string path to local html file", "local")
 some_rules.add_field_rule("a sentence or word exist in reference page", "field name 1")
-some_rules.add_field_rule("some othe.*?sentences", "field name 2", re_flag=True)
+some_rules.add_field_rule("some othe.*?text", "field name 2", re_flag=True)
 # Add more field rules...
 
 # It automatically search for the selector, to see it you can see the rule in console
@@ -35,21 +34,28 @@ some_rules.add_field_rule("some othe.*?sentences", "field name 2", re_flag=True)
 ```
 
 #### 2. SimpleExtractor
-The SimpleExtractor class performs the actual scraping based on the defined rules. 
-You provide the Rules object to the SimpleExtractor constructor and use the 
-perform_extraction method to create a generator object that iterate dictionary of
-element or text information.
+The SimpleExtractor class performs the actual scraping based on a defined rule.
+A Rules object act as the "which to extract" and the SimpleExtractor do the "extract" or 
+scraping. First pass a Rules object
+to SimpleExtractor constructor and use the 
+`perform_extraction` method to create a generator object that iterate dictionary of
+elements extracted.
 
 ```python
 from scraple import SimpleExtractor
 
-extractor = SimpleExtractor(some_rules)
-result = extractor.rule(
-    "web page object in the form of beautifulSoup4 object, html code or string path to local html file")
+extractor = SimpleExtractor(some_rules)  # some_rules from above code snippet
+result = extractor.perform_extraction(
+    "web page in the form of beautifulSoup4 object",
+    "parsed"
+)
 
 # print(next(result))
-# {"field name 1": element or text information (if you provide pipeline func.),
-# print(next(result))
-#  "field name 2": ..., ...}
+# {
+#   "field name 1": [element, ...],
+#   "field name 2": ...,
+#   ...
+# }
 ```
-For more detail, see the [repository](https://github.com/max-efort/scraple) 
+For more information and tutorial, see the [documentation](https://github.com/max-efort/scraple/doc) or 
+visit the main [repository](https://github.com/max-efort/scraple)
diff --git a/doc/code_example/Modified Quotes to Scrape.html b/doc/code_example/Modified Quotes to Scrape.html
diff --git a/doc/code_example/Modified Quotes to Scrape_files/bootstrap.min.css b/doc/code_example/Modified Quotes to Scrape_files/bootstrap.min.css
diff --git a/doc/code_example/Modified Quotes to Scrape_files/main.css b/doc/code_example/Modified Quotes to Scrape_files/main.css
@@ -0,0 +1,111 @@
+/* Custom page CSS */
+body {
+  font-family: sans-serif;
+}
+
+.container .text-muted {
+  margin: 20px 0;
+}
+
+.tags-box {
+  text-align: right;
+}
+
+.tags-box h2 {
+  margin-top: 0px;
+}
+
+.tag-item {
+  display: block;
+  margin: 4px;
+}
+
+.quote {
+  padding: 10px;
+  margin-bottom: 30px;
+  border: 1px solid #333333;
+  border-radius: 5px;
+  box-shadow: 2px 2px 3px #333333;
+}
+
+.quote small.author {
+  font-weight: bold;
+  color: #3677E8;
+}
+
+.quote span.text {
+  display: block;
+  margin-bottom: 5px;
+  font-size: large;
+  font-style: italic;
+}
+
+.quote .tags {
+  margin-top: 10px;
+}
+
+.tag {
+  padding: 2px 5px;
+  border-radius: 5px;
+  color: white;
+  font-size: small;
+  background-color: #7CA3E6;
+}
+
+a.tag:hover {
+  text-decoration: none;
+}
+
+/* Sticky footer styles */
+html {
+  position: relative;
+  min-height: 100%;
+}
+
+body {
+  /* Margin bottom by footer height */
+  margin-bottom: 60px;
+}
+
+.footer {
+  position: absolute;
+  bottom: 0;
+  width: 100%;
+  /* Set the fixed height of the footer here */
+  height: 6em;
+  background-color: #f5f5f5;
+}
+
+.error {
+  color: red;
+}
+
+.header-box {
+  padding-bottom: 40px;
+}
+
+.header-box p {
+  margin-top: 30px;
+  float: right;
+}
+
+.author-details  {
+  width: 80%;
+}
+
+.author-description {
+  text-align: justify;
+  margin-bottom: 20px;
+}
+
+ul.pager {
+  margin-bottom: 100px;
+}
+
+.copyright {
+  text-align: center;
+}
+
+.sh-red {
+  color: #cc0b0f;
+}
diff --git a/doc/code_example/example_code.py b/doc/code_example/example_code.py
@@ -0,0 +1,165 @@
+from scraple import Rules, SimpleExtractor
+# to displaying the end result in tabular manner we use pandas
+from pandas import concat, DataFrame as Df
+# ----------------------------------------------------------------------------------------------------------------------
+# End Section
+# ----------------------------------------------------------------------------------------------------------------------
+
+# suppose we want to scrap just the Quote
+quote_rule = Rules(r"Modified Quotes to Scrape.html", "local")
+quote_rule.add_field_rule(
+    "It is our choices, Harry, that show what we truly are, far more than our abilities.",
+    "Quotes",
+)
+print(quote_rule)
+# >>>
+# Parent Selector:
+# 	div.container div.row div.col-md-8 div.quote span.text,
+# Field Rule:
+# 	{'Quotes': ('', None)}
+
+# create DataFrame object to accumulate the iterated item
+result_panda = Df()
+
+# scrape using SimpleExtractor, for now we just gonna scrape the reference page
+extract = SimpleExtractor(quote_rule)
+extracting = extract.perform_extraction(r"Modified Quotes to Scrape.html", "local")
+
+for index, dictionary in enumerate(extracting, 1):  # iterate dictionary of scraping result
+    result_panda = concat([result_panda, Df(dictionary, index=[index])])
+
+print(result_panda)
+# >>>
+#                                              Quotes
+# 1  [“The world as we have created it is a process...
+# 2  [“It is our choices, Harry, that show what we ...
+# 3  [“The person, be it gentleman or lady, who has...
+# 4  [“Try not to become a man of success. Rather b...
+
+# ----------------------------------------------------------------------------------------------------------------------
+# End Section
+print("# --------------------------------------------------------------------------------------------------")
+# ----------------------------------------------------------------------------------------------------------------------
+
+rules = Rules(r"Modified Quotes to Scrape.html", "local")
+# To make defining rule more easier we iterate list of defined field name, string identifier and
+# additionally finding the string of n-th and we will provide pipeline to process the extracted
+# element internally.
+#
+# If there is confusing piece of code, there is more info in the API section.
+
+field_name = [
+    "Quote",
+    "Author",
+    "Tags"
+]
+# Using part of the string in this case is valid because all the string contained in a
+# single element which selector we want to pick.
+string_identifier = [
+    "It cannot be changed",
+    "Einstein",
+    "change"
+]
+find_string_of_nth = [
+    1,
+    1,
+    2  # note that if you look at the page element, the string "change" occurred in the Quote string
+       # first so we need to find the 2nd for the element that contain the tag.
+]
+processor_function = [
+    "text",
+    "text",
+    "tags"
+]
+
+change_iterate_queue = [1, 2, 0]  # to show the Author first and Quote last in the columned data
+for i in change_iterate_queue:
+    rules.add_field_rule(
+        string=string_identifier[i],
+        field_name=field_name[i],
+        find_string_of_nth=find_string_of_nth[i],
+        pipeline=processor_function[i]
+    )
+print(rules)
+# >>>
+# Parent Selector:
+# 	div.container div.row div.col-md-8 div.quote,
+# Field Rule:
+# 	{'Author': (' span small.author', <function text at 0x0000017CB4FA9A20>), 'Tags': (' div.tags...
+
+
+# Scraping using SimpleExtractor class.
+extract = SimpleExtractor(rules)
+
+# For this tutorial, we will just use the reference page as the source
+# In the previous example we use:
+# extracting = extract.perform_extraction(r"Modified Quotes to Scrape.html", "local")
+# Rules class provide a methode that retrieve beautifulsoup object of the reference, so we can also use:
+extracting = extract.perform_extraction(rules.get_reference_soup(), "parsed")
+
+# create DataFrame object to accumulate the iterated item
+result_panda = Df()
+
+for index, dictionary in enumerate(extracting, 1):  # iterate dictionary of scraping result
+    # because one of value inside the "item" is an array (a list, product of the "tags" pipeline function),
+    # we convert it to string so the DataFrame treated it as one (scalar) value.
+    dictionary["Tags"] = ", ".join(dictionary["Tags"])
+    result_panda = concat([result_panda, Df(dictionary, index=[index])])
+
+print(result_panda)
+# >>>
+#             Author  ...                                              Quote
+# 1  Albert Einstein  ...  “The world as we have created it is a process ...
+# 2                   ...  “It is our choices, Harry, that show what we t...
+# 3      Jane Austen  ...  “The person, be it gentleman or lady, who has ...
+# 4  Albert Einstein  ...  “Try not to become a man of success. Rather be...
+
+# [4 rows x 3 columns]
+# ----------------------------------------------------------------------------------------------------------------------
+# End Section
+print("# --------------------------------------------------------------------------------------------------")
+# ----------------------------------------------------------------------------------------------------------------------
+# continuation from the previous example of extraction
+rules.add_field_rule(
+    string="Next",
+    field_name="Pager",
+    pipeline="link"
+)
+
+extract = SimpleExtractor(rules)
+extracting = extract.perform_extraction(rules.get_reference_soup(), "parsed")
+
+result_panda = Df()
+
+for index, dictionary in enumerate(extracting, 1):
+    dictionary["Tags"] = ", ".join(dictionary["Tags"])
+    result_panda = concat([result_panda, Df(dictionary, index=[index])])
+
+print(result_panda)
+#                                         Author  ...                                Pager
+# 1  Albert Einstein Jane Austen Albert Einstein  ...  https://quotes.toscrape.com/page/2/
+#
+# [1 rows x 4 columns]
+# ----------------------------------------------------------------------------------------------------------------------
+# End Section
+print("# --------------------------------------------------------------------------------------------------")
+# ----------------------------------------------------------------------------------------------------------------------
+
+navigation_rules = Rules(r"Modified Quotes to Scrape.html", "local")
+navigation_rules.add_field_rule(
+    string="Next",
+    field_name="Navigation",
+    pipeline="link"
+)
+nav_selector = navigation_rules.get_parent_selector()
+print(nav_selector)
+# >>> div.container div.row div.col-md-8 nav ul.pager li.next a
+
+scrap_page = rules.get_reference_soup()
+next_link_dict = next(SimpleExtractor(navigation_rules).perform_extraction(scrap_page, "parsed"))
+print(next_link_dict["Navigation"])
+# >>> https://quotes.toscrape.com/page/2/
+# ----------------------------------------------------------------------------------------------------------------------
+# End Section
+print("# --------------------------------------------------------------------------------------------------")
+# ----------------------------------------------------------------------------------------------------------------------