Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor cyberowl core code #32

Merged
merged 21 commits into from
Aug 16, 2022
Merged

Refactor cyberowl core code #32

merged 21 commits into from
Aug 16, 2022

Conversation

karimhabush
Copy link
Owner

@karimhabush karimhabush commented Aug 10, 2022

Description

  • Refactor all of cyberowl spiders classes.
  • Use Scrapy Item Pipelines to process each scraped item.
  • Save items to markdown file within the item pipelines.
  • Change the project's dependency manager to poetry instead of venv.
  • Refactor GitHub actions config file for cyberowl.
  • Use mdutils for markdown generation
  • Create Spiders Abstraction class ( Static / Dynamic scrapers )
  • Refactor mdtemplate and create utils for markdown functions.

Type of change

  • Breaking change (fix or feature that would cause existing functionality not to work as expected)

@karimhabush karimhabush added the enhancement New feature or request label Aug 10, 2022
src/items.py Outdated Show resolved Hide resolved
src/main.py Outdated
except Exception:
raise ValueError("Error in the spiders!")
except Exception as exc:
raise ValueError("Error in the spiders!") from exc
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain what from exc do here?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from exc here shows the cause of the exception; as if, each time this exception is raised it also shows the exceptions that led to it.
However, I am still trying to find a way to implement it properly, because it currently seems unnecessary.


class Template:
"""
This class is used to format the data into a table in markdown format.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add attributes description if possible?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, but I think it would be better if we use mdutils for generating the markdown file, and delete this class. 😅

Comment on lines 12 to 38
source: str
data: list

def __init__(self, _source, _data):
self.source = _source
self.data = _data

def _set_heading(self):
return f"""---\n### {self.source} [:arrow_heading_up:](#cyberowl)\n"""

def _set_table_headers(self):
return """|Title|Description|Date|\n|---|---|---|\n"""

def _set_table_content(self, title, link, description, date):
return f"""| [{title}]({link}) | {description} | {date} |\n"""

def fill_table(self) -> str:
"""
Returns a table ready to be written to a file.
"""
table = self._set_heading()
table += self._set_table_headers()
for row in self.data:
table += self._set_table_content(
row["title"], row["link"], row["description"], row["date"]
)
return table
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
source: str
data: list
def __init__(self, _source, _data):
self.source = _source
self.data = _data
def _set_heading(self):
return f"""---\n### {self.source} [:arrow_heading_up:](#cyberowl)\n"""
def _set_table_headers(self):
return """|Title|Description|Date|\n|---|---|---|\n"""
def _set_table_content(self, title, link, description, date):
return f"""| [{title}]({link}) | {description} | {date} |\n"""
def fill_table(self) -> str:
"""
Returns a table ready to be written to a file.
"""
table = self._set_heading()
table += self._set_table_headers()
for row in self.data:
table += self._set_table_content(
row["title"], row["link"], row["description"], row["date"]
)
return table
def __init__(self, _source:str , _data:str):
self.source = _source
self.data = _data
@property
def source(self) -> str:
"""Returns the source"""
return self.source
@property
def data(self) -> str:
"""Returns the data"""
return self.data
@property
def heading(self) -> str:
"""Returns the heading"""
return f"""---\n### {self.source} [:arrow_heading_up:](#cyberowl)\n"""
@property
def table_headers(self) -> str:
"""Returns the table headers"""
return """|Title|Description|Date|\n|---|---|---|\n"""
def _set_table_content(self, title, link, description, date) -> str:
"""Returns the table headers"""
return f"""| [{title}]({link}) | {description} | {date} |\n"""
def fill_table(self) -> str:
"""
Returns a table ready to be written to a file.
"""
table = self.heading
table += self.table_headers
for row in self.data:
table += self._set_table_content(
row["title"], row["link"], row["description"], row["date"]
)
return table

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me!

src/pipelines.py Outdated
"""
AlertPipeline class
"""

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

init ?

src/pipelines.py Outdated
Remove special characters from text.
"""
return (
text.replace("\n", "")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure how far this list can grow! but I think u should create list and loop over it to remove whatever you want
e.g.

special_characters = ['\n','\r','  ','|']
return text.translate({ord(charachter): '' for charachter in special_characters})

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work because the variable character should be a character.

src/pipelines.py Outdated

def open_spider(self, spider):
"""
Open spider
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this function getting spider as parameter and then setting result to empty list?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function open_spider expects it as an argument, however, we can use *args and **kwargs instead.

src/pipelines.py Outdated Show resolved Hide resolved
title_selector = "descendant-or-self::h3/span/a/text()"
description_selector = "descendant-or-self::div[contains(@class,'field-content')]/p"

def parse(self, response):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this function getting repeated? if yes can we take it somewhere where it will be called by all spiders

Copy link
Owner Author

@karimhabush karimhabush Aug 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is, but it is a method of the abstract class Spider, so I guess we can't take it anywhere. Please let me know if you can suggest an implementation.
What I would do is add another layer of inheritance, abstract all the spiders implemented into one class, and have its arguments be the website URL and the selectors. Let me know what you think..

Copy link
Collaborator

@safoinme safoinme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice changes!! looking forward to seeing this merged 🚀🚀
I have left a few comments

@karimhabush
Copy link
Owner Author

Really nice changes!! looking forward to seeing this merged 🚀🚀 I have left a few comments

L3zzz!! I'll be adding more changes to review 🙌😅

@karimhabush karimhabush linked an issue Aug 13, 2022 that may be closed by this pull request
@karimhabush karimhabush merged commit 58d27bb into main Aug 16, 2022
@sonarcloud
Copy link

sonarcloud bot commented Aug 16, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 5 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor code and release version 1.0.0
2 participants