Skip to content

Commit

Permalink
Merge pull request #18 from Vedant950/main
Browse files Browse the repository at this point in the history
Bug fix and minor update
  • Loading branch information
mldsveda authored Feb 26, 2022
2 parents 9976dbf + 05067b6 commit 6d3ff98
Show file tree
Hide file tree
Showing 8 changed files with 374 additions and 25 deletions.
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
MIT License

Copyright (c) 2021 Vedant Tibrewal, Vedaant Singh.
Copyright (c) 2022 Vedant Tibrewal, Vedaant Singh.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
33 changes: 19 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@

## PyScrappy: powerful Python data scraping toolkit

[![forthebadge made-with-python](http://ForTheBadge.com/images/badges/made-with-python.svg)](https://www.python.org/)
[![forthebadge made-with-python](http://ForTheBadge.com/images/badges/made-with-python.svg)](https://www.python.org/)

[![Python 3.6](https://img.shields.io/badge/python-3.6-blue.svg)](https://www.python.org/downloads/release/python-360/)
[![Python 3.6](https://img.shields.io/badge/python-3.6-blue.svg)](https://www.python.org/downloads/release/python-360/)
[![PyPI Latest Release](https://img.shields.io/pypi/v/PyScrappy.svg)](https://pypi.org/project/PyScrappy/)

[![Package Status](https://img.shields.io/pypi/status/PyScrappy.svg)](https://pypi.org/project/PyScrappy/)
Expand All @@ -21,21 +21,22 @@

[![](https://img.shields.io/badge/pyscrappy-official%20documentation-blue)](https://pyscrappy.netlify.app/)


## What is it?

**PyScrappy** is a Python package that provides a fast, flexible, and exhaustive way to scrape data from various different sources. Being an
easy and intuitive library. It aims to be the fundamental high-level building block for scraping **data** in Python. Additionally, it has the broader goal of becoming **the most powerful and flexible open source data scraping tool available**.

## Main Features

Here are just a few of the things that PyScrappy does well:

- Easy scraping of [**Data**](https://medium.com/analytics-vidhya/web-scraping-in-python-using-the-all-new-pyscrappy-5c136ed6906b) available on the internet
- Returns a [**DataFrame**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) for further analysis and research purposes.
- Automatic [**Data Scraping**](https://medium.com/analytics-vidhya/web-scraping-in-python-using-the-all-new-pyscrappy-5c136ed6906b): Other than a few user input parameters the whole process of scraping the data is automatic.
- Powerful, flexible
- Easy scraping of [**Data**](https://medium.com/analytics-vidhya/web-scraping-in-python-using-the-all-new-pyscrappy-5c136ed6906b) available on the internet
- Returns a [**DataFrame**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) for further analysis and research purposes.
- Automatic [**Data Scraping**](https://medium.com/analytics-vidhya/web-scraping-in-python-using-the-all-new-pyscrappy-5c136ed6906b): Other than a few user input parameters the whole process of scraping the data is automatic.
- Powerful, flexible

## Where to get it

The source code is currently hosted on GitHub at:
https://github.com/mldsveda/PyScrappy

Expand All @@ -47,13 +48,14 @@ pip install PyScrappy
```

## Dependencies
- [selenium - Selenium is a free (open-source) automated testing framework used to validate web applications across different browsers and platforms.](https://www.selenium.dev/)
- [webdriver-manger - WebDriverManager is an API that allows users to automate the handling of driver executables like chromedriver.exe, geckodriver.exe etc required by Selenium WebDriver API. Now let us see, how can we set path for driver executables for different browsers like Chrome, Firefox etc.](https://github.com/bonigarcia/webdrivermanager)
- [beautifulsoup4 - Beautiful Soup is a Python library for getting data out of HTML, XML, and other markup languages.](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [pandas - Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.](https://pandas.pydata.org/)

- [selenium](https://www.selenium.dev/) - Selenium is a free (open-source) automated testing framework used to validate web applications across different browsers and platforms.
- [webdriver-manger](https://github.com/bonigarcia/webdrivermanager) - WebDriverManager is an API that allows users to automate the handling of driver executables like chromedriver.exe, geckodriver.exe etc required by Selenium WebDriver API. Now let us see, how can we set path for driver executables for different browsers like Chrome, Firefox etc.
- [beautifulsoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) - Beautiful Soup is a Python library for getting data out of HTML, XML, and other markup languages.
- [pandas](https://pandas.pydata.org/) - Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

## License

[MIT](https://github.com/mldsveda/PyScrappy/blob/main/LICENSE)

## Getting Help
Expand All @@ -62,16 +64,19 @@ For usage questions, the best place to go to is [StackOverflow](https://stackove
Further, general questions and discussions can also take place on GitHub in this [repository](https://github.com/mldsveda/PyScrappy).

## Discussion and Development

Most development discussions take place on GitHub in this [repository](https://github.com/mldsveda/PyScrappy).

Also visit the official documentation of [PyScrappy](https://pyscrappy.netlify.app/) for more information.

## Contributing to PyScrappy

All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.

If you are simply looking to start working with the PyScrappy codebase, navigate to the [GitHub "issues" tab](https://github.com/mldsveda/PyScrappy/issues) and start looking through interesting issues.
If you are simply looking to start working with the PyScrappy codebase, navigate to the GitHub ["issues"](https://github.com/mldsveda/PyScrappy/issues) tab and start looking through interesting issues.

## End Notes
*Learn More about this package on [Medium](https://medium.com/analytics-vidhya/web-scraping-in-python-using-the-all-new-pyscrappy-5c136ed6906b).*

### ***This package is solely made for educational and research purposes.***
_Learn More about this package on [Medium](https://medium.com/analytics-vidhya/web-scraping-in-python-using-the-all-new-pyscrappy-5c136ed6906b)._

### **_This package is solely made for educational and research purposes._**
6 changes: 3 additions & 3 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,21 +6,21 @@

setuptools.setup(
name="PyScrappy",
version="0.0.9",
version="0.1.0",
author="Vedant Tibrewal, Vedaant Singh",
author_email="mlds93363@gmail.com",
description="Powerful web scraping tool.",
long_description=long_description,
long_description_content_type="text/markdown",
url="https://github.com/mldsveda/PyScrappy",
keywords=['PyScrappy', 'Scraping', 'E-Commerce', 'Wikipedia', 'Image Scrapper', 'YouTube', 'Scrapy', 'Twitter', 'Social Media', 'Web Scraping', 'News', 'Stocks', 'Songs', 'Food', 'Instagram'],
keywords=['PyScrappy', 'Scraping', 'E-Commerce', 'Wikipedia', 'Image Scrapper', 'YouTube', 'Scrapy', 'Twitter', 'Social Media', 'Web Scraping', 'News', 'Stocks', 'Songs', 'Food', 'Instagram', 'Movies'],
classifiers=[
"Programming Language :: Python :: 3",
"License :: OSI Approved :: MIT License",
"Operating System :: OS Independent",
],
python_requires=">=3.6",
py_modules=["PyScrappy", "alibaba", "flipkart", "image", "instagram", "news", "snapdeal", "soundcloud", "stock", "swiggy", "twitter", "wikipedia", "youtube", "zomato"],
py_modules=["PyScrappy", "alibaba", "amazon", "flipkart", "image", "imdb", "instagram", "news", "snapdeal", "soundcloud", "spotify", "stock", "swiggy", "twitter", "wikipedia", "youtube", "zomato"],
package_dir={"": "src"},
install_requires=[
'selenium',
Expand Down
151 changes: 144 additions & 7 deletions src/PyScrappy.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,9 @@ class ECommerceScrapper():
ECommerece Scrapper: Helps in scrapping data from E-Comm websites
1. Alibaba
2. Flipkart
3. Snapdeal
2. Amazon
3. Flipkart
4. Snapdeal
Type: class
Expand Down Expand Up @@ -54,6 +55,40 @@ def alibaba_scrapper(self, product_name, n_pages):
return alibaba.scrappi(product_name, n_pages)


############## Amazon Scrapper ##############
def amazon_scrapper(self, product_name, n_pages):

"""
Amazon Scrapper: Helps in scrapping amazon data ('Description', 'Rating', 'Votes', 'Offer Price', 'Actual Price').
return type: DataFrame
Parameters
------------
product_name: Enter the name of desired product
Type: str
n_pages: Enter the number of pages that you want to scrape
Type: int
Note
------
Both the arguments are a compulsion.
If n_pages == 0: A prompt will ask you to enter a valid page number and the scrapper will re-run.
Example
---------
>>> obj.amazon_scrapper('product', 3)
out: Name Number of Items Description Ratings
abc 440 product a 3.5
aec 240 product b 4.5
"""

import amazon
return amazon.scrappi(product_name, n_pages)


############## Flipkart Scrapper ##############
def flipkart_scrapper(self, product_name, n_pages):

Expand All @@ -79,8 +114,8 @@ def flipkart_scrapper(self, product_name, n_pages):
---------
>>> obj.flipkart_scrapper("Product Name", 3)
out: Name Price Original Price Description Rating
abc ₹340 ₹440 Product 4.2
aec ₹140 ₹240 Product 4.7
abc ₹340 ₹440 Product 4.2
aec ₹140 ₹240 Product 4.7
"""

Expand Down Expand Up @@ -113,8 +148,8 @@ def snapdeal_scrapper(self, product_name, n_pages):
---------
>>> obj.snapdeal_scrapper('product', 3)
out: Name Price Original Price Number of Ratings
abc ₹340 ₹440 40
aec ₹140 ₹240 34
abc ₹340 ₹440 40
aec ₹140 ₹240 34
"""

Expand Down Expand Up @@ -216,7 +251,7 @@ def image_scrapper(data_name, n_images=10, img_format='jpg', folder_name='images

"""
Image Scrapper: Helps in scrapping images from "Google", "Yahoo", "Bing".
Image Scrapper: Helps in scrapping images from "Google", "Yahoo", "Bing".
Downloads it to the desired folder.
Parameters
Expand Down Expand Up @@ -257,6 +292,75 @@ def image_scrapper(data_name, n_images=10, img_format='jpg', folder_name='images

########################################################################################################################

############## IMDB Scrapper ##############
def imdb_scrapper(genre, n_pages):

"""
IMDB Scrapper: Helps in scrapping movies from IMDB.
return type: DataFrame
Parameters
------------
genre: Enter the genre of the movie
Type: str
n_pages: Enter the number of pages that it will scrape at a single run.
Type: int
Note
------
both the parameters are compulsory.
Example
---------
>>> imdb_scrapper('action', 4)
out: Title Year Certificate Runtime Genre Rating Description Stars Directors Votes
asd 2022 UA 49min action 3.9 about the.. asd dfgv 23
scr 2022 15+ 89min action 4.9 about the.. add dfgv 23
"""

import imdb
return imdb.scrappi(genre, n_pages)

########################################################################################################################

############## LinkedIn Scrapper ##############
def linkedin_scrapper(job_title, n_pages):

"""
LinkedIn Scrapper: Helps in scrapping job related data from LinkedIn (Job Title, Company Name, Location, Salary, Benefits, Date)
return type: DataFrame
Parameters
------------
job_title: Enter the job title or type.
Type: str
n_pages: Enter the number of pages that it will scrape at a single run.
Type: int
Note
------
Both the parameters is a compulsion
Example
---------
>>> linkedin_scrapper('python', 1)
out: Job Title Company Name Location Salary Benefits Date
abc PyScrappy US 2300 Actively Hiring +1 1 day ago
abc PyScrappy US 2300 Actively Hiring +1 1 day ago
...
..
"""

import linkedin
return linkedin.scrappi(job_title, n_pages)

########################################################################################################################

############## News Scrapper ##############
def news_scrapper(n_pages, genre = str()):

Expand Down Expand Up @@ -530,6 +634,39 @@ def soundcloud_scrapper(self, track_name, n_pages):
import soundcloud
return soundcloud.soundcloud_tracks(track_name, n_pages)


############## Spotify Scrapper ##############
def spotify_scrapper(self, track_name, n_pages):

"""
Spotify Scrapper: Helps in scrapping data from spotify ('Id', 'Title', 'Singers', 'Album', 'Duration')
return type: DataFrame
Parameters
------------
track_name: Enter the name of desired track/song/music/artist/bodcast
Type: str
n_pages: The number of pages that it will scrape at a single run
Type: int
Note
------
Make sure to enter a valid name
Example
---------
>>> obj.spotify_scrapper('pop', 3)
out: Id Title Singers Album Duration
1 abc abc abc 2:30
2 def def def 2:30
"""

import spotify
return spotify.scrappi(track_name, n_pages)

########################################################################################################################

############## stock Scrapper ##############
Expand Down
60 changes: 60 additions & 0 deletions src/amazon.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
import pandas as pd
from time import sleep
from webdriver_manager.chrome import ChromeDriverManager
from selenium import webdriver

def func(cards):
data = []
for card in cards:
try: info = card.find_element_by_class_name("s-card-container").find_element_by_xpath("./div/div[3]")
except:
try: info = card.find_element_by_class_name("s-card-container").find_element_by_xpath("./div/div[2]")
except:
try: info = card.find_element_by_class_name("s-card-container").find_element_by_xpath("./div/div/div[3]")
except: info = card.find_element_by_class_name("s-card-container").find_element_by_xpath("./div/div/div[2]")
try: description = info.find_element_by_xpath("./div[1]/h2").text
except: description = None
try: rating = info.find_element_by_xpath("./div[2]/div/span").get_attribute("aria-label")
except: rating = None
try: votes = info.find_elements_by_xpath("./div[2]/div/span")[1].text
except: votes = None
try: offer_price = info.find_element_by_class_name("a-price").text.replace("\n", ".")
except: offer_price = None
try: actual_price = info.find_element_by_class_name("a-price").find_element_by_xpath("..//span[@data-a-strike='true']").text
except: actual_price = offer_price

data.append([description, rating, votes, offer_price, actual_price])

return data

def scrappi(product_name, n_pages):
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.headless = True
driver = webdriver.Chrome(ChromeDriverManager(print_first_line=False).install(), options = chrome_options)
driver.create_options()

url = "https://www.amazon.com/s?k="+product_name
driver.get(url)
sleep(4)

cards = driver.find_elements_by_xpath('//div[@data-component-type="s-search-result"]')
while len(cards) == 0:
driver.get(url)
sleep(4)

max_pages = int(driver.find_element_by_xpath(".//span[@class='s-pagination-strip']/span[last()]").text)
while n_pages > max_pages or n_pages == 0:
print(f"Please Enter a Valid Number of Pages Between 1 to {max_pages}:")
n_pages = int(input())

data = []

while n_pages > 0:
n_pages -= 1
data.extend(func(driver.find_elements_by_xpath('//div[@data-component-type="s-search-result"]')))
driver.find_element_by_class_name("s-pagination-next").click()
sleep(4)

driver.close()
return pd.DataFrame(data, columns=["Description", "Rating", "Votes", "Offer Price", "Actual Price"])
Loading

0 comments on commit 6d3ff98

Please sign in to comment.