Merge pull request #18 from Vedant950/main

Bug fix and minor update
mldsveda · Feb 26, 2022 · 6d3ff98 · 6d3ff98
2 parents 9976dbf + 05067b6
commit 6d3ff98
Show file tree

Hide file tree

Showing 8 changed files with 374 additions and 25 deletions.
diff --git a/LICENSE b/LICENSE
@@ -1,6 +1,6 @@
 MIT License
 
-Copyright (c) 2021 Vedant Tibrewal, Vedaant Singh.
+Copyright (c) 2022 Vedant Tibrewal, Vedaant Singh.
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

diff --git a/README.md b/README.md
@@ -6,9 +6,9 @@
 
 ## PyScrappy: powerful Python data scraping toolkit
 
-[![forthebadge made-with-python](http://ForTheBadge.com/images/badges/made-with-python.svg)](https://www.python.org/)    
+[![forthebadge made-with-python](http://ForTheBadge.com/images/badges/made-with-python.svg)](https://www.python.org/)
 
-[![Python 3.6](https://img.shields.io/badge/python-3.6-blue.svg)](https://www.python.org/downloads/release/python-360/) 
+[![Python 3.6](https://img.shields.io/badge/python-3.6-blue.svg)](https://www.python.org/downloads/release/python-360/)
 [![PyPI Latest Release](https://img.shields.io/pypi/v/PyScrappy.svg)](https://pypi.org/project/PyScrappy/)
 
 [![Package Status](https://img.shields.io/pypi/status/PyScrappy.svg)](https://pypi.org/project/PyScrappy/)
@@ -21,21 +21,22 @@
 
 [![](https://img.shields.io/badge/pyscrappy-official%20documentation-blue)](https://pyscrappy.netlify.app/)
 
-
 ## What is it?
 
 **PyScrappy** is a Python package that provides a fast, flexible, and exhaustive way to scrape data from various different sources. Being an
 easy and intuitive library. It aims to be the fundamental high-level building block for scraping **data** in Python. Additionally, it has the broader goal of becoming **the most powerful and flexible open source data scraping tool available**.
 
 ## Main Features
+
 Here are just a few of the things that PyScrappy does well:
 
-  - Easy scraping of [**Data**](https://medium.com/analytics-vidhya/web-scraping-in-python-using-the-all-new-pyscrappy-5c136ed6906b) available on the internet
-  - Returns a [**DataFrame**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) for further analysis and research purposes.
-  - Automatic [**Data Scraping**](https://medium.com/analytics-vidhya/web-scraping-in-python-using-the-all-new-pyscrappy-5c136ed6906b): Other than a few user input parameters the whole process of scraping the data is automatic.
-  - Powerful, flexible 
+- Easy scraping of [**Data**](https://medium.com/analytics-vidhya/web-scraping-in-python-using-the-all-new-pyscrappy-5c136ed6906b) available on the internet
+- Returns a [**DataFrame**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) for further analysis and research purposes.
+- Automatic [**Data Scraping**](https://medium.com/analytics-vidhya/web-scraping-in-python-using-the-all-new-pyscrappy-5c136ed6906b): Other than a few user input parameters the whole process of scraping the data is automatic.
+- Powerful, flexible
 
 ## Where to get it
+
 The source code is currently hosted on GitHub at:
 https://github.com/mldsveda/PyScrappy
 
@@ -47,13 +48,14 @@ pip install PyScrappy
 ```
 
 ## Dependencies
-- [selenium - Selenium is a free (open-source) automated testing framework used to validate web applications across different browsers and platforms.](https://www.selenium.dev/)
-- [webdriver-manger - WebDriverManager is an API that allows users to automate the handling of driver executables like chromedriver.exe, geckodriver.exe etc required by Selenium WebDriver API. Now let us see, how can we set path for driver executables for different browsers like Chrome, Firefox etc.](https://github.com/bonigarcia/webdrivermanager)
-- [beautifulsoup4 - Beautiful Soup is a Python library for getting data out of HTML, XML, and other markup languages.](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
-- [pandas - Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.](https://pandas.pydata.org/)
 
+- [selenium](https://www.selenium.dev/) - Selenium is a free (open-source) automated testing framework used to validate web applications across different browsers and platforms.
+- [webdriver-manger](https://github.com/bonigarcia/webdrivermanager) - WebDriverManager is an API that allows users to automate the handling of driver executables like chromedriver.exe, geckodriver.exe etc required by Selenium WebDriver API. Now let us see, how can we set path for driver executables for different browsers like Chrome, Firefox etc.
+- [beautifulsoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) - Beautiful Soup is a Python library for getting data out of HTML, XML, and other markup languages.
+- [pandas](https://pandas.pydata.org/) - Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
 
 ## License
+
 [MIT](https://github.com/mldsveda/PyScrappy/blob/main/LICENSE)
 
 ## Getting Help
@@ -62,16 +64,19 @@ For usage questions, the best place to go to is [StackOverflow](https://stackove
 Further, general questions and discussions can also take place on GitHub in this [repository](https://github.com/mldsveda/PyScrappy).
 
 ## Discussion and Development
+
 Most development discussions take place on GitHub in this [repository](https://github.com/mldsveda/PyScrappy).
 
 Also visit the official documentation of [PyScrappy](https://pyscrappy.netlify.app/) for more information.
 
 ## Contributing to PyScrappy
+
 All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.
 
-If you are simply looking to start working with the PyScrappy codebase, navigate to the [GitHub "issues" tab](https://github.com/mldsveda/PyScrappy/issues) and start looking through interesting issues.
+If you are simply looking to start working with the PyScrappy codebase, navigate to the GitHub ["issues"](https://github.com/mldsveda/PyScrappy/issues) tab and start looking through interesting issues.
 
 ## End Notes
-*Learn More about this package on [Medium](https://medium.com/analytics-vidhya/web-scraping-in-python-using-the-all-new-pyscrappy-5c136ed6906b).*
 
-### ***This package is solely made for educational and research purposes.***
+_Learn More about this package on [Medium](https://medium.com/analytics-vidhya/web-scraping-in-python-using-the-all-new-pyscrappy-5c136ed6906b)._
+
+### **_This package is solely made for educational and research purposes._**
diff --git a/setup.py b/setup.py
@@ -6,21 +6,21 @@
 
 setuptools.setup(
     name="PyScrappy",
-    version="0.0.9",
+    version="0.1.0",
     author="Vedant Tibrewal, Vedaant Singh",
     author_email="mlds93363@gmail.com",
     description="Powerful web scraping tool.",
     long_description=long_description,
     long_description_content_type="text/markdown",
     url="https://github.com/mldsveda/PyScrappy",
-    keywords=['PyScrappy', 'Scraping', 'E-Commerce', 'Wikipedia', 'Image Scrapper', 'YouTube', 'Scrapy', 'Twitter', 'Social Media', 'Web Scraping', 'News', 'Stocks', 'Songs', 'Food', 'Instagram'],
+    keywords=['PyScrappy', 'Scraping', 'E-Commerce', 'Wikipedia', 'Image Scrapper', 'YouTube', 'Scrapy', 'Twitter', 'Social Media', 'Web Scraping', 'News', 'Stocks', 'Songs', 'Food', 'Instagram', 'Movies'],
     classifiers=[
         "Programming Language :: Python :: 3",
         "License :: OSI Approved :: MIT License",
         "Operating System :: OS Independent",
     ],
     python_requires=">=3.6",
-    py_modules=["PyScrappy", "alibaba", "flipkart", "image", "instagram", "news", "snapdeal", "soundcloud", "stock", "swiggy", "twitter", "wikipedia", "youtube", "zomato"],
+    py_modules=["PyScrappy", "alibaba", "amazon", "flipkart", "image", "imdb", "instagram", "news", "snapdeal", "soundcloud", "spotify", "stock", "swiggy", "twitter", "wikipedia", "youtube", "zomato"],
     package_dir={"": "src"},
     install_requires=[
         'selenium',

diff --git a/src/PyScrappy.py b/src/PyScrappy.py
@@ -5,8 +5,9 @@ class ECommerceScrapper():
 
     ECommerece Scrapper: Helps in scrapping data from E-Comm websites
         1. Alibaba
-        2. Flipkart
-        3. Snapdeal
+        2. Amazon
+        3. Flipkart
+        4. Snapdeal
 
     Type: class
 
@@ -54,6 +55,40 @@ def alibaba_scrapper(self, product_name, n_pages):
         return alibaba.scrappi(product_name, n_pages)
 
 
+    ############## Amazon Scrapper ##############
+    def amazon_scrapper(self, product_name, n_pages):
+
+        """
+
+        Amazon Scrapper: Helps in scrapping amazon data ('Description', 'Rating', 'Votes', 'Offer Price', 'Actual Price').
+        return type: DataFrame
+
+        Parameters
+        ------------
+        product_name: Enter the name of desired product
+            Type: str
+
+        n_pages: Enter the number of pages that you want to scrape
+            Type: int
+
+        Note
+        ------
+        Both the arguments are a compulsion. 
+        If n_pages == 0: A prompt will ask you to enter a valid page number and the scrapper will re-run.
+
+        Example
+        ---------
+        >>>  obj.amazon_scrapper('product', 3)
+        out: Name   Number of Items   Description   Ratings
+             abc    440               product a     3.5
+             aec    240               product b     4.5
+
+        """
+
+        import amazon
+        return amazon.scrappi(product_name, n_pages)
+
+
     ############## Flipkart Scrapper ############## 
     def flipkart_scrapper(self, product_name, n_pages):
 
@@ -79,8 +114,8 @@ def flipkart_scrapper(self, product_name, n_pages):
         ---------
         >>>  obj.flipkart_scrapper("Product Name", 3) 
         out: Name   Price   Original Price  Description Rating
-            abc    ₹340    ₹440            Product     4.2
-            aec    ₹140    ₹240            Product     4.7
+             abc    ₹340    ₹440            Product     4.2
+             aec    ₹140    ₹240            Product     4.7
 
         """
 
@@ -113,8 +148,8 @@ def snapdeal_scrapper(self, product_name, n_pages):
         ---------
         >>>  obj.snapdeal_scrapper('product', 3)
         out: Name   Price   Original Price   Number of Ratings
-            abc    ₹340    ₹440             40
-            aec    ₹140    ₹240             34
+             abc    ₹340    ₹440             40
+             aec    ₹140    ₹240             34
 
         """
 
@@ -216,7 +251,7 @@ def image_scrapper(data_name, n_images=10, img_format='jpg', folder_name='images
 
     """
 
-    Image Scrapper: Helps in scrapping images from "Google", "Yahoo", "Bing".
+    Image Scrapper: Helps in scrapping images from "Google", "Yahoo", "Bing".
                     Downloads it to the desired folder.
 
     Parameters
@@ -257,6 +292,75 @@ def image_scrapper(data_name, n_images=10, img_format='jpg', folder_name='images
 
 ########################################################################################################################
 
+############## IMDB Scrapper ##############
+def imdb_scrapper(genre, n_pages):
+
+    """
+
+    IMDB Scrapper: Helps in scrapping movies from IMDB.
+    return type: DataFrame
+
+    Parameters
+    ------------
+    genre: Enter the genre of the movie
+        Type: str
+
+    n_pages: Enter the number of pages that it will scrape at a single run.
+        Type: int
+
+    Note
+    ------
+    both the parameters are compulsory.
+
+    Example
+    ---------
+    >>>  imdb_scrapper('action', 4)
+    out: Title  Year    Certificate     Runtime     Genre   Rating  Description    Stars   Directors   Votes
+         asd    2022        UA          49min       action  3.9     about the..     asd     dfgv        23
+         scr    2022        15+         89min       action  4.9     about the..     add     dfgv        23
+    """
+
+    import imdb
+    return imdb.scrappi(genre, n_pages)
+
+########################################################################################################################
+
+############## LinkedIn Scrapper ##############
+def linkedin_scrapper(job_title, n_pages):
+
+    """
+
+    LinkedIn Scrapper: Helps in scrapping job related data from LinkedIn (Job Title, Company Name, Location, Salary, Benefits, Date)
+    return type: DataFrame
+
+    Parameters
+    ------------
+    job_title: Enter the job title or type.
+        Type: str
+
+    n_pages: Enter the number of pages that it will scrape at a single run.
+        Type: int
+
+    Note
+    ------
+    Both the parameters is a compulsion
+
+    Example
+    ---------
+    >>>  linkedin_scrapper('python', 1)
+    out: Job Title      Company Name    Location    Salary      Benefits                Date
+         abc            PyScrappy       US          2300        Actively Hiring +1      1 day ago
+         abc            PyScrappy       US          2300        Actively Hiring +1      1 day ago
+         ...
+         ..
+
+    """
+
+    import linkedin
+    return linkedin.scrappi(job_title, n_pages)
+
+########################################################################################################################
+
 ############## News Scrapper ##############
 def news_scrapper(n_pages, genre = str()):
 
@@ -530,6 +634,39 @@ def soundcloud_scrapper(self, track_name, n_pages):
         import soundcloud
         return soundcloud.soundcloud_tracks(track_name, n_pages)
 
+
+    ############## Spotify Scrapper ##############
+    def spotify_scrapper(self, track_name, n_pages):
+
+        """
+
+        Spotify Scrapper: Helps in scrapping data from spotify ('Id', 'Title', 'Singers', 'Album', 'Duration')
+        return type: DataFrame
+
+        Parameters
+        ------------
+        track_name: Enter the name of desired track/song/music/artist/bodcast
+            Type: str
+
+        n_pages: The number of pages that it will scrape at a single run
+            Type: int
+
+        Note
+        ------
+        Make sure to enter a valid name
+        
+        Example
+        ---------
+        >>>  obj.spotify_scrapper('pop', 3)
+        out: Id     Title   Singers     Album   Duration
+             1      abc     abc         abc     2:30
+             2      def     def         def     2:30
+
+        """
+
+        import spotify
+        return spotify.scrappi(track_name, n_pages)
+
 ########################################################################################################################
 
 ############## stock Scrapper ##############

diff --git a/src/amazon.py b/src/amazon.py
@@ -0,0 +1,60 @@
+import pandas as pd
+from time import sleep
+from webdriver_manager.chrome import ChromeDriverManager
+from selenium import webdriver
+
+def func(cards):
+    data = []
+    for card in cards:
+        try: info = card.find_element_by_class_name("s-card-container").find_element_by_xpath("./div/div[3]")
+        except: 
+            try: info = card.find_element_by_class_name("s-card-container").find_element_by_xpath("./div/div[2]")
+            except:
+                try: info = card.find_element_by_class_name("s-card-container").find_element_by_xpath("./div/div/div[3]")
+                except: info = card.find_element_by_class_name("s-card-container").find_element_by_xpath("./div/div/div[2]")
+        try: description = info.find_element_by_xpath("./div[1]/h2").text
+        except: description = None
+        try: rating = info.find_element_by_xpath("./div[2]/div/span").get_attribute("aria-label")
+        except: rating = None
+        try: votes = info.find_elements_by_xpath("./div[2]/div/span")[1].text
+        except: votes = None
+        try: offer_price = info.find_element_by_class_name("a-price").text.replace("\n", ".")
+        except: offer_price = None
+        try: actual_price = info.find_element_by_class_name("a-price").find_element_by_xpath("..//span[@data-a-strike='true']").text
+        except: actual_price = offer_price
+
+        data.append([description, rating, votes, offer_price, actual_price])
+
+    return data
+
+def scrappi(product_name, n_pages):
+    chrome_options = webdriver.ChromeOptions()
+    chrome_options.add_argument('--headless')
+    chrome_options.headless = True
+    driver = webdriver.Chrome(ChromeDriverManager(print_first_line=False).install(), options = chrome_options)
+    driver.create_options()
+
+    url = "https://www.amazon.com/s?k="+product_name
+    driver.get(url)
+    sleep(4)
+
+    cards = driver.find_elements_by_xpath('//div[@data-component-type="s-search-result"]')
+    while len(cards) == 0: 
+        driver.get(url)
+        sleep(4)
+
+    max_pages = int(driver.find_element_by_xpath(".//span[@class='s-pagination-strip']/span[last()]").text)
+    while n_pages > max_pages or n_pages == 0:
+        print(f"Please Enter a Valid Number of Pages Between 1 to {max_pages}:")
+        n_pages = int(input())
+
+    data = []
+
+    while n_pages > 0:
+        n_pages -= 1
+        data.extend(func(driver.find_elements_by_xpath('//div[@data-component-type="s-search-result"]')))
+        driver.find_element_by_class_name("s-pagination-next").click()
+        sleep(4)
+
+    driver.close()
+    return pd.DataFrame(data, columns=["Description", "Rating", "Votes", "Offer Price", "Actual Price"])