<a href="https://colab.research.google.com/github/malaika-n/ETL-Pipeline/blob/main/ETL_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **ETL Pipeline using Web Data**

* A Web Data ETL (Extract, Transform, Load) pipeline is a systematic process to collect, transform, and load data from various sources on the internet into a structured and usable format for analysis and storage.
* It is essential for managing and processing large volumes of data gathered from websites, online platforms, and digital sources.

In [None]:
# importing the necessary Python libraries:
# use command for installing beautifulsoup and nltk: pip install beautifulsoup4 nltk

import requests
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from collections import Counter
import pandas as pd
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

**Extracting text from any article on the web:**

In [None]:
# WebScraper class extracts the main text content of an article from a given web page URL:
class WebScraper:
    def __init__(self, url):
        self.url = url

# creating an instance of the WebScraper class which will retrieve textual data of the article:
    def extract_article_text(self):
        response = requests.get(self.url)
        html_content = response.content
        soup = BeautifulSoup(html_content, "html.parser")
        article_text = soup.get_text()
        return article_text

# the textual data can then be further processed or analyzed.

**To store the frequency of each word in the article, the data needs to be preprocessed:**

In [None]:
# TextProcessor class processes text data by tokenizing it into words and cleaning those words by removing non-alphabetic words and stopwords.
class TextProcessor:
    def __init__(self, nltk_stopwords):
        self.nltk_stopwords = nltk_stopwords

# tokenize_and_clean method is an instance of the TextProcessor class
# it will obtain a list of cleaned and filtered words from a given input text:
    def tokenize_and_clean(self, text):
        words = text.split()
        filtered_words = [word.lower() for word in words if word.isalpha() and word.lower() not in self.nltk_stopwords]
        return filtered_words

**Defining a class for the entire ETL process for extracting article text, processing it, and generating a DataFrame of word frequencies:**

In [None]:
# ETLPipeline class encapsulates the end-to-end process of extracting article text from a web page, cleaning and processing the text, calculating word frequencies, and generating a sorted DataFrame.

class ETLPipeline:
    def __init__(self, url):
        self.url = url
        self.nltk_stopwords = set(stopwords.words("english"))

# The run method will perform the complete ETL process and obtain a DataFrame that provides insights into the most frequently used words in the article after removing stopwords.

    def run(self):
        scraper = WebScraper(self.url)
        article_text = scraper.extract_article_text()

        processor = TextProcessor(self.nltk_stopwords)
        filtered_words = processor.tokenize_and_clean(article_text)

        word_freq = Counter(filtered_words)
        df = pd.DataFrame(word_freq.items(), columns=["Words", "Frequencies"])
        df = df.sort_values(by="Frequencies", ascending=False)
        return df

In [None]:
if __name__ == "__main__":
    article_url = "https://medium.com/carre4/women-and-ai-a8389ec6334c"
    pipeline = ETLPipeline(article_url)
    result_df = pipeline.run()
    print(result_df.head())

         Words  Frequencies
7           ai           17
9          min           11
24       world            6
13  artificial            6
3     inspirit            5
