<a href="https://www.kaggle.com/code/manishkr1754/keyword-extraction-with-keybert-and-wordwise?scriptVersionId=145452236" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<center><h1> Keyword Extraction with KeyBERT and WordWise</h1></center>

---

<center><h3>Comparison, Parameter Tuning & Evaluation</h3></center>

# Process Flow of the Keyword Extraction
---

- 1. Getting System Ready
- 2. Data Collection
- 3. Data Eyeballing
- 4. Data Preprocessing
- 5. Keyword Extraction
    - 5.1 KeyBERT
    - 5.2 WordWise
- 6. Exporting Output as Excel File

## 1) Getting System Ready
---

Importing Required Packages

In [1]:
!pip install xlsxwriter -q
!pip install keybert -q
!pip install wordwise -q

1. `!pip install xlsxwriter -q`: This installs the `xlsxwriter` library, which is used for creating Excel files in Python. The `-q` flag suppresses verbose output, so you won't see detailed installation messages.

2. `!pip install keybert -q`: This installs the `keybert` library, which is a Python library for keyword extraction using BERT embeddings. Again, the `-q` flag suppresses verbose output during installation.

3. `!pip install wordwise -q`: This installs the `wordwise` library, which is another library for keyword extraction. The `-q` flag is used to keep the installation process quiet.

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import json
from pprint import pprint
import xlsxwriter
from keybert import KeyBERT
from wordwise import Extractor

- `requests`: This library allows you to send HTTP requests to web servers and retrieve data from websites. It's commonly used for web scraping and accessing web APIs.

- `BeautifulSoup`: BeautifulSoup is a popular Python library for parsing and navigating HTML and XML documents. It helps you extract and manipulate data from web pages easily.

- `pandas`: Pandas is a powerful data manipulation and analysis library. It provides data structures like DataFrames and Series, making it efficient for data cleaning, transformation, and analysis.

- `json`: The `json` module in Python is used for working with JSON (JavaScript Object Notation) data. It allows you to parse JSON strings and serialize Python objects into JSON format.

- `pprint`: The `pprint` module provides a way to pretty-print Python data structures, making complex data easier to read and understand when printed to the console.

- `xlsxwriter`: XlsxWriter is a Python library for creating Excel files (XLSX format). It allows you to generate Excel spreadsheets with various formatting options.

- `KeyBERT`: KeyBERT is a Python library for keyword extraction that leverages BERT embeddings. It's useful for extracting meaningful keywords from text data.

- `wordwise`: Wordwise is another Python library for keyword extraction. It provides a different approach to keyword extraction compared to BERT-based methods.

## 2) Data Collection
---
We aim to gather the necessary data from the **Google Cloud Skill Boost Course** website like module and submodule names along with their descriptions, which will serve as our input for keyword extraction.

- **Data Source:** Our primary data source is the "Google Cloud Skill Boost Course" website which contains information about various modules and submodules within the course. These modules are accompanied by descriptions that provide insights into their content.

- **Data Extraction Technique:** We will utilize **Web Scraping** techniques to extract the relevant data from the website. Specifically, we will use the Python library **BeautifulSoup** to parse the HTML content of the web pages and extract module and submodule names along with their descriptions.

#### Step-1: Define url of the Course Page

In [3]:
url = "https://www.cloudskillsboost.google/course_templates/53?catalog_rank=%7B%22rank%22%3A1%2C%22num_filters%22%3A0%2C%22has_search%22%3Atrue%7D&search_id=25346338"
url

'https://www.cloudskillsboost.google/course_templates/53?catalog_rank=%7B%22rank%22%3A1%2C%22num_filters%22%3A0%2C%22has_search%22%3Atrue%7D&search_id=25346338'

#### Step-2: Send an HTTP GET request to fetch the page content

In [4]:
response_from_page = requests.get(url)
response_from_page

<Response [200]>

The **<Response [200]>** is standard HTTP response code indicating a successful request. It means that the request to the website was successful and the server has returned the expected response. We can now proceed to parse the content of the web page using BeautifulSoup or another HTML parsing library.

**Common Response Code**
- `200 OK`: The request was successful and the server has returned the requested data.
- `404 Not Found`: The requested resource was not found on the server.
- `403 Forbidden`: Access to the requested resource is forbidden.
- `500 Internal Server Error`: An error occurred on the server while processing the request.

#### Step-3: Parse the HTML content with BeautifulSoup

In [5]:
soup = BeautifulSoup(response_from_page.content, "html.parser")

- `soup = BeautifulSoup(response_from_page.content, "html.parser")`: This line of code parses the HTML content of the page using BeautifulSoup's HTML parser and stores the parsed data in the `soup` variable. Now, we can navigate and extract information from the parsed HTML.

#### Step-4: Extracting Information from Parsed HTML Data

In [6]:
ql_course = soup.find("ql-course")
modules = ql_course.attrs['modules']

- `ql_course = soup.find("ql-course")`: Here `find()` method is used to locate the first occurrence of an HTML element with the tag name "ql-course" within the parsed HTML. The result is stored in the `ql_course` variable.

- `modules = ql_course.attrs['modules']`: Assuming that the "ql-course" element has an attribute named "modules," this line extracts the value of that attribute and stores it in the `modules` variable.

## 3) Data Eyeballing
---

#### Parsed HTML Data

In [7]:
print(type(soup))
print("\n\n")
print(soup)

<class 'bs4.BeautifulSoup'>




<!DOCTYPE html>

<html lang="en">
<head>
<title>Building Batch Data Pipelines on Google Cloud | Google Cloud Skills Boost</title>
<meta content="/cable" name="action-cable-url"/>
<script>
//<![CDATA[
window.gon={};gon.deployment="google-run";
//]]>
</script>
<script>
  (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
  new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
  j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
  'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
  })(window,document,'script','dataLayer',"GTM-5XSKHDX");
</script>
<script async="async" src="https://www.googletagmanager.com/gtag/js?id=G-2X30ZRBDSG"></script>
<script>
  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());
  gtag('config', "G-2X30ZRBDSG", {
    user_id: ""
  });
</script>
<script src="https://cdn.qwiklabs.com/assets/hallo

#### Information from Parsed HTML Data

In [8]:
print(type(modules))
print("\n\n")
print(modules)

<class 'str'>



[{"id":"59338","title":"Introduction","description":"\u003cp\u003eIn this module, we introduce the course and agenda\u003c/p\u003e","steps":[{"id":"386567","prompt":null,"isOptional":true,"activities":[{"id":"379215","href":null,"isLocked":false,"duration":55000,"title":"Course Introduction","type":"video","isComplete":false,"inProgress":false,"score":null,"disabled":false}],"isComplete":false,"allActivitiesRequired":false}],"expanded":false},{"id":"59339","title":"Introduction to Building Batch Data Pipelines","description":"\u003cp\u003eThis module reviews different methods of data loading: EL, ELT and ETL and when to use what\u003c/p\u003e","steps":[{"id":"386568","prompt":null,"isOptional":true,"activities":[{"id":"379216","href":null,"isLocked":false,"duration":69000,"title":"Module introduction","type":"video","isComplete":false,"inProgress":false,"score":null,"disabled":false}],"isComplete":false,"allActivitiesRequired":false},{"id":"386569","prompt":null,"isOpt

## 4) Data Preprocessing
___
Before performing keyword extraction, it is essential to preprocess the data like text cleaning, tokenization and removing stopwords.

1. **Text Cleaning**: Remove any special characters, HTML tags or irrelevant formatting from the descriptions.
2. **Tokenization**: Split the text into individual words or tokens.
3. **Stopword Removal**: Eliminate common stopwords from the text.

#### Converting the extracted `modules` data (JSON string) into a Python dictionary

In [9]:
json_modules = json.loads(modules)
print(type(json_modules))
print("\n\n")
pprint(json_modules[0])

<class 'list'>



{'description': '<p>In this module, we introduce the course and agenda</p>',
 'expanded': False,
 'id': '59338',
 'steps': [{'activities': [{'disabled': False,
                            'duration': 55000,
                            'href': None,
                            'id': '379215',
                            'inProgress': False,
                            'isComplete': False,
                            'isLocked': False,
                            'score': None,
                            'title': 'Course Introduction',
                            'type': 'video'}],
            'allActivitiesRequired': False,
            'id': '386567',
            'isComplete': False,
            'isOptional': True,
            'prompt': None}],
 'title': 'Introduction'}


- `json_modules = json.loads(modules)`: This line uses the `json.loads()` function to convert the `modules` string assumed to be in JSON format into a Python data structure. In this case, it is a list of dictionaries.

#### Converting to Pandas DataFrame

In [10]:
df = pd.DataFrame(json_modules)
df

Unnamed: 0,id,title,description,steps,expanded
0,59338,Introduction,"<p>In this module, we introduce the course and...","[{'id': '386567', 'prompt': None, 'isOptional'...",False
1,59339,Introduction to Building Batch Data Pipelines,<p>This module reviews different methods of da...,"[{'id': '386568', 'prompt': None, 'isOptional'...",False
2,59340,Executing Spark on Dataproc,<p>This module shows how to run Hadoop on Data...,"[{'id': '386575', 'prompt': None, 'isOptional'...",False
3,59341,Serverless Data Processing with Dataflow,<p>This module covers using Dataflow to build ...,"[{'id': '386587', 'prompt': None, 'isOptional'...",False
4,59342,Manage Data Pipelines with Cloud Data Fusion a...,<p>This module shows how to manage data pipeli...,"[{'id': '386604', 'prompt': None, 'isOptional'...",False
5,59343,Course Summary,<p>Course Summary</p>,"[{'id': '386620', 'prompt': None, 'isOptional'...",False
6,59344,Course Resources,<p>PDF links to all modules</p>,"[{'id': '386621', 'prompt': None, 'isOptional'...",False


In [11]:
print('The size of Dataframe is: ', df.shape)
print('-'*100)
print('The Column Name, Record Count and Data Types are as follows: ')
df.info()
print('-'*100)

The size of Dataframe is:  (7, 5)
----------------------------------------------------------------------------------------------------
The Column Name, Record Count and Data Types are as follows: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           7 non-null      object
 1   title        7 non-null      object
 2   description  7 non-null      object
 3   steps        7 non-null      object
 4   expanded     7 non-null      bool  
dtypes: bool(1), object(4)
memory usage: 359.0+ bytes
----------------------------------------------------------------------------------------------------


#### Flattening JSON Data into a Pandas DataFrame

In [12]:
flatten_df = pd.json_normalize(json_modules, 
                               record_path=['steps', ['activities']], 
                               meta=['id','title', 'description'], 
                               meta_prefix='meta-', 
                               record_prefix='record-')
flatten_df.head(3)

Unnamed: 0,record-id,record-href,record-isLocked,record-duration,record-title,record-type,record-isComplete,record-inProgress,record-score,record-disabled,meta-id,meta-title,meta-description
0,379215,,False,55000,Course Introduction,video,False,False,,False,59338,Introduction,"<p>In this module, we introduce the course and..."
1,379216,,False,69000,Module introduction,video,False,False,,False,59339,Introduction to Building Batch Data Pipelines,<p>This module reviews different methods of da...
2,379217,,False,220000,"EL, ELT, ETL",video,False,False,,False,59339,Introduction to Building Batch Data Pipelines,<p>This module reviews different methods of da...


This code is taking a JSON data structure with nested arrays, extracting specific nested records while preserving some top-level metadata and converting the structured data into a Pandas DataFrame. The resulting DataFrame will be a tabular representation of the nested JSON data, making it easier to analyze and work with in a tabular format.

In [13]:
print('The size of Dataframe is: ', flatten_df.shape)
print('-'*100)
print('The Column Name, Record Count and Data Types are as follows: ')
flatten_df.info()
print('-'*100)

The size of Dataframe is:  (58, 13)
----------------------------------------------------------------------------------------------------
The Column Name, Record Count and Data Types are as follows: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58 entries, 0 to 57
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   record-id          58 non-null     object
 1   record-href        0 non-null      object
 2   record-isLocked    58 non-null     bool  
 3   record-duration    58 non-null     int64 
 4   record-title       58 non-null     object
 5   record-type        58 non-null     object
 6   record-isComplete  58 non-null     bool  
 7   record-inProgress  58 non-null     bool  
 8   record-score       0 non-null      object
 9   record-disabled    58 non-null     bool  
 10  meta-id            58 non-null     object
 11  meta-title         58 non-null     object
 12  meta-description   58 non-null     

#### Dropping Unwanted Columns

In [14]:
flatten_df.drop(['record-href', 
                 'record-isLocked', 
                 'record-isComplete', 
                 'record-inProgress',
                 'record-score',
                 'meta-id',
                 'record-id'], axis=1, inplace=True)

In [15]:
print('The size of Dataframe is: ', flatten_df.shape)
print('-'*100)
print('The Column Name, Record Count and Data Types are as follows: ')
flatten_df.info()
print('-'*100)

The size of Dataframe is:  (58, 6)
----------------------------------------------------------------------------------------------------
The Column Name, Record Count and Data Types are as follows: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58 entries, 0 to 57
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   record-duration   58 non-null     int64 
 1   record-title      58 non-null     object
 2   record-type       58 non-null     object
 3   record-disabled   58 non-null     bool  
 4   meta-title        58 non-null     object
 5   meta-description  58 non-null     object
dtypes: bool(1), int64(1), object(4)
memory usage: 2.4+ KB
----------------------------------------------------------------------------------------------------


#### Aggregating and Grouping the Data

In [16]:
grouped_df = flatten_df.groupby(['meta-title', 'meta-description'])
print(grouped_df.ngroups)

7


- There are 7(Seven) unique groups(combinations) of `meta-title` and `meta-description` in the grouped DataFrame

In [17]:
grouped_df.head()

Unnamed: 0,record-duration,record-title,record-type,record-disabled,meta-title,meta-description
0,55000,Course Introduction,video,False,Introduction,"<p>In this module, we introduce the course and..."
1,69000,Module introduction,video,False,Introduction to Building Batch Data Pipelines,<p>This module reviews different methods of da...
2,220000,"EL, ELT, ETL",video,False,Introduction to Building Batch Data Pipelines,<p>This module reviews different methods of da...
3,168000,Quality considerations,video,False,Introduction to Building Batch Data Pipelines,<p>This module reviews different methods of da...
4,180000,How to carry out operations in BigQuery,video,False,Introduction to Building Batch Data Pipelines,<p>This module reviews different methods of da...
5,208000,Shortcomings,video,False,Introduction to Building Batch Data Pipelines,<p>This module reviews different methods of da...
8,27000,Module introduction,video,False,Executing Spark on Dataproc,<p>This module shows how to run Hadoop on Data...
9,286000,The Hadoop ecosystem,video,False,Executing Spark on Dataproc,<p>This module shows how to run Hadoop on Data...
10,602000,Running Hadoop on Dataproc,video,False,Executing Spark on Dataproc,<p>This module shows how to run Hadoop on Data...
11,379000,Cloud Storage instead of HDFS,video,False,Executing Spark on Dataproc,<p>This module shows how to run Hadoop on Data...


#### Combining 'record-title' within Grouped Data

In [18]:
flatten_df['text'] = flatten_df.groupby(['meta-title', 
                                         'meta-description'])['record-title'].transform(lambda x:'. '.join(x))
flatten_df.head(10)

Unnamed: 0,record-duration,record-title,record-type,record-disabled,meta-title,meta-description,text
0,55000,Course Introduction,video,False,Introduction,"<p>In this module, we introduce the course and...",Course Introduction
1,69000,Module introduction,video,False,Introduction to Building Batch Data Pipelines,<p>This module reviews different methods of da...,"Module introduction. EL, ELT, ETL. Quality con..."
2,220000,"EL, ELT, ETL",video,False,Introduction to Building Batch Data Pipelines,<p>This module reviews different methods of da...,"Module introduction. EL, ELT, ETL. Quality con..."
3,168000,Quality considerations,video,False,Introduction to Building Batch Data Pipelines,<p>This module reviews different methods of da...,"Module introduction. EL, ELT, ETL. Quality con..."
4,180000,How to carry out operations in BigQuery,video,False,Introduction to Building Batch Data Pipelines,<p>This module reviews different methods of da...,"Module introduction. EL, ELT, ETL. Quality con..."
5,208000,Shortcomings,video,False,Introduction to Building Batch Data Pipelines,<p>This module reviews different methods of da...,"Module introduction. EL, ELT, ETL. Quality con..."
6,428000,ETL to solve data quality issues,video,False,Introduction to Building Batch Data Pipelines,<p>This module reviews different methods of da...,"Module introduction. EL, ELT, ETL. Quality con..."
7,0,Introduction to Building Batch Data Pipelines,quiz,False,Introduction to Building Batch Data Pipelines,<p>This module reviews different methods of da...,"Module introduction. EL, ELT, ETL. Quality con..."
8,27000,Module introduction,video,False,Executing Spark on Dataproc,<p>This module shows how to run Hadoop on Data...,Module introduction. The Hadoop ecosystem. Run...
9,286000,The Hadoop ecosystem,video,False,Executing Spark on Dataproc,<p>This module shows how to run Hadoop on Data...,Module introduction. The Hadoop ecosystem. Run...


This code is adding a new column named `text` to the Pandas DataFrame `flatten_df`. This new column contains concatenated values from the 'record-title' column within each group defined by the 'meta-title' and 'meta-description' columns which can be useful for various analysis and visualization tasks.

## 5) Keyword Extraction
___

### 5.1) KeyBERT

1. `model`: (str) The name of the language model to use for keyword extraction. It specifies the pre-trained model architecture, such as 'distilbert-base-nli-mean-tokens', 'roberta-base-nli-stsb-mean-tokens', 'bert-base-nli-mean-tokens', etc.

2. `top_n`: (int, optional) The number of top keywords to extract. It determines how many keywords will be returned by the model. The default value is 5.

3. `min_kw`: (int, optional) The minimum number of characters required for a keyword to be considered. Keywords shorter than this length are filtered out. The default value is 1.

4. `stop_words`: (str or list, optional) A list of stop words to be removed from the extracted keywords. Stop words are common words like 'the', 'and', 'in', etc., that are often removed to focus on more meaningful keywords. You can specify a list of stop words or 'english' to use a predefined list of English stop words. The default is None, which means no stop words are removed.

5. `use_mmr`: (bool, optional) If True, the Maximum Marginal Relevance (MMR) algorithm is applied to select a diverse set of keywords. MMR promotes diversity by selecting keywords that are not too similar to each other. The default value is False.

6. `diversity`: (float, optional) The diversity parameter for MMR. It controls the trade-off between relevance and diversity. A higher value encourages more diverse keywords. It only applies if `use_mmr` is set to True. The default value is 0.7.

7. `keyphrase_ngram_range`: (tuple, optional) The ngram range for keyphrase extraction. It determines the length of the keyphrases that the model will extract. For example, setting it to `(1, 1)` will extract single words as keyphrases, while `(1, 2)` will extract both single words and two-word phrases as keyphrases. The default value is `(1, 1)

### Sample Demo KeyBERT

In [19]:
sample_text ="""
         Supervised learning is the machine learning task of
         learning a function that maps an input to an output based
         on example input-output pairs.[1] It infers a function
         from labeled training data consisting of a set of
         training examples.[2] In supervised learning, each
         example is a pair consisting of an input object
         (typically a vector) and a desired output value (also
         called the supervisory signal). A supervised learning
         algorithm analyzes the training data and produces an
         inferred function, which can be used for mapping new
         examples. An optimal scenario will allow for the algorithm
         to correctly determine the class labels for unseen
         instances. This requires the learning algorithm to
         generalize from the training data to unseen situations
         in a 'reasonable' way (see inductive bias).
      """


from keybert import KeyBERT
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(sample_text)

In [20]:
keywords

[('supervised', 0.6523),
 ('labeled', 0.4702),
 ('learning', 0.467),
 ('training', 0.3858),
 ('labels', 0.3728)]

In [21]:
keywords = [x[0] for x in keywords]
keywords

['supervised', 'labeled', 'learning', 'training', 'labels']

### Extracting Keywords using KeyBERT from course modules

#### Step-1: Define a list of parameter values to tune

In [22]:
ngram_range_values = [(1, 1), (1, 2), (1, 3)]
stop_words_values = [None, "english"]

#### Step-2: Define a list of model names to tune

In [23]:
model_names = ['distilbert-base-nli-mean-tokens', 
               'roberta-base-nli-stsb-mean-tokens', 
               'bert-base-nli-mean-tokens']

The models used in the code are all part of the **Hugging Face Transformers library** and are pre-trained models for various natural language processing (NLP) tasks.These models are designed to generate meaningful sentence embeddings and can be used for various NLP tasks including keyword extraction, document similarity and more.

1. `distilbert-base-nli-mean-tokens`: This is a distilled version of the **BERT (Bidirectional Encoder Representations from Transformers) model**. It has been trained on various NLP tasks and is fine-tuned to generate sentence embeddings, making it useful for tasks like keyword extraction and semantic similarity.

2. `roberta-base-nli-stsb-mean-tokens`: This is a **RoBERTa model** which is another variant of the BERT architecture. It has been fine-tuned for the Natural Language Inference (NLI) task and uses sentence embeddings. The "stsb" in the name stands for the **Semantic Textual Similarity Benchmark**.

3. `bert-base-nli-mean-tokens`: This is a **standard BERT model** that has been fine-tuned for generating sentence embeddings and is suitable for tasks involving semantic similarity and keyword extraction.

#### Step-3: Create a function for extracting keywords for a specific model and parameters

In [24]:
def get_keywords_keybert(text, model_name, ngram_range, stop_words):
    """
    Extracts keywords from the given text using the KeyBERT model.

    Parameters:
    - text (str): The input text from which keywords will be extracted.
    - model_name (str): The name of the KeyBERT model to use for keyword extraction.
    - ngram_range (tuple): A tuple specifying the range of n-grams to consider during keyword extraction.
                          Example: (1, 2) for unigrams and bigrams.
    - stop_words (str or list): A list of stop words to be removed from the extracted keywords.
                              It can also be "english" to use a predefined list of English stop words.
                              If None, no stop words are removed.

    Returns:
    - list of str: A list of extracted keywords from the input text.
    """
    model = KeyBERT(model_name)
    keywords_arr = model.extract_keywords(text)
    return [x[0] for x in keywords_arr]

#### Step-4: Iterate through different models with different parameter combinations

In [25]:
for model_name in model_names:
    for ngram_range in ngram_range_values:
        for stop_words in stop_words_values:
            column_name = f'keywords_{model_name}_{ngram_range}_{stop_words}'
            
            flatten_df[column_name] = flatten_df['text'].apply(lambda x: get_keywords_keybert(x, 
                                                                                              model_name, 
                                                                                              ngram_range, 
                                                                                              stop_words))

This code is responsible for iterating through a set of predefined model names, n-gram range values, and stop words values to perform keyword extraction for different combinations of models and keyword extraction parameters. It creates new columns in a DataFrame to store the extracted keywords for each combination.

- The outermost loop iterates through a list of model names (`model_names`). These model names represent different pre-trained models for keyword extraction.

- The first nested loop iterates through a list of n-gram range values (`ngram_range_values`). N-gram range specifies the range of n-grams (e.g., unigrams, bigrams) to consider during keyword extraction.

- The second nested loop iterates through a list of stop words values (`stop_words_values`). Stop words are common words (e.g., "the," "and," "in") that are often removed from extracted keywords to focus on more meaningful terms.

- For each combination of model name, n-gram range and stop words, a unique column name is generated using string formatting. This column name will be used to store the extracted keywords in the DataFrame.

- Inside the innermost loop, the `get_keywords_keybert` function is applied to each row of the DataFrame. This function takes the input text, the current model name, n-gram range and stop words as parameters. It extracts keywords using the specified model and parameters.

- The extracted keywords are then stored in the corresponding column in the DataFrame.

In [26]:
print('The size of Dataframe is: ', flatten_df.shape)
print('-'*100)
print('The Column Name, Record Count and Data Types are as follows: ')
flatten_df.info()
print('-'*100)

The size of Dataframe is:  (58, 25)
----------------------------------------------------------------------------------------------------
The Column Name, Record Count and Data Types are as follows: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58 entries, 0 to 57
Data columns (total 25 columns):
 #   Column                                                     Non-Null Count  Dtype 
---  ------                                                     --------------  ----- 
 0   record-duration                                            58 non-null     int64 
 1   record-title                                               58 non-null     object
 2   record-type                                                58 non-null     object
 3   record-disabled                                            58 non-null     bool  
 4   meta-title                                                 58 non-null     object
 5   meta-description                                           58 non-null     object
 6

In [27]:
flatten_df.head()

Unnamed: 0,record-duration,record-title,record-type,record-disabled,meta-title,meta-description,text,"keywords_distilbert-base-nli-mean-tokens_(1, 1)_None","keywords_distilbert-base-nli-mean-tokens_(1, 1)_english","keywords_distilbert-base-nli-mean-tokens_(1, 2)_None",...,"keywords_roberta-base-nli-stsb-mean-tokens_(1, 2)_None","keywords_roberta-base-nli-stsb-mean-tokens_(1, 2)_english","keywords_roberta-base-nli-stsb-mean-tokens_(1, 3)_None","keywords_roberta-base-nli-stsb-mean-tokens_(1, 3)_english","keywords_bert-base-nli-mean-tokens_(1, 1)_None","keywords_bert-base-nli-mean-tokens_(1, 1)_english","keywords_bert-base-nli-mean-tokens_(1, 2)_None","keywords_bert-base-nli-mean-tokens_(1, 2)_english","keywords_bert-base-nli-mean-tokens_(1, 3)_None","keywords_bert-base-nli-mean-tokens_(1, 3)_english"
0,55000,Course Introduction,video,False,Introduction,"<p>In this module, we introduce the course and...",Course Introduction,"[introduction, course]","[introduction, course]","[introduction, course]",...,"[course, introduction]","[course, introduction]","[course, introduction]","[course, introduction]","[introduction, course]","[introduction, course]","[introduction, course]","[introduction, course]","[introduction, course]","[introduction, course]"
1,69000,Module introduction,video,False,Introduction to Building Batch Data Pipelines,<p>This module reviews different methods of da...,"Module introduction. EL, ELT, ETL. Quality con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...",...,"[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq..."
2,220000,"EL, ELT, ETL",video,False,Introduction to Building Batch Data Pipelines,<p>This module reviews different methods of da...,"Module introduction. EL, ELT, ETL. Quality con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...",...,"[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq..."
3,168000,Quality considerations,video,False,Introduction to Building Batch Data Pipelines,<p>This module reviews different methods of da...,"Module introduction. EL, ELT, ETL. Quality con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...",...,"[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq..."
4,180000,How to carry out operations in BigQuery,video,False,Introduction to Building Batch Data Pipelines,<p>This module reviews different methods of da...,"Module introduction. EL, ELT, ETL. Quality con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...",...,"[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq..."


### 5.2) WordWise

WordWise is a Python library designed for keyword extraction and keyphrase generation from text data. This library is particularly useful for tasks such as content summarization, document categorization and information retrieval, where identifying key terms can provide valuable insights. 
However, this library may not have as extensive and customizable parameter options as some other libraries like KeyBERT.

1. `spacy_model` (str, optional): Specifies the spaCy model to use for text processing. Default is "en_core_web_sm".

2. `top_k` (int, optional): Determines the number of top keywords to extract. Default is 5.


### Sample Demo WordWise

In [28]:
sample_text ="""
         Supervised learning is the machine learning task of
         learning a function that maps an input to an output based
         on example input-output pairs.[1] It infers a function
         from labeled training data consisting of a set of
         training examples.[2] In supervised learning, each
         example is a pair consisting of an input object
         (typically a vector) and a desired output value (also
         called the supervisory signal). A supervised learning
         algorithm analyzes the training data and produces an
         inferred function, which can be used for mapping new
         examples. An optimal scenario will allow for the algorithm
         to correctly determine the class labels for unseen
         instances. This requires the learning algorithm to
         generalize from the training data to unseen situations
         in a 'reasonable' way (see inductive bias).
      """


from wordwise import Extractor
extractor = Extractor()
extractor.generate(sample_text)

['supervised learning', 'learning', 'inductive bias', 'labels', 'example']

### Extracting Keywords using WordWise from course modules

#### Creating Function to get keywords using WordWise

In [29]:
def get_keywords_wordwise(text):
    """
    Extract keywords from the given text using the WordWise library.

    Parameters:
        text (str): The input text from which keywords are to be extracted.

    Returns:
        list: A list of extracted keywords or an empty list if the extraction fails.

    Notes:
        - This function uses the WordWise library to extract keywords from the input text.
        - If the input text is empty or if an error occurs during keyword extraction, an
          empty list is returned.
        - Errors encountered during keyword extraction are printed to the console.

    Example:
        keywords = get_keywords_wordwise("This is a sample text for keyword extraction.")
        print(keywords)  # ['sample text', 'keyword extraction']
    """
    if not text:  # Check if text is empty
        print(f"Empty text encountered: {text}")
        return []  # Return an empty list if text is empty
    extractor = Extractor()
    try:
        return extractor.generate(text)
    except Exception as e:
        print(f"Error processing text: {text}")
        print(f"Error details: {str(e)}")
        return []

#### Applying Function to the DataFrame

In [30]:
flatten_df['wordwise_keywords'] = flatten_df['text'].apply(get_keywords_wordwise)

Error processing text: Course Introduction
Error details: list index out of range
Error processing text: Course Summary
Error details: list index out of range
Error processing text: Building Batch Data Pipelines on Google Cloud
Error details: list index out of range


In [31]:
print('The size of Dataframe is: ', flatten_df.shape)
print('-'*100)
print('The Column Name, Record Count and Data Types are as follows: ')
flatten_df.info()
print('-'*100)

The size of Dataframe is:  (58, 26)
----------------------------------------------------------------------------------------------------
The Column Name, Record Count and Data Types are as follows: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58 entries, 0 to 57
Data columns (total 26 columns):
 #   Column                                                     Non-Null Count  Dtype 
---  ------                                                     --------------  ----- 
 0   record-duration                                            58 non-null     int64 
 1   record-title                                               58 non-null     object
 2   record-type                                                58 non-null     object
 3   record-disabled                                            58 non-null     bool  
 4   meta-title                                                 58 non-null     object
 5   meta-description                                           58 non-null     object
 6

In [32]:
flatten_df.head()

Unnamed: 0,record-duration,record-title,record-type,record-disabled,meta-title,meta-description,text,"keywords_distilbert-base-nli-mean-tokens_(1, 1)_None","keywords_distilbert-base-nli-mean-tokens_(1, 1)_english","keywords_distilbert-base-nli-mean-tokens_(1, 2)_None",...,"keywords_roberta-base-nli-stsb-mean-tokens_(1, 2)_english","keywords_roberta-base-nli-stsb-mean-tokens_(1, 3)_None","keywords_roberta-base-nli-stsb-mean-tokens_(1, 3)_english","keywords_bert-base-nli-mean-tokens_(1, 1)_None","keywords_bert-base-nli-mean-tokens_(1, 1)_english","keywords_bert-base-nli-mean-tokens_(1, 2)_None","keywords_bert-base-nli-mean-tokens_(1, 2)_english","keywords_bert-base-nli-mean-tokens_(1, 3)_None","keywords_bert-base-nli-mean-tokens_(1, 3)_english",wordwise_keywords
0,55000,Course Introduction,video,False,Introduction,"<p>In this module, we introduce the course and...",Course Introduction,"[introduction, course]","[introduction, course]","[introduction, course]",...,"[course, introduction]","[course, introduction]","[course, introduction]","[introduction, course]","[introduction, course]","[introduction, course]","[introduction, course]","[introduction, course]","[introduction, course]",[]
1,69000,Module introduction,video,False,Introduction to Building Batch Data Pipelines,<p>This module reviews different methods of da...,"Module introduction. EL, ELT, ETL. Quality con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...",...,"[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[quality, data, introduction, considerations, ..."
2,220000,"EL, ELT, ETL",video,False,Introduction to Building Batch Data Pipelines,<p>This module reviews different methods of da...,"Module introduction. EL, ELT, ETL. Quality con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...",...,"[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[quality, data, introduction, considerations, ..."
3,168000,Quality considerations,video,False,Introduction to Building Batch Data Pipelines,<p>This module reviews different methods of da...,"Module introduction. EL, ELT, ETL. Quality con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...",...,"[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[quality, data, introduction, considerations, ..."
4,180000,How to carry out operations in BigQuery,video,False,Introduction to Building Batch Data Pipelines,<p>This module reviews different methods of da...,"Module introduction. EL, ELT, ETL. Quality con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...",...,"[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[quality, data, introduction, considerations, ..."


#### Reaaranging and Renaming Columns

In [33]:
index_df = flatten_df.set_index(['meta-title', 'meta-description','record-title'])

In [35]:
flatten_df.head()

Unnamed: 0,record-duration,record-title,record-type,record-disabled,meta-title,meta-description,text,"keywords_distilbert-base-nli-mean-tokens_(1, 1)_None","keywords_distilbert-base-nli-mean-tokens_(1, 1)_english","keywords_distilbert-base-nli-mean-tokens_(1, 2)_None",...,"keywords_roberta-base-nli-stsb-mean-tokens_(1, 2)_english","keywords_roberta-base-nli-stsb-mean-tokens_(1, 3)_None","keywords_roberta-base-nli-stsb-mean-tokens_(1, 3)_english","keywords_bert-base-nli-mean-tokens_(1, 1)_None","keywords_bert-base-nli-mean-tokens_(1, 1)_english","keywords_bert-base-nli-mean-tokens_(1, 2)_None","keywords_bert-base-nli-mean-tokens_(1, 2)_english","keywords_bert-base-nli-mean-tokens_(1, 3)_None","keywords_bert-base-nli-mean-tokens_(1, 3)_english",wordwise_keywords
0,55000,Course Introduction,video,False,Introduction,"<p>In this module, we introduce the course and...",Course Introduction,"[introduction, course]","[introduction, course]","[introduction, course]",...,"[course, introduction]","[course, introduction]","[course, introduction]","[introduction, course]","[introduction, course]","[introduction, course]","[introduction, course]","[introduction, course]","[introduction, course]",[]
1,69000,Module introduction,video,False,Introduction to Building Batch Data Pipelines,<p>This module reviews different methods of da...,"Module introduction. EL, ELT, ETL. Quality con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...",...,"[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[quality, data, introduction, considerations, ..."
2,220000,"EL, ELT, ETL",video,False,Introduction to Building Batch Data Pipelines,<p>This module reviews different methods of da...,"Module introduction. EL, ELT, ETL. Quality con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...",...,"[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[quality, data, introduction, considerations, ..."
3,168000,Quality considerations,video,False,Introduction to Building Batch Data Pipelines,<p>This module reviews different methods of da...,"Module introduction. EL, ELT, ETL. Quality con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...",...,"[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[quality, data, introduction, considerations, ..."
4,180000,How to carry out operations in BigQuery,video,False,Introduction to Building Batch Data Pipelines,<p>This module reviews different methods of da...,"Module introduction. EL, ELT, ETL. Quality con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...",...,"[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[quality, data, introduction, considerations, ..."


In [36]:
print(f'index names ----> {index_df.index.names}')
print('\n')
print(f'column names ----> {index_df.columns}')

index names ----> ['meta-title', 'meta-description', 'record-title']


column names ----> Index(['record-duration', 'record-type', 'record-disabled', 'text',
       'keywords_distilbert-base-nli-mean-tokens_(1, 1)_None',
       'keywords_distilbert-base-nli-mean-tokens_(1, 1)_english',
       'keywords_distilbert-base-nli-mean-tokens_(1, 2)_None',
       'keywords_distilbert-base-nli-mean-tokens_(1, 2)_english',
       'keywords_distilbert-base-nli-mean-tokens_(1, 3)_None',
       'keywords_distilbert-base-nli-mean-tokens_(1, 3)_english',
       'keywords_roberta-base-nli-stsb-mean-tokens_(1, 1)_None',
       'keywords_roberta-base-nli-stsb-mean-tokens_(1, 1)_english',
       'keywords_roberta-base-nli-stsb-mean-tokens_(1, 2)_None',
       'keywords_roberta-base-nli-stsb-mean-tokens_(1, 2)_english',
       'keywords_roberta-base-nli-stsb-mean-tokens_(1, 3)_None',
       'keywords_roberta-base-nli-stsb-mean-tokens_(1, 3)_english',
       'keywords_bert-base-nli-mean-tokens_(1, 1)_None',

In [37]:
index_df.index.names = ['Titles', 'Description', 'Activities']
index_df.rename(columns={'record-duration':'Duration', 'record-type':'Type'}, errors='raise')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Duration,Type,record-disabled,text,"keywords_distilbert-base-nli-mean-tokens_(1, 1)_None","keywords_distilbert-base-nli-mean-tokens_(1, 1)_english","keywords_distilbert-base-nli-mean-tokens_(1, 2)_None","keywords_distilbert-base-nli-mean-tokens_(1, 2)_english","keywords_distilbert-base-nli-mean-tokens_(1, 3)_None","keywords_distilbert-base-nli-mean-tokens_(1, 3)_english",...,"keywords_roberta-base-nli-stsb-mean-tokens_(1, 2)_english","keywords_roberta-base-nli-stsb-mean-tokens_(1, 3)_None","keywords_roberta-base-nli-stsb-mean-tokens_(1, 3)_english","keywords_bert-base-nli-mean-tokens_(1, 1)_None","keywords_bert-base-nli-mean-tokens_(1, 1)_english","keywords_bert-base-nli-mean-tokens_(1, 2)_None","keywords_bert-base-nli-mean-tokens_(1, 2)_english","keywords_bert-base-nli-mean-tokens_(1, 3)_None","keywords_bert-base-nli-mean-tokens_(1, 3)_english",wordwise_keywords
Titles,Description,Activities,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
Introduction,"<p>In this module, we introduce the course and agenda</p>",Course Introduction,55000,video,False,Course Introduction,"[introduction, course]","[introduction, course]","[introduction, course]","[introduction, course]","[introduction, course]","[introduction, course]",...,"[course, introduction]","[course, introduction]","[course, introduction]","[introduction, course]","[introduction, course]","[introduction, course]","[introduction, course]","[introduction, course]","[introduction, course]",[]
Introduction to Building Batch Data Pipelines,"<p>This module reviews different methods of data loading: EL, ELT and ETL and when to use what</p>",Module introduction,69000,video,False,"Module introduction. EL, ELT, ETL. Quality con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...",...,"[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[quality, data, introduction, considerations, ..."
Introduction to Building Batch Data Pipelines,"<p>This module reviews different methods of data loading: EL, ELT and ETL and when to use what</p>","EL, ELT, ETL",220000,video,False,"Module introduction. EL, ELT, ETL. Quality con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...",...,"[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[quality, data, introduction, considerations, ..."
Introduction to Building Batch Data Pipelines,"<p>This module reviews different methods of data loading: EL, ELT and ETL and when to use what</p>",Quality considerations,168000,video,False,"Module introduction. EL, ELT, ETL. Quality con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...",...,"[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[quality, data, introduction, considerations, ..."
Introduction to Building Batch Data Pipelines,"<p>This module reviews different methods of data loading: EL, ELT and ETL and when to use what</p>",How to carry out operations in BigQuery,180000,video,False,"Module introduction. EL, ELT, ETL. Quality con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...",...,"[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[quality, data, introduction, considerations, ..."
Introduction to Building Batch Data Pipelines,"<p>This module reviews different methods of data loading: EL, ELT and ETL and when to use what</p>",Shortcomings,208000,video,False,"Module introduction. EL, ELT, ETL. Quality con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...",...,"[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[quality, data, introduction, considerations, ..."
Introduction to Building Batch Data Pipelines,"<p>This module reviews different methods of data loading: EL, ELT and ETL and when to use what</p>",ETL to solve data quality issues,428000,video,False,"Module introduction. EL, ELT, ETL. Quality con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...",...,"[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[quality, data, introduction, considerations, ..."
Introduction to Building Batch Data Pipelines,"<p>This module reviews different methods of data loading: EL, ELT and ETL and when to use what</p>",Introduction to Building Batch Data Pipelines,0,quiz,False,"Module introduction. EL, ELT, ETL. Quality con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...",...,"[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[quality, data, introduction, considerations, ..."
Executing Spark on Dataproc,"<p>This module shows how to run Hadoop on Dataproc, how to leverage Cloud Storage, and how to optimize your Dataproc jobs.</p>",Module introduction,27000,video,False,Module introduction. The Hadoop ecosystem. Run...,"[dataproc, autoscaling, apache, lab, templates]","[dataproc, autoscaling, apache, lab, templates]","[dataproc, autoscaling, apache, lab, templates]","[dataproc, autoscaling, apache, lab, templates]","[dataproc, autoscaling, apache, lab, templates]","[dataproc, autoscaling, apache, lab, templates]",...,"[apache, spark, introduction, executing, optim...","[apache, spark, introduction, executing, optim...","[apache, spark, introduction, executing, optim...","[dataproc, templates, autoscaling, lab, optimi...","[dataproc, templates, autoscaling, lab, optimi...","[dataproc, templates, autoscaling, lab, optimi...","[dataproc, templates, autoscaling, lab, optimi...","[dataproc, templates, autoscaling, lab, optimi...","[dataproc, templates, autoscaling, lab, optimi...","[introduction, templates, storage, jobs, monit..."
Executing Spark on Dataproc,"<p>This module shows how to run Hadoop on Dataproc, how to leverage Cloud Storage, and how to optimize your Dataproc jobs.</p>",The Hadoop ecosystem,286000,video,False,Module introduction. The Hadoop ecosystem. Run...,"[dataproc, autoscaling, apache, lab, templates]","[dataproc, autoscaling, apache, lab, templates]","[dataproc, autoscaling, apache, lab, templates]","[dataproc, autoscaling, apache, lab, templates]","[dataproc, autoscaling, apache, lab, templates]","[dataproc, autoscaling, apache, lab, templates]",...,"[apache, spark, introduction, executing, optim...","[apache, spark, introduction, executing, optim...","[apache, spark, introduction, executing, optim...","[dataproc, templates, autoscaling, lab, optimi...","[dataproc, templates, autoscaling, lab, optimi...","[dataproc, templates, autoscaling, lab, optimi...","[dataproc, templates, autoscaling, lab, optimi...","[dataproc, templates, autoscaling, lab, optimi...","[dataproc, templates, autoscaling, lab, optimi...","[introduction, templates, storage, jobs, monit..."


In [39]:
flatten_df.head()

Unnamed: 0,record-duration,record-title,record-type,record-disabled,meta-title,meta-description,text,"keywords_distilbert-base-nli-mean-tokens_(1, 1)_None","keywords_distilbert-base-nli-mean-tokens_(1, 1)_english","keywords_distilbert-base-nli-mean-tokens_(1, 2)_None",...,"keywords_roberta-base-nli-stsb-mean-tokens_(1, 2)_english","keywords_roberta-base-nli-stsb-mean-tokens_(1, 3)_None","keywords_roberta-base-nli-stsb-mean-tokens_(1, 3)_english","keywords_bert-base-nli-mean-tokens_(1, 1)_None","keywords_bert-base-nli-mean-tokens_(1, 1)_english","keywords_bert-base-nli-mean-tokens_(1, 2)_None","keywords_bert-base-nli-mean-tokens_(1, 2)_english","keywords_bert-base-nli-mean-tokens_(1, 3)_None","keywords_bert-base-nli-mean-tokens_(1, 3)_english",wordwise_keywords
0,55000,Course Introduction,video,False,Introduction,"<p>In this module, we introduce the course and...",Course Introduction,"[introduction, course]","[introduction, course]","[introduction, course]",...,"[course, introduction]","[course, introduction]","[course, introduction]","[introduction, course]","[introduction, course]","[introduction, course]","[introduction, course]","[introduction, course]","[introduction, course]",[]
1,69000,Module introduction,video,False,Introduction to Building Batch Data Pipelines,<p>This module reviews different methods of da...,"Module introduction. EL, ELT, ETL. Quality con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...",...,"[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[quality, data, introduction, considerations, ..."
2,220000,"EL, ELT, ETL",video,False,Introduction to Building Batch Data Pipelines,<p>This module reviews different methods of da...,"Module introduction. EL, ELT, ETL. Quality con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...",...,"[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[quality, data, introduction, considerations, ..."
3,168000,Quality considerations,video,False,Introduction to Building Batch Data Pipelines,<p>This module reviews different methods of da...,"Module introduction. EL, ELT, ETL. Quality con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...",...,"[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[quality, data, introduction, considerations, ..."
4,180000,How to carry out operations in BigQuery,video,False,Introduction to Building Batch Data Pipelines,<p>This module reviews different methods of da...,"Module introduction. EL, ELT, ETL. Quality con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...","[pipelines, introduction, batch, building, con...",...,"[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, etl, module, quality, elt]","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[introduction, data, pipelines, building, bigq...","[quality, data, introduction, considerations, ..."


## 6) Exporting Keyword Extracted Output as Excel File
---

In [40]:
with pd.ExcelWriter('Keyword_Extraction_Comparison.xlsx', engine='xlsxwriter') as writer:
    index_df.to_excel(writer, sheet_name="Keyword_Extraction_Comparison")

**`Note:`** Information, when saved as a CSV (Comma Separated Values) file, may lose certain formatting or structural details, depending on the complexity of the data. CSV files are primarily used for storing tabular data and may not preserve specialized data types, hierarchical structures, or formatting elements found in the original data.

**No Significant change found in Keyword Extraction with different model and parameter tunning**