<a href="https://colab.research.google.com/github/khushboo-gehi/Py-ML-DL/blob/main/Scrape_my_Newsletter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Set-up and Workflow

### Importing the packages

In [1]:
# Load the packages
import requests
from bs4 import BeautifulSoup

### Making a GET request

In [2]:
# Defining the url of the site
base_site = "https://www.linkedin.com/newsletters/ai-and-data-science-usecases-6877830316791226368/"

# Making a get request
response = requests.get(base_site)
response.status_code

200

In [3]:
# Extracting the HTML
html = response.content

# Checking that the reply is indeed an HTML code by inspecting the first 100 symbols
html[:100]

b'<!DOCTYPE html>\n\n    \n    \n    \n    \n    \n    \n    \n    \n    \n    \n\n    \n    <html lang="en">\n      '

### Making the soup

In [4]:
# Convert HTML to a BeautifulSoup object. This will allow us to parse out content from the HTML more easily.
# Using the default parser as it is included in Python
soup = BeautifulSoup(html, "html.parser")

### Exporting the HTML to a file

In [5]:
# It is extremely useful to be able to check this file when searching where some info is located
# or to see how was the document parsed

# Exporting the HTML to a file
with open('newsletter.html', 'wb') as file:
    file.write(soup.prettify('utf-8'))


# the 'with' statement is shorthand for a 'try-finally' block
# open is function for opening/creating a file to edit
# the 'wb' argument signifies the mode in which to edit the file - Writing in Bytes format
# .prettify() modifies the HTML code with additional indentations for better readability

# Searching and navigating the HTML tree

## Searching - find() and find_all()

In [6]:
# The soup variable (BeautifulSoup object) we defined earlier can be seen as representing the whole document
soup

<!DOCTYPE html>

<html lang="en">
<head>
<meta content="d_news-guest-series_entity" name="pageKey"/>
<!-- --> <meta content="en_US" name="locale"/>
<meta data-app-version="0.0.222" data-call-tree-id="AAXV2w+SvBEw8gNbq4OMAw==" data-service-name="news-guest-frontend" id="config"/>
<meta id="google-analytics-config"/>
<link href="https://www.linkedin.com/newsletters/ai-and-data-science-usecases-6877830316791226368" rel="canonical"/>
<!-- --><!-- -->
<!-- -->
<!-- -->
<link crossorigin="use-credentials" href="/homepage-guest/manifest.json" rel="manifest"/>
<link href="https://static-exp1.licdn.com/sc/h/al2o9zrvru7aqj8e1x2rzsrca" rel="icon"/>
<script>
            function getDfd() {let yFn,nFn;const p=new Promise(function(y, n){yFn=y;nFn=n;});p.resolve=yFn;p.reject=nFn;return p;}
            window.lazyloader = getDfd();
            window.tracking = getDfd();
            window.impressionTracking = getDfd();
            window.ingraphTracking = getDfd();
            window.appDetection = g

In [7]:
# We can search by tag name
# This returns as the element with all its contents and nested elements inside
soup.find('head')

<head>
<meta content="d_news-guest-series_entity" name="pageKey"/>
<!-- --> <meta content="en_US" name="locale"/>
<meta data-app-version="0.0.222" data-call-tree-id="AAXV2w+SvBEw8gNbq4OMAw==" data-service-name="news-guest-frontend" id="config"/>
<meta id="google-analytics-config"/>
<link href="https://www.linkedin.com/newsletters/ai-and-data-science-usecases-6877830316791226368" rel="canonical"/>
<!-- --><!-- -->
<!-- -->
<!-- -->
<link crossorigin="use-credentials" href="/homepage-guest/manifest.json" rel="manifest"/>
<link href="https://static-exp1.licdn.com/sc/h/al2o9zrvru7aqj8e1x2rzsrca" rel="icon"/>
<script>
            function getDfd() {let yFn,nFn;const p=new Promise(function(y, n){yFn=y;nFn=n;});p.resolve=yFn;p.reject=nFn;return p;}
            window.lazyloader = getDfd();
            window.tracking = getDfd();
            window.impressionTracking = getDfd();
            window.ingraphTracking = getDfd();
            window.appDetection = getDfd();
            window.pemTra

In [8]:
# If there is no result it returns None
# Note: None is not displayed in IPython unless print() or repr() is used
soup.find('article')

In [9]:
# Display the None value
print(soup.find('article'))

None


In [10]:
# verify the type of output
type(soup.find('article'))

NoneType

In [11]:
# .find() returns only the first such result
soup.find('a')

<a class="nav__logo-link" data-tracking-control-name="news-guest_nav-header-logo" data-tracking-will-navigate="" href="/?trk=news-guest_nav-header-logo">
<span class="sr-only">LinkedIn</span>
<icon class="nav-logo" data-delayed-url="https://static-exp1.licdn.com/sc/h/8fkga714vy9b2wk5auqo5reeb"></icon>
</a>

In [12]:
# If we want all the results we use find_all() 
links = soup.find_all('a')
links

[<a class="nav__logo-link" data-tracking-control-name="news-guest_nav-header-logo" data-tracking-will-navigate="" href="/?trk=news-guest_nav-header-logo">
 <span class="sr-only">LinkedIn</span>
 <icon class="nav-logo" data-delayed-url="https://static-exp1.licdn.com/sc/h/8fkga714vy9b2wk5auqo5reeb"></icon>
 </a>,
 <a class="nav__button-tertiary" data-test-live-nav-primary-cta="" data-tracking-control-name="news-guest_nav-header-join" data-tracking-will-navigate="" href="https://www.linkedin.com/signup/cold-join?trk=news-guest_nav-header-join">
           Join now
         </a>,
 <a class="nav__button-secondary" data-tracking-control-name="news-guest_nav-header-signin" data-tracking-will-navigate="" href="https://www.linkedin.com/uas/login?fromSignIn=true&amp;trk=news-guest_nav-header-signin">Sign in</a>,
 <a class="top-card-layout__cta top-card-layout__cta--primary" data-tracking-control-name="news-guest_top-card-primary-button" data-tracking-will-navigate="" href="https://www.linkedin.c

In [13]:
# find_all returns a list of all results
isinstance(links, list)

True

In [14]:
# We must be careful when using find_all()
# If no result is found it returns an empty list
soup.find_all('mini-card')

[]

In [15]:
# How many links are on the page?
len(links)

54

# Practical examples

## Links - absolute path URL

In [22]:
# Let's use the variable links we defined a couple of lectures ago for this example
# It contains all the 'a' tags on this page
links

[<a class="nav__logo-link" data-tracking-control-name="news-guest_nav-header-logo" data-tracking-will-navigate="" href="/?trk=news-guest_nav-header-logo">
 <span class="sr-only">LinkedIn</span>
 <icon class="nav-logo" data-delayed-url="https://static-exp1.licdn.com/sc/h/8fkga714vy9b2wk5auqo5reeb"></icon>
 </a>,
 <a class="nav__button-tertiary" data-test-live-nav-primary-cta="" data-tracking-control-name="news-guest_nav-header-join" data-tracking-will-navigate="" href="https://www.linkedin.com/signup/cold-join?trk=news-guest_nav-header-join">
           Join now
         </a>,
 <a class="nav__button-secondary" data-tracking-control-name="news-guest_nav-header-signin" data-tracking-will-navigate="" href="https://www.linkedin.com/uas/login?fromSignIn=true&amp;trk=news-guest_nav-header-signin">Sign in</a>,
 <a class="top-card-layout__cta top-card-layout__cta--primary" data-tracking-control-name="news-guest_top-card-primary-button" data-tracking-will-navigate="" href="https://www.linkedin.c

In [23]:
# Let's choose one link to manipulate
link = links[26]
link

<a class="social-action-bar__button" data-tracking-control-name="news-guest_share-update_like-cta" data-tracking-will-navigate="" href="https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fnewsletters%2Fai-and-data-science-usecases-6877830316791226368&amp;trk=news-guest_share-update_like-cta">
<icon class="social-action-bar__icon" data-delayed-url="https://static-exp1.licdn.com/sc/h/a3tgrip1w48gegs3o6k4wh61q">
</icon>
<span class="social-action-bar__button-text">Like</span>
</a>

In [24]:
# Get the link's text
link.string

In [25]:
# Extract the link's URL
link['href']

'https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fnewsletters%2Fai-and-data-science-usecases-6877830316791226368&trk=news-guest_share-update_like-cta'

In [26]:
# This is a relative URL
# To obtain the absolute URL address we will use urljoin

from urllib.parse import urljoin

In [27]:
# Now we need the address of the current page + the relative URL to compute the full-path URL
base_site

'https://www.linkedin.com/newsletters/ai-and-data-science-usecases-6877830316791226368/'

In [28]:
relative_url = link['href']
relative_url

'https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fnewsletters%2Fai-and-data-science-usecases-6877830316791226368&trk=news-guest_share-update_like-cta'

In [29]:
full_url = urljoin(base_site, relative_url)
full_url

'https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fnewsletters%2Fai-and-data-science-usecases-6877830316791226368&trk=news-guest_share-update_like-cta'

## Processing multiple links at once

In [30]:
# We will work with:
links

[<a class="nav__logo-link" data-tracking-control-name="news-guest_nav-header-logo" data-tracking-will-navigate="" href="/?trk=news-guest_nav-header-logo">
 <span class="sr-only">LinkedIn</span>
 <icon class="nav-logo" data-delayed-url="https://static-exp1.licdn.com/sc/h/8fkga714vy9b2wk5auqo5reeb"></icon>
 </a>,
 <a class="nav__button-tertiary" data-test-live-nav-primary-cta="" data-tracking-control-name="news-guest_nav-header-join" data-tracking-will-navigate="" href="https://www.linkedin.com/signup/cold-join?trk=news-guest_nav-header-join">
           Join now
         </a>,
 <a class="nav__button-secondary" data-tracking-control-name="news-guest_nav-header-signin" data-tracking-will-navigate="" href="https://www.linkedin.com/uas/login?fromSignIn=true&amp;trk=news-guest_nav-header-signin">Sign in</a>,
 <a class="top-card-layout__cta top-card-layout__cta--primary" data-tracking-control-name="news-guest_top-card-primary-button" data-tracking-will-navigate="" href="https://www.linkedin.c

In [31]:
# Examining the link's addresses
[l.get('href') for l in links]   # Note that if l['href'] was written instead of l.get('href'), this would produce an error

['/?trk=news-guest_nav-header-logo',
 'https://www.linkedin.com/signup/cold-join?trk=news-guest_nav-header-join',
 'https://www.linkedin.com/uas/login?fromSignIn=true&trk=news-guest_nav-header-signin',
 'https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fnewsletters%2Fai-and-data-science-usecases-6877830316791226368&trk=news-guest_top-card-primary-button',
 'https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fnewsletters%2Fai-and-data-science-usecases-6877830316791226368&trk=news-guest_ellipsis-menu-sign-in-redirect',
 'https://www.linkedin.com/in/khushboo-gehi?trk=news-guest_mini-profile_title',
 'https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fnewsletters%2Fai-and-data-science-usecases-6877830316791226368&trk=news-guest_share-update_ellipsis-menu-sign-in-redirect',
 'https://www.linkedin.com/pulse/not-so-perfect-chatbot-khushboo-gehi?trk=news-guest_

In [32]:
# Notice that some links don't have URL (None appears)

# Dropping the links without href attribute
clean_links = [l for l in links if l.get('href') != None]

In [33]:
# Obtaining the relative URLs
relative_urls = [link.get('href') for link in clean_links]
relative_urls

['/?trk=news-guest_nav-header-logo',
 'https://www.linkedin.com/signup/cold-join?trk=news-guest_nav-header-join',
 'https://www.linkedin.com/uas/login?fromSignIn=true&trk=news-guest_nav-header-signin',
 'https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fnewsletters%2Fai-and-data-science-usecases-6877830316791226368&trk=news-guest_top-card-primary-button',
 'https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fnewsletters%2Fai-and-data-science-usecases-6877830316791226368&trk=news-guest_ellipsis-menu-sign-in-redirect',
 'https://www.linkedin.com/in/khushboo-gehi?trk=news-guest_mini-profile_title',
 'https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fnewsletters%2Fai-and-data-science-usecases-6877830316791226368&trk=news-guest_share-update_ellipsis-menu-sign-in-redirect',
 'https://www.linkedin.com/pulse/not-so-perfect-chatbot-khushboo-gehi?trk=news-guest_

In [34]:
# Transforming to absolute path URLs
full_urls = [urljoin(base_site, url) for url in relative_urls]
full_urls

['https://www.linkedin.com/?trk=news-guest_nav-header-logo',
 'https://www.linkedin.com/signup/cold-join?trk=news-guest_nav-header-join',
 'https://www.linkedin.com/uas/login?fromSignIn=true&trk=news-guest_nav-header-signin',
 'https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fnewsletters%2Fai-and-data-science-usecases-6877830316791226368&trk=news-guest_top-card-primary-button',
 'https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fnewsletters%2Fai-and-data-science-usecases-6877830316791226368&trk=news-guest_ellipsis-menu-sign-in-redirect',
 'https://www.linkedin.com/in/khushboo-gehi?trk=news-guest_mini-profile_title',
 'https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fnewsletters%2Fai-and-data-science-usecases-6877830316791226368&trk=news-guest_share-update_ellipsis-menu-sign-in-redirect',
 'https://www.linkedin.com/pulse/not-so-perfect-chatbot-khush

In [35]:
# Extracting only URLs pointing to Wikipedia (internal URLs)
internal_links = [url for url in full_urls if 'wikipedia.org' in url]
internal_links

[]

# Extracting data from nested tags

In [36]:
# Our objective now is to extract all links that can be found under a section heading
# Marked as 'Main article:' or 'See also:'
# By quick inspection, we see that these are contained in div tags with attribute 'role' set to 'note'

div_notes = soup.find_all("div")
div_notes

[<div class="search-bar__full-placeholder">
 <!-- -->              T'aipei
 <!-- --> </div>, <div class="switcher-tabs__trigger-and-tabs">
 <button aria-expanded="false" class="switcher-tabs__placeholder hide-on-mobile" data-tracking-control-name="news-guest_switcher-tabs-placeholder">
 <span class="switcher-tabs__placeholder-text"></span>
 <icon class="switcher-tabs__caret-down-filled onload" data-delayed-url="https://static-exp1.licdn.com/sc/h/7asbl4deqijhoy3z2ivveispv"></icon>
 </button>
 <div class="switcher-tabs hide-on-desktop hide-on-mobile">
 <ul aria-labelledby="switcher-label" class="switcher-tabs__list">
 <li class="switcher-tabs__tab switcher-tabs__tab--active">
 <button class="switcher-tabs__button" data-switcher-type="JOBS" data-tracking-control-name="news-guest_switcher-tabs-jobs-search-switcher" type="button">
               Jobs
             </button>
 </li>
 <li class="switcher-tabs__tab ">
 <button class="switcher-tabs__button" data-switcher-type="PEOPLE" data-tracki

In [37]:
div_notes[0]

<div class="search-bar__full-placeholder">
<!-- -->              T'aipei
<!-- --> </div>

In [38]:
# We can apply find() and find_all() to a tag in the same way we do it to the whole document
div_notes[0].find('a')

In [39]:
# A naive approach to get all links would be to use find
div_links = [div.find('a') for div in div_notes]
div_links

[None,
 None,
 None,
 None,
 None,
 None,
 <a class="nav__button-tertiary" data-test-live-nav-primary-cta="" data-tracking-control-name="news-guest_nav-header-join" data-tracking-will-navigate="" href="https://www.linkedin.com/signup/cold-join?trk=news-guest_nav-header-join">
           Join now
         </a>,
 None,
 <a class="top-card-layout__cta top-card-layout__cta--primary" data-tracking-control-name="news-guest_top-card-primary-button" data-tracking-will-navigate="" href="https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fnewsletters%2Fai-and-data-science-usecases-6877830316791226368&amp;trk=news-guest_top-card-primary-button">
                     Join to Subscribe
                   </a>,
 None,
 <a class="top-card-layout__cta top-card-layout__cta--primary" data-tracking-control-name="news-guest_top-card-primary-button" data-tracking-will-navigate="" href="https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww%2

In [40]:
len(div_links)

51

In [41]:
# However, some divs have more than 1 link
div_notes[6]

<div class="nav__cta-container">
<a class="nav__button-tertiary" data-test-live-nav-primary-cta="" data-tracking-control-name="news-guest_nav-header-join" data-tracking-will-navigate="" href="https://www.linkedin.com/signup/cold-join?trk=news-guest_nav-header-join">
          Join now
        </a>
<a class="nav__button-secondary" data-tracking-control-name="news-guest_nav-header-signin" data-tracking-will-navigate="" href="https://www.linkedin.com/uas/login?fromSignIn=true&amp;trk=news-guest_nav-header-signin">Sign in</a>
</div>

In [42]:
# This div has 6 links in it
div_notes[6].find_all('a')

[<a class="nav__button-tertiary" data-test-live-nav-primary-cta="" data-tracking-control-name="news-guest_nav-header-join" data-tracking-will-navigate="" href="https://www.linkedin.com/signup/cold-join?trk=news-guest_nav-header-join">
           Join now
         </a>,
 <a class="nav__button-secondary" data-tracking-control-name="news-guest_nav-header-signin" data-tracking-will-navigate="" href="https://www.linkedin.com/uas/login?fromSignIn=true&amp;trk=news-guest_nav-header-signin">Sign in</a>]

In [43]:
# Therefore we need to use find_all
# Let's use a for loop

# Define initially empty list of links
div_links = []

for div in div_notes:
    anchors = div.find_all('a')
    
    # Need to add every link from anchors to div_links
    for a in anchors:
        div_links.append(a)
    
    # Can use div_links.extend(anchors) instead of the for loop
    

In [44]:
div_links

[<a class="nav__button-tertiary" data-test-live-nav-primary-cta="" data-tracking-control-name="news-guest_nav-header-join" data-tracking-will-navigate="" href="https://www.linkedin.com/signup/cold-join?trk=news-guest_nav-header-join">
           Join now
         </a>,
 <a class="nav__button-secondary" data-tracking-control-name="news-guest_nav-header-signin" data-tracking-will-navigate="" href="https://www.linkedin.com/uas/login?fromSignIn=true&amp;trk=news-guest_nav-header-signin">Sign in</a>,
 <a class="top-card-layout__cta top-card-layout__cta--primary" data-tracking-control-name="news-guest_top-card-primary-button" data-tracking-will-navigate="" href="https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fnewsletters%2Fai-and-data-science-usecases-6877830316791226368&amp;trk=news-guest_top-card-primary-button">
                     Join to Subscribe
                   </a>,
 <a class="ellipsis-menu__item-button" data-tracking-control-name="

In [45]:
# We now have a complete list
len(div_links)

77

In [46]:
# Let's get the URLs
note_urls = [urljoin(base_site, l.get('href')) for l in div_links]
note_urls

['https://www.linkedin.com/signup/cold-join?trk=news-guest_nav-header-join',
 'https://www.linkedin.com/uas/login?fromSignIn=true&trk=news-guest_nav-header-signin',
 'https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fnewsletters%2Fai-and-data-science-usecases-6877830316791226368&trk=news-guest_top-card-primary-button',
 'https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fnewsletters%2Fai-and-data-science-usecases-6877830316791226368&trk=news-guest_ellipsis-menu-sign-in-redirect',
 'https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fnewsletters%2Fai-and-data-science-usecases-6877830316791226368&trk=news-guest_top-card-primary-button',
 'https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fnewsletters%2Fai-and-data-science-usecases-6877830316791226368&trk=news-guest_top-card-primary-button',
 'https://www.linked

In [47]:
len(note_urls)

77

# Scraping multiple pages automatically - Extracting all the text from the note URLs

In [48]:
# We will use the links we obtained above
note_urls

['https://www.linkedin.com/signup/cold-join?trk=news-guest_nav-header-join',
 'https://www.linkedin.com/uas/login?fromSignIn=true&trk=news-guest_nav-header-signin',
 'https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fnewsletters%2Fai-and-data-science-usecases-6877830316791226368&trk=news-guest_top-card-primary-button',
 'https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fnewsletters%2Fai-and-data-science-usecases-6877830316791226368&trk=news-guest_ellipsis-menu-sign-in-redirect',
 'https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fnewsletters%2Fai-and-data-science-usecases-6877830316791226368&trk=news-guest_top-card-primary-button',
 'https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fnewsletters%2Fai-and-data-science-usecases-6877830316791226368&trk=news-guest_top-card-primary-button',
 'https://www.linked

In [49]:
# The objective is to get all the useful text from those wikipedia pages

# We will do that by extracting all text contained in a paragraph element,
# for all paragraphs on a page,
# for all pages (in note_urls)

In [50]:
# initialize list to store paragraph text for each webpage
par_text = []


# creating a loop counter
i = 0

# Loop through each URL in note_urls
for url in note_urls:
    
    # connect to every webpage
    note_resp = requests.get(url)
    
    # checking if the request is successful
    if note_resp.status_code == 200:            # Everything is OK!
        print('URL #{0}: {1}'.format(i+1,url))    # print out the number of iteration and the URL to keep track of place in loop
    
    else:                                       # Something is wrong!
        print('Status code {0}: Skipping URL #{1}: {2}'.format(note_resp.status_code, i+1, url))
        i = i+1
        continue
        
    
    # get HTML from webpage
    note_html = note_resp.content
    
    # convert HTML to BeautifulSoup object
    note_soup = BeautifulSoup(note_html, 'lxml')
    
    # find all "p" tags on the webpage
    note_pars = note_soup.find_all("p")
    
    # Get the text from each "p" tag
    text = [p.text for p in note_pars]
    
    # Append text from each "p" tag to our list, par_text
    par_text.append(text)
    
    # Incrementing the loop counter
    i = i+1


URL #1: https://www.linkedin.com/signup/cold-join?trk=news-guest_nav-header-join
URL #2: https://www.linkedin.com/uas/login?fromSignIn=true&trk=news-guest_nav-header-signin
URL #3: https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fnewsletters%2Fai-and-data-science-usecases-6877830316791226368&trk=news-guest_top-card-primary-button
URL #4: https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fnewsletters%2Fai-and-data-science-usecases-6877830316791226368&trk=news-guest_ellipsis-menu-sign-in-redirect
URL #5: https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fnewsletters%2Fai-and-data-science-usecases-6877830316791226368&trk=news-guest_top-card-primary-button
URL #6: https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fnewsletters%2Fai-and-data-science-usecases-6877830316791226368&trk=news-guest_top-card-primary-but

In [61]:
# Inspecting the result for the first page
par_text[4]

['Remove Photo',
 'or',
 'Already on LinkedIn? Sign in',
 'Looking to create a page for a business? Get help']

In [62]:
# We see that we have a list of all paragraph strings
# It would be more useful to have all the text as one string, not as a list of strings

# Merging all paragraphs of the first page into one long string
page_text = "".join(par_text[0])
page_text

'Remove PhotoorAlready on LinkedIn? Sign inLooking to create a page for a business? Get help'

In [63]:
# Let's do that for all pages

# Merging all paragraphs for all pages
page_text = ["".join(text) for text in par_text]

# Inspect the result for some webpage
page_text[0]

'Remove PhotoorAlready on LinkedIn? Sign inLooking to create a page for a business? Get help'

In [64]:
# Inspect result
print(page_text[7])

Remove PhotoorAlready on LinkedIn? Sign inLooking to create a page for a business? Get help


In [65]:
# Creating a dictionary with the (key,value) pairs being (url,text)
url_to_text = dict(zip(note_urls, page_text))  # You don't need to know the specifics of these functions

In [56]:
value_list = list(url_to_text.values())

In [57]:
value_list[4]

'This is a sequel of the article titled "Visual Representation of Topic Clusters (Part 1)" . Here we will look at the pcoa and tsne representations of topic models using LDAvis.PCOA or principal coordinate analysis is also known as classical multidimensional scaling. The visual below represents pcoa on an NMF(non-negative matrix factorization) model with 9 topics. This shows global clusters which represent the marginal distribution of the terms across the topics in the entire corpus. The bar chart showing top-30 most salient terms in the whole corpus on the right panel.  Observe that topic 1, to which 23.9% of the tokens in the corpus belong.  Clustered around the selected topic are topics 5,4,3 and 2. The bigger the circles, the higher the frequency of the said terms and the further the distance between, the lesser the similarity between terms amongst topics. In other words, topic 9 is more distinct in theme to topic 1 in comparison with topic 5. On the other hand, though it is simila

In [58]:
len(value_list)

13