# Working with Unscructured Data (Part II)

Today we are going to look at text data, and things we can do with text data. In today's lecture I am going to cover two topics:

1. Scraping text
2. Rudimentary Natural Language Processing


## Scraping Text/Data

[Text scraping](https://en.wikipedia.org/wiki/Text_scraping), or more generally [data scraping](https://en.wikipedia.org/wiki/Data_scraping), is defined as the process of extracting text out of web resources such as newspapers, web pages, or government sites. Data scraping is often needed to interface with an old legacy system that does not expose its data via a convenient API. It involves making requests to websites and parsing the HTML responses to collect the desired information. This technique is often used for data mining, data analysis, and many other applications. 
          

We will be using a Python library, BeautifulSoup, an easy-to-use tool for parsing HTML and XML documents, enabling extraction of useful data from them. Let's suppose we need to scrape quotes from a website, "http://quotes.toscrape.com". First, we import necessary libraries, including BeautifulSoup and requests.

In [2]:
from bs4 import BeautifulSoup
import requests

For this task, we need texts of moderate size: not too long, or not too short. News articles are perfect for this purpose. I am going to use several sources. 

* For English texts I am going to use the [Guardian Newspaper](https://www.theguardian.com/international),
* For Turkish texts I am going to use [Milliyet](https://www.milliyet.com.tr/)
* For French texts I am going to use [Le Monde](https://www.lemonde.fr/)

We are going to pull articles on a specific subject using a service called [RSS Feed](https://en.wikipedia.org/wiki/RSS). Each of these newspapers have their own RSS feeds.

### RSS Feeds

Let us start with the Guardian: Guardian's RSS feed has a [predictable pattern](https://www.theguardian.com/help/feeds). For example here are some interesting subjects:

1. Economy: https://www.theguardian.com/economy/rss
2. Technology: https://www.theguardian.com/technology/rss
3. Film: https://www.theguardian.com/film/rss
4. NBA: https://www.theguardian.com/sport/nba/rss
5. Fashion: https://www.theguardian.com/fashion/rss

Each RSS feed is an XML file. We are going to parse it and extract the bits we are interested in:

In [3]:
from xmltodict import parse

with requests.get('https://www.theguardian.com/technology/rss') as link:
    raw = parse(link.text)

raw

{'rss': {'@xmlns:media': 'http://search.yahoo.com/mrss/',
  '@xmlns:dc': 'http://purl.org/dc/elements/1.1/',
  '@version': '2.0',
  'channel': {'title': 'Technology | The Guardian',
   'link': 'https://www.theguardian.com/uk/technology',
   'description': "Latest Technology news, comment and analysis from the Guardian, the world's leading liberal voice",
   'language': 'en-gb',
   'copyright': 'Guardian News and Media Limited or its affiliated companies. All rights reserved. 2023',
   'pubDate': 'Mon, 13 Nov 2023 11:39:06 GMT',
   'dc:date': '2023-11-13T11:39:06Z',
   'dc:language': 'en-gb',
   'dc:rights': 'Guardian News and Media Limited or its affiliated companies. All rights reserved. 2023',
   'image': {'title': 'The Guardian',
    'url': 'https://assets.guim.co.uk/images/guardian-logo-rss.c45beb1bafa34b347ac333af2e6fe23f.png',
    'link': 'https://www.theguardian.com'},
   'item': [{'title': 'Bose QC Ultra earbuds review: top-class noise cancelling with audio upgrade',
     'link

The response is a large XML data structure parsed as a python dictionary. The part we are interested in is an array located at `raw['rss']['channel']['item']`. Let us look at its first element:

In [4]:
raw['rss']['channel']['item'][0]

{'title': 'Bose QC Ultra earbuds review: top-class noise cancelling with audio upgrade',
 'link': 'https://www.theguardian.com/technology/2023/nov/13/bose-qc-ultra-earbuds-review-top-class-noise-cancelling-with-audio-upgrade',
 'description': '<p>New immersive sound and better Bluetooth update for comfortable and popular earbuds</p><p>Bose’s commuter favourite QuietComfort earbuds have been given an upgrade, setting the standard with best-in-class noise cancelling and new immersive audio features.</p><p>Costing £300 (€350/$300/A$450) the QC Ultra earbuds are £20 more than the excellent <a href="https://www.theguardian.com/technology/2022/dec/14/bose-quietcomfort-earbuds-2-review-noise-cancelling-sound-battery-life">QC Earbuds II</a> they effectively replace, rubbing shoulders with the best in the business from Sennheiser, Sony and Apple.</p><p><strong>Water resistance:</strong> sweat resistant (IPX4)</p><p><strong>Connectivity:</strong> Bluetooth 5.3 (SBC, AAC, aptX Adaptive)</p><p><st

I am going to use a function to pull the links that I need:

In [5]:
def getSubjectGuardian(subject):
    with requests.get(f'https://www.theguardian.com/{subject}/rss') as link:
        raw = parse(link.text)
    return raw['rss']['channel']['item']

NBA = getSubjectGuardian('sport/nba')
Tech = getSubjectGuardian('technology')

Let us look at the links

In [5]:
for x in NBA[:4]:
    print(f'{x["title"]}\n{x["link"]}\n\n')

Sixers’ Oubre Jr to miss ‘significant’ time after being hit by vehicle in Philadelphia
https://www.theguardian.com/sport/2023/nov/12/kelly-oubre-jr-hit-by-car-philadelphia-76ers


NBA Cup roundup: Curry’s last-second bucket lifts Warriors over Thunder
https://www.theguardian.com/sport/2023/nov/03/nba-in-season-tournament-scores-reports


‘I can’t wait’: excitement mounts for NBA’s first in-season tournament
https://www.theguardian.com/sport/2023/nov/03/nba-cup-preview-in-season-tournament


‘Unbelievable’ Victor Wembanyama explodes for 38 points in fifth NBA game
https://www.theguardian.com/sport/2023/nov/03/victor-wembanyama-spurs-suns-game-report




Let us retrieve the text from the first link:

In [6]:
with requests.get(NBA[0]['link']) as link:
    raw = BeautifulSoup(link.content,'html.parser')

print(raw)

<!DOCTYPE html>

<html lang="en">
<head>
<!-- Hello there, HTML enthusiast! -->
<title>Cam’ron and Mase: how two rap legends became US sports’ unlikely hit | US sports | The Guardian</title>
<meta content="The Harlem rap vets behind It Is What It Is are breathing fresh life into a hoary format with their chemistry, editorial freedom and unpredictability" name="description"/>
<link href="https://www.theguardian.com/sport/2023/nov/13/it-is-what-it-is-camron-mase-sports-talk-show-youtube" rel="canonical"/>
<meta charset="utf-8"/>
<meta content="width=device-width,minimum-scale=1,initial-scale=1" name="viewport"/>
<meta content="#052962" name="theme-color">
<link href="https://assets.guim.co.uk/static/frontend/manifest.json" rel="manifest"/>
<link href="https://assets.guim.co.uk/static/frontend/icons/homescreen/apple-touch-icon.svg" rel="apple-touch-icon" sizes="any"/>
<link href="https://assets.guim.co.uk/static/frontend/icons/homescreen/apple-touch-icon-512.png" rel="apple-touch-icon" si

The response we get is a large string containing the HTML of the page. This is where BeautifulSoup comes in. We transform this string into a BeautifulSoup object.

The `raw` object now holds the structured representation of the page, from which data can be easily extracted. If we want to extract quotes from the page, we can do so by finding the HTML tags that contain them.

In [7]:
raw.find_all('p')

[<p>The Harlem rap vets behind It Is What It Is are breathing fresh life into a hoary format with their chemistry, editorial freedom and unpredictability</p>,
 <p class="dcr-1dpfw7k"><span class="dcr-11l45yn" style="color:#005689;font-weight:700;">T</span>V sports talk in America is a broken record. Every day brings the same warmed-over topics (Dallas Cowboys), the same personal triggers (LeBron James), the same stale mix of sportswriters (Skip Bayless) and ex-jocks (Michael Irvin) shouting over each other across the basic cable divide. Only one show manages to cut through the noise without really raising its voice.</p>,
 <p class="dcr-1dpfw7k">In late February an online-only production called It Is What It Is <a data-link-name="in body link" href="https://www.youtube.com/watch?v=kbp2L5s5iEc">premiered on YouTube</a> to little fanfare – a jarring setup for two hosts who are so far from understated. On one side of the dais, there’s Mase (government name: Mason Betha), the <a data-link-n

In this example, the `find_all` function is used to select the 'p' tag. Next, for each paragraph, we extract the text.

In [8]:
' '.join([x.text for x in raw.find_all('p')])

"The Harlem rap vets behind It Is What It Is are breathing fresh life into a hoary format with their chemistry, editorial freedom and unpredictability TV sports talk in America is a broken record. Every day brings the same warmed-over topics (Dallas Cowboys), the same personal triggers (LeBron James), the same stale mix of sportswriters (Skip Bayless) and ex-jocks (Michael Irvin) shouting over each other across the basic cable divide. Only one show manages to cut through the noise without really raising its voice. In late February an online-only production called It Is What It Is premiered on YouTube to little fanfare – a jarring setup for two hosts who are so far from understated. On one side of the dais, there’s Mase (government name: Mason Betha), the shiny suit-wearing star who hijacked the pop charts in the mid-90s with the Notorious BIG. On the other there’s Cam’ron (Cameron Giles), the neon-palette style icon who went platinum with Jay-Z’s Roc-A-Fella team. The notion that two r

Let us convert this to a function:

In [9]:
def getText(url):
    with requests.get(url) as link:
        raw = BeautifulSoup(link.content,'html.parser')
    return ' '.join([x.text for x in raw.find_all('p')])

This function takes and URL and the extract the text out of that URL

In [10]:
tech_example = getText(Tech[0]['link'])
tech_example

'New immersive sound and better Bluetooth update for comfortable and popular earbuds Bose’s commuter favourite QuietComfort earbuds have been given an upgrade, setting the standard with best-in-class noise cancelling and new immersive audio features. Costing £300 (€350/$300/A$450) the QC Ultra earbuds are £20 more than the excellent QC Earbuds II they effectively replace, rubbing shoulders with the best in the business from Sennheiser, Sony and Apple. Unlike the revamped QC Ultra headphones, the earbuds have only been given a light touch on the outside with a few metallic accents here and there. Otherwise, they look like the Earbuds II: fairly large compared with competitors with flat stalks that point towards your mouth. The combination of silicone earbud tips and wings provide a comfortable and secure hold in your ear. Each bit comes in different sizes in the box so you can mix and match to get the right size for your ears, with a fit test available in the Bose Music app. Touch-sensi

In [12]:
nba_example = getText(NBA[3]['link'])
nba_example

'The race for the inaugural NBA Cup tips off on Friday night with all 30 teams set to compete for a newly minted trophy and cash bonuses When Greg Popovich is enthused, you know you’re onto something. The often-reserved coach of the San Antonio Spurs is known for keeping his composure and not using hyperbole. It’s what’s helped his team win five NBA championships during his ongoing tenure. Now, though, as the league is set to embark on its latest endeavor – the in-season tournament, beginning on Friday night – the 74-year-old coach says that the event is “exciting for everybody”. Speaking to reporters on Wednesday, the Spurs coach reminded those listening just how driven NBA players are. So, with a chance at winning the new NBA Cup, Pop says teams will rise to the challenge. “You have to understand all these guys are very competitive,” Popovich says of the NBA workforce. “If you put something out there like this, it just adds to that competition.” Many around the league, from current t

Of course, you don't have to go through RSS to get the links. You may provide them directly:

In [13]:
wiki_example = getText('https://en.wikipedia.org/wiki/Data_scraping')
wiki_example

'Data scraping is a technique where a computer program extracts data from human-readable output coming from another program.\n Normally, data transfer between programs is accomplished using data structures suited for automated processing by computers, not people.  Such interchange formats and protocols are typically rigidly structured, well-documented, easily parsed, and minimize ambiguity.  Very often, these transmissions are not human-readable at all.\n Thus, the key element that distinguishes data scraping from regular parsing is that the output being scraped is intended for display to an end-user, rather than as an input to another program. It is therefore usually neither documented nor structured for convenient parsing.  Data scraping often involves ignoring binary data (usually images or multimedia data), display formatting, redundant labels, superfluous commentary, and other information which is either irrelevant or hinders automated processing.\n Data scraping is most often don

## An Application: Word Frequencies

There are many ways one can use text. For today's lecture I am going to demonstrate one such use: text summarization and keyword extraction. Surprisingly, the task is a simple application linear algebra.

### Splitting a text into words and sentences

We are going to process a human language text into meaningful pieces. For this we need to employ an Natural Language Processing Library called [NLTK](https://www.nltk.org/). Our first task is to split the text into sentences and words:


In [14]:
import nltk

from nltk.tokenize import sent_tokenize, word_tokenize

# nltk.download('punkt',download_dir='/home/kaygun/local/lib/nltk_data')

In [15]:
sent_tokenize(wiki_example)

['Data scraping is a technique where a computer program extracts data from human-readable output coming from another program.',
 'Normally, data transfer between programs is accomplished using data structures suited for automated processing by computers, not people.',
 'Such interchange formats and protocols are typically rigidly structured, well-documented, easily parsed, and minimize ambiguity.',
 'Very often, these transmissions are not human-readable at all.',
 'Thus, the key element that distinguishes data scraping from regular parsing is that the output being scraped is intended for display to an end-user, rather than as an input to another program.',
 'It is therefore usually neither documented nor structured for convenient parsing.',
 'Data scraping often involves ignoring binary data (usually images or multimedia data), display formatting, redundant labels, superfluous commentary, and other information which is either irrelevant or hinders automated processing.',
 'Data scrapi

In [17]:
word_tokenize(wiki_example.lower())

['data',
 'scraping',
 'is',
 'a',
 'technique',
 'where',
 'a',
 'computer',
 'program',
 'extracts',
 'data',
 'from',
 'human-readable',
 'output',
 'coming',
 'from',
 'another',
 'program',
 '.',
 'normally',
 ',',
 'data',
 'transfer',
 'between',
 'programs',
 'is',
 'accomplished',
 'using',
 'data',
 'structures',
 'suited',
 'for',
 'automated',
 'processing',
 'by',
 'computers',
 ',',
 'not',
 'people',
 '.',
 'such',
 'interchange',
 'formats',
 'and',
 'protocols',
 'are',
 'typically',
 'rigidly',
 'structured',
 ',',
 'well-documented',
 ',',
 'easily',
 'parsed',
 ',',
 'and',
 'minimize',
 'ambiguity',
 '.',
 'very',
 'often',
 ',',
 'these',
 'transmissions',
 'are',
 'not',
 'human-readable',
 'at',
 'all',
 '.',
 'thus',
 ',',
 'the',
 'key',
 'element',
 'that',
 'distinguishes',
 'data',
 'scraping',
 'from',
 'regular',
 'parsing',
 'is',
 'that',
 'the',
 'output',
 'being',
 'scraped',
 'is',
 'intended',
 'for',
 'display',
 'to',
 'an',
 'end-user',
 ',',
 '

As you can see there are many extraneous symbols that need to be cleaned.

### Regular expressions

I would like to remove all non-alphanumeric or non-space characters from the text. For this I am going to use [regular expressions](https://en.wikipedia.org/wiki/Regular_expression). This is a technical tool that every knowledgeable data science should have in their aresenal

In [20]:
import re
from collections import Counter

In [21]:
re.sub(r'[^\w\s]+','',wiki_example.lower())

'data scraping is a technique where a computer program extracts data from humanreadable output coming from another program\n normally data transfer between programs is accomplished using data structures suited for automated processing by computers not people  such interchange formats and protocols are typically rigidly structured welldocumented easily parsed and minimize ambiguity  very often these transmissions are not humanreadable at all\n thus the key element that distinguishes data scraping from regular parsing is that the output being scraped is intended for display to an enduser rather than as an input to another program it is therefore usually neither documented nor structured for convenient parsing  data scraping often involves ignoring binary data usually images or multimedia data display formatting redundant labels superfluous commentary and other information which is either irrelevant or hinders automated processing\n data scraping is most often done either to interface to 

In [24]:
res = Counter(word_tokenize(re.sub(r'[^\w\s]+','',wiki_example.lower())))
dict(sorted(res.items(), key=lambda item: -item[1]))

{'the': 58,
 'to': 44,
 'a': 42,
 'data': 40,
 'and': 31,
 'of': 31,
 'is': 21,
 'scraping': 20,
 'for': 19,
 'from': 15,
 'or': 15,
 'system': 15,
 'in': 14,
 'this': 14,
 'as': 13,
 'web': 13,
 'an': 12,
 'screen': 12,
 'be': 11,
 'computer': 10,
 'can': 10,
 'program': 8,
 'output': 8,
 'by': 8,
 'such': 8,
 'often': 8,
 'that': 8,
 'are': 7,
 'with': 7,
 'more': 7,
 'on': 7,
 'processing': 6,
 'interface': 6,
 'used': 6,
 'use': 6,
 'not': 5,
 'which': 5,
 'human': 5,
 'modern': 5,
 'source': 5,
 'user': 5,
 'systems': 5,
 'humanreadable': 4,
 'another': 4,
 'using': 4,
 'automated': 4,
 'parsing': 4,
 'display': 4,
 'usually': 4,
 'api': 4,
 'applications': 4,
 'screens': 4,
 'through': 4,
 'scraper': 4,
 'techniques': 4,
 'report': 4,
 'technique': 3,
 'where': 3,
 'between': 3,
 'easily': 3,
 'these': 3,
 'intended': 3,
 'input': 3,
 'it': 3,
 'other': 3,
 'legacy': 3,
 'no': 3,
 'mechanism': 3,
 'provide': 3,
 'will': 3,
 'control': 3,
 'available': 3,
 'programming': 3,
 'term

The frequencies of the words appear in this list follows [Zipf's Law](https://en.wikipedia.org/wiki/Zipf%27s_law). [Here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4176592/) is a more academic treatment of the subject. But, as you may notice, most of these words are meaningless without a proper context. I would describe thse words as words with *high noise and low signal* value. In natural language processing terminology, these are called [*stop words*](https://en.wikipedia.org/wiki/Stop_word).

In [25]:
from nltk.corpus import stopwords
nltk.download('stopwords',download_dir='/home/kaygun/local/lib/nltk_data/')

swEN = set(stopwords.words('english'))
swEN

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/kaygun/local/lib/nltk_data/...
[nltk_data]   Package stopwords is already up-to-date!


{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [26]:
cleaned = {k: v for k,v in res.items() if k not in swEN }
dict(sorted(cleaned.items(), key=lambda item: -item[1]))

{'data': 40,
 'scraping': 20,
 'system': 15,
 'web': 13,
 'screen': 12,
 'computer': 10,
 'program': 8,
 'output': 8,
 'often': 8,
 'processing': 6,
 'interface': 6,
 'used': 6,
 'use': 6,
 'human': 5,
 'modern': 5,
 'source': 5,
 'user': 5,
 'systems': 5,
 'humanreadable': 4,
 'another': 4,
 'using': 4,
 'automated': 4,
 'parsing': 4,
 'display': 4,
 'usually': 4,
 'api': 4,
 'applications': 4,
 'screens': 4,
 'scraper': 4,
 'techniques': 4,
 'report': 4,
 'technique': 3,
 'easily': 3,
 'intended': 3,
 'input': 3,
 'legacy': 3,
 'mechanism': 3,
 'provide': 3,
 'control': 3,
 'available': 3,
 'programming': 3,
 'terminal': 3,
 'instead': 3,
 'text': 3,
 'terminals': 3,
 'port': 3,
 'term': 3,
 'example': 3,
 'process': 3,
 'extract': 3,
 'mining': 3,
 'reports': 3,
 'normally': 2,
 'transfer': 2,
 'interchange': 2,
 'structured': 2,
 'enduser': 2,
 'convenient': 2,
 'involves': 2,
 'images': 2,
 'information': 2,
 'either': 2,
 'done': 2,
 'thirdparty': 2,
 'case': 2,
 'reasons': 2,
 '

Let us write a function for this:

In [28]:
def processText(text, sws):
    words = word_tokenize(re.sub(r'[^\w\s]+','',text.lower()))
    res = Counter([w for w in words if w not in sws])
    return dict(sorted(res.items(), key=lambda item: -item[1]))

In [29]:
processText(wiki_example,swEN)

{'data': 40,
 'scraping': 20,
 'system': 15,
 'web': 13,
 'screen': 12,
 'computer': 10,
 'program': 8,
 'output': 8,
 'often': 8,
 'processing': 6,
 'interface': 6,
 'used': 6,
 'use': 6,
 'human': 5,
 'modern': 5,
 'source': 5,
 'user': 5,
 'systems': 5,
 'humanreadable': 4,
 'another': 4,
 'using': 4,
 'automated': 4,
 'parsing': 4,
 'display': 4,
 'usually': 4,
 'api': 4,
 'applications': 4,
 'screens': 4,
 'scraper': 4,
 'techniques': 4,
 'report': 4,
 'technique': 3,
 'easily': 3,
 'intended': 3,
 'input': 3,
 'legacy': 3,
 'mechanism': 3,
 'provide': 3,
 'control': 3,
 'available': 3,
 'programming': 3,
 'terminal': 3,
 'instead': 3,
 'text': 3,
 'terminals': 3,
 'port': 3,
 'term': 3,
 'example': 3,
 'process': 3,
 'extract': 3,
 'mining': 3,
 'reports': 3,
 'normally': 2,
 'transfer': 2,
 'interchange': 2,
 'structured': 2,
 'enduser': 2,
 'convenient': 2,
 'involves': 2,
 'images': 2,
 'information': 2,
 'either': 2,
 'done': 2,
 'thirdparty': 2,
 'case': 2,
 'reasons': 2,
 '

In [30]:
processText(nba_example,swEN)

{'nba': 17,
 'teams': 12,
 'tournament': 11,
 'inseason': 9,
 'new': 9,
 'kerr': 9,
 'says': 8,
 'also': 8,
 'like': 7,
 'games': 7,
 'something': 6,
 'team': 6,
 'night': 5,
 'coach': 5,
 'league': 5,
 'first': 5,
 'year': 5,
 'inaugural': 4,
 'cup': 4,
 'trophy': 4,
 'win': 4,
 'players': 4,
 'former': 4,
 'played': 4,
 'december': 4,
 'championship': 4,
 'play': 4,
 'friday': 3,
 'popovich': 3,
 'event': 3,
 'reporters': 3,
 'winning': 3,
 'competition': 3,
 'many': 3,
 'schedule': 3,
 'nbas': 3,
 'season': 3,
 'group': 3,
 'game': 3,
 'player': 3,
 'leagues': 3,
 'may': 3,
 'going': 3,
 'warriors': 3,
 'lot': 3,
 'would': 3,
 'tips': 2,
 '30': 2,
 'set': 2,
 'cash': 2,
 'spurs': 2,
 'whats': 2,
 'five': 2,
 'though': 2,
 'chance': 2,
 'competitive': 2,
 'put': 2,
 'around': 2,
 'current': 2,
 'coaches': 2,
 'even': 2,
 'silver': 2,
 'regularseason': 2,
 'thats': 2,
 'soccer': 2,
 'else': 2,
 'records': 2,
 'stage': 2,
 'others': 2,
 'knockout': 2,
 'las': 2,
 'vegas': 2,
 'quarterf

In [31]:
processText(tech_example,swEN)

{'earbuds': 21,
 'case': 9,
 'audio': 8,
 'bose': 8,
 'sound': 7,
 'good': 7,
 'cost': 7,
 'noise': 6,
 'ultra': 6,
 'new': 5,
 'immersive': 5,
 'bluetooth': 5,
 'cancelling': 5,
 'qc': 5,
 'ii': 5,
 'best': 5,
 'battery': 5,
 'comfortable': 4,
 'standard': 4,
 'excellent': 4,
 'earbud': 4,
 'bit': 4,
 'get': 4,
 'music': 4,
 'life': 4,
 'hours': 4,
 'x': 4,
 'charging': 4,
 'sounds': 4,
 'quietcomfort': 3,
 'headphones': 3,
 'fairly': 3,
 'fit': 3,
 'available': 3,
 'modes': 3,
 'well': 3,
 'full': 3,
 'aptx': 3,
 'adaptive': 3,
 'set': 3,
 'mode': 3,
 'also': 3,
 'call': 3,
 'quality': 3,
 'making': 3,
 'better': 2,
 'boses': 2,
 'given': 2,
 'bestinclass': 2,
 'features': 2,
 '20': 2,
 'sennheiser': 2,
 'sony': 2,
 'apple': 2,
 'unlike': 2,
 'outside': 2,
 'metallic': 2,
 'accents': 2,
 'otherwise': 2,
 'like': 2,
 'large': 2,
 'point': 2,
 'tips': 2,
 'provide': 2,
 'ear': 2,
 'different': 2,
 'match': 2,
 'ears': 2,
 'app': 2,
 'controls': 2,
 'volume': 2,
 'playback': 2,
 '24': 2

### Stemming

As you can notice above, certain words appear in different forms such as 'use' vs 'used', 'technology' vs 'technologies' etc. One way of eliminating different variations of a word is called [*stemming*](https://en.wikipedia.org/wiki/Stemming).

In [32]:
from nltk.stem.snowball import SnowballStemmer

In [33]:
stemmer = SnowballStemmer('english')
stemmer.stem('tokenization')

'token'

Let us modify our function now:

In [34]:
def wordFrequencies(text, sws, st):
    words = word_tokenize(re.sub(r'[^\w\s]+','',text.lower()))
    res = Counter([st(w) for w in words if w not in sws])
    return dict(sorted(res.items(), key=lambda item: -item[1]))

In [35]:
wordFrequencies(wiki_example,swEN,stemmer.stem)

{'data': 40,
 'scrape': 22,
 'system': 20,
 'use': 18,
 'screen': 16,
 'web': 13,
 'program': 12,
 'comput': 11,
 'process': 9,
 'output': 8,
 'often': 8,
 'interfac': 8,
 'report': 8,
 'techniqu': 7,
 'extract': 7,
 'display': 6,
 'control': 6,
 'human': 6,
 'termin': 6,
 'user': 6,
 'autom': 5,
 'pars': 5,
 'provid': 5,
 'api': 5,
 'modern': 5,
 'sourc': 5,
 'scraper': 5,
 'humanread': 4,
 'anoth': 4,
 'structur': 4,
 'format': 4,
 'endus': 4,
 'usual': 4,
 'involv': 4,
 'case': 4,
 'requir': 4,
 'applic': 4,
 'captur': 4,
 'connect': 4,
 'common': 4,
 'tool': 4,
 'easili': 3,
 'intend': 3,
 'input': 3,
 'document': 3,
 'legaci': 3,
 'mechan': 3,
 'avail': 3,
 'result': 3,
 'practic': 3,
 'instead': 3,
 'refer': 3,
 'text': 3,
 'port': 3,
 'term': 3,
 'exampl': 3,
 'page': 3,
 'file': 3,
 'mine': 3,
 'develop': 3,
 'normal': 2,
 'transfer': 2,
 'interchang': 2,
 'minim': 2,
 'conveni': 2,
 'imag': 2,
 'inform': 2,
 'either': 2,
 'done': 2,
 'thirdparti': 2,
 'oper': 2,
 'reason': 2,


In [36]:
wordFrequencies(nba_example,swEN,stemmer.stem)

{'team': 18,
 'nba': 17,
 'tournament': 11,
 'game': 10,
 'play': 10,
 'inseason': 9,
 'new': 9,
 'kerr': 9,
 'leagu': 8,
 'say': 8,
 'also': 8,
 'coach': 7,
 'win': 7,
 'player': 7,
 'like': 7,
 'someth': 6,
 'competit': 6,
 'year': 6,
 'night': 5,
 'trophi': 5,
 'championship': 5,
 'first': 5,
 'inaugur': 4,
 'cup': 4,
 'friday': 4,
 'former': 4,
 'group': 4,
 'decemb': 4,
 'final': 4,
 'go': 4,
 'popovich': 3,
 'event': 3,
 'excit': 3,
 'report': 3,
 'mani': 3,
 'schedul': 3,
 'give': 3,
 'nbas': 3,
 'season': 3,
 'follow': 3,
 'hope': 3,
 'may': 3,
 'warrior': 3,
 'lot': 3,
 'would': 3,
 'tip': 2,
 '30': 2,
 'set': 2,
 'cash': 2,
 'know': 2,
 'spur': 2,
 'what': 2,
 'five': 2,
 'though': 2,
 'chanc': 2,
 'put': 2,
 'add': 2,
 'around': 2,
 'current': 2,
 'even': 2,
 'silver': 2,
 'regularseason': 2,
 'that': 2,
 'soccer': 2,
 'els': 2,
 'slate': 2,
 'featur': 2,
 'record': 2,
 'stage': 2,
 'other': 2,
 'knockout': 2,
 'las': 2,
 'vega': 2,
 'quarterfin': 2,
 'semifin': 2,
 'espn': 

In [37]:
wordFrequencies(tech_example,swEN,stemmer.stem)

{'earbud': 25,
 'sound': 13,
 'bose': 10,
 'case': 9,
 'audio': 8,
 'cost': 8,
 'good': 7,
 'nois': 6,
 'ultra': 6,
 'mode': 6,
 'batteri': 6,
 'charg': 6,
 'new': 5,
 'immers': 5,
 'bluetooth': 5,
 'cancel': 5,
 'qc': 5,
 'ii': 5,
 'best': 5,
 'comfort': 4,
 'set': 4,
 'standard': 4,
 'excel': 4,
 'replac': 4,
 'ear': 4,
 'bit': 4,
 'get': 4,
 'music': 4,
 'life': 4,
 'hour': 4,
 'x': 4,
 'call': 4,
 'make': 4,
 'quietcomfort': 3,
 'upgrad': 3,
 'headphon': 3,
 'fair': 3,
 'come': 3,
 'fit': 3,
 'avail': 3,
 'control': 3,
 'well': 3,
 'full': 3,
 'resist': 3,
 'connect': 3,
 'aptx': 3,
 'adapt': 3,
 'also': 3,
 'qualiti': 3,
 'one': 3,
 'product': 3,
 'better': 2,
 'commut': 2,
 'given': 2,
 'bestinclass': 2,
 'featur': 2,
 '20': 2,
 'sennheis': 2,
 'soni': 2,
 'appl': 2,
 'unlik': 2,
 'outsid': 2,
 'metal': 2,
 'accent': 2,
 'otherwis': 2,
 'look': 2,
 'like': 2,
 'larg': 2,
 'point': 2,
 'combin': 2,
 'tip': 2,
 'provid': 2,
 'differ': 2,
 'size': 2,
 'match': 2,
 'app': 2,
 'volum'

### Now, in Turkish...

Let us repeat what we have done for a Turkish text. First, let us get the RSS feed from Milliyyet

In [38]:
def getSubjectMilliyet(subject):
    with requests.get(f'https://www.milliyet.com.tr/rss/rssNew/{subject}Rss.xml') as link:
        raw = parse(link.text.encode('iso8859-9'))
    return raw['rss']['channel']['item']

In [39]:
ekonomi = getSubjectMilliyet('ekonomi')
ekonomi

[{'guid': {'@isPermaLink': 'false', '#text': '7033985'},
  'title': 'Haberler: Ekmek fabrikasında şaşırtan kampanya! 100 TL bozuk para getirene 2 ekmek',
  'description': '<img src="https://image.milimaj.com/i/milliyet/75/460x340/65520cb886b2472b80216a28.jpg"/>Sivas\'ta bir ekmek fabrikasında cama asılan kağıt görenleri şaşırttı. Bozuk para sıkıntısı çeken fırında 100 lira getirene iki ekmek bedava veriliyor. Kampanyanın büyük ilgi gördüğünü söyleyen fırın sahibi, "bu kampanyayla hem artık bozuk para sıkıntısı çekmiyoruz hem de insanlara destek oluyoruz" dedi.',
  'pubDate': 'Mon, 13 Nov 2023 14:51:47  Z',
  'atom:link': {'@href': 'https://www.milliyet.com.tr/galeri/sasirtan-kampanya-100-tl-bozuk-para-getirene-2-ekmek-7033985'}},
 {'guid': {'@isPermaLink': 'false', '#text': '7033986'},
  'title': 'Bakan Şimşek: Cari dengede 7.3 milyar dolar iyileşme sağlandı',
  'description': '<img src="https://image.milimaj.com/i/milliyet/75/460x340/65520c8a86b24a17c08827a7.jpg"/><p>Türkiye Cumhuriye

In [41]:
for x in ekonomi[:4]:
    print(f'{x["title"]}\n{x["atom:link"]["@href"]}\n')

Haberler: Ekmek fabrikasında şaşırtan kampanya! 100 TL bozuk para getirene 2 ekmek
https://www.milliyet.com.tr/galeri/sasirtan-kampanya-100-tl-bozuk-para-getirene-2-ekmek-7033985

Bakan Şimşek: Cari dengede 7.3 milyar dolar iyileşme sağlandı
https://www.milliyet.com.tr/ekonomi/bakan-simsek-cari-dengede-7-3-milyar-dolar-iyilesme-saglandi-7033986

Türk Hava Yolları, 355 uçak alacak
https://www.milliyet.com.tr/ekonomi/turk-hava-yollari-355-ucak-alacak-7033934

Haberler: Ev sahipleri pes dedirtti! İşte yeni yöntemleri
https://www.milliyet.com.tr/ekonomi/ev-sahiplerinin-yeni-yontemi-pes-dedirtti-yikim-oncesi-kiralik-daire-7033753



In [42]:
ekonomi_example = getText(ekonomi[0]['atom:link']['@href'])
ekonomi_example

"13.11.2023 - 14:48 | Son Güncellenme: 13.11.2023 - 14:51 İHA Sivas’ta madeni para sıkıntısı yaşanan bir fırında iş yerinin camlarına, “100 TL madeni para getirene 2 ekmek bedava” yazıları astı. Sivas’ta bulunan bir fırın işletmesi madeni para sıkıntısını gidermek için ürettiği çözüm dikkat çekti. İş yerinin camlarına ve duvarlarına “100 TL madeni para getirene 2 ekmek bedava” yazısı asan esnaf bu sayede hem bozuk para sıkıntısını giderdi hem de vatandaşa destek oldu. işletmesinin muhasebe bölümünde çalışan Alperen Karakaya, bozuk para getirene ekmek hediye ettiklerini ifade ederek,“ Bozuk para konusunda büyük bir sıkıntımız vardı, bankalardan da istediğimiz halde bozuk para temini edemiyorduk. Türkiye’de bunun örnekleri olduğu için biz deneyelim dedik. 100 TL bozuk para getirene 2 ekmek hediye ediyoruz. Geri dönüşler çok güzel oldu, evdeki çocuklar kumbaralarında biriktirdiği madeni paraları getirdiler. Böylelikle vatandaşa da destek olduk. Şuanda günlük madeni para ihtiyacımızı karşı

In [43]:
swTR = stopwords.words('turkish')
trStemmer = SnowballStemmer('turkish')

ValueError: The language 'turkish' is not supported.

In [45]:
from snowballstemmer import turkish_stemmer 

trStemmer = turkish_stemmer.TurkishStemmer()
trStemmer.stemWord('uygarlaştıramadıklarımızdanmışcasına')

'uygarlaştıramadık'

In [46]:
wordFrequencies(ekonomi_example,swTR,trStemmer.stemWord)

{'par': 14,
 'made': 9,
 'bozuk': 9,
 'ekmek': 7,
 'bir': 4,
 'fır': 4,
 '100': 4,
 'tl': 4,
 'getire': 4,
 '2': 4,
 'para': 4,
 'yer': 3,
 'esnaf': 3,
 'ol': 3,
 'çocuk': 3,
 'kumbara': 3,
 'getir': 3,
 '13112023': 2,
 'son': 2,
 'sivas': 2,
 'iş': 2,
 'cam': 2,
 'bedav': 2,
 'yazı': 2,
 'sıkıntı': 2,
 'çöz': 2,
 'vatandaş': 2,
 'destek': 2,
 'hedi': 2,
 'ifa': 2,
 'var': 2,
 'türkiye': 2,
 'güzel': 2,
 'ev': 2,
 'böylelik': 2,
 'oluyor': 2,
 'haber': 2,
 'milliyetcomtr': 2,
 '1448': 1,
 'güncellenme': 1,
 '1451': 1,
 'iha': 1,
 'sıkıntıs': 1,
 'yaşana': 1,
 'as': 1,
 'buluna': 1,
 'işletmes': 1,
 'gidermek': 1,
 'ürettik': 1,
 'dikkat': 1,
 'çek': 1,
 'duvar': 1,
 'yazıs': 1,
 'asa': 1,
 'saye': 1,
 'gider': 1,
 'işletme': 1,
 'muhasep': 1,
 'bölüm': 1,
 'çalışa': 1,
 'alpere': 1,
 'karaka': 1,
 'ettik': 1,
 'ederek': 1,
 'konu': 1,
 'büyük': 1,
 'sıkınt': 1,
 'banka': 1,
 'istedik': 1,
 'halde': 1,
 'tem': 1,
 'edemiyor': 1,
 'bu': 1,
 'örnek': 1,
 'olduk': 1,
 'deneyel': 1,
 'dedik

## Nutuk

In [47]:
with open('../data/nutuk.txt') as f:
    nutuk_text = f.read()

nutuk_text

'Skip to main content\nSearch\n\n UPLOAD\n SIGN UP | LOG IN\n \nBOOKS\n \nVIDEO\n \nAUDIO\n \nSOFTWARE\n \nIMAGES\nSign up for free\nLog in\n Search metadata\n Search text contents\n Search TV news captions\n Search radio transcripts\n Search archived web sites\nAdvanced Search\nABOUT BLOG PROJECTS HELP DONATE  CONTACT JOBS VOLUNTEER PEOPLE\nFull text of "Mustafa Kemal Atatürk Nutuk ( Orijinal Metin)"\nSee other formats\n\n\n\n\n\n\n\n\n\n\n\nGazi Mustafa Kemal \n\n\nNUTUK \n\n\nGenel Yayın Yönetmeni \nŞule Perinçek \n\n\nEditörler \nNejat Bayramoğlu, Kurtuluş Güran \n\n\nÇevriyazı \nErcüment Hüsnü Baki, Yücel Demirel, Ahmet Hezarfen, Sadık Perinçek, \nMusa Sarıkaya \n\n\nSayfa Düzeni \nGüler Kızılelma \n\n\nKapak Tasarımı \nBora Gürsoy \n\n\nBaskı ve Cilt \n\nErtem Basım Yayın Dağıtım Sanayi ve Ticaret Limited Şirketi \n\nBaşkent Organize Sanayi Bölgesi 22. Cadde No: 6 Malıköy - Temelli / ANKARA \nTel : 0312 640 16 23 \n\nSertifika No: 26886 \n\n\nBu eserin yayın hakları \nAnaliz Bası

In [49]:
wordFrequencies(nutuk_text,swTR,trStemmer.stemWord)

{'bir': 4001,
 'bey': 1868,
 'paş': 1268,
 'millet': 1103,
 'olduk': 1103,
 'heyet': 1027,
 'ola': 1026,
 'milli': 983,
 'et': 910,
 'paşa': 827,
 'hükümet': 816,
 'sonra': 806,
 'meclis': 801,
 'istanbul': 787,
 'efen': 719,
 'üzer': 715,
 'ol': 714,
 'be': 701,
 'p': 673,
 'kadar': 647,
 'taraf': 637,
 'büt': 632,
 'büyük': 580,
 'e': 552,
 'etmek': 547,
 'vaziyet': 545,
 'yn': 545,
 'telgraf': 544,
 'olarak': 543,
 'devlet': 540,
 'idi': 532,
 'yer': 518,
 'kuvvet': 491,
 'karşı': 488,
 'hareket': 487,
 'karar': 485,
 'hakk': 485,
 'ettik': 469,
 'söz': 464,
 'kendi': 456,
 'eder': 452,
 'biz': 450,
 'cevap': 437,
 'kabul': 436,
 'on': 431,
 'suret': 430,
 'tarih': 427,
 'rauf': 422,
 'vekil': 422,
 'değil': 421,
 'kumanda': 416,
 'buluna': 410,
 'i': 410,
 'iş': 402,
 'görüş': 399,
 'memleket': 394,
 'cemiyet': 393,
 'fırka': 383,
 'arz': 373,
 'reis': 372,
 'mebus': 371,
 'te': 370,
 'teklif': 364,
 'fakat': 364,
 'var': 360,
 'edecek': 360,
 'sıvas': 353,
 'teşkilat': 351,
 'bası

## A Second Example: Summarization and Keyword Extraction

Looking at word frequencies within a text gives a rough idea of what the text is about. But there is a better method. But first let us talk about [*word embeddings*](https://en.wikipedia.org/wiki/Word_embedding).

A word embedding is a mathematical way of converting a text into a collection of vectors such that the syntactic/semantic relations among words are imitated by the metric relations between the embedding vectors. Here is a picture of what we mean by this imitation:

![Analogy](https://www.ed.ac.uk/sites/default/files/styles/landscape_breakpoints_theme_uoe_tv_1x/public/thumbnails/image/diagram-20190710.png?itok=niBaDPXj)

The simplest way of converting a text into a collection of vectors can be done via counting words within a context.

In [51]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import PCA

### Bag of Words Embedding

Let's delve into 'Bag of Words' (BoW) model, one of the simplest forms of word embeddings. In BoW representation, a text is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. Essentially, BoW constructs a vector representation where each word in a document corresponds to a position in the vector and is represented by a count of occurrences of the word in the document.

Scikit-learn library in python provides two types of BoWs - CountVectorizer and TfidfVectorizer. 

#### **CountVectorizer** 

`CountVectorizer` tokenizes the documents into words and provides a matrix where the cell a[i][j] signifies the count of the jth word in the ith document.

Let's say we have the following list of sentences:

In [52]:
corpus = ['This is the first document.',
          'This document is the second document.',
          'And this is the third one.',
          'Is this the first document?']

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

vectorizer.get_feature_names_out()

array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], dtype=object)

The method `fit_transform` is used for the initial fitting of parameters on the training set `corpus`, but it also returns a transformed `corpus`. We can obtain feature names (word at the particular index in the vector) using `get_feature_names()`:

The `transform` method transforms documents to document-term matrix. We can convert the counts to a pandas dataframe for better visual interpretation:

In [53]:
import pandas as pd
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
df

Unnamed: 0,and,document,first,is,one,second,the,third,this
0,0,1,1,1,0,0,1,0,1
1,0,2,0,1,0,1,1,0,1
2,1,0,0,1,1,0,1,1,1
3,0,1,1,1,0,0,1,0,1


Resulting dataframe shows the count of each word in each document of the corpus.

#### **TfidfVectorizer** 

TF-IDF stands for "Term Frequency, Inverse Document Frequency". It is a way to score the importance of words (or "terms") in a document based on how frequently they appear across multiple documents. If a word appears frequently in a document, it's important (hence the term frequency). If a word appears in many documents, it's not a unique identifier, thereby reducing its importance (hence the inverse document frequency). This is a way to give a score ('weight') to words.

Let's use the same sentences with the TfidfVectorizer. In this case, you will observe that the columns represent the same words, but all the counts have been replaced with the calculated TF-IDF scores. Low TF-IDF means either the term appears much or it’s rarely appeared, so it’s either too common or too rare to be a significant term.

In [54]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
df

Unnamed: 0,and,document,first,is,one,second,the,third,this
0,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085
1,0.0,0.687624,0.0,0.281089,0.0,0.538648,0.281089,0.0,0.281089
2,0.511849,0.0,0.0,0.267104,0.511849,0.0,0.267104,0.511849,0.267104
3,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085


#### A Larger Example

Let us use these vectorizers on the texts we worked with earlier. 


In [55]:
sentences = sent_tokenize(wiki_example)
processed = []

for x in sentences:
    tmp = re.sub(r'[^\w\s]+', '', x.lower())
    processed.append(tmp)

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(processed)

df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
df

Unnamed: 0,10,1960sthe,1980s,20,2480,3270s,50yearold,78,accomplished,acquire,...,where,whereas,which,will,with,without,working,write,wrote,xhtml
0,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,3,0,1,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [56]:
df.shape

(54, 519)

In [58]:
projection = PCA(n_components=1)
weights = projection.fit_transform(X.toarray())
res = list(zip(range(53),weights.T[0],sentences,processed))
sorted(res, key=lambda item: -item[1])

[(25,
  6.835356889968498,
  'The screen scraper might connect to the legacy system via Telnet, emulate the keystrokes needed to navigate the old user interface, process the resulting display output, extract the desired data, and pass it on to the modern system.',
  'the screen scraper might connect to the legacy system via telnet emulate the keystrokes needed to navigate the old user interface process the resulting display output extract the desired data and pass it on to the modern system'),
 (8,
  5.822043289063032,
  'In the second case, the operator of the third-party system will often see screen scraping as unwanted, due to reasons such as increased system load, the loss of advertisement revenue, or the loss of control of the information content.',
  'in the second case the operator of the thirdparty system will often see screen scraping as unwanted due to reasons such as increased system load the loss of advertisement revenue or the loss of control of the information content'),


In [61]:
words = df.columns
processed = [stemmer.stem(x) for x in words]

projection = PCA(n_components=1)
weights = projection.fit_transform(X.T.toarray())
res = list(zip(range(53),weights.T[0],words,processed))
sorted(res, key=lambda item: -item[1])

[(23, 3.0858069937506682, 'and', 'and'),
 (32, 1.8422678560512895, 'as', 'as'),
 (21, 1.2782754141478363, 'an', 'an'),
 (48, 1.1042935262755353, 'be', 'be'),
 (28, 0.5958323447474705, 'applications', 'applic'),
 (24, 0.22644201694927657, 'another', 'anoth'),
 (26, 0.17326963235370485, 'api', 'api'),
 (12, 0.06852351509316737, 'advertisement', 'advertis'),
 (37, 0.04756816123132114, 'automated', 'autom'),
 (5, -0.0019225729671915061, '3270s', '3270s'),
 (9, -0.0019225729671915061, 'acquire', 'acquir'),
 (18, -0.0019225729671915061, 'although', 'although'),
 (30, -0.014733877280239248, 'are', 'are'),
 (29, -0.08313992485658184, 'approach', 'approach'),
 (13, -0.10227329098088289, 'against', 'against'),
 (50, -0.11578862065275278, 'being', 'be'),
 (17, -0.1338104655259438, 'also', 'also'),
 (52, -0.1338104655259438, 'bidirectional', 'bidirect'),
 (41, -0.14085740054301465, 'available', 'avail'),
 (45, -0.14142021119727716, 'banks', 'bank'),
 (22, -0.144156037143828, 'analysis', 'analysi')

In [62]:
def getWeights(text,vectorizer,sws,st):
    sentences = sent_tokenize(text)
    processed = []
    for s in sentences:
        tmp = word_tokenize(s)
        res = []
        for w in tmp:
            if w not in sws:
                res.append(st(w))
        processed.append(' '.join(res))
    X = vectorizer.fit_transform(processed)
    projector = PCA(n_components=1)
    weights = projector.fit_transform(X.toarray())
    N = len(sentences)
    return list(zip(range(N),weights.T[0],sentences,processed))

In [63]:
result = getWeights(nba_example,CountVectorizer(),swEN,stemmer.stem)
sorted(result, key=lambda x: -x[1])[:4]

[(12,
  4.891361540528905,
  'From there, the six group winners and two wildcards (teams with the best records that did not win their group) will advance to a single-elimination knockout stage played in Las Vegas, starting with quarter-finals on 4 and 5 December, followed by the semi-finals on 7 December, and concluding with the championship game on 9 December.',
  'from , six group winner two wildcard ( team best record win group ) advanc single-elimin knockout stage play las vega , start quarter-fin 4 5 decemb , follow semi-fin 7 decemb , conclud championship game 9 decemb .'),
 (19,
  1.2439879487039374,
  'And those teams that do not reach the final will be assigned home and away games also to be played in early December to make up the difference.',
  'and team reach final assign home away game also play earli decemb make differ .'),
 (10,
  1.2363391499496592,
  'The tournament, which tips off with a full slate of Friday games, features each of the NBA’s 30 teams broken up into fi

In [64]:
result = getWeights(tech_example,CountVectorizer(),swEN,stemmer.stem)
sorted(result, key=lambda x: -x[1])[:4]

[(8,
  8.868150946000338,
  'Water resistance: sweat resistant (IPX4) Connectivity: Bluetooth 5.3 (SBC, AAC, aptX Adaptive) Battery life: 6 hours (up to 24 hours with case) Earbud dimensions: 17.2 x 30.5 x 22.4mm Earbud weight: 6.24g each Charging case dimensions: 59.4 x 66.3 x 26.7mm Charging case weight: 59.8g Case charging: USB-C The earbuds have some of the best noise cancelling you can get on any headphones, let alone something as small as a set of earbuds.',
  'water resist : sweat resist ( ipx4 ) connect : bluetooth 5.3 ( sbc , aac , aptx adapt ) batteri life : 6 hour ( 24 hour case ) earbud dimens : 17.2 x 30.5 x 22.4mm earbud weight : 6.24g charg case dimens : 59.4 x 66.3 x 26.7mm charg case weight : 59.8g case charg : usb-c the earbud best nois cancel get headphon , let alon someth small set earbud .'),
 (7,
  1.5162074726420176,
  'The battery life remains the same six hours of noise cancelling as the previous models, with three full charges in the case for up to 24 hours of

## BERT Models

Bidirectional Encoder Representations from Transformers (BERT) is a Transformer-based machine learning technique developed by Google for natural language processing (NLP). BERT is designed to pretrain deep bidirectional representations from the unlabelled text by jointly conditioning on both left and right context in all the layers. This results in a pre-trained model that can be fine-tuned for a wide range of tasks with substantially improved results.

### Mathematical Background:

BERT is based on a modified version of the Transformer architecture. Transformers use attention mechanisms to weigh the influence of different words on each other in a data-efficient way. Instead of recurrent or convolutional layers, Transformers use several layers of self-attention in parallel for encoding.

One of the central innovations in BERT is masked language modelling. This masked language modelling randomly masks some of the tokens from the input, and predicts only those masked tokens. Specifically, it masks words in the sentence at random, and then it attempts to predict them based upon the context provided by non-masked words. 

Tokenization in BERT requires more than simply splitting the input into words. BERT uses WordPiece tokenization. A WordPiece tokenizer creates a vocabulary of individual characters, and then, the most common two-character combinations, three-character combinations, etc. 

### BERT Representation:

Each token used by BERT is represented using a WordPiece embedding of size 30k. We concatenate position and segment embeddings with these token embeddings to get the final representation.

Each input sentence in BERT needs special tokens at the beginning and end, and each word should also be tokenized into subwords. 'BertTokenizer' from 'transformers' module can be used to handle the tokenization.

In [65]:
import numpy as np
import pandas as pd
import torch
import tensorflow as tf
import tensorflow_datasets as tfds

from transformers import BertTokenizer, BertModel

2023-11-13 16:29:59.246830: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-13 16:29:59.246876: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-13 16:29:59.248877: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-13 16:29:59.412194: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [66]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', model_max_length=512, padding=True, truncation=True)
model = BertModel.from_pretrained("bert-base-uncased")

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [67]:
sentences

['Data scraping is a technique where a computer program extracts data from human-readable output coming from another program.',
 'Normally, data transfer between programs is accomplished using data structures suited for automated processing by computers, not people.',
 'Such interchange formats and protocols are typically rigidly structured, well-documented, easily parsed, and minimize ambiguity.',
 'Very often, these transmissions are not human-readable at all.',
 'Thus, the key element that distinguishes data scraping from regular parsing is that the output being scraped is intended for display to an end-user, rather than as an input to another program.',
 'It is therefore usually neither documented nor structured for convenient parsing.',
 'Data scraping often involves ignoring binary data (usually images or multimedia data), display formatting, redundant labels, superfluous commentary, and other information which is either irrelevant or hinders automated processing.',
 'Data scrapi

In [68]:
np.array([ tokenizer.encode(x, return_tensors='pt', max_length=80, padding='max_length', truncation=True).numpy().reshape(-1) for x in sentences ])

array([[  101,  2951, 23704, ...,     0,     0,     0],
       [  101,  5373,  1010, ...,     0,     0,     0],
       [  101,  2107,  8989, ...,     0,     0,     0],
       ...,
       [  101,  6168,  2951, ...,     0,     0,     0],
       [  101,  2122,  2064, ...,     0,     0,     0],
       [  101,  2023,  3921, ...,     0,     0,     0]])

In [None]:


# Construct a tf.data.Dataset
ds = tfds.load('mnist', split='train', shuffle_files=True)

# Build your input pipeline
ds = ds.shuffle(1024).batch(32).prefetch(tf.data.AUTOTUNE)
for example in ds.take(1):
  image, label = example["image"], example["label"]