# Working with Unscructured Data (Part II)

Today we are going to look at text data, and things we can do with text data. In today's lecture I am going to cover two topics:

1. Scraping text
2. Rudimentary Natural Language Processing


## Scraping Text/Data

[Text scraping](https://en.wikipedia.org/wiki/Text_scraping), or more generally [data scraping](https://en.wikipedia.org/wiki/Data_scraping), is defined as the process of extracting text out of web resources such as newspapers, web pages, or government sites. Data scraping is often needed to interface with an old legacy system that does not expose its data via a convenient API. It involves making requests to websites and parsing the HTML responses to collect the desired information. This technique is often used for data mining, data analysis, and many other applications. 
          

We will be using a Python library, BeautifulSoup, an easy-to-use tool for parsing HTML and XML documents, enabling extraction of useful data from them. Let's suppose we need to scrape quotes from a website, "http://quotes.toscrape.com". First, we import necessary libraries, including BeautifulSoup and requests.

In [1]:
from bs4 import BeautifulSoup
import requests

For this task, we need texts of moderate size: not too long, or not too short. News articles are perfect for this purpose. I am going to use several sources. 

* For English texts I am going to use the [Guardian Newspaper](https://www.theguardian.com/international),
* For Turkish texts I am going to use [Milliyet](https://www.milliyet.com.tr/)
* For French texts I am going to use [Le Monde](https://www.lemonde.fr/)

We are going to pull articles on a specific subject using a service called [RSS Feed](https://en.wikipedia.org/wiki/RSS). Each of these newspapers have their own RSS feeds.

### RSS Feeds

Let us start with the Guardian: Guardian's RSS feed has a [predictable pattern](https://www.theguardian.com/help/feeds). For example here are some interesting subjects:

1. Economy: https://www.theguardian.com/economy/rss
2. Technology: https://www.theguardian.com/technology/rss
3. Film: https://www.theguardian.com/film/rss
4. NBA: https://www.theguardian.com/sport/nba/rss
5. Fashion: https://www.theguardian.com/fashion/rss

Each RSS feed is an XML file. We are going to parse it and extract the bits we are interested in:

In [2]:
from xmltodict import parse

with requests.get('https://www.theguardian.com/technology/rss') as link:
    raw = parse(link.text)

raw

{'rss': {'@xmlns:media': 'http://search.yahoo.com/mrss/',
  '@xmlns:dc': 'http://purl.org/dc/elements/1.1/',
  '@version': '2.0',
  'channel': {'title': 'Technology | The Guardian',
   'link': 'https://www.theguardian.com/uk/technology',
   'description': "Latest Technology news, comment and analysis from the Guardian, the world's leading liberal voice",
   'language': 'en-gb',
   'copyright': 'Guardian News and Media Limited or its affiliated companies. All rights reserved. 2023',
   'pubDate': 'Sun, 12 Nov 2023 18:19:04 GMT',
   'dc:date': '2023-11-12T18:19:04Z',
   'dc:language': 'en-gb',
   'dc:rights': 'Guardian News and Media Limited or its affiliated companies. All rights reserved. 2023',
   'image': {'title': 'The Guardian',
    'url': 'https://assets.guim.co.uk/images/guardian-logo-rss.c45beb1bafa34b347ac333af2e6fe23f.png',
    'link': 'https://www.theguardian.com'},
   'item': [{'title': 'How Chinese firm linked to repression of Uyghurs aids Israeli surveillance in West Bank'

The response is a large XML data structure parsed as a python dictionary. The part we are interested in is an array located at `raw['rss']['channel']['item']`. Let us look at its first element:

In [3]:
raw['rss']['channel']['item'][0]

{'title': 'How Chinese firm linked to repression of Uyghurs aids Israeli surveillance in West Bank',
 'link': 'https://www.theguardian.com/technology/2023/nov/11/west-bank-palestinians-surveillance-cameras-hikvision',
 'description': '<p>Cameras made by Hikvision, which is blacklisted in US, blanket the occupied West Bank, according to Amnesty International</p><p>In the occupied Palestinian territories, there are cameras everywhere. In Silwan, in occupied East Jerusalem, residents say cameras were installed by Israeli police up and down their streets, peering into their homes. One resident named Sara said she and her family “could be detected as if the cameras were just in our house … we couldn’t feel at home in our own house and had to be fully dressed all the time.”</p><p>Surveillance cameras now cover the Damascus Gate, the main entrance into the old city of Jerusalem and one of the only public areas for Palestinians to gather socially and hold demonstrations. It’s at that gate that

I am going to use a function to pull the links that I need:

In [4]:
def getSubjectGuardian(subject):
    with requests.get(f'https://www.theguardian.com/{subject}/rss') as link:
        raw = parse(link.text)
    return raw['rss']['channel']['item']

NBA = getSubjectGuardian('sport/nba')
Tech = getSubjectGuardian('technology')

Let us look at the links

In [5]:
for x in NBA[:4]:
    print(f'{x["title"]}\n{x["link"]}\n\n')

Sixers’ Oubre Jr to miss ‘significant’ time after being hit by vehicle in Philadelphia
https://www.theguardian.com/sport/2023/nov/12/kelly-oubre-jr-hit-by-car-philadelphia-76ers


NBA Cup roundup: Curry’s last-second bucket lifts Warriors over Thunder
https://www.theguardian.com/sport/2023/nov/03/nba-in-season-tournament-scores-reports


‘I can’t wait’: excitement mounts for NBA’s first in-season tournament
https://www.theguardian.com/sport/2023/nov/03/nba-cup-preview-in-season-tournament


‘Unbelievable’ Victor Wembanyama explodes for 38 points in fifth NBA game
https://www.theguardian.com/sport/2023/nov/03/victor-wembanyama-spurs-suns-game-report




Let us retrieve the text from the first link:

In [6]:
with requests.get(NBA[0]['link']) as link:
    raw = BeautifulSoup(link.content,'html.parser')

print(raw)

<!DOCTYPE html>

<html lang="en">
<head>
<!-- Hello there, HTML enthusiast! -->
<title>Sixers’ Oubre Jr to miss ‘significant’ time after being hit by vehicle in Philadelphia | Philadelphia 76ers | The Guardian</title>
<meta content="Philadelphia 76ers guard Kelly Oubre Jr is expected to miss significant time after being struck by a vehicle Saturday, the team said" name="description"/>
<link href="https://www.theguardian.com/sport/2023/nov/12/kelly-oubre-jr-hit-by-car-philadelphia-76ers" rel="canonical"/>
<meta charset="utf-8"/>
<meta content="width=device-width,minimum-scale=1,initial-scale=1" name="viewport"/>
<meta content="#052962" name="theme-color">
<link href="https://assets.guim.co.uk/static/frontend/manifest.json" rel="manifest"/>
<link href="https://assets.guim.co.uk/static/frontend/icons/homescreen/apple-touch-icon.svg" rel="apple-touch-icon" sizes="any"/>
<link href="https://assets.guim.co.uk/static/frontend/icons/homescreen/apple-touch-icon-512.png" rel="apple-touch-icon" s

The response we get is a large string containing the HTML of the page. This is where BeautifulSoup comes in. We transform this string into a BeautifulSoup object.

The `raw` object now holds the structured representation of the page, from which data can be easily extracted. If we want to extract quotes from the page, we can do so by finding the HTML tags that contain them.

In [7]:
raw.find_all('p')

[<p class="dcr-1dpfw7k">Philadelphia 76ers guard Kelly Oubre Jr sustained undisclosed injuries after being struck by a vehicle on Saturday, the team said, and is expected to miss significant playing time.</p>,
 <p class="dcr-1dpfw7k">The 27-year-old Oubre was transported to a hospital in stable condition after being hit, and released a few hours later. The teams says he is not expected to miss the rest of the season, but did not provide information on his injuries.</p>,
 <p class="dcr-1dpfw7k">The Sixers said Oubre was walking near his residence in downtown <a data-component="auto-linked-tag" data-link-name="in body link" href="https://www.theguardian.com/us-news/philadelphia">Philadelphia</a> when he struck.</p>,
 <p dir="ltr" lang="qme">❤️💙 <a href="https://t.co/AgkaWnO7oT">pic.twitter.com/AgkaWnO7oT</a></p>,
 <p class="dcr-1dpfw7k">The Philadelphia Police said in an email to the Associated Press that Oubre was struck at about 7pm while crossing the street at Broad and Locust streets

In this example, the `find_all` function is used to select the 'p' tag. Next, for each paragraph, we extract the text.

In [8]:
' '.join([x.text for x in raw.find_all('p')])

'Philadelphia 76ers guard Kelly Oubre Jr sustained undisclosed injuries after being struck by a vehicle on Saturday, the team said, and is expected to miss significant playing time. The 27-year-old Oubre was transported to a hospital in stable condition after being hit, and released a few hours later. The teams says he is not expected to miss the rest of the season, but did not provide information on his injuries. The Sixers said Oubre was walking near his residence in downtown Philadelphia when he struck. ❤️💙 pic.twitter.com/AgkaWnO7oT The Philadelphia Police said in an email to the Associated Press that Oubre was struck at about 7pm while crossing the street at Broad and Locust streets in Center City, and that he was taken to Jefferson Hospital. Police said there is an active investigation into the incident. 6ABC in Philadelphia reported police saying a silver vehicle fled the scene after the accident. Oubre is in his first season with the 76ers and has averaged 16.8 points in the fi

Let us convert this to a function:

In [9]:
def getText(url):
    with requests.get(url) as link:
        raw = BeautifulSoup(link.content,'html.parser')
    return ' '.join([x.text for x in raw.find_all('p')])

This function takes and URL and the extract the text out of that URL

In [10]:
tech_example = getText(Tech[0]['link'])
tech_example

'Cameras made by Hikvision, which is blacklisted in US, blanket the occupied West Bank, according to Amnesty International In the occupied Palestinian territories, there are cameras everywhere. In Silwan, in occupied East Jerusalem, residents say cameras were installed by Israeli police up and down their streets, peering into their homes. One resident named Sara said she and her family “could be detected as if the cameras were just in our house … we couldn’t feel at home in our own house and had to be fully dressed all the time.” Surveillance cameras now cover the Damascus Gate, the main entrance into the old city of Jerusalem and one of the only public areas for Palestinians to gather socially and hold demonstrations. It’s at that gate that “Palestinians are being watched and assessed at all times”, according to an Amnesty International report, Automated Apartheid. These cameras have created a chilling effect on not just the ability to protest but also on the daily lives of Palestinia

In [11]:
nba_example = getText(NBA[3]['link'])
nba_example

"Victor Wembanyama strolled through the hallways at Footprint Center after the best game of his short NBA career when he passed 13-time All-Star Kevin Durant, stopping for a quick handshake and quick hug. Some say the 7ft 4in Frenchman is a taller version of Durant. Durant’s first impression is that the budding San Antonio Spurs star might be even better. “His enthusiasm for the game – you can tell that through the TV and playing against him,” Durant said. ”He’s his own player, own person. He’s going to create his own lane and is much different than anyone else who has played.” In just his fifth NBA game, the 19-year-old Wembanyama showed why he was one of the most-hyped prospects in years, scoring 38 points and grabbing 10 rebounds in a standout performance during a 132-121 victory over Durant and the Phoenix Suns on Thursday night. Wembanyama produced several highlight-reel plays in the game, including a soaring dunk while running the floor during the second quarter. But it was his p

Of course, you don't have to go through RSS to get the links. You may provide them directly:

In [12]:
wiki_example = getText('https://en.wikipedia.org/wiki/Data_scraping')
wiki_example

'Data scraping is a technique where a computer program extracts data from human-readable output coming from another program.\n Normally, data transfer between programs is accomplished using data structures suited for automated processing by computers, not people.  Such interchange formats and protocols are typically rigidly structured, well-documented, easily parsed, and minimize ambiguity.  Very often, these transmissions are not human-readable at all.\n Thus, the key element that distinguishes data scraping from regular parsing is that the output being scraped is intended for display to an end-user, rather than as an input to another program. It is therefore usually neither documented nor structured for convenient parsing.  Data scraping often involves ignoring binary data (usually images or multimedia data), display formatting, redundant labels, superfluous commentary, and other information which is either irrelevant or hinders automated processing.\n Data scraping is most often don

## An Application: Word Frequencies

There are many ways one can use text. For today's lecture I am going to demonstrate one such use: text summarization and keyword extraction. Surprisingly, the task is a simple application linear algebra.

### Splitting a text into words and sentences

We are going to process a human language text into meaningful pieces. For this we need to employ an Natural Language Processing Library called [NLTK](https://www.nltk.org/). Our first task is to split the text into sentences and words:


In [13]:
import nltk

from nltk.tokenize import sent_tokenize, word_tokenize

# nltk.download('punkt',download_dir='/home/kaygun/local/lib/nltk_data')

In [14]:
sent_tokenize(wiki_example)

['Data scraping is a technique where a computer program extracts data from human-readable output coming from another program.',
 'Normally, data transfer between programs is accomplished using data structures suited for automated processing by computers, not people.',
 'Such interchange formats and protocols are typically rigidly structured, well-documented, easily parsed, and minimize ambiguity.',
 'Very often, these transmissions are not human-readable at all.',
 'Thus, the key element that distinguishes data scraping from regular parsing is that the output being scraped is intended for display to an end-user, rather than as an input to another program.',
 'It is therefore usually neither documented nor structured for convenient parsing.',
 'Data scraping often involves ignoring binary data (usually images or multimedia data), display formatting, redundant labels, superfluous commentary, and other information which is either irrelevant or hinders automated processing.',
 'Data scrapi

In [15]:
word_tokenize(wiki_example.lower())

['data',
 'scraping',
 'is',
 'a',
 'technique',
 'where',
 'a',
 'computer',
 'program',
 'extracts',
 'data',
 'from',
 'human-readable',
 'output',
 'coming',
 'from',
 'another',
 'program',
 '.',
 'normally',
 ',',
 'data',
 'transfer',
 'between',
 'programs',
 'is',
 'accomplished',
 'using',
 'data',
 'structures',
 'suited',
 'for',
 'automated',
 'processing',
 'by',
 'computers',
 ',',
 'not',
 'people',
 '.',
 'such',
 'interchange',
 'formats',
 'and',
 'protocols',
 'are',
 'typically',
 'rigidly',
 'structured',
 ',',
 'well-documented',
 ',',
 'easily',
 'parsed',
 ',',
 'and',
 'minimize',
 'ambiguity',
 '.',
 'very',
 'often',
 ',',
 'these',
 'transmissions',
 'are',
 'not',
 'human-readable',
 'at',
 'all',
 '.',
 'thus',
 ',',
 'the',
 'key',
 'element',
 'that',
 'distinguishes',
 'data',
 'scraping',
 'from',
 'regular',
 'parsing',
 'is',
 'that',
 'the',
 'output',
 'being',
 'scraped',
 'is',
 'intended',
 'for',
 'display',
 'to',
 'an',
 'end-user',
 ',',
 '

As you can see there are many extraneous symbols that need to be cleaned.

### Regular expressions

I would like to remove all non-alphanumeric or non-space characters from the text. For this I am going to use [regular expressions](https://en.wikipedia.org/wiki/Regular_expression). This is a technical tool that every knowledgeable data science should have in their aresenal

In [16]:
import re
from collections import Counter

In [17]:
res = Counter(word_tokenize(re.sub(r'[^\w\s]+','',wiki_example.lower())))
dict(sorted(res.items(), key=lambda item: -item[1]))

{'the': 58,
 'to': 44,
 'a': 42,
 'data': 40,
 'and': 31,
 'of': 31,
 'is': 21,
 'scraping': 20,
 'for': 19,
 'from': 15,
 'or': 15,
 'system': 15,
 'in': 14,
 'this': 14,
 'as': 13,
 'web': 13,
 'an': 12,
 'screen': 12,
 'be': 11,
 'computer': 10,
 'can': 10,
 'program': 8,
 'output': 8,
 'by': 8,
 'such': 8,
 'often': 8,
 'that': 8,
 'are': 7,
 'with': 7,
 'more': 7,
 'on': 7,
 'processing': 6,
 'interface': 6,
 'used': 6,
 'use': 6,
 'not': 5,
 'which': 5,
 'human': 5,
 'modern': 5,
 'source': 5,
 'user': 5,
 'systems': 5,
 'humanreadable': 4,
 'another': 4,
 'using': 4,
 'automated': 4,
 'parsing': 4,
 'display': 4,
 'usually': 4,
 'api': 4,
 'applications': 4,
 'screens': 4,
 'through': 4,
 'scraper': 4,
 'techniques': 4,
 'report': 4,
 'technique': 3,
 'where': 3,
 'between': 3,
 'easily': 3,
 'these': 3,
 'intended': 3,
 'input': 3,
 'it': 3,
 'other': 3,
 'legacy': 3,
 'no': 3,
 'mechanism': 3,
 'provide': 3,
 'will': 3,
 'control': 3,
 'available': 3,
 'programming': 3,
 'term

The frequencies of the words appear in this list follows [Zipf's Law](https://en.wikipedia.org/wiki/Zipf%27s_law). [Here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4176592/) is a more academic treatment of the subject. But, as you may notice, most of these words are meaningless without a proper context. I would describe thse words as words with *high noise and low signal* value. In natural language processing terminology, these are called [*stop words*](https://en.wikipedia.org/wiki/Stop_word).

In [18]:
from nltk.corpus import stopwords
nltk.download('stopwords',download_dir='/home/kaygun/local/lib/nltk_data/')

swEN = set(stopwords.words('english'))
swEN

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/kaygun/local/lib/nltk_data/...
[nltk_data]   Package stopwords is already up-to-date!


{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [19]:
cleaned = {k: v for k,v in res.items() if k not in swEN }
dict(sorted(cleaned.items(), key=lambda item: -item[1]))

{'data': 40,
 'scraping': 20,
 'system': 15,
 'web': 13,
 'screen': 12,
 'computer': 10,
 'program': 8,
 'output': 8,
 'often': 8,
 'processing': 6,
 'interface': 6,
 'used': 6,
 'use': 6,
 'human': 5,
 'modern': 5,
 'source': 5,
 'user': 5,
 'systems': 5,
 'humanreadable': 4,
 'another': 4,
 'using': 4,
 'automated': 4,
 'parsing': 4,
 'display': 4,
 'usually': 4,
 'api': 4,
 'applications': 4,
 'screens': 4,
 'scraper': 4,
 'techniques': 4,
 'report': 4,
 'technique': 3,
 'easily': 3,
 'intended': 3,
 'input': 3,
 'legacy': 3,
 'mechanism': 3,
 'provide': 3,
 'control': 3,
 'available': 3,
 'programming': 3,
 'terminal': 3,
 'instead': 3,
 'text': 3,
 'terminals': 3,
 'port': 3,
 'term': 3,
 'example': 3,
 'process': 3,
 'extract': 3,
 'mining': 3,
 'reports': 3,
 'normally': 2,
 'transfer': 2,
 'interchange': 2,
 'structured': 2,
 'enduser': 2,
 'convenient': 2,
 'involves': 2,
 'images': 2,
 'information': 2,
 'either': 2,
 'done': 2,
 'thirdparty': 2,
 'case': 2,
 'reasons': 2,
 '

Let us write a function for this:

In [20]:
def processText(text, sws):
    words = word_tokenize(re.sub(r'[^\w\s]+','',text.lower()))
    res = Counter([w for w in words if w not in sws])
    return dict(sorted(res.items(), key=lambda item: -item[1]))

In [21]:
processText(wiki_example,swEN)

{'data': 40,
 'scraping': 20,
 'system': 15,
 'web': 13,
 'screen': 12,
 'computer': 10,
 'program': 8,
 'output': 8,
 'often': 8,
 'processing': 6,
 'interface': 6,
 'used': 6,
 'use': 6,
 'human': 5,
 'modern': 5,
 'source': 5,
 'user': 5,
 'systems': 5,
 'humanreadable': 4,
 'another': 4,
 'using': 4,
 'automated': 4,
 'parsing': 4,
 'display': 4,
 'usually': 4,
 'api': 4,
 'applications': 4,
 'screens': 4,
 'scraper': 4,
 'techniques': 4,
 'report': 4,
 'technique': 3,
 'easily': 3,
 'intended': 3,
 'input': 3,
 'legacy': 3,
 'mechanism': 3,
 'provide': 3,
 'control': 3,
 'available': 3,
 'programming': 3,
 'terminal': 3,
 'instead': 3,
 'text': 3,
 'terminals': 3,
 'port': 3,
 'term': 3,
 'example': 3,
 'process': 3,
 'extract': 3,
 'mining': 3,
 'reports': 3,
 'normally': 2,
 'transfer': 2,
 'interchange': 2,
 'structured': 2,
 'enduser': 2,
 'convenient': 2,
 'involves': 2,
 'images': 2,
 'information': 2,
 'either': 2,
 'done': 2,
 'thirdparty': 2,
 'case': 2,
 'reasons': 2,
 '

In [22]:
processText(nba_example,swEN)

{'wembanyama': 7,
 'game': 7,
 'said': 7,
 'hes': 7,
 'durant': 6,
 'spurs': 6,
 'nba': 4,
 'player': 3,
 'suns': 3,
 'night': 3,
 'quarter': 3,
 'lead': 3,
 'also': 3,
 'got': 3,
 'victor': 2,
 'allstar': 2,
 'quick': 2,
 'say': 2,
 'first': 2,
 'star': 2,
 'going': 2,
 'anyone': 2,
 'else': 2,
 'one': 2,
 '38': 2,
 'phoenix': 2,
 'thursday': 2,
 'plays': 2,
 'dunk': 2,
 'fourth': 2,
 'rookie': 2,
 'every': 2,
 'wembanyamas': 2,
 '15': 2,
 'two': 2,
 'weve': 2,
 'play': 2,
 'us': 2,
 'dont': 2,
 'know': 2,
 'like': 2,
 'popovich': 2,
 'booker': 2,
 'make': 2,
 'advantage': 2,
 'strolled': 1,
 'hallways': 1,
 'footprint': 1,
 'center': 1,
 'best': 1,
 'short': 1,
 'career': 1,
 'passed': 1,
 '13time': 1,
 'kevin': 1,
 'stopping': 1,
 'handshake': 1,
 'hug': 1,
 '7ft': 1,
 '4in': 1,
 'frenchman': 1,
 'taller': 1,
 'version': 1,
 'durants': 1,
 'impression': 1,
 'budding': 1,
 'san': 1,
 'antonio': 1,
 'might': 1,
 'even': 1,
 'better': 1,
 'enthusiasm': 1,
 'tell': 1,
 'tv': 1,
 'playin

In [23]:
processText(tech_example,swEN)

{'palestinians': 15,
 'cameras': 13,
 'west': 11,
 'bank': 11,
 'israeli': 11,
 'surveillance': 11,
 'hikvision': 10,
 'amnesty': 10,
 'said': 10,
 'human': 8,
 'jerusalem': 7,
 'report': 7,
 'rights': 7,
 'according': 6,
 'east': 6,
 'used': 6,
 'international': 5,
 'palestinian': 5,
 'police': 5,
 'identified': 5,
 'recognition': 5,
 'mahmoudi': 5,
 'silwan': 4,
 'one': 4,
 'city': 4,
 'apartheid': 4,
 'company': 4,
 'facialrecognition': 4,
 'settlers': 4,
 'particular': 4,
 'facial': 4,
 'us': 3,
 'occupied': 3,
 'gate': 3,
 'old': 3,
 'daily': 3,
 'investigators': 3,
 'system': 3,
 'based': 3,
 'video': 3,
 'repression': 3,
 'tools': 3,
 'people': 3,
 'features': 3,
 'guardian': 3,
 'products': 3,
 'companys': 3,
 'vast': 3,
 'technologies': 3,
 'access': 3,
 'blacklisted': 2,
 'residents': 2,
 'homes': 2,
 'family': 2,
 'could': 2,
 'house': 2,
 'time': 2,
 'damascus': 2,
 'public': 2,
 'areas': 2,
 'watched': 2,
 'ability': 2,
 'protest': 2,
 'also': 2,
 'occupation': 2,
 'organi

### Stemming

As you can notice above, certain words appear in different forms such as 'use' vs 'used', 'technology' vs 'technologies' etc. One way of eliminating different variations of a word is called [*stemming*](https://en.wikipedia.org/wiki/Stemming).

In [24]:
from nltk.stem.snowball import SnowballStemmer

In [25]:
stemmer = SnowballStemmer('english')
stemmer.stem('tokenization')

'token'

Let us modify our function now:

In [26]:
def wordFrequencies(text, sws, st):
    words = word_tokenize(re.sub(r'[^\w\s]+','',text.lower()))
    res = Counter([st(w) for w in words if w not in sws])
    return dict(sorted(res.items(), key=lambda item: -item[1]))

In [27]:
wordFrequencies(wiki_example,swEN,stemmer.stem)

{'data': 40,
 'scrape': 22,
 'system': 20,
 'use': 18,
 'screen': 16,
 'web': 13,
 'program': 12,
 'comput': 11,
 'process': 9,
 'output': 8,
 'often': 8,
 'interfac': 8,
 'report': 8,
 'techniqu': 7,
 'extract': 7,
 'display': 6,
 'control': 6,
 'human': 6,
 'termin': 6,
 'user': 6,
 'autom': 5,
 'pars': 5,
 'provid': 5,
 'api': 5,
 'modern': 5,
 'sourc': 5,
 'scraper': 5,
 'humanread': 4,
 'anoth': 4,
 'structur': 4,
 'format': 4,
 'endus': 4,
 'usual': 4,
 'involv': 4,
 'case': 4,
 'requir': 4,
 'applic': 4,
 'captur': 4,
 'connect': 4,
 'common': 4,
 'tool': 4,
 'easili': 3,
 'intend': 3,
 'input': 3,
 'document': 3,
 'legaci': 3,
 'mechan': 3,
 'avail': 3,
 'result': 3,
 'practic': 3,
 'instead': 3,
 'refer': 3,
 'text': 3,
 'port': 3,
 'term': 3,
 'exampl': 3,
 'page': 3,
 'file': 3,
 'mine': 3,
 'develop': 3,
 'normal': 2,
 'transfer': 2,
 'interchang': 2,
 'minim': 2,
 'conveni': 2,
 'imag': 2,
 'inform': 2,
 'either': 2,
 'done': 2,
 'thirdparti': 2,
 'oper': 2,
 'reason': 2,


In [28]:
wordFrequencies(nba_example,swEN,stemmer.stem)

{'wembanyama': 9,
 'game': 7,
 'durant': 7,
 'said': 7,
 'hes': 7,
 'spur': 6,
 'play': 6,
 'nba': 4,
 'player': 3,
 'sun': 3,
 'night': 3,
 'quarter': 3,
 'lead': 3,
 'make': 3,
 'also': 3,
 'know': 3,
 'got': 3,
 'victor': 2,
 'pass': 2,
 'allstar': 2,
 'stop': 2,
 'quick': 2,
 'say': 2,
 'first': 2,
 'star': 2,
 'go': 2,
 'anyon': 2,
 'els': 2,
 'one': 2,
 '38': 2,
 'phoenix': 2,
 'thursday': 2,
 'dunk': 2,
 'run': 2,
 'fourth': 2,
 'rooki': 2,
 'everi': 2,
 '15': 2,
 'two': 2,
 'weve': 2,
 'us': 2,
 'dont': 2,
 'like': 2,
 'popovich': 2,
 'booker': 2,
 'advantag': 2,
 'stroll': 1,
 'hallway': 1,
 'footprint': 1,
 'center': 1,
 'best': 1,
 'short': 1,
 'career': 1,
 '13time': 1,
 'kevin': 1,
 'handshak': 1,
 'hug': 1,
 '7ft': 1,
 '4in': 1,
 'frenchman': 1,
 'taller': 1,
 'version': 1,
 'impress': 1,
 'bud': 1,
 'san': 1,
 'antonio': 1,
 'might': 1,
 'even': 1,
 'better': 1,
 'enthusiasm': 1,
 'tell': 1,
 'tv': 1,
 'person': 1,
 'creat': 1,
 'lane': 1,
 'much': 1,
 'differ': 1,
 'fif

In [29]:
wordFrequencies(tech_example,swEN,stemmer.stem)

{'palestinian': 20,
 'camera': 13,
 'surveil': 12,
 'hikvis': 11,
 'west': 11,
 'bank': 11,
 'amnesti': 11,
 'isra': 11,
 'said': 10,
 'report': 8,
 'human': 8,
 'jerusalem': 7,
 'compani': 7,
 'right': 7,
 'identifi': 7,
 'use': 7,
 'accord': 6,
 'intern': 6,
 'east': 6,
 'technolog': 6,
 'polic': 5,
 'citi': 5,
 'protest': 5,
 'show': 5,
 'recognit': 5,
 'mahmoudi': 5,
 'silwan': 4,
 'one': 4,
 'public': 4,
 'apartheid': 4,
 'system': 4,
 'video': 4,
 'facialrecognit': 4,
 'settler': 4,
 'particular': 4,
 'facial': 4,
 'us': 3,
 'occupi': 3,
 'resid': 3,
 'home': 3,
 'detect': 3,
 'time': 3,
 'gate': 3,
 'old': 3,
 'daili': 3,
 'investig': 3,
 'previous': 3,
 'israel': 3,
 'base': 3,
 'repress': 3,
 'uyghur': 3,
 'act': 3,
 'tool': 3,
 'kill': 3,
 'peopl': 3,
 'featur': 3,
 'guardian': 3,
 'enabl': 3,
 'activ': 3,
 'crowd': 3,
 'product': 3,
 'research': 3,
 'vast': 3,
 'access': 3,
 'blacklist': 2,
 'famili': 2,
 'could': 2,
 'hous': 2,
 'damascus': 2,
 'area': 2,
 'gather': 2,
 'de

### Now, in Turkish...

Let us repeat what we have done for a Turkish text. First, let us get the RSS feed from Milliyyet

In [30]:
def getSubjectMilliyet(subject):
    with requests.get(f'https://www.milliyet.com.tr/rss/rssNew/{subject}Rss.xml') as link:
        raw = parse(link.text.encode('iso8859-9'))
    return raw['rss']['channel']['item']

In [31]:
ekonomi = getSubjectMilliyet('ekonomi')
ekonomi

[{'guid': {'@isPermaLink': 'false', '#text': '7033610'},
  'title': '12 Kasım Şans Topu çekiliş sonuçları açıklandı! Şans Topu çekilişinde büyük ikramiye kazandıran numaralar...',
  'description': '<img src="https://image.milimaj.com/i/milliyet/75/460x340/6550e80386b2442248771296.jpg"/><p>Sisal Şans ile kazandıran Şans Topu çekilişinde 5+1 bilen büyük ödülün kazananı oluyor. Çekiliş de canlı olarak aktarılıyor. 12 Kasım Şans Topu çekiliş sonucu sorgulama...</p>\n<p><strong>12 KASIM ŞANS TOPU ÇEKİLİŞ SONUÇLARI</strong></p>\n<p>Kazandıran Şans Topu çekilişi pazar akşamı saat <strong>20:00</strong>\'de yapıldı. Çekiliş ise canlı olarak <strong>millipiyangoonline.com ve Milli Piyango TV</strong> Youtube kanalından aktarılıyor.</p>\n<p>Kazandıran numaralar; <strong>3-7-18-20-25+10</strong> oldu. Açıklanan sonuçlara göre<strong> 5+1 bilen 1 talihli2.523.206,00 TL ödül kazandı.</strong></p>\n<p><strong><img src="https://image.milimaj.com/i/milliyet/75/770x0/6551087d86b24422487712dc.jpg" width

In [32]:
for x in ekonomi[:4]:
    print(f'{x["title"]}\n{x["atom:link"]["@href"]}\n')

12 Kasım Şans Topu çekiliş sonuçları açıklandı! Şans Topu çekilişinde büyük ikramiye kazandıran numaralar...
https://www.milliyet.com.tr/ekonomi/12-kasim-sans-topu-cekilis-sonuclari-aciklandi-sans-topu-cekilisinde-buyuk-ikramiye-kazandiran-numaralar-7033610

Süper Loto 12 Kasım çekiliş sonuçları saat kaçta açıklanıyor? Süper Loto çekiliş sonucu sorgulama...
https://www.milliyet.com.tr/ekonomi/super-loto-12-kasim-cekilis-sonuclari-saat-kacta-aciklaniyor-super-loto-cekilis-sonucu-sorgulama-7033611

Devasa alan 1 yılda tamamlandı! Bakan Göktaş CNN Türk’e anlattı: Dünyada örneği yok
https://www.milliyet.com.tr/ekonomi/devasa-alan-1-yilda-tamamlandi-bakan-goktas-cnn-turke-anlatti-dunyada-ornegi-yok-7033465

Bakan Şimşek: Türkiye'ye yatırımcı güveni geri geliyor
https://www.milliyet.com.tr/ekonomi/bakan-simsek-turkiyeye-yatirimci-guveni-geri-geliyor-7033560



In [33]:
ekonomi_example = getText(ekonomi[0]['atom:link']['@href'])
ekonomi_example

"12.11.2023 - 20:16 | Son Güncellenme: 12.11.2023 - 20:16 Sisal Şans ile kazandıran Şans Topu çekilişinde 5+1 bilen büyük ödülün kazananı oluyor. Çekiliş de canlı olarak aktarılıyor. 12 Kasım Şans Topu çekiliş sonucu sorgulama... 12 KASIM ŞANS TOPU ÇEKİLİŞ SONUÇLARI Kazandıran Şans Topu çekilişi pazar akşamı saat 20:00'de yapıldı. Çekiliş ise canlı olarak millipiyangoonline.com ve Milli Piyango TV Youtube kanalından aktarılıyor. Kazandıran numaralar; 3-7-18-20-25+10 oldu. Açıklanan sonuçlara göre 5+1 bilen 1 talihli\xa02.523.206,00 TL ödül kazandı.  Türkiye'den ve Dünya’dan son dakika haberler, köşe yazıları, magazinden siyasete, spordan seyahate bütün konuların tek adresi milliyet.com.tr; Milliyet.com.tr haber içerikleri izin alınmadan, kaynak gösterilerek dahi iktibas edilemez, kanuna aykırı ve izinsiz olarak kopyalanamaz, başka yerde yayınlanamaz."

In [34]:
swTR = stopwords.words('turkish')
trStemmer = SnowballStemmer('turkish')

ValueError: The language 'turkish' is not supported.

In [35]:
from snowballstemmer import turkish_stemmer 

trStemmer = turkish_stemmer.TurkishStemmer()
trStemmer.stemWord('uygarlaştıramadıklarımızdanmışcasına')

'uygarlaştıramadık'

In [36]:
wordFrequencies(ekonomi_example,swTR,trStemmer.stemWord)

{'çekiliş': 6,
 'şans': 5,
 'top': 4,
 'kazandıra': 3,
 'olarak': 3,
 '12112023': 2,
 '2016': 2,
 'son': 2,
 '51': 2,
 'bile': 2,
 'ödül': 2,
 'canlı': 2,
 'aktarılıyor': 2,
 '12': 2,
 'kas': 2,
 'sonuç': 2,
 'haber': 2,
 'milliyetcomtr': 2,
 'güncellenme': 1,
 'sisal': 1,
 'büyük': 1,
 'kazana': 1,
 'oluyor': 1,
 'sorgula': 1,
 'sonuçlari': 1,
 'pazar': 1,
 'akşa': 1,
 'saat': 1,
 '2000de': 1,
 'yapıl': 1,
 'millipiyangoonlineco': 1,
 'milli': 1,
 'piyango': 1,
 'tv': 1,
 'youtube': 1,
 'kanal': 1,
 'numara': 1,
 '3718202510': 1,
 'ol': 1,
 'açıklana': 1,
 'gör': 1,
 '1': 1,
 'talihli': 1,
 '252320600': 1,
 'tl': 1,
 'kaza': 1,
 'türkiye': 1,
 'dünya': 1,
 'dakik': 1,
 'köş': 1,
 'yazı': 1,
 'magaz': 1,
 'siyase': 1,
 'spor': 1,
 'seyaha': 1,
 'büt': 1,
 'kon': 1,
 'tek': 1,
 'adres': 1,
 'içerik': 1,
 'iz': 1,
 'alınma': 1,
 'kaynak': 1,
 'gösterilerek': 1,
 'dahi': 1,
 'iktibas': 1,
 'edilemez': 1,
 'kan': 1,
 'aykır': 1,
 'izinsiz': 1,
 'kopyalanamaz': 1,
 'başka': 1,
 'yer': 1,
 '

## A Second Example: Summarization and Keyword Extraction

Looking at word frequencies within a text gives a rough idea of what the text is about. But there is a better method. But first let us talk about [*word embeddings*](https://en.wikipedia.org/wiki/Word_embedding).

A word embedding is a mathematical way of converting a text into a collection of vectors such that the syntactic/semantic relations among words are imitated by the metric relations between the embedding vectors. Here is a picture of what we mean by this imitation:

![Analogy](https://www.ed.ac.uk/sites/default/files/styles/landscape_breakpoints_theme_uoe_tv_1x/public/thumbnails/image/diagram-20190710.png?itok=niBaDPXj)

The simplest way of converting a text into a collection of vectors can be done via counting words within a context.

In [37]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import PCA

### Bag of Words Embedding

Let's delve into 'Bag of Words' (BoW) model, one of the simplest forms of word embeddings. In BoW representation, a text is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. Essentially, BoW constructs a vector representation where each word in a document corresponds to a position in the vector and is represented by a count of occurrences of the word in the document.

Scikit-learn library in python provides two types of BoWs - CountVectorizer and TfidfVectorizer. 

#### **CountVectorizer** 

`CountVectorizer` tokenizes the documents into words and provides a matrix where the cell a[i][j] signifies the count of the jth word in the ith document.

Let's say we have the following list of sentences:

In [38]:
corpus = ['This is the first document.',
          'This document is the second document.',
          'And this is the third one.',
          'Is this the first document?']

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

vectorizer.get_feature_names_out()

array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], dtype=object)

The method `fit_transform` is used for the initial fitting of parameters on the training set `corpus`, but it also returns a transformed `corpus`. We can obtain feature names (word at the particular index in the vector) using `get_feature_names()`:

The `transform` method transforms documents to document-term matrix. We can convert the counts to a pandas dataframe for better visual interpretation:

In [39]:
import pandas as pd
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
df

Unnamed: 0,and,document,first,is,one,second,the,third,this
0,0,1,1,1,0,0,1,0,1
1,0,2,0,1,0,1,1,0,1
2,1,0,0,1,1,0,1,1,1
3,0,1,1,1,0,0,1,0,1


Resulting dataframe shows the count of each word in each document of the corpus.

#### **TfidfVectorizer** 

TF-IDF stands for "Term Frequency, Inverse Document Frequency". It is a way to score the importance of words (or "terms") in a document based on how frequently they appear across multiple documents. If a word appears frequently in a document, it's important (hence the term frequency). If a word appears in many documents, it's not a unique identifier, thereby reducing its importance (hence the inverse document frequency). This is a way to give a score ('weight') to words.

Let's use the same sentences with the TfidfVectorizer. In this case, you will observe that the columns represent the same words, but all the counts have been replaced with the calculated TF-IDF scores. Low TF-IDF means either the term appears much or it’s rarely appeared, so it’s either too common or too rare to be a significant term.

In [40]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
df

Unnamed: 0,and,document,first,is,one,second,the,third,this
0,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085
1,0.0,0.687624,0.0,0.281089,0.0,0.538648,0.281089,0.0,0.281089
2,0.511849,0.0,0.0,0.267104,0.511849,0.0,0.267104,0.511849,0.267104
3,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085


#### A Larger Example

Let us use these vectorizers on the texts we worked with earlier. 


In [41]:
sentences = sent_tokenize(wiki_example)
processed = []

for x in sentences:
    tmp = re.sub(r'[^\w\s]+', '', x.lower())
    processed.append(tmp)

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(processed)

df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
df

Unnamed: 0,10,1960sthe,1980s,20,2480,3270s,50yearold,78,accomplished,acquire,...,where,whereas,which,will,with,without,working,write,wrote,xhtml
0,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,3,0,1,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [42]:
projection = PCA(n_components=1)
weights = projection.fit_transform(X.toarray())
res = list(zip(range(53),weights.T[0],sentences,processed))
sorted(res, key=lambda item: -item[1])

[(25,
  6.835362088130517,
  'The screen scraper might connect to the legacy system via Telnet, emulate the keystrokes needed to navigate the old user interface, process the resulting display output, extract the desired data, and pass it on to the modern system.',
  'the screen scraper might connect to the legacy system via telnet emulate the keystrokes needed to navigate the old user interface process the resulting display output extract the desired data and pass it on to the modern system'),
 (8,
  5.822033661372326,
  'In the second case, the operator of the third-party system will often see screen scraping as unwanted, due to reasons such as increased system load, the loss of advertisement revenue, or the loss of control of the information content.',
  'in the second case the operator of the thirdparty system will often see screen scraping as unwanted due to reasons such as increased system load the loss of advertisement revenue or the loss of control of the information content'),


In [43]:
def getWeights(text,vectorizer,sws,st):
    sentences = sent_tokenize(text)
    processed = []
    for s in sentences:
        tmp = word_tokenize(s)
        res = []
        for w in tmp:
            if w not in sws:
                res.append(st(w))
        processed.append(' '.join(res))
    X = vectorizer.fit_transform(processed)
    projector = PCA(n_components=1)
    weights = projector.fit_transform(X.toarray())
    N = len(sentences)
    return list(zip(range(N),weights.T[0],sentences,processed))

In [44]:
result = getWeights(nba_example,CountVectorizer(),swEN,stemmer.stem)
sorted(result, key=lambda x: -x[1])[:4]

[(10,
  5.863722316209936,
  "EVERY HIGHLIGHT from Victor Wembanyama's dominant night in the Spurs' W 🎬38 PTS (career-high)10 REB2 BLK pic.twitter.com/E7FXwDlX9A Wembanyama shot 15 of 26 from the field and added two assists, two blocks and a steal to help the Spurs complete a two-game sweep in Phoenix.",
  "everi highlight victor wembanyama 's domin night spur ' w 🎬38 pts ( career-high ) 10 reb2 blk pic.twitter.com/e7fxwdlx9a wembanyama shot 15 26 field ad two assist , two block steal help spur complet two-gam sweep phoenix ."),
 (5,
  2.5282012526213413,
  'He’s going to create his own lane and is much different than anyone else who has played.” In just his fifth NBA game, the 19-year-old Wembanyama showed why he was one of the most-hyped prospects in years, scoring 38 points and grabbing 10 rebounds in a standout performance during a 132-121 victory over Durant and the Phoenix Suns on Thursday night.',
  'he ’ go creat lane much differ anyon els played. ” in fifth nba game , 19-year-

In [45]:
result = getWeights(tech_example,CountVectorizer(),swEN,stemmer.stem)
sorted(result, key=lambda x: -x[1])[:4]

[(22,
  3.7326599444489257,
  '“Hikvision’s critical role in surveilling and oppressing Muslims in Xinjiang and the company’s failure to take accountability shows that the company is not serious about ethics or protecting human rights,” said Conor Healy, the director of government research at surveillance research publication Internet Protocol Video Market (IPVM) in a statement to the Guardian.',
  '“ hikvis ’ critic role surveil oppress muslim xinjiang compani ’ failur take account show compani serious ethic protect human right , ” said conor heali , director govern research surveil research public internet protocol video market ( ipvm ) statement guardian .'),
 (21,
  1.617427878497036,
  'Experts on surveillance tools used in the repression of Uyghurs argued the company’s history shows Hikvision has not followed through on previous commitments to preserve human rights.',
  'expert surveil tool use repress uyghur argu compani ’ histori show hikvis follow previous commit preserv human

## BERT Models

Bidirectional Encoder Representations from Transformers (BERT) is a Transformer-based machine learning technique developed by Google for natural language processing (NLP). BERT is designed to pretrain deep bidirectional representations from the unlabelled text by jointly conditioning on both left and right context in all the layers. This results in a pre-trained model that can be fine-tuned for a wide range of tasks with substantially improved results.

### Mathematical Background:

BERT is based on a modified version of the Transformer architecture. Transformers use attention mechanisms to weigh the influence of different words on each other in a data-efficient way. Instead of recurrent or convolutional layers, Transformers use several layers of self-attention in parallel for encoding.

One of the central innovations in BERT is masked language modelling. This masked language modelling randomly masks some of the tokens from the input, and predicts only those masked tokens. Specifically, it masks words in the sentence at random, and then it attempts to predict them based upon the context provided by non-masked words. 

Tokenization in BERT requires more than simply splitting the input into words. BERT uses WordPiece tokenization. A WordPiece tokenizer creates a vocabulary of individual characters, and then, the most common two-character combinations, three-character combinations, etc. 

### BERT Representation:

Each token used by BERT is represented using a WordPiece embedding of size 30k. We concatenate position and segment embeddings with these token embeddings to get the final representation.

Each input sentence in BERT needs special tokens at the beginning and end, and each word should also be tokenized into subwords. 'BertTokenizer' from 'transformers' module can be used to handle the tokenization.

In [131]:
import numpy as np
import pandas as pd
import torch
import tensorflow as tf
import tensorflow_datasets as tfds

from transformers import BertTokenizer, BertModel

ModuleNotFoundError: No module named 'tensorflow'

In [127]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', model_max_length=512, padding=True, truncation=True)
model = BertModel.from_pretrained("bert-base-uncased")

In [130]:
np.array([ tokenizer.encode(x, return_tensors='pt', max_length=80, padding='max_length', truncation=True).numpy().reshape(-1) for x in sentences ])

array([[  101,  2951, 23704, ...,     0,     0,     0],
       [  101,  5373,  1010, ...,     0,     0,     0],
       [  101,  2107,  8989, ...,     0,     0,     0],
       ...,
       [  101,  6168,  2951, ...,     0,     0,     0],
       [  101,  2122,  2064, ...,     0,     0,     0],
       [  101,  2023,  3921, ...,     0,     0,     0]])

In [None]:


# Construct a tf.data.Dataset
ds = tfds.load('mnist', split='train', shuffle_files=True)

# Build your input pipeline
ds = ds.shuffle(1024).batch(32).prefetch(tf.data.AUTOTUNE)
for example in ds.take(1):
  image, label = example["image"], example["label"]