# PySpark Part of Speech (POS) analysis
Text taken from [Reuters](https://www.reuters.com/business/finance/banks-beware-outsiders-are-cracking-code-finance-2021-09-17/).

In [3]:
import nltk
from pyspark import SparkContext

In [4]:
from nltk.tokenize import word_tokenize

### NLTK packages are downloaded
This includes an "universal_tagset" which will help categorizing in POS all the data

In [5]:
nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")
nltk.download('universal_tagset')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mateo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\mateo\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     C:\Users\mateo\AppData\Roaming\nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


True

In [6]:
# Entry point for working with RDD
sc = SparkContext(appName = "pyspark-pos-analysis")

In [7]:
# Loading a text file
rdd_reuters = sc.textFile("./data/reuters.txt")

### Verifying results
First we will count the results on the list

In [8]:
rdd_reuters.count()

87

Then, we can use a ".take(5)" command in order to verify how the data is stored

In [9]:
rdd_reuters.take(5)

['Banks beware, Amazon and Walmart are cracking the code for finance',
 '',
 'LONDON, Sept 17 (Reuters) - Anyone can be a banker these days, you just need the right code.',
 '',
 'Global brands from Mercedes and Amazon (AMZN.O) to IKEA and Walmart (WMT.N) are cutting out the traditional financial middleman and plugging in software from tech startups to offer customers everything from banking and credit to insurance.']

## Splitting words from text and putting them on a flat structure
This result is stored to be used as a list. Taking its first 15 elements, we can see this object as we can see below.

In [24]:
words = rdd_reuters.flatMap(lambda x: word_tokenize(x))
words.take(15)

['Banks',
 'beware',
 ',',
 'Amazon',
 'and',
 'Walmart',
 'are',
 'cracking',
 'the',
 'code',
 'for',
 'finance',
 'LONDON',
 ',',
 'Sept']

### Using NLTK to analyze POS on the splitted text
First, we use the "pos_tag()" method to create pairs between the words and their respective Part of Speech (POS). This will use the previously downloaded "universal" tagset, wich will use only the desired categories. 

In [27]:
tagged_words = nltk.pos_tag(words.collect(), tagset='universal')
# This will print only 5 of the obtained results
[print(a) for a in tagged_words[:5]]

('Banks', 'NOUN')
('beware', 'NOUN')
(',', '.')
('Amazon', 'NOUN')
('and', 'CONJ')


[None, None, None, None, None]

With this list, now we can use NTLK "FreqDist()" and "most_common()" methods in order to count and generate another list with the results of each POS tag.

In [31]:
tag_fd = nltk.FreqDist(tag for (word, tag) in tagged_words)
tag_fd.most_common()

[('NOUN', 422),
 ('VERB', 248),
 ('.', 209),
 ('ADP', 143),
 ('ADJ', 95),
 ('DET', 91),
 ('ADV', 61),
 ('PRT', 59),
 ('PRON', 56),
 ('CONJ', 45),
 ('NUM', 35)]

In [32]:
sc.stop()