### Learning NLTK

#### Import Packages

In [10]:
import pandas as pd
import nltk

In [2]:
# download all the packages - take a minute or so - you will see a popup
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

#### Tokenising

Tokenising just means splitting up some body of text (eg by word or by sentence).

We use the package's splittings because it is better and more efficient than us making long arse regex - one advantage of nltk 
is that it can save us lots of time from regex and in pre-processing

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [11]:
# set up body of text
text = 'The battle over abortion rights in the US shifted rapidly to Congress and the midterm elections after the Supreme Court overturned Roe vs Wade. Conservative states began to implement new abortion restrictions across the country in the wake of Friday’s ruling. Democrats on Capitol Hill and running for national office called for abortion rights to be protected through legislation, and sought to depict Republicans as dangerously out of step with average Americans heading into the November vote.'

# split into word
print(word_tokenize(text))

# split by sentence
print(sent_tokenize(text))

['The', 'battle', 'over', 'abortion', 'rights', 'in', 'the', 'US', 'shifted', 'rapidly', 'to', 'Congress', 'and', 'the', 'midterm', 'elections', 'after', 'the', 'Supreme', 'Court', 'overturned', 'Roe', 'vs', 'Wade', '.', 'Conservative', 'states', 'began', 'to', 'implement', 'new', 'abortion', 'restrictions', 'across', 'the', 'country', 'in', 'the', 'wake', 'of', 'Friday', '’', 's', 'ruling', '.', 'Democrats', 'on', 'Capitol', 'Hill', 'and', 'running', 'for', 'national', 'office', 'called', 'for', 'abortion', 'rights', 'to', 'be', 'protected', 'through', 'legislation', ',', 'and', 'sought', 'to', 'depict', 'Republicans', 'as', 'dangerously', 'out', 'of', 'step', 'with', 'average', 'Americans', 'heading', 'into', 'the', 'November', 'vote', '.']
['The battle over abortion rights in the US shifted rapidly to Congress and the midterm elections after the Supreme Court overturned Roe vs Wade.', 'Conservative states began to implement new abortion restrictions across the country in the wake of

#### Stop Words

These are just commonly used but kinda less meaningful words in the english language we might want to exclude in analysis (eg'the','should' etc). In the jargon we might think of 'context words' and 'content words' - context words are built around other things, but dont give much info themselves. (nonetheless context might be useful to analyse writing styles).

In [22]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# use the packages stopwords in english - note that lower case means only lower case stop words - take a set
stop_words = set(stopwords.words('english'))

# get text tokenized by words
text_word = word_tokenize(text)

# filter out stop words - casefold method ensures both cases are applied - so ensures upper case stop words are filtered out
filtered_text_word = [i for i in text_word if i.casefold() not in stop_words]

#### Stemming

We have 'playing' and 'play' - we want to think of these as having the same meaning/stem ('play'). We can use packages algorithms to stem down.

In [29]:
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

# a modern stemming algorithm to use
stemmer = SnowballStemmer('english')

to_stem = 'I played my play to the players in the audience. The musician was playing music.'
to_stem = word_tokenize(to_stem)

# call on our stemmer and then apply the stem method - notice play and playing get collapsed - though play is treated as the same too
stemmed = [stemmer.stem(word) for word in to_stem]