# Demo1 
## Cleaning Text Data: Removing Stopwords  
## Working with Unstructured Text Data
### Loading the Required Libraries and Model  
### Text file:

                2001: A SPACE ODYSSEY

					    Screenplay

						   by
 
			   Stanley Kubrick and Arthur C. Clark

					    Hawk Films Ltd.,
					    c/o. M-G-M Studios,
					    Boreham Wood,
					    Herts.

In [1]:
import spacy                                                  # Import spaCy library
from spacy.lang.en import English                             # Import specific model
nlp = spacy.load("en_core_web_sm")                            # Load model
f = open('scifiscripts.txt')
contents = f.read()                                           # Read input text
print(contents)  # To print all contents of file              # print contents
text = str(contents)                                          # convert to string type

Note from poster to Kubrick newsgroup:

I found this on a bbs a while ago and I thought I'd pass it along to all 
of you Kubrick freaks out there.

02/23/89
Transcriber's note:

For all you Clarke/Kubrick/2001 fans,

I found the original paper copy of this screenplay a while back and felt 
compelled to transcribe it to disk and upload it to various bulletin 
boards for the enjoyment of all.

The final movie deviates from this screenplay in a number of interesting 
ways. I've tried to maintain the format of the original document except 
the number of lines per page of the original. In order to reduce the 
length of this file I've used a bar of "------" to delimit the pages as 
there was a lot of whitespace per original screenplay page.


------------------------------------------------------------------------
				    
				    2001: A SPACE ODYSSEY

					    Screenplay

						   by
 
			   Stanley Kubrick and Arthur C. Clark

					    Hawk Films Ltd.,
					    c/o. M-G-M Studios,
					  

## Removing Stopwords

In [3]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS   # Importing stop words from English language.
print('Number of stop words: %d' % len(spacy_stopwords))     # Printing the total number of stop words
print('First ten stop words: %s' % list(spacy_stopwords)[:10]) # Printing first few stop words
#from spacy.lang.en.stop_words import STOP_WORDS               # Importing stop words
filtered_sent=[]                                               # Initialize empty file
doc = nlp(text)            # "nlp" Object is used to create documents with linguistic annotations.
for word in doc:                                               # For all words
    if word.is_stop==False:                                    # check condition
        filtered_sent.append(word)                             # Appending words in output file
print("\n \nFiltered Sentence:",filtered_sent)                 # Text after stopwords removal

Number of stop words: 326
First ten stop words: ['always', 'to', 'please', 'around', 'be', 'various', 'there', '’s', 'cannot', 'would']

 
Filtered Sentence: [Note, poster, Kubrick, newsgroup, :, 

, found, bbs, ago, thought, pass, 
, Kubrick, freaks, ., 

, 02/23/89, 
, Transcriber, note, :, 

, Clarke, /, Kubrick/2001, fans, ,, 

, found, original, paper, copy, screenplay, felt, 
, compelled, transcribe, disk, upload, bulletin, 
, boards, enjoyment, ., 

, final, movie, deviates, screenplay, number, interesting, 
, ways, ., tried, maintain, format, original, document, 
, number, lines, page, original, ., order, reduce, 
, length, file, bar, ", ------, ", delimit, pages, 
, lot, whitespace, original, screenplay, page, ., 


, ------------------------------------------------------------------------, 
				    
				    , 2001, :, SPACE, ODYSSEY, 

					    , Screenplay, 

						   , 
 
			   , Stanley, Kubrick, Arthur, C., Clark, 

					    , Hawk, Films, Ltd., ,, 
					    , c, /, o., M

### Number of stop words: 326
### First ten stop words: ['an', 'sixty', 'twelve', 'must', 'whereafter', 'else', 'most', 'next', 'as', 'formerly']
### Try to print more stop words


# Add/Remove custom stop words 

## Adding your own stop words

In [4]:
nlp.Defaults.stop_words |= {"my_new_stopword1","my_new_stopword2",}         # To add several stopwords at once
print(nlp.Defaults.stop_words)                                              # To print the updated set of stopwords 

{'always', 'to', 'please', 'around', 'be', 'various', 'there', '’s', 'cannot', 'would', 'while', 'below', 'name', 'down', 'almost', 'thereupon', 'will', 'namely', 'five', 'keep', 'into', 'has', 'who', 'but', 'by', '‘re', 'both', 'an', "'ll", 'only', 'except', 'least', '’re', 'very', 'formerly', 'besides', 'hereby', 'after', 'about', 'few', 'through', 'since', 'they', 'me', 'each', 'might', 'rather', 'sometime', 'doing', 'often', 'these', 'beside', 'him', 'whither', 'himself', '‘m', 'fifty', 'onto', 'together', "'s", 'upon', 'nowhere', 'were', 'using', 'therein', 'along', 'seemed', 'between', 'three', 'as', 'wherever', 'hereafter', 'go', 'when', 'above', 'may', 'per', 'i', 'moreover', 'seem', 'whenever', '’d', 'bottom', 'take', "n't", 'whether', 'do', 'did', 'six', 'them', 'nobody', 'on', 'does', '’ve', 'yourselves', 'so', 'give', 'such', 'or', 'being', 'done', 'thru', 'been', 'every', 'without', 'thereafter', 'can', 'somewhere', 'seems', 'side', 'nevertheless', 'get', 'amongst', 'herse

## Removing stop words from the default list

In [5]:
nlp.Defaults.stop_words -= {"whatever", "whenever"}          # To remove several stopwords at once
print(nlp.Defaults.stop_words)                               # To print the updated set of stopwords                                               

{'always', 'to', 'please', 'around', 'be', 'various', 'there', '’s', 'cannot', 'would', 'while', 'below', 'name', 'down', 'almost', 'thereupon', 'will', 'namely', 'five', 'keep', 'into', 'has', 'who', 'but', 'by', '‘re', 'both', 'an', "'ll", 'only', 'except', 'least', '’re', 'very', 'formerly', 'besides', 'hereby', 'after', 'about', 'few', 'through', 'since', 'they', 'me', 'each', 'might', 'rather', 'sometime', 'doing', 'often', 'these', 'beside', 'him', 'whither', 'himself', '‘m', 'fifty', 'onto', 'together', "'s", 'upon', 'nowhere', 'were', 'using', 'therein', 'along', 'seemed', 'between', 'three', 'as', 'wherever', 'hereafter', 'go', 'when', 'above', 'may', 'per', 'i', 'moreover', 'seem', '’d', 'bottom', 'take', "n't", 'whether', 'do', 'did', 'six', 'them', 'nobody', 'on', 'does', '’ve', 'yourselves', 'so', 'give', 'such', 'or', 'being', 'done', 'thru', 'been', 'every', 'without', 'thereafter', 'can', 'somewhere', 'seems', 'side', 'nevertheless', 'get', 'amongst', 'herself', 'are', 

# Demo2 
## Text Normalization: Stemming and Lemmatization

### 1. Lemmatization
#### a) Few words as input

In [7]:
doc = nlp("friendship studied was am is organizing matches asked")          # Input
for word in doc:                                                      # For all words
    print(word.text,'-> ',word.lemma_)               # Print result after lemmatization

friendship ->  friendship
studied ->  study
was ->  be
am ->  be
is ->  be
organizing ->  organize
matches ->  match
asked ->  ask


#### b) Full text file as input

In [8]:
for word in filtered_sent:                      # All words in input text                 
    print(word.text,word.lemma_)                # After Lemmatization

Note note
poster poster
Kubrick Kubrick
newsgroup newsgroup
: :


 


found find
bbs bbs
ago ago
thought think
pass pass

 

Kubrick Kubrick
freaks freak
. .


 


02/23/89 02/23/89

 

Transcriber Transcriber
note note
: :


 


Clarke Clarke
/ /
Kubrick/2001 Kubrick/2001
fans fan
, ,


 


found find
original original
paper paper
copy copy
screenplay screenplay
felt felt

 

compelled compel
transcribe transcribe
disk disk
upload upload
bulletin bulletin

 

boards board
enjoyment enjoyment
. .


 


final final
movie movie
deviates deviate
screenplay screenplay
number number
interesting interesting

 

ways way
. .
tried try
maintain maintain
format format
original original
document document

 

number number
lines line
page page
original original
. .
order order
reduce reduce

 

length length
file file
bar bar
" "
------ ------
" "
delimit delimit
pages page

 

lot lot
whitespace whitespace
original original
screenplay screenplay
page page
. .



 



----------------------------

>> ##### spaCy (lemmatization) is preferred and referred for processing by NIIT!

### 2. Stemming
#### It might be surprising to you but spaCy doesn't contain any function for stemming as it relies on lemmatization only. Therefore, in this section, we will use NLTK for stemming.

In [11]:
!pip install nltk

Collecting nltk
  Downloading nltk-3.7-py3-none-any.whl (1.5 MB)
     ---------------------------------------- 1.5/1.5 MB 7.9 MB/s eta 0:00:00
Collecting regex>=2021.8.3
  Downloading regex-2022.4.24-cp310-cp310-win_amd64.whl (262 kB)
     -------------------------------------- 262.0/262.0 KB 5.4 MB/s eta 0:00:00
Installing collected packages: regex, nltk
Successfully installed nltk-3.7 regex-2022.4.24


In [13]:
import nltk                                                                                 # import nltk library
from nltk.stem.snowball import SnowballStemmer                                              # import stemmer
stemmer = SnowballStemmer(language='english')                                               # English language model
tokens = ['friend', 'friendship', 'studied', 'was', 'am', 'is', 'helped', 'troubling']      # Input words
for token in tokens:                                                                        # For all words
    print(token + ' --> ' + stemmer.stem(token))                                            # print words after stemming

friend --> friend
friendship --> friendship
studied --> studi
was --> was
am --> am
is --> is
helped --> help
troubling --> troubl


## Comparison: Stemming and Lemmatization

In [14]:
stemmer = SnowballStemmer(language='english')                                                # import stemmer
tokens = ['cries', 'this', 'lied', 'compute', 'computer', 'computed', 'computing', 'organizing', 'matches']   # Input words
for token in tokens:                                                                         # For all words
    print('After stemming:',token + ' --> ' + stemmer.stem(token))                           # Print result after stemming
doc = nlp("cries this lied compute computer computed computing organizing matches")          # For lemmatization
print('\n')                                                                                  # print empty line for differentiating
for word in doc:                                                                             # For all words
    print('After lemmatization:',word.text,word.lemma_)                                      # Print result after lemmatization

After stemming: cries --> cri
After stemming: this --> this
After stemming: lied --> lie
After stemming: compute --> comput
After stemming: computer --> comput
After stemming: computed --> comput
After stemming: computing --> comput
After stemming: organizing --> organ
After stemming: matches --> match


After lemmatization: cries cry
After lemmatization: this this
After lemmatization: lied lie
After lemmatization: compute compute
After lemmatization: computer computer
After lemmatization: computed compute
After lemmatization: computing compute
After lemmatization: organizing organize
After lemmatization: matches match


### As we can see, lemmatization provides better results than stemming for converting words into their root forms.

# Demo3
# Part of Speech (POS) Tagging

### 1. For a simple sentence

In [16]:
doc = nlp("All is well that ends well.")                # Input sentence
for word in doc:                                        # For all input words
    print(word.text,'\t',word.pos_,'\t',word.tag_)      # Apply POS and print output

All 	 PRON 	 DT
is 	 AUX 	 VBZ
well 	 ADV 	 RB
that 	 PRON 	 DT
ends 	 VERB 	 VBZ
well 	 ADV 	 RB
. 	 PUNCT 	 .


#### 1st column: original word, 2nd col: POS tag, 3rd col: Detailed POS tag. More details about the POS tags are provided in the concept session.

### 2. Example2

In [17]:
doc = nlp("Apple is looking at buying U.K. startup for 1 billion dollars.")       # Input sentence
for word in doc:                                                                  # For all input words
    print(word.text,'\t',word.pos_,'\t',word.tag_)                                # Apply POS and print output

Apple 	 PROPN 	 NNP
is 	 AUX 	 VBZ
looking 	 VERB 	 VBG
at 	 ADP 	 IN
buying 	 VERB 	 VBG
U.K. 	 PROPN 	 NNP
startup 	 NOUN 	 NN
for 	 ADP 	 IN
1 	 NUM 	 CD
billion 	 NUM 	 CD
dollars 	 NOUN 	 NNS
. 	 PUNCT 	 .


#### For Apple: POS: PROPN: Proper Noun, Tag: NNP: Proper Noun, singular

### 3. For the whole text document

In [18]:
f = open('scifiscripts.txt')
contents = f.read()                                                             # Read input data
text = str(contents)                                                            # string type
doc = nlp(text)                                                                 # convert to NLP object
for word in doc:                                                                # For all input words
    print(word.text,'\t',word.pos_,'\t',word.tag_)                              # Apply POS and print output

Note 	 VERB 	 VB
from 	 ADP 	 IN
poster 	 NOUN 	 NN
to 	 ADP 	 IN
Kubrick 	 PROPN 	 NNP
newsgroup 	 PROPN 	 NNP
: 	 PUNCT 	 :


 	 SPACE 	 _SP
I 	 PRON 	 PRP
found 	 VERB 	 VBD
this 	 PRON 	 DT
on 	 ADP 	 IN
a 	 DET 	 DT
bbs 	 NOUN 	 NN
a 	 DET 	 DT
while 	 NOUN 	 NN
ago 	 ADV 	 RB
and 	 CCONJ 	 CC
I 	 PRON 	 PRP
thought 	 VERB 	 VBD
I 	 PRON 	 PRP
'd 	 AUX 	 MD
pass 	 VERB 	 VB
it 	 PRON 	 PRP
along 	 ADP 	 RP
to 	 ADP 	 IN
all 	 DET 	 DT

 	 SPACE 	 _SP
of 	 ADP 	 IN
you 	 PRON 	 PRP
Kubrick 	 PROPN 	 NNP
freaks 	 VERB 	 VBZ
out 	 ADP 	 RP
there 	 ADV 	 RB
. 	 PUNCT 	 .


 	 SPACE 	 _SP
02/23/89 	 NUM 	 CD

 	 SPACE 	 _SP
Transcriber 	 PROPN 	 NNP
's 	 PART 	 POS
note 	 NOUN 	 NN
: 	 PUNCT 	 :


 	 SPACE 	 _SP
For 	 ADP 	 IN
all 	 DET 	 DT
you 	 PRON 	 PRP
Clarke 	 PROPN 	 NNP
/ 	 SYM 	 SYM
Kubrick/2001 	 PROPN 	 NNP
fans 	 NOUN 	 NNS
, 	 PUNCT 	 ,


 	 SPACE 	 _SP
I 	 PRON 	 PRP
found 	 VERB 	 VBD
the 	 DET 	 DT
original 	 ADJ 	 JJ
paper 	 NOUN 	 NN
copy 	 NOUN 	 NN
of 	 ADP 	 IN
th

. 	 PUNCT 	 .


 	 SPACE 	 _SP
10/4/65 	 NUM 	 CD
										   	 SPACE 	 _SP
b51 	 PROPN 	 NNP

 	 SPACE 	 _SP
------------------------------------------------------------------------ 	 SYM 	 SYM

 	 SPACE 	 _SP
B36 	 PROPN 	 NNP

 	 SPACE 	 _SP
CHILDREN 	 PROPN 	 NNP
IN 	 ADP 	 IN
SCHOOL 	 PROPN 	 NNP
. 	 PUNCT 	 .

 	 SPACE 	 _SP
TEACHER 	 NOUN 	 NN
SHOWING 	 VERB 	 VBG
THEM 	 PRON 	 PRP

 	 SPACE 	 _SP
VIEWS 	 NOUN 	 NNS
OF 	 ADP 	 IN
EARTH 	 PROPN 	 NNP
AND 	 CCONJ 	 CC
MAP 	 PROPN 	 NNP

 	 SPACE 	 _SP
OF 	 ADP 	 IN
EARTH 	 PROPN 	 NNP
. 	 PUNCT 	 .


					     	 SPACE 	 _SP
NARRATOR 	 PROPN 	 NNP

					     	 SPACE 	 _SP
The 	 DET 	 DT
personnel 	 NOUN 	 NNS
of 	 ADP 	 IN
the 	 DET 	 DT
Base 	 PROPN 	 NNP
and 	 CCONJ 	 CC
their 	 PRON 	 PRP$

					     	 SPACE 	 _SP
children 	 NOUN 	 NNS
were 	 AUX 	 VBD
the 	 DET 	 DT
forerunners 	 NOUN 	 NNS
of 	 ADP 	 IN
new 	 ADJ 	 JJ

					     	 SPACE 	 _SP
nations 	 NOUN 	 NNS
, 	 PUNCT 	 ,
new 	 ADJ 	 JJ
cultures 	 NOUN 	 NNS
that 	 PRON 

the 	 DET 	 DT
control 	 NOUN 	 NN
, 	 PUNCT 	 ,
please 	 INTJ 	 UH
. 	 PUNCT 	 .


					     	 SPACE 	 _SP
HAL 	 PROPN 	 NNP

					     	 SPACE 	 _SP
Look 	 VERB 	 VBP
, 	 PUNCT 	 ,
Dave 	 PROPN 	 NNP
your've 	 PROPN 	 NNP
probably 	 ADV 	 RB
got 	 VERB 	 VBD

					     	 SPACE 	 _SP
a 	 DET 	 DT
lot 	 NOUN 	 NN
to 	 PART 	 TO
do 	 VERB 	 VB
. 	 PUNCT 	 .
I 	 PRON 	 PRP
suggest 	 VERB 	 VBP
you 	 PRON 	 PRP
leave 	 VERB 	 VBP

					     	 SPACE 	 _SP
it 	 PRON 	 PRP
to 	 ADP 	 IN
me 	 PRON 	 PRP
. 	 PUNCT 	 .


					     	 SPACE 	 _SP
BOWMAN 	 PROPN 	 NNP

					     	 SPACE 	 _SP
Hal 	 PROPN 	 NNP
, 	 PUNCT 	 ,
switch 	 VERB 	 VBP
to 	 ADP 	 IN
manual 	 ADJ 	 JJ
hibernation 	 NOUN 	 NN

					     	 SPACE 	 _SP
control 	 NOUN 	 NN
. 	 PUNCT 	 .


					     	 SPACE 	 _SP
HAL 	 PROPN 	 NNP

					     	 SPACE 	 _SP
I 	 PRON 	 PRP
do 	 AUX 	 VBP
n't 	 PART 	 RB
like 	 VERB 	 VB
to 	 PART 	 TO
assert 	 VERB 	 VB
myself 	 PRON 	 PRP
, 	 PUNCT 	 ,
Dave 	 PROPN 	 NNP
, 	 PUNCT 	 ,

					     	

* For more info. about PARTS OF SPEECH (TAGGING):
https://universaldependencies.org/u/pos/