# Demo 1: Preprocessing of Text Data Using Regular Expressions (RegEx)

### 1. Remove Numeric Values From Input Text
### 2. Remove Extra White Spaces
### 3. Replace Symbols and Characters 

## Regular Expressions:
A regular expression is a sequence of characters used to find  patterns in a string or file.

## Importing Required Libraries and Load Input Text Data

### About Text File 
#### Gettysburg Address 

    The Gettysburg Address is a speech that U.S. President Abraham Lincoln delivered during the American Civil War at the dedication of the Soldiers' National Cemetery in Gettysburg, Pennsylvania, on the afternoon of November 19, 1863.

    It is one of the best-known speeches in American history.

In [1]:
import re                    # Importing re library 
file1 = open(r"Gettysburg_Address.txt", "r").read()
print(file1)                 # To print contents of input file

Four score and seven years ago our fathers brought forth upon this continent, a new nation,

 conceived in Liberty, and dedicated to the proposition that all men are created equal.
Now we are engaged in a great civil war, testing whether that nation, or any nation so

 conceived and so dedicated, can long endure. We are met on a great battle-field of that 

war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.

But, in a larger sense, we can not dedicateâ€” we can not consecrateâ€” we can not hallowâ€” this
 ground. The brave men, living and dead, who struggled here, have consecrated it, far above
 our poor power to add or detract. The world will little note, nor long remember what we say
 here, but it can never forget what they did here. 


66666666666666666666 7777777777777
444  222 2222  000

It is for us the living, rather, to be 
de

### In an input text file, as you can see, there are many unnecessary whitespaces, numerical values, and wrong symbols are present such as â€” which differs from the original speech.
### Let's clean the data using RegEx.

# 1. Removal of Numerical Values
### In the original speech of Abraham Lincoln, there is no numerical value. But in our input text file, there are numerical values present.
### Let's remove those values.

In [2]:
mod_string = ''.join(filter(lambda item: not item.isdigit(), file1))         # Filter all digits from characters in string & join remaining character
print('\nAfter removal of Numeric values:\n\n',mod_string)                   # print text after removal of Numeric values


After removal of Numeric values:

 Four score and seven years ago our fathers brought forth upon this continent, a new nation,

 conceived in Liberty, and dedicated to the proposition that all men are created equal.
Now we are engaged in a great civil war, testing whether that nation, or any nation so

 conceived and so dedicated, can long endure. We are met on a great battle-field of that 

war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.

But, in a larger sense, we can not dedicateâ€” we can not consecrateâ€” we can not hallowâ€” this
 ground. The brave men, living and dead, who struggled here, have consecrated it, far above
 our poor power to add or detract. The world will little note, nor long remember what we say
 here, but it can never forget what they did here. 


 
     

It is for us the living, rather, to be 
dedicated he

## 2. Remove Extra White Spaces so there is only one space between words

In [3]:
item = mod_string                                                   # input file
text_new = re.sub(r'\s+',' ', item)                                 # Replacing extra white space by a single space
print('Input TEXT:\n\n',item)                                       # To print text before removing symbols
print('\n \n  After REMOVING extra White Spaces:\n\n',text_new)     # To print text after removing symbols

Input TEXT:

 Four score and seven years ago our fathers brought forth upon this continent, a new nation,

 conceived in Liberty, and dedicated to the proposition that all men are created equal.
Now we are engaged in a great civil war, testing whether that nation, or any nation so

 conceived and so dedicated, can long endure. We are met on a great battle-field of that 

war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.

But, in a larger sense, we can not dedicateâ€” we can not consecrateâ€” we can not hallowâ€” this
 ground. The brave men, living and dead, who struggled here, have consecrated it, far above
 our poor power to add or detract. The world will little note, nor long remember what we say
 here, but it can never forget what they did here. 


 
     

It is for us the living, rather, to be 
dedicated here to the unfinished w

### As you can compare with input text, how compact the speech has become after the removal of extra white spaces.

# 3. a) Replacing Symbols and Characters
### In an original speech of Abraham Lincoln, — the symbol has been used, but erroneous symbols â€” are present in our input text file.
#### Let's replace the symbol â€” with the right symbol — 

In [4]:
sample_string = text_new                                       # Taking text with removed whitespaces as input here                                                 
#print('Original text:',sample_string)                         # print original text
char_to_replace = {'â€”': '—'}                                 # Define characters to be replaced                                
for key, value in char_to_replace.items():                     # Iterate over all key-value pairs in dictionary
    sample_string = sample_string.replace(key, value)          # Replace key character with value character in string
print('After Replacing symbol:\n\n',sample_string)             # print text after replacing characters

After Replacing symbol:

 Four score and seven years ago our fathers brought forth upon this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But, in a larger sense, we can not dedicate— we can not consecrate— we can not hallow— this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they wh

#### You can try to replace other symbols and characters.

# 3. b) Removing Symbols and Characters From Text
#### Let's assume another scenario where HR wants to calculate the total salary of a few employees for some analysis. 
#### But there is a dollar symbol before each number due to which it is difficult to calculate the sum. 
#### Let's help him to remove the dollar sign and perform the sum of those employees' salaries.

In [5]:
strs = "$1000,$2000,$3000,$4000,$100,$200,$300,$400"                              # Input string
nstr = re.sub('[$|,]',' ',strs)                                                   # Here the pattern '[$|,]' indicate that our reggex pattern contain the symbole '$', or ','
print('Original TEXT:\n\n',strs)                                                  # To print text before removing symbols
print('\n \nAfter REMOVING Dollar symbol for easy sum of salaries:\n\n',nstr)     # To print text after removing symbols

Original TEXT:

 $1000,$2000,$3000,$4000,$100,$200,$300,$400

 
After REMOVING Dollar symbol for easy sum of salaries:

  1000  2000  3000  4000  100  200  300  400


# Demo 2: spaCy Installation

## Check if pip is installed

Open command prompt    
pip –version  
pip 21.0.1

## Update pip, setuptools, wheel

In [6]:
pip install -U pip setuptools wheel 

Note: you may need to restart the kernel to use updated packages.


## Installation of spaCy in Python

In [7]:
pip install spacy

Note: you may need to restart the kernel to use updated packages.


## Downloading Specific Model for spaCy

In [8]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.3.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.3.0/en_core_web_sm-3.3.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 12.8/12.8 MB 5.1 MB/s eta 0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.3.0
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


## Importing Library and Loading Model

In [9]:
import spacy                                  # Import spaCy library
nlp = spacy.load("en_core_web_sm")            # load spaCy model

# Demo 3
### How to open and write text files using spaCy
### Pipeline components
### Step1: Create an NLP object
### Step2: Tokenization 
### Frequency word count
### Most common words

## Let's get started

# To open text files

In [16]:
file_name = 'Review1.txt'                                       # File name
introduction_file_text = open(file_name).read()                 # To open text file

#CONVERTING TO NLP OBJECT/TYPE TO APPLY spaCy on it!
introduction_file_doc = nlp(introduction_file_text)             # Create NLP object

print(introduction_file_doc)                                    # To print contents

Rama eats apple


In [17]:
file_name = 'Review2.txt'                                       # File name
introduction_file_text = open(file_name).read()                 # To open text file

#CONVERTING TO NLP OBJECT/TYPE TO APPLY spaCy on it!
introduction_file_doc = nlp(introduction_file_text)             # Create NLP object

print(introduction_file_doc)                                    # To print contents

Rama eats apple

He plays cricket.


## Read a file line by line

In [19]:
myfile = open('Review2.txt')                                    # To open a file
print(myfile.readlines())                                       # To read file line by line

['Rama eats apple\n', '\n', 'He plays cricket.']


# To write text files

In [23]:
file = open('New.txt', 'w')                                     # To write New text file
file.write('Created new file. using above steps!') # It gives number of characters in a sentence
file.close()                                                    # To update new created file 

### A new text file is created

# Pipeline Components
To figure out the active pipeline components

In [25]:
nlp.pipe_names                                # To display active pipeline components:

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

To disable the pipeline components

In [26]:
nlp.disable_pipes('tagger', 'parser')          # to disable the pipeline components
nlp.pipe_names                                 # Again check active pipeline components

['tok2vec', 'attribute_ruler', 'lemmatizer', 'ner']

### Note: All the pipeline components will be discussed in upcoming sessions.

# Step1: Create an NLP object

In [30]:
doc = nlp("Mahira goes to school daily.")     # Text as NLP object
print('Text1:',doc)                           # To print the text
doc2 = nlp("In academic writing, readers expect each paragraph to have a sentence or two that captures its main point. They’re often called “topic sentences,” though many writing instructors prefer to call them key sentences.")   # Text as NLP object
print('\nText2:',doc2)                        # To print the text

Text1: Mahira goes to school daily.

Text2: In academic writing, readers expect each paragraph to have a sentence or two that captures its main point. They’re often called “topic sentences,” though many writing instructors prefer to call them key sentences.


# Step2: Tokenization
 
## For the single input sentence "Rama eats apple"

In [32]:
print ([token.text for token in doc])        # Extract tokens for the given doc...
                                             #(fullstop is also a token)

['Mahira', 'goes', 'to', 'school', 'daily', '.']


In [33]:
print([token.text for token in doc2])

['In', 'academic', 'writing', ',', 'readers', 'expect', 'each', 'paragraph', 'to', 'have', 'a', 'sentence', 'or', 'two', 'that', 'captures', 'its', 'main', 'point', '.', 'They', '’re', 'often', 'called', '“', 'topic', 'sentences', ',', '”', 'though', 'many', 'writing', 'instructors', 'prefer', 'to', 'call', 'them', 'key', 'sentences', '.']


## For Raw Script

In [34]:
from spacy.lang.en import English                              # Importing library
nlp = English()                                                # Importing model
nlp = spacy.load("en_core_web_sm")                             # Importing model
f = open('Ten_things_I_hate.txt')
contents = f.read()                                            # To read input dataset
contents = contents[:500]                                      # First few characters of input file
print(contents)                                                # To print contents of file
text_combined = str(contents)                                  # String
doc = nlp(text_combined)                                       # Create NLP object
for token in doc:
    print(token)                                               # Print tokens
len(token)                                                     # Length of tokens

                               TEN THINGS I HATE ABOUT YOU
          
                written by Karen McCullah Lutz & Kirsten Smith
          
              based on 'Taming of the Shrew" by William Shakespeare
          
          Revision November 12, 1997
          
          
          PADUA HIGH SCHOOL - DAY
          
          Welcome to Padua High School,, your typical urban-suburban 
          high school in Portland, Oregon.  Smarties, Skids, Preppies, 
          Granolas. Loners, Lov
                               
TEN
THINGS
I
HATE
ABOUT
YOU

          
                
written
by
Karen
McCullah
Lutz
&
Kirsten
Smith

          
              
based
on
'
Taming
of
the
Shrew
"
by
William
Shakespeare

          
          
Revision
November
12
,
1997

          
          
          
PADUA
HIGH
SCHOOL
-
DAY

          
          
Welcome
to
Padua
High
School
,
,
your
typical
urban
-
suburban

          
high
school
in
Portland
,
Oregon
.
 
Smarties
,
Skids
,
Preppies
,

     

3

# Frequency of Word Count

In [35]:
from collections import Counter
counts = Counter()
for token in doc:
    counts[token.orth_] += 1                       # Equivalently, token.text
print(counts)                                      # To print the frequency count

Counter({',': 8, 'by': 2, '\n          \n          ': 2, '-': 2, '\n          ': 2, '.': 2, '                               ': 1, 'TEN': 1, 'THINGS': 1, 'I': 1, 'HATE': 1, 'ABOUT': 1, 'YOU': 1, '\n          \n                ': 1, 'written': 1, 'Karen': 1, 'McCullah': 1, 'Lutz': 1, '&': 1, 'Kirsten': 1, 'Smith': 1, '\n          \n              ': 1, 'based': 1, 'on': 1, "'": 1, 'Taming': 1, 'of': 1, 'the': 1, 'Shrew': 1, '"': 1, 'William': 1, 'Shakespeare': 1, 'Revision': 1, 'November': 1, '12': 1, '1997': 1, '\n          \n          \n          ': 1, 'PADUA': 1, 'HIGH': 1, 'SCHOOL': 1, 'DAY': 1, 'Welcome': 1, 'to': 1, 'Padua': 1, 'High': 1, 'School': 1, 'your': 1, 'typical': 1, 'urban': 1, 'suburban': 1, 'high': 1, 'school': 1, 'in': 1, 'Portland': 1, 'Oregon': 1, ' ': 1, 'Smarties': 1, 'Skids': 1, 'Preppies': 1, 'Granolas': 1, 'Loners': 1, 'Lov': 1})


# Most Common Words in Document

In [36]:
text = str(sample_string)                             # Lincoln speech as Input string
doc = nlp(text)                                       # Create NLP object
from collections import Counter
counts = Counter()                                    # Frequency count
for token in doc:
    counts[token.orth_] += 1                          # Equivalently, token.text
m = counts.most_common(15)                            # Most common words in document
m                                                     # To print most common words

[(',', 22),
 ('that', 13),
 ('.', 10),
 ('the', 9),
 ('to', 8),
 ('we', 8),
 ('here', 8),
 ('—', 8),
 ('a', 7),
 ('and', 6),
 ('nation', 5),
 ('can', 5),
 ('of', 5),
 ('have', 5),
 ('for', 5)]

### Here, we can see that the most common words are 'that', 'the', 'we', 'here' etc.
### These should be removed for better text analysis.
### In the next sprint, we will remove such stop words using spaCy.