# Text Cleaning

---

This collab will show you how to clean your data in preparation for the NLP.
In the first section, we will be using built-in python functions as for the second, we will introduce NLTK libraries.

---

###Split by white space

Splitting a document or text by word.
Applying split() with no input parameters calls the function to split text by looking at white spaces only. "Who's" for example is considered an entire word.

In [None]:
text = 'Albert Einstein is widely celebrated as one of the most brilliant scientists who’s ever lived. His fame was due to his original and creative theories that at first seemed crazy, but that later turned out to represent the actual physical world.  Nonetheless, when he applied his theory of general relativity to the universe as a whole in a paper published in 1917, while serving as Director of the Kaiser Wilhelm Institute for Physics and professor at the University of Berlin, Einstein suggested the notion of a "cosmological constant". He discarded this notion when it had been established that the universe was indeed expanding. His contributions to physics made it possible to envision how the universe evolved.'\
      'In order to understand Einstein’s contribution to cosmology it is helpful to begin with his theory of gravity. Rather than thinking of gravity as an attractive force between two objects, in the tradition of Isaac Newton, Einstein’s conception was that gravity is a property of massive objects that “bends” space and time around itself. For example, consider the question of why the Moon does not fly off into space, rather than staying in orbit around Earth. Newton would say that gravity is a force acting between the Earth and Moon, holding it in orbit. Einstein would say that the massive Earth “bends” space and time around itself, so that the moon follows the curves created by the massive Earth. His theory was confirmed when he predicted that even starlight would bend when passing near the sun during a solar eclipse.'\
      'In 1917 Einstein published a paper in which he applied this theory to all matter in space. His theory led to the conclusion that all the mass in the universe would bend space so much that it should have long ago contracted into a single dense blob. Given that the universe seems pretty well spread out, however, and does not seem to be contracting, Einstein decided to add a “fudge factor,” that acts like “anti-gravity” and prevents the universe from collapsing. He called this idea, which was represented as an additional term in the mathematical equation representing his theory of gravity, the cosmological constant. In other words, Einstein supposed the universe to be static and unchanging, because that is the way it looked to astronomers in 1917.'

# split into words by white space



print(words[:100])

### Split by word

Using regular expression re and splitting based on words. Notice the difference in "who's".

In [None]:
import re

# split based on words only



print(words[:100])

### Normalizing case

Normalizing is when we turn all the words of the document into lower case. Careful however not always employing this method because it might change the entire meaning. For example take the French telecom company Orange and the fruit orange. Normalizing would change the entire meaning.

In [None]:
# split into words by white space



# convert to lower case


print(words[:100])

# NLTK

The Natural Language Toolkit is a suite of libraries and programs for symbolic and statistical natural language processing for English, written in the Python programming language. *(Wikipedia)*



### Split into sentences

This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.

The NLTK data package includes a pre-trained Punkt tokenizer for English. *(nltk.org)*

In [None]:
import nltk
from nltk import sent_tokenize
nltk.download('punkt')

# split into sentences



for sentence in sentences:
  print(sentence)


### Split into words

From the same toolkit, we consider the tokenize library and import the word tokenizer. Similary to `re.split`, this function will split the text into tokens rather than words. Make sure you check out the ouput and spot the differences!

In [None]:
from nltk.tokenize import word_tokenize

# split into words


print(tokens[:100])

### Filter out punctuation

Python includes the built-in function `isalpha()` that can be used in order to determine whether or not the scanned the word is alphabetical or else (numerical, punctuation, special characters, etc.)

In [None]:
# split into words



# remove all tokens that are not alphabetic


print(words[:100])

### Remove stopwords

Stopwords are the words which do not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. The most common are short function words such as the, is, at, which, and on, etc.

In this case, removing stopwords can cause problems when searching for phrases that include them, particularly in names such as “The Who” or “Take That”.

Including the word "not" as a stopword also changes the entire meaning if removed (try "this code is not good")

In [None]:
# let's list all the stopwords for NLTK
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

stop_words = stopwords.words('english')
print(stop_words)

As you can see, the stopwords are all lower case. If we're to compare them with our tokens, we need to make sure that our text is prepared the same way.

This cell recaps all what we have previously learnt in this colab: tokenizing, lower casing and checking for alphabetic words.

In [None]:
# clean our text

# split into words
tokens = word_tokenize(text)

# convert to lower case
tokens = [w.lower() for w in tokens]

# remove all tokens that are not alphabetic
words = [word for word in tokens if word.isalpha()]

# filter out stop words
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]
print(words[:100])

### Stem words

Stemming refers to the process of reducing each word to its root or base.
There are two types of stemmers for suffix stripping: porter and lancaster and each has its own algorithm and sometimes display different outputs.

In [None]:
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer

# split into words
tokens = word_tokenize(text)

# stemming of words


print(stemmed[:100])

In [None]:
from nltk.tokenize import word_tokenize
from nltk.stem.lancaster import LancasterStemmer

# split into words
tokens = word_tokenize(text)

# stemming of words
lancaster = LancasterStemmer()
stemmed = [lancaster.stem(word) for word in tokens]
print(stemmed[:100])
