# Text preprocessing techniques
* Converting words into lowercase
* Removing leading and trailing whitespaces
* Removing punctuation
* Removing stopwords
* Expanding contractions
* Removing special characters (numbers, emojis, etc.)

## Tokenization using spaCy

In [4]:
import spacy
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')
# Initiliaze string
string = "Hello! I don't know what I'm doing here."
# Create a Doc object
doc = nlp(string)
# Generate list of tokens
tokens = [token.text for token in doc]
print(tokens)

['Hello', '!', 'I', 'do', "n't", 'know', 'what', 'I', "'m", 'doing', 'here', '.']


## Lemmatization using spaCy

In [3]:
import spacy
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')
# Initiliaze string
string = "Hello! I don't know what I'm doing here."
# Create a Doc object
doc = nlp(string)
# Generate list of lemmas
lemmas = [token.lemma_ for token in doc]
lemmas

['hello', '!', 'I', 'do', 'not', 'know', 'what', 'I', 'be', 'do', 'here', '.']

## Tokenizing the Gettysburg Address

In [11]:
gettysburg = "Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we're engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We're met on a great battlefield of that war. We've come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It's altogether fitting and proper that we should do this. But, in a larger sense, we can't dedicate - we can not consecrate - we can not hallow - this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It's rather for us to be here dedicated to the great task remaining before us - that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion - that we here highly resolve that these dead shall not have died in vain - that this nation, under God, shall have a new birth of freedom - and that government of the people, by the people, for the people, shall not perish from the earth."

In [15]:
import spacy
doc = nlp(gettysburg)
tokens = [token.text for token in doc]
print(tokens)

['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'fathers', 'brought', 'forth', 'on', 'this', 'continent', ',', 'a', 'new', 'nation', ',', 'conceived', 'in', 'Liberty', ',', 'and', 'dedicated', 'to', 'the', 'proposition', 'that', 'all', 'men', 'are', 'created', 'equal', '.', 'Now', 'we', "'re", 'engaged', 'in', 'a', 'great', 'civil', 'war', ',', 'testing', 'whether', 'that', 'nation', ',', 'or', 'any', 'nation', 'so', 'conceived', 'and', 'so', 'dedicated', ',', 'can', 'long', 'endure', '.', 'We', "'re", 'met', 'on', 'a', 'great', 'battlefield', 'of', 'that', 'war', '.', 'We', "'ve", 'come', 'to', 'dedicate', 'a', 'portion', 'of', 'that', 'field', ',', 'as', 'a', 'final', 'resting', 'place', 'for', 'those', 'who', 'here', 'gave', 'their', 'lives', 'that', 'that', 'nation', 'might', 'live', '.', 'It', "'s", 'altogether', 'fitting', 'and', 'proper', 'that', 'we', 'should', 'do', 'this', '.', 'But', ',', 'in', 'a', 'larger', 'sense', ',', 'we', 'ca', "n't", 'dedicate', '-', 'we', '

In [16]:
# Generate lemmas
lemmas = [token.lemma_ for token in doc]
print(lemmas)

['four', 'score', 'and', 'seven', 'year', 'ago', 'our', 'father', 'bring', 'forth', 'on', 'this', 'continent', ',', 'a', 'new', 'nation', ',', 'conceive', 'in', 'Liberty', ',', 'and', 'dedicate', 'to', 'the', 'proposition', 'that', 'all', 'man', 'be', 'create', 'equal', '.', 'now', 'we', 'be', 'engage', 'in', 'a', 'great', 'civil', 'war', ',', 'test', 'whether', 'that', 'nation', ',', 'or', 'any', 'nation', 'so', 'conceive', 'and', 'so', 'dedicated', ',', 'can', 'long', 'endure', '.', 'we', 'be', 'meet', 'on', 'a', 'great', 'battlefield', 'of', 'that', 'war', '.', 'we', "'ve", 'come', 'to', 'dedicate', 'a', 'portion', 'of', 'that', 'field', ',', 'as', 'a', 'final', 'resting', 'place', 'for', 'those', 'who', 'here', 'give', 'their', 'life', 'that', 'that', 'nation', 'might', 'live', '.', 'it', 'be', 'altogether', 'fitting', 'and', 'proper', 'that', 'we', 'should', 'do', 'this', '.', 'but', ',', 'in', 'a', 'large', 'sense', ',', 'we', 'can', 'not', 'dedicate', '-', 'we', 'can', 'not', 'c

# Text cleaning techniques
* Unnecessary whitespaces and escape sequences
* Punctuations
* Special characters (numbers, emojis, etc.)
* Stopwords

## A word of caution
* Abbreviations: U.S.A , U.K , etc.
* Proper Nouns: word2vec and xto10x .
* Write your own custom function (using regex) for the more nuanced cases.

## Removing non-alphabetic characters

In [18]:
string = """
OMG!!!! This is like the best thing ever \t\n.
Wow, such an amazing song! I'm hooked. Top 5 definitely. ?
"""
import spacy
# Generate list of tokens
nlp = spacy.load('en_core_web_sm')
doc = nlp(string)
lemmas = [token.lemma_ for token in doc]
print(lemmas)

['\n', 'OMG', '!', '!', '!', '!', 'this', 'be', 'like', 'the', 'good', 'thing', 'ever', '\t\n', '.', '\n', 'wow', ',', 'such', 'an', 'amazing', 'song', '!', 'I', 'be', 'hook', '.', 'top', '5', 'definitely', '.', '?', '\n']


In [19]:
# Remove tokens that are not alphabetic
a_lemmas = [lemma for lemma in lemmas if lemma.isalpha()]
# Print string after text cleaning
print(' '.join(a_lemmas))

OMG this be like the good thing ever wow such an amazing song I be hook top definitely


## Stopwords
* Words that occur extremely commonly
* Eg. articles, be verbs, pronouns, etc.

In [24]:
# Removing stopwords using spaCy
# Get list of stopwords
stopwords = spacy.lang.en.stop_words.STOP_WORDS
print(stopwords)

{'seemed', 'then', 'therein', 'almost', 'now', 'been', 'they', 'while', 'most', 'moreover', 'whose', 'ca', 'here', 'used', 'also', 'even', 'top', 'everyone', 'the', 'becoming', 'its', "'re", 'toward', 'their', 'hereby', 'however', 'same', 'behind', 'often', 'must', 'perhaps', 'regarding', 'how', '’d', 'whether', 'various', 'fifteen', 'i', 'thru', 'due', 'an', 'afterwards', 'were', 'again', 'forty', 'third', 'everywhere', 'within', 'otherwise', 'yourself', 'sixty', 'sometimes', 'seeming', 'latter', '‘m', 'name', 'make', 'should', 'himself', 'why', 'empty', 'noone', 'although', 'always', 'in', 'you', 'thence', 'for', 'whoever', 'below', 'into', 'ten', 'all', 'hundred', 'because', 'he', 'together', 'itself', 'than', 'thereupon', 'nowhere', 'twenty', 'hereafter', 'either', 'whom', 'hence', 'former', 'something', 'more', 'elsewhere', 'ours', 'amongst', 'several', 'or', 'made', 'myself', 'these', 'seem', 'which', 'through', 'his', 'beside', 'take', 'another', 'we', 'own', 'am', 'hers', 'any'

In [10]:
string = """
OMG!!!! This is like the best thing ever \t\n.
Wow, such an amazing song! I'm hooked. Top 5 definitely. ?
"""
# Remove stopwords and non-alphabetic tokens
a_lemmas = [lemma for lemma in lemmas
if lemma.isalpha() and lemma not in stopwords]
# Print string after text cleaning
print(' '.join(a_lemmas))

OMG like good thing wow amazing song I hook definitely


## TODO: Other text preprocessing techniques
* Removing HTML/XML tags
* Replacing accented characters (such as é)
* Correcting spelling errors

## Cleaning a blog post

In [25]:
blog = '\nTwenty-first-century politics has witnessed an alarming rise of populism in the U.S. and Europe. The first warning signs came with the UK Brexit Referendum vote in 2016 swinging in the way of Leave. This was followed by a stupendous victory by billionaire Donald Trump to become the 45th President of the United States in November 2016. Since then, Europe has seen a steady rise in populist and far-right parties that have capitalized on Europe’s Immigration Crisis to raise nationalist and anti-Europe sentiments. Some instances include Alternative for Germany (AfD) winning 12.6% of all seats and entering the Bundestag, thus upsetting Germany’s political order for the first time since the Second World War, the success of the Five Star Movement in Italy and the surge in popularity of neo-nazism and neo-fascism in countries such as Hungary, Czech Republic, Poland and Austria.\n'
# create doc Object
doc = nlp(blog)
# Generate lemmatized tokens
lemmas = [token.lemma_ for token in doc]
# Remove stopwords and non-alphabetic tokens
a_lemmas = [lemma for lemma in lemmas 
            if lemma.isalpha() and lemma not in stopwords]
# Print string after text cleaning
print(' '.join(a_lemmas))



## Cleaning TED talks in a dataframe

In [32]:
import pandas as pd
ted = pd.read_csv('ted.csv')
ted['transcript']

0      We're going to talk — my — a new lecture, just...
1      This is a representation of your brain, and yo...
2      It's a great honor today to share with you The...
3      My passions are music, technology and making t...
4      It used to be that if you wanted to get a comp...
                             ...                        
495    Today I'm going to unpack for you three exampl...
496    Both myself and my brother belong to the under...
497    John Hockenberry: It's great to be here with y...
498    What you're doing, right now, at this very mom...
499    We've got a real problem with math education r...
Name: transcript, Length: 500, dtype: object

In [33]:
# Function to preprocess text
def preprocess(text):
  	# Create Doc object
    doc = nlp(text, disable=['ner', 'parser'])
    # Generate lemmas
    lemmas = [token.lemma_ for token in doc]
    # Remove stopwords and non-alphabetic characters
    a_lemmas = [lemma for lemma in lemmas 
            if lemma.isalpha() and lemma not in stopwords]
    
    return ' '.join(a_lemmas)
  
# Apply preprocess to ted['transcript']
ted['transcript'] = ted['transcript'].apply(preprocess)
ted['transcript']

0      talk new lecture TED I illusion create TED I t...
1      representation brain brain break left half log...
2      great honor today share Digital Universe creat...
3      passion music technology thing combination thi...
4      use want computer new program programming requ...
                             ...                        
495    today I unpack example iconic design perfect s...
496    brother belong demographic Pat percent accord ...
497    John Hockenberry great Tom I want start questi...
498    right moment kill More car internet little mob...
499    real problem math education right basically ha...
Name: transcript, Length: 500, dtype: object

# Part-of-speech tagging

## Applications
* Word-sense disambiguation
    * "The bear is a majestic animal"
    * "Please bear with me"
* Sentiment analysis
* Question answering
* Fake news and opinion spam detection

> For example, one paper discovered that fake news headlines, on average, tend to use lesser common nouns and more proper nouns than mainstream headlines. Generating the POS tags for these words proved extremely useful in detecting false or hyperpartisan news. 

## POS tagging
* Assigning every word, its corresponding part of speech.  
    "Jane is an amazing guitarist."
* POS Tagging:
    * Jane → proper noun
    * is → verb
    * an → determiner
    * amazing → adjective
    * guitarist → noun

## POS tagging using spaCy

In [35]:
string = "Jane is an amazing guitarist"
# Create a Doc object
doc = nlp(string)
# Generate list of tokens and pos tags
pos = [(token.text, token.pos_) for token in doc]
print(pos)

[('Jane', 'PROPN'), ('is', 'AUX'), ('an', 'DET'), ('amazing', 'ADJ'), ('guitarist', 'NOUN')]


## POS annotations in spaCy
* PROPN → proper noun
* DET → determinant
* spaCy annotations at hhttps://spacy.io/api/annotation

## POS tagging in Lord of the Flies

In [38]:
lotf = 'He found himself understanding the wearisomeness of this life, where every path was an improvisation and a considerable part of one’s waking life was spent watching one’s feet.'
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')
# Create a Doc object
doc = nlp(lotf)
# Generate tokens and pos tags
pos = [(token.text, token.pos_) for token in doc]
print(pos)

[('He', 'PRON'), ('found', 'VERB'), ('himself', 'PRON'), ('understanding', 'VERB'), ('the', 'DET'), ('wearisomeness', 'NOUN'), ('of', 'ADP'), ('this', 'DET'), ('life', 'NOUN'), (',', 'PUNCT'), ('where', 'SCONJ'), ('every', 'DET'), ('path', 'NOUN'), ('was', 'AUX'), ('an', 'DET'), ('improvisation', 'NOUN'), ('and', 'CCONJ'), ('a', 'DET'), ('considerable', 'ADJ'), ('part', 'NOUN'), ('of', 'ADP'), ('one', 'NUM'), ('’s', 'NUM'), ('waking', 'VERB'), ('life', 'NOUN'), ('was', 'AUX'), ('spent', 'VERB'), ('watching', 'VERB'), ('one', 'NUM'), ('’s', 'NUM'), ('feet', 'NOUN'), ('.', 'PUNCT')]


## Counting nouns in a piece of text

In [39]:
# Returns number of proper nouns
def proper_nouns(text, model=nlp):
  	# Create doc object
    doc = model(text)
    # Generate list of POS tags
    pos = [token.pos_ for token in doc]
    
    # Return number of proper nouns
    return pos.count('PROPN')

print(proper_nouns("Abdul, Bill and Cathy went to the market to buy apples.", nlp))

3


In [40]:
# Returns number of other nouns
def nouns(text, model=nlp):
  	# Create doc object
    doc = model(text)
    # Generate list of POS tags
    pos = [token.pos_ for token in doc]
    
    # Return number of other nouns
    return pos.count('NOUN')

print(nouns("Abdul, Bill and Cathy went to the market to buy apples.", nlp))

2


## Noun usage in fake news

In [42]:
headlines = pd.read_csv('fakenews.csv')
headlines

Unnamed: 0.1,Unnamed: 0,title,label
0,0,You Can Smell Hillary’s Fear,FAKE
1,1,Watch The Exact Moment Paul Ryan Committed Pol...,FAKE
2,2,Kerry to go to Paris in gesture of sympathy,REAL
3,3,Bernie supporters on Twitter erupt in anger ag...,FAKE
4,4,The Battle of New York: Why This Primary Matters,REAL
...,...,...,...
95,95,The Mandela Effect was made by one overlooked ...,FAKE
96,96,CNN: One voter can make a difference by voting...,FAKE
97,97,Give Social Security recipients a CEO-style raise,REAL
98,98,"Fireworks erupt between Trump and Bush, Rubio ...",REAL


In [43]:
headlines['num_propn'] = headlines['title'].apply(proper_nouns)

# Compute mean of proper nouns
real_propn = headlines[headlines['label'] == 'REAL']['num_propn'].mean()
fake_propn = headlines[headlines['label'] == 'FAKE']['num_propn'].mean()

# Print results
print("Mean no. of proper nouns in real and fake headlines are %.2f and %.2f respectively"%(real_propn, fake_propn))

Mean no. of proper nouns in real and fake headlines are 2.37 and 4.81 respectively


In [44]:
headlines['num_noun'] = headlines['title'].apply(nouns)

# Compute mean of other nouns
real_noun = headlines[headlines['label'] == 'REAL']['num_noun'].mean()
fake_noun = headlines[headlines['label'] == 'FAKE']['num_noun'].mean()

# Print results
print("Mean no. of other nouns in real and fake headlines are %.2f and %.2f respectively"%(real_noun, fake_noun))

Mean no. of other nouns in real and fake headlines are 2.39 and 1.60 respectively


# Named entity recognition

## Applications
* Efficient search algorithms
* Questions answering
* News article classification
* Custom services

## Named entity recognition
* Identifying and classifying named entities into predefined categories
* Categories include person, organization, country, etc.  
    "John Doe is a software engineer working at Google. He lives in France."
* Named Entities
    * John Doe → person
    * Google → organization
    * France → country (geopolitical entity)

## NER using spaCy

In [46]:
string = "John Doe is a software engineer working at Google. He lives in France."
doc = nlp(string)
ne = [(ent.text, ent.label_) for ent in doc.ents]
print(ne)

[('John Doe', 'PERSON'), ('Google', 'ORG'), ('France', 'GPE')]


## NER annotations in spaCy
* More than 15 categories of named entities
* NER annotations at https://spacy.io/api/annotation#named-entities

## Named entities in a sentence

In [48]:
text = 'Sundar Pichai is the CEO of Google. Its headquarters is in Mountain View.'
doc = nlp(text)
# Print all named entities and their labels
for ent in doc.ents:
    print(ent.text, ent.label_)

Sundar Pichai PERSON
Google ORG
Mountain View GPE


## Identifying people mentioned in a news article

In [49]:
tc ="\nIt’s' been a busy day for Facebook  exec op-eds. Earlier this morning, Sheryl Sandberg broke the site’s silence around the Christchurch massacre, and now Mark Zuckerberg is calling on governments and other bodies to increase regulation around the sorts of data Facebook traffics in. He’s hoping to get out in front of heavy-handed regulation and get a seat at the table shaping it.\n"
def find_persons(text):
    doc = nlp(text)
    persons = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']
    return persons

print(find_persons(tc))

['Facebook', 'Sheryl Sandberg', 'Mark Zuckerberg', 'Facebook']


## A word of caution
* Not perfect
* Performance dependent on training and test data
* Train models with specialized data for nuanced cases
* Language specific