In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [3]:
from glob import glob
import re
import nltk
import spacy

In [65]:
#datasets used
txt_files = glob('./*.txt')
names = ['blog', 'gettysburg', 'lotf', 'tc']
files = []
for file in txt_files:
    for name in names:
        if name in file:
            with open(file, 'r', encoding='utf-8') as f:
                files.append(f.read())

blog, gettysburg, lotf, tc = files

ted = pd.read_csv('./ted.csv')
headlines = pd.read_csv('./fakenews.csv')

# Tokenization and Lemmatization

## Tokenizing the Gettysburg Address

In [17]:
# load the language model for spacy
nlp = spacy.load('en_core_web_sm')

In [27]:
# create a doc object
doc = nlp(gettysburg)

# generate the tokens
tokens = [token.text for token in doc]
print(f'tokens:\n{tokens}')

tokens:
['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'fathers', 'brought', 'forth', 'on', 'this', 'continent', ',', 'a', 'new', 'nation', ',', 'conceived', 'in', 'Liberty', ',', 'and', 'dedicated', 'to', 'the', 'proposition', 'that', 'all', 'men', 'are', 'created', 'equal', '.', 'Now', 'we', "'re", 'engaged', 'in', 'a', 'great', 'civil', 'war', ',', 'testing', 'whether', 'that', 'nation', ',', 'or', 'any', 'nation', 'so', 'conceived', 'and', 'so', 'dedicated', ',', 'can', 'long', 'endure', '.', 'We', "'re", 'met', 'on', 'a', 'great', 'battlefield', 'of', 'that', 'war', '.', 'We', "'ve", 'come', 'to', 'dedicate', 'a', 'portion', 'of', 'that', 'field', ',', 'as', 'a', 'final', 'resting', 'place', 'for', 'those', 'who', 'here', 'gave', 'their', 'lives', 'that', 'that', 'nation', 'might', 'live', '.', 'It', "'s", 'altogether', 'fitting', 'and', 'proper', 'that', 'we', 'should', 'do', 'this', '.', 'But', ',', 'in', 'a', 'larger', 'sense', ',', 'we', 'ca', "n't", 'dedicate', '-',

## Lemmatizing the Gettysburg address

In [28]:
print(gettysburg)

Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we're engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We're met on a great battlefield of that war. We've come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It's altogether fitting and proper that we should do this. But, in a larger sense, we can't dedicate - we can not consecrate - we can not hallow - this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so no

In [29]:
# generate lemmas
lemmas = [token.lemma_ for token in doc]

# convert lemmas into a string
print(" ".join(lemmas))

four score and seven year ago -PRON- father bring forth on this continent , a new nation , conceive in Liberty , and dedicate to the proposition that all man be create equal . now -PRON- be engage in a great civil war , test whether that nation , or any nation so conceive and so dedicated , can long endure . -PRON- be meet on a great battlefield of that war . -PRON- have come to dedicate a portion of that field , as a final resting place for those who here give -PRON- life that that nation may live . -PRON- be altogether fitting and proper that -PRON- should do this . but , in a large sense , -PRON- can not dedicate - -PRON- can not consecrate - -PRON- can not hallow - this ground . the brave man , living and dead , who struggle here , have consecrate -PRON- , far above -PRON- poor power to add or detract . the world will little note , nor long remember what -PRON- say here , but -PRON- can never forget what -PRON- do here . -PRON- be for -PRON- the living , rather , to be dedicate her

# Text cleaning
Removing...
* Unnecessary whitespaces and escape sequences
* Punctuations
* Special characters (numbers, emojis, etc.)
* Stopwords

## Cleaning a blog post

In [44]:
print(blog)




In [45]:
# get default stopwords from spacy
stopwords = spacy.lang.en.stop_words.STOP_WORDS

# create Doc object
doc = nlp(blog.lower())

# generate lemmatized tokens
lemmas = [token.lemma_ for token in doc]

# remove stopwords and non-alphabetic tokens
a_lemmas = [lemma for lemma in lemmas
            if lemma.isalpha() and lemma not in stopwords]

# print string after text cleaning
print(" ".join(a_lemmas))



Take a look at the cleaned text; it is lowercased and devoid of numbers, punctuations and commonly used stopwords. Also, note that the word `U.S.` was present in the original text. Since it had periods in between, our text cleaning process completely removed it. This may not be ideal behavior. It is always advisable to use your custom functions in place of `isalpha()` for more nuanced cases.

## Cleaning TED talks in a dataframe

In [51]:
ted_ = ted.iloc[:20, :]
ted_['transcript'] = ted_['transcript'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [52]:
ted_.head()

Unnamed: 0,transcript,url
0,"we're going to talk — my — a new lecture, just...",https://www.ted.com/talks/al_seckel_says_our_b...
1,"this is a representation of your brain, and yo...",https://www.ted.com/talks/aaron_o_connell_maki...
2,it's a great honor today to share with you the...,https://www.ted.com/talks/carter_emmart_demos_...
3,"my passions are music, technology and making t...",https://www.ted.com/talks/jared_ficklin_new_wa...
4,it used to be that if you wanted to get a comp...,https://www.ted.com/talks/jeremy_howard_the_wo...


In [53]:
# function to preprocess text
def preprocess(text):
    # create Doc object
    doc = nlp(text, disable=['ner', 'parser'])
    # generate lemmas
    lemmas = [token.lemma_ for token in doc]
    # remove non-alphabetic chars
    a_lemmas = [lemma for lemma in lemmas
                if lemma.isalpha() and lemma not in stopwords]
    
    return " ".join(a_lemmas)

In [54]:
# apply preprocess to ted_['transcript']
ted_['transcript'] = ted_['transcript'].apply(preprocess)
ted_.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,transcript,url
0,talk new lecture ted illusion create ted try r...,https://www.ted.com/talks/al_seckel_says_our_b...
1,representation brain brain break left half log...,https://www.ted.com/talks/aaron_o_connell_maki...
2,great honor today share digital universe creat...,https://www.ted.com/talks/carter_emmart_demos_...
3,passion music technology thing combination thi...,https://www.ted.com/talks/jared_ficklin_new_wa...
4,use want computer new program programming requ...,https://www.ted.com/talks/jeremy_howard_the_wo...


# Part-of-speech (POS) tagging

## POS tagging in Lord of the Flies

In [56]:
# create a Doc object
doc = nlp(lotf)

# generate tokens and pos tags
pos = [(token.text, token.pos_) for token in doc]
print(pos)

[('He', 'PRON'), ('found', 'VERB'), ('himself', 'PRON'), ('understanding', 'VERB'), ('the', 'DET'), ('wearisomeness', 'NOUN'), ('of', 'ADP'), ('this', 'DET'), ('life', 'NOUN'), (',', 'PUNCT'), ('where', 'ADV'), ('every', 'DET'), ('path', 'NOUN'), ('was', 'AUX'), ('an', 'DET'), ('improvisation', 'NOUN'), ('and', 'CCONJ'), ('a', 'DET'), ('considerable', 'ADJ'), ('part', 'NOUN'), ('of', 'ADP'), ('one', 'NOUN'), ('’s', 'PART'), ('waking', 'VERB'), ('life', 'NOUN'), ('was', 'AUX'), ('spent', 'VERB'), ('watching', 'VERB'), ('one', 'PRON'), ('’s', 'PART'), ('feet', 'NOUN'), ('.', 'PUNCT')]


## Counting nouns in a piece of text

In [62]:
# fxn returning number of proper nouns
def proper_nouns(text, model=nlp):
    # create Doc object
    doc = model(text)
    # generate list of POS tags
    pos = [token.pos_ for token in doc]
    
    # return number of proper nouns
    return pos.count('PROPN')

# fxn returning number of other nouns
def nouns(text, model=nlp):
    # create Doc object
    doc = model(text)
    # generate list of POS tags
    pos = [token.pos_ for token in doc]
    
    # return number of other nouns
    return pos.count('NOUN')

In [64]:
text = "Abdul, Bill and Cathy went to the market to buy apples."

print(f'The number of proper nouns in \'{text}\' is {proper_nouns(text, nlp)}\n')
print(f'The number of nouns in \'{text}\' is {nouns(text, nlp)}')

The number of proper nouns in 'Abdul, Bill and Cathy went to the market to buy apples.' is 3

The number of nouns in 'Abdul, Bill and Cathy went to the market to buy apples.' is 2


## Noun usage in fake news

In [66]:
headlines.head()

Unnamed: 0.1,Unnamed: 0,title,label
0,0,You Can Smell Hillary’s Fear,FAKE
1,1,Watch The Exact Moment Paul Ryan Committed Pol...,FAKE
2,2,Kerry to go to Paris in gesture of sympathy,REAL
3,3,Bernie supporters on Twitter erupt in anger ag...,FAKE
4,4,The Battle of New York: Why This Primary Matters,REAL


In [67]:
headlines_ = headlines.copy()

In [68]:
headlines_['num_PROPN'] = headlines_['title'].apply(proper_nouns)

# compute the mean of proper nouns
real_propn = headlines_[headlines_['label'] == 'REAL']['num_PROPN'].mean()
fake_propn = headlines_[headlines_['label'] == 'FAKE']['num_PROPN'].mean()

print(f'Mean no. of proper nouns in real and fake news are {real_propn: .2f} and {fake_propn: .2f}, respectively.')

Mean no. of proper nouns in real and fake news are  2.42 and  4.58, respectively.


In [69]:
headlines_['num_NOUN'] = headlines_['title'].apply(nouns)

# compute the mean of proper nouns
real_propn = headlines_[headlines_['label'] == 'REAL']['num_NOUN'].mean()
fake_propn = headlines_[headlines_['label'] == 'FAKE']['num_NOUN'].mean()

print(f'Mean no. of other nouns in real and fake news are {real_propn: .2f} and {fake_propn: .2f}, respectively.')

Mean no. of other nouns in real and fake news are  2.30 and  1.67, respectively.


# Named entity recognition

## Named entities in a sentence

In [70]:
text = 'Sundar Pichai is the CEO of Google. Its headquarters is in Mountain View.'

# create a Doc object
doc = nlp(text)

# print named entities and their labels
print('Named entities and their labels:')
for ent in doc.ents:
    print(ent.text, ent.label_)

Named entities and their labels:
Sundar Pichai PERSON
Google ORG
Mountain View GPE


## Identifying people mentioned in a news article

In [72]:
print(tc)


It’s' been a busy day for Facebook  exec op-eds. Earlier this morning, Sheryl Sandberg broke the site’s silence around the Christchurch massacre, and now Mark Zuckerberg is calling on governments and other bodies to increase regulation around the sorts of data Facebook traffics in. He’s hoping to get out in front of heavy-handed regulation and get a seat at the table shaping it.


In [73]:
def find_persons(text):
    # create Doc object
    doc = nlp(text)
    
    # identify the persons
    persons = [ent.text for ent in doc.ents
               if ent.label_ == 'PERSON']
    
    return persons

In [76]:
print(f'The persons in the article are: {find_persons(tc)}')

The persons in the article are: ['Sheryl Sandberg', 'Mark Zuckerberg']
