**Introduction to NLP feature engineering**
___
- concepts covered
    - text preprocessing
    - basic features
    - word features
    - vectorization
___

In [None]:
#One-hot encoding

#In the previous exercise, we encountered a dataframe df1 which
#contained categorical features and therefore, was unsuitable for
#applying ML algorithms to.

#In this exercise, your task is to convert df1 into a format that is
#suitable for machine learning.

# Print the features of df1
#print(df1.columns)
#################################################
#<script.py> output:
#    Index(['feature 1', 'feature 2', 'feature 3', 'feature 4', 'feature 5', 'label'], dtype='object')
#################################################

# Perform one-hot encoding
#df1 = pd.get_dummies(df1, columns=['feature 5'])

# Print the new features of df1
#print(df1.columns)

# Print first five rows of df1
#print(df1.head())

#################################################
#Index(['feature 1', 'feature 2', 'feature 3', 'feature 4', 'label', 'feature 5_female', 'feature 5_male'], dtype='object')
#       feature 1  feature 2  feature 3  feature 4  label  feature 5_female  feature 5_male
#    0    29.0000          0          0   211.3375      1                 1               0
#    1     0.9167          1          2   151.5500      1                 0               1
#    2     2.0000          1          2   151.5500      0                 1               0
#    3    30.0000          1          2   151.5500      0                 0               1
#    4    25.0000          1          2   151.5500      0                 1               0
#################################################
#You have successfully performed one-hot encoding on this dataframe.
#Notice how the feature 5 (which represents sex) gets converted to
#two features feature 5_male and feature 5_female. With one-hot
#encoding performed, df1 only contains numerical features and can
#now be fed into any standard ML model!

**Basic feature extraction**
___
- number of characters
- number of words
- average word length
- special features
    - e.g., number of hashtags in a tweet
- other features
    - number of sentences
    - number of paragraphs
    - words starting with an uppercase
    - all-capital words
    - numeric quantities
___

In [None]:
#Character count of Russian tweets

#In this exercise, you have been given a dataframe tweets which
#contains some tweets associated with Russia's Internet Research
#Agency and compiled by FiveThirtyEight.

#Your task is to create a new feature 'char_count' in tweets which
#computes the number of characters for each tweet. Also, compute the
#average length of each tweet. The tweets are available in the
#content feature of tweets.

# Create a feature char_count
#tweets['char_count'] = tweets['content'].apply(len)

# Print the average character count
#print(tweets['char_count'].mean())

#################################################
#<script.py> output:
#    103.462
#################################################
#Notice that the average character count of these tweets is
#approximately 104, which is much higher than the overall average
#tweet length of around 40 characters. Depending on what you're
#working on, this may be something worth investigating into. For
#your information, there is research that indicates that fake news
#articles tend to have longer titles! Therefore, even extremely
#basic features such as character counts can prove to be very useful
#in certain applications.

In [None]:
#Word count of TED talks

#ted is a dataframe that contains the transcripts of 500 TED talks.
#Your job is to compute a new feature word_count which contains the
#approximate number of words for each talk. Consequently, you also
#need to compute the average word count of the talks. The transcripts
#are available as the transcript feature in ted.

#In order to complete this task, you will need to define a function
#count_words that takes in a string as an argument and returns the
#number of words in the string. You will then need to apply this
#function to the transcript feature of ted to create the new feature
#word_count and compute its mean.

# Function that returns number of words in a string
#def count_words(string):
	# Split the string into words
#    words = string.split()

    # Return the number of words
#    return len(words)

# Create a new feature word_count
#ted['word_count'] = ted['transcript'].apply(count_words)

# Print the average word count of the talks
#print(ted['word_count'].mean())

#################################################
#<script.py> output:
#   1987.1
#################################################
#You now know how to compute the number of words in a given piece
#of text. Also, notice that the average length of a talk is close
#to 2000 words. You can use the word_count feature to compute its
#correlation with other variables such as number of views, number
#of comments, etc. and derive extremely interesting insights about
#TED.

In [None]:
#Hashtags and mentions in Russian tweets

#Let's revisit the tweets dataframe containing the Russian tweets.
#In this exercise, you will compute the number of hashtags and
#mentions in each tweet by defining two functions count_hashtags()
#and count_mentions() respectively and applying them to the content
#feature of tweets.

#In case you don't recall, the tweets are contained in the content
#feature of tweets.

# Function that returns numner of hashtags in a string
#def count_hashtags(string):
	# Split the string into words
#    words = string.split()

    # Create a list of words that are hashtags
#    hashtags = [word for word in words if word.startswith('#')]

    # Return number of hashtags
#    return(len(hashtags))

# Create a feature hashtag_count and display distribution
#tweets['hashtag_count'] = tweets['content'].apply(count_hashtags)
#tweets['hashtag_count'].hist()
#plt.title('Hashtag count distribution')
#plt.show()

![_images/19.1.svg](_images/19.1.svg)

In [None]:
# Function that returns number of mentions in a string
#def count_mentions(string):
	# Split the string into words
#    words = string.split()

    # Create a list of words that are mentions
#    mentions = [word for word in words if word.startswith('@')]

    # Return number of mentions
#    return(len(mentions))

# Create a feature mention_count and display distribution
#tweets['mention_count'] = tweets['content'].apply(count_mentions)
#tweets['mention_count'].hist()
#plt.title('Mention count distribution')
#plt.show()

![_images/19.2.svg](_images/19.2.svg)
You now have a good grasp of how to compute various types of
summary features. In the next lesson, we will learn about more
advanced features that are capable of capturing more nuanced
information beyond simple word and character counts.

**Readability tests**
___
- overview of readability tests
    - determine readability of an English passage
    - scale ranging from primary school up to college graduate level
    - a mathematical formula utilizing word, syllable, and sentence count
    - used in fake news and opinion spam detection
- readability text examples
    - **Flesch reading ease**
        - greater the average sentence length, harder text is to read
        - greater the average number of syllables in a word, harder the text is to read
        - higher the score, greater the readability
        ![_images/19.1.PNG](_images/19.1.PNG)
    - **Gunning fog index**
        - developed in 1954
        - also dependent on average sentence length
        - greater the percentage of complex words, harder the text is to read
        - higher the index, lesser the readability
        ![_images/19.2.PNG](_images/19.2.PNG)
    - Simple Measure of Gobbledygook (SMOG)
    - Dale-Chall score
- the textatistic library
    - not available for Anaconda/Windows
    - requires pip install (5GB Visual C++ build tools requirement)
___

In [1]:
#Readability of 'The Myth of Sisyphus'

#In this exercise, you will compute the Flesch reading ease score
#for Albert Camus' famous essay The Myth of Sisyphus. We will then
#interpret the value of this score as explained in the video and
#try to determine the reading level of the essay.

#The entire essay is in the form of a string and is available as
#sisyphus_essay.

sisyphus_essay = '\nThe gods had condemned Sisyphus to ceaselessly rolling a rock to the top of a mountain, whence the stone would fall back of its own weight. They had thought with some reason that there is no more dreadful punishment than futile and hopeless labor. If one believes Homer, Sisyphus was the wisest and most prudent of mortals. According to another tradition, however, he was disposed to practice the profession of highwayman. I see no contradiction in this. Opinions differ as to the reasons why he became the futile laborer of the underworld. To begin with, he is accused of a certain levity in regard to the gods. He stole their secrets. Egina, the daughter of Esopus, was carried off by Jupiter. The father was shocked by that disappearance and complained to Sisyphus. He, who knew of the abduction, offered to tell about it on condition that Esopus would give water to the citadel of Corinth. To the celestial thunderbolts he preferred the benediction of water. He was punished for this in the underworld. Homer tells us also that Sisyphus had put Death in chains. Pluto could not endure the sight of his deserted, silent empire. He dispatched the god of war, who liberated Death from the hands of her conqueror. It is said that Sisyphus, being near to death, rashly wanted to test his wife\'s love. He ordered her to cast his unburied body into the middle of the public square. Sisyphus woke up in the underworld. And there, annoyed by an obedience so contrary to human love, he obtained from Pluto permission to return to earth in order to chastise his wife. But when he had seen again the face of this world, enjoyed water and sun, warm stones and the sea, he no longer wanted to go back to the infernal darkness. Recalls, signs of anger, warnings were of no avail. Many years more he lived facing the curve of the gulf, the sparkling sea, and the smiles of earth. A decree of the gods was necessary. Mercury came and seized the impudent man by the collar and, snatching him from his joys, lead him forcibly back to the underworld, where his rock was ready for him. You have already grasped that Sisyphus is the absurd hero. He is, as much through his passions as through his torture. His scorn of the gods, his hatred of death, and his passion for life won him that unspeakable penalty in which the whole being is exerted toward accomplishing nothing. This is the price that must be paid for the passions of this earth. Nothing is told us about Sisyphus in the underworld. Myths are made for the imagination to breathe life into them. As for this myth, one sees merely the whole effort of a body straining to raise the huge stone, to roll it, and push it up a slope a hundred times over; one sees the face screwed up, the cheek tight against the stone, the shoulder bracing the clay-covered mass, the foot wedging it, the fresh start with arms outstretched, the wholly human security of two earth-clotted hands. At the very end of his long effort measured by skyless space and time without depth, the purpose is achieved. Then Sisyphus watches the stone rush down in a few moments toward tlower world whence he will have to push it up again toward the summit. He goes back down to the plain. It is during that return, that pause, that Sisyphus interests me. A face that toils so close to stones is already stone itself! I see that man going back down with a heavy yet measured step toward the torment of which he will never know the end. That hour like a breathing-space which returns as surely as his suffering, that is the hour of consciousness. At each of those moments when he leaves the heights and gradually sinks toward the lairs of the gods, he is superior to his fate. He is stronger than his rock. If this myth is tragic, that is because its hero is conscious. Where would his torture be, indeed, if at every step the hope of succeeding upheld him? The workman of today works everyday in his life at the same tasks, and his fate is no less absurd. But it is tragic only at the rare moments when it becomes conscious. Sisyphus, proletarian of the gods, powerless and rebellious, knows the whole extent of his wretched condition: it is what he thinks of during his descent. The lucidity that was to constitute his torture at the same time crowns his victory. There is no fate that can not be surmounted by scorn. If the descent is thus sometimes performed in sorrow, it can also take place in joy. This word is not too much. Again I fancy Sisyphus returning toward his rock, and the sorrow was in the beginning. When the images of earth cling too tightly to memory, when the call of happiness becomes too insistent, it happens that melancholy arises in man\'s heart: this is the rock\'s victory, this is the rock itself. The boundless grief is too heavy to bear. These are our nights of Gethsemane. But crushing truths perish from being acknowledged. Thus, Edipus at the outset obeys fate without knowing it. But from the moment he knows, his tragedy begins. Yet at the same moment, blind and desperate, he realizes that the only bond linking him to the world is the cool hand of a girl. Then a tremendous remark rings out: "Despite so many ordeals, my advanced age and the nobility of my soul make me conclude that all is well." Sophocles\' Edipus, like Dostoevsky\'s Kirilov, thus gives the recipe for the absurd victory. Ancient wisdom confirms modern heroism. One does not discover the absurd without being tempted to write a manual of happiness. "What!---by such narrow ways--?" There is but one world, however. Happiness and the absurd are two sons of the same earth. They are inseparable. It would be a mistake to say that happiness necessarily springs from the absurd. Discovery. It happens as well that the felling of the absurd springs from happiness. "I conclude that all is well," says Edipus, and that remark is sacred. It echoes in the wild and limited universe of man. It teaches that all is not, has not been, exhausted. It drives out of this world a god who had come into it with dissatisfaction and a preference for futile suffering. It makes of fate a human matter, which must be settled among men. All Sisyphus\' silent joy is contained therein. His fate belongs to him. His rock is a thing. Likewise, the absurd man, when he contemplates his torment, silences all the idols. In the universe suddenly restored to its silence, the myriad wondering little voices of the earth rise up. Unconscious, secret calls, invitations from all the faces, they are the necessary reverse and price of victory. There is no sun without shadow, and it is essential to know the night. The absurd man says yes and his efforts will henceforth be unceasing. If there is a personal fate, there is no higher destiny, or at least there is, but one which he concludes is inevitable and despicable. For the rest, he knows himself to be the master of his days. At that subtle moment when man glances backward over his life, Sisyphus returning toward his rock, in that slight pivoting he contemplates that series of unrelated actions which become his fate, created by him, combined under his memory\'s eye and soon sealed by his death. Thus, convinced of the wholly human origin of all that is human, a blind man eager to see who knows that the night has no end, he is still on the go. The rock is still rolling. I leave Sisyphus at the foot of the mountain! One always finds one\'s burden again. But Sisyphus teaches the higher fidelity that negates the gods and raises rocks. He too concludes that all is well. This universe henceforth without a master seems to him neither sterile nor futile. Each atom of that stone, each mineral flake of that night filled mountain, in itself forms a world. The struggle itself toward the heights is enough to fill a man\'s heart. One must imagine Sisyphus happy.\n'

# Import Textatistic
from textatistic import Textatistic

# Compute the readability scores
readability_scores = Textatistic(sisyphus_essay).scores

# Print the flesch reading ease score
flesch = readability_scores['flesch_score']
print("The Flesch Reading Ease is %.2f" % (flesch))

#################################################
#You now know to compute the Flesch reading ease score for a
#given body of text. Notice that the score for this essay is
#approximately 81.67. This indicates that the essay is at the
#readability level of a 6th grade American student.

The Flesch Reading Ease is 81.67


In [2]:
#Readability of various publications

#In this exercise, you have been given excerpts of articles from
#four publications. Your task is to compute the readability of these
#excerpts using the Gunning fog index and consequently, determine
#the relative difficulty of reading these publications.

#The excerpts are available as the following strings:

#forbes- An excerpt from an article from Forbes magazine on the
#Chinese social credit score system.

#harvard_law- An excerpt from a book review published in Harvard
#Law Review.

#r_digest- An excerpt from a Reader's Digest article on flight
#turbulence.

#time_kids - An excerpt from an article on the ill effects of salt
#consumption published in TIME for Kids.

forbes = '\nThe idea is to create more transparency about companies and individuals that are breaking the law or are non-compliant with official obligations and incentivize the right behaviors with the overall goal of improving governance and market order. The Chinese Communist Party intends the social credit score system to “allow the trustworthy to roam freely under heaven while making it hard for the discredited to take a single step.” Even though the system is still under development it currently plays out in real life in myriad ways for private citizens, businesses and government officials. Generally, higher credit scores give people a variety of advantages. Individuals are often given perks such as discounted energy bills and access or better visibility on dating websites. Often, those with higher social credit scores are able to forgo deposits on rental properties, bicycles, and umbrellas. They can even get better travel deals. In addition, Chinese hospitals are currently experimenting with social credit scores. A social credit score above 650 at one hospital allows an individual to see a doctor without lining up to pay.\n'
harvard_law = '\nIn his important new book, The Schoolhouse Gate: Public Education, the Supreme Court, and the Battle for the American Mind, Professor Justin Driver reminds us that private controversies that arise within the confines of public schools are part of a broader historical arc — one that tracks a range of cultural and intellectual flashpoints in U.S. history. Moreover, Driver explains, these tensions are reflected in constitutional law, and indeed in the history and jurisprudence of the Supreme Court. As such, debates that arise in the context of public education are not simply about the conflict between academic freedom, public safety, and student rights. They mirror our persistent struggle to reconcile our interest in fostering a pluralistic society, rooted in the ideal of individual autonomy, with our desire to cultivate a sense of national unity and shared identity (or, put differently, our effort to reconcile our desire to forge common norms of citizenship with our fear of state indoctrination and overencroachment). In this regard, these debates reflect the unique role that both the school and the courts have played in defining and enforcing the boundaries of American citizenship. \n'
r_digest = '\nThis week 30 passengers were reportedly injured when a Turkish Airlines flight landing at John F. Kennedy International Airport encountered turbulent conditions. Injuries included bruises, bloody noses, and broken bones. In mid-February, a Delta Airlines flight made an emergency landing to assist three passengers in getting to the nearest hospital after some sudden and unexpected turbulence. Doctors treated 15 passengers after a flight from Miami to Buenos Aires last October for everything from severe bruising to nosebleeds after the plane caught some rough winds over Brazil. In 2016, 23 passengers were injured on a United Airlines flight after severe turbulence threw people into the cabin ceiling. The list goes on. Turbulence has been become increasingly common, with painful outcomes for those on board. And more costly to the airlines, too. Forbes estimates that the cost of turbulence has risen to over $500 million each year in damages and delays. And there are no signs the increase in turbulence will be stopping anytime soon.\n'
time_kids = '\nThat, of course, is easier said than done. The more you eat salty foods, the more you develop a taste for them. The key to changing your diet is to start small. “Small changes in sodium in foods are not usually noticed,” Quader says. Eventually, she adds, the effort will reset a kid’s taste buds so the salt cravings stop. Bridget Murphy is a dietitian at New York University’s Langone Medical Center. She suggests kids try adding spices to their food instead of salt. Eating fruits and veggies and cutting back on packaged foods will also help. Need a little inspiration? Murphy offers this tip: Focus on the immediate effects of a diet that is high in sodium. High blood pressure can make it difficult to be active. “Do you want to be able to think clearly and perform well in school?” she asks. “If you’re an athlete, do you want to run faster?” If you answered yes to these questions, then it’s time to shake the salt habit.\n'


# Import Textatistic
from textatistic import Textatistic

# List of excerpts
excerpts = [forbes, harvard_law, r_digest, time_kids]

# Loop through excerpts and compute gunning fog index
gunning_fog_scores = []
for excerpt in excerpts:
  readability_scores = Textatistic(excerpt).scores
  gunning_fog = readability_scores['gunningfog_score']
  gunning_fog_scores.append(gunning_fog)

# Print the gunning fog indices
print(gunning_fog_scores)

#################################################
#You are now adept at computing readability scores for various
#pieces of text. Notice that the Harvard Law Review excerpt has
#the highest Gunning fog index; indicating that it can be
#comprehended only by readers who have graduated college. On the
#other hand, the Time for Kids article, intended for children, has
#a much lower fog index and can be comprehended by 5th grade students.

[14.436002482929858, 20.735401069518716, 11.085587583148559, 5.926785009861934]


**Tokenization and Lemmatization**
___
- making text machine friendly
    - text preprocessing techniques
        - converting words into lowercase
        - removing leading and trailing whitespaces
        - removing punctuation
        - removing stopwords
        - expanding contractions
        - removing special characters (numbers, emojis, etc.)
- tokenization
- lemmatization
    - convert word into its base form
        - am, is, are -> be
        - reducing, reduces, reduced, reduction -> reduce
        - n't -> not
        - 've -> have
- both tokenization and lemmatization can be done using spaCy library
___

In [3]:
#Tokenizing the Gettysburg Address

#In this exercise, you will be tokenizing one of the most famous
#speeches of all time: the Gettysburg Address delivered by American
#President Abraham Lincoln during the American Civil War.

#The entire speech is available as a string named gettysburg.

gettysburg = "Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we're engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We're met on a great battlefield of that war. We've come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It's altogether fitting and proper that we should do this. But, in a larger sense, we can't dedicate - we can not consecrate - we can not hallow - this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It's rather for us to be here dedicated to the great task remaining before us - that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion - that we here highly resolve that these dead shall not have died in vain - that this nation, under God, shall have a new birth of freedom - and that government of the people, by the people, for the people, shall not perish from the earth."

import spacy

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp(gettysburg)

# Generate the tokens
tokens = [token.text for token in doc]
print(tokens)

#################################################
#You now know how to tokenize a piece of text. In the next exercise,
#we will perform similar steps and conduct lemmatization.

['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'fathers', 'brought', 'forth', 'on', 'this', 'continent', ',', 'a', 'new', 'nation', ',', 'conceived', 'in', 'Liberty', ',', 'and', 'dedicated', 'to', 'the', 'proposition', 'that', 'all', 'men', 'are', 'created', 'equal', '.', 'Now', 'we', "'re", 'engaged', 'in', 'a', 'great', 'civil', 'war', ',', 'testing', 'whether', 'that', 'nation', ',', 'or', 'any', 'nation', 'so', 'conceived', 'and', 'so', 'dedicated', ',', 'can', 'long', 'endure', '.', 'We', "'re", 'met', 'on', 'a', 'great', 'battlefield', 'of', 'that', 'war', '.', 'We', "'ve", 'come', 'to', 'dedicate', 'a', 'portion', 'of', 'that', 'field', ',', 'as', 'a', 'final', 'resting', 'place', 'for', 'those', 'who', 'here', 'gave', 'their', 'lives', 'that', 'that', 'nation', 'might', 'live', '.', 'It', "'s", 'altogether', 'fitting', 'and', 'proper', 'that', 'we', 'should', 'do', 'this', '.', 'But', ',', 'in', 'a', 'larger', 'sense', ',', 'we', 'ca', "n't", 'dedicate', '-', 'we', '

In [5]:
#Lemmatizing the Gettysburg address

#In this exercise, we will perform lemmatization on the same
#gettysburg address from before.

#However, this time, we will also take a look at the speech, before
#and after lemmatization, and try to adjudge the kind of changes
#that take place to make the piece more machine friendly.

gettysburg = "Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we're engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We're met on a great battlefield of that war. We've come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It's altogether fitting and proper that we should do this. But, in a larger sense, we can't dedicate - we can not consecrate - we can not hallow - this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It's rather for us to be here dedicated to the great task remaining before us - that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion - that we here highly resolve that these dead shall not have died in vain - that this nation, under God, shall have a new birth of freedom - and that government of the people, by the people, for the people, shall not perish from the earth."

# Print the gettysburg address
print(gettysburg)

import spacy

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp(gettysburg)

# Generate lemmas
lemmas = [token.lemma_ for token in doc]

# Convert lemmas into a string
print(' '.join(lemmas))

#################################################
#You're now proficient at performing lemmatization using spaCy.
#Observe the lemmatized version of the speech. It isn't very
#readable to humans but it is in a much more convenient format for
#a machine to process.

Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we're engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We're met on a great battlefield of that war. We've come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It's altogether fitting and proper that we should do this. But, in a larger sense, we can't dedicate - we can not consecrate - we can not hallow - this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so no

**Text cleaning**
___
- text cleaning techniques
    - unnecessary whitespaces and escape sequences
    - punctuations
    - special characters (numbers, emojis, etc.)
    - stopwords
- .isalpha() method
    - boolean
    - remove numbers, punctuation, emojis
    - caution: abbreviations or proper nouns with words would be removed
- stopwords
    - words that occur extremely commonly
    - e.g. articles, be verbs, pronouns, etc.
- other text preprocessing techniques
    - removing html/xml tags
    - replacing accented characters
    - correcting spelling errors
- a word of caution
    - always use only those text preprocessing techniques that are relevant to your application
___

In [6]:
#Cleaning a blog post

#In this exercise, you have been given an excerpt from a blog post.
#Your task is to clean this text into a more machine friendly
#format. This will involve converting to lowercase, lemmatization
#and removing stopwords, punctuations and non-alphabetic characters.

#The excerpt is available as a string blog and has been printed
#to the console. The list of stopwords are available as stopwords.

blog = '\nTwenty-first-century politics has witnessed an alarming rise of populism in the U.S. and Europe. The first warning signs came with the UK Brexit Referendum vote in 2016 swinging in the way of Leave. This was followed by a stupendous victory by billionaire Donald Trump to become the 45th President of the United States in November 2016. Since then, Europe has seen a steady rise in populist and far-right parties that have capitalized on Europe’s Immigration Crisis to raise nationalist and anti-Europe sentiments. Some instances include Alternative for Germany (AfD) winning 12.6% of all seats and entering the Bundestag, thus upsetting Germany’s political order for the first time since the Second World War, the success of the Five Star Movement in Italy and the surge in popularity of neo-nazism and neo-fascism in countries such as Hungary, Czech Republic, Poland and Austria.\n'
stopwords = ['fifteen', 'noone', 'whereupon', 'could', 'ten', 'all', 'please', 'indeed', 'whole', 'beside', 'therein', 'using', 'but', 'very', 'already', 'about', 'no', 'regarding', 'afterwards', 'front', 'go', 'in', 'make', 'three', 'here', 'what', 'without', 'yourselves', 'which', 'nothing', 'am', 'between', 'along', 'herein', 'sometimes', 'did', 'as', 'within', 'elsewhere', 'was', 'forty', 'becoming', 'how', 'will', 'other', 'bottom', 'these', 'amount', 'across', 'the', 'than', 'first', 'namely', 'may', 'none', 'anyway', 'again', 'eleven', 'his', 'meanwhile', 'name', 're', 'from', 'some', 'thru', 'upon', 'whither', 'he', 'such', 'down', 'my', 'often', 'whether', 'made', 'while', 'empty', 'two', 'latter', 'whatever', 'cannot', 'less', 'many', 'you', 'ours', 'done', 'thus', 'since', 'everything', 'for', 'more', 'unless', 'former', 'anyone', 'per', 'seeming', 'hereafter', 'on', 'yours', 'always', 'due', 'last', 'alone', 'one', 'something', 'twenty', 'until', 'latterly', 'seems', 'were', 'where', 'eight', 'ourselves', 'further', 'themselves', 'therefore', 'they', 'whenever', 'after', 'among', 'when', 'at', 'through', 'put', 'thereby', 'then', 'should', 'formerly', 'third', 'who', 'this', 'neither', 'others', 'twelve', 'also', 'else', 'seemed', 'has', 'ever', 'someone', 'its', 'that', 'does', 'sixty', 'why', 'do', 'whereas', 'are', 'either', 'hereupon', 'rather', 'because', 'might', 'those', 'via', 'hence', 'itself', 'show', 'perhaps', 'various', 'during', 'otherwise', 'thereafter', 'yourself', 'become', 'now', 'same', 'enough', 'been', 'take', 'their', 'seem', 'there', 'next', 'above', 'mostly', 'once', 'a', 'top', 'almost', 'six', 'every', 'nobody', 'any', 'say', 'each', 'them', 'must', 'she', 'throughout', 'whence', 'hundred', 'not', 'however', 'together', 'several', 'myself', 'i', 'anything', 'somehow', 'or', 'used', 'keep', 'much', 'thereupon', 'ca', 'just', 'behind', 'can', 'becomes', 'me', 'had', 'only', 'back', 'four', 'somewhere', 'if', 'by', 'whereafter', 'everywhere', 'beforehand', 'well', 'doing', 'everyone', 'nor', 'five', 'wherein', 'so', 'amongst', 'though', 'still', 'move', 'except', 'see', 'us', 'your', 'against', 'although', 'is', 'became', 'call', 'have', 'most', 'wherever', 'few', 'out', 'whom', 'yet', 'be', 'own', 'off', 'quite', 'with', 'and', 'side', 'whoever', 'would', 'both', 'fifty', 'before', 'full', 'get', 'sometime', 'beyond', 'part', 'least', 'besides', 'around', 'even', 'whose', 'hereby', 'up', 'being', 'we', 'an', 'him', 'below', 'moreover', 'really', 'it', 'of', 'our', 'nowhere', 'whereby', 'too', 'her', 'toward', 'anyhow', 'give', 'never', 'another', 'anywhere', 'mine', 'herself', 'over', 'himself', 'to', 'onto', 'into', 'thence', 'towards', 'hers', 'nevertheless', 'serious', 'under', 'nine']

import spacy

# Load model and create Doc object
nlp = spacy.load('en_core_web_sm')
doc = nlp(blog)

# Generate lemmatized tokens
lemmas = [token.lemma_ for token in doc]

# Remove stopwords and non-alphabetic tokens
a_lemmas = [lemma for lemma in lemmas
            if lemma.isalpha() and lemma not in stopwords]

# Print string after text cleaning
print(' '.join(a_lemmas))

#################################################
#Take a look at the cleaned text; it is lowercased and devoid of
#numbers, punctuations and commonly used stopwords. Also, note
#that the word U.S. was present in the original text. Since it
#had periods in between, our text cleaning process completely
#removed it. This may not be ideal behavior. It is always
#advisable to use your custom functions in place of isalpha() for
#more nuanced cases.



In [None]:
#Cleaning TED talks in a dataframe

#In this exercise, we will revisit the TED Talks from the first
#chapter. You have been a given a dataframe ted consisting of 5
#TED Talks. Your task is to clean these talks using techniques
#discussed earlier by writing a function preprocess and applying
#it to the transcript feature of the dataframe.

#The stopwords list is available as stopwords.

# Function to preprocess text
#def preprocess(text):
  	# Create Doc object
#    doc = nlp(text, disable=['ner', 'parser'])
    # Generate lemmas
#    lemmas = [token.lemma_ for token in doc]
    # Remove stopwords and non-alphabetic characters
#    a_lemmas = [lemma for lemma in lemmas
#            if lemma.isalpha() and lemma not in stopwords]

#    return ' '.join(a_lemmas)

# Apply preprocess to ted['transcript']
#ted['transcript'] = ted['transcript'].apply(preprocess)
#print(ted['transcript'])

#################################################
#<script.py> output:
#    0     talk new lecture ted illusion create ted try r...
#    1     representation brain brain break left half log...
#    2     great honor today share digital universe creat...
#    3     passion music technology thing combination thi...
#    4     use want computer new program programming requ...
#    5     neuroscientist mixed background physics medici...
#    6     pat mitchell day january begin like work love ...
#    7     taylor wilson year old nuclear physicist littl...
#    8     grow northern ireland right north end absolute...
#    9     publish article new york times modern love col...
#    10    joseph member parliament kenya picture maasai ...
#    11    hi talk little bit music machine life specific...
#    12    hi let ask audience question lie child raise h...
#    13    historical record allow know ancient greeks dr...
#    14    good morning little boy experience change life...
#    15    slide year ago time short slide morning time w...
#    16    like world like share year old love story poor...
#    17    fail woman fail feminist passionate opinion ge...
#    18    revolution century significant longevity revol...
#    19    today baffle lady observe shell soul dwellsand...
#    Name: transcript, dtype: object
#################################################
#You have preprocessed all the TED talk transcripts contained in
#ted and it is now in a good shape to perform operations such as
#vectorization (as we will soon see how). You now have a good
#understanding of how text preprocessing works and why it is
#important. In the next lessons, we will move on to generating
#word level features for our texts.

**Part-of-speech tagging**
___
- Applications
    - word-sense disambiguation
        - "the bear is a majestic animal"
        - "please bear with me"
    - sentiment analysis
    - question answering
    - fake news and opinion spam detection
- POS tagging
    - Assigning every word its corresponding part of speech
        - "Jane is an amazing guitarist"
            - Jane -> proper noun
            - is -> verb
            - an -> determiner
            - amazing -> adjective
            - guitarist -> noun
___

In [7]:
#POS tagging in Lord of the Flies

#In this exercise, you will perform part-of-speech tagging on a
#famous passage from one of the most well-known novels of all time,
#Lord of the Flies, authored by William Golding.

#The passage is available as lotf and has already been printed to
#the console.

lotf = 'He found himself understanding the wearisomeness of this life, where every path was an improvisation and a considerable part of one’s waking life was spent watching one’s feet.'

import spacy

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp(lotf)

# Generate tokens and pos tags
pos = [(token.text, token.pos_) for token in doc]
print(pos)

#################################################
#Examine the various POS tags attached to each token and evaluate
#if they make intuitive sense to you. You will notice that they
#are indeed labelled correctly according to the standard rules of
#English grammar.

[('He', 'PRON'), ('found', 'VERB'), ('himself', 'PRON'), ('understanding', 'VERB'), ('the', 'DET'), ('wearisomeness', 'NOUN'), ('of', 'ADP'), ('this', 'DET'), ('life', 'NOUN'), (',', 'PUNCT'), ('where', 'ADV'), ('every', 'DET'), ('path', 'NOUN'), ('was', 'AUX'), ('an', 'DET'), ('improvisation', 'NOUN'), ('and', 'CCONJ'), ('a', 'DET'), ('considerable', 'ADJ'), ('part', 'NOUN'), ('of', 'ADP'), ('one', 'NOUN'), ('’s', 'PART'), ('waking', 'VERB'), ('life', 'NOUN'), ('was', 'AUX'), ('spent', 'VERB'), ('watching', 'VERB'), ('one', 'PRON'), ('’s', 'PART'), ('feet', 'NOUN'), ('.', 'PUNCT')]


In [None]:
#Counting nouns in a piece of text

#In this exercise, we will write two functions, nouns() and
#proper_nouns() that will count the number of other nouns and
#proper nouns in a piece of text respectively.

#These functions will take in a piece of text and generate a list
#containing the POS tags for each word. It will then return the
#number of proper nouns/other nouns that the text contains. We will
#use these functions in the next exercise to generate interesting
#insights about fake news.

#The en_core_web_sm model has already been loaded as nlp in this
#exercise.

#nlp = spacy.load('en_core_web_sm')

# Returns number of proper nouns
#def proper_nouns(text, model=nlp):
  	# Create doc object
#    doc = model(text)
    # Generate list of POS tags
#    pos = [token.pos_ for token in doc]

    # Return number of proper nouns
#    return pos.count('PROPN')

# Returns number of other nouns
#def nouns(text, model=nlp):
  	# Create doc object
#    doc = model(text)
    # Generate list of POS tags
#    pos = [token.pos_ for token in doc]

    # Return number of other nouns
#    return pos.count('NOUN')

#################################################
#You now know how to write functions that compute the number of
#instances of a particular POS tag in a given piece of text. In
#the next exercise, we will use these functions to generate
#features from text in a dataframe.

In [None]:
#Noun usage in fake news

#In this exercise, you have been given a dataframe headlines that
#contains news headlines that are either fake or real. Your task
#is to generate two new features num_propn and num_noun that
#represent the number of proper nouns and other nouns contained
#in the title feature of headlines.

#Next, we will compute the mean number of proper nouns and other
#nouns used in fake and real news headlines and compare the values.
#If there is a remarkable difference, then there is a good chance
#that using the num_propn and num_noun features in fake news
#detectors will improve its performance.

#To accomplish this task, the functions proper_nouns and nouns
#that you had built in the previous exercise have already been
#made available to you.

#headlines['num_propn'] = headlines['title'].apply(proper_nouns)

# Compute mean of proper nouns
#real_propn = headlines[headlines['label'] == 'REAL']['num_propn'].mean()
#fake_propn = headlines[headlines['label'] == 'FAKE']['num_propn'].mean()

# Print results
#print("Mean no. of proper nouns in real and fake headlines are %.2f and %.2f respectively"%(real_propn, fake_propn))
#################################################
#<script.py> output:
#    Mean no. of proper nouns in real and fake headlines are 2.46
#    and 4.86 respectively
#################################################

#headlines['num_noun'] = headlines['title'].apply(nouns)

# Compute mean of other nouns
#real_noun = headlines[headlines['label'] == 'REAL']['num_noun'].mean()
#fake_noun = headlines[headlines['label'] == 'FAKE']['num_noun'].mean()

# Print results
#print("Mean no. of other nouns in real and fake headlines are %.2f and %.2f respectively"%(real_noun, fake_noun))
#################################################
#<script.py> output:
#    Mean no. of other nouns in real and fake headlines are 2.30
#    and 1.44 respectively
#################################################
#You now know to construct features using POS tags information.
#Notice how the mean number of proper nouns is considerably higher
#for fake news than it is for real news. The opposite seems to be
#true in the case of other nouns. This fact can be put to great
#use in designing fake news detectors.

**Named entity recognition**
___
- Applications
    - efficient search algorithms
    - question answering
    - news article classification
    - customer service
- named entity recognition
    - identifying and classifying named entities into predefined categories
    - categories include person, organization, country, etc.
- a word of caution
    - spaCy's models are not perfect
    - performance is dependent on training and test data
    - train models with specialized data for nuanced cases
    - language specific
___

In [8]:
#Named entities in a sentence

#In this exercise, we will identify and classify the labels of
#various named entities in a body of text using one of spaCy's
#statistical models. We will also verify the veracity of these
#labels.

import spacy

# Load the required model
nlp = spacy.load('en_core_web_sm')

# Create a Doc instance
text = 'Sundar Pichai is the CEO of Google. Its headquarters is in Mountain View.'
doc = nlp(text)

# Print all named entities and their labels
for ent in doc.ents:
    print(ent.text, ent.label_)

#################################################
#Notice how the model correctly predicted the labels of Google
#and Mountain View but [mislabeled Sundar Pichai as an organization.]
#{this was according to an earlier version of en_core_web_sm. As you
#can see here, he is correctly labeled as a PERSON now}. As
#discussed in the video, the predictions of the model depend
#strongly on the data it is trained on. It is possible to train
#spaCy models on your custom data. You will learn to do this in
#more advanced NLP courses.

Sundar Pichai PERSON
Google ORG
Mountain View GPE


In [9]:
#Identifying people mentioned in a news article

#In this exercise, you have been given an excerpt from a news
#article published in TechCrunch. Your task is to write a
#function find_people that identifies the names of people that
#have been mentioned in a particular piece of text. You will then
#use find_people to identify the people of interest in the article.

#The article is available as the string tc and has been printed
#to the console. The required spacy model has also been already
#loaded as nlp.

import spacy

# Load the required model
nlp = spacy.load('en_core_web_sm')

tc = "\nIt’s' been a busy day for Facebook  exec op-eds. Earlier this morning, Sheryl Sandberg broke the site’s silence around the Christchurch massacre, and now Mark Zuckerberg is calling on governments and other bodies to increase regulation around the sorts of data Facebook traffics in. He’s hoping to get out in front of heavy-handed regulation and get a seat at the table shaping it.\n"

def find_persons(text):
  # Create Doc object
  doc = nlp(text)

  # Identify the persons
  persons = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']

  # Return persons
  return persons

print(find_persons(tc))

#################################################
#The article was related to Facebook and our function correctly
#identified both the people mentioned. You can now see how NER
#could be used in a variety of applications. Publishers may use
#a technique like this to classify news articles by the people
#mentioned in them. A question answering system could also use
#something like this to answer questions such as 'Who are the
#people mentioned in this passage?'. With this, we come to an
#end of this chapter. In the next, we will learn how to conduct
#vectorization on documents.

['Sheryl Sandberg', 'Mark Zuckerberg']
