# Feature Engineering for NLP in Python
## 1. Basic features and readability scores
Learn to compute basic features such as number of words, number of characters, average word length and number of special characters (such as Twitter hashtags and mentions). You will also learn to compute readability scores and determine the amount of education required to comprehend a piece of text.

### Introduction to NLP feature engineering
Feature shoul be numerical
Categorical feature "One-hot encoding"

| Sex    | one-hot encoding | sex_female | sex_male |
|--------|------------------|------------|----------|
| female | ->               | 1          | 0        |
| male   | ->               | 0          | 1        |
| female | ->               | 1          | 0        |
| ...    | ...              | ...        | ...      |

#### One-hot encoding with pandas

In [4]:
# Import the pandas library
import pandas as pd
# initialize list of lists 
data = [['male', 10], ['female', 15], ['male', 14]] 
  
# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['sex', 'age'])

# Perform one-hot encoding on the 'sex' feature of df
# Encode all non numerical features in numbers
df = pd.get_dummies(df, columns = ['sex'])
print(df.columns)
print(df.head())
print(df)

   age  sex_female  sex_male
0   10           0         1
1   15           1         0
2   14           0         1


#### Textual dataset
Movie review dataset

| review                    | class    |
|---------------------------|----------|
| This movie ...            | positive |
| The movie is forgettable. | negative |
| A truly amazing ...       | positive |
| ...                       | ...      |

1. Text pre-processing
- Converting to lowercase
  - **Example**: `Reduction` to `reduction`
- Converting to base-form
  - **Example**: `reduction` to `reduce`

2. Vectorization
After processing data looks like this:

| 0    | 1    | ... | n    | class    |
|------|------|-----|------|----------|
| 0.03 | 0.71 | ... | 0.22 | positive |
| 0.45 | 0.00 | ... | 0.19 | negative |
| 0.14 | 0.03 | ... | 0.45 | positive |

3. Basic features
- Number of words
- Number of characters
- Average lenght of words

Word features:
- POS tagging (Pronoun, Verb, Article, Noun ...)
- NER (Person, Organization ...)

#### Course concept
- Text processing
- Basic features
- Word Features
- Vectorization

### Basic feature extraction
#### Number of characters


In [7]:
# Compute the number of characters
text = "I don't know."
num_char = len(text)
print(num_char)

data = [['male', 10], ['female', 15], ['male', 14]] 
df = pd.DataFrame(data, columns = ['sex', 'age'])

df['num_char'] = df['sex'].apply(len)
print(df)

13
      sex  age  num_char
0    male   10         4
1  female   15         6
2    male   14         4


#### Number of words

In [9]:
# Split the string into words
text = "Mary had a little lamb."
words = text.split()
print(words)
print(len(words))

['Mary', 'had', 'a', 'little', 'lamb.']
5


In [13]:
# Function that returns number of words in string
def word_count(string):
    words = text.split()
    return len(words)

data = [["Mary had a little lamb.", 10], ["Mary had a little lamb.", 15]] 
df = pd.DataFrame(data, columns = ['review', 'age'])

df['num_words'] = df['review'].apply(word_count)
print(df)

                    review  age  num_words
0  Mary had a little lamb.   10          5
1  Mary had a little lamb.   15          5


#### Average word lenght

In [14]:
def avg_word_lenght(x):
    words = text.split()
    word_lenghts = [len(word) for word in words]
    avg_word_length = sum(word_lenghts)/len(words)
    return avg_word_length

data = [["Mary had a little lamb.", 10], ["Mary had a little lamb.", 15]] 
df = pd.DataFrame(data, columns = ['review', 'age'])

df['num_words'] = df['review'].apply(avg_word_lenght)
print(df)

                    review  age  num_words
0  Mary had a little lamb.   10        3.8
1  Mary had a little lamb.   15        3.8


#### Hash tags and mentions

In [16]:
def hashtag_count(string):
    # Split the string into words
    words = string.split()
    # Create a list of hashtags
    hashtags = [word for word in words if word.startswith('#')]
    return len(hashtags)

print(hashtag_count("My first tweet! #FirstTweet"))

1


#### Other features
- Number of sentences
- Number of paragraphs
- Word starting with an uppercase
- All-capital words
- Numeric quantities

In [17]:
data = [["Mary had a little lamb."], ["Nick had a big dog."]] 
df = pd.DataFrame(data, columns = ['content'])

df['char_count'] = df['content'].apply(len)
print(df['char_count'].mean())

21.0


### Redability tests
- Determine readability of an English passage
- Scale ranging from primary school up to college graduate level
- A mathematical formula utilizing word, syllable and sentence count
- Used in fake news and opinion spam detection

#### Redability test examples (English text only)
- **Flesch reading ease**
- **Gunning fog index**
- Simple Measure of Gobbledygook (SMOG)
- Dale-Chall score

#### Flesch reading ease
- One of the oldest and most widely used test
- Dependent on two factors:
  - **Greater the average sentence length, harder the text is to read**
    - "This is short sentence."
    - "This is linger sentence with more words an it is harder to follow than the first sentence."
  - **Greated the average mumber of syllables in a word, harder the text is to read**
    - "I live in my home."
    - "I reside in my domicile."
- Higher the score, greated the readability

#### Gunning for index
- Developed in 1954
- Also dependent on average sentence length
- Great the percentage of complex words (3 or more syllables), hader the text is to read
- Higher the index, lesser the readability

#### The textatistic library

In [6]:
from textatistic import Textatistic

text = '\nThe gods had condemned Sisyphus to ceaselessly rolling a rock to the top of a mountain, whence the stone would fall back of its own weight. They had thought with some reason that there is no more dreadful punishment than futile and hopeless labor. If one believes Homer, Sisyphus was the wisest and most prudent of mortals. According to another tradition, however, he was disposed to practice the profession of highwayman. I see no contradiction in this. Opinions differ as to the reasons why he became the futile laborer of the underworld. To begin with, he is accused of a certain levity in regard to the gods. He stole their secrets. Egina, the daughter of Esopus, was carried off by Jupiter. The father was shocked by that disappearance and complained to Sisyphus. He, who knew of the abduction, offered to tell about it on condition that Esopus would give water to the citadel of Corinth. To the celestial thunderbolts he preferred the benediction of water. He was punished for this in the underworld. Homer tells us also that Sisyphus had put Death in chains. Pluto could not endure the sight of his deserted, silent empire. He dispatched the god of war, who liberated Death from the hands of her conqueror. It is said that Sisyphus, being near to death, rashly wanted to test his wife\'s love. He ordered her to cast his unburied body into the middle of the public square. Sisyphus woke up in the underworld. And there, annoyed by an obedience so contrary to human love, he obtained from Pluto permission to return to earth in order to chastise his wife. But when he had seen again the face of this world, enjoyed water and sun, warm stones and the sea, he no longer wanted to go back to the infernal darkness. Recalls, signs of anger, warnings were of no avail. Many years more he lived facing the curve of the gulf, the sparkling sea, and the smiles of earth. A decree of the gods was necessary. Mercury came and seized the impudent man by the collar and, snatching him from his joys, lead him forcibly back to the underworld, where his rock was ready for him. You have already grasped that Sisyphus is the absurd hero. He is, as much through his passions as through his torture. His scorn of the gods, his hatred of death, and his passion for life won him that unspeakable penalty in which the whole being is exerted toward accomplishing nothing. This is the price that must be paid for the passions of this earth. Nothing is told us about Sisyphus in the underworld. Myths are made for the imagination to breathe life into them. As for this myth, one sees merely the whole effort of a body straining to raise the huge stone, to roll it, and push it up a slope a hundred times over; one sees the face screwed up, the cheek tight against the stone, the shoulder bracing the clay-covered mass, the foot wedging it, the fresh start with arms outstretched, the wholly human security of two earth-clotted hands. At the very end of his long effort measured by skyless space and time without depth, the purpose is achieved. Then Sisyphus watches the stone rush down in a few moments toward tlower world whence he will have to push it up again toward the summit. He goes back down to the plain. It is during that return, that pause, that Sisyphus interests me. A face that toils so close to stones is already stone itself! I see that man going back down with a heavy yet measured step toward the torment of which he will never know the end. That hour like a breathing-space which returns as surely as his suffering, that is the hour of consciousness. At each of those moments when he leaves the heights and gradually sinks toward the lairs of the gods, he is superior to his fate. He is stronger than his rock. If this myth is tragic, that is because its hero is conscious. Where would his torture be, indeed, if at every step the hope of succeeding upheld him? The workman of today works everyday in his life at the same tasks, and his fate is no less absurd. But it is tragic only at the rare moments when it becomes conscious. Sisyphus, proletarian of the gods, powerless and rebellious, knows the whole extent of his wretched condition: it is what he thinks of during his descent. The lucidity that was to constitute his torture at the same time crowns his victory. There is no fate that can not be surmounted by scorn. If the descent is thus sometimes performed in sorrow, it can also take place in joy. This word is not too much. Again I fancy Sisyphus returning toward his rock, and the sorrow was in the beginning. When the images of earth cling too tightly to memory, when the call of happiness becomes too insistent, it happens that melancholy arises in man\'s heart: this is the rock\'s victory, this is the rock itself. The boundless grief is too heavy to bear. These are our nights of Gethsemane. But crushing truths perish from being acknowledged. Thus, Edipus at the outset obeys fate without knowing it. But from the moment he knows, his tragedy begins. Yet at the same moment, blind and desperate, he realizes that the only bond linking him to the world is the cool hand of a girl. Then a tremendous remark rings out: "Despite so many ordeals, my advanced age and the nobility of my soul make me conclude that all is well." Sophocles\' Edipus, like Dostoevsky\'s Kirilov, thus gives the recipe for the absurd victory. Ancient wisdom confirms modern heroism. One does not discover the absurd without being tempted to write a manual of happiness. "What!---by such narrow ways--?" There is but one world, however. Happiness and the absurd are two sons of the same earth. They are inseparable. It would be a mistake to say that happiness necessarily springs from the absurd. Discovery. It happens as well that the felling of the absurd springs from happiness. "I conclude that all is well," says Edipus, and that remark is sacred. It echoes in the wild and limited universe of man. It teaches that all is not, has not been, exhausted. It drives out of this world a god who had come into it with dissatisfaction and a preference for futile suffering. It makes of fate a human matter, which must be settled among men. All Sisyphus\' silent joy is contained therein. His fate belongs to him. His rock is a thing. Likewise, the absurd man, when he contemplates his torment, silences all the idols. In the universe suddenly restored to its silence, the myriad wondering little voices of the earth rise up. Unconscious, secret calls, invitations from all the faces, they are the necessary reverse and price of victory. There is no sun without shadow, and it is essential to know the night. The absurd man says yes and his efforts will henceforth be unceasing. If there is a personal fate, there is no higher destiny, or at least there is, but one which he concludes is inevitable and despicable. For the rest, he knows himself to be the master of his days. At that subtle moment when man glances backward over his life, Sisyphus returning toward his rock, in that slight pivoting he contemplates that series of unrelated actions which become his fate, created by him, combined under his memory\'s eye and soon sealed by his death. Thus, convinced of the wholly human origin of all that is human, a blind man eager to see who knows that the night has no end, he is still on the go. The rock is still rolling. I leave Sisyphus at the foot of the mountain! One always finds one\'s burden again. But Sisyphus teaches the higher fidelity that negates the gods and raises rocks. He too concludes that all is well. This universe henceforth without a master seems to him neither sterile nor futile. Each atom of that stone, each mineral flake of that night filled mountain, in itself forms a world. The struggle itself toward the heights is enough to fill a man\'s heart. One must imagine Sisyphus happy.\n'
readability_scores = Textatistic(text).scores

# Generate scores
print(readability_scores['flesch_score'])
print(readability_scores['gunningfog_score'])

81.67466335836913
7.913698140200286


#### Example: Readability of various publications
In this exercise, you have been given excerpts of articles from four publications. Your task is to compute the readability of these excerpts using the Gunning fog index and consequently, determine the relative difficulty of reading these publications.

The excerpts are available as the following strings:

forbes- An excerpt from an article from Forbes magazine on the Chinese social credit score system.
harvard_law- An excerpt from a book review published in Harvard Law Review.
r_digest- An excerpt from a Reader's Digest article on flight turbulence.
time_kids - An excerpt from an article on the ill effects of salt consumption published in TIME for Kids.

In [7]:
from textatistic import Textatistic

forbes = '\nThe idea is to create more transparency about companies and individuals that are breaking the law or are non-compliant with official obligations and incentivize the right behaviors with the overall goal of improving governance and market order. The Chinese Communist Party intends the social credit score system to “allow the trustworthy to roam freely under heaven while making it hard for the discredited to take a single step.” Even though the system is still under development it currently plays out in real life in myriad ways for private citizens, businesses and government officials. Generally, higher credit scores give people a variety of advantages. Individuals are often given perks such as discounted energy bills and access or better visibility on dating websites. Often, those with higher social credit scores are able to forgo deposits on rental properties, bicycles, and umbrellas. They can even get better travel deals. In addition, Chinese hospitals are currently experimenting with social credit scores. A social credit score above 650 at one hospital allows an individual to see a doctor without lining up to pay.\n'
harvard_law = '\nIn his important new book, The Schoolhouse Gate: Public Education, the Supreme Court, and the Battle for the American Mind, Professor Justin Driver reminds us that private controversies that arise within the confines of public schools are part of a broader historical arc — one that tracks a range of cultural and intellectual flashpoints in U.S. history. Moreover, Driver explains, these tensions are reflected in constitutional law, and indeed in the history and jurisprudence of the Supreme Court. As such, debates that arise in the context of public education are not simply about the conflict between academic freedom, public safety, and student rights. They mirror our persistent struggle to reconcile our interest in fostering a pluralistic society, rooted in the ideal of individual autonomy, with our desire to cultivate a sense of national unity and shared identity (or, put differently, our effort to reconcile our desire to forge common norms of citizenship with our fear of state indoctrination and overencroachment). In this regard, these debates reflect the unique role that both the school and the courts have played in defining and enforcing the boundaries of American citizenship. \n'
r_digest = '\nThis week 30 passengers were reportedly injured when a Turkish Airlines flight landing at John F. Kennedy International Airport encountered turbulent conditions. Injuries included bruises, bloody noses, and broken bones. In mid-February, a Delta Airlines flight made an emergency landing to assist three passengers in getting to the nearest hospital after some sudden and unexpected turbulence. Doctors treated 15 passengers after a flight from Miami to Buenos Aires last October for everything from severe bruising to nosebleeds after the plane caught some rough winds over Brazil. In 2016, 23 passengers were injured on a United Airlines flight after severe turbulence threw people into the cabin ceiling. The list goes on. Turbulence has been become increasingly common, with painful outcomes for those on board. And more costly to the airlines, too. Forbes estimates that the cost of turbulence has risen to over $500 million each year in damages and delays. And there are no signs the increase in turbulence will be stopping anytime soon.\n'
time_kids = '\nThat, of course, is easier said than done. The more you eat salty foods, the more you develop a taste for them. The key to changing your diet is to start small. “Small changes in sodium in foods are not usually noticed,” Quader says. Eventually, she adds, the effort will reset a kid’s taste buds so the salt cravings stop. Bridget Murphy is a dietitian at New York University’s Langone Medical Center. She suggests kids try adding spices to their food instead of salt. Eating fruits and veggies and cutting back on packaged foods will also help. Need a little inspiration? Murphy offers this tip: Focus on the immediate effects of a diet that is high in sodium. High blood pressure can make it difficult to be active. “Do you want to be able to think clearly and perform well in school?” she asks. “If you’re an athlete, do you want to run faster?” If you answered yes to these questions, then it’s time to shake the salt habit.\n'

# List of excerpts
excerpts = [forbes, harvard_law, r_digest, time_kids]

# Loop through excerpts and compute gunning fog index
gunning_fog_scores = []
for excerpt in excerpts:
  readability_scores = Textatistic(excerpt).scores
  gunning_fog = readability_scores['gunningfog_score']
  gunning_fog_scores.append(gunning_fog)

# Print the gunning fog indices
print(gunning_fog_scores)

[14.436002482929858, 20.735401069518716, 11.085587583148559, 5.926785009861934]


## 2. Text preprocessing, POS tagging and NER
In this chapter, you will learn about tokenization and lemmatization. You will then learn how to perform text cleaning, part-of-speech tagging, and named entity recognition using the spaCy library. Upon mastering these concepts, you will proceed to make the Gettysburg address machine-friendly, analyze noun usage in fake news, and identify people mentioned in a TechCrunch article.

### Tokenization and Lemmatization
#### Making text machine friendly
- Dogs, dog
- reduction, REDUCING, Reduce
- don't, do not

#### Text preprocessing techniques
- Converting words into lowercase
- Removing leading and trailing whitespaces
- Removing punctuations
- Removing stopwords
- Expanding contractions
- Removing spectial characters (numbers, emojis, etc.)

#### Tokenization
"I have a dog. His name is Hachi"
**Tokens:**
`["I", "have", "a", "dog", ".", "His", "name", "is", "Hachi"]`

"Don't do this."
**Tokens:**
`["Do", "n't", "do", "this", "."]`

#### Tokenization using spaCy

In [10]:
import spacy

nlp = spacy.load('en_core_web_sm')
string = "Hello! I don't know what I'm doing here"
doc = nlp(string)
tokens = [token.text for token in doc]
print(tokens)

['Hello', '!', 'I', 'do', "n't", 'know', 'what', 'I', "'m", 'doing', 'here']


#### Lemmatization
- Convert word into its base form
  - redicing, reduces, reduced, reduction -> reduce
  - am, are, is -> be
  - n't -> not
  - 've -> have

#### Lemmatization using spaCy
spaCy generate lemmas by default

In [11]:
import spacy

nlp = spacy.load('en_core_web_sm')
string = "Hello! I don't know what I'm doing here"
doc = nlp(string)

lemmas = [token.lemma_ for token in doc]
print(lemmas)

['hello', '!', '-PRON-', 'do', 'not', 'know', 'what', '-PRON-', 'be', 'do', 'here']


#### Example: Tokenizing the Gettysburg Address
In this exercise, you will be tokenizing one of the most famous speeches of all time: the Gettysburg Address delivered by American President Abraham Lincoln during the American Civil War.

The entire speech is available as a string named gettysburg.

In [12]:
import spacy

gettysburg = "Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we're engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We're met on a great battlefield of that war. We've come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It's altogether fitting and proper that we should do this. But, in a larger sense, we can't dedicate - we can not consecrate - we can not hallow - this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It's rather for us to be here dedicated to the great task remaining before us - that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion - that we here highly resolve that these dead shall not have died in vain - that this nation, under God, shall have a new birth of freedom - and that government of the people, by the people, for the people, shall not perish from the earth."

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp(gettysburg)

# Generate the tokens
tokens = [token.text for token in doc]
print(tokens)

['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'fathers', 'brought', 'forth', 'on', 'this', 'continent', ',', 'a', 'new', 'nation', ',', 'conceived', 'in', 'Liberty', ',', 'and', 'dedicated', 'to', 'the', 'proposition', 'that', 'all', 'men', 'are', 'created', 'equal', '.', 'Now', 'we', "'re", 'engaged', 'in', 'a', 'great', 'civil', 'war', ',', 'testing', 'whether', 'that', 'nation', ',', 'or', 'any', 'nation', 'so', 'conceived', 'and', 'so', 'dedicated', ',', 'can', 'long', 'endure', '.', 'We', "'re", 'met', 'on', 'a', 'great', 'battlefield', 'of', 'that', 'war', '.', 'We', "'ve", 'come', 'to', 'dedicate', 'a', 'portion', 'of', 'that', 'field', ',', 'as', 'a', 'final', 'resting', 'place', 'for', 'those', 'who', 'here', 'gave', 'their', 'lives', 'that', 'that', 'nation', 'might', 'live', '.', 'It', "'s", 'altogether', 'fitting', 'and', 'proper', 'that', 'we', 'should', 'do', 'this', '.', 'But', ',', 'in', 'a', 'larger', 'sense', ',', 'we', 'ca', "n't", 'dedicate', '-', 'we', '

In [13]:
import spacy

# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp(gettysburg)

# Generate lemmas
lemmas = [token.lemma_ for token in doc]

# Convert lemmas into a string
print(' '.join(lemmas))

four score and seven year ago -PRON- father bring forth on this continent , a new nation , conceive in Liberty , and dedicate to the proposition that all man be create equal . now -PRON- be engage in a great civil war , test whether that nation , or any nation so conceived and so dedicated , can long endure . -PRON- be meet on a great battlefield of that war . -PRON- have come to dedicate a portion of that field , as a final resting place for those who here give -PRON- life that that nation might live . -PRON- be altogether fitting and proper that -PRON- should do this . but , in a large sense , -PRON- can not dedicate - -PRON- can not consecrate - -PRON- can not hallow - this ground . the brave man , living and dead , who struggle here , have consecrate -PRON- , far above -PRON- poor power to add or detract . the world will little note , nor long remember what -PRON- say here , but -PRON- can never forget what -PRON- do here . -PRON- be for -PRON- the living , rather , to be dedicate 

### Text cleaning
#### Text cleaning techniques
- Unnecessary whitespaces and escape sequences
- Punctuations
- Special characters (numbers, emojis, etc.)
- Stopwords

In [16]:
print("Dog".isalpha())
print("3dogs".isalpha())
print("1233456".isalpha())
print("U.S.A".isalpha())

True
False
False
False


#### A word of caution
- Abreviation: U.S.A, U.K., etc
- Proper Nouns: word2vec and xto10x
- Write your own custom function (using regex) for the more nuanced cases.

#### Removing non-alphabetical characters

In [20]:
string = """OMG!!! This is like      the best thing ever \t\n.
Wow, such an amazing song! I'm hooked. Top 5 definitely. ?"""

import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(string)
lemmas = [token.lemma_ for token in doc]
print(lemmas)

# Remove tokens that are not alphabetic
a_lemmas = [lemma for lemma in lemmas if lemma.isalpha() or lemma == '-PRON-']
print(' '.join(a_lemmas))

['OMG', '!', '!', '!', 'this', 'be', 'like', '     ', 'the', 'good', 'thing', 'ever', '\t\n', '.', '\n', 'wow', ',', 'such', 'an', 'amazing', 'song', '!', '-PRON-', 'be', 'hook', '.', 'Top', '5', 'definitely', '.', '?']
OMG this be like the good thing ever wow such an amazing song -PRON- be hook Top definitely


#### Stopwords
- Words that occur extemely commonly
- Eg. articles, be verbs, pronouns, etc.

#### Removing stop words using spaCy

In [22]:
# Get list of stopwords
stopwords = spacy.lang.en.stop_words.STOP_WORDS

string = """OMG!!! This is like      the best thing ever \t\n.
Wow, such an amazing song! I'm hooked. Top 5 definitely. ?"""

nlp = spacy.load('en_core_web_sm')
doc = nlp(string)
lemmas = [token.lemma_ for token in doc]

# Removing stopwords and non-alphabetic tokens
a_lemmas = [lemma for lemma in lemmas if lemma.isalpha() and lemma not in stopwords]
print(' '.join(a_lemmas))

OMG like good thing wow amazing song hook Top definitely


#### Other text preprocessing techniques
- Removing HTML/XML tags
- Replacing accented characters
- Correcting spelling errors

#### Example: Cleaning a blog post
In this exercise, you have been given an excerpt from a blog post. Your task is to clean this text into a more machine friendly format. This will involve converting to lowercase, lemmatization and removing stopwords, punctuations and non-alphabetic characters.

The excerpt is available as a string blog and has been printed to the console. The list of stopwords are available as stopwords.

In [23]:
import spacy

stopwords = spacy.lang.en.stop_words.STOP_WORDS
blog = "Twenty-first-century politics has witnessed an alarming rise of populism in the U.S. and Europe. The first warning signs came with the UK Brexit Referendum vote in 2016 swinging in the way of Leave. This was followed by a stupendous victory by billionaire Donald Trump to become the 45th President of the United States in November 2016. Since then, Europe has seen a steady rise in populist and far-right parties that have capitalized on Europe’s Immigration Crisis to raise nationalist and anti-Europe sentiments. Some instances include Alternative for Germany (AfD) winning 12.6% of all seats and entering the Bundestag, thus upsetting Germany’s political order for the first time since the Second World War, the success of the Five Star Movement in Italy and the surge in popularity of neo-nazism and neo-fascism in countries such as Hungary, Czech Republic, Poland and Austria."

# Load model and create Doc object
nlp = spacy.load('en_core_web_sm')
doc = nlp(blog)

# Generate lemmatized tokens
lemmas = [token.lemma_ for token in doc]

# Remove stopwords and non-alphabetic tokens
a_lemmas = [lemma for lemma in lemmas 
            if lemma.isalpha() and lemma not in stopwords]

# Print string after text cleaning
print(' '.join(a_lemmas))



#### Examples: Cleaning TED talks in a dataframe
In this exercise, we will revisit the TED Talks from the first chapter. You have been a given a dataframe ted consisting of 5 TED Talks. Your task is to clean these talks using techniques discussed earlier by writing a function preprocess and applying it to the transcript feature of the dataframe.

The stopwords list is available as stopwords.

In [25]:
import spacy
import pandas as pd

stopwords = spacy.lang.en.stop_words.STOP_WORDS
data = [["Mary had a little lamb."], ["Nick had a big dog."]] 
ted = pd.DataFrame(data, columns = ['transcript'])

# Function to preprocess text
def preprocess(text):
  	# Create Doc object
    doc = nlp(text, disable=['ner', 'parser'])
    # Generate lemmas
    lemmas = [token.lemma_ for token in doc]
    # Remove stopwords and non-alphabetic characters
    a_lemmas = [lemma for lemma in lemmas 
            if lemma.isalpha() and lemma not in stopwords]
    
    return ' '.join(a_lemmas)
  
# Apply preprocess to ted['transcript']
ted['transcript'] = ted['transcript'].apply(preprocess)
print(ted['transcript'])

0    Mary little lamb
1        Nick big dog
Name: transcript, dtype: object


### Part-of-speach tagging
#### Applications
- Word-sense disambiguation
  - `"The bear is a majestic animal"`
  - `"Please bear with me"`
- Sentiment analysis
- Question answering
- Fake news and opinion spam detection

#### POS tagging
- Assigning every word, its corresponding part of speech.

"Jane is an amazing gutarist."
- **POS Tagging:**
  - Jane -> proper noun
  - is -> verb
  - an -> determiner
  - amazing -> adjective
  - guitarist -> noun
  
#### POS tagging using spaCy

In [26]:
import spacy

nlp = spacy.load('en_core_web_sm')
string = "Jane is an amazing gutarist."
doc = nlp(string)

# Generate list of tokens and pos tags
pos = [(token.text, token.pos_) for token in doc]
print(pos)

[('Jane', 'PROPN'), ('is', 'AUX'), ('an', 'DET'), ('amazing', 'ADJ'), ('gutarist', 'NOUN'), ('.', 'PUNCT')]


Spacy pos tagging depends on data which it was trained on and on data on for which it was used for.

#### Example: Counting nouns in a piece of text
In this exercise, we will write two functions, nouns() and proper_nouns() that will count the number of other nouns and proper nouns in a piece of text respectively.

These functions will take in a piece of text and generate a list containing the POS tags for each word. It will then return the number of proper nouns/other nouns that the text contains. We will use these functions in the next exercise to generate interesting insights about fake news.

The en_core_web_sm model has already been loaded as nlp in this exercise.

In [27]:
nlp = spacy.load('en_core_web_sm')

# Returns number of proper nouns
def proper_nouns(text, model=nlp):
  	# Create doc object
    doc = model(text)
    # Generate list of POS tags
    pos = [token.pos_ for token in doc]
    
    # Return number of proper nouns
    return pos.count('PROPN')

print(proper_nouns("Abdul, Bill and Cathy went to the market to buy apples.", nlp))

3


#### Example: Noun usage in fake news
In this exercise, you have been given a dataframe headlines that contains news headlines that are either fake or real. Your task is to generate two new features num_propn and num_noun that represent the number of proper nouns and other nouns contained in the title feature of headlines.

Next, we will compute the mean number of proper nouns and other nouns used in fake and real news headlines and compare the values. If there is a remarkable difference, then there is a good chance that using the num_propn and num_noun features in fake news detectors will improve its performance.

To accomplish this task, the functions proper_nouns and nouns that you had built in the previous exercise have already been made available to you.

In [32]:
import pandas as pd

def proper_nouns(text, model=nlp):
  	# Create doc object
    doc = model(text)
    # Generate list of POS tags
    pos = [token.pos_ for token in doc]
    
    # Return number of proper nouns
    return pos.count('PROPN')

def nouns(text, model=nlp):
  	# Create doc object
    doc = model(text)
    # Generate list of POS tags
    pos = [token.pos_ for token in doc]
    
    # Return number of proper nouns
    return pos.count('NOUN')

data = [["Mary had a little lamb.", "REAL"], ["Nick had a big dog.", "FAKE"]] 
headlines = pd.DataFrame(data, columns = ['title', 'label'])

headlines['num_propn'] = headlines['title'].apply(proper_nouns)
headlines['num_noun'] = headlines['title'].apply(nouns)

# Compute mean of proper nouns
real_propn = headlines[headlines['label'] == 'REAL']['num_propn'].mean()
fake_propn = headlines[headlines['label'] == 'FAKE']['num_propn'].mean()

# Compute mean of other nouns
real_noun = headlines[headlines['label'] == 'REAL']['num_noun'].mean()
fake_noun = headlines[headlines['label'] == 'FAKE']['num_noun'].mean()

# Print results
print("Mean no. of proper nouns in real and fake headlines are %.2f and %.2f respectively"%(real_propn, fake_propn))
print("Mean no. of other nouns in real and fake headlines are %.2f and %.2f respectively"%(real_noun, fake_noun))

Mean no. of proper nouns in real and fake headlines are 1.00 and 1.00 respectively
Mean no. of other nouns in real and fake headlines are 1.00 and 1.00 respectively


### Named entity recognition
#### Applications
- Efficient search algorithm
- Question answering
- News aarticles classification
- Customer service

#### Named entity recognition
- Identifying and classifying named entities into predefined categories.
- Categoories include person, organization, country, etc.

"John Doe is a software engineer working at Google. He lives in France."

- **Named Entities**
- John Doe -> person
- Google -> organization
- France -> country

#### NER using spaCy

In [33]:
import spacy

nlp = spacy.load('en_core_web_sm')
string = "John Doe is a software engineer working at Google. He lives in France."
doc = nlp(string)

# generate named entities
ne = [(ent.text, ent.label_) for ent in doc.ents]
print(ne)

[('John Doe', 'PERSON'), ('Google', 'ORG'), ('France', 'GPE')]


#### A word of caution
- Not perfect
- Performance dependent on training and test data
- Train models with specialized data foe nuanced cases
- Language specific (model)

In [36]:
import spacy

nlp = spacy.load('en_core_web_sm')
tc = "\nIt’s' been a busy day for Facebook  exec op-eds. Earlier this morning, Sheryl Sandberg broke the site’s silence around the Christchurch massacre, and now Mark Zuckerberg is calling on governments and other bodies to increase regulation around the sorts of data Facebook traffics in. He’s hoping to get out in front of heavy-handed regulation and get a seat at the table shaping it.\n"

def find_persons(text):
  # Create Doc object
  doc = nlp(text)
  
  # Identify the persons
  persons = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']
  
  # Return persons
  return persons

print(find_persons(tc))

['Sheryl Sandberg', 'Mark Zuckerberg']


## 3. N-Gram models
Learn about n-gram modeling and use it to perform sentiment analysis on movie reviews.

### Building a bag of words model
For any ML algorithm:
- Data must be in tabular form
- Training features must be numerical

#### Bag of words model
- Extract word tokens
- Compute frequency of tokens
- Construct a word vector out of these frequences and vocabulary of corpus

**Corpus**
`"The lion is the king of the jungle"`
`"Lions have lifespans of a decade"`
`"The lion is an endangered species"`

**Vocabulary of model** -> `a`, `an`, `decade`, `endangered`, `have`, `is`, `jungle`, `king`, `lifespans`, `lion`, `Lions`, `of`, `species`, `the`, `The`

Convert to word vector
`The lion is the king of the jungle`
`[0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 2, 1]`

Each value in the vector corresponds to the frequency of the corresponding word in the vocabulary.

#### Text preprocessing
By doing text preprocessing model might be improved significantly
- Lions, lion -> lion
- The, the -> the
- No punctuations
- No stopwords
- Preprocessing leads to smaller vacabulaies
- Reducing mumber of dimensions helps improve performance

#### Bag of words model using sklearn

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

corpus = pd.Series([
    'The lion is the king of the jungle',
    'Lions have lifespans of a decade',
    'The lion is an endangered species',
    'men may come and men may go but i go on forever'
])

vectorizer = CountVectorizer()

# Generate matrix of word vectors
# Lowercase all words and ignore single character tokens such as 'a'
# Doesn't index in alphabetical order
bow_matrix = vectorizer.fit_transform(corpus)
print(bow_matrix.toarray())

# Convert bow_matrix into a DataFrame
bow_df = pd.DataFrame(bow_matrix.toarray())
# Map the column names to vocabulary 
bow_df.columns = vectorizer.get_feature_names()
print(bow_df)

[[0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 0 1 0 0 3]
 [0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0]
 [1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1]
 [0 1 1 1 0 0 1 2 0 0 0 0 0 0 0 2 2 0 1 0 0]]
   an  and  but  come  decade  endangered  forever  go  have  is  ...  king  \
0   0    0    0     0       0           0        0   0     0   1  ...     1   
1   0    0    0     0       1           0        0   0     1   0  ...     0   
2   1    0    0     0       0           1        0   0     0   1  ...     0   
3   0    1    1     1       0           0        1   2     0   0  ...     0   

   lifespans  lion  lions  may  men  of  on  species  the  
0          0     1      0    0    0   1   0        0    3  
1          1     0      1    0    0   1   0        0    0  
2          0     1      0    0    0   0   0        1    1  
3          0     0      0    2    2   0   1        0    0  

[4 rows x 21 columns]


In [1]:
import spacy
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

nlp = spacy.load('en_core_web_sm')
stopwords = spacy.lang.en.stop_words.STOP_WORDS
corpus = pd.Series([
    'The lion is the king of the jungle',
    'Lions have lifespans of a decade',
    'The lion is an endangered species',
    'men may come and men may go but i go on forever'
]) 
lemmas = pd.DataFrame(corpus, columns = ['data'])

# Function to preprocess text
def preprocess(text):
  	# Create Doc object
    doc = nlp(text, disable=['ner', 'parser'])
    # Generate lemmas
    lemmas = [token.lemma_ for token in doc]
    # Remove stopwords and non-alphabetic characters
    a_lemmas = [lemma for lemma in lemmas 
            if lemma.isalpha() and lemma not in stopwords]
    
    return ' '.join(a_lemmas)
  
# Apply preprocess to ted['transcript']
lemmas['data'] = lemmas['data'].apply(preprocess)
print(lemmas['data'])

vectorizer = CountVectorizer()

# Generate matrix of word vectors
# Lowercase all words and ignore single character tokens such as 'a'
# Doesn't index in alphabetical order
bow_matrix = vectorizer.fit_transform(corpus)
print(bow_matrix.toarray())

0         lion king jungle
1     lion lifespan decade
2    lion endanger species
3     man come man forever
Name: data, dtype: object
[[0 0 0 0 1 1 0 1 0 0]
 [0 1 0 0 0 0 1 1 0 0]
 [0 0 1 0 0 0 0 1 0 1]
 [1 0 0 1 0 0 0 0 2 0]]


### Building a BoW Naive Bayes classifier
#### Spam filtering

**Steps**
1. Text processing
2. Building a bag-of-words model (or representations)
3. Machine learning

#### Text presprocessing using CountVectorizer
CountVectorizer arguments
- `lowercase` : `False`, `True`
- `strip_accents` : `unicode`, `ascii`, `None`
- `stop_words` : `english`, `list`, `None`
- `token_pattern` : `regexp`
- `tokenizer` : `function`

#### Building the BoW model

In [17]:
review = [
    'The lion is the king of the jungle',
    'Lions have lifespans of a decade',
    'The lion is an endangered species',
    'men may come and men may go but i go on forever'
]
label = [1, 1, 1, 0]
data = list(zip(review, label))

# Create test spam dataframe
df = pd.DataFrame(data, columns = ['message', 'label'])

from sklearn.feature_extraction.text import CountVectorizer

# Create CountVectorizer object
vectorizer = CountVectorizer(strip_accents='ascii', stop_words='english', lowercase=False)

from sklearn.model_selection import train_test_split

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'], test_size=0.25)
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

# Print shape of X_train_bow and X_test_bow
print(X_train_bow.shape)
print(X_test_bow.shape)

# Words in test data which not part for count vectorizer ignored

(3, 9)
(1, 9)


#### Training the Naive Bayes classifier
In the previous exercise, you generated the bag-of-words representations for the training and test movie review data. In this exercise, we will use this model to train a Naive Bayes classifier that can detect the sentiment of a movie review and compute its accuracy. Note that since this is a binary classification problem, the model is only capable of classifying a review as either positive (1) or negative (0). It is incapable of detecting neutral reviews.

In [16]:
# !!! Depends on previous task
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()

# Train classifier
clf.fit(X_train_bow, y_train)

# Compute accuracy on test set
accuracy = clf.score(X_test_bow, y_test)
print(accuracy)

1.0


### Building n-gram models
BoW shortcomings

| review                                | label    |
|---------------------------------------|----------|
| `'The movie was good and not boring'` | positive |
| `'The movie was not good and boring'` | negative |

- Exactly the same BoW representation!
- Context of the words is lost
- Sentiment depends on the position of 'not'

**n-gram**
- Contiguous sequence of n elements (or words) in given document.
- n=1 -> bag-of-words

`'for you a thousand times over'`

- n=2, n-grams:

`[
'for you',
'you a',
'a thousand',
'thousand times',
'time over'
]`

- n=3, n-grams:

`[
'for you a',
'you a thousand',
'a thousand times',
'thousand time over'
]`

- Capture more context

#### Applications
- sentence completion
- spelling correction
- machine translation correction

#### Building n-gram models using scikit-learn
Generate only bigrams.

`bigrams = CountVectorizer(ngram_range=(2,2))`

Generate unigrams, bigrams and trigrams.

`ngram = CountVectorizer(ngram_range=(1,3))`

#### Shortcomings
- Curse of demensionality
- Higher order n-grams are rare (more than 3)
- Keep n is small

#### Exercise: n-gram models for movie tag lines
In this exercise, we have been provided with a corpus of more than 9000 movie tag lines. Our job is to generate n-gram models up to n equal to 1, n equal to 2 and n equal to 3 for this data and discover the number of features for each model.

We will then compare the number of features generated for each model.

In [70]:
import pandas as pd
df = pd.read_csv('../dataset/movie/movies_metadata.csv', header=0)
df.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [73]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

df = pd.read_csv('../dataset/movie/movies_metadata.csv', header=0, keep_default_na = False, usecols=["tagline", "vote_average"])
print(df.head())

corpus = df['tagline']

# Generate n-grams upto n=1
vectorizer_ng1 = CountVectorizer(ngram_range=(1,1))
ng1 = vectorizer_ng1.fit_transform(corpus)

# Generate n-grams upto n=2
vectorizer_ng2 = CountVectorizer(ngram_range=(1,2))
ng2 = vectorizer_ng2.fit_transform(corpus)

# Generate n-grams upto n=3
vectorizer_ng3 = CountVectorizer(ngram_range=(1, 3))
ng3 = vectorizer_ng3.fit_transform(corpus)

# Print the number of features for each model
print("ng1, ng2 and ng3 have %i, %i and %i features respectively" % (ng1.shape[1], ng2.shape[1], ng3.shape[1]))


                                             tagline vote_average
0                                                             7.7
1          Roll the dice and unleash the excitement!          6.9
2  Still Yelling. Still Fighting. Still Ready for...          6.5
3  Friends are the people who let you be yourself...          6.1
4  Just When His World Is Back To Normal... He's ...          5.7
ng1, ng2 and ng3 have 12930, 86665 and 192984 features respectively


#### Exercise: Higher order n-grams for sentiment analysis
Similar to a previous exercise, we are going to build a classifier that can detect if the review of a particular movie is positive or negative. However, this time, we will use n-grams up to n=2 for the task.

The n-gram training reviews are available as X_train_ng. The corresponding test reviews are available as X_test_ng. Finally, use y_train and y_test to access the training and test sentiment classes respectively.

In [74]:
# !!! Depends on previous exercise
# Split data
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['tagline'], df['vote_average'], test_size=0.25)

ng_vectorizer = CountVectorizer(ngram_range=(1,2))
X_train_ng = ng_vectorizer.fit_transform(X_train)
X_test_ng = ng_vectorizer.transform(X_test)

from sklearn.naive_bayes import MultinomialNB

# Define an instance of MultinomialNB 
clf_ng = MultinomialNB()

# Fit the classifier 
clf_ng.fit(X_train_ng, y_train)

# Measure the accuracy 
accuracy = clf_ng.score(X_test_ng, y_test)
print("The accuracy of the classifier on the test set is %.3f" % accuracy)

# Predict the sentiment of a negative review
review = "The movie was not good. The plot had several holes and the acting lacked panache."
prediction = clf_ng.predict(ng_vectorizer.transform([review]))[0]
print("The sentiment predicted by the classifier is %i" % (prediction))


The accuracy of the classifier on the test set is 0.073


TypeError: %i format: a number is required, not numpy.str_

#### Exercise: Comparing performance of n-gram models
You now know how to conduct sentiment analysis by converting text into various n-gram representations and feeding them to a classifier. In this exercise, we will conduct sentiment analysis for the same movie reviews from before using two n-gram models: unigrams and n-grams upto n equal to 3.

We will then compare the performance using three criteria: accuracy of the model on the test set, time taken to execute the program and the number of features created when generating the n-gram representation.

In [75]:
start_time = time.time()
# Splitting the data into training and test sets
train_X, test_X, train_y, test_y = train_test_split(df['review'], df['sentiment'], test_size=0.5, random_state=42, stratify=df['sentiment'])

# Generating ngrams
vectorizer = CountVectorizer(ngram_range=(1,3))
train_X = vectorizer.fit_transform(train_X)
test_X = vectorizer.transform(test_X)

# Fit classifier
clf = MultinomialNB()
clf.fit(train_X, train_y)

# Print accuracy, time and number of dimensions
print("The program took %.3f seconds to complete. The accuracy on the test set is %.2f. The ngram representation had %i features." % (time.time() - start_time, clf.score(test_X, test_y), train_X.shape[1]))

NameError: name 'time' is not defined

**Result**
Unigram:
The program took 0.194 seconds to complete. The accuracy on the test set is 0.75. The ngram representation had 12347 features.

N-Gram (up to 3)
The program took 3.605 seconds to complete. The accuracy on the test set is 0.77. The ngram representation had 178240 features.

**Summary**
Amazing work! The program took around 0.2 seconds in the case of the unigram model and more than 10 times longer for the higher order n-gram model. The unigram model had over 12,000 features whereas the n-gram model for upto n=3 had over 178,000! Despite taking higher computation time and generating more features, the classifier only performs marginally better in the latter case, producing an accuracy of 77% in comparison to the 75% for the unigram model.

## 4. TF-IDF and similarity scores
Learn how to compute tf-idf weights and the cosine similarity score between two vectors. You will use these concepts to build a movie and a TED Talk recommender. Finally, you will also learn about word embeddings and using word vector representations, you will compute similarities between various Pink Floyd songs.

### Buildin tf-idf document vectors
#### n-gram modeling
- weight of dimension dependent on the frequency of the word corresponding to the dimension.
  - Document contain the word `human` in five places.
  - Dimension corresponding to `human` has weight `5`
- Some words occur very commonly across all documents
- Corpus of documents on the univese
  - One document has `jupiter` and `universe` occuring 20 times each.
  - `jupiter` rarely occure in the other documents, `universe` is common
  - Give more weight to `jupiter` on account of exclusivity (jupiter characterize document more than universe).

#### Applications
- automatically detect stopwords
- search algorithm
- recommender systems
- better performance in predictive modeling for some cases

#### Term frequency-inverse document frequency (td-idf)
- Proportional to term frequency
- Inverse function of the number of documents in which it occurs
- Higher td-idf - more importan word characterize it


![id](images/td-idf-formula.png "Tf-Idf mathematical formula")

#### tf-idf using scikit-learn

In [79]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'The lion is the king of the jungle',
    'Lions have lifespans of a decade',
    'The lion is an endangered species',
    'men may come and men may go but i go on forever'
]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(corpus)
print(tfidf_matrix.toarray())

[[0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.25634672 0.3251433  0.3251433
  0.         0.25634672 0.         0.         0.         0.25634672
  0.         0.         0.76904015]
 [0.         0.         0.         0.         0.46516193 0.
  0.         0.         0.46516193 0.         0.         0.
  0.46516193 0.         0.46516193 0.         0.         0.36673901
  0.         0.         0.        ]
 [0.4533864  0.         0.         0.         0.         0.4533864
  0.         0.         0.         0.35745504 0.         0.
  0.         0.35745504 0.         0.         0.         0.
  0.         0.4533864  0.35745504]
 [0.         0.24253563 0.24253563 0.24253563 0.         0.
  0.24253563 0.48507125 0.         0.         0.         0.
  0.         0.         0.         0.48507125 0.48507125 0.
  0.24253563 0.         0.        ]]


#### Exercise: tf-idf weight of commonly occurring words
The word bottle occurs 5 times in a particular document D and also occurs in every document of the corpus. What is the tf-idf weight of bottle in D?

Answer:
In fact, the tf-idf weight for bottle in every document will be 0. This is because the inverse document frequency is constant across documents in a corpus and since bottle occurs in every document, its value is log(1), which is 0.

#### Exersise: tf-idf vectors for TED talks
In this exercise, you have been given a corpus ted which contains the transcripts of 500 TED Talks. Your task is to generate the tf-idf vectors for these talks.

In a later lesson, we will use these vectors to generate recommendations of similar talks based on the transcript. 

In [81]:
import pandas as pd
df = pd.read_csv('../dataset/ted_talks/transcripts.csv', header=0)
df.head()

Unnamed: 0,transcript,url
0,Good morning. How are you?(Laughter)It's been ...,https://www.ted.com/talks/ken_robinson_says_sc...
1,"Thank you so much, Chris. And it's truly a gre...",https://www.ted.com/talks/al_gore_on_averting_...
2,"(Music: ""The Sound of Silence,"" Simon & Garfun...",https://www.ted.com/talks/david_pogue_says_sim...
3,If you're here today — and I'm very happy that...,https://www.ted.com/talks/majora_carter_s_tale...
4,"About 10 years ago, I took on the task to teac...",https://www.ted.com/talks/hans_rosling_shows_t...


In [83]:
import pandas as pd

df = pd.read_csv('../dataset/ted_talks/transcripts.csv', header=0)

ted = df['transcript']

# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Generate matrix of word vectors
tfidf_matrix = vectorizer.fit_transform(ted)

# Print the shape of tfidf_matrix
print(tfidf_matrix.shape)

(2467, 58795)


### Cosine similarity

![id](images/cosine-similarity.png "Cosine similarity")

#### The dot product

![id](images/dot-product.png "Cosine similarity")

![id](images/magnitude-of-vector.png "Magnitude of vector")

![id](images/cosine-score.png "Cosine score")

#### Cosine Score: points to rememeber
- Value between -1 and 1
- In NLP document vector use value between 0 and 1 (0 - no similarity, 1 - identical)
- Robust to document length bcs ignore vector magnitude

#### Implementation using scikit-learn


In [88]:
from sklearn.metrics.pairwise import cosine_similarity
# Define two 3- dimensional vectors A and B
A = (4, 7, 1)
B = (5, 2, 3)

# Compute the cosine score of A and B
score = cosine_similarity([A], [B])

# Print the cosine score
print(score)

[[0.73881883]]


#### Exercise: Computing dot product
In this exercise, we will learn to compute the dot product between two vectors, A = (1, 3) and B = (-2, 2), using the numpy library. More specifically, we will use the np.dot() function to compute the dot product of two numpy arrays.

In [89]:
# Initialize numpy vectors
A = np.array([1, 3])
B = np.array([-2, 2])

# Compute dot product
dot_prod = np.dot(A, B)

# Print dot product
print(dot_prod)

4


The dot product of the two vectors is 1 * -2 + 3 * 2 = 4, which is indeed the output produced. We will not be using np.dot() too much in this course but it can prove to be a helpful function while computing dot products between two standalone vectors.

#### Exercise: Cosine similarity matrix of a corpus
In this exercise, you have been given a corpus, which is a list containing five sentences. The corpus is printed in the console. You have to compute the cosine similarity matrix which contains the pairwise cosine similarity score for every pair of sentences (vectorized using tf-idf).

Remember, the value corresponding to the ith row and jth column of a similarity matrix denotes the similarity score for the ith and jth vector.


In [90]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ['The sun is the largest celestial body in the solar system',
          'The solar system consists of the sun and eight revolving planets',
          'Ra was the Egyptian Sun God',
          'The Pyramids were the pinnacle of Egyptian architecture',
          'The quick brown fox jumps over the lazy dog'
         ]
# Initialize an instance of tf-idf Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Generate the tf-idf vectors for the corpus
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

# Compute and print the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
print(cosine_sim)

[[1.         0.36413198 0.18314713 0.18435251 0.16336438]
 [0.36413198 1.         0.15054075 0.21704584 0.11203887]
 [0.18314713 0.15054075 1.         0.21318602 0.07763512]
 [0.18435251 0.21704584 0.21318602 1.         0.12960089]
 [0.16336438 0.11203887 0.07763512 0.12960089 1.        ]]


As you will see in a subsequent lesson, computing the cosine similarity matrix lies at the heart of many practical systems such as recommenders. From our similarity matrix, we see that the first and the second sentence are the most similar. Also the fifth sentence has, on average, the lowest pairwise cosine scores. This is intuitive as it contains entities that are not present in the other sentences.

### Building a plot line based recommender
#### Movie recommender

| Title        | Overview                                                                                                                                                                                                                                                                                        |
|--------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Plot         | Astronauts who have seen the Earth from space have often described the 'Overview Effect',  an experience that has transformed their perspective of the planet and mankind's place  upon it, and enabled them to perceive it as our shared home, without boundaries between  nations or species. |
| Interstellar | A team of explorers travel through a wormhole in space in an attempt to ensure humanity's survival.                                                                                                                                                                                             |

#### Task
Build a system which takes a movie title and outputs the list of moviews which has similar plot line.

#### Steps
1. Preprocess movie overview
2. Generate tf-idf vectors
3. Generate cosine similarity matrix

#### Recommender function
1. Takes a movie title, cosine similarity matrix and indices series as arguments.
2. Extract pairwise cosine similarity scores for the movie.
3. Sort the score in descending order.
4. Output title corresponding to the highest scores.
5. Ignore the highest similarity score (of 1 bcs move which more similar is movie itself)

#### Generating tf-idf vectors


In [96]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import linear_kernel

movie_plots = ['The sun is the largest celestial body in the solar system',
          'The solar system consists of the sun and eight revolving planets',
          'Ra was the Egyptian Sun God',
          'The Pyramids were the pinnacle of Egyptian architecture',
          'The quick brown fox jumps over the lazy dog'
         ]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(movie_plots)

cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
print(cosine_sim)

# Result the same but should take less time to compute
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
print(cosine_sim)

[[1.         0.36413198 0.18314713 0.18435251 0.16336438]
 [0.36413198 1.         0.15054075 0.21704584 0.11203887]
 [0.18314713 0.15054075 1.         0.21318602 0.07763512]
 [0.18435251 0.21704584 0.21318602 1.         0.12960089]
 [0.16336438 0.11203887 0.07763512 0.12960089 1.        ]]
[[1.         0.36413198 0.18314713 0.18435251 0.16336438]
 [0.36413198 1.         0.15054075 0.21704584 0.11203887]
 [0.18314713 0.15054075 1.         0.21318602 0.07763512]
 [0.18435251 0.21704584 0.21318602 1.         0.12960089]
 [0.16336438 0.11203887 0.07763512 0.12960089 1.        ]]


#### The linear_kernel function
- Magnitude of tf-idf vector is 1
- Consine score between two tf-idf vectors is their dot product.
- Can significantly improve computation time
- There for we can use `linear_kernel` instead of `cosine_similarity`

#### Exercise: Comparing linear_kernel and cosine_similarity
In this exercise, you have been given tfidf_matrix which contains the tf-idf vectors of a thousand documents. Your task is to generate the cosine similarity matrix for these vectors first using cosine_similarity and then, using linear_kernel.

We will then compare the computation times for both functions.

In [98]:
import time
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

movie_plots = ['The sun is the largest celestial body in the solar system',
          'The solar system consists of the sun and eight revolving planets',
          'Ra was the Egyptian Sun God',
          'The Pyramids were the pinnacle of Egyptian architecture',
          'The quick brown fox jumps over the lazy dog'
         ]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(movie_plots)

# Record start time
start = time.time()

# Compute cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Print cosine similarity matrix
print(cosine_sim)

# Print time taken
print("Time taken: %s seconds" %(time.time() - start))

[[1.         0.36413198 0.18314713 0.18435251 0.16336438]
 [0.36413198 1.         0.15054075 0.21704584 0.11203887]
 [0.18314713 0.15054075 1.         0.21318602 0.07763512]
 [0.18435251 0.21704584 0.21318602 1.         0.12960089]
 [0.16336438 0.11203887 0.07763512 0.12960089 1.        ]]
Time taken: 0.014595508575439453 seconds


In [99]:
import time
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

movie_plots = ['The sun is the largest celestial body in the solar system',
          'The solar system consists of the sun and eight revolving planets',
          'Ra was the Egyptian Sun God',
          'The Pyramids were the pinnacle of Egyptian architecture',
          'The quick brown fox jumps over the lazy dog'
         ]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(movie_plots)

# Record start time
start = time.time()

# Compute cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# Print cosine similarity matrix
print(cosine_sim)

# Print time taken
print("Time taken: %s seconds" %(time.time() - start))

[[1.         0.36413198 0.18314713 0.18435251 0.16336438]
 [0.36413198 1.         0.15054075 0.21704584 0.11203887]
 [0.18314713 0.15054075 1.         0.21318602 0.07763512]
 [0.18435251 0.21704584 0.21318602 1.         0.12960089]
 [0.16336438 0.11203887 0.07763512 0.12960089 1.        ]]
Time taken: 0.003167390823364258 seconds


**Result**
Notice how both linear_kernel and cosine_similarity produced the same result. However, linear_kernel took a smaller amount of time to execute. When you're working with a very large amount of data and your vectors are in the tf-idf representation, it is good practice to default to linear_kernel to improve performance. (NOTE: In case, you see linear_kernel taking more time, it's because the dataset we're dealing with is extremely small and Python's time module is incapable of capture such minute time differences accurately)

#### Exercise: Plot recommendation engine
In this exercise, we will build a recommendation engine that suggests movies based on similarity of plot lines. You have been given a get_recommendations() function that takes in the title of a movie, a similarity matrix and an indices series as its arguments and outputs a list of most similar movies. indices has already been provided to you.

You have also been given a movie_plots Series that contains the plot lines of several movies. Your task is to generate a cosine similarity matrix for the tf-idf vectors of these plots.

Consequently, we will check the potency of our engine by generating recommendations for one of my favorite movies, The Dark Knight Rises.

Data ["Movie Data Kaggle"](https://www.kaggle.com/kokur123/analysis-of-movie-data)

In [101]:
import pandas as pd
df = pd.read_csv('../dataset/tmdb_movie/tmdb_5000_movies.csv', header=0)
df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [116]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import pandas as pd

df = pd.read_csv('../dataset/tmdb_movie/tmdb_5000_movies.csv', header=0, keep_default_na = False)

# Generate mapping between titles and index
indices = pd.Series(df.index, index=df['title']).drop_duplicates()

def get_recommendations(title, cosine_sim, indices):
    # Get index of movie that matches title
    idx = indices[title]
    # Sort the movies based on the similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # Get the scores for 10 most similar movies
    sim_scores = sim_scores[1:11]
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    # Return the top 10 most similar movies
    return df['title'].iloc[movie_indices]

movie_plots = df['overview']

# Initialize the TfidfVectorizer 
tfidf = TfidfVectorizer(stop_words='english')

# Construct the TF-IDF matrix
tfidf_matrix = tfidf.fit_transform(movie_plots)

# Generate the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
 
# Generate recommendations 
print(get_recommendations('The Dark Knight Rises', cosine_sim, indices))

65                              The Dark Knight
299                              Batman Forever
428                              Batman Returns
1359                                     Batman
3854    Batman: The Dark Knight Returns, Part 2
119                               Batman Begins
2507                                  Slow Burn
9            Batman v Superman: Dawn of Justice
1181                                        JFK
210                              Batman & Robin
Name: title, dtype: object


#### Exercise: TED talk recommender
In this exercise, we will build a recommendation system that suggests TED Talks based on their transcripts. You have been given a get_recommendations() function that takes in the title of a talk, a similarity matrix and an indices series as its arguments, and outputs a list of most similar talks. indices has already been provided to you.

You have also been given a transcripts series that contains the transcripts of around 500 TED talks. Your task is to generate a cosine similarity matrix for the tf-idf vectors of the talk transcripts.

Consequently, we will generate recommendations for a talk titled '5 ways to kill your dreams' by Brazilian entrepreneur Bel Pesce.

In [123]:
import pandas as pd
df = pd.read_csv('../dataset/ted_talks/ted_main.csv', header=0)
df.head()

Unnamed: 0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,ratings,related_talks,speaker_occupation,tags,title,url,views
0,4553,Sir Ken Robinson makes an entertaining and pro...,1164,TED2006,1140825600,60,Ken Robinson,Ken Robinson: Do schools kill creativity?,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 19645}, {...","[{'id': 865, 'hero': 'https://pe.tedcdn.com/im...",Author/educator,"['children', 'creativity', 'culture', 'dance',...",Do schools kill creativity?,https://www.ted.com/talks/ken_robinson_says_sc...,47227110
1,265,With the same humor and humanity he exuded in ...,977,TED2006,1140825600,43,Al Gore,Al Gore: Averting the climate crisis,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 544}, {'i...","[{'id': 243, 'hero': 'https://pe.tedcdn.com/im...",Climate advocate,"['alternative energy', 'cars', 'climate change...",Averting the climate crisis,https://www.ted.com/talks/al_gore_on_averting_...,3200520
2,124,New York Times columnist David Pogue takes aim...,1286,TED2006,1140739200,26,David Pogue,David Pogue: Simplicity sells,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 964}, {'i...","[{'id': 1725, 'hero': 'https://pe.tedcdn.com/i...",Technology columnist,"['computers', 'entertainment', 'interface desi...",Simplicity sells,https://www.ted.com/talks/david_pogue_says_sim...,1636292
3,200,"In an emotionally charged talk, MacArthur-winn...",1116,TED2006,1140912000,35,Majora Carter,Majora Carter: Greening the ghetto,1,1151367060,"[{'id': 3, 'name': 'Courageous', 'count': 760}...","[{'id': 1041, 'hero': 'https://pe.tedcdn.com/i...",Activist for environmental justice,"['MacArthur grant', 'activism', 'business', 'c...",Greening the ghetto,https://www.ted.com/talks/majora_carter_s_tale...,1697550
4,593,You've never seen data presented like this. Wi...,1190,TED2006,1140566400,48,Hans Rosling,Hans Rosling: The best stats you've ever seen,1,1151440680,"[{'id': 9, 'name': 'Ingenious', 'count': 3202}...","[{'id': 2056, 'hero': 'https://pe.tedcdn.com/i...",Global health expert; data visionary,"['Africa', 'Asia', 'Google', 'demo', 'economic...",The best stats you've ever seen,https://www.ted.com/talks/hans_rosling_shows_t...,12005869


In [124]:
import pandas as pd
ted_main = pd.read_csv('../dataset/ted_talks/ted_main.csv', header=0)
ted_transcript = pd.read_csv('../dataset/ted_talks/transcripts.csv', header=0)
ted_data = pd.concat([ted_main,ted_transcript], axis=1)
ted_data.head()

Unnamed: 0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,ratings,related_talks,speaker_occupation,tags,title,url,views,transcript,url.1
0,4553,Sir Ken Robinson makes an entertaining and pro...,1164,TED2006,1140825600,60,Ken Robinson,Ken Robinson: Do schools kill creativity?,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 19645}, {...","[{'id': 865, 'hero': 'https://pe.tedcdn.com/im...",Author/educator,"['children', 'creativity', 'culture', 'dance',...",Do schools kill creativity?,https://www.ted.com/talks/ken_robinson_says_sc...,47227110,Good morning. How are you?(Laughter)It's been ...,https://www.ted.com/talks/ken_robinson_says_sc...
1,265,With the same humor and humanity he exuded in ...,977,TED2006,1140825600,43,Al Gore,Al Gore: Averting the climate crisis,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 544}, {'i...","[{'id': 243, 'hero': 'https://pe.tedcdn.com/im...",Climate advocate,"['alternative energy', 'cars', 'climate change...",Averting the climate crisis,https://www.ted.com/talks/al_gore_on_averting_...,3200520,"Thank you so much, Chris. And it's truly a gre...",https://www.ted.com/talks/al_gore_on_averting_...
2,124,New York Times columnist David Pogue takes aim...,1286,TED2006,1140739200,26,David Pogue,David Pogue: Simplicity sells,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 964}, {'i...","[{'id': 1725, 'hero': 'https://pe.tedcdn.com/i...",Technology columnist,"['computers', 'entertainment', 'interface desi...",Simplicity sells,https://www.ted.com/talks/david_pogue_says_sim...,1636292,"(Music: ""The Sound of Silence,"" Simon & Garfun...",https://www.ted.com/talks/david_pogue_says_sim...
3,200,"In an emotionally charged talk, MacArthur-winn...",1116,TED2006,1140912000,35,Majora Carter,Majora Carter: Greening the ghetto,1,1151367060,"[{'id': 3, 'name': 'Courageous', 'count': 760}...","[{'id': 1041, 'hero': 'https://pe.tedcdn.com/i...",Activist for environmental justice,"['MacArthur grant', 'activism', 'business', 'c...",Greening the ghetto,https://www.ted.com/talks/majora_carter_s_tale...,1697550,If you're here today — and I'm very happy that...,https://www.ted.com/talks/majora_carter_s_tale...
4,593,You've never seen data presented like this. Wi...,1190,TED2006,1140566400,48,Hans Rosling,Hans Rosling: The best stats you've ever seen,1,1151440680,"[{'id': 9, 'name': 'Ingenious', 'count': 3202}...","[{'id': 2056, 'hero': 'https://pe.tedcdn.com/i...",Global health expert; data visionary,"['Africa', 'Asia', 'Google', 'demo', 'economic...",The best stats you've ever seen,https://www.ted.com/talks/hans_rosling_shows_t...,12005869,"About 10 years ago, I took on the task to teac...",https://www.ted.com/talks/hans_rosling_shows_t...


In [126]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import pandas as pd

ted_main = pd.read_csv('../dataset/ted_talks/ted_main.csv', header=0)
ted_transcript = pd.read_csv('../dataset/ted_talks/transcripts.csv', header=0)
ted_data = pd.concat([ted_main,ted_transcript], axis=1).dropna()

transcripts = ted_data['transcript']
# Generate mapping between titles and index
indices = pd.Series(df.index, index=df['title']).drop_duplicates()

def get_recommendations(title, cosine_sim, indices):
    # Get index of movie that matches title
    idx = indices[title]
    # Sort the movies based on the similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # Get the scores for 10 most similar movies
    sim_scores = sim_scores[1:11]
    # Get the movie indices
    ted_indices = [i[0] for i in sim_scores]
    # Return the top 10 most similar movies
    return ted_data['title'].iloc[ted_indices]

# Initialize the TfidfVectorizer 
tfidf = TfidfVectorizer(stop_words='english')

# Construct the TF-IDF matrix
tfidf_matrix = tfidf.fit_transform(transcripts)

# Generate the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
 
# Generate recommendations 
print(get_recommendations('5 ways to kill your dreams', cosine_sim, indices))

2197    How my son's short life made a lasting difference
1473              4 pillars of college success in science
1148                                 The true cost of oil
2186      The inside story of the Paris climate agreement
1781           Why I love a country that once betrayed me
131                  How to educate leaders? Liberal arts
1848      What I learned from spending 31 days underwater
1634                              Invest in social change
1379                     How to "sketch" with electronics
973                         Try something new for 30 days
Name: title, dtype: object


### Beyond n-grams: word embeddings
#### The problem with BoW and tf-idf
`I am happy`
`I am joyous`
`I am sad`

#### Word embeddings
- Mapping words into an n-dimensional vector space
- Produced using deep learning and huge amount of data
- Discern how similar two words are to each other
- Used to detect synonyms and antonyms
- Capture complex relationships
  - `King`-`Queen` relates same way as `Man`-`Woman`
  - `France`-`Paris` -> `Russia`-`Moscow`
- Dependent on spacy model; independent of dataset you use

#### Word embeddings using spaCy

In [1]:
import spacy

# Load model and create Doc object
nlp = spacy.load('en_core_web_lg')
doc = nlp('I am happy')

# Generate word vectors for each token
for token in doc:
    print(token.vector)

[ 1.8733e-01  4.0595e-01 -5.1174e-01 -5.5482e-01  3.9716e-02  1.2887e-01
  4.5137e-01 -5.9149e-01  1.5591e-01  1.5137e+00 -8.7020e-01  5.0672e-02
  1.5211e-01 -1.9183e-01  1.1181e-01  1.2131e-01 -2.7212e-01  1.6203e+00
 -2.4884e-01  1.4060e-01  3.3099e-01 -1.8061e-02  1.5244e-01 -2.6943e-01
 -2.7833e-01 -5.2123e-02 -4.8149e-01 -5.1839e-01  8.6262e-02  3.0818e-02
 -2.1253e-01 -1.1378e-01 -2.2384e-01  1.8262e-01 -3.4541e-01  8.2611e-02
  1.0024e-01 -7.9550e-02 -8.1721e-01  6.5621e-03  8.0134e-02 -3.9976e-01
 -6.3131e-02  3.2260e-01 -3.1625e-02  4.3056e-01 -2.7270e-01 -7.6020e-02
  1.0293e-01 -8.8653e-02 -2.9087e-01 -4.7214e-02  4.6036e-02 -1.7788e-02
  6.4990e-02  8.8451e-02 -3.1574e-01 -5.8522e-01  2.2295e-01 -5.2785e-02
 -5.5981e-01 -3.9580e-01 -7.9849e-02 -1.0933e-02 -4.1722e-02 -5.5576e-01
  8.8707e-02  1.3710e-01 -2.9873e-03 -2.6256e-02  7.7330e-02  3.9199e-01
  3.4507e-01 -8.0130e-02  3.3451e-01  2.7063e-01 -2.4544e-02  7.2576e-02
 -1.8120e-01  2.3693e-01  3.9977e-01  4.5012e-01  2

#### Word similarities

In [2]:
doc = nlp("happy joyous sad")
for token1 in doc:
    for token2 in doc:
        print(token1.text, token2.text, token1.similarity(token2))
        

happy happy 1.0
happy joyous 0.5333031
happy sad 0.64389884
joyous happy 0.5333031
joyous joyous 1.0
joyous sad 0.43832767
sad happy 0.64389884
sad joyous 0.43832767
sad sad 1.0


#### Document similarities

In [5]:
# Generate doc objects
sent1 = nlp("I am happy")
sent2 = nlp("I am sad")
sent3 = nlp("I am joyous")

# Compute similarity between sent1 and sent2
sent1.similarity(sent2)

0.9492464724721577

In [6]:
sent1.similarity(sent3)

0.9239675481730458

#### Exercise: Generating word vectors
In this exercise, we will generate the pairwise similarity scores of all the words in a sentence. The sentence is available as sent and has been printed to the console for your convenience.

In [7]:
sent = 'I like apples and oranges'

# Create the doc object
doc = nlp(sent)

# Compute pairwise similarity scores
for token1 in doc:
  for token2 in doc:
    print(token1.text, token2.text, token1.similarity(token2))

I I 1.0
I like 0.55549127
I apples 0.20442723
I and 0.31607857
I oranges 0.18824081
like I 0.55549127
like like 1.0
like apples 0.32987145
like and 0.5267485
like oranges 0.27717474
apples I 0.20442723
apples like 0.32987145
apples apples 1.0
apples and 0.2409773
apples oranges 0.77809423
and I 0.31607857
and like 0.5267485
and apples 0.2409773
and and 1.0
and oranges 0.19245945
oranges I 0.18824081
oranges like 0.27717474
oranges apples 0.77809423
oranges and 0.19245945
oranges oranges 1.0


#### Exercise: Computing similarity of Pink Floyd songs
In this final exercise, you have been given lyrics of three songs by the British band Pink Floyd, namely 'High Hopes', 'Hey You' and 'Mother'. The lyrics to these songs are available as hopes, hey and mother respectively.

Your task is to compute the pairwise similarity between mother and hopes, and mother and hey.

In [8]:
mother = "\nMother do you think they'll drop the bomb?\nMother do you think they'll like this song?\nMother do you think they'll try to break my balls?\nOoh, ah\nMother should I build the wall?\nMother should I run for President?\nMother should I trust the government?\nMother will they put me in the firing mine?\nOoh ah,\nIs it just a waste of time?\nHush now baby, baby, don't you cry.\nMama's gonna make all your nightmares come true.\nMama's gonna put all her fears into you.\nMama's gonna keep you right here under her wing.\nShe won't let you fly, but she might let you sing.\nMama's gonna keep baby cozy and warm.\nOoh baby, ooh baby, ooh baby,\nOf course mama's gonna help build the wall.\nMother do you think she's good enough, for me?\nMother do you think she's dangerous, to me?\nMother will she tear your little boy apart?\nOoh ah,\nMother will she break my heart?\nHush now baby, baby don't you cry.\nMama's gonna check out all your girlfriends for you.\nMama won't let anyone dirty get through.\nMama's gonna wait up until you get in.\nMama will always find out where you've been.\nMama's gonna keep baby healthy and clean.\nOoh baby, ooh baby, ooh baby,\nYou'll always be baby to me.\nMother, did it need to be so high?\n"
hopes = "\nBeyond the horizon of the place we lived when we were young\nIn a world of magnets and miracles\nOur thoughts strayed constantly and without boundary\nThe ringing of the division bell had begun\nAlong the Long Road and on down the Causeway\nDo they still meet there by the Cut\nThere was a ragged band that followed in our footsteps\nRunning before times took our dreams away\nLeaving the myriad small creatures trying to tie us to the ground\nTo a life consumed by slow decay\nThe grass was greener\nThe light was brighter\nWhen friends surrounded\nThe nights of wonder\nLooking beyond the embers of bridges glowing behind us\nTo a glimpse of how green it was on the other side\nSteps taken forwards but sleepwalking back again\nDragged by the force of some in a tide\nAt a higher altitude with flag unfurled\nWe reached the dizzy heights of that dreamed of world\nEncumbered forever by desire and ambition\nThere's a hunger still unsatisfied\nOur weary eyes still stray to the horizon\nThough down this road we've been so many times\nThe grass was greener\nThe light was brighter\nThe taste was sweeter\nThe nights of wonder\nWith friends surrounded\nThe dawn mist glowing\nThe water flowing\nThe endless river\nForever and ever\n"
hey = "\nHey you, out there in the cold\nGetting lonely, getting old\nCan you feel me?\nHey you, standing in the aisles\nWith itchy feet and fading smiles\nCan you feel me?\nHey you, don't help them to bury the light\nDon't give in without a fight\nHey you out there on your own\nSitting naked by the phone\nWould you touch me?\nHey you with you ear against the wall\nWaiting for someone to call out\nWould you touch me?\nHey you, would you help me to carry the stone?\nOpen your heart, I'm coming home\nBut it was only fantasy\nThe wall was too high\nAs you can see\nNo matter how he tried\nHe could not break free\nAnd the worms ate into his brain\nHey you, out there on the road\nAlways doing what you're told\nCan you help me?\nHey you, out there beyond the wall\nBreaking bottles in the hall\nCan you help me?\nHey you, don't tell me there's no hope at all\nTogether we stand, divided we fall\n"

# Create Doc objects
mother_doc = nlp(mother)
hopes_doc = nlp(hopes)
hey_doc = nlp(hey)

# Print similarity between mother and hopes
print(mother_doc.similarity(hopes_doc))

# Print similarity between mother and hey
print(mother_doc.similarity(hey_doc))

0.8653562508450858
0.9595267703981097


### Review
- Basic features (characters, words, mentions, etc.)
- Readability scores
- Tokenization and lemmitization
- Text cleaning
- Part-of-speach tagging & named entity recording
- n-gram modeling
- tf-idf
- Cosine similarity
- Word embedding

#### Futher resources
- Advanced NLP with spaCy
- Deep Learning in Python