# Question: What are the steps in NLP pre-processing?

In NLP like any data science problem questions are answered using text. In order to run machine learning algorithms and techniques, a text should be transformed into numerical features. This whole process comes in NLP pre-processing steps.

## 1. Cleaning
Process of getting rid of the less useful parts of text through stopword removal, dealing with capitalization and characters and other details.

### a) Capitalization
Text often has a variety of capitalization reflecting the beginning of sentences, proper nouns emphasis. The most common approach is to reduce everything to lower case for simplicity but it is important to remember that some words, like “US” to “us”, can change meanings when reduced to the lower case.

### b) Stopword
A major portion of the words in a text are connecting parts of a sentence rather than showing subjects, objects or intent. Word like “the” or “and” cab be removed by comparing text to a list of stopword.

IN:
['He', 'did', 'not', 'try', 'to', 'navigate', 'after', 'the', 'first', 'bold', 'flight', ',', 'for', 'the', 'reaction', 'had', 'taken', 'something', 'out', 'of', 'his', 'soul', '.']

OUT:
['try', 'navigate', 'first', 'bold', 'flight', ',', 'reaction', 'taken', 'something', 'soul', '.']

Sometimes we can create our own stopword dictionary manually or utilize prebuilt libraries depending on the sensitivity required.

### c) Tokenization

Tokenization describes splitting paragraphs into sentences, or sentences into individual words. For the former Sentence Boundary Disambiguation (SBD) can be applied to create a list of individual sentences. This relies on a pre-trained, language specific algorithms like the Punkt Models from NLTK.

Most commonly this split across white spaces, for example:

IN:
"He did not try to navigate after the first bold flight, for the reaction had taken something out of his soul."

OUT:
['He', 'did', 'not', 'try', 'to', 'navigate', 'after', 'the', 'first', 'bold', 'flight', ',', 'for', 'the', 'reaction', 'had', 'taken', 'something', 'out', 'of', 'his', 'soul', '.']

### d) Stemming
Stemming is the process of reducing the words(generally modified or derived) to their word stem or root form. The objective of stemming is to reduce related words to the same stem even if the stem is not a dictionary word.
For example, in the English language-

* beautiful and beautifully are stemmed to beauti 
* good, better and best are stemmed to good, better and best respectively

In [5]:
#!pip install stemming
from stemming.porter2 import stem
stem("beautifully")

'beauti'

### e) Lemmatisation
Lemmatisation is the process of reducing a group of words into their lemma or dictionary form. It takes into account things like POS(Parts of Speech), the meaning of the word in the sentence, the meaning of the word in the nearby sentences etc. before reducing the word to its lemma. For example, in the English Language-

* beautiful and beautifully are lemmatised to beautiful and beautifully respectively.
* good, better and best are lemmatised to good, good and good respectively.


In [10]:
#!pip install spacy
#!python -m spacy download en
import spacy
nlp=spacy.load("en")
doc="good better best"

for token in nlp(doc):
    print(token,token.lemma_)

Collecting en_core_web_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm==2.0.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
[K    100% |████████████████████████████████| 37.4MB 3.1MB/s ta 0:00:0121
[?25hInstalling collected packages: en-core-web-sm
  Running setup.py install for en-core-web-sm ... [?25ldone
[?25hSuccessfully installed en-core-web-sm-2.0.0

[93m    Linking successful[0m
    /Users/mk194903/anaconda3/lib/python3.6/site-packages/en_core_web_sm -->
    /Users/mk194903/anaconda3/lib/python3.6/site-packages/spacy/data/en

    You can now load the model via spacy.load('en')

good good
better better
best good


**Following code converts text to lower case, removes non relevent characters, stop words. Which completes cleansing part.**

In [None]:
# Corpus cleaning
STOPWORDS = set(stopwords.words('english'))
def clean_str(string):
    """
    Tokenization/string cleaning for datasets.
    Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
    """
    string = re.sub(r"^b", "", string)
    string = re.sub(r"\\n ", "", string)
    string = re.sub(r"\'s", "", string)
    string = re.sub(r"\'ve", "", string)
    string = re.sub(r"n\'t", "", string)
    string = re.sub(r"\'re", "", string)
    string = re.sub(r"\'d", "", string)
    string = re.sub(r"\'ll", "", string)
    string = re.sub(r",", "", string)
    string = re.sub(r"!", " ! ", string)
    string = re.sub(r"\(", "", string)
    string = re.sub(r"\)", "", string)
    string = re.sub(r"\?", "", string)
    string = re.sub(r"'", "", string)
    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
    string = re.sub(r"[0-9]\w+|[0-9]","", string)
    string = re.sub(r"\s{2,}", " ", string)
    string = ' '.join(Word(word).lemmatize() for word in string.split() if word not in STOPWORDS) # delete stopwors from text

    return string.strip().lower()

In [None]:
# Cleaning text
!pip install textblob
import nltk
nltk.download('wordnet')
from textblob import Word
df['comment_text_clean'] = df['comment_text'].apply(lambda x : clean_str(x))  # calling clean for all rows

## 2. Word Embedding/Text Vectors
Word Embeddings is the name of the techniques which are used to represent Natural Language in vector form of real numbers. They are useful because of computers’ inability to process Natural Language. So these Word Embeddings capture the essence and relationship between words in a Natural Language using real numbers. In Word Embeddings, a word or a phrase is represented in a fixed dimension vector of length say 100. **Word2Vec** and **GloVe** are the most common models to convert text to vectors.

So for example-

A word “man” might be represented in a 5-dimension vector as
<img src="images/word-vector.png" alt="Word Vector" />
where each of these numbers is the magnitude of the word in a particular direction.
<img src="images/Word-Vectors-direction.png" alt="Word Vector" />

**Implementation:** Here is how you can obtain pre-trained Word Vector of a word using the gensim package.

Download the Google News pre-trained Word Vectors from here(https://drive.google.com/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM&export=download).

In [16]:
#!pip install gensim
from gensim.models.keyedvectors import KeyedVectors
word_vectors=KeyedVectors.load_word2vec_format('dataset/GoogleNews-vectors-negative300.bin',binary=True)
word_vectors['human']

array([ 5.59082031e-02,  9.22851562e-02,  1.07910156e-01,  2.83203125e-01,
       -2.43164062e-01,  1.90429688e-02,  4.08203125e-01, -3.17382812e-02,
       -4.78515625e-02,  6.34765625e-02, -9.32617188e-02, -4.46777344e-02,
       -2.41210938e-01, -1.58203125e-01, -5.83496094e-02,  2.51953125e-01,
       -3.24707031e-02,  1.00097656e-01, -4.56542969e-02,  1.35742188e-01,
       -2.07031250e-01, -3.73046875e-01,  4.39453125e-02,  4.24804688e-02,
        6.93359375e-02, -2.42187500e-01, -2.75390625e-01,  1.95312500e-01,
        2.26562500e-01, -1.90429688e-01, -2.35351562e-01, -5.56640625e-02,
       -1.25000000e-01, -8.78906250e-02, -2.33398438e-01,  9.61914062e-02,
       -4.83398438e-02,  4.54101562e-02,  9.81445312e-02,  5.76171875e-02,
       -4.17480469e-02,  2.02148438e-01, -9.03320312e-02,  2.75390625e-01,
       -6.34765625e-02,  4.93164062e-02,  2.92968750e-02,  2.57812500e-01,
        1.32812500e-01,  7.42187500e-02,  6.64062500e-02, -1.37695312e-01,
       -1.73828125e-01,  

**Implementation:** Here is how you can train your own word vectors using gensim

In [8]:
import gensim
sentence=[['first','sentence'],['second','sentence']]
model = gensim.models.Word2Vec(sentence, min_count=1,size=300,workers=4)
print(model['sentence'][0])
print(model['sentence'][1])

0.0010816038
0.0007006851


  after removing the cwd from sys.path.
  """


## 3. Part-Of-Speech Tagging
In Simplistic terms, Part-Of-Speech Tagging is the process of marking up of words in a sentence as nouns, verbs, adjectives, adverbs etc. For example, in the sentence-

In [17]:
# POS using Spacy
#!pip install spacy
#!python -m spacy download en 
nlp=spacy.load('en')
sentence="A look at what lies ahead for a Trump National Golf Club housekeeper who disclosed her status as an undocumented immigrant."
for token in nlp(sentence):
   print(token,token.pos_)

A DET
look NOUN
at ADP
what NOUN
lies VERB
ahead ADV
for ADP
a DET
Trump PROPN
National PROPN
Golf PROPN
Club PROPN
housekeeper NOUN
who NOUN
disclosed VERB
her ADJ
status NOUN
as ADP
an DET
undocumented ADJ
immigrant NOUN
. PUNCT


In [18]:
# POS using NLTK
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
# Clear content POS
tokens = nltk.word_tokenize(sentence)
nltk.pos_tag(tokens)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/mk194903/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /Users/mk194903/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


[('Automatic', 'JJ'),
 ('summarization', 'NN'),
 ('is', 'VBZ'),
 ('the', 'DT'),
 ('process', 'NN'),
 ('of', 'IN'),
 ('shortening', 'VBG'),
 ('a', 'DT'),
 ('text', 'NN'),
 ('document', 'NN'),
 ('with', 'IN'),
 ('software', 'NN'),
 (',', ','),
 ('in', 'IN'),
 ('order', 'NN'),
 ('to', 'TO'),
 ('create', 'VB'),
 ('a', 'DT'),
 ('summary', 'JJ'),
 ('with', 'IN'),
 ('the', 'DT'),
 ('major', 'JJ'),
 ('points', 'NNS'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('original', 'JJ'),
 ('document', 'NN'),
 ('.', '.'),
 ('Technologies', 'NNS'),
 ('that', 'WDT'),
 ('can', 'MD'),
 ('make', 'VB'),
 ('a', 'DT'),
 ('coherent', 'NN'),
 ('summary', 'JJ'),
 ('take', 'NN'),
 ('into', 'IN'),
 ('account', 'NN'),
 ('variables', 'NNS'),
 ('such', 'JJ'),
 ('as', 'IN'),
 ('length', 'NN'),
 (',', ','),
 ('writing', 'VBG'),
 ('style', 'NN'),
 ('and', 'CC'),
 ('syntax.Automatic', 'JJ'),
 ('data', 'NNS'),
 ('summarization', 'NN'),
 ('is', 'VBZ'),
 ('part', 'NN'),
 ('of', 'IN'),
 ('machine', 'NN'),
 ('learning', 'NN'),
 ('and', '

## 5. Named Entity Recognition
Named Entity Recognition is the task of identifying entities in a sentence and classifying them into categories like a person, organisation, date, location, time etc. For example, a NER would take in a sentence like –

In [19]:
import spacy
nlp=spacy.load('en')
sentence="Ram of Apple Inc. travelled to Sydney on 5th October 2017"
for token in nlp(sentence):
   print(token, token.ent_type_)

Ram 
of 
Apple ORG
Inc. ORG
travelled 
to 
Sydney GPE
on 
5th DATE
October DATE
2017 DATE


# Question: What is distillation ?

## Distillation 
Distill the unstructured or structured data source into a 
Condensed extracted representation of the unstructured data with weightages of importance and relevance.

## 1. Sentiment Analysis
Sentiment Analysis is a broad range of subjective analysis which uses Natural Language processing techniques to perform tasks such as identifying the sentiment of a customer review, positive or negative feeling in a sentence, judging mood via voice analysis or written text analysis etc.

In [9]:
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sentences = ["VADER is smart, handsome, and funny.", # positive sentence example
"VADER is smart, handsome, and funny!"] # punctuation emphasis handled correctly (sentiment intensity adjusted)
paragraph = "It was one of the worst movies I've seen, despite good reviews. \
 Unbelievably bad acting!! Poor direction. VERY poor production. \
 The movie was bad. Very bad movie. VERY bad movie. VERY BAD movie. VERY BAD movie!"
from nltk import tokenize
lines_list = tokenize.sent_tokenize(paragraph)
sentences.extend(lines_list)
sentences

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/mk194903/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


['VADER is smart, handsome, and funny.',
 'VADER is smart, handsome, and funny!',
 "It was one of the worst movies I've seen, despite good reviews.",
 'Unbelievably bad acting!!',
 'Poor direction.',
 'VERY poor production.',
 'The movie was bad.',
 'Very bad movie.',
 'VERY bad movie.',
 'VERY BAD movie.',
 'VERY BAD movie!']

In [16]:
sid = SentimentIntensityAnalyzer()
for sentence in sentences:
     print(sentence)
     ss = sid.polarity_scores(sentence)
     for k in sorted(ss):
         print('{0}: {1}, '.format(k, ss[k]), end='')
     print() #negation-Contradiction

VADER is smart, handsome, and funny.
compound: 0.8316, neg: 0.0, neu: 0.254, pos: 0.746, 
VADER is smart, handsome, and funny!
compound: 0.8439, neg: 0.0, neu: 0.248, pos: 0.752, 
It was one of the worst movies I've seen, despite good reviews.
compound: -0.7584, neg: 0.394, neu: 0.606, pos: 0.0, 
Unbelievably bad acting!!
compound: -0.6572, neg: 0.686, neu: 0.314, pos: 0.0, 
Poor direction.
compound: -0.4767, neg: 0.756, neu: 0.244, pos: 0.0, 
VERY poor production.
compound: -0.6281, neg: 0.674, neu: 0.326, pos: 0.0, 
The movie was bad.
compound: -0.5423, neg: 0.538, neu: 0.462, pos: 0.0, 
Very bad movie.
compound: -0.5849, neg: 0.655, neu: 0.345, pos: 0.0, 
VERY bad movie.
compound: -0.6732, neg: 0.694, neu: 0.306, pos: 0.0, 
VERY BAD movie.
compound: -0.7398, neg: 0.724, neu: 0.276, pos: 0.0, 
VERY BAD movie!
compound: -0.7616, neg: 0.735, neu: 0.265, pos: 0.0, 


## 2. Semantic Text Similarity

Semantic Text Similarity is the process of analysing similarity between two pieces of text with respect to the meaning and essence of the text rather than analysing the syntax of the two pieces of text. Also, similarity is different than relatedness.<br>
Words can be similar in two ways **lexically** and **semantically**. Words are similar lexically if they have a *similar character sequence*. Words are similar semantically if they have the *same thing*, are opposite of each other, used in the same way, used in the *same context* and one is a type of another.
<br>
### > Lexical Similarity 
#### 1. String-Based Similarity 
Operate on string sequences and character composition. <br>
- **Character-Based Similarity -** N-gram is a sub-sequence of n items from a given sequence of text. **N-gram** similarity algorithms compare the n-grams from each character or word in two strings. Distance is computed by dividing the number of similar n-grams by maximal number of n-grams<br>
- **Term-based Similarity -** Most famous **Cosine similarity** is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them.
<br>

### > Semantic Similarity 
#### 1. Corpus-Based
Determines the similarity between words according to information gained from large corpora. A Corpus is a large collection of written or spoken texts that is used for language research.<br>
- **Latent Semantic Analysis (LSA) -**  is the most popular Corpus-Based similarity technique. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per paragraph (rows represent unique words and columns represent each paragraph) is constructed from a large piece of text and a mathematical technique which called singular value decomposition (SVD) is used to reduce the number of columns while preserving the similarity structure among rows. Words are then compared by taking the cosine of the angle between the two vectors formed by any two rows.
- **Probabilistic latent semantic analysis(pLSA) -** Probabilistic latent semantic analysis, also known as probabilistic latent semantic indexing is a statistical technique for the analysis of two-mode and co-occurrence data.

#### 2. Knowledge-Based 
Determines the degree of similarity between words using information derived from semantic networks.<br>
- **Vector -** measure creates a co–occurrence matrix for each word used in the WordNet glosses from a given corpus, and then represents each gloss/concept with a vector that is the average of these co–occurrence vectors. The most popular packages that cover knowledge-based similarity measures are **WordNet::Similarity** and **Natural Language Toolkit (NLTK)**.

## 3.  Text Summarisation
Text Summarisation is the process of shortening up of a text by identifying the important points of the text and creating a summary using these points. The goal of Text Summarisation is to retain maximum information along with maximum shortening of text without altering the meaning of the text.<br>
Here is how you can quickly summarise your text using the gensim package.

In [17]:
from gensim.summarization import summarize
sentence="Automatic summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document. Technologies that can make a coherent summary take into account variables such as length, writing style and syntax.Automatic data summarization is part of machine learning and data mining. The main idea of summarization is to find a subset of data which contains the information of the entire set. Such techniques are widely used in industry today. Search engines are an example; others include summarization of documents, image collections and videos. Document summarization tries to create a representative summary or abstract of the entire document, by finding the most informative sentences, while in image summarization the system finds the most representative and important (i.e. salient) images. For surveillance videos, one might want to extract the important events from the uneventful context.There are two general approaches to automatic summarization: extraction and abstraction. Extractive methods work by selecting a subset of existing words, phrases, or sentences in the original text to form the summary. In contrast, abstractive methods build an internal semantic representation and then use natural language generation techniques to create a summary that is closer to what a human might express. Such a summary might include verbal innovations. Research to date has focused primarily on extractive methods, which are appropriate for image collection summarization and video summarization."
summarize(sentence)

'Document summarization tries to create a representative summary or abstract of the entire document, by finding the most informative sentences, while in image summarization the system finds the most representative and important (i.e. salient) images.\nExtractive methods work by selecting a subset of existing words, phrases, or sentences in the original text to form the summary.'

## 4. Count / Density

Perhaps one of the more basic tools for feature engineering, adding word count, sentence count, punctuation counts and Industry specific word counts can greatly help in prediction or classification. 

![Word Count](images/word-count-density.png)

## 5.  Topic Modeling
**Latent Dirichlet Allocation (LDA)** is an example of topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

## 6. Author/Sender Importance 

Author, sender or writer of the text can sldo be important value add to feature list. Importance can be ranked in numeric range like (1-10).

## 7. Severity, Priority, Urgency, Importance 

Depending on context Severity, Priority, Urgency, and Importance can be a great addition to feature list in order to feed into machine learning algorithms.Importance can be ranked in numeric range like (1-10).

## 8. Context 

Some context can decide importance like historical, science, etc. So these contexts can be labeled. 

# Question: Describe the difference between CountVectorizer, tf-idf and word embeddings.

## Word Embeddings
Word Embeddings are the text converted into numbers. There might be different numerical representations of the same text as per embedding technique used.
Formally we can say, A Word Embedding format tries to convert a word to a vector by mapping it with a dictionary.

Example for simple word embedding.

Sentence=” Word Embeddings are Word converted into numbers ”<br>
A dictionary can be the list of all unique words in the sentence.<br>
Dictionary = [‘Word’,’Embeddings’,’are’,’Converted’,’into’,’numbers’]<br>

Let's represent word 'Embeddings' in vector form<br>
[0,1,0,0,0,0]

### Types of Word Embeddings
1. Frequency based Embedding
    * Count Vector
    * TF-IDF Vector
2. Prediction based Embedding
    * CBOW (Continuous Bag of words)
    * Skip – Gram model
3. Pre-trained word vectors
    * Word2Vec
4. Training own word vectors

I will analyze Count Vector, TF-IDF Vector as per ask.

### Count Vector
CountVectorizer just counts the occurrences of each word in its vocabulary. It can lowercase letters, disregard punctuation and stopwords, but it can't LEMMATIZE or STEM.

Let us understand this using a simple example.

D1: He is a lazy boy. She is also lazy.

D2: Neel is a lazy person.

The dictionary created may be a list of unique tokens(words) in the corpus =[‘He’,’She’,’lazy’,’boy’,’Neel’,’person’]

Here, D=2, N=6

The count matrix M of size 2 X 6 will be represented as –

 | He | She | lazy | boy | Neeraj | person | 
 | --- | --- | --- | --- | --- | --- | 
 | D1 | 1 | 1 | 2 | 1 | 0 | 0 | 
 | D2 | 0 | 0 | 1 | 0 | 1 | 1 | 


sklearn CountVectorizer implements both tokenization and occurrence counting in a single class.

In [33]:
import pandas as pd
txt = ['His smile was not perfect', 'His smile was not not not not perfect.', 'she not sang']
from sklearn.feature_extraction.text import CountVectorizer
# Initialize a CountVectorizer object: count_vectorizer
count_vec = CountVectorizer(stop_words="english", analyzer='word', 
                            ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None)

# Transforms the data into a bag of words
count_train = count_vec.fit(txt) #text can be corpus text.tolist()
bag_of_words = count_vec.transform(txt) # can also use fit_transform(text) # vector features

print("vocabulary:\n{}".format(count_vec.vocabulary_))
# Print the first 10 features of the count_vec
#print("Every feature:\n{}".format(count_vec.get_feature_names()))
#print("\nEvery 3rd feature:\n{}".format(count_vec.get_feature_names()[::3]))

textmatrix = pd.DataFrame(bag_of_words.toarray(),columns=count_vec.vocabulary_)
textmatrix.head(5)

vocabulary:
{'smile': 2, 'perfect': 0, 'sang': 1}


Unnamed: 0,smile,perfect,sang
0,1,0,1
1,1,0,1
2,0,1,0


### TfidfVectorizer
tf-idf minimizes the impact of tokens that occur very frequently across documents and that are consequently experimentally less informative than features that occur in a small fraction of the training corpus. 
#### *tf-idf(d, t) = tf(t) * idf(d, t)
* tf(t)= the term frequency is the number of times the term appears in the document
* idf(d, t) = the document frequency is the number of documents 'd' that contain term 't'

In [29]:
#TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
txt1 = ['His smile was not perfect', 'His smile was not not not not perfect', 'she not sang']
tf = TfidfVectorizer(smooth_idf=False, sublinear_tf=False, norm=None, analyzer='word')
txt_fitted = tf.fit(txt1)
txt_transformed = txt_fitted.transform(txt1) # vector features

#pd.DataFrame(txt_transformed.toarray()).head(15)
textmatrix = pd.DataFrame(txt_transformed.toarray(),columns=tf.vocabulary_)
textmatrix.head(5)
#print ("The text: ", txt_transformed.todense())
#print("Every feature:\n{}".format(tf.get_feature_names()))

Unnamed: 0,his,smile,was,not,perfect,she,sang
0,1.405465,1.0,1.405465,0.0,0.0,1.405465,1.405465
1,1.405465,4.0,1.405465,0.0,0.0,1.405465,1.405465
2,0.0,1.0,0.0,2.098612,2.098612,0.0,0.0


## CBOW (Continuous Bag of words)
The way CBOW work is that it tends to predict the probability of a word given a context. A context may be a single word or a group of words. 

Pros:<br>
* Being probabilistic is nature, it is supposed to perform superior to deterministic methods(generally).
* It is low on memory. It does not need to have huge RAM requirements like that of co-occurrence matrix where it needs to store three huge matrices.

Cons:<br>
* CBOW takes the average of the context of a word (as seen above in calculation of hidden activation). For example, Apple can be both a fruit and a company but CBOW takes an average of both the contexts and places it in between a cluster for fruits and companies.
* Training a CBOW from scratch can take forever if not properly optimized.

## Word2Vec
Word2vec is a group of related models that are used to produce word embeddings using mostly pretraind models. In following example we are using [google’s pre-trained model](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit). It contains word vectors for a vocabulary of 3 million words trained on around 100 billion words from the google news dataset. Size 1.5 GB.

In [42]:
# pretrained model
from gensim.models import KeyedVectors # Deprecated Word2Vec

word_vectors_model = KeyedVectors.load_word2vec_format('dataset/GoogleNews-vectors-negative300.bin', binary=True)

# getting word vectors of a word
word_vectors_model['dog']

array([ 5.12695312e-02, -2.23388672e-02, -1.72851562e-01,  1.61132812e-01,
       -8.44726562e-02,  5.73730469e-02,  5.85937500e-02, -8.25195312e-02,
       -1.53808594e-02, -6.34765625e-02,  1.79687500e-01, -4.23828125e-01,
       -2.25830078e-02, -1.66015625e-01, -2.51464844e-02,  1.07421875e-01,
       -1.99218750e-01,  1.59179688e-01, -1.87500000e-01, -1.20117188e-01,
        1.55273438e-01, -9.91210938e-02,  1.42578125e-01, -1.64062500e-01,
       -8.93554688e-02,  2.00195312e-01, -1.49414062e-01,  3.20312500e-01,
        3.28125000e-01,  2.44140625e-02, -9.71679688e-02, -8.20312500e-02,
       -3.63769531e-02, -8.59375000e-02, -9.86328125e-02,  7.78198242e-03,
       -1.34277344e-02,  5.27343750e-02,  1.48437500e-01,  3.33984375e-01,
        1.66015625e-02, -2.12890625e-01, -1.50756836e-02,  5.24902344e-02,
       -1.07421875e-01, -8.88671875e-02,  2.49023438e-01, -7.03125000e-02,
       -1.59912109e-02,  7.56835938e-02, -7.03125000e-02,  1.19140625e-01,
        2.29492188e-01,  

# Question: What are latent variables and manifolds ? Why are they important ? Give at least two examples.

## Latent variables
1.	A latent variable is a variable which you can’t observe neither in training nor in test phase.
2.	In Latent variable model, we assume that the distribution in data space is due to the influence of a small number of latent variables. We then map the latent space onto data space. It is optimized using a maximum likelihood criterion using the EM algorithm.
3.	In Latent Variable modelling, Bayes’ theorem explains the distribution of data point was due to a point in latent space.
4.	Latent Variable modelling is also called as factor analysis. PCA is a special case of factor analysis.
5.	Dimensionality reduction, the reverse is done data space is mapped to latent space.
6. Latent features are computed from observed features using matrix factorization.
7. An example would be text document analysis. 'words' extracted from the documents are features.
8. If you factorize the data of words you can find 'topics', where 'topic' is a group of words with semantic relevance.

## Latent Manifold
The Latent manifold is the set of all Latent variables that make up the Latent Space.

#### Why?
Latent variables/manifold provide condensed meaningful important information in the form of feature which plays an essential role in prediction or classification.

#### Example:
1. In fake news project, author name is transformed to rank considering different reputation factors. Domain of news source is also converted to domain rank using external sources. These features worked as a latent variable, and combined reliability rating has been appeared as a latent variable manifold. 

2. In Housing price prediction project 'location' is a facor that influence house price significently, but location is not present in data set. Like distance from super market, school district, hospital, freeway etc. So location becomes latent variable. Number of landmark locations and neighborhood rating also become latent variable and formed neighborhood latent manifold.

2. For example in NLP features (words) like [sail-boat, schooner, yatch, steamer, cruiser] which would 'factorize' to latent feature (topic) like 'ship' and 'boat'.<br>
    * [sail-boat, schooner, yatch, steamer, cruiser, ...] ->[ship, boat]<br>
    * The underlying idea is that latent features are semantically relevant 'aggregates' of observed features. When you have large-scale, high-dimensional, and noisy observed features, it makes sense to build your classifier on latent features.
    * Ship, Boat as combined become Latent manifold.

# Question: Describe the Scikit-learn Pipeline function with an example.

Sklearn's pipeline functionality makes easier to repeat commonly occuring steps in your modeling process. 

In following example pipeline gives a single interface for all 3 most common steps of transformation and resulting estimator. It encapsulates transformers and predictors inside, and then you can do something like:

In [12]:
'''   
    vect = CountVectorizer()
    tfidf = TfidfTransformer()
    clf = SGDClassifier()

    vX = vect.fit_transform(Xtrain)
    tfidfX = tfidf.fit_transform(vX)
    predicted = clf.fit_predict(tfidfX)

    # Now evaluate all steps on test set
    vX = vect.fit_transform(Xtest)
    tfidfX = tfidf.fit_transform(vX)
    predicted = clf.fit_predict(tfidfX)
''' 

'   \n    vect = CountVectorizer()\n    tfidf = TfidfTransformer()\n    clf = SGDClassifier()\n\n    vX = vect.fit_transform(Xtrain)\n    tfidfX = tfidf.fit_transform(vX)\n    predicted = clf.fit_predict(tfidfX)\n\n    # Now evaluate all steps on test set\n    vX = vect.fit_transform(Xtest)\n    tfidfX = tfidf.fit_transform(vX)\n    predicted = clf.fit_predict(tfidfX)\n'

In [11]:
# With pipeline
'''
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])
predicted = pipeline.fit(Xtrain).predict(Xtrain)
# Now evaluate all steps on test set
predicted = pipeline.predict(Xtest)
'''

"\npipeline = Pipeline([\n    ('vect', CountVectorizer()),\n    ('tfidf', TfidfTransformer()),\n    ('clf', SGDClassifier()),\n])\npredicted = pipeline.fit(Xtrain).predict(Xtrain)\n# Now evaluate all steps on test set\npredicted = pipeline.predict(Xtest)\n"

# Question: Compare Decision Trees , Random Forests and SVM.

## Decision Tree

A decision tree is a graphical representation of all the possible solutions to a decision based on certain conditions. It's called a decision tree because it starts with a single box (or root), which then branches off into a number of solutions, just like a tree.
![Decision Trees](images/DecisionTree.png)
Credit: [Source](https://www.quora.com/What-is-the-difference-between-random-forest-and-decision-trees)
#### Pros
1. **Easy to Understand**: Decision tree output is very easy to understand even for people from the non-analytical background. It does not require any statistical knowledge to read and interpret them. Its graphical representation is very intuitive and users can easily relate their hypothesis.
2. **Useful in Data exploration:** Decision tree is one of the fastest ways to identify the most significant variables and the relation between two or more variables. With the help of decision trees, we can create new variables/features that have better power to predict the target variable. It can also be used in the data exploration stage. For example, we are working on a problem where we have information available in hundreds of variables, their decision tree will help to identify the most significant variable.
3. **Less data cleaning required:** It requires less data cleaning compared to some other modeling techniques. It is not influenced by outliers and missing values to a fair degree.
4. **Data type is not a constraint:** It can handle both numerical and categorical variables.
5. **Non Parametric Method:** Decision tree is considered to be a non-parametric method. This means that decision trees have no assumptions about the space distribution and the classifier structure.

#### Cons
1. **Overfitting:** Overfitting is one of the most practical difficulties for decision tree models. This problem gets solved by setting constraints on model parameters and pruning (discussed in detail below).
2. **Not fit for continuous variables**: While working with continuous numerical variables, decision tree loses information when it categorizes variables in different categories.


In [13]:
'''
#Import Library
#Import other necessary libraries like pandas, numpy...
from sklearn import tree
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create tree object 
model = tree.DecisionTreeClassifier(criterion='gini') # for classification, here you can change the algorithm as gini or entropy (information gain) by default it is gini  
# model = tree.DecisionTreeRegressor() for regression
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
#Predict Output
predicted= model.predict(x_test)
'''

"\n#Import Library\n#Import other necessary libraries like pandas, numpy...\nfrom sklearn import tree\n#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset\n# Create tree object \nmodel = tree.DecisionTreeClassifier(criterion='gini') # for classification, here you can change the algorithm as gini or entropy (information gain) by default it is gini  \n# model = tree.DecisionTreeRegressor() for regression\n# Train the model using the training sets and check score\nmodel.fit(X, y)\nmodel.score(X, y)\n#Predict Output\npredicted= model.predict(x_test)\n"

## Random Forest
Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.
![randomforest](images/RandomForest.png)
#### Pros

- This algorithm can solve both type of problems i.e. classification and regression and does a decent estimation at both fronts.
- One of benefits of Random forest which excites me most is, the power of handle large data set with higher dimensionality. It can handle thousands of input variables and identify most significant variables so it is considered as one of the dimensionality reduction methods. Further, the model outputs **Importance of variable,** which can be a very handy feature (on some random data set).  
- It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.
- It has methods for balancing errors in data sets where classes are imbalanced.
- The capabilities of the above can be extended to unlabeled data, leading to unsupervised clustering, data views and outlier detection.
- Random Forest involves sampling of the input data with replacement called as bootstrap sampling. Here one third of the data is not used for training and can be used to testing. These are called the **out of bag** samples. Error estimated on these out of bag samples is known as _out of bag error_. Study of error estimates by Out of bag, gives evidence to show that the out-of-bag estimate is as accurate as using a test set of the same size as the training set. Therefore, using the out-of-bag error estimate removes the need for a set aside test set.

#### Cons

- It surely does a good job at classification but not as good as for regression problem as it does not give precise continuous nature predictions. In case of regression, it doesn’t predict beyond the range in the training data, and that they may over-fit data sets that are particularly noisy.
- Random Forest can feel like a black box approach for statistical modelers – you have very little control on what the model does. You can at best – try different parameters and random seeds!

### Code:

In [None]:
'''
#Import Library
from sklearn.ensemble import RandomForestClassifier #use RandomForestRegressor for regression problem
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create Random Forest object
model= RandomForestClassifier(n_estimators=1000)
# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)
'''

## Support Vector Machine(SVM)
A Support Vector Machine (SVM) is a classifier that tries to maximize the margin between training data and the classification boundary (the separating hyperplane defined by 𝑋𝛽 = 0).
![svm](images/svm.png)

#### Pros:
* It works really well with clear margin of separation
* It is effective in high dimensional spaces.
* It is effective in cases where number of dimensions is greater than the number of samples.
* It uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.

#### Cons:
* It doesn’t perform well, when we have large data set because the required training time is higher
* It also doesn’t perform very well, when the data set has more noise i.e. target classes are overlapping
* SVM doesn’t directly provide probability estimates, these are calculated using an expensive five-fold cross-validation. It is related SVC method of Python scikit-learn library.

In [14]:
'''
#Import Library
from sklearn import svm
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create SVM classification object 
model = svm.svc(kernel='linear', c=1, gamma=1) 
# there is various option associated with it, like changing kernel, gamma and C value. Will discuss more # about it in next section.Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
#Predict Output
predicted= model.predict(x_test)
'''

"\n#Import Library\nfrom sklearn import svm\n#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset\n# Create SVM classification object \nmodel = svm.svc(kernel='linear', c=1, gamma=1) \n# there is various option associated with it, like changing kernel, gamma and C value. Will discuss more # about it in next section.Train the model using the training sets and check score\nmodel.fit(X, y)\nmodel.score(X, y)\n#Predict Output\npredicted= model.predict(x_test)\n"

### Side by side comparision
![df-rf-svm](images/df-rf-svm.png)

# Question: How does Logistic Regression compare to svm? What if we have an svm with only binary classifications ?

## Side by Side Compare

| no. | Logistic Regression | Support Vector Machine(SVM) |
| --- | --- | --- |
| 1 | Logistic Regression fits the data points as if they are along a continuous function. | SVM fits a function (hyperplane) that attempts to separate two classes of data that could be of multiple dimensions. |
| 2 | This isn't always the case for single-class classification, and so the function may have trouble classifying where P = 0.5 | SVM could have difficulty when the classes are not separable or there is not enough margin to fit a (n_dimensions - 1) hyperplane between the two classes. |
| 3 | Generally resistant to overfitting. The less is parameters count there are less chances of overfitting. | Not sensitive to overfitting |
| 4 | Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables. | The idea is that maximizing the margin maximizes the chance that classification will be correct on new data. We assume the new data of each class is near the training data of that type. |


## SVM with binary classification

SVMs (linear or otherwise) inherently do binary classification. Although, there are several ways to extend to multiclass problems. The most common procedure involves transforming the problem set into a set of binary classification problems, using following two strategies:

* One vs. the rest - For n classes, n binary classifiers are trained. Each predicts whether an example belongs to its 'own' class versus other class. The classifier with the largest output is taken to be the class of the example.
* One vs. one. A binary classifier can be trained for each pair of classes. A voting procedure is used to combine the outputs.

# Question: Describe naive bayes with an example.

## Naive Bayes
Naive Bayes is used for several purposes successfully however it works specifically well with natural language processing (NLP) problems. Naive Bayes is a family of probabilistic algorithms that take advantage of probability theory and Bayes' Theorem to predict the tag of a text like a piece of news or a customer feedback.

As it is probabilistic, so it calculates the probability of each tag for a given document, and then output the tag with the highest probability. It gets probabilities by using Bayes’ Theorem, which describes the probability of a feature, based on prior knowledge of conditions that might be related to that feature.

In the following example will walk you through Multinomial Naive Bayes with NLP example.

### Example

Training data:

| no. | Text | Tag |
| --- | --- | --- |
| 1 | A great game | Sports |
| 2 | The election was over | Not Sports |
| 3 | Very clean match | Sports |
| 4 | A clean but forgettable game | Sports |
| 5 | It was a close election | Not Sports |

Naive Bayes is a probability base classifier, we will calculate the probability that the sentence “A very close game” is Sports, and the probability that it’s Not Sports.

#### Feature Engineering
First we need to extract features from text so we can apply models. Feature is the pieces of information that we extract from the text. As we know model accepts numbers to calculate, so we need to convert features into numbers. One basic way of doing this is **word frequencies**.

We will treat each text(document) as a set of words and counts.

#### Bayes’ Theorem
![Bayes-theorem](images/Bayes-theorem.png)

Lets calculate probablity of Sports or Not Sport for given text “A very close game”.
![probability](images/nb-probability.svg)
As we are looking just for higher probability in order to compare so we can drop common factors.
![probability compare](images/nb-probability-compare.png)

Still problem is there “A very close game” does not appears in the Sports tag texts. In this Naive can help.

#### Naive
We are treating text as set of words. We can write probabilities in terms of words with little assumption.
nb-wordset
![wordset](images/nb-wordset.svg)
![wordset-prob](images/nb-wordset-prob.svg)

#### Calculation
Probability of each tag:<br>
P(Sports) = ⅗<br>
P(Not Sports) = ⅖<br>

Calculating words probability<br>
P(game | Sports) = (count word “game” appears in Sports tag texts) / (total number of words in sports tag) => 2/11<br>
P(close | Sports) = 0 as close does not appear in the text.<br>
It can be problem as we need to multiply word probabilities to get text(document) probability.<br>
Solution is **Laplace smoothing** we add 1 to every count so it’s never zero. To balance this, we add the number of possible words to the divisor, so the division will never be greater than 1. 

Possible words are ['a', 'great', 'very', 'over', 'it', 'but', 'game', 'election', 'clean', 'close', 'the', 'was', 'forgettable', 'match'].

Since the number of possible words is 14, applying smoothing we get that 

P(game | sports) = (2 + 1)/(11 + 14). 

The full results are:
![nb-classifier](images/nb-classifier.png)

#### Result: Classifier gives “A very close game” the Sports tag.

### Code with sklearn

In [1]:
# load the iris dataset 
from sklearn.datasets import load_iris 
iris = load_iris() 

# store the feature matrix (X) and response vector (y) 
X = iris.data 
y = iris.target 

# splitting X and y into training and testing sets 
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1) 

# training the model on training set 
from sklearn.naive_bayes import GaussianNB 
gnb = GaussianNB() 
gnb.fit(X_train, y_train) 

# making predictions on the testing set 
y_pred = gnb.predict(X_test) 

# comparing actual response values (y_test) with predicted response values (y_pred) 
from sklearn import metrics 
print("Gaussian Naive Bayes model accuracy(in %):", metrics.accuracy_score(y_test, y_pred)*100)

Gaussian Naive Bayes model accuracy(in %): 95.0


# Question: What is deep learning? Name three activation functions .

## Deep Learning
Deep learning is a subset of AI and machine learning that uses multi-layered artificial neural networks to deliver state-of-the-art accuracy in tasks such as object detection, speech recognition, language translation and others.
<br>
![deeplearning](images/deeplearning.png)

Deep learning is a machine learning technique that teaches computers to do what comes naturally to humans: learn by example.<br>
In deep learning, a computer model learns to perform classification tasks directly from images, text, or sound. Deep learning models can achieve state-of-the-art accuracy, sometimes exceeding human-level performance. Models are trained by using a large set of labeled data and neural network architectures that contain many layers.
#### How Deep Learning Works
Most deep learning methods use neural network architectures, which is why deep learning models are often referred to as deep neural networks.

The term “deep” usually refers to the number of hidden layers in the neural network. Traditional neural networks only contain 2-3 hidden layers, while deep networks can have as many as 150.

Deep learning models are trained by using large sets of labeled data and neural network architectures that learn features directly from the data without the need for manual feature extraction.

#### Use Cases
* Artificial Neural Networks for Regression and Classification
* Convolutional Neural Networks for Computer Vision
* Recurrent Neural Networks for Time Series Analysis
* Self Organizing Maps for Feature Extraction
* Deep Boltzmann Machines for Recommendation Systems
* Auto Encoders for Recommendation Systems

#### Artificial Neural Network
1. Randomly initialise the weights to small numbers close to 0(but not 0).
2. Input the first observation of your dataset in the input layer, each feature is one input node.
3. Forward propagation - from left to right, the neurons are activated in a way that the impact of each neuron's activation is limited by the weights. Propagate the activations until getting the predicted result y.
4. Compare the predicted result to the actual result. Measure the generated error.
5. Back-Propagation- from right to left, the error is back propagated. Update the weights according to how much they are responsible for the error. The learning rate decides by how much we update the weights.
6. Repeat Step 1 to 5 and update the weights after each observation(Reinforcement Learning). Or Repeat Step 1 to 5 but update the weights only after a batch of obserbations(Batch Learning).
7. When the whole training set passed through the ANN, that makes an epoch. Repeat more epochs.


### Activation Functions
An activation function is a function used to compute the output in a layer of a Neural Network. Main purpose is to convert a input signal of a node in a A-NN to an output signal. That output signal now is used as a input in the next layer in the stack.

It’s just a thing (node) that you add to the output end of any neural network. It is also known as Transfer Function. It can also be attached in between two Neural Networks.

In simple words artificial neuron calculates a "weighted sum" of its input, adds a bias and then decides whether it should be “fired” or not ( yeah right, an activation function does this ).

**Y = SUM(weight * input) + bias**

Now, the value of Y can be anything ranging from -inf to +inf. The neuron really doesn’t know the bounds of the value. So we decided to add “activation functions” for this purpose. To check the Y value produced by a neuron and decide whether outside connections should consider this neuron as “fired” or not. Or rather let’s say — “activated” or not.

**Most popular Activation Functions:**
![Activation Functions](images/activation-function.png)
![Activation Functions](images/activation-functions-cheatsheet.png)

**Note:** Rectifier function(“relu”) for hidden layer. Sigmoid function is for output layer where probability is expected. Binary outcome Sigmoid activation function, if category outcome then SoftMax.

# Question: How do we use sigmoid? When do we use it?

### How do we use sigmoid
Sigmoid activation function is used for output layer in in Neural Networks where probability is expected.

**Code snippet:**

In [None]:
'''
#Sigmoid in ANN!

# Importing the Keras libraries and packages
import keras
from keras.models import Sequential
from keras.layers import Dense

# Initialising the ANN
classifier = Sequential()

# Adding the input layer and the first hidden layer
classifier.add(Dense(output_dim = 6, init = 'uniform', activation = 'relu', input_dim = 11))

# Adding the second hidden layer
classifier.add(Dense(output_dim = 6, init = 'uniform', activation = 'relu'))

# Adding the output layer
classifier.add(Dense(output_dim = 1, init = 'uniform', activation = 'sigmoid'))

# Compiling the ANN
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

# Fitting the ANN to the Training set
classifier.fit(X_train, y_train, batch_size = 10, nb_epoch = 100)

# Part 3 - Making the predictions and evaluating the model

# Predicting the Test set results
y_pred = classifier.predict(X_test)
'''

### When to use:
Sigmoid function is used in neural networks to give logistic neurons real-valued output that is a smooth and bounded function of their total input. The sigmoid function is a activation function in terms of underlying gate structured in co-relation to Neurons firing, in Neural Networks. 

Benifit:
- Sigmoid function is nonlinear even in combination. It can stack layers. As it has smooth gradient so can work for non binary activations and produce analog activation.
![SigmoidFunction](images/sigmoid-func.png)
- As we see value of Y is very steep for value of X in between -2 to 2. Which mean small change in value of X can cause significent change in the value of Y. It means function has tendency to fall in either side of curve for Y value. That is good for classification by making clear distinctions on prediction.
- Anathor advantage is, Sigmoid activation function always results in bound range binary outcome (0,1) in place of (-inf, inf) of linear function.

# Question: What do we use to measure our models? Describe four parameters / metrics with examples.

Evolution of machine learning model is an essential part of project. Mostly we use classification accurecy to measure the performance of the model. Some thime it is not sufficient to judge performance. Lets look some other evolution matrics.

### 1. Classification Accuracy 
Accurecy usualy masured using Classification Accuracy. It is ratio of correct predictions devided by total number of predictions. It works quite well when data set has equal number of samples from each class. 
![Accuracy](images/accuracy.gif)
The best value is 1 and the worst value is 0.

### 2. Loss Functions
Cost(loss) Function is a measure of how wrong the model is in terms of its ability to estimate the relationship between X and y.( estimate how badly models are performing)
Models learn by minimizing a cost function, you may naturally wonder how the cost function is minimized — enter gradient descent. Gradient descent is an efficient optimization algorithm that attempts to find a local or global minima of a function.

Logarithmic Loss or **Log Loss**, works by penalising the false classifications. It works well for multi-class classification. **Cross-entropy loss**, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label. So predicting a probability of .012 when the actual observation label is 1 would be bad and result in a high loss value. A perfect model would have a log loss of 0.
![LossFunctions](images/LossFunctions1.png)
Higher the value, The worse off the prediction is from target.

### 3. Confusion Matrix
Confusion Matrix as the name suggests gives us a matrix as output and describes the complete performance of the model.
There are 4 important terms :

* True Positives : The cases in which we predicted YES and the actual output was also YES.
* True Negatives : The cases in which we predicted NO and the actual output was NO.
* False Positives : The cases in which we predicted YES and the actual output was NO.
* False Negatives : The cases in which we predicted NO and the actual output was YES.

Accuracy for the matrix can be calculated as:

Accuracy = TP+TN / (TP+TN+FP+FN)

![ConfusionMatrix](images/ConfusionMatrix.png)

### 4. Classification Report : F1 Score
F1 Score is used to measure a test’s accuracy. F1 Score is the Harmonic Mean between precision and recall. The range for F1 Score is (0, 1). It tells, how precise classifier is (how many instances it classifies correctly), as well as how robust it is (it does not miss a significant number of instances).
![f1 Score](images/f1-score.gif)
The greater the F1 Score, the better is the performance of our model. F1 score reaches its best value at 1 and worst score at 0.

**precision**
Measures the fraction(ratio) of actual positives among those examples that are predicted as positive. 
The best value is 1 and the worst value is 0.
Precision = tp / (tp + fp)

**recall**
Measures the fraction of actual positives that are predicted as positive. The recall is intuitively the ability of the classifier to find all the positive samples.
The best value is 1 and the worst value is 0.
Recall = tp / (tp + fn)

### 5. R squared
In statistics, the coefficient of determination, denoted R² or r² and pronounced "R squared", is the proportion of the variance in the dependent variable that is predictable from the independent variable. (Fraction of variance in dependent variable which is explained by independent variable). 
R2 = 0 bad model. No linear relationship
R2 = 1 good model. Line is perfectly fits the data.

R2 = Explained variation / Total of Variation

## Examples

In [60]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
df = pd.read_csv('dataset/Social_Network_Ads.csv')
df.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


In [61]:
X = df.iloc[:, [2, 3]].values
y = df.iloc[:, 4].values

In [62]:
# Train Test Split
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state = 100)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)



In [63]:
# Let's 
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression(random_state = 0)
logmodel.fit(X_train, y_train)
predictions = logmodel.predict(X_test)



### Evaluation

In [64]:
#from sklearn.metrics import accuracy_score
from sklearn.metrics import accuracy_score 
print ("Accuracy : ", accuracy_score(y_test,predictions)*100)

Accuracy :  85.0


This is percentage higher the better.

In [65]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test,predictions))

[[62  3]
 [12 23]]


Negative diagonal should be 0 for best result. It means 12 negative(0) results went wrong and 3 positive(1) results did not predicted correct. Accurecy = (62+23)/(62+23+3+12)=> 85%

In [66]:
from sklearn.metrics import classification_report
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.84      0.95      0.89        65
           1       0.88      0.66      0.75        35

   micro avg       0.85      0.85      0.85       100
   macro avg       0.86      0.81      0.82       100
weighted avg       0.85      0.85      0.84       100



For best results F-Score should be close to 1. In above case for prediction of 1 is not that great as prediction of 0. Which is same we saw in confusion matrix.

In [67]:
# MAE L1 loss - Should be close to 0
from sklearn.metrics import mean_absolute_error  
mean_absolute_error(y_test,predictions) #y_target, y_pred

0.15

L1 Loss value is closer to 0. Which is good.

In [15]:
# MSE L2 loss - Should be close to 0
from sklearn.metrics import mean_squared_error 
mean_squared_error(y_test,predictions) #y_target, y_pred

0.15

L2 Loss value is closer to 0. Which is good.

In [16]:
# Log loss function
from sklearn.metrics import log_loss
log_loss(y_test,predictions)

5.1808404471595075

Log Loss has no upper bound and it exists on the range (0, ∞). Log Loss closer to 0 indicates higher accuracy, whereas if the Log Loss is away from 0 then it indicates lower accuracy.

In [19]:
# R Squared 
from sklearn.metrics import r2_score
r2 = r2_score(y_test, predictions)
r2

0.34065934065934056

R2 is applicable to linear model. In this case not of much use. For linear model value should be near to 1.

# Reference
* [A Survey of Text Similarity Approaches](https://pdfs.semanticscholar.org/5b5c/a878c534aee3882a038ef9e82f46e102131b.pdf)
* [How to solve 90% of NLP problems: a step-by-step guide](https://blog.insightdatascience.com/how-to-solve-90-of-nlp-problems-a-step-by-step-guide-fda605278e4e)
* [Latent Dirichlet Allocation (LDA)](https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24)
* [Understanding Activation Functions in Neural Networks](https://medium.com/the-theory-of-everything/understanding-activation-functions-in-neural-networks-9491262884e0)
* [Interns Explain Basic Neural Network](https://blog.datawow.io/interns-explain-basic-neural-network-ebc555708c9)
* [Metrics to Evaluate your Machine Learning Algorithm](https://towardsdatascience.com/metrics-to-evaluate-your-machine-learning-algorithm-f10ba6e38234)
* [A practical explanation of a Naive Bayes classifier](https://monkeylearn.com/blog/practical-explanation-naive-bayes-classifier/)
* [A Deep Dive Into Sklearn Pipelines](https://www.kaggle.com/baghern/a-deep-dive-into-sklearn-pipelines)
* [Pre-Processing in Natural Language Machine Learning](https://towardsdatascience.com/pre-processing-in-natural-language-machine-learning-898a84b8bd47)
* [SVM](https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-code/)
* [An Intuitive Understanding of Word Embeddings](https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/)