# This algorithms is applied to Simple NLP

Import the preliminaries

In [8]:
import numpy as np
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt


lets create small data set

In [3]:
data = pd.DataFrame()
data['text'] = ["Barca paly well", "Obama is great politician", "foolball is the famous game", "this game was not fair", "caffine to code"]
data["tag"]= ["Sports", "Not Sports", "Sports","Sports" ,"Not Sports", ]

In [4]:
data

Unnamed: 0,text,tag
0,Barca paly well,Sports
1,Obama is great politician,Not Sports
2,foolball is the famous game,Sports
3,this game was not fair,Sports
4,caffine to code,Not Sports


Lets see the sentence "that was the unforgetable game"

1. We want to calculate the probaility that the sentence "that was unforgetable game" is "Sports" and the probability that it's "Not Sports"


2. Written mathematically, what we want is P(Sports|that was the unforgetable game) the probability that the tag of a sentence is Sports given that the sentence is "that was an unforgetable game"

# Feature Engineering

We call features the pieces of information that we take from the text give to the algorithm so it can work its magic.

For example, if we were doing classification on health, some features could be a person’s height, weight, gender, and so on. We would exclude things that maybe are known but aren’t useful to the model, like a person’s name or favorite color.

In this case though, we don’t even have numeric features. We just have text. We need to somehow convert this text into numbers that we can do calculations on.

In our case, we can use word frequencies. That is, we ignore word order and sentence construction, treating every document as a set of the words it contains.

# Navie Bayees

## 1. Bayees' Theorem

\begin{equation*}
P(A|B) =\frac{P(B|A)*P(A)}{P(B)}
\end{equation*}


In our case:

\begin{equation*}
P(sports|that\ was\ unforgettable\ game) =\frac{P(that\ was\ the\ unforgettable\ game|sports)*P(sports)}{P(that\ was\ unforgettable\ game)}
\end{equation*}


In our classifier we're just trying to find out which tag has a bigger probability, er can discard the divisor - which is the same for both tags - and just compare


\begin{equation*}
P(that\ was\ the\ unforgettable\ game|Sports)*P(Sports)
\end{equation*}

with


\begin{equation*}
P(that\ was\ the\ unforgettable\ game|Not\ Sports)*P(sports)
\end{equation*}

This is better, since we could actually calculate these probabilities! Just count how many times the sentence “A very close game” appears in the Sports tag, divide it by the total, and obtain \begin{equation*}
P(that\ was\ the\ unforgettable\ game|Not\ Sports)*P(sports)
\end{equation*}



There’s a problem though: “that was the unforgettable game” doesn’t appear in our training data, so this probability is zero. Unless every sentence that we want to classify appears in our training data, the model won’t be very useful.

## 2. Navie

In navie, we assume that every word in a sentence is independent of the prther ones. This means that we're no longer looking at entire sentences, but rather at individual words. So for our purposes, "this was a fun party" is the same as "this party was fun" and "party fun was this"


We wrire this as:

\begin{equation*}
P(that\ was\ the\ unforgettable\ game) = P(that)*P(was)*P(the)*P(unforgettable)*P(game)
\end{equation*}

Therefore:
    
\begin{equation*}
P(that\ was\ the\ unforgettable\ game|Sports) = P(that|Sports)∗P(was|Sports)∗P(the|Sports)∗P(unforgettable|Sports)*P(game|Sports)
\end{equation*}


And now, all of these individual words actually show up several times in our training sata, and we can calculate them!

## 3. Calculating probabilities

Calculating a probability is just counting in our training our training data


1. Calculate the priori probability i.e P(Sports) and P(Not Sports)

In [5]:
test_text = "that was the unforgetable game"

In [6]:
number_of_sports_occurence = data['tag'][data['tag']=='Sports'].count()
number_of_not_sports_occurence = data['tag'][data['tag']=='Not Sports'].count()
total_occurence_of_tag = data['tag'].count()

In [7]:
p_sports = number_of_sports_occurence / total_occurence_of_tag
p_not_sports = number_of_not_sports_occurence / total_occurence_of_tag

In [8]:
print("number_of_sports_occurence: ", number_of_sports_occurence)
print("number_of_not_sports_occurence: ", number_of_not_sports_occurence)
print("total_occurence_of_tag: ", total_occurence_of_tag)
print()
print("p_sports: ", p_sports)
print("p_not_sports: ", p_not_sports)

number_of_sports_occurence:  3
number_of_not_sports_occurence:  2
total_occurence_of_tag:  5

p_sports:  0.6
p_not_sports:  0.4


Remember our livlihood:
    
\begin{equation*}
P(that\ was\ unforgettable\ game|Sports)*P(Sports)
\end{equation*}

with


\begin{equation*}
P(that\ was\ unforgettable\ game|Not\ Sports)*P(sports)
\end{equation*}
    

Navie sugegests this as:
    \begin{equation*}
P(that\ was\ the\ unforgettable\ game) = P(that)*P(was)*P(the)*P(unforgettable)*P(game)
\end{equation*}

and 


    
\begin{equation*}
P(that\ was\ the\ unforgettable\ game|Sports) = P(that|Sports)∗P(was|Sports)∗P(the|Sports)∗P(unforgettable|Sports)*P(game|Sports)
\end{equation*}


,

\begin{equation*}
P(that\ was\ the\ unforgettable\ game|Not\ Sports) = P(that|Not\ Sports)∗P(was|Not\ Sports)∗P(the|Not\ Sports)∗P(unforgettable|Not\ Sports)*P(game|Not\ Sports)
\end{equation*}



#### Here we have one issue:
    In the given sentence, the words (that, was, unforgettable) does not fall in any of the tag

#### To handel this we apply laplace something/ laplace esitimation

We add 1 to every count so it’s never zero. To balance this, we add the number of possible words to the divisor, so the division will never be greater than 1.

In [9]:
all_word_in_text = [j for i in data["text"] for j in i.split()]
all_word_in_text_having_sports = [j for i in data["text"][data['tag']=="Sports"] for j in i.split()]
all_word_in_text_having_not_sports = [j for i in data["text"][data['tag']=="Not Sports"] for j in i.split()]

number_of_possible_words = len(all_word_in_text)
number_of_words_having_sports = len(all_word_in_text_having_sports)
number_of_words_having_not_sports = len(all_word_in_text_having_not_sports)


p_sports_denominator = number_of_possible_words+number_of_words_having_sports
p_not_sports_denominator = number_of_possible_words+number_of_words_having_not_sports


occurence_of_all_word_having_sports = Counter(all_word_in_text_having_sports)
occurence_of_all_word_having_not_sports = Counter(all_word_in_text_having_not_sports)


observation = pd.DataFrame()
observation['words'] = test_text.split()
print([(word, occurence_of_all_word_having_sports.get(word, 0)+1) for word in observation["words"]])

print([(word, occurence_of_all_word_having_not_sports.get(word, 0)+1) for word in observation["words"]])

observation["p(word|Sports)"] =[((occurence_of_all_word_having_sports.get(word,0)+ 1)/p_sports_denominator) for word in observation["words"]]
observation["p(word|Not Sports)"] =[((occurence_of_all_word_having_not_sports.get(word,0)+1)/p_not_sports_denominator) for word in observation["words"]]


[('that', 1), ('was', 2), ('the', 2), ('unforgetable', 1), ('game', 3)]
[('that', 1), ('was', 1), ('the', 1), ('unforgetable', 1), ('game', 1)]


In [10]:
observation

Unnamed: 0,words,p(word|Sports),p(word|Not Sports)
0,that,0.030303,0.037037
1,was,0.060606,0.037037
2,the,0.060606,0.037037
3,unforgetable,0.030303,0.037037
4,game,0.090909,0.037037


In [11]:
sports = np.multiply.reduce(observation["p(word|Sports)"])[0]*p_sports
not_sports=  np.multiply.reduce(observation["p(word|Not Sports)"])[0]*p_not_sports


In [12]:
result = "Sports" if sports> not_sports else "Not Sports" 

In [13]:
data.append(dict(zip(data.columns, [test_text, result])), ignore_index=True)

Unnamed: 0,text,tag
0,Barca paly well,Sports
1,Obama is great politician,Not Sports
2,foolball is the famous game,Sports
3,this game was not fair,Sports
4,caffine to code,Not Sports
5,that was the unforgetable game,Sports


In [14]:
data

Unnamed: 0,text,tag
0,Barca paly well,Sports
1,Obama is great politician,Not Sports
2,foolball is the famous game,Sports
3,this game was not fair,Sports
4,caffine to code,Not Sports
