## Stop Words:

    - Stop words are words which are filtered out before or after processing of the text.
    - When applying machine learning to text these words can add a lot of noise. hence we want to remove those irrelevant
      words.
    - Stop words are usually reffered to the most common words such as "and","the","a" in a language, but there is no
      single universal list of stop words available.
    - The list of stop words can change depending on your application you work on.
    - NLTK tool has a predefined list of stopwords that refers to the most common word.
    
    If you use it in your code for the 1st time you need to download it using the command below.:

In [1]:
import nltk

In [2]:
nltk.download('stopwords')

[nltk_data] Error loading stopwords: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


False

In [3]:
from nltk.corpus import stopwords

In [4]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [5]:
stop_words = set(stopwords.words('english'))

In [6]:
sentence = "Cricket is one of the most common games followed in india"

In [7]:
words = sentence.split()

In [8]:
cleaned_data = [word for word in words if word not in stop_words]

In [9]:
cleaned_data

['Cricket', 'one', 'common', 'games', 'followed', 'india']

In [10]:
sentence

'Cricket is one of the most common games followed in india'

## Bag of Words:

    - Machine Learning algorithm can not work with raw text directly, we need to convert the text into vector of numbers.
    - Thus particular process is called as feature extraction.
    - The bag of words model is a popular and simple feature extraction technique used when we work with text.
    - It describes the occurnace of each word with in a document.
    
    Steps to use :
        1) Design the vocabulary of known words (called as tokens)
        2) Choose a measure of the presence of known words.
    
    - Any Information about the order or structure of words is discarded. Thats why it is called as bag of words.
    - The model is trying to understand whether a known word occur in a document, but we don't know where is that word in
      the document.

In [11]:
with open('data.txt','r') as data:
    raw_data = data.read().splitlines()
print(raw_data)

['i am a python developer,', "i like this movie, it's funny", 'this was awesome! i like it', 'nice one i love it.']


### Design the Vocabulary

    To get all the unique words from the four loaded sentence ignoring the case, punctuation and one character token.

In [12]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [13]:
CV = CountVectorizer()

In [14]:
# to create a bag of vector model

bag_of_words = CV.fit_transform(raw_data)

In [15]:
# show bag of words model

feature_name = CV.get_feature_names()

In [16]:
df = pd.DataFrame(bag_of_words.toarray(), columns=feature_name)

In [17]:
df

Unnamed: 0,am,awesome,developer,funny,it,like,love,movie,nice,one,python,this,was
0,1,0,1,0,0,0,0,0,0,0,1,0,0
1,0,0,0,1,1,1,0,1,0,0,0,1,0
2,0,1,0,0,1,1,0,0,0,0,0,1,1
3,0,0,0,0,1,0,1,0,1,1,0,0,0


In [18]:
print(raw_data)

['i am a python developer,', "i like this movie, it's funny", 'this was awesome! i like it', 'nice one i love it.']


In [19]:
df.iloc[0]

am           1
awesome      0
developer    1
funny        0
it           0
like         0
love         0
movie        0
nice         0
one          0
python       1
this         0
was          0
Name: 0, dtype: int64

## Term Frenquency - Inverse Document Frenquency:

    - One of the problem with scoring word frequency is that the most frequent word in the document start to have the
      highest score.
    - These frequent words may or may not contain much information to the model compared with some other domain related
      specific words.
    - One of the technique to fix the problem is to penalize words that are frequent across all the document.
    - This approach is called as TF-IDF.
    - TF-IDF (also called as "Term Frequency - Inverse Document Frequency") is a statistical measure, which is used to
      evaluate the importance of a word in a document.
    - The TF-IDF scoring value increases proportionaly to the number of time a word appear in a document.

**Formula:**

    Term Frequency TF(t,d) = number of times term 't' appears in document 'd' / total number of words in document 'd'

    Inverse Document Frequency IDF(t) = log(total number of document / number of document with term in it)
                                     
                                      = log(N/df + 1)

    TFIDF = TF(t,d) x IDF(t)
    
    tf-idf(t,d) = tf(t,d) x log(N/(df + 1))

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [21]:
tfid = TfidfVectorizer()

In [22]:
values = tfid.fit_transform(raw_data)

In [23]:
feature_name = tfid.get_feature_names()

In [24]:
df = pd.DataFrame(values.toarray(), columns=feature_name)

In [25]:
df

Unnamed: 0,am,awesome,developer,funny,it,like,love,movie,nice,one,python,this,was
0,0.57735,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.0,0.0
1,0.0,0.0,0.0,0.523381,0.334067,0.41264,0.0,0.523381,0.0,0.0,0.0,0.41264,0.0
2,0.0,0.523381,0.0,0.0,0.334067,0.41264,0.0,0.0,0.0,0.0,0.0,0.41264,0.523381
3,0.0,0.0,0.0,0.0,0.345783,0.0,0.541736,0.0,0.541736,0.541736,0.0,0.0,0.0


In [26]:
raw_data

['i am a python developer,',
 "i like this movie, it's funny",
 'this was awesome! i like it',
 'nice one i love it.']