# Example 1. Basic Techniques of Nature Language Processing
---
This is the example showing the basic techniques of **Nature language processing (NLP)**, since the this Chapter is focusing on the sentiment analysis which relates to language, i.e. the reviews of movie. In the preprocessing the data, we need following basic techniques.
- **Bag-of-word**
- **Term frequency-inverse document frequency, tf-idf**
- **Porter stemmer algorithm**
- **stop-word removal**

In [1]:
import pandas as pd
import numpy as np

### 1. Bag-of-word
It is also called **$n$-gram model**, $n$ is the number of words for a bag. The method is to count the used bag of words in sentence. 

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

In [3]:
docs = np.array([
    'The sun is shining',
    'The weather is sweet',
    'The sun is shining and the weather is sweet'])

In [4]:
count_1g = CountVectorizer()
count_2g = CountVectorizer(ngram_range=(2,2))
bag_1g = count_1g.fit_transform(docs)
bag_2g = count_2g.fit_transform(docs)

In [5]:
print '1-gram: ', count_1g.vocabulary_

1-gram:  {u'and': 0, u'weather': 6, u'sweet': 4, u'sun': 3, u'is': 1, u'the': 5, u'shining': 2}


In [6]:
print '2-gram: ', count_2g.vocabulary_

2-gram:  {u'the sun': 5, u'shining and': 3, u'the weather': 6, u'sun is': 4, u'and the': 0, u'weather is': 7, u'is shining': 1, u'is sweet': 2}


### 2. Term frequency-inverse document frequency, tf-idf

In [7]:
from sklearn.feature_extraction.text import TfidfTransformer

In [8]:
tfidf = TfidfTransformer()

In [9]:
np.set_printoptions(precision=2)
print tfidf.fit_transform(count_1g.fit_transform(docs)).toarray()

[[ 0.    0.43  0.56  0.56  0.    0.43  0.  ]
 [ 0.    0.43  0.    0.    0.56  0.43  0.56]
 [ 0.4   0.48  0.31  0.31  0.31  0.48  0.31]]


### 3. Porter stemmer algorithm
The algorithm is plitting the words to the word stemming, i.e. runs -> run.

In [10]:
from nltk.stem.porter import PorterStemmer #pip install nltk

In [11]:
text='runners like running and thus they run'
print text

runners like running and thus they run


In [12]:
porter = PorterStemmer()
text_stem = []
for word in text.split():
    stem = porter.stem(word)
    text_stem.append(stem)
    print '%s -> %s'%( word, stem )
print text_stem

runners -> runner
like -> like
running -> run
and -> and
thus -> thu
they -> they
run -> run
[u'runner', 'like', u'run', 'and', u'thu', 'they', 'run']


### 4. Stop-wrod removal
Remove the words without helpful meaning, e.g. is, the has etc..

In [13]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/Alpha/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [14]:
from nltk.corpus import stopwords

In [15]:
stop = stopwords.words('english')
text_stem_rm = []
for w in text_stem:
    if w not in stop:
        text_stem_rm.append(w)
print text_stem_rm

[u'runner', 'like', u'run', u'thu', 'run']
