### NLP PipeLine

#### Text processing : 
    
    1.Cleaning :
            re
            Beautiful soup
            
    2.Normalization, 
    3.Tokenization, 
    4.Stop word removal, 
    Part of speech Tagging, 
    Named Entity Recongnition, 
    Stemming and Lemmatization

#### Feature Extraction : 
    Bag of words, 
    TF-IDF, 
    Word Embeddings

#### Modelling

#### 1.Cleaning :
        re
        Beautiful soup

In [30]:
from bs4 import BeautifulSoup
import requests

html = requests.get('http://www.jianshu.com/').content  
soup = BeautifulSoup(html, 'html.parser', from_encoding='utf-8')
result = soup('div')

In [31]:
soup.select("html head title")  

soup.select('td  div  a')  ##tag route td --> div --> a

soup.select('td > div > a')
soup.find_all("div", {"class":"course-summary-card"})

[]

#### 2.Normalization - Normalization convert text to all lowercase and removing punctuation.
    lower()
    re.sub()

In [32]:
text = '''Edit the Expression & Text to see matches. Roll over matches or the expression for details. PCRE & Javascript flavors of RegEx are supported.'''
text = text.lower()
print(text)

edit the expression & text to see matches. roll over matches or the expression for details. pcre & javascript flavors of regex are supported.


In [33]:
import re
text = re.sub(r"[^a-zA-Z0-9]", " ", text) 
print(text)

edit the expression   text to see matches  roll over matches or the expression for details  pcre   javascript flavors of regex are supported 


#### 3.Tokenize
    nltk word_tokenize

In [34]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
text = '''Edit the Expression & Text to see matches. Roll over matches or the expression for details. PCRE & Javascript flavors of RegEx are supported.'''
words = word_tokenize(text)
print(words)

['Edit', 'the', 'Expression', '&', 'Text', 'to', 'see', 'matches', '.', 'Roll', 'over', 'matches', 'or', 'the', 'expression', 'for', 'details', '.', 'PCRE', '&', 'Javascript', 'flavors', 'of', 'RegEx', 'are', 'supported', '.']


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\abhil\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [35]:
sentences = sent_tokenize(text)
print(sentences)

['Edit the Expression & Text to see matches.', 'Roll over matches or the expression for details.', 'PCRE & Javascript flavors of RegEx are supported.']


#### 4.Stop word removal

In [36]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
text = '''Edit the Expression & Text to see matches. Roll over matches or the expression for details. PCRE & Javascript flavors of RegEx are supported.'''

# Normalize text
text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())

# Tokenize text
words = word_tokenize(text)

# Remove stop words
words = [w for w in words if w not in stopwords.words("english")]
print(words)

['edit', 'expression', 'text', 'see', 'matches', 'roll', 'matches', 'expression', 'details', 'pcre', 'javascript', 'flavors', 'regex', 'supported']


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\abhil\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\abhil\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### 5.POS and NER

In [37]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('words')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
from nltk import pos_tag, ne_chunk
from nltk.tokenize import word_tokenize
text = "I always lie down to tell a lie."

# tokenize text
sentence = word_tokenize(text)

# tag each word with part of speech
pos_tag(sentence)

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\abhil\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\abhil\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\abhil\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\abhil\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!


[('I', 'PRP'),
 ('always', 'RB'),
 ('lie', 'VBP'),
 ('down', 'RP'),
 ('to', 'TO'),
 ('tell', 'VB'),
 ('a', 'DT'),
 ('lie', 'NN'),
 ('.', '.')]

In [38]:
text = "Jim will go to Beijing to study in Peking University"
# tokenize, pos tag, then recognize named entities in text
tree = ne_chunk(pos_tag(word_tokenize(text)))
print(tree)

(S
  (PERSON Jim/NNP)
  will/MD
  go/VB
  to/TO
  (GPE Beijing/NNP)
  to/TO
  study/VB
  in/IN
  (GPE Peking/NNP University/NNP))


#### 6.Stemming and Lemmatization

This means converting words into their dictionary forms. And still nltk! This time it’s PorterStemmer and WordNetLemmatizer.

In [39]:
from nltk.stem.porter import PorterStemmer
words = ['renaissance', 'may', 'look', 'boring', 'look', 'least', 'twice']
# Reduce words to their stems
stemmed = [PorterStemmer().stem(w) for w in words]
print(stemmed)

['renaiss', 'may', 'look', 'bore', 'look', 'least', 'twice']


In [40]:
from nltk.stem.wordnet import WordNetLemmatizer
lemmed = [WordNetLemmatizer().lemmatize(w, pos='v') for w in words]
print(lemmed)

['renaissance', 'may', 'look', 'bore', 'look', 'least', 'twice']


### Feature Extraction

#### 1.Bag of Words

In [41]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = ["The first time you see The Second Renaissance it may look boring.",
        "Look at it at least twice and definitely watch part 2.",
        "It will change your view of the matrix.",
        "Are the human people the ones who started the war?",
        "Is AI a bad thing ?"]
# initialize count vectorizer object
# use your own tokenize function
vect = CountVectorizer(tokenizer=tokenize)
# get counts of each token (word) in text data
X = vect.fit_transform(corpus)
# convert sparse matrix to numpy array to view
X.toarray()
# view token vocabulary and counts
vect.vocabulary_

NameError: name 'tokenize' is not defined