# Tokenization
Tokenization is a common task in Natural Language Processing (NLP). Tokens are the building blocks of Natural Language. Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords.

## Importing libraries

In [9]:
import pandas as pd
import numpy as np 

from tensorflow.keras.preprocessing.text import Tokenizer

## Reading data

In [4]:
tweets = pd.read_csv('C:\\Users\\nehal\\Music\\12.NLP\\Practise\\Datasets\\narendramodi_tweets.csv')
print(tweets.shape)
tweets.head()

(3220, 14)


Unnamed: 0,id,retweets_count,favorite_count,created_at,text,lang,retweeted,followers_count,friends_count,hashtags_count,description,location,background_image_url,source
0,8.263846e+17,1406.0,4903.0,2017-01-31 11:00:07,The President's address wonderfully encapsulat...,en,False,26809964.0,1641.0,1.0,Prime Minister of India,India,http://pbs.twimg.com/profile_background_images...,Twitter Web Client
1,8.263843e+17,907.0,2877.0,2017-01-31 10:59:12,Rashtrapati Ji's address to both Houses of Par...,en,False,26809964.0,1641.0,0.0,Prime Minister of India,India,http://pbs.twimg.com/profile_background_images...,Twitter Web Client
2,8.263827e+17,694.0,0.0,2017-01-31 10:52:33,RT @PMOIndia: Empowering the marginalised. htt...,en,False,26809964.0,1641.0,0.0,Prime Minister of India,India,http://pbs.twimg.com/profile_background_images...,Twitter Web Client
3,8.263826e+17,666.0,0.0,2017-01-31 10:52:22,RT @PMOIndia: Commitment to welfare of farmers...,en,False,26809964.0,1641.0,0.0,Prime Minister of India,India,http://pbs.twimg.com/profile_background_images...,Twitter Web Client
4,8.263826e+17,716.0,0.0,2017-01-31 10:52:16,RT @PMOIndia: Improving the quality of life fo...,en,False,26809964.0,1641.0,0.0,Prime Minister of India,India,http://pbs.twimg.com/profile_background_images...,Twitter Web Client


## Text Preprocessing

In [5]:
# converting to lower case and extracting only alphabets, spaces and fullstops
docs=tweets.text.str.lower().str.replace('[^a-z\s.]','')
docs[:5]

0    the presidents address wonderfully encapsulate...
1    rashtrapati jis address to both houses of parl...
2    rt pmoindia empowering the marginalised. https...
3    rt pmoindia commitment to welfare of farmers. ...
4    rt pmoindia improving the quality of life for ...
Name: text, dtype: object

## Tokenization

### Method 1 

In [6]:
#Spliting each review into words
docs_tokens=docs.str.split(' ')
docs_tokens[:5]

0    [the, presidents, address, wonderfully, encaps...
1    [rashtrapati, jis, address, to, both, houses, ...
2    [rt, pmoindia, empowering, the, marginalised.,...
3    [rt, pmoindia, commitment, to, welfare, of, fa...
4    [rt, pmoindia, improving, the, quality, of, li...
Name: text, dtype: object

In [7]:
#Putting all tokens into a list 
tokens_all=[]

for x in docs_tokens:
    tokens_all.extend(x)
print('No. of tokens in entire corpus:',len(tokens_all))

No. of tokens in entire corpus: 56862


### Method 2

In [10]:
from tensorflow.keras.preprocessing.text import Tokenizer

In [15]:
tokenizer=Tokenizer()

tokenizer.fit_on_texts(docs)
tokenizer.word_index

{'the': 1,
 'httpst': 2,
 'to': 3,
 'of': 4,
 'amp': 5,
 'in': 6,
 'a': 7,
 'for': 8,
 'on': 9,
 'rt': 10,
 'with': 11,
 'is': 12,
 'our': 13,
 'and': 14,
 'will': 15,
 'i': 16,
 'india': 17,
 'my': 18,
 'at': 19,
 'this': 20,
 'his': 21,
 'you': 22,
 'are': 23,
 'all': 24,
 'we': 25,
 'from': 26,
 'people': 27,
 'that': 28,
 'by': 29,
 'pm': 30,
 '\r': 31,
 'was': 32,
 'be': 33,
 'who': 34,
 'pmoindia': 35,
 'very': 36,
 'their': 37,
 'have': 38,
 'it': 39,
 'us': 40,
 'today': 41,
 'has': 42,
 'about': 43,
 'narendramodi': 44,
 'ties': 45,
 'an': 46,
 'thank': 47,
 'ji': 48,
 'wishes': 49,
 'president': 50,
 'sandeshsoldiers': 51,
 'day': 52,
 'life': 53,
 'good': 54,
 'your': 55,
 'greetings': 56,
 'app': 57,
 'development': 58,
 'great': 59,
 'also': 60,
 'as': 61,
 'mannkibaat': 62,
 'best': 63,
 'had': 64,
 'which': 65,
 'birthday': 66,
 'can': 67,
 'tirangayatra': 68,
 'how': 69,
 'visit': 70,
 'spoke': 71,
 'indias': 72,
 'meeting': 73,
 'via': 74,
 'nation': 75,
 'new': 76,
 '

In [31]:
# vocab size
vocab_size=len(tokenizer.word_index)
vocab_size

8868

In [36]:
# Converting to sequences
doc_sequences = tokenizer.texts_to_sequences(docs)
doc_sequences[:2]

[[1, 2408, 235, 1836, 3454, 72, 1480, 306, 341, 5, 1, 100, 112, 106],
 [532,
  403,
  235,
  3,
  533,
  3455,
  4,
  1243,
  32,
  2409,
  5,
  173,
  120,
  385,
  2,
  3456,
  280]]

In [40]:
# Padding a Sequence
from tensorflow.keras.preprocessing.sequence import pad_sequences

doc_padded=pad_sequences(doc_sequences,padding='post',maxlen=20)
doc_padded[:2]

array([[   1, 2408,  235, 1836, 3454,   72, 1480,  306,  341,    5,    1,
         100,  112,  106,    0,    0,    0,    0,    0,    0],
       [ 532,  403,  235,    3,  533, 3455,    4, 1243,   32, 2409,    5,
         173,  120,  385,    2, 3456,  280,    0,    0,    0]])