# Need to remove stopwords

Stop words are available in abundance in any human language. By removing these words, we remove the low-level information from our text in order to give more focus to the important information.

## Importing libraries

In [1]:
import pandas as pd
import numpy as np 

import nltk
#nltk.download('stopwords')

## Reading data

In [2]:
tweets = pd.read_csv('C:\\Users\\nehal\\Music\\12.NLP\\Practise\\Datasets\\narendramodi_tweets.csv')
print(tweets.shape)
tweets.head()

(3220, 14)


Unnamed: 0,id,retweets_count,favorite_count,created_at,text,lang,retweeted,followers_count,friends_count,hashtags_count,description,location,background_image_url,source
0,8.263846e+17,1406.0,4903.0,2017-01-31 11:00:07,The President's address wonderfully encapsulat...,en,False,26809964.0,1641.0,1.0,Prime Minister of India,India,http://pbs.twimg.com/profile_background_images...,Twitter Web Client
1,8.263843e+17,907.0,2877.0,2017-01-31 10:59:12,Rashtrapati Ji's address to both Houses of Par...,en,False,26809964.0,1641.0,0.0,Prime Minister of India,India,http://pbs.twimg.com/profile_background_images...,Twitter Web Client
2,8.263827e+17,694.0,0.0,2017-01-31 10:52:33,RT @PMOIndia: Empowering the marginalised. htt...,en,False,26809964.0,1641.0,0.0,Prime Minister of India,India,http://pbs.twimg.com/profile_background_images...,Twitter Web Client
3,8.263826e+17,666.0,0.0,2017-01-31 10:52:22,RT @PMOIndia: Commitment to welfare of farmers...,en,False,26809964.0,1641.0,0.0,Prime Minister of India,India,http://pbs.twimg.com/profile_background_images...,Twitter Web Client
4,8.263826e+17,716.0,0.0,2017-01-31 10:52:16,RT @PMOIndia: Improving the quality of life fo...,en,False,26809964.0,1641.0,0.0,Prime Minister of India,India,http://pbs.twimg.com/profile_background_images...,Twitter Web Client


## Text Preprocessing

In [3]:
# converting to lower case and extracting only alphabets, spaces and fullstops
docs=tweets.text.str.lower().str.replace('[^a-z\s.]','')
docs[:5]

0    the presidents address wonderfully encapsulate...
1    rashtrapati jis address to both houses of parl...
2    rt pmoindia empowering the marginalised. https...
3    rt pmoindia commitment to welfare of farmers. ...
4    rt pmoindia improving the quality of life for ...
Name: text, dtype: object

## Tokenization

In [4]:
#Spliting each review into words
docs_tokens=docs.str.split(' ')
docs_tokens[:5]

0    [the, presidents, address, wonderfully, encaps...
1    [rashtrapati, jis, address, to, both, houses, ...
2    [rt, pmoindia, empowering, the, marginalised.,...
3    [rt, pmoindia, commitment, to, welfare, of, fa...
4    [rt, pmoindia, improving, the, quality, of, li...
Name: text, dtype: object

In [5]:
#Putting all tokens into a list 
tokens_all=[]

for x in docs_tokens:
    tokens_all.extend(x)
print('No. of tokens in entire corpus:',len(tokens_all))

No. of tokens in entire corpus: 56862


### Bag of Word Analysis

In [6]:
bow=pd.Series(tokens_all).value_counts()
bow

               4690
the            2184
to             1516
of             1508
amp            1480
               ... 
collapse          1
hangzhou          1
linked            1
incentivise       1
station           1
Length: 10026, dtype: int64

### Removing Stop Words

#### Method 1 : Using NLTK

In [7]:
common_stopwords=nltk.corpus.stopwords.words('english')

In [8]:
df_tokens=pd.DataFrame(bow).reset_index().rename(columns={'index':'token',0:'frequency'})
df_tokens.head()

Unnamed: 0,token,frequency
0,,4690
1,the,2184
2,to,1516
3,of,1508
4,amp,1480


#### Removing common stopword in nltk

In [35]:
df_tokens[~df_tokens['token'].isin(common_stopwords)]

Unnamed: 0,token,frequency
0,,4690
4,amp,1480
9,rt,573
18,india,298
27,people,183
...,...,...
10021,collapse,1
10022,hangzhou,1
10023,linked,1
10024,incentivise,1


#### Removing custom stopwords

In [36]:
common_stopwords=nltk.corpus.stopwords.words('english')
custom_stopwords=['','amp','rt']
all_stopwords=common_stopwords+custom_stopwords
print(len(common_stopwords),len(custom_stopwords),len(all_stopwords))

179 3 182


In [37]:
df_tokens[~df_tokens['token'].isin(all_stopwords)]

Unnamed: 0,token,frequency
18,india,298
27,people,183
29,pm,167
36,pmoindia,143
40,us,124
...,...,...
10021,collapse,1
10022,hangzhou,1
10023,linked,1
10024,incentivise,1


#### Method 2: Using Gensim

In [38]:
from gensim.parsing.preprocessing import remove_stopwords

In [39]:
docs.apply(remove_stopwords)

0       presidents address wonderfully encapsulated in...
1       rashtrapati jis address houses parliament inde...
2       rt pmoindia empowering marginalised. httpst.co...
3       rt pmoindia commitment welfare farmers. httpst...
4       rt pmoindia improving quality life poor. https...
                              ...                        
3215    passage real estate great news aspiring house ...
3216    rt dpradhanbjp highlights pradhan mantri ujjwa...
3217    successful launch irnssf accomplishment immens...
3218    cisfs raising day salute cisf personnel valour...
3219                               ... httpst.cooyscfulth
Name: text, Length: 3220, dtype: object

TF-IDF
A problem with scoring word frequency is that highly frequent words start to dominate in the document (e.g. larger score), but may not contain as much “informational content” to the model as rarer but perhaps domain specific words.

One approach is to rescale the frequency of words by how often they appear in all documents, so that the scores for frequent words like “the” that are also frequent across all documents are penalized.

This approach to scoring is called Term Frequency – Inverse Document Frequency, or TF-IDF for short, where:

- Term Frequency: is a scoring of the frequency of the word in the current document.
- Inverse Document Frequency: is a scoring of how rare the word is across documents.

The scores are a weighting where not all words are equally as important or interesting.

The scores have the effect of highlighting words that are distinct (contain useful information) in a given document.