### Problem:
Suppose you have a text document which contains some information about a topic. But you don’t know what that topic is. Your challenge is to find out the topic without reading the text.

### Importing Data

In [3]:
# Reading the file using open()
file=open('moon.txt',mode='r',encoding='utf-8')

# Getting text data as string
text=file.read()

# Closing the file
file.close()

In [4]:
print(text)

The moon is the satellite of the earth. It moves round the earth. It shines at night by light reflected from the Sun. It looks beautiful. The bright Moonlight is very soothing. The earthly objects shine like silver in the moonlight. We are fascinated by the enchanting beauty of the Moon. The moon is not as beautiful as it looks. It seems to be lovely when it shines in the sky at night. As a matter of fact it is devoid of plants and animals. The moon is not a suitable place for plants and animals. Therefore, no form of life can be found on the moon. Unlike the earth, the moon has got no atmosphere. Therefore, the lunar days are very hot and the lunar nights are intensely cold. The moon looks beautiful from the earth but in fact it has up forbidding appearance. It is full of rocks and craters. When we look at the moon at night we see some dark spots on it. These dark spots are dangerous rocks and craters. The gravitational pull of the moon is less than that of the earth, so it is difficu

### Cleaning Text

In [5]:
import re

# Function for cleaning text
def clean_text(text):
    
    # Lowercasing the text
    text=text.lower()
    
    # Removing comma(,), period(.) and newline character(\n)
    text=re.sub('[,.\n]','',text)
    
    # Replacing hypen with blank space
    text=re.sub('-',' ',text)
    
    return text

In [7]:
# Cleaning Text
cleaned_text=clean_text(text)

In [8]:
print(cleaned_text)

the moon is the satellite of the earth it moves round the earth it shines at night by light reflected from the sun it looks beautiful the bright moonlight is very soothing the earthly objects shine like silver in the moonlight we are fascinated by the enchanting beauty of the moon the moon is not as beautiful as it looks it seems to be lovely when it shines in the sky at night as a matter of fact it is devoid of plants and animals the moon is not a suitable place for plants and animals therefore no form of life can be found on the moon unlike the earth the moon has got no atmosphere therefore the lunar days are very hot and the lunar nights are intensely cold the moon looks beautiful from the earth but in fact it has up forbidding appearance it is full of rocks and craters when we look at the moon at night we see some dark spots on it these dark spots are dangerous rocks and craters the gravitational pull of the moon is less than that of the earth so it is difficult to walk on the surf

### Finding most frequent words

In [9]:
import spacy

In [10]:
# Loading spacy model
nlp=spacy.load('en_core_web_sm')

In [11]:
# creating doc object
doc=nlp(cleaned_text)

In [17]:
words_dict={}

# Add word-count pair to the dictionary
for token in doc:
    # Check if the word is already in dictionary 
    if token.text in words_dict:
        # Increment count of word by 1 
        words_dict[token.text]=words_dict[token.text]+1
    else:
        # Add the word to dictionary with count 1 
        words_dict[token.text]=1

In [18]:
import pandas as pd

In [14]:
# Creating a dataframe from dictionary
df = pd.DataFrame({'word':list(words_dict.keys()), 'count':list(words_dict.values())})

In [15]:
# Sorting dataframe in descending order
df.sort_values(by='count',ascending=False,inplace=True,ignore_index=True)

In [19]:
print('Shape=>',df.shape)
df.head(5)

Shape=> (151, 2)


Unnamed: 0,word,count
0,the,47
1,moon,21
2,it,15
3,of,13
4,to,11


# What are Stop words?
Stopwords are the most common words in a language which are added to make things more understandable to humans. Like in English we have `a, an, the, for, where, when, at,` etc. These words are removed during text pre-processing phase because these words do not add much value to the meaning of the document.

Consider a sample sentence:
##### String: "There is a pen on the table."
##### Stopwords: \["There", "is", "a", "on", "the" \]
##### Meaningful words: \["pen", "table"\]

In [20]:
print(nlp.Defaults.stop_words)

{'two', 'somewhere', 'all', '‘s', 'it', 'should', 'only', 'not', 'via', 'their', 'at', 'namely', 'using', 'ourselves', 'nevertheless', 'between', '‘m', 'he', 'go', 'forty', 'perhaps', 'then', 'whereupon', 'own', 'nowhere', 'could', 'sixty', 'seemed', 'still', 'fifteen', 'yourselves', 'neither', 'himself', '’ll', 'sometime', 'with', 'she', 'yourself', 'them', 'ever', 'latter', 'bottom', 'sometimes', 'few', 'might', 'moreover', 'mostly', '’d', 'keep', 'really', 'are', 'within', 'move', 'very', 'if', 'seems', 'former', 'hereupon', 'out', 'due', 'below', 'now', 'nine', 'thereby', 'each', 'hereby', 'my', 'first', 'these', 'unless', 'everyone', 'most', 'nobody', 'several', 'both', 'one', 'our', '‘ll', 'hence', 'much', 'herself', 'someone', 'have', '’s', 'nothing', 'becomes', 'themselves', 'around', 'done', 'above', 'becoming', 'anywhere', 'n‘t', 'whose', 'seem', 'whereby', 'against', 'just', 'about', 'they', "n't", 'whither', 'amongst', 'call', 'whereafter', 'hundred', 'must', 'back', 'doing

In [21]:
len(nlp.Defaults.stop_words)

326

In [22]:
# Getting words that are not stopwords
new_tokens=[token.text for token in doc if (token.is_stop == False)]

In [23]:
print(new_tokens)

['moon', 'satellite', 'earth', 'moves', 'round', 'earth', 'shines', 'night', 'light', 'reflected', 'sun', 'looks', 'beautiful', 'bright', 'moonlight', 'soothing', 'earthly', 'objects', 'shine', 'like', 'silver', 'moonlight', 'fascinated', 'enchanting', 'beauty', 'moon', 'moon', 'beautiful', 'looks', 'lovely', 'shines', 'sky', 'night', 'matter', 'fact', 'devoid', 'plants', 'animals', 'moon', 'suitable', 'place', 'plants', 'animals', 'form', 'life', 'found', 'moon', 'unlike', 'earth', 'moon', 'got', 'atmosphere', 'lunar', 'days', 'hot', 'lunar', 'nights', 'intensely', 'cold', 'moon', 'looks', 'beautiful', 'earth', 'fact', 'forbidding', 'appearance', 'rocks', 'craters', 'look', 'moon', 'night', 'dark', 'spots', 'dark', 'spots', 'dangerous', 'rocks', 'craters', 'gravitational', 'pull', 'moon', 'earth', 'difficult', 'walk', 'surface', 'moon', 'moon', 'fascinated', 'man', 'beginning', 'life', 'earth', 'looked', 'wonder', 'poets', 'composed', 'beautiful', 'poems', 'moon', 'scientists', 'tried

In [24]:
new_words_dict={}

# Add word-count pair to the dictionary
for token in new_tokens:
    # Check if the word is already in dictionary 
    if token in new_words_dict:
        # Increment count of word by 1
        new_words_dict[token] = new_words_dict[token]+1
    else:
        # Add the word to dictionary with count 1 
        new_words_dict[token]=1

In [25]:
# Creating a dataframe from dictionary
new_df = pd.DataFrame({'word':list(new_words_dict.keys()), 'count':list(new_words_dict.values())})

In [26]:
# Sorting dataframe in descending order
new_df.sort_values(by='count',ascending=False,inplace=True,ignore_index=True)

In [27]:
print('Shape=>',new_df.shape)
new_df.head(5)

Shape=> (97, 2)


Unnamed: 0,word,count
0,moon,21
1,earth,9
2,life,4
3,beautiful,4
4,looks,3


Unnamed: 0,word,count
0,moon,21
1,earth,9
2,life,4
3,beautiful,4
4,looks,3
...,...,...
92,atmosphere,1
93,got,1
94,unlike,1
95,found,1


Due to stopwords a lot of resources get wasted in storing and pre-processing these. Removing them makes the process of analysis and model building faster because the corpus size gets reduced due to it.

## Remove Stopwords:
- Text Classification
- Caption Generation
- Auto-Tag Generation

## Don't Remove Stopwords:
- Machine Translation
- Language Modeling
- Text Summarization
- Question-Answering Problems