### Problem:
Suppose you have a text document which contains some information about a topic. But you don’t know what that topic is. Your challenge is to find out the topic without reading the text.

### Importing Data

In [1]:
# Reading the file using open()
file=open('moon.txt',mode='r',encoding='utf-8')

# Getting text data as string
text=file.read()

# Closing the file
file.close()

In [2]:
print(text)

The moon is the satellite of the earth. It moves round the earth. It shines at night by light reflected from the Sun. It looks beautiful. The bright Moonlight is very soothing. The earthly objects shine like silver in the moonlight. We are fascinated by the enchanting beauty of the Moon. The moon is not as beautiful as it looks. It seems to be lovely when it shines in the sky at night. As a matter of fact it is devoid of plants and animals. The moon is not a suitable place for plants and animals. Therefore, no form of life can be found on the moon. Unlike the earth, the moon has got no atmosphere. Therefore, the lunar days are very hot and the lunar nights are intensely cold. The moon looks beautiful from the earth but in fact it has up forbidding appearance. It is full of rocks and craters. When we look at the moon at night we see some dark spots on it. These dark spots are dangerous rocks and craters. The gravitational pull of the moon is less than that of the earth, so it is difficu

### Cleaning Text

In [3]:
import re

# Function for cleaning text
def clean_text(text):
    
    # Lowercasing the text
    text=text.lower()
    
    # Removing comma(,), period(.) and newline character(\n)
    text=re.sub('[,.\n]','',text)
    
    # Replacing hypen with blank space
    text=re.sub('-',' ',text)
    
    return text

In [4]:
# Cleaning Text
cleaned_text=clean_text(text)

In [5]:
print(cleaned_text)

the moon is the satellite of the earth it moves round the earth it shines at night by light reflected from the sun it looks beautiful the bright moonlight is very soothing the earthly objects shine like silver in the moonlight we are fascinated by the enchanting beauty of the moon the moon is not as beautiful as it looks it seems to be lovely when it shines in the sky at night as a matter of fact it is devoid of plants and animals the moon is not a suitable place for plants and animals therefore no form of life can be found on the moon unlike the earth the moon has got no atmosphere therefore the lunar days are very hot and the lunar nights are intensely cold the moon looks beautiful from the earth but in fact it has up forbidding appearance it is full of rocks and craters when we look at the moon at night we see some dark spots on it these dark spots are dangerous rocks and craters the gravitational pull of the moon is less than that of the earth so it is difficult to walk on the surf

### Finding most frequent words

In [6]:
import spacy

In [7]:
# Loading spacy model
nlp=spacy.load('en_core_web_sm')

In [8]:
# creating doc object
doc=nlp(cleaned_text)

In [9]:
words_dict={}

# Add word-count pair to the dictionary
for token in doc:
    # Check if the word is already in dictionary 
    if token.text in words_dict:
        # Increment count of word by 1 
        words_dict[token.text]=words_dict[token.text]+1
    else:
        # Add the word to dictionary with count 1 
        words_dict[token.text]=1

In [10]:
import pandas as pd

In [11]:
# Creating a dataframe from dictionary
df = pd.DataFrame({'word':list(words_dict.keys()), 'count':list(words_dict.values())})

In [12]:
# Sorting dataframe in descending order
df.sort_values(by='count',ascending=False,inplace=True,ignore_index=True)

In [13]:
print('Shape=>',df.shape)
df.head(5)

Shape=> (151, 2)


Unnamed: 0,word,count
0,the,47
1,moon,21
2,it,15
3,of,13
4,to,11


# What are Stop words?
Stopwords are the most common words in a language which are added to make things more understandable to humans. Like in English we have `a, an, the, for, where, when, at,` etc. These words are removed during text pre-processing phase because these words do not add much value to the meaning of the document.

Consider a sample sentence:
##### String: "There is a pen on the table."
##### Stopwords: \["There", "is", "a", "on", "the" \]
##### Meaningful words: \["pen", "table"\]

In [14]:
print(nlp.Defaults.stop_words)

{'latter', 'were', 'as', 'nevertheless', 'before', 'done', 'to', 'from', 'for', 'next', 'until', 'further', 'another', 'part', 'quite', 'also', 'other', 'ten', 'whole', 'nor', 'along', 'somehow', 'throughout', 'using', 'that', 'your', "'s", 'are', 'of', 'six', 'on', 'their', 'so', 'seeming', 'name', 'twelve', 'if', 'often', 'some', 'could', 'much', 'nobody', 'within', "'m", 'used', 'always', 'everyone', 'at', 'even', 'they', 'something', 'he', 'with', 'herein', 'fifty', '’m', 'front', 'onto', 'keep', 'go', 'please', 'third', 'his', 'those', 'beyond', "n't", 'side', 'thereupon', 'still', 'yet', 'via', 'really', 'namely', 'else', 'ourselves', 'we', 'almost', 'this', 'anyone', 'again', 'top', 'is', 'neither', "'d", 'whereas', 'together', 'around', 'not', 'behind', '‘d', 'after', 'once', 'them', 'rather', "'ve", 'latterly', 'seemed', 'anywhere', 'whom', 're', 'few', '’s', 'below', 'while', 'in', 'none', 'eleven', 'therein', 'well', 'per', 'becoming', 'three', 'yourselves', 'empty', 'the', 

In [15]:
len(nlp.Defaults.stop_words)

326

In [16]:
# Getting words that are not stopwords
new_tokens=[token.text for token in doc if (token.is_stop == False)]

In [17]:
print(new_tokens)

['moon', 'satellite', 'earth', 'moves', 'round', 'earth', 'shines', 'night', 'light', 'reflected', 'sun', 'looks', 'beautiful', 'bright', 'moonlight', 'soothing', 'earthly', 'objects', 'shine', 'like', 'silver', 'moonlight', 'fascinated', 'enchanting', 'beauty', 'moon', 'moon', 'beautiful', 'looks', 'lovely', 'shines', 'sky', 'night', 'matter', 'fact', 'devoid', 'plants', 'animals', 'moon', 'suitable', 'place', 'plants', 'animals', 'form', 'life', 'found', 'moon', 'unlike', 'earth', 'moon', 'got', 'atmosphere', 'lunar', 'days', 'hot', 'lunar', 'nights', 'intensely', 'cold', 'moon', 'looks', 'beautiful', 'earth', 'fact', 'forbidding', 'appearance', 'rocks', 'craters', 'look', 'moon', 'night', 'dark', 'spots', 'dark', 'spots', 'dangerous', 'rocks', 'craters', 'gravitational', 'pull', 'moon', 'earth', 'difficult', 'walk', 'surface', 'moon', 'moon', 'fascinated', 'man', 'beginning', 'life', 'earth', 'looked', 'wonder', 'poets', 'composed', 'beautiful', 'poems', 'moon', 'scientists', 'tried

In [18]:
new_words_dict={}

# Add word-count pair to the dictionary
for token in new_tokens:
    # Check if the word is already in dictionary 
    if token in new_words_dict:
        # Increment count of word by 1
        new_words_dict[token] = new_words_dict[token]+1
    else:
        # Add the word to dictionary with count 1 
        new_words_dict[token]=1

In [19]:
# Creating a dataframe from dictionary
new_df = pd.DataFrame({'word':list(new_words_dict.keys()), 'count':list(new_words_dict.values())})

In [20]:
# Sorting dataframe in descending order
new_df.sort_values(by='count',ascending=False,inplace=True,ignore_index=True)

In [21]:
print('Shape=>',new_df.shape)
new_df.head(5)

Shape=> (97, 2)


Unnamed: 0,word,count
0,moon,21
1,earth,9
2,life,4
3,beautiful,4
4,looks,3


Due to stopwords a lot of resources get wasted in storing and pre-processing these. Removing them makes the process of analysis and model building faster because the corpus size gets reduced due to it.

## Remove Stopwords:
- Text Classification
- Caption Generation
- Auto-Tag Generation

## Don't Remove Stopwords:
- Machine Translation
- Language Modeling
- Text Summarization
- Question-Answering Problems

# assignment

In [27]:
with open('switzerland.txt',mode='r',encoding='utf-8') as f:
    text=f.read()

In [28]:
text

"Switzerland, officially the Swiss Confederation, is a country situated in the confluence of Western, Central, and Southern Europe. It is a federal republic composed of 26 cantons, with federal authorities based in Bern. Switzerland is a landlocked country bordered by Italy to the south, France to the west, Germany to the north, and Austria and Liechtenstein to the east. It is geographically divided among the Swiss Plateau, the Alps, and the Jura, spanning a total area of 41,285 km2 (15,940 sq mi), and land area of 39,997 km2 (15,443 sq mi). While the Alps occupy the greater part of the territory, the Swiss population of approximately 8.5 million is concentrated mostly on the plateau, where the largest cities and economic centres are located, among them Zürich, Geneva and Basel, where multiple international organisations are domiciled (such as FIFA, the UN's second-largest Office, and the Bank for International Settlements) and where the main international airports of Switzerland are.\

In [29]:
import pandas as pd
import spacy
nlp=spacy.load('en_core_web_sm')

In [30]:
words=[token.text for token in nlp(text)]
stop_words=[token.text for token in nlp(text) if token.is_stop==True]
not_sw=[token.text for token in nlp(text) if token.is_stop==False]

In [32]:
# task: percentage of stopwords
print((len(stop_words)/len(words))*100)

36.64921465968586


In [33]:
# remove stopwords
not_sw

['Switzerland',
 ',',
 'officially',
 'Swiss',
 'Confederation',
 ',',
 'country',
 'situated',
 'confluence',
 'Western',
 ',',
 'Central',
 ',',
 'Southern',
 'Europe',
 '.',
 'federal',
 'republic',
 'composed',
 '26',
 'cantons',
 ',',
 'federal',
 'authorities',
 'based',
 'Bern',
 '.',
 'Switzerland',
 'landlocked',
 'country',
 'bordered',
 'Italy',
 'south',
 ',',
 'France',
 'west',
 ',',
 'Germany',
 'north',
 ',',
 'Austria',
 'Liechtenstein',
 'east',
 '.',
 'geographically',
 'divided',
 'Swiss',
 'Plateau',
 ',',
 'Alps',
 ',',
 'Jura',
 ',',
 'spanning',
 'total',
 'area',
 '41,285',
 'km2',
 '(',
 '15,940',
 'sq',
 'mi',
 ')',
 ',',
 'land',
 'area',
 '39,997',
 'km2',
 '(',
 '15,443',
 'sq',
 'mi',
 ')',
 '.',
 'Alps',
 'occupy',
 'greater',
 'territory',
 ',',
 'Swiss',
 'population',
 'approximately',
 '8.5',
 'million',
 'concentrated',
 'plateau',
 ',',
 'largest',
 'cities',
 'economic',
 'centres',
 'located',
 ',',
 'Zürich',
 ',',
 'Geneva',
 'Basel',
 ',',
 'm