### 1. Text Wrapping

In [2]:
import requests
from bs4 import BeautifulSoup

# URL of the Wikipedia page
url = "https://en.wikipedia.org/wiki/India"

# Send a request to the webpage
response = requests.get(url)

# Parse the content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Extract the main content of the article
content = soup.find_all('p')

# Combine all paragraphs into one string
paragraph = ""
for lines in content:
    paragraph += lines.text

# Print the extracted text
print(paragraph)


India, officially the Republic of India (ISO: Bhārat Gaṇarājya),[21] is a country in South Asia.  It is the seventh-largest country by area, the most populous country as of June 2023,[22][23] and from the time of its independence in 1947, the world's most populous democracy.[24][25][26] Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west;[j] China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand, Myanmar, and Indonesia.
Modern humans arrived on the Indian subcontinent from Africa no later than 55,000 years ago.[27][28][29]
Their long occupation, initially in varying forms of isolation as hunter-gatherers, has made the region highly diverse, second only to Africa in human genetic diversity.[30] Settled life

In [4]:
import spacy

### 2. Basic Text Processing

In [6]:
text_pro = spacy.load("en_core_web_sm")
text = paragraph
doc = text_pro(text)

for token in doc:
    print(token.text, token.pos_, token.lemma_)


 SPACE 

India PROPN India
, PUNCT ,
officially ADV officially
the DET the
Republic PROPN Republic
of ADP of
India PROPN India
( PUNCT (
ISO PROPN ISO
: PUNCT :
Bhārat PROPN Bhārat
Gaṇarājya),[21 PROPN Gaṇarājya),[21
] PUNCT ]
is AUX be
a DET a
country NOUN country
in ADP in
South PROPN South
Asia PROPN Asia
. PUNCT .
  SPACE  
It PRON it
is AUX be
the DET the
seventh ADV seventh
- PUNCT -
largest ADJ large
country NOUN country
by ADP by
area NOUN area
, PUNCT ,
the DET the
most ADV most
populous ADJ populous
country NOUN country
as ADP as
of ADP of
June PROPN June
2023,[22][23 NUM 2023,[22][23
] PUNCT ]
and CCONJ and
from ADP from
the DET the
time NOUN time
of ADP of
its PRON its
independence NOUN independence
in ADP in
1947 NUM 1947
, PUNCT ,
the DET the
world NOUN world
's PART 's
most ADV most
populous ADJ populous
democracy.[24][25][26 NOUN democracy.[24][25][26
] PUNCT ]
Bounded VERB bound
by ADP by
the DET the
Indian PROPN Indian
Ocean PROPN Ocean
on ADP on
the DET the
south NO

### 3. Named Entity Recognition (NER)

In [7]:
ner = spacy.load("en_core_web_sm")
text = paragraph

ner_doc = ner(text)

for ent in ner_doc.ents:
    print(ent.text, ent.label_)

India GPE
the Republic of India GPE
South Asia LOC
seventh ORDINAL
June 2023,[22][23 DATE
1947 DATE
the Indian Ocean LOC
the Arabian Sea LOC
the Bay of Bengal LOC
Pakistan GPE
Nepal GPE
Bhutan GPE
Bangladesh GPE
Myanmar GPE
the Indian Ocean LOC
India GPE
Sri Lanka GPE
Maldives GPE
Andaman PRODUCT
Nicobar Islands ORG
Thailand GPE
Myanmar GPE
Indonesia GPE
Indian NORP
Africa LOC
no later than 55,000 years DATE
second ORDINAL
Africa LOC
Indus GPE
9,000 years ago DATE
the Indus Valley Civilisation LOC
the third millennium DATE
BCE.[31 ORG
1200 DATE
BCE ORG
Sanskrit ORG
India GPE
today DATE
Rigveda ORG
Rigveda ORG
Dravidian NORP
India GPE
400 CARDINAL
BCE ORG
Gupta Empires PERSON
belief.[k][41] PERSON
South India GPE
Dravidian NORP
Southeast Asia.[42 LOC
Christianity NORP
Islam ORG
Judaism ORG
Zoroastrianism NORP
India GPE
Muslim NORP
Central Asia LOC
India GPE
the Delhi Sultanate GPE
India GPE
the 15th century DATE
Vijayanagara Empire GPE
Hindu NORP
south India.[46 GPE
Punjab PRODUCT
The M

### 4. Dependency Parsing

In [8]:
par = spacy.load("en_core_web_sm")
text = paragraph

par_doc = par(text)

for token in par_doc:
    print(token.text, token.dep_, token.head.text)


 dep India
India nsubj is
, punct India
officially advmod Republic
the det Republic
Republic appos India
of prep Republic
India pobj of
( punct ISO
ISO conj Republic
: punct ISO
Bhārat compound Gaṇarājya),[21
Gaṇarājya),[21 conj India
] punct Gaṇarājya),[21
is ROOT is
a det country
country attr is
in prep country
South compound Asia
Asia pobj in
. punct is
  dep .
It nsubj is
is ROOT is
the det country
seventh advmod largest
- punct largest
largest amod country
country attr is
by prep country
area pobj by
, punct country
the det country
most advmod populous
populous amod country
country appos country
as prep country
of prep as
June pobj of
2023,[22][23 nummod June
] punct country
and cc is
from prep shares
the det time
time pobj from
of prep time
its poss independence
independence pobj of
in prep time
1947 pobj in
, punct shares
the det world
world poss democracy.[24][25][26
's case world
most advmod populous
populous amod democracy.[24][25][26
democracy.[24][25][26 meta ,
] punct dem

### 5. Text Classification

In [1]:
# from spacy.pipeline.textcat import single_label_cnn_config

# txtclf = spacy.load("en_core_web_sm")

# # text = paragraph
# text = "This movie is great!"
# txtclf_doc = txtclf(text)

# textcat = txtclf.create_pipe("textcat", config=single_label_cnn_config)
# txtclf.add_pipe(textcat, last=True)
# textcat.add_label("POSITIVE")

# txtclf_doc = txtclf(text)
# print(txtclf_doc.cats)