# Web Scrapping and Data Preparation for News Article

Author : Prasad Patharvat
    
KMeans is a clustering algorithm which divides observations into k clusters. Since we can dictate the amount of clusters, it can be easily used in classification where we divide data into clusters which can be equal to or more than the number of classes.    

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
url = "https://www.opindia.com/latest-news/?nocache"
data = requests.get(url)

In [3]:
data

<Response [200]>

In [4]:
data.content



In [5]:
soup = BeautifulSoup(data.content,"html.parser")

In [6]:
def fetch_article(url):
    data = requests.get(url)
    soup = BeautifulSoup(data.content,"html.parser")
    articles = []
    for i in soup.find_all("h3",class_ = ["entry-title td-module-title"]):
        articles.append(i.find('a')['title'])     
    return articles

In [7]:
urllist = []
for i in range(2,21,1):
    url = "https://www.opindia.com/latest-news/?nocache" + str(i) + "/"
    urllist.append(url)

In [8]:
urllist

['https://www.opindia.com/latest-news/?nocache2/',
 'https://www.opindia.com/latest-news/?nocache3/',
 'https://www.opindia.com/latest-news/?nocache4/',
 'https://www.opindia.com/latest-news/?nocache5/',
 'https://www.opindia.com/latest-news/?nocache6/',
 'https://www.opindia.com/latest-news/?nocache7/',
 'https://www.opindia.com/latest-news/?nocache8/',
 'https://www.opindia.com/latest-news/?nocache9/',
 'https://www.opindia.com/latest-news/?nocache10/',
 'https://www.opindia.com/latest-news/?nocache11/',
 'https://www.opindia.com/latest-news/?nocache12/',
 'https://www.opindia.com/latest-news/?nocache13/',
 'https://www.opindia.com/latest-news/?nocache14/',
 'https://www.opindia.com/latest-news/?nocache15/',
 'https://www.opindia.com/latest-news/?nocache16/',
 'https://www.opindia.com/latest-news/?nocache17/',
 'https://www.opindia.com/latest-news/?nocache18/',
 'https://www.opindia.com/latest-news/?nocache19/',
 'https://www.opindia.com/latest-news/?nocache20/']

In [9]:
all_articles = []
for i in urllist:
    all_articles.extend(fetch_article(i))

In [11]:
all_articles

['From ‘hospital in place of Ram Mandir’ to wearing Hindu identity on sleeve: Hindutva as response to Left Liberal and deep Nehruvian State',
 'BBC, NYT jump in to defend Alt News’ Md Zubair, ignore his derogatory anti-Hindu posts and claim he is arrested for being a ‘Modi critic’',
 'Maharashtra crisis: Uddhav to face trust vote on June 30, rebel Shiv Sena MLAs to return from Guwahati',
 'Punjab: AAP govt heeds Congress’ suggestion, vows to pass a resolution against Centre’s ban on Sidhu Moosewala’s contentious song',
 '‘Christians under attack’ redux: SC accepts plea moved by scam-tainted Archbishop alleging increased attacks against pastors and churches',
 'From ‘hospital in place of Ram Mandir’ to wearing Hindu identity on sleeve: Hindutva as response to Left Liberal and deep Nehruvian State',
 'Islamists kill another Hindu man while journalists try to paint Mohammed Zubair, who triggered the Islamists, a victim\xa0',
 'Of Free Speech and 295A: Why I partially agree and wholly disa

In [12]:
p_art =[]
for i in all_articles:
    q = i.upper()
    import re
    q = re.sub("[^A-Z0-9 ]","",q)
    from nltk.stem import PorterStemmer
    tk_q = q.split(" ")
    sent = ""
    for j in tk_q:
        ps = PorterStemmer()
        sent = sent + " " + ps.stem(j).upper()
    p_art.append(sent)

In [13]:
p_art

[' FROM HOSPIT IN PLACE OF RAM MANDIR TO WEAR HINDU IDENT ON SLEEV HINDUTVA AS RESPONS TO LEFT LIBER AND DEEP NEHRUVIAN STATE',
 ' BBC NYT JUMP IN TO DEFEND ALT NEW MD ZUBAIR IGNOR HI DEROGATORI ANTIHINDU POST AND CLAIM HE IS ARREST FOR BE A MODI CRITIC',
 ' MAHARASHTRA CRISI UDDHAV TO FACE TRUST VOTE ON JUNE 30 REBEL SHIV SENA MLA TO RETURN FROM GUWAHATI',
 ' PUNJAB AAP GOVT HEED CONGRESS SUGGEST VOW TO PASS A RESOLUT AGAINST CENTR BAN ON SIDHU MOOSEWALA CONTENTI SONG',
 ' CHRISTIAN UNDER ATTACK REDUX SC ACCEPT PLEA MOVE BY SCAMTAINT ARCHBISHOP ALLEG INCREAS ATTACK AGAINST PASTOR AND CHURCH',
 ' FROM HOSPIT IN PLACE OF RAM MANDIR TO WEAR HINDU IDENT ON SLEEV HINDUTVA AS RESPONS TO LEFT LIBER AND DEEP NEHRUVIAN STATE',
 ' ISLAMIST KILL ANOTH HINDU MAN WHILE JOURNALIST TRI TO PAINT MOHAM ZUBAIR WHO TRIGGER THE ISLAMIST A VICTIM',
 ' OF FREE SPEECH AND 295A WHI I PARTIAL AGRE AND WHOLLI DISAGRE WITH THE ARGUMENT OF PROF ANAND RANGANATHAN',
 ' AS SANJAY RAUT SAY THEY MIGHT ABANDON MVA ALL

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
A = tf.fit_transform(p_art).toarray()

In [17]:
A

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.17666163],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [18]:
from sklearn.cluster import KMeans
km = KMeans(n_clusters = 5)
cl_res = km.fit(A)

In [19]:
cl_res.labels_

array([4, 1, 2, ..., 1, 2, 3])

In [20]:
import pandas as pd
Q = pd.DataFrame(p_art,columns=["Article"])
Q['Cluster'] = cl_res.labels_

In [21]:
Q.head()

Unnamed: 0,Article,Cluster
0,FROM HOSPIT IN PLACE OF RAM MANDIR TO WEAR HI...,4
1,BBC NYT JUMP IN TO DEFEND ALT NEW MD ZUBAIR I...,1
2,MAHARASHTRA CRISI UDDHAV TO FACE TRUST VOTE O...,2
3,PUNJAB AAP GOVT HEED CONGRESS SUGGEST VOW TO ...,1
4,CHRISTIAN UNDER ATTACK REDUX SC ACCEPT PLEA M...,3


In [22]:
E = {1 : "Politics",
2 : "Regional_Politics ",
3 : "Religion",
4 : "Geopolitics",
0 : "Entertainment "}

In [23]:
R = []
for i in Q.Cluster:
    R.append(E[i])

Q['category'] = R

In [26]:
Q.sample(10)

Unnamed: 0,Article,Cluster,category
1913,MADHYA PRADESH KHARGON ADMINISTR ERECT BARRIC...,1,Politics
1513,UDAIPUR MURDER POLIC CONSTABL CONTROL THE MOB...,3,Religion
563,BBC NYT JUMP IN TO DEFEND ALT NEW MD ZUBAIR I...,1,Politics
585,FROM HOSPIT IN PLACE OF RAM MANDIR TO WEAR HI...,4,Geopolitics
841,GODHRA CARNAG THE LIE OF DEAD BODI PARAD HOW ...,3,Religion
1204,RAJASTHAN POSTMORTEM REPORT REVEAL KANHAIYA L...,2,Regional_Politics
1048,AS SANJAY RAUT SAY THEY MIGHT ABANDON MVA ALL...,3,Religion
861,MAHARASHTRA CRISI UDDHAV TO FACE TRUST VOTE O...,2,Regional_Politics
286,PUNJAB AAP GOVT HEED CONGRESS SUGGEST VOW TO ...,1,Politics
1293,MEDIA OUTFIT SECULARIS THE BRUTAL KILL OF KAN...,0,Entertainment
