**Problem Statement**

Aim of this project is to predict the virality of news articles.

I approached this problem 

* first we gonna scrap the news data from the indiatimes or choose diff website using beautiful soup.
* To train a machine learning model will use the UCI news data. Apply that trained model to predict the virality.
* Here I explained the term virality with the number of shares(how many times does the article is shared) article gonna have.



In [None]:
# Install the newspaper library
! pip install newspaper3k



In [None]:
# import all the required libraries
import requests
from bs4 import BeautifulSoup
from newspaper import Article  
import csv 
import pandas as pd
import numpy as np

To work with the text data we need to import Natural language took kit library i.e nltk

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

**Start the Scraping**

In [None]:
news_url = "https://timesofindia.indiatimes.com/world"
r = requests.get(news_url)

In [None]:
soup = BeautifulSoup(r.content, 'html5lib') 
table = soup.findAll('a', attrs = {'class':'w_img'})

In [None]:
news=[]
for row in table: 
    if not row['href'].startswith('http'):
        news.append('https://timesofindia.indiatimes.com'+row['href'])


Create a list to store all the data formates in the form of dictionary.

In [None]:
df=[]
for i in news:
    article = Article(i, language="en")
    article.download() 
    article.parse() 
    article.nlp() 
    data={}
    data['Title']=article.title
    data['Text']=article.text
    data['Summary']=article.summary
    data['Keywords']=article.keywords
    df.append(data)

In [None]:
# check the list
df

[{'Keywords': ['announced',
   'terror',
   'bill',
   'ms13',
   'diaz',
   'justice',
   'gang',
   'barr',
   'conspiracy',
   'drug',
   'leader',
   'department',
   'charges'],
  'Summary': 'Attorney General Bill Barr announced the charges against Salvador-based gang leader Armando Eliu Melgar Diaz a... Read MoreWASHINGTON: The US Justice Department announced Wednesday it is using terrorism charges for the first time to indict a member of the notoriously violent MS-13 gang.\nAttorney General Bill Barr announced the charges against Salvador-based gang leader Armando Eliu Melgar Diaz at a White House event meant to highlight the Trump administration\'s efforts to crack down on the group.\nThe Justice Department accused Diaz with conspiracy to provide material support to terrorists and conspiracy to cross-border terror acts, along with narco-terror financing and other charges.\nThe department did not explain why it was using terror charges for the first time against the gang, but it

In [None]:
# convert the dictionary data to a pandas dataframe

dataset=pd.DataFrame(df)
dataset.head()

Unnamed: 0,Title,Text,Summary,Keywords
0,"Bill Barr: In first, US charges MS-13 leader w...",Attorney General Bill Barr announced the charg...,Attorney General Bill Barr announced the charg...,"[announced, terror, bill, ms13, diaz, justice,..."
1,Chilean police train dogs to sniff out Covid-19,"Jul 15, 2020, 09:28PM IST\n\nSource: TOI.in\n\...","Jul 15, 2020, 09:28PM ISTSource: TOI.inPolice ...","[training, covid19, sniffing, trained, sniff, ..."
2,Desperation science slows hunt for virus drugs,"Jul 08, 2020, 10:27PM IST\n\nSource: AP\n\nDes...","Jul 08, 2020, 10:27PM ISTSource: APDesperate t...","[understanding, hunt, tens, desperation, virus..."
3,COVID-ravaged New York church reopens,"Jul 08, 2020, 10:19PM IST\n\nSource: AP\n\nAft...","Jul 08, 2020, 10:19PM ISTSource: APAfter losin...","[saint, roman, lost, months, reopens, parishio..."
4,America disrupted: US on edge as presidential ...,"Jul 07, 2020, 09:54PM IST\n\nSource: AP\n\nAme...","Jul 07, 2020, 09:54PM ISTSource: APAmerica Dis...","[edge, places, voting, likely, presidential, a..."


In [None]:
from xgboost.sklearn import XGBRegressor
from sklearn.model_selection import train_test_split

In [None]:
# To train the model we used the UCI news data which is available at the following link.
FILEPATH="https://raw.githubusercontent.com/heroorkrishna/News-Virality-Prediction/master/OnlineNewsPopularity.csv"

Transform the data to train the model

In [None]:
def clean_cols(data):
    """Clean the column names by stripping and lowercase."""
    clean_col_map = {x: x.lower().strip() for x in list(data)}
    return data.rename(index=str, columns=clean_col_map)

def TrainTestSplit(X, Y, R=0, test_size=0.2):
    """Easy Train Test Split call."""
    return train_test_split(X, Y, test_size=test_size, random_state=R)

* I removed these features below because some of them are not available in our UCI training data. which we are using for training the model.


In [None]:
full_data = clean_cols(pd.read_csv(FILEPATH))
train_set, test_set = train_test_split(full_data, test_size=0.20, random_state=42)

x_train = train_set.drop(['url','shares', 'timedelta', 'lda_00','lda_01','lda_02','lda_03','lda_04','num_self_hrefs', 'kw_min_min', 'kw_max_min', 'kw_avg_min','kw_min_max','kw_max_max','kw_avg_max','kw_min_avg','kw_max_avg','kw_avg_avg','self_reference_min_shares','self_reference_max_shares','self_reference_avg_sharess','rate_positive_words','rate_negative_words','abs_title_subjectivity','abs_title_sentiment_polarity'], axis=1)
y_train = train_set['shares']

x_test = test_set.drop(['url','shares', 'timedelta', 'num_self_hrefs', 'kw_min_min', 'kw_max_min', 'kw_avg_min','kw_min_max','kw_max_max','kw_avg_max','kw_min_avg','kw_max_avg','kw_avg_avg','self_reference_min_shares','self_reference_max_shares','self_reference_avg_sharess','rate_positive_words','rate_negative_words','abs_title_subjectivity','abs_title_sentiment_polarity'], axis=1)
y_test = test_set['shares']

In [None]:
clf=XGBRegressor(random_state=32,max_depth=5,n_estimators=1000)
clf.fit(x_train, y_train)



XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=5, min_child_weight=1, missing=None, n_estimators=1000,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=32,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=1)

In [None]:
rf_res = pd.DataFrame(clf.predict(x_train),list(y_train))

In [None]:
rf_res.reset_index(level=0, inplace=True)
rf_res_df = rf_res.rename(index=str, columns={"index": "Actual shares", 0: "Predicted shares"})
rf_res_df.head()

Unnamed: 0,Actual shares,Predicted shares
0,16100,4824.28125
1,508,813.78064
2,1300,3281.530762
3,3100,4311.856445
4,6900,5627.784668


**Converting Crawled News according to Training Set in UCI Dataset**

In [None]:
nltk.download('stopwords')
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords
stopwords=set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
def rate_unique(words):
    words=tokenize(words)
    no_order = list(set(words))
    rate_unique=len(no_order)/len(words)
    return rate_unique
    
def rate_nonstop(words):
    words=tokenize(words)
    filtered_sentence = [w for w in words if not w in stopwords]
    rate_nonstop=len(filtered_sentence)/len(words)
    no_order = list(set(filtered_sentence))
    rate_unique_nonstop=len(no_order)/len(words)
    return rate_nonstop,rate_unique_nonstop

def avg_token(words):
    words=tokenize(words)
    length=[]
    for i in words:
        length.append(len(i))
    return np.average(length)

In [None]:
from textblob import TextBlob

In [None]:
# intall the datefinder.
!pip install datefinder

Collecting datefinder
  Downloading https://files.pythonhosted.org/packages/0c/4f/29524c9ca35d2ba1a8a3c6c895b90fc92525cf0fe357f747133890953ebe/datefinder-0.7.1-py2.py3-none-any.whl
Installing collected packages: datefinder
Successfully installed datefinder-0.7.1


In [None]:
import datefinder
import datetime  
from datetime import date 
def day(article_text):
    article=article_text
    if len(list(datefinder.find_dates(article)))>0:
        date=str(list(datefinder.find_dates(article))[0])
        date=date.split()
        date=date[0]
        year, month, day = date.split('-')     
        day_name = datetime.date(int(year), int(month), int(day)) 
        return day_name.strftime("%A")
    return "Sunday"

In [None]:
def tokenize(text):
    text=text
    return word_tokenize(text)

In [None]:
pos_words=[]
neg_words=[]
def polar(words):
    all_tokens=tokenize(words)
    for i in all_tokens:
        analysis=TextBlob(i)
        polarity=analysis.sentiment.polarity
        if polarity>0:
            pos_words.append(i)
        if polarity<0:
            neg_words.append(i)
    return pos_words,neg_words

In [None]:
def rates(words):
    words=polar(words)
    pos=words[0]
    neg=words[1]
    all_words=words
    global_rate_positive_words=(len(pos)/len(all_words))/100
    global_rate_negative_words=(len(neg)/len(all_words))/100
    pol_pos=[]
    pol_neg=[]
    for i in pos:
        analysis=TextBlob(i)
        pol_pos.append(analysis.sentiment.polarity)
        avg_positive_polarity=analysis.sentiment.polarity
    for j in neg:
        analysis2=TextBlob(j)
        pol_neg.append(analysis2.sentiment.polarity)
        avg_negative_polarity=analysis2.sentiment.polarity
    min_positive_polarity=min(pol_pos)
    max_positive_polarity=max(pol_pos)
    min_negative_polarity=min(pol_neg)
    max_negative_polarity=max(pol_neg)
    avg_positive_polarity=np.average(pol_pos)
    avg_negative_polarity=np.average(pol_neg)
    return global_rate_positive_words,global_rate_negative_words,avg_positive_polarity,min_positive_polarity,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity

In [None]:
df2=[]
for i in news:
    pred_info={}
    article = Article(i, language="en") # en for English 
    article.download() 
    article.parse()
    analysis=TextBlob(article.text)
    polarity=analysis.sentiment.polarity
    title_analysis=TextBlob(article.title)
    pred_info['text']=article.text
    pred_info['n_tokens_title']=len(tokenize(article.title))
    pred_info['n_tokens_content']=len(tokenize(article.text))
    pred_info['n_unique_tokens']=rate_unique(article.text)
    pred_info['n_non_stop_words']=rate_nonstop(article.text)[0]
    pred_info['n_non_stop_unique_tokens']=rate_nonstop(article.text)[1]
    pred_info['num_hrefs']=article.html.count("https://timesofindia.indiatimes.com")
    pred_info['num_imgs']=len(article.images)
    pred_info['num_videos']=len(article.movies)
    pred_info['average_token_length']=avg_token(article.text)
    pred_info['num_keywords']=len(article.keywords)
    
    if "life-style" in article.url:
        pred_info['data_channel_is_lifestyle']=1
    else:
        pred_info['data_channel_is_lifestyle']=0
    if "etimes" in article.url:
        pred_info['data_channel_is_entertainment']=1
    else:
        pred_info['data_channel_is_entertainment']=0
    if "business" in article.url:
        pred_info['data_channel_is_bus']=1
    else:
        pred_info['data_channel_is_bus']=0
    if "social media" or "facebook" or "whatsapp" in article.text.lower():
        data_channel_is_socmed=1
        data_channel_is_tech=0
        data_channel_is_world=0
    else:
        data_channel_is_socmed=0
    if ("technology" or "tech" in article.text.lower()) or ("technology" or "tech" in article.url):
        data_channel_is_tech=1
        data_channel_is_socmed=0
        data_channel_is_world=0
    else:
        data_channel_is_tech=0
    if "world" in article.url:
        data_channel_is_world=1
        data_channel_is_tech=0
        data_channel_is_socmed=0
    else:
        data_channel_is_world=0
        
    pred_info['data_channel_is_socmed']=data_channel_is_socmed
    pred_info['data_channel_is_tech']=data_channel_is_tech
    pred_info['data_channel_is_world']=data_channel_is_world
    
    if day(i)=="Monday":
        pred_info['weekday_is_monday']=1
    else:
        pred_info['weekday_is_monday']=0
    if day(i)=="Tuesday":
        pred_info['weekday_is_tuesday']=1
    else:
        pred_info['weekday_is_tuesday']=0
    if day(i)=="Wednesday":
        pred_info['weekday_is_wednesday']=1
    else:
        pred_info['weekday_is_wednesday']=0
    if day(i)=="Thursday":
        pred_info['weekday_is_thursday']=1
    else:
        pred_info['weekday_is_thursday']=0
    if day(i)=="Friday":
        pred_info['weekday_is_friday']=1
    else:
        pred_info['weekday_is_friday']=0
    if day(i)=="Saturday":
        pred_info['weekday_is_saturday']=1
        pred_info['is_weekend']=1
    else:
        pred_info['weekday_is_saturday']=0
    if day(i)=="Sunday":
        pred_info['weekday_is_sunday']=1
        pred_info['is_weekend']=1
    else:
        pred_info['weekday_is_sunday']=0
        pred_info['is_weekend']=0
        
    pred_info['global_subjectivity']=analysis.sentiment.subjectivity
    pred_info['global_sentiment_polarity']=analysis.sentiment.polarity
    pred_info['global_rate_positive_words']=rates(article.text)[0]
    pred_info['global_rate_negative_words']=rates(article.text)[1]
    pred_info['avg_positive_polarity']=rates(article.text)[2]
    pred_info['min_positive_polarity']=rates(article.text)[3]
    pred_info['max_positive_polarity']=rates(article.text)[4]
    pred_info['avg_negative_polarity']=rates(article.text)[5]
    pred_info['min_negative_polarity']=rates(article.text)[6]
    pred_info['max_negative_polarity']=rates(article.text)[7]    
    pred_info['title_subjectivity']=title_analysis.sentiment.subjectivity
    pred_info['title_sentiment_polarity']=title_analysis.sentiment.polarity
    df2.append(pred_info)

In [None]:
pred_df=pd.DataFrame(df2)
pred_test=pred_df.drop(['text'],axis=1)
pred_df.head()

Unnamed: 0,text,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_imgs,num_videos,average_token_length,num_keywords,data_channel_is_lifestyle,data_channel_is_entertainment,data_channel_is_bus,data_channel_is_socmed,data_channel_is_tech,data_channel_is_world,weekday_is_monday,weekday_is_tuesday,weekday_is_wednesday,weekday_is_thursday,weekday_is_friday,weekday_is_saturday,weekday_is_sunday,is_weekend,global_subjectivity,global_sentiment_polarity,global_rate_positive_words,global_rate_negative_words,avg_positive_polarity,min_positive_polarity,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity
0,Attorney General Bill Barr announced the charg...,12,407,0.511057,0.700246,0.425061,229,13,0,4.906634,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0.461776,-0.037361,0.055,0.09,0.236049,0.05,0.5,-0.392284,-0.8,-0.05,0.333333,0.25
1,"Jul 15, 2020, 09:28PM IST\n\nSource: TOI.in\n\...",8,112,0.732143,0.678571,0.544643,185,9,0,4.160714,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0.5,0.4,0.45,0.36,0.246514,0.05,0.5,-0.392284,-0.8,-0.05,0.0,0.0
2,"Jul 08, 2020, 10:27PM IST\n\nSource: AP\n\nDes...",7,117,0.726496,0.649573,0.547009,185,9,0,4.487179,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0.68125,0.05625,0.545,0.39,0.278759,0.05,0.6,-0.41716,-0.8,-0.05,0.0,0.0
3,"Jul 08, 2020, 10:19PM IST\n\nSource: AP\n\nAft...",5,54,0.87037,0.796296,0.703704,185,9,0,4.518519,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0.337576,0.037273,0.735,0.49,0.291708,0.05,0.6,-0.416122,-0.8,-0.05,0.454545,0.136364
4,"Jul 07, 2020, 09:54PM IST\n\nSource: AP\n\nAme...",10,51,0.823529,0.745098,0.627451,185,9,0,4.078431,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0.508333,-0.066667,0.84,0.53,0.280138,0.05,0.6,-0.402222,-0.8,-0.05,0.0,0.0


Final Results depicting the Likelihood of Virality of News

In [None]:
test2=pd.DataFrame(clf.predict(pred_test),pred_df['text'])
test2.reset_index(level=0, inplace=True)
test2 = test2.rename(index=str, columns={"index": "News", 0: "Virality"})
test2

Unnamed: 0,text,Virality
0,Attorney General Bill Barr announced the charg...,9392.894531
1,"Jul 15, 2020, 09:28PM IST\n\nSource: TOI.in\n\...",35716.289062
2,"Jul 08, 2020, 10:27PM IST\n\nSource: AP\n\nDes...",40506.992188
3,"Jul 08, 2020, 10:19PM IST\n\nSource: AP\n\nAft...",51604.65625
4,"Jul 07, 2020, 09:54PM IST\n\nSource: AP\n\nAme...",32206.533203
5,"Jul 07, 2020, 09:53PM IST\n\nSource: AP\n\nTom...",23513.542969
6,"Jul 07, 2020, 09:51PM IST\n\nSource: AP\n\nUS ...",24541.496094
7,"Jul 03, 2020, 08:59PM IST\n\nSource: AP\n\nAhe...",72445.953125
8,"Jul 03, 2020, 08:54PM IST\n\nSource: AP\n\nRen...",60149.453125
9,"Jun 29, 2020, 03:44PM IST\n\nSource: TOI.in\n\...",28198.240234


Here we got the virality of different news based on Xgb regressor model. This is not the best because our training data is of different news categories and also the timing of news.