# DL_comp1_全_report

## Member

- 邱煒甯, 108072244
- 劉祥暉, 109072142
- 簡佩如, 112065525
- 陳凱揚, 108032053

## Load Data

Note: Must modifiy the data path of train.csv and test.csv before running code.

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

from IPython.display import display

import warnings
warnings.filterwarnings("ignore")

# Download data from google drive
from google.colab import drive
drive.mount('/content/drive')

import re
from bs4 import BeautifulSoup
from datetime import datetime
from itertools import accumulate

from textblob import TextBlob

from gensim.corpora import Dictionary
from gensim.models import LdaModel

from datetime import time, timedelta
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

from datetime import time, timedelta
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('stopwords')
stop = stopwords.words('english')
import pickle

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
from sklearn.feature_selection import RFECV
from sklearn.model_selection import train_test_split
from sklearn.metrics import auc, confusion_matrix,  f1_score, precision_score, recall_score, roc_curve
from sklearn.preprocessing import StandardScaler, MinMaxScaler

Mounted at /content/drive


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# Must modify the data path
train_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/DL/Competition 01_Predicting_News_Popularity/Data/train.csv')
test_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/DL/Competition 01_Predicting_News_Popularity/Data/test.csv')

# train_df = train_df[:100]
# test_df = test_df[:100]

test_string = train_df.loc[0,'Page content']
train_df['Popularity'] = train_df['Popularity'].map(lambda x: (x+1)//2)

X_train_raw = train_df['Page content'].values
y_train_raw = train_df['Popularity'].values
X_test_raw = test_df['Page content'].values

## Preprocessing

### Preprocessing: Data Cleaning

**Data cleaning** is the process of detecting and correcting (or removing) corrupt or inaccurate pieces of information in the dataset. This feature is important since we don't want the raw text to be include as data for it bring more harm then good statictis when training the machine.

In [None]:
# text
def preprocessor(text):
  text = BeautifulSoup(text, 'html.parser').get_text()
  r = '(?::|;|=|X)(?:-)?(?:\)|\(|D|P)'
  emoticons = re.findall(r, text)
  text = re.sub(r, '', text)
  text = re.sub('[\W]+', ' ', text.lower()) + ' ' + ' '.join(emoticons).replace('-','')
  return text

**Stop-words** are simply words that are extremely common in all sorts of texts thus contain little useful information that can be used to distinguish between different classes of documents. Example stop-words are "is", "and", "has", and "the". We choose the Stop-Word Removal since we are using a Classification Model for they are best fit on working with raw or normalized term frequencies.

In [None]:
# token
def tokenizer_stem_nostop(text):
  porter = PorterStemmer()
  return [porter.stem(w) for w in re.split('\s+', text.strip()) \
      if w not in stop and re.match('[a-zA-Z]+', w)]

### Preprocessing: Datetime

**Datestime** are rich sources of information that can be used with machine learning models. However, these datetime variables do require some feature engineering to turn them into numerical data. By converting the strings into datetimes, we can break apart the date and get the year, month, week of year, day of month, hour, minute, second, etc. You can also get the day of the week (Monday = 0, Sunday = 6).

In the datetime library that feature class ```datetime.strptime``` which return a datetime corresponding to date_string, parsed according to format ```YYYY-MM-DD HH:MM:SS```. The class also include attribute that could obtain certain individual data of dates (year, month, day, weekday, hour, etc.)



In [None]:
# timedelta
def get_timedelta(s):
  now = datetime.strptime('2022-10-18 00:00:00', "%Y-%m-%d %H:%M:%S")
  try:
    t = datetime.strptime(s, "%Y-%m-%d %H:%M:%S")
    res = (now-t).days
  except:
    res = 3000
  return res

# weekday
def get_weekday(s):
  try:
    res = datetime.strptime(s, "%Y-%m-%d %H:%M:%S").weekday()
  except:
    res = 0
  return res

# hour
def get_hour(s):
  try:
    res = datetime.strptime(s, "%Y-%m-%d %H:%M:%S").hour
  except:
    res = 0
  return res

In [None]:
def create_df_soup(df):

  df_soup = pd.DataFrame()
  for index in range(0, df.shape[0]):

    html_data = df.loc[index,'Page content']
    soup = BeautifulSoup(html_data, 'html.parser')

    title = soup.find('h1').text.strip()
    paragraphs = soup.find_all('p')
    paragraph_text = [p.text.strip() for p in paragraphs]

    time = soup.find('time').get('datetime')
    categories = soup.find('footer').find_all('a')
    categories_text = [category.text.strip() for category in categories]

    data = {
        'Title': [title],
        'Paragraphs': ['\n'.join(paragraph_text)],
        'Time': [time],
        'Categories': [', '.join(categories_text)]
    }

    data = pd.DataFrame(data)

    df_soup = pd.concat([df_soup, data])
  return df_soup

In [None]:
def title_TDIDF(df_soup):
  nouns_titles = []

  for index in range(0, len(df_soup['Title'])):

    text = df_soup['Title'].iloc[index].lower()
    sentences = sent_tokenize(text)

    def pos_tagging(sentence):
        words = word_tokenize(sentence)
        tagged_words = nltk.pos_tag(words)
        return tagged_words

    nouns = []
    for sentence in sentences:
        tagged_words = pos_tagging(sentence)
        nouns.extend([word for word, pos in tagged_words if pos.startswith('N')])

    nouns_titles.append(nouns)

  corpus = [' '.join(sublist) for sublist in nouns_titles]
  vectorizer = TfidfVectorizer()
  tfidf_matrix = vectorizer.fit_transform(corpus)

  feature_names = vectorizer.get_feature_names_out()

  word_score_list = {}

  # In each article, identify the word with the highest TF-IDF value.
  for i in range(len(corpus)):
      feature_index = tfidf_matrix[i, :].nonzero()[1]
      tfidf_scores = zip(feature_index, [tfidf_matrix[i, x] for x in feature_index])
      for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:
          word_score_list[w] = s

  word_score_list = dict(sorted(word_score_list.items(), key=lambda item: item[1], reverse=True))

  words_with_value_one = [word for word, score in word_score_list.items() if score >= 0.9]

  select_title_keyword = []
  for w in words_with_value_one:
    c = 0
    for i in corpus:
      if w in i:
        c+=1
    if c > 50: #50
      select_title_keyword.append(w)

  title_keyword_features = pd.DataFrame()

  for i in corpus:
    tmp_df = pd.DataFrame()
    for keyword in select_title_keyword:
      if keyword in i:
        tmp_df[keyword] = [1]
      else:
        tmp_df[keyword] = [0]

    title_keyword_features = pd.concat([title_keyword_features, tmp_df])

  return title_keyword_features, tfidf_matrix

In [None]:
def is_popular_time(timestamp):
    weekend_days = [4, 5, 6]
    morning = (time(6, 30), time(9, 30))
    lunch_time_range = (time(11, 0), time(14, 0))
    evening_time_range = (time(18, 0), time(0, 0))

    day_of_week = timestamp.weekday()
    time_of_day = timestamp.time()

    if day_of_week in weekend_days:
        return 1
    if day_of_week == 4 and time(12,0) <= time_of_day:
        return 1
    if lunch_time_range[0] <= time_of_day <= lunch_time_range[1] or evening_time_range[0] <= time_of_day <= evening_time_range[1] or morning[0] <= time_of_day <= morning[1]:
        return 1

    return 0

def create_popular_time(df_soup):
  parsed_time_lsit = []
  for t in df_soup['Time'].tolist():
    if t == None:
      t = df_soup['Time'].iloc[0]
    parsed_time_lsit.append(datetime.strptime(t, "%a, %d %b %Y %H:%M:%S %z"))

  popular_time_falg = []
  for t in parsed_time_lsit:
    popular_time_falg.append(is_popular_time(t))

  return popular_time_falg

### Preprocessing: Feature Extraction

By Using the three class above, we created our own feature extraction class that extract certain data.

In [None]:
def feature_extraction(text, lda, id2word):
  res = [None]*104
  soup = BeautifulSoup(text, 'html.parser')
  all_text = soup.get_text().split()
  token = tokenizer_stem_nostop(preprocessor(text))
  topic_dict = dict(lda[id2word.doc2bow(token)])

  time_tag = soup.find('time')
  timedelta = get_timedelta(time_tag.text[:19]) if time_tag else 3000
  title_tag = soup.find('h1', class_='title')
  article_tag = soup.find('article')
  href_tag = soup.find_all('a')
  href_http_tag = soup.find_all('a', attrs={'href': re.compile("^http://")})
  img_tag = soup.find_all('img')
  video_tag = soup.find_all('video')
  iframe_tag = soup.find_all('iframe')
  weekday = get_weekday(time_tag.text[:19]) if time_tag else 0
  hour = get_hour(time_tag.text[:19]) if time_tag else 0
  channel = article_tag.attrs.get('data-channel')

  all_weekday = ['monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday']
  all_hour = ['hour_'+str(i) for i in range(24)]
  all_channel = ['travel-leisure', 'gaming', 'tech', 'music', 'gadgets', 'film', 'viral',
          'business', 'memes', 'world', 'lifestyle', 'watercooler', 'sports', 'marketing',
          'bus', 'mobile', 'comics', 'mob', 'us', 'dev-design', 'how-to', 'jobs', 'home',
          'advertising', 'media', 'small-business', 'howto', 'apps-software', 'social-media',
          'startups', 'socmed', 'conversations', 'entertainment', 'pics']
  all_topic = ['LDA_'+str(i) for i in range(10)]

  # timedelta
  res[0] = timedelta
  # n_tokens_title
  res[1] = len(title_tag.text.split()) if title_tag else 10
  # n_tokens_content
  res[2] = len(article_tag.text.split()) if article_tag else 500
  # n_unique_tokens
  res[3] = len(set(all_text))/len(all_text)
  # n_non_stop_words
  res[4] = len([i for i in all_text if i not in stop])/len(all_text)
  # n_non_stop_unique_tokens
  res[5] = len(set(i for i in all_text if i not in stop))/len(set(all_text))
  # num_hrefs
  res[6] = len(href_tag)
  # num_self_hrefs
  res[7] = len(href_tag)-len(href_http_tag)
  # num_imgs
  res[8] = len(img_tag)
  # num_videos
  res[9] = len(video_tag)
  # num_iframes
  res[10] = len(iframe_tag)
  # average_token_length
  res[11] = sum([len(i) for i in all_text])/len(all_text)
  # sentiment_subjectivity
  res[12] = TextBlob(article_tag.text).sentiment.subjectivity
  # sentiment_polarity
  res[13] = TextBlob(article_tag.text).sentiment.polarity
  # weekend
  res[14] = 1 if weekday > 4 else 0
  # weekday
  for i, v in enumerate(all_weekday):
    res[15+i] = 1 if weekday == i else 0
  # hour
  for i, v in enumerate(all_hour):
    res[22+i] = 1 if hour == i else 0
  # channel
  for i, v in enumerate(all_channel):
    res[46+i] = 1 if channel == v else 0
  # positive, negative words
  pos, neu, neg = [], [], []
  for s in article_tag.text.split():
    p = TextBlob(s).sentiment.polarity
    if p >= 0.5:
      pos.append(p)
    elif p <= -0.5:
      neg.append(p)
    else:
      neu.append(p)
  # global_rate_positive_words
  res[80] = len(pos)/(len(pos)+len(neu)+len(neg)) if (len(pos)+len(neu)+len(neg)) else 0
  # global_rate_negative_words
  res[81] = len(neg)/(len(pos)+len(neu)+len(neg)) if (len(pos)+len(neu)+len(neg)) else 0
  # rate_positive_words
  res[82] = len(pos)/(len(pos)+len(neg)) if (len(pos)+len(neg)) else 0
  # rate_negative_words
  res[83] = len(neg)/(len(pos)+len(neg)) if (len(pos)+len(neg)) else 0
  # avg_positive_polarity
  res[84] = sum(pos)/len(pos) if len(pos) else 0.75
  # min_positive_polarity
  res[85] = min(pos) if len(pos) else 0.75
  # max_positive_polarity
  res[86] = max(pos) if len(pos) else 0.75
  # avg_negative_polarity
  res[87] = sum(neg)/len(neg) if len(neg) else -0.75
  # min_negative_polarity
  res[88] = min(neg) if len(neg) else -0.75
  # max_negative_polarity
  res[89] = max(neg) if len(neg) else -0.75
  # title_sentiment_subjectivity
  res[90] = TextBlob(title_tag.text).sentiment.subjectivity
  # title_sentiment_polarity
  res[91] = TextBlob(title_tag.text).sentiment.polarity
  # abs_title_sentiment_subjectivity
  res[92] = abs(res[90]-0.5)
  # abs_title_sentiment_polarity
  res[93] = abs(res[91])
  # topic
  for i, v in enumerate(all_topic):
    res[94+i] = topic_dict.get(i, 0);
  return res

Feature Extraction: 104 feature

*  0) find the days between the article publication and the dataset acquisition
*  1) n_tokens_title: Number of words in the title
*  2) n_tokens_content: Number of words in the content
*  3) n_unique_tokens: Rate of unique words in the content
*  4) n_non_stop_words: Rate of non-stop words in the content
*  5) n_non_stop_unique_tokens: Rate of unique non-stop words in the content
*  6) num_hrefs: Number of links
*  7) num_self_hrefs: Number of links to other articles published by Mashable
*  8) num_imgs: Number of images
*  9) num_videos: Number of videos
*  10) num_iframes: Number of inline frames
*  11) average_token_length: Average length of the words in the content
*  12) sentiment_subjectivity: content subjectivity
*  13) sentiment_polarity: content sentiment polarity
*  14) weekend: Was the article published on the weekend?
*  15-21) weekday: Was the article published on <Mon. Tue. Wen. Thur. Fri. Sat. Sun.>?
*  22-45) hour: Was the article published on <24 hours>?
*  46-79) channel: Is data channel <channels...>?
*  80) global_rate_positive_words: Rate of positive words in the content
*  81) global_rate_negative_words: Rate of negative words in the content
*  82) rate_positive_words: Rate of positive words among non-neutral tokens
*  83) rate_negative_words: Rate of negative words among non-neutral tokens
*  84) avg_positive_polarity: Avg. polarity of positive words
*  85) min_positive_polarity: Min. polarity of positive words
*  86) max_positive_polarity: Max. polarity of positive words
*  87) avg_negative_polarity: Avg. polarity of negative words
*  88) min_negative_polarity: Min. polarity of negative words
*  89) max_negative_polarity: Max. polarity of negative words
*  90) title_subjectivity: Title subjectivity
*  91) title_sentiment_polarity: Title polarity
*  92) abs_title_subjectivity: Absolute subjectivity level
*  93) abs_title_sentiment_polarity: Absolute polarity level
*  94-103) Closeness to LDA topic <0-9>
*  104) title_keyword: Indicates if the title contains the keyword selected by TF-IDF.
*  105)	title_tfidf: Represents the TF-IDF values of each title.
*  106) popular_time_falg: Indicates whether the time is during a popular period.


In [None]:
def feature_df(X, lda, id2word):
  '''
  X is a np array, whose element contains raw data (html).
  ex: X = ['<html><head>...', '<html><head>...', ...]
  '''
  label = ['timedelta', 'n_tokens_title', 'n_tokens_content', 'n_unique_tokens', 'n_non_stop_words',
        'n_non_stop_unique_tokens', 'num_hrefs', 'num_self_hrefs', 'num_imgs', 'num_videos',
        'num_iframes', 'average_token_length', 'sentiment_subjectivity', 'sentiment_polarity', 'weekend'] + \
        ['monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday'] + \
        ['hour_'+str(i) for i in range(24)] + \
        ['travel-leisure', 'gaming', 'tech', 'music', 'gadgets', 'film', 'viral',
        'business', 'memes', 'world', 'lifestyle', 'watercooler', 'sports', 'marketing',
        'bus', 'mobile', 'comics', 'mob', 'us', 'dev-design', 'how-to', 'jobs', 'home',
        'advertising', 'media', 'small-business', 'howto', 'apps-software', 'social-media',
        'startups', 'socmed', 'conversations', 'entertainment', 'pics'] + \
        ['global_rate_positive_words', 'global_rate_negative_words', 'rate_positive_words', 'rate_negative_words',
        'avg_positive_polarity', 'min_positive_polarity', 'max_positive_polarity',
        'avg_negative_polarity', 'min_negative_polarity', 'max_negative_polarity',
        'title_sentiment_subjectivity', 'title_sentiment_polarity', 'abs_title_sentiment_subjectivity', 'abs_title_sentiment_polarity'] + \
        ['LDA_'+str(i) for i in range(10)]
  return pd.DataFrame([feature_extraction(i, lda, id2word) for i in X], columns=label)

In [None]:
def feature_df2(df):
  df_soup = create_df_soup(df)
  # df_test_soup = create_df_soup(test_df)

  title_keyword_features, tfidf_matrix = title_TDIDF(df_soup)

  reduced_tfidf = tfidf_matrix.toarray()
  reduced_tfidf = np.mean(reduced_tfidf, axis=1)

  title_keyword_features_reduced = np.sum(title_keyword_features.values, axis=1)
  dim2 = np.column_stack((title_keyword_features_reduced, reduced_tfidf)) # Directly combining them into a two-dimensional array (one dimension for TF-IDF values and one for keywords) enhances the AUC

  popular_time_falg = create_popular_time(df_soup)
  dim3 = np.hstack((dim2, np.array(popular_time_falg).reshape(-1, 1)))

  df_dim3 = pd.DataFrame(dim3)
  column_names = ['title_keyword', 'title_tfidf', 'popular_time_falg']
  df_dim3df = pd.DataFrame(dim3, columns=column_names)

  return df_dim3df

In [None]:
data = [tokenizer_stem_nostop(preprocessor(i)) for i in X_train_raw]
id2word = Dictionary(data)
corpus = [id2word.doc2bow(i) for i in data]
lda = LdaModel(corpus=corpus, id2word=id2word, num_topics=10)


X_train_feature_df = feature_df(X_train_raw, lda, id2word)
X_test_feature_df = feature_df(X_test_raw, lda, id2word)

In [None]:
X_train_feature_df2 = feature_df2(train_df)
X_test_feature_df2 = feature_df2(test_df)

In [None]:
X_train_feature_df = pd.concat([X_train_feature_df, X_train_feature_df2], axis=1)
X_test_feature_df = pd.concat([X_test_feature_df, X_test_feature_df2], axis=1)

In [None]:
X_train_feature_df.head(3)

Unnamed: 0,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,num_videos,...,LDA_3,LDA_4,LDA_5,LDA_6,LDA_7,LDA_8,LDA_9,title_keyword,title_tfidf,popular_time_falg
0,3407,8,577,0.618333,0.658333,0.865229,22,8,1,0,...,0.0,0.0,0.913512,0.0,0.0,0.038722,0.0,0.0,0.000141,0.0
1,3490,12,305,0.599407,0.664688,0.806931,18,10,2,0,...,0.995358,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000137,0.0
2,3085,12,1114,0.610635,0.671527,0.891854,11,7,2,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.827832,0.0,0.000185,0.0


In [None]:
X_test_feature_df.head(3)

Unnamed: 0,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,num_videos,...,LDA_3,LDA_4,LDA_5,LDA_6,LDA_7,LDA_8,LDA_9,title_keyword,title_tfidf,popular_time_falg
0,3325,11,527,0.692168,0.752277,0.85,30,15,1,0,...,0.0,0.143564,0.112824,0.269268,0.0,0.028707,0.020707,0.0,0.000257,0.0
1,3273,6,142,0.652174,0.652174,0.761905,13,10,3,0,...,0.821052,0.0,0.0,0.0,0.0,0.0,0.170495,0.0,0.000206,1.0
2,3401,8,164,0.741758,0.675824,0.807407,13,6,2,0,...,0.355931,0.0,0.0,0.0,0.12682,0.0,0.390952,0.0,0.000147,1.0


In [None]:
# with open('/content/drive/MyDrive/Colab Notebooks/DL/Competition 01_Predicting_News_Popularity/X_train_feature_df.pkl', 'wb') as file:
#     pickle.dump(X_train_feature_df, file)

# with open('/content/drive/MyDrive/Colab Notebooks/DL/Competition 01_Predicting_News_Popularity/X_test_feature_df.pkl', 'wb') as file:
#     pickle.dump(X_test_feature_df, file)

In [4]:
import pickle
with open('/content/drive/MyDrive/Colab Notebooks/DL/Competition 01_Predicting_News_Popularity/X_train_feature_df.pkl', 'rb') as file:
    X_train_feature_df = pickle.load(file)

In [5]:
import pickle
with open('/content/drive/MyDrive/Colab Notebooks/DL/Competition 01_Predicting_News_Popularity/X_test_feature_df.pkl', 'rb') as file:
    X_test_feature_df = pickle.load(file)

In [None]:
X_train_feature_df

Unnamed: 0,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,num_videos,...,LDA_3,LDA_4,LDA_5,LDA_6,LDA_7,LDA_8,LDA_9,title_keyword,title_tfidf,popular_time_falg
0,3407,8,577,0.618333,0.658333,0.865229,22,8,1,0,...,0.000000,0.000000,0.505577,0.000000,0.000000,0.000000,0.000000,0.0,0.000141,0.0
1,3490,12,305,0.599407,0.664688,0.806931,18,10,2,0,...,0.511926,0.000000,0.000000,0.000000,0.000000,0.000000,0.375583,0.0,0.000137,0.0
2,3085,12,1114,0.610635,0.671527,0.891854,11,7,2,0,...,0.000000,0.045030,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000185,0.0
3,3293,5,278,0.825342,0.784247,0.871369,13,6,1,0,...,0.000000,0.000000,0.000000,0.000000,0.877795,0.000000,0.000000,0.0,0.000100,1.0
4,3105,10,1370,0.550949,0.751230,0.934949,16,8,52,0,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000136,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27638,3114,9,303,0.623932,0.632479,0.835616,12,7,2,0,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.236871,0.0,0.000117,0.0
27639,3022,9,317,0.623932,0.638177,0.831050,23,14,3,0,...,0.899058,0.000000,0.038657,0.059009,0.000000,0.000000,0.000000,0.0,0.000156,0.0
27640,3021,8,170,0.703704,0.746032,0.789474,23,9,15,0,...,0.000000,0.000000,0.137163,0.107332,0.541112,0.000000,0.000000,0.0,0.000100,1.0
27641,3471,8,430,0.639198,0.634744,0.829268,17,7,3,0,...,0.417492,0.000000,0.000000,0.000000,0.000000,0.021826,0.557996,0.0,0.000158,0.0


In [None]:
X_test_feature_df

Unnamed: 0,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,num_videos,...,LDA_3,LDA_4,LDA_5,LDA_6,LDA_7,LDA_8,LDA_9,title_keyword,title_tfidf,popular_time_falg
0,3325,11,527,0.692168,0.752277,0.850000,30,15,1,0,...,0.000000,0.468118,0.037595,0.052271,0.160124,0.108477,0.000000,0.0,0.000257,0.0
1,3273,6,142,0.652174,0.652174,0.761905,13,10,3,0,...,0.779021,0.000000,0.000000,0.196693,0.000000,0.000000,0.000000,0.0,0.000206,1.0
2,3401,8,164,0.741758,0.675824,0.807407,13,6,2,0,...,0.735742,0.000000,0.000000,0.040888,0.000000,0.000000,0.217057,0.0,0.000147,1.0
3,3533,6,153,0.765714,0.782857,0.843284,15,9,1,0,...,0.000000,0.682230,0.023915,0.000000,0.000000,0.000000,0.000000,0.0,0.000183,0.0
4,2936,12,219,0.780488,0.707317,0.817708,10,9,1,0,...,0.000000,0.279056,0.332744,0.000000,0.000000,0.254863,0.000000,0.0,0.000233,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11842,3537,8,123,0.756757,0.689189,0.830357,12,9,3,0,...,0.000000,0.087117,0.141892,0.000000,0.000000,0.000000,0.636433,0.0,0.000210,1.0
11843,3469,10,565,0.586735,0.593537,0.811594,10,9,4,0,...,0.186810,0.000000,0.611977,0.065516,0.000000,0.000000,0.084627,0.0,0.000183,0.0
11844,3190,10,973,0.520669,0.800197,0.948960,33,8,10,0,...,0.000000,0.919244,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000182,0.0
11845,3337,9,130,0.765101,0.744966,0.850877,11,7,2,0,...,0.751882,0.000000,0.000000,0.000000,0.000000,0.146200,0.000000,0.0,0.000180,1.0


### Normalize

Use MinMaxScaler to normalize above data.

In [6]:
# Normalize
sc = MinMaxScaler()
label = X_train_feature_df.columns
sc.fit(X_train_feature_df[label])
X_train_feature_df[label] = sc.transform(X_train_feature_df[label])
X_test_feature_df[label] = sc.transform(X_test_feature_df[label])

print(X_train_feature_df.iloc[:1, :].values)

display(X_train_feature_df.head())

[[7.91025641e-01 3.33333333e-01 6.91568047e-02 5.57464080e-01
  2.67380980e-01 6.33515985e-01 6.45161290e-02 3.95480226e-02
  8.92857143e-03 0.00000000e+00 0.00000000e+00 2.20867390e-04
  4.34848485e-01 4.57048379e-01 0.00000000e+00 0.00000000e+00
  0.00000000e+00 1.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 1.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 1.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.0000

Unnamed: 0,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,num_videos,...,LDA_3,LDA_4,LDA_5,LDA_6,LDA_7,LDA_8,LDA_9,title_keyword,title_tfidf,popular_time_falg
0,0.791026,0.333333,0.069157,0.557464,0.267381,0.633516,0.064516,0.039548,0.008929,0.0,...,0.0,0.0,0.506093,0.0,0.0,0.0,0.0,0.0,0.611063,0.0
1,0.897436,0.555556,0.035626,0.535519,0.282128,0.440916,0.051613,0.050847,0.017857,0.0,...,0.512137,0.0,0.0,0.0,0.0,0.0,0.375973,0.0,0.597148,0.0
2,0.378205,0.555556,0.135355,0.548538,0.297996,0.721476,0.029032,0.033898,0.017857,0.0,...,0.0,0.045092,0.0,0.0,0.0,0.0,0.0,0.0,0.804166,0.0
3,0.644872,0.166667,0.032298,0.797488,0.559568,0.653801,0.035484,0.028249,0.008929,0.0,...,0.0,0.0,0.0,0.0,0.880099,0.0,0.0,0.0,0.435683,1.0
4,0.403846,0.444444,0.166913,0.479333,0.482951,0.863849,0.045161,0.039548,0.464286,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.591407,0.0


## Training



1.   AdaBoost Classifier
2.   Logistic Regression
3.   Random Forest Classifier
4.   XGBoost classifier
5.   LightBGM classifier

Originally used the RFECV for feature selection, but found out that although training speed is reduce, the performance is a little bit lower, therefore we didn't use it in the end.

We us
Using the three models of ADA logistic Randomforest for train and adjust parameters, it turn out that Random Forest give the best outcome.

We further applied XGBOOST and LightBGM for softening, and then averaged the results with the predictions from the previously voted Random Forest Classifier, achieving satisfactory performance.

In [None]:
import os
from os.path import join as pjoin
import pandas as pd
import pickle
import time
# from tqdm import tqdm
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV

import xgboost as xgb
import lightgbm as lgbm
from sklearn.model_selection import cross_val_score, train_test_split
from hyperopt import Trials, STATUS_OK, tpe, hp, fmin
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, balanced_accuracy_score, confusion_matrix
from sklearn.metrics import auc

### AdaBoost Classifier

In [None]:
estimator_AB = AdaBoostClassifier(random_state=0)
selector_AB = RFECV(estimator_AB, step=1, cv=5)
selector_AB = selector_AB.fit(X_train_feature_df, y_train_raw)
selector_AB.ranking_
X_train_feature_AB_df = X_train_feature_df[X_train_feature_df.columns.values[selector_AB.ranking_ == 1]]
X_test_feature_AB_df = X_test_feature_df[X_test_feature_df.columns.values[selector_AB.ranking_ == 1]]

In [None]:
X = X_train_feature_AB_df.values
X_test = X_test_feature_AB_df.values

y = y_train_raw[:, np.newaxis]

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
md = AdaBoostClassifier(random_state=0, n_estimators=500, learning_rate=0.5)
md.fit(X_train, y_train)

y_valid_pred_proba = md.predict_proba(X_valid)
fpr, tpr, thresholds = roc_curve(y_valid, y_valid_pred_proba[:, 1])
print('AUC: %.5f' % auc(fpr, tpr))

AUC: 0.58261


In [None]:
y_test_pred_proba = md.predict_proba(X_test)

d = {'Id': test_df['Id'], 'Popularity': y_test_pred_proba[:,1]}
df = pd.DataFrame(data=d)
df.to_csv('/content/drive/MyDrive/Colab Notebooks/DL/Competition 01_Predicting_News_Popularity/ada.csv', index=False)

### Logistic Regression

In [None]:
estimator_LR = LogisticRegression(random_state=0)
selector_LR = RFECV(estimator_LR, step=1, cv=5)
selector_LR = selector_LR.fit(X_train_feature_df, y_train_raw)
selector_LR.ranking_
X_train_feature_LR_df = X_train_feature_df[X_train_feature_df.columns.values[selector_LR.ranking_ == 1]]
X_test_feature_LR_df = X_test_feature_df[X_test_feature_df.columns.values[selector_LR.ranking_ == 1]]

In [None]:
X = X_train_feature_LR_df.values
X_test = X_test_feature_LR_df.values

y = y_train_raw[:, np.newaxis]

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
md = LogisticRegression(random_state=0, solver="saga", class_weight='balanced', C=2)

md.fit(X_train, y_train)

y_valid_pred_proba = md.predict_proba(X_valid)
fpr, tpr, thresholds = roc_curve(y_valid, y_valid_pred_proba[:, 1])
print('AUC: %.5f' % auc(fpr, tpr))

AUC: 0.56952


In [None]:
y_test_pred_proba = md.predict_proba(X_test)

d = {'Id': test_df['Id'], 'Popularity': y_test_pred_proba[:,1]}
df = pd.DataFrame(data=d)
df.to_csv('/content/drive/MyDrive/Colab Notebooks/DL/Competition 01_Predicting_News_Popularity/lr.csv', index=False)

### Random Forest Classifier

In [None]:
estimator_RF = RandomForestClassifier(random_state=0)
selector_RF = RFECV(estimator_RF, step=1, cv=5)
selector_RF = selector_RF.fit(X_train_feature_df, y_train_raw)
selector_RF.ranking_
X_train_feature_RF_df = X_train_feature_df[X_train_feature_df.columns.values[selector_RF.ranking_ == 1]]
X_test_feature_RF_df = X_test_feature_df[X_test_feature_df.columns.values[selector_RF.ranking_ == 1]]

In [None]:
X = X_train_feature_RF_df.values
X_test = X_test_feature_RF_df.values

y = y_train_raw[:, np.newaxis]

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
md = RandomForestClassifier(random_state=0, class_weight='balanced', n_estimators=1000)

md.fit(X_train, y_train)

y_valid_pred_proba = md.predict_proba(X_valid)
fpr, tpr, thresholds = roc_curve(y_valid, y_valid_pred_proba[:, 1])
print('AUC: %.5f' % auc(fpr, tpr))

AUC: 0.57912


In [None]:
y_test_pred_proba = md.predict_proba(X_test)

d = {'Id': test_df['Id'], 'Popularity': y_test_pred_proba[:,1]}
df = pd.DataFrame(data=d)
df.to_csv('/content/drive/MyDrive/Colab Notebooks/DL/Competition 01_Predicting_News_Popularity/rf.csv', index=False)

### XGBoost

In [2]:
# Must modify the data path
train_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/DL/Competition 01_Predicting_News_Popularity/Data/train.csv')
test_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/DL/Competition 01_Predicting_News_Popularity/Data/test.csv')

test_string = train_df.loc[0,'Page content']
train_df['Popularity'] = train_df['Popularity'].map(lambda x: (x+1)//2)

X_train_raw = train_df['Page content'].values
y_train_raw = train_df['Popularity'].values
X_test_raw = test_df['Page content'].values

In [7]:
# Normalize
sc = MinMaxScaler()
label = X_train_feature_df.columns
sc.fit(X_train_feature_df[label])
X_train_feature_df[label] = sc.transform(X_train_feature_df[label])
X_test_feature_df[label] = sc.transform(X_test_feature_df[label])
X_test_feature_df.head(3)

X = X_train_feature_df
y = pd.DataFrame(y_train_raw[:, np.newaxis])[0]

train_dict = {'X_train': X, 'y_train':y}
test_dict = {}

In [8]:
def classifier_XGBoost(scoring, max_evals, train_dict, test_dict={}):

    X_train, y_train = [train_dict['X_train'], train_dict['y_train']]

    def objective(space):
        classifier = xgb.XGBClassifier(n_estimators = space['n_estimators'],
                                    max_depth = int(space['max_depth']),
                                    learning_rate = space['learning_rate'],
                                    gamma = space['gamma'],
                                    min_child_weight = space['min_child_weight'],
                                    subsample = space['subsample'],
                                    colsample_bytree = space['colsample_bytree'],
                                    )

        classifier.fit(X_train, y_train)

        Scores = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 3, scoring=scoring)
        score = Scores.mean()
        print("cross_val_score AUC: ",score)
        loss = 1-score
        return {'loss': loss, 'status': STATUS_OK}

    # Tune Hyperparams
    space = {
        'max_depth' : hp.choice('max_depth', range(5, 30, 1)),
        'learning_rate' : hp.quniform('learning_rate', 0.01, 0.5, 0.01),
        'n_estimators' : hp.choice('n_estimators', range(20, 205, 5)),
        'gamma' : hp.quniform('gamma', 0, 0.50, 0.01),
        'min_child_weight' : hp.quniform('min_child_weight', 1, 10, 1),
        'subsample' : hp.quniform('subsample', 0.1, 1, 0.01),
        'colsample_bytree' : hp.quniform('colsample_bytree', 0.1, 1.0, 0.01)}

    trials = Trials()
    print("Tuning Hyperparameters ...")
    best = fmin(fn=objective,
                space=space,
                algo=tpe.suggest,
                max_evals=max_evals,
                trials=trials)

    print("Best Hyperparameters: ", best)

    # Fit the best model
    BestModel = xgb.XGBClassifier(n_estimators = best['n_estimators'],
                                max_depth = best['max_depth'],
                                learning_rate = best['learning_rate'],
                                gamma = best['gamma'],
                                min_child_weight = best['min_child_weight'],
                                subsample = best['subsample'],
                                colsample_bytree = best['colsample_bytree'],
                                )

    BestModel.fit(X_train, y_train)

    print('XGBoostClassifier Performance:')
    Scores = cross_val_score(estimator = BestModel, X = X_train, y = y_train, cv = 3, scoring='roc_auc')
    score_train = Scores.mean()
    print("Train Set 3-Fold roc_auc-Score: ", score_train)

    out_dict = {'model': BestModel, 'score': score_train}
    return out_dict

In [21]:
from hyperopt import Trials, STATUS_OK, tpe, hp, fmin
import xgboost as xgb
from sklearn.model_selection import cross_val_score, train_test_split
import lightgbm as lgbm

In [17]:
max_evals = 30 # max iters for tunings hyperparameters
Scorings = ['roc_auc']
out = {}
out['scoring'] = classifier_XGBoost(scoring='roc_auc', max_evals=max_evals, train_dict=train_dict, test_dict=test_dict)

Tuning Hyperparameters ...
cross_val_score AUC: 
0.5420666668062405
cross_val_score AUC: 
0.5699477787639785
cross_val_score AUC: 
0.5515267267420575
cross_val_score AUC: 
0.5167015903607713
cross_val_score AUC: 
0.5477953144953788
cross_val_score AUC: 
0.5815852023307988
cross_val_score AUC: 
0.5674834263709423
cross_val_score AUC: 
0.5207809506439137
cross_val_score AUC: 
0.5473201772116791
cross_val_score AUC: 
0.5513850061324379
cross_val_score AUC: 
0.5430122030909331
cross_val_score AUC: 
0.5540509882656447
cross_val_score AUC: 
0.571390015853798
cross_val_score AUC: 
0.589762518810813
cross_val_score AUC: 
0.5426814326055008
cross_val_score AUC: 
0.5540378831435468
cross_val_score AUC: 
0.5350944705618659
cross_val_score AUC: 
0.5755345906154842
cross_val_score AUC: 
0.5153124221479061
cross_val_score AUC: 
0.547256885126962
cross_val_score AUC: 
0.5587924435200796
cross_val_score AUC: 
0.5593070700117139
cross_val_score AUC: 
0.5798896060509324
cross_val_score AUC: 
0.554719741

In [18]:
y_proba_XGBoost = out['scoring']['model'].predict_proba(X_test_feature_df)
ans_XGBoost = pd.DataFrame()
ans_XGBoost['Id'] = list(range(27643,39490))
ans_XGBoost['Popularity'] = y_proba_XGBoost[:,1]

In [None]:
ans_XGBoost.to_csv('/content/drive/MyDrive/Colab Notebooks/DL/Competition 01_Predicting_News_Popularity/xgb.csv', index=False)

### LightGBM

In [19]:
def classifier_LightGBM(scoring, max_evals, train_dict, test_dict={}):

    X_train, y_train = [train_dict['X_train'], train_dict['y_train']]
    def objective(space):
        classifier = lgbm.LGBMClassifier(n_estimators = space['n_estimators'],
                                    max_depth = int(space['max_depth']),
                                    learning_rate = space['learning_rate'],
                                    min_child_weight = space['min_child_weight'],
                                    subsample = space['subsample'],
                                    colsample_bytree = space['colsample_bytree'],
                                    verbosity = -1,
                                         )

        classifier.fit(X_train, y_train)

        Scores = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 3, scoring=scoring)
        score = Scores.mean()
        print("cross_val_score AUC: ",score)
        loss = 1-score
        return {'loss': loss, 'status': STATUS_OK}

    # Tune Hyperparams
    space = {
        'max_depth' : hp.choice('max_depth', range(5, 30, 1)),
        'learning_rate' : hp.quniform('learning_rate', 0.01, 0.5, 0.01),
        'n_estimators' : hp.choice('n_estimators', range(20, 205, 5)),
        'min_child_weight' : hp.quniform('min_child_weight', 1, 10, 1),
        'subsample' : hp.quniform('subsample', 0.1, 1, 0.01),
        'colsample_bytree' : hp.quniform('colsample_bytree', 0.1, 1.0, 0.01),
        'num_leaves': hp.choice('num_leaves', range(5, 50, 1)),
        'min_child_samples': hp.choice('min_child_samples', range(1, 10, 1)),
        }

    trials = Trials()
    print("Tuning Hyperparameters ...")
    best = fmin(fn=objective,
                space=space,
                algo=tpe.suggest,
                max_evals=max_evals,
                trials=trials)

    print("Best Hyperparameters: ", best)

    # Fit the best model
    BestModel = lgbm.LGBMClassifier(n_estimators = best['n_estimators'],
                                max_depth = best['max_depth'],
                                learning_rate = best['learning_rate'],
                                min_child_weight = best['min_child_weight'],
                                subsample = best['subsample'],
                                colsample_bytree = best['colsample_bytree'],
                                verbosity = -1,
                                )

    BestModel.fit(X_train, y_train)

    print('LightGBMClassifier Performance:')
    Scores = cross_val_score(estimator = BestModel, X = X_train, y = y_train, cv = 3, scoring='roc_auc')
    score_train = Scores.mean()
    print("Train Set 3-Fold AUC-Score: ", score_train)

    out_dict = {'model': BestModel, 'score': score_train}

    return out_dict

In [22]:
max_evals = 30 # max iters for tunings hyperparameters
Scorings = ['roc_auc']
out = {}
out['scoring'] = classifier_LightGBM(scoring='roc_auc', max_evals=max_evals, train_dict=train_dict, test_dict=test_dict)

Tuning Hyperparameters ...
cross_val_score AUC: 
0.5447596688450163
cross_val_score AUC: 
0.5537374869348998
cross_val_score AUC: 
0.5705512895259381
cross_val_score AUC: 
0.5498384051349311
cross_val_score AUC: 
0.5454914476499813
cross_val_score AUC: 
0.5751134194937785
cross_val_score AUC: 
0.551484804960109
cross_val_score AUC: 
0.560164353771663
cross_val_score AUC: 
0.5689846353550466
cross_val_score AUC: 
0.5752982650040863
cross_val_score AUC: 
0.5831147087012906
cross_val_score AUC: 
0.5686335326984544
cross_val_score AUC: 
0.5813725985940632
cross_val_score AUC: 
0.562795653152307
cross_val_score AUC: 
0.5538139780371852
cross_val_score AUC: 
0.5633251405400476
cross_val_score AUC: 
0.5901818030290346
cross_val_score AUC: 
0.5634249090965936
cross_val_score AUC: 
0.5850162061665555
cross_val_score AUC: 
0.5556896130457868
cross_val_score AUC: 
0.5903619249777211
cross_val_score AUC: 
0.5921711356884654
cross_val_score AUC: 
0.5910430512621708
cross_val_score AUC: 
0.593594946

In [23]:
y_proba_LightGBM = out['scoring']['model'].predict_proba(X_test_feature_df)
ans_LightGBM = pd.DataFrame()
ans_LightGBM['Id'] = list(range(27643,39490))
ans_LightGBM['Popularity'] = y_proba_LightGBM[:,1]

In [None]:
ans_XGBoost.to_csv('/content/drive/MyDrive/Colab Notebooks/DL/Competition 01_Predicting_News_Popularity/lgbm.csv', index=False)

### Soft voting

In [None]:
# List of input file names
input_files = ['ada.csv', 'rf.csv', 'xgb.csv', 'lgbm.csv']
weights = [0.5857, 0.5810, 0.5834, 0.5938]

# Column name
column_name = 'Popularity'

# Read each input CSV file and store the data in a DataFrame
data_frames = [pd.read_csv(f'/content/drive/MyDrive/Colab Notebooks/DL/Competition 01_Predicting_News_Popularity/{file}') for file in input_files]

# Calculate the average of the "number" column across all DataFrames
average_data = sum(df[column_name] * weights[idx] for idx, df in enumerate(data_frames)) / len(data_frames)

# Create a new DataFrame with the averaged values
average_df = pd.DataFrame({'id': data_frames[0]['Id'], column_name: average_data})

# Save the averaged data to a new CSV file
average_df.to_csv("/content/drive/MyDrive/Colab Notebooks/DL/Competition 01_Predicting_News_Popularity/average.csv", index=False)

## Conclusion

After conducting numerous experiments, I found that when employing multiple classifiers to vote on the same features, their performance is superior compared to using a single classifier with identical features.

## Reference

- Feature
  - https://link.springer.com/content/pdf/10.1007/978-3-319-23485-4_53.pdf
  - https://medium.com/@syedsadiqalinaqvi/predicting-popularity-of-online-news-articles-a-data-scientists-report-fac298466e7
  - https://github.com/ymdong/MLND-Online-News-Popularity-Prediction
  - https://www.researchgate.net/publication/306061597_Predicting_the_Popularity_of_News_Articles

- LDA model
  - https://github.com/kapadias/mediumposts/blob/master/natural_language_processing/topic_modeling/notebooks/Introduction%20to%20Topic%20Modeling.ipynb
  - https://github.com/arezaz/ensemble-binary-classification
  