# Feedly ML Internship Challenge - Julien Duquesne
## Introduction
This challenge, proposed by Feedly to evaluate my ML skills in the scope of an internship at Leo (Feedly ML program), aims at recognizing articles dealing with Leadership topics, Leadership being a quite abstract subject difficult to categorize.
It consists in 5 steps :
* Downloading and exploring data
* Rule-based approach
* Categorizing our data
* Evaluating our rule-based models
* Supervized approach

## Step 1 : Downloading and exploring data

As data, I used the last 500 articles of the source Harvard Business Review that can be downloaded thanks to the feedly api. First of all I then need to install Feedly client library and authenticate through my feedly token that have found on the console page of my account.

In [1]:
!pip install feedly-client --quiet

In [3]:
#Import principal libraries and Feedly relative library

from feedly.session import FeedlySession
from feedly.data import StreamOptions
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')  # "error", "ignore", "always", "default", "module" or "once"

import urllib3
urllib3.disable_warnings() #Disable warnings because warnings about HTTPS were displayed at each request

In [None]:
# Enter your Feedly token here
token = input()

In [4]:
# Initializing Feedly session
sess = FeedlySession(auth=token)

In [5]:
# Downloading data with the code furnished in the challenge wording
source_feed = "feed/http://feeds.harvardbusiness.org/harvardbusiness/"
base_query = f"/v3/streams/contents?streamId={source_feed}"
data = sess.do_api_request(base_query)

In [6]:
print('Type of data received : ' + str(type(data)))
print(data.keys())
print(len(data['items']))

Type of data received : <class 'dict'>
dict_keys(['id', 'title', 'direction', 'updated', 'alternate', 'continuation', 'items'])
20


As shown above, data is a dictionnary whose interesting values are links to the items key, it consists in an array of dictionnaries containing features for each article. However I obtained only 20 articles which is not enough. So I explored documentation of Feedly API there : https://developer.feedly.com/v3/streams/ and discovered I can add a parameter 'count' to the base query, what I have done.

In [55]:
# Downloading data with the code furnished in the challenge wording
source_feed = "feed/http://feeds.harvardbusiness.org/harvardbusiness/"
base_query = f"/v3/streams/contents?streamId={source_feed}&count=500"
data = sess.do_api_request(base_query)

In [8]:
print(len(data['items']))

500


Here I have my 500 articles! Let's convert my articles into a dataframe to manipulate data easily

In [56]:
df = pd.DataFrame(data['items'])
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 22 columns):
alternate         500 non-null object
author            395 non-null object
canonical         500 non-null object
content           500 non-null object
crawled           500 non-null int64
engagement        500 non-null int64
engagementRate    21 non-null float64
fingerprint       500 non-null object
id                500 non-null object
keywords          500 non-null object
memes             43 non-null object
origin            500 non-null object
originId          500 non-null object
published         500 non-null int64
recrawled         43 non-null float64
summary           144 non-null object
title             500 non-null object
unread            500 non-null bool
updateCount       43 non-null float64
updated           500 non-null int64
visual            500 non-null object
webfeeds          500 non-null object
dtypes: bool(1), float64(3), int64(4), object(14)
memory usage: 82

I see that along title and content fields, I have other useful features such as the engagement, the author or some keywords describing the article for the most important ones.

In [10]:
df.head()

Unnamed: 0,alternate,author,canonical,content,crawled,engagement,engagementRate,fingerprint,id,keywords,...,originId,published,recrawled,summary,title,unread,updateCount,updated,visual,webfeeds
0,[{'href': 'http://feeds.harvardbusiness.org/~r...,Lauren Golembiewski,[{'href': 'https://hbr.org/2019/04/how-wearabl...,{'content': '<p>New devices offer less intrusi...,1556633327693,6073,3.76,34120a51,RUSFMaap2epUB1Hxr7coT6Cd7n5A3BtYAIay0ExHdcs=_1...,"[Technology, Talent management, Digital Article]",...,"tag:blogs.harvardbusiness.org,2007-03-31:999.2...",1556632812000,,,How Wearable AI Will Amplify Human Intelligence,True,,1556632812000,"{'processor': 'feedly-nikon-v3.1', 'url': 'htt...","{'relatedLayout': 'card', 'relatedTarget': 'br..."
1,[{'href': 'http://feeds.harvardbusiness.org/~r...,,[{'href': 'https://hbr.org/ideacast/2019/04/ho...,"{'content': '<p>Kimberly Whitler, assistant pr...",1556632875292,5757,3.56,ed2a5c20,RUSFMaap2epUB1Hxr7coT6Cd7n5A3BtYAIay0ExHdcs=_1...,"[Marketing, International business, Audio]",...,"tag:blogs.harvardbusiness.org,2007-03-31:999.2...",1556631032000,,,How China Is Upending Western Marketing Practices,True,,1556631032000,"{'processor': 'feedly-nikon-v3.1', 'url': 'htt...","{'relatedLayout': 'card', 'relatedTarget': 'br..."
2,[{'href': 'http://feeds.harvardbusiness.org/~r...,Larry Downes,[{'href': 'https://hbr.org/2019/04/the-u-s-gov...,{'content': '<p>Digital infrastructure is best...,1556631146678,5717,3.51,d8615a32,RUSFMaap2epUB1Hxr7coT6Cd7n5A3BtYAIay0ExHdcs=_1...,"[Internet, Regulation, Policy, Digital Article]",...,"tag:blogs.harvardbusiness.org,2007-03-31:999.2...",1556629207000,,,The U.S. Government Shouldn’t Run the Country’...,True,,1556629207000,"{'processor': 'feedly-nikon-v3.1', 'url': 'htt...","{'relatedLayout': 'card', 'relatedTarget': 'br..."
3,[{'href': 'http://feeds.harvardbusiness.org/~r...,Liz Morris,[{'href': 'https://hbr.org/2019/04/how-compani...,"{'content': '<p>Discrimination is widespread, ...",1556629196457,5806,3.54,553c73a5,RUSFMaap2epUB1Hxr7coT6Cd7n5A3BtYAIay0ExHdcs=_1...,"[Gender, Personnel policies, Digital Article]",...,"tag:blogs.harvardbusiness.org,2007-03-31:999.2...",1556625959000,,,How Companies Can Support Breastfeeding Employees,True,,1556625959000,"{'processor': 'feedly-nikon-v3.1', 'url': 'htt...","{'relatedLayout': 'card', 'relatedTarget': 'br..."
4,[{'href': 'http://feeds.harvardbusiness.org/~r...,,[{'href': 'https://hbr.org/podcast/2019/04/the...,{'content': '<p>There are a lot of reasons wom...,1556552454373,5525,2.85,a655c457,RUSFMaap2epUB1Hxr7coT6Cd7n5A3BtYAIay0ExHdcs=_1...,"[Gender, Audio]",...,"tag:audio.hbr.org,2018-01-01:999.227832",1556552036000,,,The Upside of Working Motherhood,True,,1556552036000,"{'processor': 'feedly-nikon-v3.1', 'url': 'htt...","{'relatedLayout': 'card', 'relatedTarget': 'br..."


I noticed the content is not a plain text but HTML code, so I created a new features containing only the text of the content. For this purpose, I used BeautifulSoup, a very useful library to scrap web and capable of decoding HTML content.

In [57]:
from bs4 import BeautifulSoup

def extract_text(content):
    return BeautifulSoup(content['content'].replace("\n", ""), 'html.parser').text

df['text_content'] = df['content'].apply(extract_text)
df[['text_content']].head()

Unnamed: 0,text_content
0,"New devices offer less intrusive, more intuiti..."
1,"Kimberly Whitler, assistant professor at the U..."
2,Digital infrastructure is best left to the pri...
3,"Discrimination is widespread, and too few empl..."
4,There are a lot of reasons women should feel o...


In [12]:
#Checking if there has been an issue and if Beautiful soup has left some HTML tags
print(sum(df['text_content'].str.contains('<')))
print(sum(df['text_content'].str.contains('>')))


0
0


It seems OK, here I have our plain text and I can start analyzing it ! It is asked to analyze how many words are capitalized, thus I am going to create columns containing an array with words contained in title and contents and one column being the union of these two arrays. Then I count how many words are capitalized and how many aren't.

In [58]:
df['title_words']=df['title'].str.split(' ')
df['content_words']=df['text_content'].str.split(' ')
df['total_words'] = df['title_words'] + df['content_words']
df[['title_words','content_words','total_words']].head()

Unnamed: 0,title_words,content_words,total_words
0,"[How, Wearable, AI, Will, Amplify, Human, Inte...","[New, devices, offer, less, intrusive,, more, ...","[How, Wearable, AI, Will, Amplify, Human, Inte..."
1,"[How, China, Is, Upending, Western, Marketing,...","[Kimberly, Whitler,, assistant, professor, at,...","[How, China, Is, Upending, Western, Marketing,..."
2,"[The, U.S., Government, Shouldn’t, Run, the, C...","[Digital, infrastructure, is, best, left, to, ...","[The, U.S., Government, Shouldn’t, Run, the, C..."
3,"[How, Companies, Can, Support, Breastfeeding, ...","[Discrimination, is, widespread,, and, too, fe...","[How, Companies, Can, Support, Breastfeeding, ..."
4,"[The, Upside, of, Working, Motherhood]","[There, are, a, lot, of, reasons, women, shoul...","[The, Upside, of, Working, Motherhood, There, ..."


In [59]:
#Counting capitalized words with a list comprehension checking 
# if the word is not empty and if the word first letter is upper
def count_capitalized(words):
    return sum([1 if (word and word[0].isupper()) else 0 for word in words])

def count_not_capitalized(words):
    return sum([1 if (word and word[0].islower()) else 0 for word in words])


df['count_capitalized'],df['count_not_capitalized'] = df['total_words'].apply(count_capitalized), df['total_words'].apply(count_not_capitalized)
ratio = sum(df['count_capitalized'])/sum(df['count_not_capitalized'])
print(f'Ratio of capitalized words over non capitalized words : {ratio}')

Ratio of capitalized words over non capitalized words : 0.12203174004291745


In [60]:
import operator

count_words = {}
#We create a list with all words of the articles
total_words = df['total_words'].sum()

#Looping over the list to count the words in a dictionnary, which is the more efficient computationnaly speaking
for word in total_words:
    lowered_word = word.lower()
    if lowered_word in count_words.keys():
        count_words[lowered_word] += 1
    else:
        count_words[lowered_word] = 1

sorted_words = sorted(count_words.items(), key=operator.itemgetter(1))
print(sorted_words[len(sorted_words)-10:])

[('are', 1121), ('is', 1400), ('for', 1479), ('that', 1902), ('in', 2402), ('a', 3249), ('of', 3633), ('and', 4388), ('to', 5018), ('the', 6066)]


As expected, the most frequent words are stop words. I need to remove them to find interesting words. I then used gensim library which is commonly used for text processing.

In [61]:
import gensim
from gensim.parsing.preprocessing import STOPWORDS
count_words = {}
total_words = df['total_words'].sum()

for word in total_words:
    lowered_word = word.lower()
    if len(lowered_word)>1 and lowered_word not in STOPWORDS:
        if lowered_word in count_words.keys():
            count_words[lowered_word] += 1
        else:
            count_words[lowered_word] = 1

sorted_words = sorted(count_words.items(), key=operator.itemgetter(1))
print(sorted_words[len(sorted_words)-10:])

[('like', 207), ('data', 213), ('it’s', 218), ('need', 226), ('time', 243), ('business', 305), ('companies', 308), ('work', 351), ('people', 380), ('new', 411)]


Here it is much more interesting and I see that words such as business, work or companies are very frequent, which is not very surprising regarding the theme of the feed. But I can do even better and stem words so that 'company' and 'companies' count as the same word for instance. I will use gensim library to do so. I inspired ourselves of [Quentin's work](https://colab.research.google.com/drive/1jUpGwTaY9vJsUVw1tgwwXqKz6UOsvV1a#scrollTo=JfSH2BQVtAK3) and reused some of his code that has been done for this preprocessing,

In [62]:
from gensim import utils
from gensim import corpora, models
from nltk.stem import WordNetLemmatizer, SnowballStemmer
stemmer = SnowballStemmer('english')

In [63]:
# Stemming text using the NLTK stemmer
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Whole preprocessing of text
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 1:
            result.append(lemmatize_stemming(token))
    return result

df["tokenized_content"] = df["text_content"].map(preprocess)
df["tokenized_title"] = df["title"].map(preprocess)
df[['title','tokenized_title','text_content','tokenized_content']].head()

Unnamed: 0,title,tokenized_title,text_content,tokenized_content
0,How Wearable AI Will Amplify Human Intelligence,"[wearabl, ai, amplifi, human, intellig]","New devices offer less intrusive, more intuiti...","[new, devic, offer, intrus, intuit, way, human..."
1,How China Is Upending Western Marketing Practices,"[china, upend, western, market, practic]","Kimberly Whitler, assistant professor at the U...","[kimber, whitler, assist, professor, univers, ..."
2,The U.S. Government Shouldn’t Run the Country’...,"[govern, shouldn, run, countri, network]",Digital infrastructure is best left to the pri...,"[digit, infrastructur, best, leav, privat, sec..."
3,How Companies Can Support Breastfeeding Employees,"[compani, support, breastfe, employe]","Discrimination is widespread, and too few empl...","[discrimin, widespread, employ, respond]"
4,The Upside of Working Motherhood,"[upsid, work, motherhood]",There are a lot of reasons women should feel o...,"[lot, reason, women, feel, optimist, have, car..."


In [64]:
#Aggregation of every token in one list
total_tokens = (df['tokenized_content'] + df['tokenized_title']).sum()

for word in total_tokens:
    if word in count_words.keys():
        count_words[word] += 1
    else:
        count_words[word] = 1

# Then I build a sorted representation of our dictionnary
sorted_words = sorted(count_words.items(), key=operator.itemgetter(1))
print(sorted_words[len(sorted_words)-10:])

[('way', 470), ('help', 473), ('data', 489), ('team', 535), ('need', 592), ('like', 607), ('time', 695), ('compani', 710), ('new', 840), ('work', 1007)]


The words are pretty much the same but some have appeared such as 'way' and other have disappeared such as 'business'.


## Step 2 : Rule-based approach

Here I am going to create three rule-based models.
The first one will classify an article positive if the word leadership is in the title, the second one will classify it if the word leadership is in the body.

In [65]:
# This function returns a boolean Serie that indicates if the cell contains or not the word
def rule_column(df,column,word):
    word = word.lower()
    return df[column].str.lower().str.contains(word)

In [66]:
#Applying it to the title
sum_title_model = sum(rule_column(df,'title','leadership'))

#Applying it to the body
sum_body_model = sum(rule_column(df,'text_content','leadership'))
print(f'Number of articles about leadership according to model based on title : {sum_title_model}')
print(f'Number of articles about leadership according to model based on body : {sum_body_model}')

Number of articles about leadership according to model based on title : 6
Number of articles about leadership according to model based on body : 34


First of all, I will combine both precedent model with and 'OR' statement, i.e if the word leadership is either in the title or in the body, the article will be classified as positive. Moreover, it is likely that is the word leader is contained in the title, the article will also deal with leadership, thus I will extend our rules to titles that contain this word.

In [91]:
result_third_model = np.logical_or(rule_column(df,'title','leader'),rule_column(df,'text_content','leadership'))
sum_third_model = sum(result_third_model)
print(f'Number of articles about leadership according to our own model: {sum_third_model}')

Number of articles about leadership according to our own model: 51


I have then around articles classified as dealing with leadership, I will evaluate our models in a next step.

## Step 3 : Building a labelized dataset

In order to save time labelizing on-hand every article, I will use the tags provided to build a gold dataset

In [84]:
# Concise and efficient way of finding keywords that contain leadership
df['about_leadership'] = df['keywords'].str.join(';').str.lower().str.contains('leadership')
print(f'Number of articles about leadership : {sum(df["about_leadership"])}')

Number of articles about leadership : 44


I then find 44 articles about leadership. The dataset is as expected really imbalanced because we have only 44 articles about leadership over 500 articles.

## Step 4 : Evaluation of our rule-based models

### Choosing the measure

To evaluate my models, I need to find the better measure regarding my objective. Accuracy is not here the best metric since the dataset is deeply imbalanced. Here the F1-score which is the harmonic mean of precision and recall seems to be the more suited measure, since it doesn't take into account true negative predictions that aren't very important.

In [92]:
from sklearn.metrics import f1_score, accuracy_score
score_model_1 = f1_score(df['about_leadership'],rule_column(df,'title','leadership'))
print(f'F1-score of the first model, based on the title : {score_model_1}')

score_model_2 = f1_score(df['about_leadership'],rule_column(df,'text_content','leadership'))
print(f'F1-score of the second model, based on the content : {score_model_2}')

score_model_3 = f1_score(df['about_leadership'],result_third_model)
print(f'F1-score of the third model : {score_model_3}')


F1-score of the first model, based on the title : 0.16
F1-score of the second model, based on the content : 0.1794871794871795
F1-score of the third model : 0.37894736842105264


My rule-based models aren't then precise at all, even if the third model is much more precise than the other ones.

## Step 5 : Supervized approach

Now I am going to use TF-IDF and a Logistic Regression to classify our articles. Before applying TF-IDF, I need to preprocess the text, by lemmatizing, stemming and tokenizing the contents while removing stop words. Lemmatizing and stemming consists in keeping only the stem of the words, i.e keep only the root of the word so that many variation of the same root are counted as the same word. Then I remove stop words and split our text in a list of words (tokenizing). Here contrary to what Quentin has done, I include the title in our TF-IDF but separing it from the text, because I thought the title had a particular importance that our model could catch.

Then I can apply TF-IDF, but first I need to construct a bag-of-words representation of our texts, i.e convert text to a list of couple key-value with the key corresponding to the token and the value being the number of times this token appear in the text.

In [72]:
tokens = list(df['tokenized_content'].values)
tokens.extend(list(df['tokenized_title'].values))
vocab = corpora.Dictionary(tokens)

# apply tfidf model and create a vector representation
def to_vector(key_value_tuples, vector_dim, default_value=0):
    rv = np.ones(vector_dim) * default_value
    for key, val in key_value_tuples:
        rv[key] = val
    return rv
  
# Create a bag of words representation of content and title
df["bow_corpus_content"] = df.tokenized_content.map(vocab.doc2bow)
df["bow_corpus_title"] = df.tokenized_title.map(vocab.doc2bow)

bow_corpus = list(df["bow_corpus_content"].values)
bow_corpus.extend(list(df["bow_corpus_title"].values))

tfidf = models.TfidfModel(bow_corpus)
df["tfidf_content"] = df.bow_corpus_content.map(lambda bow: to_vector(tfidf[bow], len(vocab)))
df["tfidf_title"] = df.bow_corpus_title.map(lambda bow: to_vector(tfidf[bow], len(vocab)))
df[['title','tokenized_title','text_content','tokenized_content','tokenized_title','bow_corpus_content','bow_corpus_title','tfidf_content','tfidf_title']].head()

Unnamed: 0,title,tokenized_title,text_content,tokenized_content,tokenized_title.1,bow_corpus_content,bow_corpus_title,tfidf_content,tfidf_title
0,The Upside of Working Motherhood,"[upsid, work, motherhood]",There are a lot of reasons women should feel o...,"[lot, reason, women, feel, optimist, have, car...","[upsid, work, motherhood]","[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1...","[(16, 1), (27, 1), (2008, 1)]","[0.17394563156110787, 0.1434033814243381, 0.17...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,Keep Your Company’s Toxic Culture from Infecti...,"[compani, toxic, cultur, infect, team]",Tips for staying positive and productive in a ...,"[tip, stay, posit, product, negat, environ]","[compani, toxic, cultur, infect, team]","[(28, 1), (29, 1), (30, 1), (31, 1), (32, 1), ...","[(39, 1), (227, 1), (320, 1), (1187, 1), (6344...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,Your Company Needs a Strategy for Voice Techno...,"[compani, need, strategi, voic, technolog]",Here’s what the current landscape looks like.,"[current, landscap, look, like]","[compani, need, strategi, voic, technolog]","[(34, 1), (35, 1), (36, 1), (37, 1)]","[(39, 1), (48, 1), (212, 1), (602, 1), (1774, 1)]","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,How to Manage Your Perfectionism,"[manag, perfection]",Learn when to put in more time and when to mov...,"[learn, time]","[manag, perfection]","[(24, 1), (38, 1)]","[(15, 1), (948, 1)]","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,Student Debt Is Stopping U.S. Millennials from...,"[student, debt, stop, millenni, entrepreneur]",And what companies can do to help.,"[compani, help]","[student, debt, stop, millenni, entrepreneur]","[(39, 1), (40, 1)]","[(60, 1), (418, 1), (590, 1), (928, 1), (3198,...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


Since TF-IDF give us a lot of features (2 per words present in the corpus actually, one for the title and one for the content), I need to reduce dimensionnality of our data so that the model will learn more easily. I will then apply PCA (Principal component analysis, which is a very commonly use method to reduce dimensionnality, maximazing the variance of the projected data). As we use to do in ML, I will also divide our dataset into random train and test sets to control our trade-off bias-variance and not overfit.

In [22]:
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV, GridSearchCV

In [80]:
# Create a global feature containing TF-IDF of the content and the title
df['tfidf_total'] = df['tfidf_content'] + df['tfidf_title']
X = np.array(df['tfidf_total'].tolist())
y = df['about_leadership']
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [75]:
pca = PCA()
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

In [81]:
clf = LogisticRegression(solver = 'saga')
clf.fit(X_train_pca,y_train)
result = clf.predict(X_test_pca)
print(result)
print(f'Result with the logistic regression : {f1_score(y_test,result)}')

[False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False]
Result with the logistic regression : 0.0


  'precision', 'predicted', average, warn_for)


Here we have a problem, the number of positive outcome is too poor, thus our model can not learn when they happen. I could find a good explanation of this problem right here : [Logistic Regression for rare events](http://statisticalhorizons.com/logistic-regression-for-rare-events). The simplest solution if we want to pursue with Logistic Regression is to use more samples. I then repeat the whole process with 5 batches of 1000 samples that is the maximum number of articles we can fetch at one time. I limit myself to 5000 articles to have a computing time quite correct so that I can tune my parameters more easily.

In [43]:
# Downloading data with the code furnished in the challenge wording
source_feed = "feed/http://feeds.harvardbusiness.org/harvardbusiness/"
base_query = f"/v3/streams/contents?streamId={source_feed}&count=1000"
data = sess.do_api_request(base_query)
print('Batch received')
articles = data['items']

while len(articles)<5000 and 'continuation' in data.keys():
    base_query = f"/v3/streams/contents?streamId={source_feed}&count=1000&continuation={data['continuation']}"
    data = sess.do_api_request(base_query)
    print('Batch received')
    articles.extend(data['items'])
    
print(f'Number of articles fetched : {len(articles)}')

Batch received
Batch received
Batch received
Batch received
Batch received
Number of articles fetched : 5000


In [44]:
# apply tfidf model and create a vector representation
def to_vector(key_value_tuples, vector_dim, default_value=0):
    rv = np.ones(vector_dim) * default_value
    for key, val in key_value_tuples:
        rv[key] = val
    return rv
  
df = pd.DataFrame(articles)
df = df.dropna(subset=['content', 'title','keywords'])

df['text_content'] = df['content'].apply(extract_text)

df['about_leadership'] = df['keywords'].str.join(';').str.lower().str.contains('leadership')

print('Tokenizing texts')

df["tokenized_content"] = df["text_content"].map(preprocess)
df["tokenized_title"] = df["title"].map(preprocess)

tokens = list(df['tokenized_content'].values)
tokens.extend(list(df['tokenized_title'].values))
vocab = corpora.Dictionary(tokens)

print('Bow corpus creation')
df["bow_corpus_content"] = df.tokenized_content.map(vocab.doc2bow)
df["bow_corpus_title"] = df.tokenized_title.map(vocab.doc2bow)

bow_corpus = list(df["bow_corpus_content"].values)
bow_corpus.extend(list(df["bow_corpus_title"].values))

print('Applying TF-IDF ')
tfidf = models.TfidfModel(bow_corpus)
df["tfidf_content"] = df.bow_corpus_content.map(lambda bow: to_vector(tfidf[bow], len(vocab)))
df["tfidf_title"] = df.bow_corpus_title.map(lambda bow: to_vector(tfidf[bow], len(vocab)))


Tokenizing texts
Bow corpus creation
Applying TF-IDF 


In [45]:
df['tfidf_total'] = df['tfidf_content']+ df['tfidf_title']
X = np.array(df['tfidf_total'].tolist())
y = df['about_leadership']
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=0)

print('Applying PCA')
pca = PCA()
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

Applying PCA


In [46]:
from sklearn.metrics import confusion_matrix

print('Applying Logistic Regression')
clf = LogisticRegression(solver = 'liblinear')
clf.fit(X_train_pca,y_train)
result = clf.predict(X_test_pca)
print('Confusion matrix with the logistic regression :')
print(confusion_matrix(y_test,result))
print(f'F1-score with the logistic regression : {f1_score(y_test,result)}')
print(f'Accuracy with the logistic regression : {accuracy_score(y_test,result)}')

score_model_1 = f1_score(y_test,rule_column(df.loc[y_test.index],'title','leadership'))
print(f'F1-score of the first rule-based model, based on the title : {score_model_1}')

score_model_2 = f1_score(y_test,rule_column(df.loc[y_test.index],'text_content','leadership'))
print(f'F1-score of the second rule-based model, based on the content : {score_model_2}')

result_third_model = np.logical_or(rule_column(df.loc[y_test.index],'title','leader'),rule_column(df.loc[y_test.index],'text_content','leadership'))
score_model_3 = f1_score(y_test,result_third_model)
print(f'F1-score of the third rule-based model : {score_model_3}')


Applying Logistic Regression
Confusion matrix with the logistic regression :
[[1095   13]
 [ 103   39]]
F1-score with the logistic regression : 0.40206185567010305
Accuracy with the logistic regression : 0.9072
F1-score of the first rule-based model, based on the title : 0.20731707317073172
F1-score of the second rule-based model, based on the content : 0.37303370786516854
F1-score of the third rule-based model : 0.411134903640257


With 5000 articles, I reached a **F1-score of 0.40**, which is much better, and quite equivalent to the third rule based model. But it's still not acceptable and optimizing our hyperparameters here wouldn't be very useful. The confusion matrix is very interesting, and gives us plenty of informations : as expected I have plenty of negative articles classified as negative. The number of negative articles classified as positive is quite low but the number of False negative articles is huge : our model has hard time catching the theme of the article most of the time.
I then have to tune our hyperparameters. I then use cross validation which a Grid Search.

In [48]:
print('Applying PCA')
pca = PCA(svd_solver='full')
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

Applying PCA


In [54]:
print('Applying Logistic Regression')
clf = LogisticRegression(solver = 'liblinear',penalty='l1')

print(f'Cross-validation f1-score :{np.mean(cross_val_score(clf,X_train_pca,y_train,scoring="f1",cv=5))}')
      
clf.fit(X_train_pca,y_train)
result = clf.predict(X_test_pca)
print(f'F1-score on test set with the logistic regression : {f1_score(y_test,result)}')
print(f'Accuracy with the logistic regression : {accuracy_score(y_test,result)}')

score_model_1 = f1_score(y_test,rule_column(df.loc[y_test.index],'title','leadership'))
print(f'F1-score of the first rule-based model, based on the title : {score_model_1}')

score_model_2 = f1_score(y_test,rule_column(df.loc[y_test.index],'text_content','leadership'))
print(f'F1-score of the second rule-based model, based on the content : {score_model_2}')

result_third_model = np.logical_or(rule_column(df.loc[y_test.index],'title','leader'),rule_column(df.loc[y_test.index],'text_content','leadership'))
score_model_3 = f1_score(y_test,result_third_model)
print(f'F1-score of the third rule-based model : {score_model_3}')

Applying Logistic Regression
Cross-validation f1-score :0.4684237094367984
F1-score on test set with the logistic regression : 0.48571428571428577
Accuracy with the logistic regression : 0.9136
F1-score of the first rule-based model, based on the title : 0.20731707317073172
F1-score of the second rule-based model, based on the content : 0.37303370786516854
F1-score of the third rule-based model : 0.411134903640257


In [51]:
distribution={
    'penalty':['l1','l2'],
    'tol':[10**(-4),10**(-5),10**(-3),10**(-2)],
    'C':[0.01,0.1,1,10]
}

clf = GridSearchCV(LogisticRegression(solver='liblinear'),param_grid=distribution,scoring='f1',cv=5)
clf.fit(X_train_pca,y_train)
print(f'Best params obtained through the grid search : {clf.best_params_}')
print(f'Best score obtained with these params : {clf.best_score_}')

Best params obtained through the grid search : {'C': 1, 'penalty': 'l1', 'tol': 0.0001}
Best score obtained with these params : 0.4684237094367984


In [53]:
clf = LogisticRegression(solver = 'liblinear',penalty='l1',C=1,tol=0.0001)
clf.fit(X_train_pca,y_train)
result = clf.predict(X_test_pca)
print('Confusion matrix with the logistic regression :')
print(confusion_matrix(y_test,result))

print(f'F1-score on test set with the logistic regression : {f1_score(y_test,result)}')

Confusion matrix with the logistic regression :
[[1091   17]
 [  91   51]]
F1-score on test set with the logistic regression : 0.48571428571428577


I obtained finally a F1-score of 0.49, which is not very good but it probably can't be really better with a model so simple. The confusion matrix is a bit better and we succeeded catching a bit more articles but still not enough. I think one problem is that many body contents are very short (a few meaningful words only) and don't give up many informations. One way to greatly improve our score would be undeniably to scrap the whole articles directly on the websites, since this would gave us may more informations about this one and it's content. But HBR limits us to 3 articles per month so I would need to suscribe to be able to do this.

We could also consider using neural networks (Recurrent Neural Networks and LSTM probably) to improve this score largely, probably through transfer learning since we don't have so many examples.