# Feedly ML Internship Challenge
## Introduction
This challenge, proposed by Feedly to evaluate our ML skills in the scope of an internship at Leo (Feedly ML program), aims at recognizing articles dealing with Leadership topics, Leadership being a quite abstract subject difficult to categorize.
It consists in 5 steps :
* Downloading and exploring data
* Rule-based approach
* Categorizing our data
* Evaluating our rule-based models
* Supervized approach

## Step 1 : Downloading and exploring data

As data, we used the last 500 articles of the source Harvard Business Review that can be downloaded thanks to the feedly api. First of all we then need to install Feedly client library and authenticate through your feedly token that you will found on the console page of your account.

In [1]:
!pip install feedly-client --quiet

In [2]:
from feedly.session import FeedlySession
from feedly.data import StreamOptions
import pandas as pd
import numpy as np
import urllib3
urllib3.disable_warnings() #Disable warnings because warnings about HTTPS were displayed at each request

In [3]:
# Enter your Feedly token here
token = input()

86e9449b-479f-4127-8e1b-3b20f99d942e


In [4]:
# Initializing Feedly session
sess = FeedlySession(auth=token)

In [5]:
# Downloading data with the code furnished in the challenge wording
source_feed = "feed/http://feeds.harvardbusiness.org/harvardbusiness/"
base_query = f"/v3/streams/contents?streamId={source_feed}"
data = sess.do_api_request(base_query)

In [6]:
print('Type of data received : ' + str(type(data)))
print(data.keys())
print(len(data['items']))

Type of data received : <class 'dict'>
dict_keys(['id', 'title', 'direction', 'updated', 'alternate', 'continuation', 'items'])
20


As shown above, data is a dictionnary whose interesting values are links to the items key, it consists in an array of dictionnaries containing features for each article. However we obtained only 20 articles which is not enough. So I explored documentation of Feedly API there : https://developer.feedly.com/v3/streams/ and discovered we can add a parameter 'count' to the base query, what I have done.

In [7]:
# Downloading data with the code furnished in the challenge wording
source_feed = "feed/http://feeds.harvardbusiness.org/harvardbusiness/"
base_query = f"/v3/streams/contents?streamId={source_feed}&count=500"
data = sess.do_api_request(base_query)

In [8]:
print(len(data['items']))

500


Here we have our 500 articles! Let's convert our articles into a dataframe to manipulate data easily

In [9]:
df = pd.DataFrame(data['items'])
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 22 columns):
alternate         500 non-null object
author            396 non-null object
canonical         500 non-null object
content           500 non-null object
crawled           500 non-null int64
engagement        500 non-null int64
engagementRate    22 non-null float64
fingerprint       500 non-null object
id                500 non-null object
keywords          500 non-null object
memes             42 non-null object
origin            500 non-null object
originId          500 non-null object
published         500 non-null int64
recrawled         46 non-null float64
summary           152 non-null object
title             500 non-null object
unread            500 non-null bool
updateCount       46 non-null float64
updated           500 non-null int64
visual            500 non-null object
webfeeds          500 non-null object
dtypes: bool(1), float64(3), int64(4), object(14)
memory usage: 82

We see that along title and content fields, we have other useful features such as the engagement, the author or some keywords describing the article for the most important ones.

In [10]:
df.head()

Unnamed: 0,alternate,author,canonical,content,crawled,engagement,engagementRate,fingerprint,id,keywords,...,originId,published,recrawled,summary,title,unread,updateCount,updated,visual,webfeeds
0,[{'href': 'http://feeds.harvardbusiness.org/~r...,Vadim Revzin,[{'href': 'https://hbr.org/2019/04/student-deb...,{'content': '<p>And what companies can do to h...,1556287872600,10128,5.29,867b4873,RUSFMaap2epUB1Hxr7coT6Cd7n5A3BtYAIay0ExHdcs=_1...,"[Innovation, Finance & Accounting, Social resp...",...,"tag:blogs.harvardbusiness.org,2007-03-31:999.2...",1556287224000,,,Student Debt Is Stopping U.S. Millennials from...,True,,1556287224000,"{'processor': 'feedly-nikon-v3.1', 'url': 'htt...","{'relatedLayout': 'card', 'relatedTarget': 'br..."
1,[{'href': 'http://feeds.harvardbusiness.org/~r...,Art Markman,[{'href': 'https://hbr.org/2019/04/should-you-...,{'content': '<p>Sometimes it’s better to let t...,1556286653488,5892,3.08,609d6b90,RUSFMaap2epUB1Hxr7coT6Cd7n5A3BtYAIay0ExHdcs=_1...,"[Employee retention, Managing people, Talent m...",...,"tag:blogs.harvardbusiness.org,2007-03-31:999.2...",1556283615000,,,Should You Try to Convince a Star Employee to ...,True,,1556283615000,"{'processor': 'feedly-nikon-v3.1', 'url': 'htt...","{'relatedLayout': 'card', 'relatedTarget': 'br..."
2,[{'href': 'http://feeds.harvardbusiness.org/~r...,Gokhanedge Ozturk,[{'href': 'https://hbr.org/2019/04/what-compan...,"{'content': '<p>When technology goes awry, you...",1556280927644,5714,2.98,583c6388,RUSFMaap2epUB1Hxr7coT6Cd7n5A3BtYAIay0ExHdcs=_1...,"[Technology, Digital Article]",...,"tag:blogs.harvardbusiness.org,2007-03-31:999.2...",1556280322000,,,What Companies Should Consider Before Investin...,True,,1556280322000,"{'processor': 'feedly-nikon-v3.1', 'url': 'htt...","{'relatedLayout': 'card', 'relatedTarget': 'br..."
3,[{'href': 'http://feeds.harvardbusiness.org/~r...,,[{'href': 'https://hbr.org/ideacast/2019/04/hb...,"{'content': '<p>Patrick McGinnis, creator of t...",1556223088824,5996,3.07,2e7d5ecd,RUSFMaap2epUB1Hxr7coT6Cd7n5A3BtYAIay0ExHdcs=_1...,"[Entrepreneurship, Diversity, Founders, Audio]",...,"tag:blogs.harvardbusiness.org,2007-03-31:999.2...",1556222623000,,,HBR Presents: FOMO Sapiens with Patrick J. McG...,True,,1556222623000,"{'processor': 'feedly-nikon-v3.1', 'url': 'htt...","{'relatedLayout': 'card', 'relatedTarget': 'br..."
4,[{'href': 'http://feeds.harvardbusiness.org/~r...,Jane Hyun,[{'href': 'https://hbr.org/2019/04/3-ways-to-i...,{'content': '<p>Be curious and open to learnin...,1556207737072,11218,5.59,8698519,RUSFMaap2epUB1Hxr7coT6Cd7n5A3BtYAIay0ExHdcs=_1...,"[Cross-cultural management, International busi...",...,"tag:blogs.harvardbusiness.org,2007-03-31:999.2...",1556207156000,,,3 Ways to Improve Your Cultural Fluency,True,,1556207156000,"{'processor': 'feedly-nikon-v3.1', 'url': 'htt...","{'relatedLayout': 'card', 'relatedTarget': 'br..."


We noticed the content is not a plain text but HTML code, so we created a new features containing only the text of the content. For this purpose, we used BeautifulSoup, a very useful library to scrap web and capable of decoding HTML content.

In [11]:
from bs4 import BeautifulSoup

def extract_text(content):
    return BeautifulSoup(content['content'].replace("\n", ""), 'html.parser').text

df['text_content'] = df['content'].apply(extract_text)
df[['text_content']].head()

Unnamed: 0,text_content
0,And what companies can do to help.
1,Sometimes it’s better to let them go.
2,"When technology goes awry, your reputation can..."
3,"Patrick McGinnis, creator of the term FOMO, en..."
4,Be curious and open to learning a new way of m...


Here we have our plain text and we can start analyzing it ! It is asked to analyze how many words are capitalized, thus we are going to create columns containing an array with words contained in title and contents and one column being the union of these two arrays. Then we count how many words are capitalized and how many aren't.

In [12]:
df['title_words']=df['title'].apply(lambda x:x.split(' '))
df['content_words']=df['text_content'].apply(lambda x:x.split(' '))
df['total_words'] = df['title_words'] + df['content_words']
df[['title_words','content_words','total_words']].head()

Unnamed: 0,title_words,content_words,total_words
0,"[Student, Debt, Is, Stopping, U.S., Millennial...","[And, what, companies, can, do, to, help., ]","[Student, Debt, Is, Stopping, U.S., Millennial..."
1,"[Should, You, Try, to, Convince, a, Star, Empl...","[Sometimes, it’s, better, to, let, them, go., ]","[Should, You, Try, to, Convince, a, Star, Empl..."
2,"[What, Companies, Should, Consider, Before, In...","[When, technology, goes, awry,, your, reputati...","[What, Companies, Should, Consider, Before, In..."
3,"[HBR, Presents:, FOMO, Sapiens, with, Patrick,...","[Patrick, McGinnis,, creator, of, the, term, F...","[HBR, Presents:, FOMO, Sapiens, with, Patrick,..."
4,"[3, Ways, to, Improve, Your, Cultural, Fluency]","[Be, curious, and, open, to, learning, a, new,...","[3, Ways, to, Improve, Your, Cultural, Fluency..."


In [13]:
def count_capitalized(words):
    sum_capitalized=0
    sum_not_capitalized=0
    for word in words:
        if word: #To escape blank caracters
            if word[0].isupper():
                sum_capitalized += 1
    return sum_capitalized

def count_not_capitalized(words):
    sum_not_capitalized=0
    for word in words:
        if word: #To escape blank caracters
            if word[0].islower():
                sum_not_capitalized += 1
    return sum_not_capitalized


df['count_capitalized'],df['count_not_capitalized'] = df['total_words'].apply(count_capitalized), df['total_words'].apply(count_not_capitalized)
ratio = sum(df['count_capitalized'])/sum(df['count_not_capitalized'])
print(f'Ratio of capitalized words over non capitalized words : {ratio}')

Ratio of capitalized words over non capitalized words : 0.11878054731923011


In [14]:
import operator

count_words = {}
total_words = df['total_words'].sum()

for word in total_words:
    lowered_word = word.lower()
    if lowered_word in count_words.keys():
        count_words[lowered_word] += 1
    else:
        count_words[lowered_word] = 1

sorted_words = sorted(count_words.items(), key=operator.itemgetter(1))
print(sorted_words[len(sorted_words)-10:])

[('are', 1180), ('is', 1489), ('for', 1557), ('that', 2018), ('in', 2560), ('a', 3449), ('of', 3923), ('and', 4664), ('to', 5332), ('the', 6531)]


As expected, the most frequent words are stop words. We need to remove them to find interesting words. We then use gensim library which is commonly used for text processing.

In [15]:
import gensim
from gensim.parsing.preprocessing import STOPWORDS
count_words = {}
total_words = df['total_words'].sum()

for word in total_words:
    lowered_word = word.lower()
    if len(lowered_word)>1 and lowered_word not in STOPWORDS:
        if lowered_word in count_words.keys():
            count_words[lowered_word] += 1
        else:
            count_words[lowered_word] = 1

sorted_words = sorted(count_words.items(), key=operator.itemgetter(1))
print(sorted_words[len(sorted_words)-10:])

[('like', 215), ('it’s', 222), ('need', 241), ('time', 250), ('data', 289), ('companies', 310), ('business', 326), ('work', 363), ('people', 407), ('new', 441)]


Here it is much more interesting and we see that words such as business, work or companies are very frequent, which is not very surprising regarding the theme of the feed.

## Step 2 : Rule-based approach

Here we are going to create three rule-based models.
The first one will classify an article positive if the word leadership is in the title, the second one will classify it if the word leadership is in the body.

In [16]:
def check_word(row,column,word):
    return int(word in row[column].lower())

def rule_column(df,column,word):
    word = word.lower()
    return df.apply(lambda row:check_word(row,column,word),axis=1)

In [17]:
sum_title_model = sum(rule_column(df,'title','leadership'))
sum_body_model = sum(rule_column(df,'text_content','leadership'))
print(f'Number of articles about leadership according to model based on title : {sum_title_model}')
print(f'Number of articles about leadership according to model based on body : {sum_body_model}')

Number of articles about leadership according to model based on title : 8
Number of articles about leadership according to model based on body : 38


First of all, we will combine both precedent model with and 'OR' statement, i.e if the word leadership is either in the title or in the body, the article will be classified as positive. Moreover, it is likely that is the word leader is contained in the title, the article will also deal with leadership, thus we will extend our rules to titles that contain this word.

In [18]:
result_third_model = np.logical_or(rule_column(df,'title','leader'),rule_column(df,'text_content','leadership'))
sum_third_model = sum(result_third_model)
print(f'Number of articles about leadership according to our own model: {sum_third_model}')

Number of articles about leadership according to our own model: 55


We have then 55 articles classified as dealing with leadership, we will evaluate our models in a next step.

## Step 3 : Building a labelized dataset

In order to save time labelizing on-hand every article, we will use the tags provided to build a gold dataset

In [19]:
def check_tag(row,word):
    for tag in row['keywords']:
        if word in tag.lower():
            return 1
    return 0

df['about_leadership'] = df.apply(lambda row:check_tag(row,'leadership'),axis=1)
print(f'Number of articles about leadership : {sum(df["about_leadership"])}')

Number of articles about leadership : 46


We then find 46 articles about leadership, my function check_tag is a bit more complex than just checking is one of the tag is leadership because some tags are like : Leadership and development, so we just checked if leadership was contained in the tag. The dataset is as expected really imbalanced because we have only 46 articles about leadership over 500 articles.

## Step 4 : Evaluation of our rule-based models

### Choosing the measure

To evaluate our models, we need to find the better measure regarding our objective. Accuracy is not here the best metric since the dataset is deeply inbalanced. Here the F1-score which is the harmonic mean of precision and recall seems to be the more suited measure, since it doesn't take into account true negative predictions that aren't very important.

In [20]:
from sklearn.metrics import f1_score, accuracy_score
score_model_1 = f1_score(df['about_leadership'],rule_column(df,'title','leadership'))
print(f'F1-score of the first model, based on the title : {score_model_1}')

score_model_2 = f1_score(df['about_leadership'],rule_column(df,'text_content','leadership'))
print(f'F1-score of the second model, based on the content : {score_model_2}')

score_model_3 = f1_score(df['about_leadership'],result_third_model)
print(f'F1-score of the third model : {score_model_3}')


F1-score of the first model, based on the title : 0.22222222222222218
F1-score of the second model, based on the content : 0.21428571428571427
F1-score of the third model : 0.396039603960396


Our rule-based models aren't then precise at all, even if the third model is much more precise than the other ones.

## Step 5 : Supervized approach

Now we are going to use TF-IDF and a Logistic Regression to classify our articles. Before applying TF-IDF, we need to preprocess our text, by lemmatizing, stemming and tokenizing our contents while removing stop words. Lemmatizing and stemming consists in keeping only the stem of the words, i.e keep only the root of the word so that many variation of the same root are counted as the same word. Then we remove stop words and split our text in a list of words (tokenizing). We inspired ourselves of [Quentin's work](https://colab.research.google.com/drive/1jUpGwTaY9vJsUVw1tgwwXqKz6UOsvV1a#scrollTo=JfSH2BQVtAK3) and reused some of his code that has been done for this preprocessing, at one exception : we include the title in our TF-IDF but separing it from the text, because we thought the title had a particular importance that our model could catch.

In [21]:
from gensim.utils import simple_preprocess
from gensim import corpora, models
from nltk.stem import WordNetLemmatizer, SnowballStemmer
stemmer = SnowballStemmer('english')

In [22]:
# Stemming text using the NLTK stemmer
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Whole preprocessing of text
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 1:
            result.append(lemmatize_stemming(token))
    return result

df["tokenized_content"] = df["text_content"].map(preprocess)
df["tokenized_title"] = df["title"].map(preprocess)
df[['title','tokenized_title','text_content','tokenized_content']].head()

Unnamed: 0,title,tokenized_title,text_content,tokenized_content
0,Student Debt Is Stopping U.S. Millennials from...,"[student, debt, stop, millenni, entrepreneur]",And what companies can do to help.,"[compani, help]"
1,Should You Try to Convince a Star Employee to ...,"[tri, convinc, star, employe, stay]",Sometimes it’s better to let them go.,"[better, let]"
2,What Companies Should Consider Before Investin...,"[compani, consid, invest, smart, speaker]","When technology goes awry, your reputation can...","[technolog, go, awri, reput, big, hit]"
3,HBR Presents: FOMO Sapiens with Patrick J. McG...,"[hbr, present, fomo, sapien, patrick, mcginni]","Patrick McGinnis, creator of the term FOMO, en...","[patrick, mcginni, creator, term, fomo, engag,..."
4,3 Ways to Improve Your Cultural Fluency,"[way, improv, cultur, fluenci]",Be curious and open to learning a new way of m...,"[curious, open, learn, new, way, manag]"


Then we can apply TF-IDF, but first we need to construct a bag-of-words representation of our texts, i.e convert text to a list of couple key-value with the key corresponding to the token and the value being the number of times this token appear in the text.

In [23]:
tokens = list(df['tokenized_content'].values)
tokens.extend(list(df['tokenized_title'].values))
vocab = corpora.Dictionary(tokens)

# apply tfidf model
def to_vector(key_value_tuples, vector_dim, default_value=0):
    rv = np.ones(vector_dim) * default_value
    for key, val in key_value_tuples:
        rv[key] = val
    return rv
  
df["bow_corpus_content"] = df.tokenized_content.map(vocab.doc2bow)
df["bow_corpus_title"] = df.tokenized_title.map(vocab.doc2bow)

bow_corpus = list(df["bow_corpus_content"].values)
bow_corpus.extend(list(df["bow_corpus_title"].values))

tfidf = models.TfidfModel(bow_corpus)
df["tfidf_content"] = df.bow_corpus_content.map(lambda bow: to_vector(tfidf[bow], len(vocab)))
df["tfidf_title"] = df.bow_corpus_title.map(lambda bow: to_vector(tfidf[bow], len(vocab)))
df[['title','tokenized_title','text_content','tokenized_content','tokenized_title','bow_corpus_content','bow_corpus_title','tfidf_content','tfidf_title']].head()

Unnamed: 0,title,tokenized_title,text_content,tokenized_content,tokenized_title.1,bow_corpus_content,bow_corpus_title,tfidf_content,tfidf_title
0,Student Debt Is Stopping U.S. Millennials from...,"[student, debt, stop, millenni, entrepreneur]",And what companies can do to help.,"[compani, help]","[student, debt, stop, millenni, entrepreneur]","[(0, 1), (1, 1)]","[(21, 1), (395, 1), (573, 1), (916, 1), (3191,...","[0.658932031251147, 0.7522024848345273, 0.0, 0...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,Should You Try to Convince a Star Employee to ...,"[tri, convinc, star, employe, stay]",Sometimes it’s better to let them go.,"[better, let]","[tri, convinc, star, employe, stay]","[(2, 1), (3, 1)]","[(161, 1), (261, 1), (615, 1), (1980, 1), (423...","[0.0, 0.0, 0.5901062324256953, 0.8073256062161...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,What Companies Should Consider Before Investin...,"[compani, consid, invest, smart, speaker]","When technology goes awry, your reputation can...","[technolog, go, awri, reput, big, hit]","[compani, consid, invest, smart, speaker]","[(4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1)]","[(0, 1), (251, 1), (368, 1), (582, 1), (1749, 1)]","[0.0, 0.0, 0.0, 0.0, 0.6800213710962498, 0.276...","[0.22165857538502837, 0.0, 0.0, 0.0, 0.0, 0.0,..."
3,HBR Presents: FOMO Sapiens with Patrick J. McG...,"[hbr, present, fomo, sapien, patrick, mcginni]","Patrick McGinnis, creator of the term FOMO, en...","[patrick, mcginni, creator, term, fomo, engag,...","[hbr, present, fomo, sapien, patrick, mcginni]","[(3, 1), (10, 1), (11, 1), (12, 2), (13, 1), (...","[(25, 1), (28, 1), (34, 1), (39, 1), (42, 1), ...","[0.0, 0.0, 0.0, 0.07807178436943807, 0.0, 0.0,...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,3 Ways to Improve Your Cultural Fluency,"[way, improv, cultur, fluenci]",Be curious and open to learning a new way of m...,"[curious, open, learn, new, way, manag]","[way, improv, cultur, fluenci]","[(36, 1), (56, 1), (57, 1), (58, 1), (59, 1), ...","[(60, 1), (202, 1), (292, 1), (6583, 1)]","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


Since TF-IDF give us a lot of features (2 per words present in the corpus actually, one for the title and one for the content), we need to reduce dimensionnality of our data so that the model will learn more easily. We will then apply PCA (Principal component analysis, which is a very commonly use method to reduce dimensionnality, maximazing the variance of the projected data). As we use to do in ML, we will also divide our dataset into random train and test sets to control our trade-off bias-variance and not overfit.

In [67]:
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV, GridSearchCV

In [25]:
df['tfidf_total'] = df['tfidf_content'] + df['tfidf_title']
X = np.array(df['tfidf_total'].tolist())
y = df['about_leadership']
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [26]:
pca = PCA()
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

In [27]:
clf = LogisticRegression(solver = 'saga')
clf.fit(X_train_pca,y_train)
result = clf.predict(X_test_pca)
print(result)
print(f'Result with the logistic regression : {f1_score(y_test,result)}')

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Result with the logistic regression : 0.0


  'precision', 'predicted', average, warn_for)


Here we have a problem, the number of positive outcome is too poor, thus our model can not learn when they happen. We can find a good explanation of this problem right here : [Logistic Regression for rare events](http://statisticalhorizons.com/logistic-regression-for-rare-events). The simplest solution is we want to pursue with Logistic Regression is to use more samples. We then repeat the whole process with 5 batches of 1000 samples, that is the maximum number of articles we can fetch.

In [28]:
# Downloading data with the code furnished in the challenge wording
source_feed = "feed/http://feeds.harvardbusiness.org/harvardbusiness/"
base_query = f"/v3/streams/contents?streamId={source_feed}&count=1000"
data = sess.do_api_request(base_query)
print('Batch received')
articles = data['items']

while len(articles)<10000 and 'continuation' in data.keys():
    base_query = f"/v3/streams/contents?streamId={source_feed}&count=1000&continuation={data['continuation']}"
    data = sess.do_api_request(base_query)
    print('Batch received')
    articles.extend(data['items'])
    
print(f'Number of articles fetched : {len(articles)}')

Batch received
Batch received
Batch received
Batch received
Batch received
Batch received
Batch received
Batch received
Batch received
Batch received
Number of articles fetched : 9298


In [30]:
df = pd.DataFrame(articles)
df = df.dropna(subset=['content', 'title','keywords'])

df['text_content'] = df['content'].apply(extract_text)

df['about_leadership'] = df.apply(lambda row:check_tag(row,'leadership'),axis=1)

print('Tokenizing texts')

df["tokenized_content"] = df["text_content"].map(preprocess)
df["tokenized_title"] = df["title"].map(preprocess)

print('Bow corpus creation')
df["bow_corpus_content"] = df.tokenized_content.map(vocab.doc2bow)
df["bow_corpus_title"] = df.tokenized_title.map(vocab.doc2bow)

bow_corpus = list(df["bow_corpus_content"].values)
bow_corpus.extend(list(df["bow_corpus_title"].values))

print('Applying TF-IDF ')
tfidf = models.TfidfModel(bow_corpus)
df["tfidf_content"] = df.bow_corpus_content.map(lambda bow: to_vector(tfidf[bow], len(vocab)))
df["tfidf_title"] = df.bow_corpus_title.map(lambda bow: to_vector(tfidf[bow], len(vocab)))


Tokenizing texts
Bow corpus creation
Applying TF-IDF 


In [31]:
df['tfidf_total'] = df['tfidf_content'] + df['tfidf_title']
X = np.array(df['tfidf_total'].tolist())
y = df['about_leadership']
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=0)

print('Applying PCA')
pca = PCA()
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

Applying PCA


In [52]:
print('Applying Logistic Regression')
clf = LogisticRegression(solver = 'lbfgs')
clf.fit(X_train_pca,y_train)
result = clf.predict(X_test_pca)
print(f'F1-score with the logistic regression : {f1_score(y_test,result)}')

score_model_1 = f1_score(y_test,rule_column(df.loc[y_test.index],'title','leadership'))
print(f'Accuracy with the logistic regression : {accuracy_score(y_test,result)}')

print(f'F1-score of the first rule-based model, based on the title : {score_model_1}')

score_model_2 = f1_score(y_test,rule_column(df.loc[y_test.index],'text_content','leadership'))
print(f'F1-score of the second rule-based model, based on the content : {score_model_2}')

result_third_model = np.logical_or(rule_column(df.loc[y_test.index],'title','leader'),rule_column(df.loc[y_test.index],'text_content','leadership'))
score_model_3 = f1_score(y_test,result_third_model)
print(f'F1-score of the third rule-based model : {score_model_3}')


Applying Logistic Regression
F1-score with the logistic regression : 0.44607843137254904
Accuracy with the logistic regression : 0.9015679442508711
F1-score of the first rule-based model, based on the title : 0.18006430868167203
F1-score of the second rule-based model, based on the content : 0.41542288557213924
F1-score of the third rule-based model : 0.4404332129963898


With 5000 articles, we reached a **F1-score of 0.43**, which is much better, and already better than the rule-based models. But it's still not acceptable and optimizing our hyperparameters here wouldn't be very useful. Thus we raised the number of articles fetched to 9298 which is the total number of articles on the feed. It gives us a result of 0.45, thus having more example helped a bit but not that much. We then have to tune our hyperparameters. We then use cross validation which a Grid Search.

In [44]:
print('Applying PCA')
pca = PCA(svd_solver='full')
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

Applying PCA


In [61]:
print('Applying Logistic Regression')
clf = LogisticRegression(solver = 'liblinear',penalty='l1')

print(f'Cross-validation f1-score :{np.mean(cross_val_score(clf,X_train_pca,y_train,scoring="f1",cv=5))}')
      
clf.fit(X_train_pca,y_train)
result = clf.predict(X_test_pca)
print(f'F1-score on test set with the logistic regression : {f1_score(y_test,result)}')

score_model_1 = f1_score(df['about_leadership'],rule_column(df,'title','leadership'))
print(f'Accuracy with the logistic regression : {accuracy_score(y_test,result)}')

print(f'F1-score of the first rule-based model, based on the title : {score_model_1}')

score_model_2 = f1_score(df['about_leadership'],rule_column(df,'text_content','leadership'))
print(f'F1-score of the second rule-based model, based on the content : {score_model_2}')

result_third_model = np.logical_or(rule_column(df,'title','leader'),rule_column(df,'text_content','leadership'))
score_model_3 = f1_score(df['about_leadership'],result_third_model)
print(f'F1-score of the third rule-based model : {score_model_3}')


Applying Logistic Regression
Cross-validation f1-score :0.4551455978681241
F1-score on test set with the logistic regression : 0.4585365853658537
Accuracy with the logistic regression : 0.9033101045296167
F1-score of the first rule-based model, based on the title : 0.16637478108581435
F1-score of the second rule-based model, based on the content : 0.38892466194462333
F1-score of the third rule-based model : 0.4163290744780305


In [74]:
distribution={
    'penalty':['l1','l2'],
    'tol':[10**(-4),10**(-5),10**(-3),10**(-2)],
    'C':[0.01,0.1,1,10]
}

clf = GridSearchCV(LogisticRegression(solver='liblinear'),param_grid=distribution,scoring='f1',cv=5)
clf.fit(X_train_pca,y_train)
print(f'Best params obtained through the grid search : {clf.best_params_}')
print(f'Best score obtained with these params : {clf.best_score_}')

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision'

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision'

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Best params obtained through the grid search : {'C': 10, 'penalty': 'l2', 'tol': 0.0001}
Best score obtained with these params : 0.465401906795641


In [77]:
clf = LogisticRegression(solver = 'liblinear',penalty='l2',C=10,tol=0.0001)
clf.fit(X_train_pca,y_train)
result = clf.predict(X_test_pca)
print(f'F1-score on test set with the logistic regression : {f1_score(y_test,result)}')

F1-score on test set with the logistic regression : 0.46017699115044247


We obtained finally a F1-score of 0.46, which is not very good but we probably can't be really better with a model so simple. We could consider using neural networks to improve this score largely. However this score works better than our rule-based model, which is satisfying.