<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Guardian or Telegraph?

Sentiment analysis of article titles

---

**Objectives:**

1. Complete sentiment analysis manually using the sentiment dictionary from the lesson
2. Build a classification model to predict if the article was in The Guardian or the Telegraph
3. Evaluate your model with a classification report and confusion matrix
4. Do steps 1 to 3 using the VADER Sentiment Analyzer

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

import textacy
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from tqdm import tqdm_notebook

## Load data

The data is in the file `brexit_articles.csv`.

The word sentiments are in the file `word_sentiments.csv`.

In [2]:
brexit = pd.read_csv('./datasets/brexit_articles.csv')
sents = pd.read_csv('./datasets/word_sentiments.csv')

In [3]:
sents.head()

Unnamed: 0,pos,word,pos_score,neg_score,objectivity,pos_vs_neg
0,ADJ,.22-caliber,0.0,0.0,1.0,0.0
1,ADJ,.22-calibre,0.0,0.0,1.0,0.0
2,ADJ,.22_caliber,0.0,0.0,1.0,0.0
3,ADJ,.22_calibre,0.0,0.0,1.0,0.0
4,ADJ,.38-caliber,0.0,0.0,1.0,0.0


In [4]:
brexit.head()

Unnamed: 0,source,title
0,guardian,Sam Gyimah resigns over Theresa May's Brexit deal
1,guardian,SNP and Lib Dems back Benn amendment to preven...
2,guardian,The Guardian view on Donald Trump’s credibilit...
3,guardian,"In this high-stakes game of Brexit, how much o..."
4,guardian,Brexit: McDonnell says remain would be on ball...


In [5]:
brexit.shape

(427, 2)

## Create the `sen_dict` from the word_sentiments data frame

In [6]:
from collections import defaultdict
sen_dict = defaultdict(dict) # set up a default dictionary with an empty dictionary as default value

for row in tqdm_notebook(sents.itertuples()):
    sen_dict[row.pos] [row.word] = {'objectivity': row.objectivity, 'pos_vs_neg': row.pos_vs_neg}

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




## Engineer an initial feature of title length

In [7]:
brexit['title_length'] = [len(t.split()) for t in brexit['title']]

## Complete sentiment analysis manually using the sentiment dictionary

In [8]:
en_nlp = textacy.load_spacy_lang('en_core_web_sm')

In [9]:
def process_text(documents, pos=False):
    nlp = textacy.load_spacy_lang('en_core_web_sm')
    
    texts = []
    tokenised_texts = []

    if pos: # pos can either be False or a list of parts of speech
        for document in tqdm_notebook(nlp.pipe(documents, batch_size=200)):
            assert document.is_parsed
            tokens = [token
                      for token in document 
                      if token.is_stop == False
                      and token.pos_ in pos
                      and token.pos_ != 'PUNCT']
            doc_ = ''
            for token in tokens:
                doc_ += str(token) + ' '
            
            doc_ = doc_.strip()
            texts.append(doc_)
            tokenised_texts.append(tokens)
    
    
    else:    
        for document in tqdm_notebook(nlp.pipe(documents, batch_size=200)):
            assert document.is_parsed
            tokens = [token
                      for token in document 
                      if token.is_stop == False
                      and token.pos_ != 'PUNCT']
            doc_ = ''
            for token in tokens:
                doc_ += str(token) + ' '
            
            doc_ = doc_.strip()
            texts.append(doc_)
            tokenised_texts.append(tokens)
            
    return texts, tokenised_texts

In [10]:
pos = ['NOUN', 'ADJ', 'VERB', 'ADV']

In [11]:
processed_titles, tokenised_titles = process_text(brexit['title'], pos=pos)

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




In [12]:
brexit['processed_title'] = processed_titles
brexit['tokenised_title'] = tokenised_titles

In [13]:
brexit.head()

Unnamed: 0,source,title,title_length,processed_title,tokenised_title
0,guardian,Sam Gyimah resigns over Theresa May's Brexit deal,8,resigns deal,"[resigns, deal]"
1,guardian,SNP and Lib Dems back Benn amendment to preven...,11,amendment prevent deal Brexit,"[amendment, prevent, deal, Brexit]"
2,guardian,The Guardian view on Donald Trump’s credibilit...,12,view credibility compromised leader Editorial,"[view, credibility, compromised, leader, Edito..."
3,guardian,"In this high-stakes game of Brexit, how much o...",16,high stakes game gambler,"[high, stakes, game, gambler]"
4,guardian,Brexit: McDonnell says remain would be on ball...,15,Brexit says remain ballot second referendum Po...,"[Brexit, says, remain, ballot, second, referen..."


In [14]:
def scorer(parsed):
    
    obj_scores, pvn_scores = [], []
    
    for token in parsed:
        try:
            obj_scores.append(sen_dict[token.pos_][token.lemma_]['objectivity'])
            pvn_scores.append(sen_dict[token.pos_][token.lemma_]['pos_vs_neg'])
        except:
            pass
    
    if not obj_scores:
        obj_scores = [1.]
    if not pvn_scores:
        pvn_scores = [0.]
        
    return [np.mean(obj_scores), np.mean(pvn_scores)]

In [15]:
scores = brexit['tokenised_title'].map(scorer)
brexit['objectivity_avg'] = scores.map(lambda x: x[0])
brexit['polarity_avg'] = scores.map(lambda x: x[1])

In [16]:
brexit.head()

Unnamed: 0,source,title,title_length,processed_title,tokenised_title,objectivity_avg,polarity_avg
0,guardian,Sam Gyimah resigns over Theresa May's Brexit deal,8,resigns deal,"[resigns, deal]",0.96875,-0.03125
1,guardian,SNP and Lib Dems back Benn amendment to preven...,11,amendment prevent deal Brexit,"[amendment, prevent, deal, Brexit]",0.9375,0.020833
2,guardian,The Guardian view on Donald Trump’s credibilit...,12,view credibility compromised leader Editorial,"[view, credibility, compromised, leader, Edito...",0.778333,-0.016667
3,guardian,"In this high-stakes game of Brexit, how much o...",16,high stakes game gambler,"[high, stakes, game, gambler]",0.857955,-0.041396
4,guardian,Brexit: McDonnell says remain would be on ball...,15,Brexit says remain ballot second referendum Po...,"[Brexit, says, remain, ballot, second, referen...",0.978761,0.021239


## Build a classification model to predict if the article was in The Guardian or the Telegraph

I am using a Random Forest Model, but if you have time do try others!!

In [17]:
# baseline

brexit['source'].value_counts()

telegraph    277
guardian     150
Name: source, dtype: int64

In [18]:
brexit['source'].value_counts(normalize=True)

telegraph    0.648712
guardian     0.351288
Name: source, dtype: float64

In [19]:
X = brexit[['objectivity_avg', 'polarity_avg', 'title_length']].astype(float)
y = brexit['source']

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, stratify=y)

In [21]:
scaler = StandardScaler()
X_train_s = pd.DataFrame(scaler.fit_transform(X_train), columns=X.columns)
X_test_s = pd.DataFrame(scaler.transform(X_test), columns=X.columns)

In [22]:
rf = RandomForestClassifier(n_estimators=100)

params = {
    'criterion': ['entropy', 'gini'],
    'max_features': np.linspace(0.1,1.0, 20),
}

gs = GridSearchCV(rf, params, n_jobs=-1, cv=5, verbose=1, iid=False)
gs.fit(X_train_s, y_train)

Fitting 5 folds for each of 40 candidates, totalling 200 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    4.3s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   11.4s
[Parallel(n_jobs=-1)]: Done 200 out of 200 | elapsed:   12.0s finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid=False, n_jobs=-1,
       param_grid={'criterion': ['entropy', 'gini'], 'max_features': array([0.1    , 0.14737, 0.19474, 0.24211, 0.28947, 0.33684, 0.38421,
       0.43158, 0.47895, 0.52632, 0.57368, 0.62105, 0.66842, 0.71579,
       0.76316, 0.81053, 0.85789, 0.90526, 0.95263, 1.     ])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [23]:
gs.score(X_train_s, y_train)

0.9790209790209791

In [24]:
gs.score(X_test_s, y_test)

0.624113475177305

In [25]:
gs.best_params_

{'criterion': 'entropy', 'max_features': 0.9052631578947369}

In [26]:
predictions_train = gs.predict(X_train_s)
predictions_test = gs.predict(X_test_s)

## Evaluate your model with a classification report and confusion matrix

Describe your results!

In [27]:
print(classification_report(y_train, predictions_train))

              precision    recall  f1-score   support

    guardian       0.97      0.97      0.97       100
   telegraph       0.98      0.98      0.98       186

   micro avg       0.98      0.98      0.98       286
   macro avg       0.98      0.98      0.98       286
weighted avg       0.98      0.98      0.98       286



In [28]:
pd.DataFrame(confusion_matrix(y_train, predictions_train, labels=['guardian', 'telegraph']),
             index=['actual guardian', 'actual telegraph'],
             columns=['predicted guardian', 'predicted telegraph'])

Unnamed: 0,predicted guardian,predicted telegraph
actual guardian,97,3
actual telegraph,3,183


In [29]:
print(classification_report(y_test, predictions_test))

              precision    recall  f1-score   support

    guardian       0.47      0.40      0.43        50
   telegraph       0.69      0.75      0.72        91

   micro avg       0.62      0.62      0.62       141
   macro avg       0.58      0.57      0.57       141
weighted avg       0.61      0.62      0.62       141



In [30]:
pd.DataFrame(confusion_matrix(y_test, predictions_test, labels=['guardian', 'telegraph']),
             index=['actual guardian', 'actual telegraph'],
             columns=['predicted guardian', 'predicted telegraph'])

Unnamed: 0,predicted guardian,predicted telegraph
actual guardian,20,30
actual telegraph,23,68


## Do steps 1 to 3 using the VADER Sentiment Analyzer

### Complete sentiment analysis using VADER

In [31]:
analyzer = SentimentIntensityAnalyzer()

In [32]:
vader_scores = brexit['title'].map(analyzer.polarity_scores)

In [33]:
vader_scores.head()

0    {'neg': 0.247, 'neu': 0.753, 'pos': 0.0, 'comp...
1    {'neg': 0.0, 'neu': 0.901, 'pos': 0.099, 'comp...
2    {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
3    {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
4    {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
Name: title, dtype: object

In [34]:
from sklearn.feature_extraction import DictVectorizer

dvec = DictVectorizer()

vader_scores = dvec.fit_transform(vader_scores)
vader_scores

<427x4 sparse matrix of type '<class 'numpy.float64'>'
	with 1708 stored elements in Compressed Sparse Row format>

In [35]:
dvec.feature_names_

['compound', 'neg', 'neu', 'pos']

In [36]:
for i, col in enumerate(dvec.feature_names_):
    brexit['vader_{}'.format(col)] = vader_scores[:, i].toarray().ravel()

In [37]:
brexit.head()

Unnamed: 0,source,title,title_length,processed_title,tokenised_title,objectivity_avg,polarity_avg,vader_compound,vader_neg,vader_neu,vader_pos
0,guardian,Sam Gyimah resigns over Theresa May's Brexit deal,8,resigns deal,"[resigns, deal]",0.96875,-0.03125,-0.3182,0.247,0.753,0.0
1,guardian,SNP and Lib Dems back Benn amendment to preven...,11,amendment prevent deal Brexit,"[amendment, prevent, deal, Brexit]",0.9375,0.020833,0.0258,0.0,0.901,0.099
2,guardian,The Guardian view on Donald Trump’s credibilit...,12,view credibility compromised leader Editorial,"[view, credibility, compromised, leader, Edito...",0.778333,-0.016667,0.0,0.0,1.0,0.0
3,guardian,"In this high-stakes game of Brexit, how much o...",16,high stakes game gambler,"[high, stakes, game, gambler]",0.857955,-0.041396,0.0,0.0,1.0,0.0
4,guardian,Brexit: McDonnell says remain would be on ball...,15,Brexit says remain ballot second referendum Po...,"[Brexit, says, remain, ballot, second, referen...",0.978761,0.021239,0.0,0.0,1.0,0.0


### Build a classification model to predict if the article was in The Guardian or the Telegraph

In [38]:
# baseline

brexit['source'].value_counts(normalize=True)

telegraph    0.648712
guardian     0.351288
Name: source, dtype: float64

In [39]:
brexit.columns

Index(['source', 'title', 'title_length', 'processed_title', 'tokenised_title',
       'objectivity_avg', 'polarity_avg', 'vader_compound', 'vader_neg',
       'vader_neu', 'vader_pos'],
      dtype='object')

In [40]:
X = brexit[['vader_neg', 'vader_pos', 'vader_neu', 'vader_compound', 'title_length']].astype(float)
y = brexit['source']

In [41]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, stratify=y)

In [42]:
scaler = StandardScaler()
X_train_s = pd.DataFrame(scaler.fit_transform(X_train), columns=X.columns)
X_test_s = pd.DataFrame(scaler.transform(X_test), columns=X.columns)

In [43]:
rf = RandomForestClassifier(n_estimators=100)

params = {
    'criterion': ['entropy', 'gini'],
    'max_features': np.linspace(0.1,1.0, 20),
}


gs = GridSearchCV(rf, params, n_jobs=-1, cv=5, verbose=1, iid=False)
gs.fit(X_train_s, y_train)

Fitting 5 folds for each of 40 candidates, totalling 200 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    3.9s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   13.6s
[Parallel(n_jobs=-1)]: Done 200 out of 200 | elapsed:   14.3s finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid=False, n_jobs=-1,
       param_grid={'criterion': ['entropy', 'gini'], 'max_features': array([0.1    , 0.14737, 0.19474, 0.24211, 0.28947, 0.33684, 0.38421,
       0.43158, 0.47895, 0.52632, 0.57368, 0.62105, 0.66842, 0.71579,
       0.76316, 0.81053, 0.85789, 0.90526, 0.95263, 1.     ])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [44]:
gs.score(X_train_s, y_train)

0.9090909090909091

In [45]:
gs.score(X_test_s, y_test)

0.6099290780141844

In [46]:
gs.best_params_

{'criterion': 'gini', 'max_features': 0.1}

### Evaluate your model with a classification report and confusion matrix

Describe your results!

In [47]:
print(classification_report(y_train, predictions_train))

              precision    recall  f1-score   support

    guardian       0.33      0.33      0.33       100
   telegraph       0.64      0.64      0.64       186

   micro avg       0.53      0.53      0.53       286
   macro avg       0.48      0.48      0.48       286
weighted avg       0.53      0.53      0.53       286



In [48]:
pd.DataFrame(confusion_matrix(y_train, predictions_train, labels=['guardian', 'telegraph']),
             index=['actual guardian', 'actual telegraph'],
             columns=['predicted guardian', 'predicted telegraph'])

Unnamed: 0,predicted guardian,predicted telegraph
actual guardian,33,67
actual telegraph,67,119


In [49]:
print(classification_report(y_test, predictions_test))

              precision    recall  f1-score   support

    guardian       0.44      0.38      0.41        50
   telegraph       0.68      0.74      0.71        91

   micro avg       0.61      0.61      0.61       141
   macro avg       0.56      0.56      0.56       141
weighted avg       0.60      0.61      0.60       141



In [50]:
pd.DataFrame(confusion_matrix(y_test, predictions_test, labels=['guardian', 'telegraph']),
             index=['actual guardian', 'actual telegraph'],
             columns=['predicted guardian', 'predicted telegraph'])

Unnamed: 0,predicted guardian,predicted telegraph
actual guardian,19,31
actual telegraph,24,67


In [51]:
for title in brexit[brexit['source']=='guardian'].sort_values('vader_pos', ascending=False)['title'][0:5]:
    print(title)
    print('============================================================\n')

Gentleman and Cadwalladr win joint journalist of the year award

Public Service Awards: the finalists

Flawless first tests for 'flying launchpad'

May to promote Brexit trade opportunities at fraught G20 summit

Yes, Donald Trump is talking perfect sense on May’s Brexit deal | Peter Mandelson



In [52]:
for title in brexit[brexit['source']=='telegraph'].sort_values('vader_pos', ascending=False)['title'][0:5]:
    print(title)
    print('============================================================\n')

Health Secretary Jeane Freeman wins top politician award

TOP TENS

Discover opportunities in India at Business Festival

Party will rue inviting Boris to entertain at annual conference

Gabby’s Great Brexit Bake Off cheers Theresa May after ‘tough week’



In [53]:
for title in brexit[brexit['source']=='guardian'].sort_values('vader_neg', ascending=False)['title'][0:5]:
    print(title)
    print('============================================================\n')

DICK COLE

No-deal Brexit 'could leave UK at risk from terrorism'

This means war: why cheesy churros will destroy UK-Spain relations


Trump blasts GM over job cuts, as deeper trade war looms - business live



In [54]:
for title in brexit[brexit['source']=='telegraph'].sort_values('vader_neg', ascending=False)['title'][0:5]:
    print(title)
    print('============================================================\n')

Man accused of killing west Belfast mother-of-three refused bail

Job loss fears and uncertainty inside JLR - workers speak out

Ignore the sore losers, May's deal is a good one

May’s Brexit deal is worst of all worlds and doomed to fail – Sir Michael Fallon

Hammond warns of threat to economy if MPs reject Brexit deal



In [55]:
for title in brexit[brexit['source']=='guardian'].sort_values('vader_neu', ascending=False)['title'][0:5]:
    print(title)
    print('============================================================\n')

Brexit: Theresa May says McDonnell wants to overturn will of British people – as it happened

Brexit: McDonnell says it is 'inevitable' Labour will back second referendum - Politics live

Philip Hammond says economy will be 'a little bit smaller' under all Brexit scenarios – video

UK car industry and Airbus cautiously back PM's Brexit deal

Second referendum campaigners split over parliamentary tactics



In [56]:
for title in brexit[brexit['source']=='telegraph'].sort_values('vader_neu', ascending=False)['title'][0:5]:
    print(title)
    print('============================================================\n')

'Junk backstop', May is urged as Brexiteer MPs hit out over her EU deal

Final reports commissioned for new University of Peterborough ‘to get every aspect right’

What the papers say – November 25

Westminster seats decision for Brexit vote up to Sinn Fein, says PM

PM’s letter to nation an utter work of fiction, says Labour MP



In [57]:
for title in brexit[brexit['source']=='guardian'].sort_values('vader_neu', ascending=True)['title'][0:5]:
    print(title)
    print('============================================================\n')

DICK COLE

No-deal Brexit 'could leave UK at risk from terrorism'

Investors hope for trade war breakthrough - business live

No-deal Brexit: Amazon blamed for lack of food warehouse space

Has France fallen out of love with Emmanuel Macron?



In [58]:
for title in brexit[brexit['source']=='telegraph'].sort_values('vader_neu', ascending=True)['title'][0:5]:
    print(title)
    print('============================================================\n')

Health Secretary Jeane Freeman wins top politician award

Ignore the sore losers, May's deal is a good one

TOP TENS

Man accused of killing west Belfast mother-of-three refused bail

However we leave, Northern Ireland's sure to suffer

