# Part 3: Data Analysis (40%)

In [133]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import requests
import json

from datetime import datetime 

import re
import nltk

## Step 1: Crawl a real-world dataset

Where does this data come from?
- This data uses the NewsAPI to get news articles in english from across the world. There will be no specification on the country of origin.
- This data will specifically look at articles containing the word 'boris' between 22nd October 2021 and 22nd November 2021 inclusive.
- This is not an exhustive list of articles containing the word 'boris'.
- The API request will be split up into two calls of 100 articles and then concatonated to get a full dataset of 200 entries. This is due to the 100 row request limit that the free membership allows.

How was the data scraped/collected?
- We used the NewsAPI to get this data from non-specific news sources.
- This was in conjunction with the `requests` package to request the information from the NewsAPI and `json` to format.
- Note: Unfortunately, due to subscription cap, the `content` column is not the full content. Thus this variable is less useful.

In [134]:
data = pd.DataFrame(columns = ['source', 'author', 'title', 'description', 'url', 'urlToImage', 'publishedAt', 'content'])

urls = (
    'http://newsapi.org/v2/everything?q=boris&language=en&pageSize=100&from=2021-10-22&to=2021-11-06&apiKey=823b384f7f3f4119b36bb73e3e82e0c9',
    'http://newsapi.org/v2/everything?q=boris&language=en&pageSize=100&from=2021-11-06&to=2021-11-22&apiKey=823b384f7f3f4119b36bb73e3e82e0c9'
)

for url in urls:
    r = requests.get(url)
    json_load = json.loads(r.content)
    data = pd.concat([data, pd.DataFrame(json_load['articles'])], ignore_index=True)

In [135]:
data['source_name'] = data['source'].apply(lambda x: x['name'])
data['source_id'] = data['source'].apply(lambda x: x['id'])
data.drop(columns = 'source', axis = 1, inplace=True)
data.head()

Unnamed: 0,author,title,description,url,urlToImage,publishedAt,content,source_name,source_id
0,Sarah Marsh,"Boris Johnson has ‘fragile male ego’, suggests...",First minister tells Vogue why she thinks PM s...,https://amp.theguardian.com/politics/2021/oct/...,https://i.guim.co.uk/img/media/01856ea1ddb24d3...,2021-10-29T15:41:40Z,Nicola Sturgeon has suggested Boris Johnsons f...,The Guardian,
1,Miranda Bryant,The Queen is ‘on very good form’ says Boris Jo...,Prime minister’s comments come after news that...,https://amp.theguardian.com/uk-news/2021/oct/3...,https://i.guim.co.uk/img/media/0f91310e6096525...,2021-10-30T12:14:09Z,Boris Johnson has said the Queen is on very go...,The Guardian,
2,Rowena Mason Deputy political editor,Cop26: Boris Johnson ‘cautiously optimistic’ a...,UK PM claims there has been a turnaround since...,https://amp.theguardian.com/environment/2021/n...,https://i.guim.co.uk/img/media/4239d0ea380a158...,2021-11-02T19:14:49Z,Boris Johnson has declared he is cautiously op...,The Guardian,
3,"Toby Helm, Michael Savage and Jo Ungoed-Thomas",Boris Johnson sleaze crisis deepens amid press...,Sir John Major attacks PM’s actions as ‘shamef...,https://amp.theguardian.com/politics/2021/nov/...,https://i.guim.co.uk/img/media/aec6318ff128a2c...,2021-11-06T20:30:04Z,The row over Tory sleaze reached new heights o...,The Guardian,
4,Letters,Owen Paterson affair exposes Boris Johnson’s c...,Readers respond to the Tory party closing rank...,https://amp.theguardian.com/politics/2021/nov/...,https://i.guim.co.uk/img/media/6ba3d61a7c34951...,2021-11-04T18:23:26Z,Martin Kettle writes: Just before the start of...,The Guardian,


What are the variables of interest?

- Title: numeric features we can get from this will be interesting. Does word count/average word length vary over time? or does it vary over sentiment?
- description: will help us identify the topic of the article. Are there lots on one topic?
- source_name: Is there any bias when it comes to source of article?
- publishedAt: Not directly interesting, but provides questions on how things vary over time.

## Step 2: Perform data preparation & cleaning

In [136]:
# data shape
data.shape

(200, 9)

In [137]:
data.columns

Index(['author', 'title', 'description', 'url', 'urlToImage', 'publishedAt',
       'content', 'source_name', 'source_id'],
      dtype='object')

In [138]:
# Null value counts
data.isnull().sum(axis=0)

author          74
title            0
description      0
url              0
urlToImage       0
publishedAt      0
content          0
source_name      0
source_id      110
dtype: int64

55% of `source_id`s are null. This suggests it would be prudent to drop this factor.

In [139]:
# Drop un-needed columns
data.drop(
    columns = ['url', 'urlToImage', 'content', 'source_id'],
    axis=1,
    inplace=True
)

In [140]:
# Imputer 'MISSING' into missing values
data.author.fillna('MISSING', inplace=True)

In [141]:
data.dtypes

author         object
title          object
description    object
publishedAt    object
source_name    object
dtype: object

In [142]:
# Format publishedAt to datetime type
data['publishedAt_dt'] = pd.to_datetime(data.publishedAt)
data.drop(columns='publishedAt', axis=1, inplace=True)

In [143]:
text_cols = ['title', 'description']

In [144]:
def average_word_length(row):
    return sum(map(len, row))/len(row)

def remove_special_characters(row):
    return re.sub(r'[^a-zA-Z0-9\s]', '', row)

def get_tokenize(row):
    return nltk.word_tokenize(row)
    

In [145]:
for i in text_cols:
    data[i] = data[i].str.lower()

    data[i] = data[i].apply(lambda x: remove_special_characters(x))

    data[i + '_tokenize'] = data[i].apply(lambda x: get_tokenize(x))

    data[i + '_word_count'] = data[i + '_tokenize'].apply(lambda x: len(x))

    data[i + '_av_word_len'] = data[i + '_tokenize'].apply(lambda x: average_word_length(x))

In [146]:
data['title_has_longer_words'] = data.title_av_word_len > data.description_av_word_len

In [147]:
data_out = data.drop(columns=['description_tokenize', 'title_tokenize'], axis=1)
data_out.head()

Unnamed: 0,author,title,description,source_name,publishedAt_dt,title_word_count,title_av_word_len,description_word_count,description_av_word_len,title_has_longer_words
0,Sarah Marsh,boris johnson has fragile male ego suggests ni...,first minister tells vogue why she thinks pm s...,The Guardian,2021-10-29 15:41:40+00:00,9,5.666667,42,5.119048,True
1,Miranda Bryant,the queen is on very good form says boris johnson,prime ministers comments come after news that ...,The Guardian,2021-10-30 12:14:09+00:00,10,4.0,44,4.795455,False
2,Rowena Mason Deputy political editor,cop26 boris johnson cautiously optimistic abou...,uk pm claims there has been a turnaround since...,The Guardian,2021-11-02 19:14:49+00:00,9,6.333333,46,4.586957,True
3,"Toby Helm, Michael Savage and Jo Ungoed-Thomas",boris johnson sleaze crisis deepens amid press...,sir john major attacks pms actions as shameful...,The Guardian,2021-11-06 20:30:04+00:00,10,5.5,44,4.840909,True
4,Letters,owen paterson affair exposes boris johnsons co...,readers respond to the tory party closing rank...,The Guardian,2021-11-04 18:23:26+00:00,10,6.5,40,5.4,True


##### Title

In [23]:
data['title_word_count'] = data['title'].apply(lambda x: len(nltk.word_tokenize(x)))

## Step 3: Perform exploratory analysis and ask questions

## Step 5: Summarise and write a conclusion using markdown cells