# News Article EDA

In this Notebook we open our dataset of categorised news articles and explore the distribution of topics. We will do some reshaping to permit effective machine learning.

This Notebook was run inside Sagemaker Studio on the **Python 3 (Data Science)** Kernel.

In [1]:
import pandas as pd
import numpy as np
import sagemaker
import boto3

boto_session = boto3.Session()
region = boto_session.region_name
sgmk_session = sagemaker.Session()
sgmk_role = sagemaker.get_execution_role()

In [2]:
ctstories = "s3://funnybones/news/topics/CTstories.csv"

In [4]:
df1 = pd.read_csv(ctstories)

In [5]:
df1["category"].value_counts()

sport          176
politics       124
arts            98
health          96
lifestyle       90
crime           73
society         69
ignore          46
business        35
realestate      30
human           29
accident        28
environment     26
education       23
science         15
labour          13
military        12
weather          5
transport        4
religion         1
Name: category, dtype: int64

# Low observation categories

Some of the original categories do not have sufficient observations for modelling: [religion, weather, military, labour, science, transport]

For the purposese of this project we collapse these categories into others using the following logic:

* religion -> society
* weather -> lifestyle
* military -> politics
* labour -> politics
* science -> education
* transport -> politics

In the process of labelling we identified 46 articles that could not be classified and should be ignored. These were predominantly Letters sections that contained mixed topic content.



In [6]:
df1["category"] = np.where( df1["category"]=="religion", "society", df1["category"])
df1["category"] = np.where( df1["category"]=="weather", "lifestyle", df1["category"])
df1["category"] = np.where( df1["category"]=="military", "politics", df1["category"])
df1["category"] = np.where( df1["category"]=="labour", "politics", df1["category"])
df1["category"] = np.where( df1["category"]=="transport", "politics", df1["category"])
df1["category"] = np.where( df1["category"]=="science", "education", df1["category"])

In [8]:
df_trainer = df1[df1["category"]!="ignore"]

In [10]:
df_trainer = df_trainer[ df_trainer["category"].notnull() ]

In [11]:
len(df_trainer)

947

In [12]:
df_trainer["category"].value_counts()

sport          176
politics       153
arts            98
health          96
lifestyle       95
crime           73
society         70
education       38
business        35
realestate      30
human           29
accident        28
environment     26
Name: category, dtype: int64

In [12]:
trainset = df_trainer.loc[:,['category','text']]

In [13]:
trainset.to_csv("data/training.csv", index=False, header=False)
trainset.to_csv("data/training_with_header.csv", index=False, header=True)

In [14]:
bucket_name = "funnybones"
bucket_prefix="topics/train"

In [15]:
# Upload CSV files to S3 for SageMaker training
train_uri = sgmk_session.upload_data(
    path="data/training.csv",
    bucket=bucket_name,
    key_prefix=bucket_prefix,
)

# Separate Test Data

We grabbed the original data set (not from Canberra times) as an independent test set.


In [16]:
stories = "s3://funnybones/news/topics/stories.csv"
df2 = pd.read_csv(stories)

In [17]:
df2["category"].value_counts()

sport          32
health          9
environment     9
business        8
crime           6
transport       6
politics        2
accident        2
human           2
arts            2
lifestyle       2
society         1
weather         1
military        1
Name: category, dtype: int64

In [18]:
df2["category"] = np.where( df2["category"]=="religion", "society", df2["category"])
df2["category"] = np.where( df2["category"]=="weather", "lifestyle", df2["category"])
df2["category"] = np.where( df2["category"]=="military", "politics", df2["category"])
df2["category"] = np.where( df2["category"]=="labour", "politics", df2["category"])
df2["category"] = np.where( df2["category"]=="transport", "politics", df2["category"])
df2["category"] = np.where( df2["category"]=="science", "education", df2["category"])

In [19]:
df2["category"].value_counts()

sport          32
politics        9
health          9
environment     9
business        8
crime           6
lifestyle       3
accident        2
human           2
arts            2
society         1
Name: category, dtype: int64

In [20]:
test_data = df2[ df2["category"].notnull() ]

In [21]:
testset = test_data.loc[:,['category','text']]

In [22]:
testset.to_csv("data/test.csv", index=False, header=False)
testset.to_csv("data/test_with_header.csv", index=False, header=True)


In [23]:
# Upload CSV files to S3 for SageMaker training
test_uri = sgmk_session.upload_data(
    path="data/test.csv",
    bucket=bucket_name,
    key_prefix=bucket_prefix,
)