Dataset source: https://www.kaggle.com/datasets/everydaycodings/global-news-dataset?select=data.csv

# Notebook 2: Text Analysis of Global News Dataset

In this notebook we will conduct text analysis of the global news dataset.


## Part 1: Load Dataset



Load Dataset

In [None]:
cd /content/drive/MyDrive/CIND820

/content/drive/MyDrive/CIND820


In [None]:
# install required libraries


In [None]:
# import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import os

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('wordnet')

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
# Set option to display full column
# pd.set_option('display.max_colwidth', None)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [None]:
# load all datasets
data_df = pd.read_csv('global_news/data.csv')
rating_df = pd.read_csv('global_news/rating.csv')
raw_df = pd.read_csv('global_news/raw-data.csv', dtype='object')

Converted all datatypes in raw_data.csv to 'object' to standardize datatypes in each column

In [None]:
# merge dataframes into master dataframe
master_df = pd.concat([data_df, rating_df, raw_df], axis=0)

In [None]:
master_df.shape

(1096988, 14)

In [None]:
master_df.head()

Unnamed: 0,article_id,source_id,source_name,author,title,description,url,url_to_image,published_at,content,category,full_content,article,title_sentiment
0,89541,,International Business Times,Paavan MATHEMA,UN Chief Urges World To 'Stop The Madness' Of ...,UN Secretary-General Antonio Guterres urged th...,https://www.ibtimes.com/un-chief-urges-world-s...,https://d.ibtimes.com/en/full/4496078/nepals-g...,2023-10-30 10:12:35.000000,UN Secretary-General Antonio Guterres urged th...,Nepal,UN Secretary-General Antonio Guterres urged th...,,
1,89542,,Prtimes.jp,,RANDEBOOよりワンランク上の大人っぽさが漂うニットとベストが新登場。,[株式会社Ainer]\nRANDEBOO（ランデブー）では2023年7月18日(火)より公...,https://prtimes.jp/main/html/rd/p/000000147.00...,https://prtimes.jp/i/32220/147/ogp/d32220-147-...,2023-10-06 04:40:02.000000,"RANDEBOO2023718()WEB2023 Autumn Winter \n""Nepa...",Nepal,,,
2,89543,,VOA News,webdesk@voanews.com (Agence France-Presse),UN Chief Urges World to 'Stop the Madness' of ...,UN Secretary-General Antonio Guterres urged th...,https://www.voanews.com/a/un-chief-urges-world...,https://gdb.voanews.com/01000000-0a00-0242-60f...,2023-10-30 10:53:30.000000,"Kathmandu, Nepal UN Secretary-General Antonio...",Nepal,,,
3,89545,,The Indian Express,Editorial,Sikkim warning: Hydroelectricity push must be ...,Ecologists caution against the adverse effects...,https://indianexpress.com/article/opinion/edit...,https://images.indianexpress.com/2023/10/edit-...,2023-10-06 01:20:24.000000,At least 14 persons lost their lives and more ...,Nepal,At least 14 persons lost their lives and more ...,,
4,89547,,The Times of Israel,Jacob Magid,"200 foreigners, dual nationals cut down in Ham...","France lost 35 citizens, Thailand 33, US 31, U...",https://www.timesofisrael.com/200-foreigners-d...,https://static.timesofisrael.com/www/uploads/2...,2023-10-27 01:08:34.000000,"Scores of foreign citizens were killed, taken ...",Nepal,,,


In [None]:
master_df.describe()

Unnamed: 0,article_id,source_id,source_name,author,title,description,url,url_to_image,published_at,content,category,full_content,article,title_sentiment
count,1096988,161794,1031069,907159,1030479,1027259,967880,903375,967880,967414,967435,58432,58356,58356
unique,906200,210,5358,62261,393930,382750,399959,321282,319926,390985,258,54143,54148,3
top,The Visitor,the-times-of-india,Punknews.org,emmoore@nospam.punknews.org (emmoore),Cigar release video for “These Chances”,Cigar have released a video for their song “Th...,https://removed.com,https://www.marketscreener.com/images/twitter_...,1970-01-01 00:00:00.000000,[Removed],Stock,Attachment,Liite,Neutral
freq,65536,30417,63225,63219,63187,63187,21246,9992,21246,21246,18580,9,9,42926


In [None]:
master_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1096988 entries, 0 to 933256
Data columns (total 14 columns):
 #   Column           Non-Null Count    Dtype 
---  ------           --------------    ----- 
 0   article_id       1096988 non-null  object
 1   source_id        161794 non-null   object
 2   source_name      1031069 non-null  object
 3   author           907159 non-null   object
 4   title            1030479 non-null  object
 5   description      1027259 non-null  object
 6   url              967880 non-null   object
 7   url_to_image     903375 non-null   object
 8   published_at     967880 non-null   object
 9   content          967414 non-null   object
 10  category         967435 non-null   object
 11  full_content     58432 non-null    object
 12  article          58356 non-null    object
 13  title_sentiment  58356 non-null    object
dtypes: object(14)
memory usage: 125.5+ MB


In [None]:
# checking the categories of provided sentiment
master_df['title_sentiment'].value_counts()

Neutral     42926
Negative     9133
Positive     6297
Name: title_sentiment, dtype: int64

In [None]:
master_df['category'].unique()

array(['Nepal', 'New Zealand', 'Hiking', 'Sustainability', 'Europe',
       'Oman', 'Politics', 'Pakistan', 'Panama', 'Games', 'News',
       'Papua New Guinea', 'Poland', 'Climate', 'Qatar', 'YouTube',
       'Peru', 'Artificial Intelligence', 'America', 'Puerto Rico',
       'Real estate', 'Philippines', 'Weather', 'Palau', 'Romania',
       'Armenia', 'Amazon', 'Love', 'COVID', 'Paraguay', 'Music', 'Cars',
       'Photography', 'Stock', 'Space', 'History', 'Food', 'Art',
       'Bitcoin', 'Fashion', 'TikTok', 'Design', 'Technology',
       'Architecture', 'Motivation', 'Home', 'Asia', 'Relationships',
       'Coding', 'Beauty', 'Finance', 'Africa', 'Ghana',
       'Russian Federation', 'Rwanda', 'Nutrition', 'Anime', 'Podcasts',
       'Yoga', 'Cryptocurrency', 'Sudan', 'Singapore', 'Blockchain',
       'Startups', 'Productivity', 'Health', 'Uganda', 'Samoa', 'Yemen',
       'Chad', 'Philosophy', 'world', 'Movies', nan, 'Google', 'Facebook',
       'Jobs', 'Travel', 'Sports', 'Scien

In [None]:
# filter dataframe to only include financial articles
relevant_columns = ['News','Real estate',
       'Amazon', 'Stock', 'Bitcoin', 'Technology',
       'Finance', 'Cryptocurrency', 'Blockchain',
       'Startups', 'Google', 'Facebook']

filtered_df = master_df[master_df['category'].isin(relevant_columns)].copy()

In [None]:
filtered_df['category'].unique()

array(['News', 'Real estate', 'Amazon', 'Stock', 'Bitcoin', 'Technology',
       'Finance', 'Cryptocurrency', 'Blockchain', 'Startups', 'Google',
       'Facebook'], dtype=object)

In [None]:
filtered_df.shape

(138630, 14)

In [None]:
# check for missing values
filtered_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 138630 entries, 856 to 933256
Data columns (total 14 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   article_id       138630 non-null  object
 1   source_id        18797 non-null   object
 2   source_name      138630 non-null  object
 3   author           127579 non-null  object
 4   title            138604 non-null  object
 5   description      137599 non-null  object
 6   url              138630 non-null  object
 7   url_to_image     133126 non-null  object
 8   published_at     138630 non-null  object
 9   content          138545 non-null  object
 10  category         138630 non-null  object
 11  full_content     12859 non-null   object
 12  article          13028 non-null   object
 13  title_sentiment  13028 non-null   object
dtypes: object(14)
memory usage: 15.9+ MB


**Let's divide up the dataset according to the attribute type:**

Textual attributes are:
- source_name
- author
- title
- description
- url
- url_to_image
- content
- category
- full_content
- article
- title_sentiment

Non-textual attributes are:
- article_id
- source_id
- published_at

## Part 2: Data Cleaning

Issues with dataset:

Before we start text analysis there are some problems and questions we need to answer.



*   There is non-english text in the dataset
*   There are a number of missing values for all attributes more or less



Solutions:

* Since this is a global news dataset if we want to make the most of the data we should translate non-english text to English. However there are limitations such as limits to the number of calls made to free translators so we will skip doing translations.

* As for the number of missing values, we will fill missing values with an empty string. For the text processing steps we will use the 'title' column only since that has the most data points of 1,030,479.

In [None]:
# fill missing values in text columns
master_df['title'] = master_df['title'].fillna('')

In [None]:
# remove all special characters in text
import re

def remove_special_characters(text):
    # Define the pattern to match special characters
    pattern = r'[^a-zA-Z0-9\s]'  # Keep only alphanumeric characters and whitespace

    # Use the sub() function to replace matched patterns with an empty string
    cleaned_text = re.sub(pattern, '', text)

    return cleaned_text

master_df['cleaned_text'] = master_df['title'].apply(remove_special_characters)

In [None]:
# checking dataframe
master_df.tail()

Unnamed: 0,article_id,source_id,source_name,author,title,description,url,url_to_image,published_at,content,category,full_content,article,title_sentiment,cleaned_text
933252,594419,,ETF Daily News,MarketBeat News,TMT Investments (LON:TMT) Stock Price Up 0.8%,TMT Investments PLC (LON:TMT – Get Free Report...,https://www.etfdailynews.com/2023/11/21/tmt-in...,https://www.americanbankingnews.com/wp-content...,2023-11-21 05:44:47.000000,TMT Investments PLC (LON:TMT – Get Free Report...,Stock,,,,TMT Investments LONTMT Stock Price Up 08
933253,594420,,ETF Daily News,MarketBeat News,Avadel Pharmaceuticals (NASDAQ:AVDL) Stock Pri...,Avadel Pharmaceuticals plc (NASDAQ:AVDL – Get ...,https://www.etfdailynews.com/2023/11/21/avadel...,https://www.americanbankingnews.com/wp-content...,2023-11-21 16:30:51.000000,Avadel Pharmaceuticals plc (NASDAQ:AVDL – Get ...,Stock,,,,Avadel Pharmaceuticals NASDAQAVDL Stock Price ...
933254,594421,,New Atlas,Loz Blain,Utter chaos at OpenAI puts GPT in jeopardy – w...,One of the world's most important companies se...,https://newatlas.com/technology/openai-altman-...,https://assets.newatlas.com/dims4/default/898e...,2023-11-21 06:19:52.000000,One of the world's most important companies se...,Stock,,,,Utter chaos at OpenAI puts GPT in jeopardy wh...
933255,594422,,CNA,,Marketmind: Risk rally rages on,A look at the day ahead in European and global...,https://www.channelnewsasia.com/business/marke...,https://onecms-res.cloudinary.com/image/upload...,2023-11-21 05:37:34.000000,A look at the day ahead in European and global...,Stock,,,,Marketmind Risk rally rages on
933256,594423,,Dutchcowboys.nl,"Jeroen de Hooge, Jeroen de Hooge",Google Trends: PVV grootste partij in acht pro...,Een dag voor de verkiezingen stijgen GroenLink...,https://www.dutchcowboys.nl/search/google-tren...,https://www.dutchcowboys.nl/uploads/posts/list...,2023-11-21 12:00:00.000000,Een dag voor de verkiezingen stijgen GroenLink...,Stock,,,,Google Trends PVV grootste partij in acht prov...


## Part 3: Tokenization

In [None]:
from nltk.tokenize import word_tokenize
import spacy

# Load the English language model
nlp = spacy.load("en_core_web_sm")

In [None]:
# Convert 'cleaned_text' column to string type
master_df['cleaned_text'] = master_df['cleaned_text'].astype(str)

In [None]:
master_df['tokens'] = master_df['cleaned_text'].apply(word_tokenize)

In [None]:
from nltk.util import ngrams

In [None]:
# Get unigrams and bigrams
master_df['unigrams'] = master_df['tokens'].apply(lambda x: list(ngrams(x, 1)))
master_df['bigrams'] = master_df['tokens'].apply(lambda x: list(ngrams(x, 2)))

In [None]:
master_df.tail()

Unnamed: 0,article_id,source_id,source_name,author,title,description,url,url_to_image,published_at,content,category,full_content,article,title_sentiment,cleaned_text,tokens,unigrams,bigrams
933252,594419,,ETF Daily News,MarketBeat News,TMT Investments (LON:TMT) Stock Price Up 0.8%,TMT Investments PLC (LON:TMT – Get Free Report...,https://www.etfdailynews.com/2023/11/21/tmt-in...,https://www.americanbankingnews.com/wp-content...,2023-11-21 05:44:47.000000,TMT Investments PLC (LON:TMT – Get Free Report...,Stock,,,,TMT Investments LONTMT Stock Price Up 08,"[TMT, Investments, LONTMT, Stock, Price, Up, 08]","[(TMT,), (Investments,), (LONTMT,), (Stock,), ...","[(TMT, Investments), (Investments, LONTMT), (L..."
933253,594420,,ETF Daily News,MarketBeat News,Avadel Pharmaceuticals (NASDAQ:AVDL) Stock Pri...,Avadel Pharmaceuticals plc (NASDAQ:AVDL – Get ...,https://www.etfdailynews.com/2023/11/21/avadel...,https://www.americanbankingnews.com/wp-content...,2023-11-21 16:30:51.000000,Avadel Pharmaceuticals plc (NASDAQ:AVDL – Get ...,Stock,,,,Avadel Pharmaceuticals NASDAQAVDL Stock Price ...,"[Avadel, Pharmaceuticals, NASDAQAVDL, Stock, P...","[(Avadel,), (Pharmaceuticals,), (NASDAQAVDL,),...","[(Avadel, Pharmaceuticals), (Pharmaceuticals, ..."
933254,594421,,New Atlas,Loz Blain,Utter chaos at OpenAI puts GPT in jeopardy – w...,One of the world's most important companies se...,https://newatlas.com/technology/openai-altman-...,https://assets.newatlas.com/dims4/default/898e...,2023-11-21 06:19:52.000000,One of the world's most important companies se...,Stock,,,,Utter chaos at OpenAI puts GPT in jeopardy wh...,"[Utter, chaos, at, OpenAI, puts, GPT, in, jeop...","[(Utter,), (chaos,), (at,), (OpenAI,), (puts,)...","[(Utter, chaos), (chaos, at), (at, OpenAI), (O..."
933255,594422,,CNA,,Marketmind: Risk rally rages on,A look at the day ahead in European and global...,https://www.channelnewsasia.com/business/marke...,https://onecms-res.cloudinary.com/image/upload...,2023-11-21 05:37:34.000000,A look at the day ahead in European and global...,Stock,,,,Marketmind Risk rally rages on,"[Marketmind, Risk, rally, rages, on]","[(Marketmind,), (Risk,), (rally,), (rages,), (...","[(Marketmind, Risk), (Risk, rally), (rally, ra..."
933256,594423,,Dutchcowboys.nl,"Jeroen de Hooge, Jeroen de Hooge",Google Trends: PVV grootste partij in acht pro...,Een dag voor de verkiezingen stijgen GroenLink...,https://www.dutchcowboys.nl/search/google-tren...,https://www.dutchcowboys.nl/uploads/posts/list...,2023-11-21 12:00:00.000000,Een dag voor de verkiezingen stijgen GroenLink...,Stock,,,,Google Trends PVV grootste partij in acht prov...,"[Google, Trends, PVV, grootste, partij, in, ac...","[(Google,), (Trends,), (PVV,), (grootste,), (p...","[(Google, Trends), (Trends, PVV), (PVV, groots..."


## Part 4: Part of Speech Tagging

In [None]:
from nltk.tag import pos_tag

In [None]:
master_df['POS_tag'] = master_df['tokens'].apply(pos_tag)

## Part 5: Named Entity Recognition

In [None]:
from nltk.chunk import ne_chunk

In [None]:
master_df['named_entities'] = master_df['POS_tag'].apply(ne_chunk)

KeyboardInterrupt: 

## Part 6: Stemming and Lemmatization

In [None]:
# import stemming and lemmatization algorithms
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

In [None]:
# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Function to apply stemming
def apply_stemming(tokens):
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    return stemmed_tokens

# Function to apply lemmatization
def apply_lemmatization(tokens):
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return lemmatized_tokens

# Apply stemming and lemmatization to tokens column
master_df['stemmed_tokens'] = master_df['tokens'].apply(apply_stemming)
master_df['lemmatized_tokens'] = master_df['tokens'].apply(apply_lemmatization)

In [None]:
# checkpoint
# stored dataframe

## Part 7: Bag of Words Model and TF-IDF


Below is the function for TF-IDF(Term Frequency-Inverse Document Frequency):

$$\text{TF-IDF}(t, d, C) = \text{TF}(t, d) \times \text{IDF}(t, C)$$

$$\text{TF}(t, d) = \frac{\text{number of times term } t \text{ appears in document } d}{\text{total number of terms in document } d}$$

$$\text{IDF}(t, C) = \log\left(\frac{\text{total number of documents in corpus } C}{\text{number of documents containing term } t}\right)$$

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
# Bag of Words (BoW) model
# Initialize the CountVectorizer object
count_vectorizer = CountVectorizer()

# Fit and transform the text column to generate the BoW representation
bow_matrix = count_vectorizer.fit_transform(master_df['title'])

In [None]:
# TF-IDF
# Initialize the TfidfVectorizer object
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the text column to generate the TF-IDF representation
tfidf_matrix = tfidf_vectorizer.fit_transform(master_df['title'])

## Part 8: Semantic Approach

The semantic approach to text processing involves analyzing text at a deeper level to understand its meaning and context. Unlike syntactic or statistical approaches, which focus on the structure or frequency of words, the semantic approach aims to capture the underlying semantics of language.



Some key aspects of the semantic approach to text processing are:

1. **Word Sense Disambiguation**

2. **Semantic Similarity**: Semantic similarity measures how similar two pieces of text are in meaning. It takes into account the semantics of words and phrases, rather than just their surface-level representations. Semantic similarity can be used in various applications such as information retrieval, question answering, and recommendation systems.

3. **Sentiment Analysis**: Sentiment analysis aims to determine the sentiment or opinion expressed in a piece of text. Semantic approaches to sentiment analysis involve understanding the semantics of words and phrases, as well as their contextual relationships, to infer sentiment.

4. **Topic Modeling**: Topic modeling algorithms such as Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization (NMF) are used to discover underlying topics or themes in a collection of documents. These topics represent groups of words that frequently co-occur together and provide insights into the content of the text.

Major stock market indices:


1. S&P 500: `^GSPC`
2. Dow Jones Industrial Average: `^DJI`
3.

