# Sentiment Analysis of News Headlines

## Contents
- Import data
- Initial data exploration

## News Category Dataset

This dataset contains around 200k news headlines from the year 2012 to 2018 obtained from HuffPost. The model trained on this dataset could be used to identify tags for untracked news articles or to identify the type of language used in different news articles.

Applications of deep learning for text data:

Document classification
Articles lebelling
Sentiment analysis
Author identification
Question-answering
Language detection
Translation Tasks

Sources: 
- [News Category Dataset](https://www.kaggle.com/datasets/rmisra/news-category-dataset?resource=download)
- [[NLP]News_articles_classif (Wordembeddings&RNN)](https://www.kaggle.com/code/avikumart/nlp-news-articles-classif-wordembeddings-rnn)

In [1]:
# Import modules
# NLTK module
import nltk   # nltk module
nltk.download('punkt')    # download tokenizers
nltk.download('stopwords')    # download stopwords
nltk.download('wordnet')    # download wordnet
nltk.download('averaged_perceptron_tagger')    # download perceptron tagger
nltk.download('maxent_ne_chunker')    # download chunker

from nltk.tokenize import sent_tokenize, word_tokenize    # tokenizers
from nltk.corpus import stopwords    # stopwords
from nltk.corpus import wordnet
from nltk import pos_tag
from nltk import ne_chunk

# Import Stemmers
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from nltk.probability import FreqDist
import re    # Regular expressions

# Import Pandas, Numpy, pickle and os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle as pk
import os    # For reading the titles of files
# Import warning and ignore
import warnings
warnings.filterwarnings("ignore")
# Import beautiful soup
from bs4 import BeautifulSoup as bs

import unicodedata
from wordcloud import WordCloud


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sreed\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sreed\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sreed\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\sreed\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\sreed\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!


### Import Data
Types of import scenarios:
- Reading simple JSON from a local file
- Reading simple JSON from a URL
- Flattening nested list from JSON object
- Flattening nested list and dict from JSON object
- Extracting a value from deeply nested JSON

Sources:
- [How to convert JSON into a Pandas DataFrame](https://towardsdatascience.com/how-to-convert-json-into-a-pandas-dataframe-100b2ae1e0d8)

In [3]:
# Import data
# File path
file = r'D:\Project\News_Analytics\news_analytics_sentiment\data\raw\archive\News_Category_Dataset_v2.json'

# Read file
with open(file, 'r'):
    df = pd.read_json(file, lines=True)

# View
df.head()

Unnamed: 0,category,headline,authors,link,short_description,date
0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...,2018-05-26
1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,Andy McDonald,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.,2018-05-26
2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...,2018-05-26
3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,Ron Dicker,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...,2018-05-26
4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,Ron Dicker,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ...",2018-05-26


### Dataframe properties

#### Properties
- The size and shape of the dataset. 
- Data issues
- Unique authors
- Uniques categories

In [4]:
# Dataframe information
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200853 entries, 0 to 200852
Data columns (total 6 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   category           200853 non-null  object        
 1   headline           200853 non-null  object        
 2   authors            200853 non-null  object        
 3   link               200853 non-null  object        
 4   short_description  200853 non-null  object        
 5   date               200853 non-null  datetime64[ns]
dtypes: datetime64[ns](1), object(5)
memory usage: 9.2+ MB
None


In [5]:
# Dataframe shape
nrows, ncols = df.shape
print(nrows, ncols)

200853 6


In [10]:
# Unique news categories
unique_news_categories = df['category'].unique()
print(unique_news_categories, '\n', len(unique_news_categories))
# Number of unique news categories
# print(len())

# Unique authors
unique_authors = df['authors'].unique()
print(unique_authors, '\n', len(unique_authors))

['CRIME' 'ENTERTAINMENT' 'WORLD NEWS' 'IMPACT' 'POLITICS' 'WEIRD NEWS'
 'BLACK VOICES' 'WOMEN' 'COMEDY' 'QUEER VOICES' 'SPORTS' 'BUSINESS'
 'TRAVEL' 'MEDIA' 'TECH' 'RELIGION' 'SCIENCE' 'LATINO VOICES' 'EDUCATION'
 'COLLEGE' 'PARENTS' 'ARTS & CULTURE' 'STYLE' 'GREEN' 'TASTE'
 'HEALTHY LIVING' 'THE WORLDPOST' 'GOOD NEWS' 'WORLDPOST' 'FIFTY' 'ARTS'
 'WELLNESS' 'PARENTING' 'HOME & LIVING' 'STYLE & BEAUTY' 'DIVORCE'
 'WEDDINGS' 'FOOD & DRINK' 'MONEY' 'ENVIRONMENT' 'CULTURE & ARTS'] 
 41
['Melissa Jeltsen' 'Andy McDonald' 'Ron Dicker' ...
 'Courtney Garcia, Contributor\nI tell stories and drink wine.'
 'Mateo Gutierrez, Contributor\nArtist'
 'John Giacobbi, Contributor\nTales from the Interweb by The Web Sheriff'] 
 27993
