# <center> News Classification with NLP and Neural Networks </center>

### Imports

In [1]:
import warnings
import numpy as np
import pandas as pd
import plotly.express as px
warnings.filterwarnings('ignore')

### Data

In [2]:
df = pd.read_json('News_Category_Dataset_v2.json', lines=True)
df

Unnamed: 0,category,headline,authors,link,short_description,date
0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...,2018-05-26
1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,Andy McDonald,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.,2018-05-26
2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...,2018-05-26
3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,Ron Dicker,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...,2018-05-26
4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,Ron Dicker,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ...",2018-05-26
...,...,...,...,...,...,...
200848,TECH,RIM CEO Thorsten Heins' 'Significant' Plans Fo...,"Reuters, Reuters",https://www.huffingtonpost.com/entry/rim-ceo-t...,Verizon Wireless and AT&T are already promotin...,2012-01-28
200849,SPORTS,Maria Sharapova Stunned By Victoria Azarenka I...,,https://www.huffingtonpost.com/entry/maria-sha...,"Afterward, Azarenka, more effusive with the pr...",2012-01-28
200850,SPORTS,"Giants Over Patriots, Jets Over Colts Among M...",,https://www.huffingtonpost.com/entry/super-bow...,"Leading up to Super Bowl XLVI, the most talked...",2012-01-28
200851,SPORTS,Aldon Smith Arrested: 49ers Linebacker Busted ...,,https://www.huffingtonpost.com/entry/aldon-smi...,CORRECTION: An earlier version of this story i...,2012-01-28


# Cleaning

#### Check data types

#### Check NaNs 

#### Check Duplicates

In [3]:
print(df.duplicated().sum())
df = df.drop_duplicates()
df.duplicated().sum()

13


0

# EDA

#### View categories

In [None]:
print(f"There are {len(df['category'].value_counts())} unique categories including the following:")
df['category'].value_counts()

#### View length of headlines

#### Most Common Authors 
- The `authors` field is a list containing:
    - Name(s)
    - Titles
    - Organizations
- It also contains many Nans in the form of empty strings
    - Replace with 'unknown'
- Most authors are unknown
- Some are uppers/lower, need to title() them
- Strip non-characters

In [4]:
# Replace missing authors with 'unknown'
df['authors'] = df['authors'].apply(lambda x: x.replace('','unknown') if x == '' else x)

# Split the list of authors to get just the names
# Title them, and turn into list splitting on the 'And' that separates multiple author names
df['author_names'] = df['authors'].apply(lambda x: x.split(',')[0].title().split(' And '))

In [5]:
print(len(df['author_names'].explode().value_counts()))
df['author_names'].explode().value_counts()

22226


Unknown                  36607
Lee Moran                 2432
Ron Dicker                1915
Reuters                   1577
Ed Mazza                  1328
                         ...  
Katrina Lantos Swett         1
Kris Hayashi                 1
Hilary Levey Friedman        1
Umit Ozdal                   1
Helen Moon                   1
Name: author_names, Length: 22226, dtype: int64

In [22]:
df[df['authors'].str.contains('And Travel')]['authors']

168836    Elyse Pasquale, Contributor\nFood And Travel W...
170022    Elyse Pasquale, Contributor\nFood And Travel W...
176020    Elyse Pasquale, Contributor\nFood And Travel W...
191232    Elyse Pasquale, Contributor\nFood And Travel W...
Name: authors, dtype: object

In [29]:
df['authors'].apply(lambda x: x.replace('\n',',').split(',')[-1].strip().title() if len(x.split(',')) > 1 else None).value_counts().head(50)

Reuters                                                                         4941
Contributor                                                                     2013
Ap                                                                               696
Writer                                                                           631
Author                                                                           565
Contributors                                                                     540
The Huffington Post                                                              460
...                                                                              449
Food52.Com                                                                       318
Contributorwriter                                                                302
Inc. And Book Author                                                             294
The Hotel Tell-All                                               