In [1]:
%matplotlib inline
# For auto-reload
%load_ext autoreload
%autoreload 2

In [2]:
# ----------------- Classics -------------------- #
import numpy as np
import pandas as pd

# ------------------- Plotting ------------------- #
import squarify
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
plt.style.use('fivethirtyeight')

# ---------------- Pandas settings --------------- #
# Removes rows and columns truncation of '...'
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 200)

# ------------------- Python libs ---------------- #
import os
from pathlib import Path
import re
import sys
ROOT_PATH = Path().resolve().parent
sys.path.append(str(ROOT_PATH)) # Add folder root path



import warnings
warnings.filterwarnings(action='ignore')

Before any data science process we need to set our objectives for this challenge we are given two main goals to direct our analysis towards:

> Objectives: 
> 
> 1. What are the most important events that occurred during the timeframe these articles were 
> captured? How do these change over time in the dataset? 
> 
> 2. Lockheed Martin is part of an expansive and ever-changing Aerospace and Defense industry. In 
> order to better understand the playing field, perform an analysis that would allow a stakeholder 
> to make an informed strategic business decision regarding an action the company should take 
> with respect to the Aerospace and Defense landscape. 

Now that we know what we are looking for, let's start with data pre-processing start and initial exploration phase.

 ## 1. Loading raw data  
 
 #### Background info from Kaggle:
 
> The publications include the New York Times, Breitbart, CNN, Business Insider, the Atlantic, Fox News, Talking Points Memo, Buzzfeed News, National Review, New York Post, the Guardian, NPR, Reuters, Vox, and the Washington Post. Sampling wasn't quite scientific; I chose publications based on my familiarity of the domain and tried to get a range of political alignments, as well as a mix of print and digital publications. By count, the publications break down accordingly:

> The data primarily falls between the years of 2016 and July 2017, although there is a not-insignificant number of articles from 2015, and a possibly insignificant number from before then.
 
 
 #### Dataset info:
``` 
articles1.csv - 50,000 news articles (Articles 1-50,000)
articles2.csv - 49,999 news articles (Articles 50,001-100,00)
articles3.csv - Articles 100,001+
```

![](https://i.imgur.com/QDPtuEv.png)



In [3]:
# Fetch file paths
file_names = ["articles1.csv", "articles2.csv", "articles3.csv"]
fpaths = [ROOT_PATH.joinpath(f"data/raw_data/{file_name}") for file_name in file_names]

# Load in individual files as dfs

def read_csv_strip(data, date_columns=[], index_col=None):
    df = pd.read_csv(data, quotechar='"', parse_dates=date_columns, index_col=index_col)
    
    # for each column
    for col in df.columns:
        # check if the columns contains string data
        if pd.api.types.is_string_dtype(df[col]):
            df[col] = df[col].str.strip() # removes front and end white spaces
            df[col] = df[col].str.replace('\s{2,}', ' ') # remove double or more white spaces
    df = df.replace({"":np.nan}) # if there remained only empty string "", change to Nan
    return df

articles1 = read_csv_strip(fpaths[0], date_columns=["date"], index_col=0)
articles2 = read_csv_strip(fpaths[1], date_columns=["date"], index_col=0)
articles3 = read_csv_strip(fpaths[2], date_columns=["date"], index_col=0)
articles1.head()

Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have a ...
1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, Kim..."


In [4]:
articles2.head()

Unnamed: 0,id,title,publication,author,date,year,month,url,content
53293,73471,Patriots Day Is Best When It Digs Past the Her...,Atlantic,David Sims,2017-01-11,2017.0,1.0,,"Patriots Day, Peter Berg’s new thriller that r..."
53294,73472,A Break in the Search for the Origin of Comple...,Atlantic,Ed Yong,2017-01-11,2017.0,1.0,,"In Norse mythology, humans and our world were ..."
53295,73474,Obama’s Ingenious Mention of Atticus Finch,Atlantic,Spencer Kornhaber,2017-01-11,2017.0,1.0,,“If our democracy is to work in this increasin...
53296,73475,"Donald Trump Meets, and Assails, the Press",Atlantic,David A. Graham,2017-01-11,2017.0,1.0,,Updated on January 11 at 5:05 p. m. In his fir...
53297,73476,Trump: ’I Think’ Hacking Was Russian,Atlantic,Kaveh Waddell,2017-01-11,2017.0,1.0,,Updated at 12:25 p. m. After months of equivoc...


In [5]:
articles3.head()

Unnamed: 0,id,title,publication,author,date,year,month,url,content
103459,151908,Alton Sterling’s son: ’Everyone needs to prote...,Guardian,Jessica Glenza,2016-07-13,2016.0,7.0,https://www.theguardian.com/us-news/2016/jul/1...,The son of a Louisiana man whose father was sh...
103460,151909,Shakespeare’s first four folios sell at auctio...,Guardian,,2016-05-25,2016.0,5.0,https://www.theguardian.com/culture/2016/may/2...,Copies of William Shakespeare’s first four boo...
103461,151910,My grandmother’s death saved me from a life of...,Guardian,Robert Pendry,2016-10-31,2016.0,10.0,https://www.theguardian.com/commentisfree/2016...,"Debt: $20, 000, Source: College, credit cards,..."
103462,151911,I feared my life lacked meaning. Cancer pushed...,Guardian,Bradford Frost,2016-11-26,2016.0,11.0,https://www.theguardian.com/commentisfree/2016...,"It was late. I was drunk, nearing my 35th birt..."
103463,151912,Texas man serving life sentence innocent of do...,Guardian,,2016-08-20,2016.0,8.0,https://www.theguardian.com/us-news/2016/aug/2...,A central Texas man serving a life sentence fo...


## 2. Exploring raw data

### Missing data

Let's look at more info on all three data frames.

In [6]:
articles1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50000 entries, 0 to 53291
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   id           50000 non-null  int64         
 1   title        49997 non-null  object        
 2   publication  50000 non-null  object        
 3   author       43693 non-null  object        
 4   date         50000 non-null  datetime64[ns]
 5   year         50000 non-null  float64       
 6   month        50000 non-null  float64       
 7   url          0 non-null      float64       
 8   content      49997 non-null  object        
dtypes: datetime64[ns](1), float64(3), int64(1), object(4)
memory usage: 3.8+ MB


In [7]:
articles2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 49999 entries, 53293 to 103457
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   id           49999 non-null  int64         
 1   title        49986 non-null  object        
 2   publication  49999 non-null  object        
 3   author       41401 non-null  object        
 4   date         47373 non-null  datetime64[ns]
 5   year         47373 non-null  float64       
 6   month        47373 non-null  float64       
 7   url          42988 non-null  object        
 8   content      49978 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(5)
memory usage: 3.8+ MB


In [8]:
articles3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 42571 entries, 103459 to 146032
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   id           42571 non-null  int64         
 1   title        42570 non-null  object        
 2   publication  42571 non-null  object        
 3   author       41599 non-null  object        
 4   date         42556 non-null  datetime64[ns]
 5   year         42556 non-null  float64       
 6   month        42556 non-null  float64       
 7   url          42571 non-null  object        
 8   content      42571 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(5)
memory usage: 3.2+ MB


In total we have `50000 + 49999 + 42571 = 142570` articles, and missing - `url`, `content`, `date`, `title` and `author` info in some cases. For our scope `id`, redudant datetime information, etc.., will be unnecessary.

### Concatenate data frames

In [9]:
articles = pd.concat([articles1, articles2, articles3])

# Sanity check to make sure shapes match
assert articles.shape[0] == (articles1.shape[0] + articles2.shape[0] + articles3.shape[0])
assert articles.shape[1] == articles1.shape[1]
articles.head()

Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have a ...
1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, Kim..."


### Again Missing Values

In [10]:
articles.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 142570 entries, 0 to 146032
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   id           142570 non-null  int64         
 1   title        142553 non-null  object        
 2   publication  142570 non-null  object        
 3   author       126693 non-null  object        
 4   date         139929 non-null  datetime64[ns]
 5   year         139929 non-null  float64       
 6   month        139929 non-null  float64       
 7   url          85559 non-null   object        
 8   content      142546 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(5)
memory usage: 10.9+ MB


In [11]:
articles.isnull().sum()

id                 0
title             17
publication        0
author         15877
date            2641
year            2641
month           2641
url            57011
content           24
dtype: int64

In summary:

`url` - is missing, since it's unimportant we can ignore it  
`title` - is missing, we would have to check the content of it to see if it's relevant or not   
`content` - is misssing, we should also check it as this is the most important feature    
`date` -is missing, again we would have to check to see if it's relevant or not   

Most important columns are `title`, `content` and `date` for our purposes.

In [12]:
articles[articles["title"].isnull()]

Unnamed: 0,id,title,publication,author,date,year,month,url,content
47094,66261,,Business Insider,,2017-04-03,2017.0,4.0,,’’ ’ is trying desperately to shed its ”Whole ...
52032,71949,,Business Insider,,2016-10-09,2016.0,10.0,,’ ’ ’ Despite mounting calls for him to leave ...
52374,72340,,Business Insider,,2016-11-03,2016.0,11.0,,’ ’ ’ President Barack Obama delivered an impa...
68283,96065,,Talking Points Memo,,2016-02-12,2016.0,2.0,https://web.archive.org/web/20160213080712/htt...,Harry Reid tries to drop kick Alan Grayson rig...
70008,99956,,Buzzfeed News,Jessica Testa,2017-01-30,2017.0,1.0,https://web.archive.org/web/20170130044515/htt...,Spend enough time with Utah feminists and you’...
70025,99975,,Buzzfeed News,Tyler Kingkade,2017-02-25,2017.0,2.0,https://web.archive.org/web/20170225014538/htt...,"Lately, Laura Dunn has tried to avoid thinking..."
70035,99985,,Buzzfeed News,Darren Sands,2017-03-10,2017.0,3.0,https://web.archive.org/web/20170310090305/htt...,Two days after the country chose Donald Trump ...
70186,100152,,Buzzfeed News,Azeen Ghorayshi,2017-04-20,2017.0,4.0,https://web.archive.org/web/20170420233757/htt...,"A week before his 23rd birthday, Max Meehan an..."
72949,107996,,Buzzfeed News,Karla Zabludovsky,2016-05-05,2016.0,5.0,https://web.archive.org/web/20160505034853/htt...,"CIUDAD JUÁREZ, México — The police scanner buz..."
72997,108084,,Buzzfeed News,Borzou Daragahi,2016-06-01,2016.0,6.0,https://web.archive.org/web/20160601143429/htt...,"HAARLEM, Netherlands — The name had cropped up..."


Looks like there is some duplicate article from same author `Borzou Daragahi` regarding Netherlands. We would need to look further and drop duplicated content later.

In [13]:
articles[articles["content"].isnull()]

Unnamed: 0,id,title,publication,author,date,year,month,url,content
41452,60381,Wonders of the universe,CNN,,2014-01-10,2014.0,1.0,,
41809,60745,The week in 32 photos,CNN,,2015-01-23,2015.0,1.0,,
44395,63359,Enchanting waterfront murals painted while bal...,CNN,Jacopo Prisco,2015-06-01,2015.0,6.0,,
54479,74983,Mass Effect: Andromeda Is More About Choice Th...,Atlantic,David Sims,2017-03-20,2017.0,3.0,,
70101,100055,27 Of The Most Amazing Science Photos Of 2016,Buzzfeed News,Kelly Oakes,2017-01-01,2017.0,1.0,https://web.archive.org/web/20170101112616/htt...,
70113,100068,Just A Few Of The LGBT Signs People Carried At...,Buzzfeed News,Sarah Karlan,2017-02-09,2017.0,2.0,https://web.archive.org/web/20170209162409/htt...,
70191,100158,All The Looks At The MTV Movie & TV Awards Red...,Buzzfeed News,Whitney Jefferson,2017-05-08,2017.0,5.0,https://web.archive.org/web/20170508055955/htt...,
70284,100294,23 Of The Most Powerful Photos Of The Week,Buzzfeed News,Gabriel H. Sanchez,2017-04-09,2017.0,4.0,https://web.archive.org/web/20170409173523/htt...,
70425,100523,27 Of The Most Insane Pictures Ever Taken At T...,Buzzfeed News,Gabriel H. Sanchez,2017-05-07,2017.0,5.0,https://web.archive.org/web/20170507045124/htt...,
70551,100755,A Love Letter To All My Gay Firsts,Buzzfeed News,Will Varner,2017-02-26,2017.0,2.0,https://web.archive.org/web/20170226024057/htt...,


Most of this buzzfeed articles are non-relevant to news or major events for our use case we can safely ignore them.

### Check for duplicates

We know the data was taken from a database, so each `id` should be unique, let's check for duplicate `id's` and rows.

In [14]:
articles[articles.duplicated()]

Unnamed: 0,id,title,publication,author,date,year,month,url,content


In [15]:
articles[articles.duplicated(["id"])]

Unnamed: 0,id,title,publication,author,date,year,month,url,content


Now let's check for duplicated `title` and see if their `content` is exact same or different.

In [16]:
duplicated_articles = articles[articles.duplicated(["title"])]
duplicated_articles.head()

Unnamed: 0,id,title,publication,author,date,year,month,url,content
2149,19688,Right and Left: Partisan Writing You Shouldn’t...,New York Times,Anna Dubenko,2017-03-29,2017.0,3.0,,"The political news cycle is fast, and keeping ..."
2419,19986,17 Great Stories That Have Nothing to Do With ...,New York Times,Anna Dubenko and Michelle L. Dozois,2017-04-08,2017.0,4.0,,"Welcome to Our Picks, a guide to the best stuf..."
3571,21245,What to Cook This Weekend - The New York Times,New York Times,Sam Sifton,2016-08-12,2016.0,8.0,,Sam Sifton emails readers of Cooking five days...
4196,21941,What to Cook This Week - The New York Times,New York Times,Sam Sifton,2016-10-03,2016.0,10.0,,Sam Sifton emails readers of Cooking five days...
6891,25309,What to Cook This Week - The New York Times,New York Times,Sam Sifton,2016-09-19,2016.0,9.0,,Sam Sifton emails readers of Cooking five days...


In [17]:
print(f"There are {len(duplicated_articles)} duplicated articles with exact same title.")

There are 444 duplicated articles with exact same title.


Now that we know there are duplicated articles here, let's investigate them more.

In [18]:
duplicate_article_id = 25309
duplicate_article_title = duplicated_articles[duplicated_articles["id"] == duplicate_article_id]["title"].values[0]
mask = (duplicated_articles["title"] == duplicate_article_title)
nyt_duplicated_article = duplicated_articles[mask]
nyt_duplicated_article

Unnamed: 0,id,title,publication,author,date,year,month,url,content
4196,21941,What to Cook This Week - The New York Times,New York Times,Sam Sifton,2016-10-03,2016.0,10.0,,Sam Sifton emails readers of Cooking five days...
6891,25309,What to Cook This Week - The New York Times,New York Times,Sam Sifton,2016-09-19,2016.0,9.0,,Sam Sifton emails readers of Cooking five days...
7712,26417,What to Cook This Week - The New York Times,New York Times,Sam Sifton,2016-07-03,2016.0,7.0,,Sam Sifton emails readers of Cooking seven day...


So in this case all three of them are duplicate articles, but is it for others as well.

In [19]:
print(f"------------------- Top 20 Duplicated Titles of Articles ------------------ ")
count = 0
for k, v in duplicated_articles["title"].value_counts(dropna=False).to_dict().items():
    count += 1
    print(f"{count} - {k} - {v}")
    if count == 20:
        break

------------------- Top 20 Duplicated Titles of Articles ------------------ 
1 - nan - 16
2 - The Atlantic’s Week in Culture - 11
3 - From Whitewater to Benghazi: A Clinton-Scandal Primer - 7
4 - The 10 most important things in the world right now - 6
5 - The Many Scandals of Donald Trump: A Cheat Sheet - 6
6 - Heavy Rotation: 10 Songs Public Radio Can’t Stop Playing - 5
7 - The 2016 U.S. Presidential Race: A Cheat Sheet - 5
8 - The 3 plays in sports everybody will be talking about today - 4
9 - The Donald Trump Cabinet Tracker - 4
10 - BREAKING - 4
11 - Donald Trump’s Conflicts of Interest: A Crib Sheet - 4
12 - Premier League: 10 things to look out for this weekend - 4
13 - Where Republicans Stand on Donald Trump: A Cheat Sheet - 3
14 - Premier League: 10 talking points from the weekend’s action - 3
15 - What to Cook This Week - The New York Times - 3
16 - The Atlanticâ€™s Week in Culture - 3
17 - Which Republicans Oppose Donald Trump? A Cheat Sheet - 3
18 - Neil deGrasse Tyson and A

Let's pick one of the duplicated title, `From Whitewater to Benghazi: A Clinton-Scandal Primer` as this sounds like weekly news summary like article.

In [20]:
duplicate_article_title = "From Whitewater to Benghazi: A Clinton-Scandal Primer"
mask = (duplicated_articles["title"] == duplicate_article_title)
atlantic_duplicated_article = duplicated_articles[mask]
atlantic_duplicated_article

Unnamed: 0,id,title,publication,author,date,year,month,url,content
56082,77163,From Whitewater to Benghazi: A Clinton-Scandal...,Atlantic,,2016-05-06,2016.0,5.0,,", I want to receive updates from partners and ..."
56644,77864,From Whitewater to Benghazi: A Clinton-Scandal...,Atlantic,David A. Graham,2016-06-10,2016.0,6.0,,", I want to receive updates from partners and ..."
57480,78914,From Whitewater to Benghazi: A Clinton-Scandal...,Atlantic,David A. Graham,2016-07-06,2016.0,7.0,,I want to receive updates from partners and sp...
58228,79872,From Whitewater to Benghazi: A Clinton-Scandal...,Atlantic,David A. Graham,2016-08-23,2016.0,8.0,,"For us to continue writing great stories, we n..."
58436,80117,From Whitewater to Benghazi: A Clinton-Scandal...,Atlantic,David A. Graham,2016-09-02,2016.0,9.0,,"For us to continue writing great stories, we n..."
59379,81308,From Whitewater to Benghazi: A Clinton-Scandal...,Atlantic,Emma Green,2016-10-28,2016.0,10.0,,Here’s that October surprise. The FBI will inv...
59505,81467,From Whitewater to Benghazi: A Clinton-Scandal...,Atlantic,The Editors,2016-11-06,2016.0,11.0,,"To use ArchiveBot, drop by #archivebot on EFNe..."


In [21]:
atlantic_duplicated_article.loc[56082]["content"][:500]

', I want to receive updates from partners and sponsors. Back in early March, The New York Times reported that the FBI would be interviewing Hillary Clinton and her top aides about her private email server within the coming weeks. A source told the paper the investigation would probably conclude by early May, at which point the Justice Department would be left to decide whether to file charges against Clinton or anyone else, and what charges to file. The final decision rests with Attorney General'

Looks like some of the content is repeated during scraping, and we can remove the duplicated content and keep the first ones, so let's investigate `content` more thoroughly.

In [22]:
duplicated_articles_content = articles[articles.duplicated(["content"])]
duplicated_articles_content.head()

Unnamed: 0,id,title,publication,author,date,year,month,url,content
13010,31746,"Charlie Murphy, Comedian &amp Brother of Eddie...",Breitbart,Breitbart News,2017-04-12,2017.0,4.0,,advertisement
15722,34458,NYT: ’Angry Nationalism’ Spreads Across Europe...,Breitbart,Breitbart News,2016-06-18,2016.0,6.0,,advertisement
16257,34994,First Woman Nominated for Dean at West Point A...,Breitbart,Breitbart News,2016-04-30,2016.0,4.0,,advertisement
16724,35461,"Ryan Lochte, Three Other U.S. Olympic Swimmers...",Breitbart,Breitbart News,2016-08-14,2016.0,8.0,,advertisement
18346,37087,"Yahoo to Cut 1,700 Workers as Marissa Mayer Tr...",Breitbart,Breitbart News,2016-02-02,2016.0,2.0,,advertisement


So some of the duplicated contents are advertisements, scrapping errors with missing values. Now let's drop those rows and put together a cleaned dataset.

In [23]:
def clean_data(df):
    
    # Drop unneeded columns
    drop_cols = ["id", "url", "year", "month", "author"]
    df = df.drop(columns=drop_cols)
    
    # Drop duplicated articles with same content
    df = df.drop_duplicates(subset='content')
    
    # Drop rows with NAN values in content
    df = df.dropna(subset=["content"])
    return df

new_articles = clean_data(articles)
new_articles.head()

Unnamed: 0,title,publication,date,content
0,House Republicans Fret About Winning Their Hea...,New York Times,2016-12-31,WASHINGTON — Congressional Republicans have a ...
1,Rift Between Officers and Residents as Killing...,New York Times,2017-06-19,"After the bullet shells get counted, the blood..."
2,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,2017-01-06,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,2017-04-10,"Death may be the great equalizer, but it isn’t..."
4,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,2017-01-02,"SEOUL, South Korea — North Korea’s leader, Kim..."


In [24]:
new_articles.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 142023 entries, 0 to 146032
Data columns (total 4 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   title        142007 non-null  object        
 1   publication  142023 non-null  object        
 2   date         139387 non-null  datetime64[ns]
 3   content      142023 non-null  object        
dtypes: datetime64[ns](1), object(3)
memory usage: 5.4+ MB


In [25]:
new_articles.isnull().sum()

title            16
publication       0
date           2636
content           0
dtype: int64

## 3. Saving cleaned data

In [26]:
# Save the cleaned data frame
new_articles.to_csv(ROOT_PATH.joinpath(f"data/cleaned_data/cleaned_articles.csv"), index=False)