# Text preprocessing

Dataset: https://www.kaggle.com/datasets/brendanmiles/nyt-news-dataset-20082021

We perform text preprocessing of the news data. We are interested in the titles and abstracts of the news. 

In [1]:
import pandas as pd
import numpy as np
import string
from nltk.stem import WordNetLemmatizer
import contractions
from collections import Counter

In [2]:
wnl = WordNetLemmatizer()

In [3]:
df = pd.read_csv('NYT_Dataset.csv', index_col=0)

In [4]:
df = df[['title','abstract']].dropna()

In [5]:
punctuation = string.punctuation+'»«“”’'

In [6]:
def clean_doc(data):
    data = data.str.lower() # change to lower capital
    data =  data.str.replace('’s',' ')
    data = data.str.replace(r'[\d|\n]',' ') # remove digits
    data = data.apply(lambda x: contractions.fix(x)) # convert contractions into full form
    data = data.str.replace('[{}]'.format(punctuation), ' ') # remove punctuation
    data = data.fillna('') # fill 'nan' with ''
    data = data.apply(lambda x: x.split()) 
    data = data.apply(lambda x: [wnl.lemmatize(y) for y in x]) # Lemmatization
    data = data.apply(lambda x: ' '.join(x))
    return data

In [7]:
df['abstract'] = clean_doc(df['abstract'])

In [8]:
df['title'] = clean_doc(df['title'])

In [9]:
def word_count(data):
    length = []
    for i,txt in enumerate(data.tolist()):
        length.append(len(txt.split()))
    return length

We add "ssstarttt" at the start and "eeenddd" at the end of each title.

In [10]:
df['title'] = df['title'].apply(lambda x: 'ssstarttt ' + x + ' eeenddd')

In [11]:
length_abstract = word_count(df['abstract'])
length_title = word_count(df['title'])

In [12]:
np.percentile(length_abstract,99), np.percentile(length_title,99)

(47.0, 17.0)

Most of abstracts have about 50 words and most of titles have 20 words. 

In [13]:
def count_freq(data, min_freq=5):
    textlist = []
    for i in data.str.split():
        textlist += i

    count = Counter(textlist)
    count =  dict(count)

    count_rare = 0
    for key, value in count.items():
        if value < min_freq:
            count_rare += 1
    return count_rare, len(count.keys())

In [14]:
temp = pd.concat([df['abstract'], df['title']], axis=0)
num_rare, tot_num = count_freq(temp,3)

num_rare = number of rare words (below 3 occurrence in the dataset)
<br>
tot_num = total number of words

In [15]:
num_rare, tot_num, tot_num-num_rare

(25483, 50370, 24887)

There are 25000 words which are common words.

Split data into train, validation, test

In [16]:
def split_data(data, split=0.1):
    size = data.shape[0]
    split = int(size*(1-split))
    return data.iloc[:split], data.iloc[split:].reset_index(drop=True)

In [24]:
train, val = split_data(df, 0.05)

In [25]:
train, test = split_data(train, 0.1)

In [26]:
train = train.dropna()
val = val.dropna()
test = test.dropna()