In [2]:
!pip install -q nlpretext loguru

DEPRECATION: pytorch-lightning 1.5.10 has a non-standard dependency specifier torch>=1.7.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pytorch-lightning or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063


In [3]:
import pandas as pd
import numpy as np
from nlpretext import Preprocessor
from nlpretext.basic.preprocess import (normalize_whitespace, remove_punct, 
                                        remove_eol_characters, remove_stopwords, 
                                        lower_text, unpack_english_contractions,remove_accents,
                                        remove_multiple_spaces_and_strip_text,replace_urls,
                                       replace_emails,replace_currency_symbols)

from nlpretext.social.preprocess import (remove_mentions, remove_hashtag, 
                                         remove_emoji,remove_html_tags)

from nlpretext.token.preprocess import (remove_tokens_with_nonletters)

import logging
import time
import datetime

import warnings
warnings.filterwarnings("ignore")

In [4]:
logging.basicConfig(
    format='%(asctime)s %(levelname)-8s %(message)s',
    datefmt='%Y-%m-%d %H:%M:%S',
    level=logging.INFO, 
    filename='./logs/Data Preprocessing.log')   

In [5]:
logging.info("==========================================================================================================")
logging.info("Data Preprocessing Started ")

In [6]:
train_df = pd.read_csv('./input/train.csv')
test_df=pd.read_csv('./input/test.csv')
validation_df=pd.read_csv('./input/validation.csv')

logging.info(f"Train Shape {train_df.shape}, Test Shape {test_df.shape}, Validation Shape {validation_df.shape}")

In [7]:
print('Article:\n', train_df.iloc[0]['article'][:2000])
print('Summary:\n', train_df.iloc[0]['highlights'])

Article:
 By . Associated Press . PUBLISHED: . 14:11 EST, 25 October 2013 . | . UPDATED: . 15:36 EST, 25 October 2013 . The bishop of the Fargo Catholic Diocese in North Dakota has exposed potentially hundreds of church members in Fargo, Grand Forks and Jamestown to the hepatitis A virus in late September and early October. The state Health Department has issued an advisory of exposure for anyone who attended five churches and took communion. Bishop John Folda (pictured) of the Fargo Catholic Diocese in North Dakota has exposed potentially hundreds of church members in Fargo, Grand Forks and Jamestown to the hepatitis A . State Immunization Program Manager Molly Howell says the risk is low, but officials feel it's important to alert people to the possible exposure. The diocese announced on Monday that Bishop John Folda is taking time off after being diagnosed with hepatitis A. The diocese says he contracted the infection through contaminated food while attending a conference for newly 

In [8]:
logging.info(f"Setting up preprocessing pipeline")

In [9]:
preprocessor = Preprocessor()
preprocessor.pipe(unpack_english_contractions)
preprocessor.pipe(normalize_whitespace)
preprocessor.pipe(remove_mentions)
preprocessor.pipe(remove_hashtag)
preprocessor.pipe(remove_emoji)
preprocessor.pipe(remove_punct, args={'marks': '])\'\",;:-([?'})
preprocessor.pipe(remove_accents)
preprocessor.pipe(replace_currency_symbols)
preprocessor.pipe(replace_emails)
preprocessor.pipe(replace_urls)
preprocessor.pipe(remove_html_tags)
preprocessor.pipe(remove_mentions)
preprocessor.pipe(remove_hashtag)
preprocessor.pipe(remove_emoji)

In [10]:
sample_train_df = train_df
sample_train_df.head()

Unnamed: 0,id,article,highlights
0,0001d1afc246a7964130f43ae940af6bc6c57f01,By . Associated Press . PUBLISHED: . 14:11 EST...,"Bishop John Folda, of North Dakota, is taking ..."
1,0002095e55fcbd3a2f366d9bf92a95433dc305ef,(CNN) -- Ralph Mata was an internal affairs li...,Criminal complaint: Cop used his role to help ...
2,00027e965c8264c35cc1bc55556db388da82b07f,A drunk driver who killed a young woman in a h...,"Craig Eccleston-Todd, 27, had drunk at least t..."
3,0002c17436637c4fe1837c935c04de47adb18e9a,(CNN) -- With a breezy sweep of his pen Presid...,Nina dos Santos says Europe must be ready to a...
4,0003ad6ef0c37534f80b55b4235108024b407f0b,Fleetwood are the only team still to have a 10...,Fleetwood top of League One after 2-0 win at S...


In [11]:
logging.info(f"Cleaning Train dataset")
sample_train_df['article_cleaned'] = sample_train_df['article'].apply(preprocessor.run)

In [12]:
print('Article:\n', sample_train_df.iloc[4]['article'])
print('Article Cleaned:\n', sample_train_df.iloc[4]['article_cleaned'])

Article:
 Fleetwood are the only team still to have a 100% record in Sky Bet League One as a 2-0 win over Scunthorpe sent Graham Alexander’s men top of the table. The Cod Army are playing in the third tier for the first time in their history after six promotions in nine years and their remarkable ascent shows no sign of slowing with Jamie Proctor and Gareth Evans scoring the goals at Glanford Park. Fleetwood were one of five teams to have won two out of two but the other four clubs - Peterborough, Bristol City, Chesterfield and Crawley - all hit their first stumbling blocks. Posh were defeated 2-1 by Sheffield United, who had lost both of their opening contests. Jose Baxter’s opener gave the Blades a first-half lead, and although it was later cancelled out by Shaun Brisley’s goal, Ben Davies snatched a winner six minutes from time. In the lead: Jose Baxter (right) celebrates opening the scoring for Sheffield United . Up for the battle: Sheffield United's Michael Doyle (left) challenges

In [13]:
sample_train_df = sample_train_df.drop(['article'], axis=1)
sample_train_df.rename(columns = {'article_cleaned':'article'}, inplace = True) 

In [14]:
#Save preprocessed train data for fine tuning models
sample_train_df.to_csv('./input/train_cleaned.csv')
logging.info(f"Cleaned Train dataset saved into './input/train_cleaned.csv' file ")

In [15]:
logging.info(f"Cleaning Test dataset")
sample_test_df = test_df
sample_test_df.head()

Unnamed: 0,id,article,highlights
0,92c514c913c0bdfe25341af9fd72b29db544099b,Ever noticed how plane seats appear to be gett...,Experts question if packed out planes are put...
1,2003841c7dc0e7c5b1a248f9cd536d727f27a45a,A drunk teenage boy had to be rescued by secur...,Drunk teenage boy climbed into lion enclosure ...
2,91b7d2311527f5c2b63a65ca98d21d9c92485149,Dougie Freedman is on the verge of agreeing a ...,Nottingham Forest are close to extending Dougi...
3,caabf9cbdf96eb1410295a673e953d304391bfbb,Liverpool target Neto is also wanted by PSG an...,Fiorentina goalkeeper Neto has been linked wit...
4,3da746a7d9afcaa659088c8366ef6347fe6b53ea,Bruce Jenner will break his silence in a two-h...,"Tell-all interview with the reality TV star, 6..."


In [16]:
sample_test_df['article_cleaned'] = sample_test_df['article'].apply(preprocessor.run)

In [17]:
print('Article:\n', sample_test_df.iloc[4]['article'])
print('Article Cleaned:\n', sample_test_df.iloc[4]['article_cleaned'])

Article:
 Bruce Jenner will break his silence in a two-hour interview with Diane Sawyer later this month. The former Olympian and reality TV star, 65, will speak in a 'far-ranging' interview with Sawyer for a special edition of '20/20' on Friday April 24, ABC News announced on Monday. The interview comes amid growing speculation about the father-of-six's transition to a woman, and follows closely behind his involvement in a deadly car crash in California in February. And while the Kardashian women are known for enjoying center stage, they will not be stealing Bruce's spotlight because they will be in Armenia when the interview airs, according to TMZ. Scroll down for video . Speaking out: Bruce Jenner, pictured on 'Keeping Up with the Kardashians' will speak out in a 'far-ranging' interview with Diane Sawyer later this month, ABC News announced on Monday . Return: Diane Sawyer, who recently mourned the loss of her husband, will return to ABC for the interview . Rumors started swirling a

In [18]:
sample_test_df = sample_test_df.drop(['article'], axis=1)
sample_test_df.rename(columns = {'article_cleaned':'article'}, inplace = True) 

In [19]:
#Save preprocessed train data for fine tuning models
sample_test_df.to_csv('./input/test_cleaned.csv')
logging.info(f"Cleaned Test dataset saved into './input/test_cleaned.csv' file ")

In [20]:
logging.info(f"Cleaning Validation dataset")
sample_validation_df = validation_df
sample_validation_df.head()

Unnamed: 0,id,article,highlights
0,61df4979ac5fcc2b71be46ed6fe5a46ce7f071c3,"Sally Forrest, an actress-dancer who graced th...","Sally Forrest, an actress-dancer who graced th..."
1,21c0bd69b7e7df285c3d1b1cf56d4da925980a68,A middle-school teacher in China has inked hun...,Works include pictures of Presidential Palace ...
2,56f340189cd128194b2e7cb8c26bb900e3a848b4,A man convicted of killing the father and sist...,"Iftekhar Murtaza, 29, was convicted a year ago..."
3,00a665151b89a53e5a08a389df8334f4106494c2,Avid rugby fan Prince Harry could barely watch...,Prince Harry in attendance for England's crunc...
4,9f6fbd3c497c4d28879bebebea220884f03eb41a,A Triple M Radio producer has been inundated w...,Nick Slater's colleagues uploaded a picture to...


In [21]:
sample_validation_df['article_cleaned'] = sample_validation_df['article'].apply(preprocessor.run)

In [22]:
print('Article:\n', sample_validation_df.iloc[4]['article'])
print('Article Cleaned:\n', sample_validation_df.iloc[4]['article_cleaned'])

Article:
 A Triple M Radio producer has been inundated with messages from prospective partners after a workplace ploy. After Tuesday's Grill Team show, hosts Matty Johns, Mark Geyer and Gus Worland uploaded a picture of 26-year-old Nick Slater to Facebook with a mobile number where people could reach him. In less than 24 hours, he had received over 130 messages from a varied range of male and female listeners, reports News.com. Triple M producer Nick Slater, (C), pictured with his Grill Team hosts, was flooded with 130 voicemails in 24 hours . Workmates and Grill Team hosts Matty Johns, Mark Geyer and Gus Worland uploaded a picture of 26-year-old Nick Slater to Facebook with a mobile number where people could reach out . The ploy came about after a waitress handed the audio engineer her number while out at some work drinks. Unconvinced it was a one off, his colleagues decided to put it to the test and see if anyone else was romantically interested in him. 'The Producers had a few drink

In [23]:
sample_validation_df = sample_validation_df.drop(['article'], axis=1)
sample_validation_df.rename(columns = {'article_cleaned':'article'}, inplace = True) 

In [24]:
#Save preprocessed train data for fine tuning models
sample_validation_df.to_csv('./input/validation_cleaned.csv')
logging.info(f"Cleaned Validation dataset saved into './input/validation_cleaned.csv' file ")

In [25]:
logging.info("Data Preprocessing Completed ")

In [27]:
df = pd.read_csv('./input/train_cleaned.csv')
df.head(3)

(287113, 4)