# PACKAGE IMPORTS

<div class="alert alert-block alert-info"> 
These libraries and tools collectively provide a comprehensive set of capabilities for handling data (pandas, numpy), manipulating text (re, nltk), and performing advanced natural language processing tasks (nltk). They are widely used in data science, machine learning, and text analytics projects due to their efficiency and versatility.








In [None]:
import pandas as pd
import numpy as np 
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer


# DATA LOADING

In [8]:
train_data = pd.read_csv('Data/processed/test.csv')
test_data = pd.read_csv('Data/processed/train.csv')

In [9]:
train_data.head()

Unnamed: 0,headlines,description,content,url,category
0,NLC India wins contract for power supply to Ra...,State-owned firm NLC India Ltd (NLCIL) on Mond...,State-owned firm NLC India Ltd (NLCIL) on Mond...,https://indianexpress.com/article/business/com...,business
1,SBI Clerk prelims exams dates announced; admit...,SBI Clerk Prelims Exam: The SBI Clerk prelims ...,SBI Clerk Prelims Exam: The State Bank of Indi...,https://indianexpress.com/article/education/sb...,education
2,"Golden Globes: Michelle Yeoh, Will Ferrell, An...","Barbie is the top nominee this year, followed ...","Michelle Yeoh, Will Ferrell, Angela Bassett an...",https://indianexpress.com/article/entertainmen...,entertainment
3,"OnePlus Nord 3 at Rs 27,999 as part of new pri...",New deal makes the OnePlus Nord 3 an easy purc...,"In our review of the OnePlus Nord 3 5G, we pra...",https://indianexpress.com/article/technology/t...,technology
4,Adani family’s partners used ‘opaque’ funds to...,Citing review of files from multiple tax haven...,Millions of dollars were invested in some publ...,https://indianexpress.com/article/business/ada...,business


In [10]:
test_data.head()

Unnamed: 0,headlines,description,content,url,category
0,RBI revises definition of politically-exposed ...,The central bank has also asked chairpersons a...,The Reserve Bank of India (RBI) has changed th...,https://indianexpress.com/article/business/ban...,business
1,NDTV Q2 net profit falls 57.4% to Rs 5.55 cror...,NDTV's consolidated revenue from operations wa...,Broadcaster New Delhi Television Ltd on Monday...,https://indianexpress.com/article/business/com...,business
2,"Akasa Air ‘well capitalised’, can grow much fa...",The initial share sale will be open for public...,Homegrown server maker Netweb Technologies Ind...,https://indianexpress.com/article/business/mar...,business
3,India’s current account deficit declines sharp...,The current account deficit (CAD) was 3.8 per ...,India’s current account deficit declined sharp...,https://indianexpress.com/article/business/eco...,business
4,"States borrowing cost soars to 7.68%, highest ...",The prices shot up reflecting the overall high...,States have been forced to pay through their n...,https://indianexpress.com/article/business/eco...,business


# Data Cleaning and Preprocessing for Text Analysis

<div class="alert alert-block alert-info">  
This section covers the process of data cleaning, which involves preparing text data for analysis by removing errors and inconsistencies. It includes downloading NLTK packages, loading datasets, and cleaning the text by removing noise, punctuation, and converting to lowercase. The text is then tokenized, stop words are removed, and words are stemmed and lemmatized. Finally, the processed text is reassembled into strings, with an option to save the cleaned datasets to CSV files.








In [11]:

# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Load the datasets
train_data = pd.read_csv('Data/processed/test.csv')
test_data = pd.read_csv('Data/processed/train.csv')

# Clean text: remove noise and punctuation, convert to lower case
def clean_text(text):
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with a single space
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)  # Remove all punctuations and special characters
    return text.strip().lower()

train_data['headlines'] = train_data['headlines'].apply(clean_text)
train_data['description'] = train_data['description'].apply(clean_text)
train_data['content'] = train_data['content'].apply(clean_text)

test_data['headlines'] = test_data['headlines'].apply(clean_text)
test_data['description'] = test_data['description'].apply(clean_text)
test_data['content'] = test_data['content'].apply(clean_text)

# Tokenization
train_data['headlines'] = train_data['headlines'].apply(word_tokenize)
train_data['description'] = train_data['description'].apply(word_tokenize)
train_data['content'] = train_data['content'].apply(word_tokenize)

test_data['headlines'] = test_data['headlines'].apply(word_tokenize)
test_data['description'] = test_data['description'].apply(word_tokenize)
test_data['content'] = test_data['content'].apply(word_tokenize)

# Remove stop words
stop_words = set(stopwords.words('english'))

train_data['headlines'] = train_data['headlines'].apply(lambda x: [word for word in x if word not in stop_words])
train_data['description'] = train_data['description'].apply(lambda x: [word for word in x if word not in stop_words])
train_data['content'] = train_data['content'].apply(lambda x: [word for word in x if word not in stop_words])

test_data['headlines'] = test_data['headlines'].apply(lambda x: [word for word in x if word not in stop_words])
test_data['description'] = test_data['description'].apply(lambda x: [word for word in x if word not in stop_words])
test_data['content'] = test_data['content'].apply(lambda x: [word for word in x if word not in stop_words])

# Stemming
stemmer = PorterStemmer()

train_data['headlines'] = train_data['headlines'].apply(lambda x: [stemmer.stem(word) for word in x])
train_data['description'] = train_data['description'].apply(lambda x: [stemmer.stem(word) for word in x])
train_data['content'] = train_data['content'].apply(lambda x: [stemmer.stem(word) for word in x])

test_data['headlines'] = test_data['headlines'].apply(lambda x: [stemmer.stem(word) for word in x])
test_data['description'] = test_data['description'].apply(lambda x: [stemmer.stem(word) for word in x])
test_data['content'] = test_data['content'].apply(lambda x: [stemmer.stem(word) for word in x])

# Lemmatization
lemmatizer = WordNetLemmatizer()

train_data['headlines'] = train_data['headlines'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])
train_data['description'] = train_data['description'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])
train_data['content'] = train_data['content'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

test_data['headlines'] = test_data['headlines'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])
test_data['description'] = test_data['description'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])
test_data['content'] = test_data['content'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

# Convert lists back to strings
train_data['headlines'] = train_data['headlines'].apply(lambda x: ' '.join(x))
train_data['description'] = train_data['description'].apply(lambda x: ' '.join(x))
train_data['content'] = train_data['content'].apply(lambda x: ' '.join(x))

test_data['headlines'] = test_data['headlines'].apply(lambda x: ' '.join(x))
test_data['description'] = test_data['description'].apply(lambda x: ' '.join(x))
test_data['content'] = test_data['content'].apply(lambda x: ' '.join(x))

# Save the cleaned datasets (optional)
train_data.to_csv('train_cleaned.csv', index=False)
test_data.to_csv('test_cleaned.csv', index=False)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Sikha\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sikha\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Sikha\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [13]:
train_data.head()

Unnamed: 0,headlines,description,content,url,category
0,nlc india win contract power suppli rajasthan ...,stateown firm nlc india ltd nlcil monday said ...,stateown firm nlc india ltd nlcil monday said ...,https://indianexpress.com/article/business/com...,business
1,sbi clerk prelim exam date announc admit card ...,sbi clerk prelim exam sbi clerk prelim exam 20...,sbi clerk prelim exam state bank indian sbi an...,https://indianexpress.com/article/education/sb...,education
2,golden globe michel yeoh ferrel angela bassett...,barbi top nomine year follow close oppenheim f...,michel yeoh ferrel angela bassett amanda seyfr...,https://indianexpress.com/article/entertainmen...,entertainment
3,oneplu nord 3 r 27999 part new price cut here,new deal make oneplu nord 3 easi purchas r 30k,review oneplu nord 3 5g prais balanc combin fe...,https://indianexpress.com/article/technology/t...,technology
4,adani famili partner use opaqu fund invest sto...,cite review file multipl tax haven intern adan...,million dollar invest publicli trade stock ind...,https://indianexpress.com/article/business/ada...,business


In [14]:
test_data.head()

Unnamed: 0,headlines,description,content,url,category
0,rbi revis definit politicallyexpos person kyc ...,central bank also ask chairperson chief execut...,reserv bank india rbi chang definit politicall...,https://indianexpress.com/article/business/ban...,business
1,ndtv q2 net profit fall 574 r 555 crore impact...,ndtv consolid revenu oper r 9555 crore r 1058 ...,broadcast new delhi televis ltd monday report ...,https://indianexpress.com/article/business/com...,business
2,akasa air well capitalis grow much faster ceo ...,initi share sale open public subscript juli 17...,homegrown server maker netweb technolog india ...,https://indianexpress.com/article/business/mar...,business
3,india current account deficit declin sharpli 1...,current account deficit cad 38 per cent gdp us...,india current account deficit declin sharpli 1...,https://indianexpress.com/article/business/eco...,business
4,state borrow cost soar 768 highest far fiscal,price shot reflect overal higher risk avers in...,state forc pay nose weekli auction debt tuesda...,https://indianexpress.com/article/business/eco...,business
