
**1. Import Libraries**

In [46]:
# Install necessary libraries
!pip install nltk
!pip install numpy pandas scikit-learn tensorflow



In [47]:
# Import the required libraries
import numpy as np
import pandas as pd
import nltk
import re

**2. Data Loading**

In [48]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [49]:
# Load the dataset
df = pd.read_csv('/content/drive/My Drive/dataset.csv')

# View the first few rows of the dataset
df.head()

Unnamed: 0,title,text,subject,date,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0


In [50]:
df.describe()

Unnamed: 0,label
count,44898.0
mean,0.477015
std,0.499477
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


Label 0 is **False** and label 1 is **True**

In [51]:
df['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,23481
1,21417


In [52]:
df['subject'].unique()

array(['News', 'politics', 'Government News', 'left-news', 'US_News',
       'Middle-east', 'politicsNews', 'worldnews'], dtype=object)

**3.Data Cleaning**

Data Cleaning


Focus: Improving the quality of the dataset by removing or correcting inaccuracies.

1.Handling Missing Values: Identifying and handling  missing data points to avoid issues during analysis or model training.

2.Removing Duplicates: Identifying and removing duplicates to ensure that each entry is unique and to avoid bias in the analysis.

3.Data Type Conversion: Ensuring that data types are appropriate for analysis

**3.1 Handling Null Values**

In [53]:
# View the shape of the dataset
print(f"Dataset shape: {df.shape}")

Dataset shape: (44898, 5)


In [54]:
# Check for null values
print("Null values in each column:")
print(df.isnull().sum())

Null values in each column:
title      0
text       0
subject    0
date       0
label      0
dtype: int64


**3.2 Removing Duplicates**

In [55]:
# Check for duplicates in the 'text' column
print(f"Number of duplicate texts: {df['text'].duplicated().sum()}")

# Drop duplicates based on the 'text' column
df = df.drop_duplicates(subset='text')

# View the shape of the dataset after removing duplicates
print(f"Dataset shape after removing duplicates in text column: {df.shape}")


Number of duplicate texts: 6252
Dataset shape after removing duplicates in text column: (38646, 5)


In [56]:
#count of the labels after dropping duplicates from text column

df['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,21191
0,17455


**3.3 Data type checking**

In [57]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 38646 entries, 0 to 44897
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    38646 non-null  object
 1   text     38646 non-null  object
 2   subject  38646 non-null  object
 3   date     38646 non-null  object
 4   label    38646 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 1.8+ MB


**4. Data Preprocessing**

Data Preprocessing


Focus: Preparing the data for analysis or machine learning by transforming it into a suitable format.

1.Text Cleaning: Converting text to lowercase, removing special characters, numbers, and whitespace.

2.Tokenization: Breaking down text into individual words or tokens for analysis.

3.Stopwords Removal: Removing common words that do not contribute to the meaning of the text.

4.Lemmatization: Reducing words to their meaningful base forms to improve the model's understanding of the text.

**4.1 Text cleaning**

In [58]:
import pandas as pd
import re

def clean_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove special characters
    text = re.sub(r'\W', ' ', text)  # Keep only word characters (letters and digits)
    # Remove numbers
    text = re.sub(r'\d', '', text)
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()  # Replace multiple spaces with a single space
    return text

# Apply text cleaning to the 'text' column
df['cleaned_text'] = df['text'].apply(clean_text)

# View the cleaned text data
df[['text', 'cleaned_text']].head()


Unnamed: 0,text,cleaned_text
0,Donald Trump just couldn t wish all Americans ...,donald trump just couldn t wish all americans ...
1,House Intelligence Committee Chairman Devin Nu...,house intelligence committee chairman devin nu...
2,"On Friday, it was revealed that former Milwauk...",on friday it was revealed that former milwauke...
3,"On Christmas day, Donald Trump announced that ...",on christmas day donald trump announced that h...
4,Pope Francis used his annual Christmas Day mes...,pope francis used his annual christmas day mes...


**4.2 Tokenization**

In [59]:
from nltk.tokenize import word_tokenize

def tokenize_text(text):
    return word_tokenize(text)

# Apply tokenization to the 'cleaned_text' column
df['tokens'] = df['cleaned_text'].apply(tokenize_text)

# View the tokenized text data
df[['cleaned_text', 'tokens']].head()


Unnamed: 0,cleaned_text,tokens
0,donald trump just couldn t wish all americans ...,"[donald, trump, just, couldn, t, wish, all, am..."
1,house intelligence committee chairman devin nu...,"[house, intelligence, committee, chairman, dev..."
2,on friday it was revealed that former milwauke...,"[on, friday, it, was, revealed, that, former, ..."
3,on christmas day donald trump announced that h...,"[on, christmas, day, donald, trump, announced,..."
4,pope francis used his annual christmas day mes...,"[pope, francis, used, his, annual, christmas, ..."


**4.3 Stop words removal**

In [60]:
from nltk.corpus import stopwords

# Download stopwords if not already downloaded
nltk.download('stopwords')

# Initialize stopwords
stop_words = set(stopwords.words('english'))

def remove_stopwords(tokens):
    return [word for word in tokens if word not in stop_words]

# Apply stopwords removal to the 'tokens' column
df['filtered_tokens'] = df['tokens'].apply(remove_stopwords)

# View the filtered tokens
df[['tokens', 'filtered_tokens']].head()


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,tokens,filtered_tokens
0,"[donald, trump, just, couldn, t, wish, all, am...","[donald, trump, wish, americans, happy, new, y..."
1,"[house, intelligence, committee, chairman, dev...","[house, intelligence, committee, chairman, dev..."
2,"[on, friday, it, was, revealed, that, former, ...","[friday, revealed, former, milwaukee, sheriff,..."
3,"[on, christmas, day, donald, trump, announced,...","[christmas, day, donald, trump, announced, wou..."
4,"[pope, francis, used, his, annual, christmas, ...","[pope, francis, used, annual, christmas, day, ..."


**4.4 Lemmatization**

In [61]:
from nltk.stem import WordNetLemmatizer

# Download necessary resources for lemmatization
nltk.download('wordnet')
nltk.download('omw-1.4')

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

def lemmatize_tokens(tokens):
    return [lemmatizer.lemmatize(word) for word in tokens]

# Apply lemmatization to the 'filtered_tokens' column
df['lemmatized_tokens'] = df['filtered_tokens'].apply(lemmatize_tokens)

# View the lemmatized tokens
df[['filtered_tokens', 'lemmatized_tokens']].head()


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Unnamed: 0,filtered_tokens,lemmatized_tokens
0,"[donald, trump, wish, americans, happy, new, y...","[donald, trump, wish, american, happy, new, ye..."
1,"[house, intelligence, committee, chairman, dev...","[house, intelligence, committee, chairman, dev..."
2,"[friday, revealed, former, milwaukee, sheriff,...","[friday, revealed, former, milwaukee, sheriff,..."
3,"[christmas, day, donald, trump, announced, wou...","[christmas, day, donald, trump, announced, wou..."
4,"[pope, francis, used, annual, christmas, day, ...","[pope, francis, used, annual, christmas, day, ..."
