
Data pre-processing is an essential step in machine learning . It involves cleaning and transforming raw data to a suitable format for machine learning algorithms. In this project, the pre-processing steps included removing rows with missing values, removing punctuation marks, converting text to lowercase, removing non-alphanumeric characters, and removing extra whitespaces. These steps ensure that the data is in a consistent and standardized format, which is necessary for accurate classification.

In [21]:
import pandas as pd
import re
import string

# Loading the dataset
### Looping the data lines
with open("bbc-news-data.csv", 'r') as temp_f:
    # getting No of columns in each line
    col_count = [ len(l.split(",")) for l in temp_f.readlines() ]

### Generating column names  (names will be 0, 1, 2, ..., maximum columns - 1)
column_names = [i for i in range(0, max(col_count))]

### Reading csv
df = pd.read_csv("bbc-news-data.csv", header=None, delimiter=",", names=column_names)

print(len(df.index))

# Removing any rows with missing values
df.dropna(inplace=True)

# Removing punctuation marks from text
def remove_punctuation(text):
    return text.translate(str.maketrans("", "", string.punctuation))

df[2] = df[2].apply(remove_punctuation)
df[3] = df[3].apply(remove_punctuation)

# Converting text to lowercase
df[2] = df[2].str.lower()
df[3] = df[3].str.lower()

# Removing any non-alphanumeric characters from text
def remove_non_alphanumeric(text):
    return re.sub(r'[^a-zA-Z0-9\s]', '', text)

df[2] = df[2].apply(remove_non_alphanumeric)
df[3] = df[3].apply(remove_non_alphanumeric)

# Removing any extra whitespace from text
df[2] = df[2].str.strip()
df[3] = df[3].str.strip()
df[3] = df[3].str.replace('\s+', ' ')

# Saving the preprocessed data to a new file
df.to_csv('preprocessed_bbc-news-data.csv', index=False)


1624


  df[3] = df[3].str.replace('\s+', ' ')


This code loads the dataset, removes any rows with missing values, removes punctuation marks, converts text to lowercase, removes non-alphanumeric characters, removes extra whitespace, and saves the preprocessed data to a new file called preprocessed_bbc-news-data.csv. Note that you may need to modify the code depending on the specific requirements of your project.

Regex is used to remove any non-alphanumeric characters from the text data. Specifically, the remove_non_alphanumeric() function is defined using regex to substitute any non-alphanumeric character (i.e., characters that are not letters or numbers) with an empty string ''. This is done using the re.sub() method, which takes as arguments the regular expression pattern to match, the replacement string to substitute the matches with, and the string to apply the substitution to.

This step is important in the data preprocessing stage because it helps to clean the text data and remove any unwanted characters that could affect the accuracy of the model. By removing non-alphanumeric characters, the text data is transformed into a more uniform and standardized format that is easier to analyze and model.

In [31]:
# Dataset-https://www.kaggle.com/datasets/hgultekin/bbcnewsarchive?resource=download
import pandas as pd
import nltk

# Downloading required resources from nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Loading the pre-processed dataset
### Looping the data lines
with open("preprocessed_bbc-news-data.csv", 'r') as temp_f:
    # getting No of columns in each line
    col_count = [ len(l.split(",")) for l in temp_f.readlines() ]

### Generating column names  (names will be 0, 1, 2, ..., maximum columns - 1)
column_names = [i for i in range(0, max(col_count))]

### Reading csv
df = pd.read_csv("bbc-news-data.csv", header=None, delimiter=",", names=column_names)

print(len(df.index))

# Defining the categories we're interested in
categories = ['business', 'politics', 'entertainment', 'sports', 'tech']

# Looping through each row in the dataset and extract the category based on the headline
correct=0
total=0
cnt=0
for index, row in df.iterrows():
    headline = row[2]
    tokens = nltk.word_tokenize(str(headline))
    tagged = nltk.pos_tag(tokens)
    cnt+=1
    for word, tag in tagged:
        if word.lower() in categories:
          actual_category=row[0].split()[0]
          print(f"Category of headline '{headline}': {word.lower()}")
          total+=1
          if(actual_category==word.lower()):
            correct+=1
          break



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


1624
Category of headline ' Asian carmakers fared well - Toyota sales jumped 11% while rival Nissan notched up a 10% increase. Overall. sales across the industry also fell to 1.25 million vehicles from 1.27 million a year earlier.  GM and Ford blamed high fuel prices for low sales of big trucks and gas-guzzling sports utility vehicles (SUVs) - the vehicles that provide the biggest profits.  GM added that US truck sales fell 9% in February while car business tumbled 17%': sports
Category of headline ' has seen annual pre-tax profits climb to record levels boosted by a sharp rise in business at its investment arm.  Profits for the year to 31 December rose 20% to £4.6bn ($8.6bn). Barclays' chief John Varley said the bank had "caught the winds" of a very strong world economy. Earnings at Barclays Capital investment bank rose 25% to £1.04bn': business
Category of headline ' Monsanto also agreed to three years' close monitoring of its business practices by the American authorities. It said i

This code loads a CSV file using the pandas library and iterates over each row to extract the category based on the headline. Here's how it works:

The code first downloads two NLTK resources: punkt and averaged_perceptron_tagger, which are required for tokenization and part-of-speech tagging.

Next, it loads the dataset from a CSV file called bbc-news-data.csv using the pd.read_csv() function from pandas. Since the CSV file does not have column names, the header=None parameter is used. The delimiter="," parameter specifies that the data is comma-separated. The names parameter is used to generate column names for the DataFrame.

The code then defines a list of categories that we're interested in: ['business', 'politics', 'entertainment', 'sports', 'technology'].

For each row in the dataset, the code extracts the headline and tokenizes it using the nltk.word_tokenize() function. It then performs part-of-speech tagging on the tokens using the nltk.pos_tag() function.

Finally, the code loops through the tagged tokens and checks if any of them match one of the categories. If a match is found, the category is printed along with the corresponding headline.

In summary, the code reads in a CSV file of news articles and extracts the category of each article based on the headline using NLTK tokenization and part-of-speech tagging.

In [18]:
accuracy=(correct/total)*100
print(accuracy)


45.16129032258064
