Project Deliverable #3: Data Collection and Data Cleaning

This following code loads our dataset and shows the first 5 rows. 

In [None]:
import pandas as pd

# Load my dataset
df = pd.read_csv("Firefox_bugs.csv")

# See the first few rows
df.head()


The following code counts the number of duplicate rows. 

In [None]:
# Count total duplicate rows
num_dupes = df.duplicated().sum()
print(f"Number of duplicate rows: {num_dupes}")

Since it showed no duplicates, we specified that the check for duplicated be based on duplicated summary and descriptions. 62 Rows were found. 

In [None]:
df[df.duplicated(subset=["Summary", "Description"])]

This code removes all the duplicates found. 

In [None]:
# Remove all duplicate rows based on summary and description, if found
df = df.drop_duplicates(subset=["Summary", "Description"])

In [None]:
df[df.duplicated(subset=["Summary", "Description"])]

The next two code excertps installs packages we need to complete our data pre-processing tasks

In [None]:
%pip install nltk

In [None]:
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

In [None]:
%pip install scikit-learn

The code encodes the status and resolution fields and shows results in the first few rows. 
We use oneHotEncoder for the resolution and status field because order doesnt matter.

In [None]:
from sklearn.preprocessing import OneHotEncoder


ohe = OneHotEncoder(sparse_output=False, handle_unknown="ignore")

encodCol = ohe.fit_transform(df[["Status", "Resolution"]])

df_encoded = pd.concat([
    df.reset_index(drop=True),
    pd.DataFrame(encodCol.astype(int), columns=ohe.get_feature_names_out(["Status", "Resolution"]))
], axis=1)

df_encoded.head()

This next couple excerpts of code sets the ordinal order for the priority field with P5 being 
the most critical, encodes it and show results in the first couple rows
We dont use a label encoder because it will order alphabetically and in this case P5 is lowest priority 
so we need to map the order manually to make sure it is right. 
We use pd.Categorical which is an ordinal encoder that preserves both label and order. 

In [None]:
priority_order = ["P5","P4","P3","P2","P1"]  #lowest → highest
priority_map = {p:i for i,p in enumerate(priority_order)}

df["Priority_encoded"] = df["Priority"].map(priority_map)

unmapped = df.loc[df["Priority_encoded"].isna(), "Priority"].unique()
print("Unmapped values:", unmapped)

In [None]:
df["Priority_cat"] = pd.Categorical(df["Priority"], categories=priority_order, ordered=True)

In [None]:
df[["Priority","Priority_encoded"]].head(10)

# check to make sure it worked
print(df["Priority_cat"].value_counts().sort_index())

In [None]:
df[["Priority", "Priority_cat", "Priority_encoded"]].head(10)

In [None]:
df.head(10)

This code creates a function to process raw text by removing special characters, transforming to lowercase, tokenizing, stemming and removing stop words

In [None]:

EN_STOP = set(stopwords.words('english'))
STEMMER = PorterStemmer()

def preprocess_text(text: str):
    if not isinstance(text, str):
        text = "" if pd.isna(text) else str(text)

    #remove special characters 
    text = re.sub(r'[^A-Za-z0-9\s]', ' ', text)

    #Collapse multiple spaces if exists
    text = re.sub(r'\s+', ' ', text).strip()

    #lowercase
    text = text.lower()

    #tokenize
    tokens = word_tokenize(text)

    #remove stop words
    tokens = [t for t in tokens if t not in EN_STOP]

    #stemming
    tokens = [STEMMER.stem(t) for t in tokens]

    return tokens

This code concatentes the text and description field into one column and displays the new column

In [None]:
# 3) Create the combined Text column
df["Summary"] = df["Summary"].fillna("").astype(str)
df["Description"] = df["Description"].fillna("").astype(str)
df["Text"] = (df["Summary"] + " " + df["Description"]).str.replace(r"\s+", " ", regex=True).str.strip()

# 4) Check to see if it worked
print("New column created:", "Text" in df.columns)
print(df[["Summary", "Description", "Text"]].head(5))

This code applies our preprocessing function to the newly created column and returns a column with processed tokens.


In [None]:
#Apply tokenization column
df["processed_tokens"] = df["Text"].apply(preprocess_text)
df["processed_text"]  = df["processed_tokens"].apply(lambda toks: " ".join(toks))

#Make sure it worked
df[["Text", "processed_tokens"]].head(5)