<a href="https://colab.research.google.com/github/luckysiabula-bit/classification_of_domain-subject_area_reference./blob/main/classification_of_domain_subject_area_reference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

2

# 1. Business Understanding
### 1.1 Problem Statement

The rapid growth of digital content in education, research, and industry has made it increasingly difficult to organize and retrieve information effectively. Manual classification of documents into subject areas or domains is slow, costly, and prone to inconsistency due to human error. This creates barriers to efficient knowledge management and slows down research or learning processes.

An automated classification system for domain/subject area reference will allow organizations to process large volumes of documents quickly, assign them to appropriate categories, and improve accessibility for end-users.


### 1.2 Business Objectives

The main business objective is to develop an automated classification system that assigns documents to predefined subject areas with high accuracy and efficiency.

From a real-world perspective, success means:
- Reducing manual classification workload by at least 70%.
- Achieving a minimum classification accuracy of 80%.
- Improving document retrieval time in repositories and databases.
- Increasing user satisfaction by making content easier to find and navigate.

### 1.3 Data Mining Goals

To achieve the stated business objectives, the project will:
- Build a supervised classification model capable of predicting the correct subject area from textual input.
- Use Natural Language Processing (NLP) techniques such as TF-IDF vectorization and word embeddings to extract meaningful features from text.
- Test multiple algorithms including Logistic Regression, Random Forest, Support Vector Machines, and transformer-based models like BERT.
- Select the model that provides the best trade-off between accuracy, speed, and interpretability.

### 1.4 Initial Project Success Criteria

The project will be considered successful if:
- The model achieves at least 80% accuracy on the test dataset.
- Precision and recall for each subject area are above 0.75.
- The system processes at least 500 documents per minute without significant performance loss.
- Classifications match expert-labeled results in at least 8 out of 10 randomly reviewed cases

### 1.5 Section Integration

This section integrates all parts of the Business Understanding phase into a single, well-structured document. The text is organized into four main subsections: Problem Statement, Business Objectives, Data Mining Goals, and Initial Project Success Criteria. The same content is reflected in both the Google Colab notebook and the README.md file to ensure consistency between development and documentation. Formatting, headings, and numbering follow a clear and professional style for ease of reading.

2.0 SECTION TWO

**2. Data Understanding**

This section loads the raw dataset, performs first-look exploration, and visualizes key distributions to identify data quality issues and class balance.



## 2.1 Load the Dataset

We loaded the dataset into a Pandas DataFrame. The dataset contains BBC News articles with three columns:

- **ArticleId** – a unique identifier for each document  
- **Text** – the full article text  
- **Category** – the subject area label (target variable)


In [None]:
from google.colab import drive
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Mount Google Drive
drive.mount('/content/drive')

# Path to your file in Google Drive
file_path = '/content/drive/MyDrive/BBC News Train.csv'

# Load dataset
df = pd.read_csv(file_path)

# Display info about dataset
print(df.head())      # () needed to call the method

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
   ArticleId                                               Text  Category
0       1833  worldcom ex-boss launches defence lawyers defe...  business
1        154  german business confidence slides german busin...  business
2       1101  bbc poll indicates economic gloom citizens in ...  business
3       1976  lifestyle  governs mobile choice  faster  bett...      tech
4        917  enron bosses in $168m payout eighteen former e...  business


## 2.2 Dataset Structure  

The dataset contains **1,490 rows** and **3 columns**.  

**Columns:**  
- `ArticleId` *(integer, unique ID)*  
- `Text` *(string, main content)*  
- `Category` *(string, label for classification)*  

No missing values were found in any of the columns.  


In [5]:
from google.colab import drive
import pandas as pd

# Mount Google Drive
drive.mount('/content/drive')

# Path to your file in Google Drive
file_path = '/content/drive/MyDrive/BBC News Train.csv'

# Load dataset
df = pd.read_csv(file_path)
df.shape
df.info()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1490 entries, 0 to 1489
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   ArticleId  1490 non-null   int64 
 1   Text       1490 non-null   object
 2   Category   1490 non-null   object
dtypes: int64(1), object(2)
memory usage: 35.1+ KB
