# Business Understanding
## Problem Statement
The University of Zambia (UNZA) hosts a growing repository of academic journal articles across multiple disciplines. However, these articles are not systematically categorized according to Zambia’s Vision 2030 development sectors. This lack of alignment presents a missed opportunity to leverage UNZA’s intellectual output for national strategic planning, policy formulation, and sectoral development monitoring.

This project aims to develop a data-driven classification system that maps UNZA journal articles to the appropriate Vision 2030 sectors using machine learning techniques. By automating this classification, we intend to bridge the gap between academic research and national development priorities, enabling policymakers, researchers, and institutions to better identify and track sectoral contributions and trends.

## Objectives
**I. To align the University of Zambia’s research with national priorities:**
Systematically map academic journal articles to Zambia’s Vision 2030 development sectors to highlight how the UNZA’s intellectual output contributes to achieving national development goals.

**II. To enable evidence-based decision-making:**
Provide policymakers, researchers, and development stakeholders with an accessible, data-driven tool for identifying sectoral trends and gaps in research, thereby supporting targeted policy formulation and strategic resource allocation.

**III. To automate and scale research classification:**
Develop a machine learning–powered system to classify and update the categorization of research articles efficiently, ensuring scalability as UNZA’s repository grows and enabling continuous monitoring of sectoral contributions over time.

Data Mining Goals

I. Design a supervised multi-class classification model to assign each UNZA journal article to one of Zambia’s Vision 2030 sectors based on the article’s metadata (title, abstract, and keywords).

Purpose: Reveal the alignment between academic output and national development areas.

Method: Use labeled training data mapped to Vision 2030 sectors, extracted from a subset of articles.

Expected Output: Accurate labels such as “Education,” “Agriculture,” “Health,” “Infrastructure”, etc.

II. Identify latent research clusters and anomalies through unsupervised learning (e.g., clustering or topic modeling) to uncover emerging themes or neglected areas.

Purpose: Help decision-makers identify new or missing areas of national interest not currently emphasized in the Vision 2030 framework.

Method: Apply techniques like K-Means, DBSCAN, or LDA topic modeling on text embeddings.

Expected Output: Visual or descriptive reports of discovered themes or outliers.

III. Deploy a scalable, retrainable classification pipeline using modern ML techniques and modular design.

Purpose: Automate the tagging process for future UNZA research uploads.

Method: Build a modular pipeline for preprocessing, vectorization (e.g., TF-IDF or BERT), training, evaluation, and inference.

Expected Output: A script or web app that classifies new articles on upload.

IV. Continuously evaluate model performance over time using metrics such as F1-score, accuracy, and confusion matrices.

Purpose: Ensure system reliability and adaptiveness as language and research topics evolve.

Method: Establish a validation framework and regularly benchmark models.

Expected Output: Monitoring logs or retraining criteria to prevent model drift.