# Business Understanding
## Problem Statement
The University of Zambia (UNZA) hosts a growing repository of academic journal articles across multiple disciplines. However, these articles are not systematically categorized according to Zambia’s Vision 2030 development sectors. This lack of alignment presents a missed opportunity to leverage UNZA’s intellectual output for national strategic planning, policy formulation, and sectoral development monitoring.

This project aims to develop a data-driven classification system that maps UNZA journal articles to the appropriate Vision 2030 sectors using machine learning techniques. By automating this classification, we intend to bridge the gap between academic research and national development priorities, enabling policymakers, researchers, and institutions to better identify and track sectoral contributions and trends.

## Objectives
**1. To align the University of Zambia’s research with national priorities:**
Systematically map academic journal articles to Zambia’s Vision 2030 development sectors to highlight how the UNZA’s intellectual output contributes to achieving national development goals.

**2. To enable evidence-based decision-making:**
Provide policymakers, researchers, and development stakeholders with an accessible, data-driven tool for identifying sectoral trends and gaps in research, thereby supporting targeted policy formulation and strategic resource allocation.

**3. To automate and scale research classification:**
Develop a machine learning–powered system to classify and update the categorization of research articles efficiently, ensuring scalability as UNZA’s repository grows and enabling continuous monitoring of sectoral contributions over time.

## Data Mining Goals

**1. Design a supervised multi-class classification model** to assign each UNZA journal article to one of Zambia’s Vision 2030 sectors based on the article’s metadata (title, abstract, and keywords).

*Purpose*: Reveal the alignment between academic output and national development areas.

*Method*: Use labeled training data mapped to Vision 2030 sectors, extracted from a subset of articles.

**Expected Output**: Accurate labels such as “Education,” “Agriculture,” “Health,” “Infrastructure”, etc.

**2. Identify latent research clusters and anomalies** through unsupervised learning (e.g., clustering or topic modeling) to uncover emerging themes or neglected areas.

*Purpose*: Help decision-makers identify new or missing areas of national interest not currently emphasized in the Vision 2030 framework.

*Method*: Apply techniques like K-Means, DBSCAN, or LDA topic modeling on text embeddings.

**Expected Output**: Visual or descriptive reports of discovered themes or outliers.

**3. Deploy a scalable, retrainable classification pipeline** using modern ML techniques and modular design.

*Purpose*: Automate the tagging process for future UNZA research uploads.

*Method*: Build a modular pipeline for preprocessing, vectorization (e.g., TF-IDF or BERT), training, evaluation, and inference.

**Expected Output**: A script or web app that classifies new articles on upload.

**4. Continuously evaluate model performance** over time using metrics such as F1-score, accuracy, and confusion matrices.

*Purpose*: Ensure system reliability and adaptiveness as language and research topics evolve.

*Method*: Establish a validation framework and regularly benchmark models.

**Expected Output**: Monitoring logs or retraining criteria to prevent model drift.

## Initial Project Success Criteria

The project will be considered initially successful if the supervised classification model achieves at least 60% accuracy in assigning UNZA journal articles to the correct Zambia Vision 2030 development sectors.

This baseline is realistic for a first iteration, considering:

Data quality issues such as incomplete or inconsistent titles, abstracts, or keywords.

Sector overlap, where some research spans multiple development areas.

Model maturity, as this is the initial deployment and will improve with further training and tuning.

Achieving this baseline will:

Demonstrate that the model performs significantly above random guessing.

Provide policymakers and researchers with a usable starting point for tracking sectoral research contributions.

Establish a functional foundation for refining the system toward higher accuracy and more adoption.

# Data Understanding

In [None]:
# --- Step 1: Import libraries ---
import pandas as pd
import matplotlib.pyplot as plt
# --- Step 2: Load dataset ---
file_path = "../data/vision2030_corpus.csv"
df = pd.read_csv(file_path)
# --- Step 3: Initial Exploration ---
print("First 5 rows:")
display(df.head())

: 

The code above loads the dataset into our colaborotory notebook

In [None]:
# --- Step 4: Summary statistics ---
print("\nInfo:")
df.info()


Dataset has 17,136 rows × 13 columns.
All columns are stored as object/text.
Missing values appear mainly in authors, abstract, doi, pdf_url, and journal.
Titles and abstracts differ in length, showing variation in metadata.

In [None]:

print("\nSummary statistics (all columns):")
display(df.describe(include="all"))

The code above generates descriptive statistics for both numeric and categorical columns.

In [None]:
print("\nShape (rows, columns):", df.shape)

print("\nMissing values per column:")
print(df.isnull().sum())

The code above first prints the dataset’s overall dimensions (rows and columns) using df.shape. Then it shows how many missing values each column contains by running df.isnull().sum().

In [None]:

# --- Create derived length columns ---
df["title_length"] = df["title"].fillna("").apply(len)
df["abstract_length"] = df["abstract"].fillna("").apply(len)

# --- Plot histograms ---
df[["title_length", "abstract_length"]].hist(
    figsize=(10, 5), bins=30, edgecolor="black"
)
plt.suptitle("Distribution of Title and Abstract Lengths", fontsize=14)
plt.show()

# Missing rate per column (in %)
missing_rates = df.isnull().mean().sort_values(ascending=False) * 100

print("Missing data rates (%):")

# Plot as bar chart
plt.figure(figsize=(10,5))
missing_rates.plot(kind="bar", edgecolor="black")
plt.title("Missing Data Percentage by Column", fontsize=14)
plt.ylabel("Percentage (%)")
plt.xticks(rotation=45, ha="right")
plt.show()


### Dataset Summary

The dataset has **17,136 rows × 13 columns** of academic publication metadata.

#### Observations

- **Data types**:  
  All columns are text. `published` should be converted to `datetime` for trend analysis.  

- **Missing data**:  
  - `authors`: 81% missing  
  - `journal`: 81% missing  
  - `abstract`: 21% missing  
  - `doi`: 28% missing  
  - `pdf_url`: 56% missing  
  - Other fields: complete  

- **Uniqueness**:  
  - `id` is unique (usable as primary key)  
  - Titles are mostly unique (99%)  

- **Distributions**:  
  - `source` dominated by **openalex** (81%)  
  - `assigned_sectors` has 26 categories, with **Mining** and **Agriculture** frequent  
  - Publication years span **1994–2020+**  

#### Interpretation

The dataset is **large and well-structured**, but metadata gaps (authors, journals, abstracts) limit certain analyses.  
Strong identifiers (`id`, `topics`, `assigned_sectors`) support sectoral and thematic exploration.  
Preprocessing (handling missingness, normalizing dates) is needed for deeper analysis.


## Data Preparation

In [None]:
import numpy as np
import re
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

: 

In [None]:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

In [None]:
# Check the percentage of missing values for each column
missing_percentage = (df.isnull().sum() / len(df)) * 100
print("Missing Value Percentage:\n", missing_percentage)

# For text columns, we'll fill missing values with empty strings
text_columns = ['title', 'abstract', 'authors']
for col in text_columns:
    if col in df.columns:
        df[col].fillna('', inplace=True)

# For categorical columns, we'll fill with 'Unknown'
categorical_columns = ['journal', 'topics', 'provenance_sources']
for col in categorical_columns:
    if col in df.columns:
        df[col].fillna('Unknown', inplace=True)

# For published date, we'll extract year and handle missing values
df['published'] = pd.to_datetime(df['published'], errors='coerce')
df['published_year'] = df['published'].dt.year
df['published_year'].fillna(df['published_year'].median(), inplace=True)

print("\nMissing values after treatment:\n", df.isnull().sum())