# MasakhaNER Project: Named Entity Recognition for African Languages

## Introduction and Problem Statement

This notebook explores Named Entity Recognition (NER) for low-resource African languages using the MasakhaNER dataset. NER is a fundamental NLP task that identifies and classifies named entities in text into predefined categories such as person names, organizations, locations, and dates.

The MasakhaNER dataset covers 10 African languages: Amharic, Hausa, Igbo, Kinyarwanda, Luganda, Luo, Nigerian-Pidgin, Swahili, Wolof, and Yorùbá. This project addresses the challenge of developing effective NLP tools for languages with limited digital resources.

### Project Goals:
1. Explore and analyze the MasakhaNER dataset
2. Implement preprocessing pipelines for African languages
3. Develop and compare different NER models
4. Evaluate model performance across languages
5. Identify challenges and opportunities for low-resource NLP

### Set Up and Data Loading

In [None]:
# Import required libraries
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter, defaultdict
import re
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, AutoModelForTokenClassification
from tqdm.notebook import tqdm
import requests
import zipfile
import io

# Set display options
pd.set_option('display.max_columns', None)
sns.set(style='whitegrid')

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)

# Check for GPU availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")