
## **A Multiclass Classification Problem**

### **Problem Statement**

Algorithms for text classification have come a long way, but classifying long texts and working with under-resourced languages can still pose difficulties. This challenge gives participants the opportunity to improve on text classification techniques and algorithms for text in Chichewa. The texts are made up of news articles or varying lengths. 

The objective of this challenge is to classify these articles by topic. We hope that your solutions will illustrate some challenges and offer solutions.

Link for compitition -  https://zindi.africa/competitions/ai4d-malawi-news-classification-challenge


**About the dataset**

The data was collected from news publications in Malawi. tNyasa Ltd Data Science Lab have used three main broadcasters: the Nation Online newspaper, Radio Maria and the Malawi Broadcasting Corporation. The articles presented in the dataset are full articles and span many different genres: from social issues, family and relationships to political or economic issues.

- The articles were cleaned by removing special characters and html tags.

- Our task is to classify the news articles into one of 19 classes. The classes are mutually exclusive.

List of classes: ['SOCIAL ISSUES', 'EDUCATION', 'RELATIONSHIPS', 'ECONOMY', 'RELIGION', 'POLITICS', 'LAW/ORDER', 'SOCIAL', 'HEALTH', 'ARTS AND CRAFTS', 'FARMING', 'CULTURE', 'FLOODING', 'WITCHCRAFT', 'MUSIC', 'TRANSPORT', 'WILDLIFE/ENVIRONMENT', 'LOCALCHIEFS', 'SPORTS', 'OPINION/ESSAY']


The evaluation metric for this challenge is Accuracy.

Our submission file should look like…

| ID | Label |
| --- | --- |
| ID_ADHEtjTi | SOCIAL ISSUES |
| ID_AHfJktdQ | EDUCATION |
| ID_AUJIHpZr | RELATIONSHIPS |


**Installing Dependencies**

In [1]:
!ls

sample_data


**Initalization**

In [2]:
# Loading Libraries

# Warning Librarires
import warnings 
warnings.filterwarnings("ignore")

# Scientific and Data Manipulation Libraries 
import os
import math
import numpy as np
import pandas as pd 

In [5]:
%%capture

# Loading Data
!gdown --id 1xqbJqmCpQyQjv8f4xPW8ApUYwBuOcqRt --output train.csv
!gdown --id 1uuB0K00Yl4weUWrsilTLHzkERmpm5Oio --output test.csv
!gdown --id 1u6xKBfVHiawkNLfWZrlnpstd4j9q4Y5H --output sample.csv

# 
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
sample_data = pd.read_csv('sample.csv')

In [8]:
train_data.head()

Unnamed: 0,ID,Text,Label
0,ID_AASHwXxg,Mwangonde: Khansala wachinyamata Akamati achi...,POLITICS
1,ID_AGoFySzn,MCP siidakhutire ndi kalembera Chipani cha Ma...,POLITICS
2,ID_AGrrkBGP,Bungwe la MANEPO Lapempha Boma Liganizire Anth...,HEALTH
3,ID_AIJeigeG,Ndale zogawanitsa miyambo zanyanya Si zachile...,POLITICS
4,ID_APMprMbV,Nanga wapolisi ataphofomoka? Masiku ano sichi...,LAW/ORDER


In [9]:
train_data.shape , test_data.shape

((1436, 3), (620, 2))

In [10]:
test_data.head()

Unnamed: 0,ID,Text
0,ID_ADHEtjTi,Abambo odzikhweza akuchuluka Kafukufuku wa ap...
1,ID_AHfJktdQ,Ambuye Ziyaye Ayamikira Aphunzitsi a Tilitonse...
2,ID_AUJIHpZr,Anatcheleza: Akundiopseza a gogo wanga Akundi...
3,ID_AUKYBbIM,Ulova wafika posauzana Adatenga digiri ya uph...
4,ID_AZnsVPEi,"Dzombe kukoma, koma Kuyambira makedzana, pant..."


In [11]:
sample_data.head()

Unnamed: 0,ID,Label
0,ID_sQaPRMWO,0
1,ID_TanclvfR,0
2,ID_CNbveyvk,0
3,ID_MclKMhyP,0
4,ID_rNrmXOGD,0


In [12]:
train_data.isnull().sum()

ID       0
Text     0
Label    0
dtype: int64

 No Seems to be not missibg values data appers to be cleaned.

In [13]:
train_data.Label.value_counts(normalize=True)

POLITICS                0.194290
SOCIAL                  0.105850
RELIGION                0.102368
LAW/ORDER               0.094708
SOCIAL ISSUES           0.093315
HEALTH                  0.088440
ECONOMY                 0.059889
FARMING                 0.054318
SPORTS                  0.034123
EDUCATION               0.029944
RELATIONSHIPS           0.027159
WILDLIFE/ENVIRONMENT    0.025070
OPINION/ESSAY           0.018106
LOCALCHIEFS             0.017409
CULTURE                 0.016017
WITCHCRAFT              0.011142
MUSIC                   0.010446
TRANSPORT               0.007660
FLOODING                0.004875
ARTS AND CRAFTS         0.004875
Name: Label, dtype: float64

In [15]:
sample_data.Label.value_counts(normalize=True)

0    1.0
Name: Label, dtype: float64

In [16]:
from collections import Counter
Counter(train_data["Label"])

Counter({'ARTS AND CRAFTS': 7,
         'CULTURE': 23,
         'ECONOMY': 86,
         'EDUCATION': 43,
         'FARMING': 78,
         'FLOODING': 7,
         'HEALTH': 127,
         'LAW/ORDER': 136,
         'LOCALCHIEFS': 25,
         'MUSIC': 15,
         'OPINION/ESSAY': 26,
         'POLITICS': 279,
         'RELATIONSHIPS': 39,
         'RELIGION': 147,
         'SOCIAL': 152,
         'SOCIAL ISSUES': 134,
         'SPORTS': 49,
         'TRANSPORT': 11,
         'WILDLIFE/ENVIRONMENT': 36,
         'WITCHCRAFT': 16})

## Data Imbalancing Issues 

In [17]:
#pre-processing
import re 
def clean_str(string):
    """
    Tokenization/string cleaning for dataset
    Every dataset is lower cased except
    """
    string = re.sub(r"\n", "", string)    
    string = re.sub(r"\r", "", string) 
    string = re.sub(r"[0-9]", "digit", string)
    string = re.sub(r"\'", "", string)    
    string = re.sub(r"\"", "", string)    
    return string.strip().lower()

In [18]:
import numpy as np

In [19]:
#train test split
from sklearn.model_selection import train_test_split
X = []
for i in range(train_data.shape[0]):
    X.append(clean_str(train_data.iloc[i][1]))
y = np.array(train_data["Label"])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=5)

In [20]:

#feature engineering and model selection
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier

In [21]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

In [22]:
#pipeline of feature engineering and model
model = Pipeline([('vectorizer', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC(class_weight="balanced")))])

In [23]:
#paramater selection
from sklearn.model_selection import GridSearchCV
parameters = {'vectorizer__ngram_range': [(1, 1), (1, 2),(2,2)],
               'tfidf__use_idf': (True, False)}

In [24]:
gs_clf_svm = GridSearchCV(model, parameters, n_jobs=-1)
gs_clf_svm = gs_clf_svm.fit(X, y)
print(gs_clf_svm.best_score_)
print(gs_clf_svm.best_params_)

0.642065911730546
{'tfidf__use_idf': True, 'vectorizer__ngram_range': (1, 2)}


In [25]:
#preparing the final pipeline using the selected parameters
model = Pipeline([('vectorizer', CountVectorizer(ngram_range=(1,2))),
    ('tfidf', TfidfTransformer(use_idf=True)),
    ('clf', OneVsRestClassifier(LinearSVC(class_weight="balanced")))])

In [26]:
#fit model with training data
model.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 2), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabula...)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 OneVsRestClassifier(estimator=LinearSVC(C=1.0,
                                                         class_weight='balanced',
     

In [27]:
#evaluation on test data
pred = model.predict(X_test)

In [28]:
model.classes_

array(['ARTS AND CRAFTS', 'CULTURE', 'ECONOMY', 'EDUCATION', 'FARMING',
       'FLOODING', 'HEALTH', 'LAW/ORDER', 'LOCALCHIEFS', 'MUSIC',
       'OPINION/ESSAY', 'POLITICS', 'RELATIONSHIPS', 'RELIGION', 'SOCIAL',
       'SOCIAL ISSUES', 'SPORTS', 'TRANSPORT', 'WILDLIFE/ENVIRONMENT',
       'WITCHCRAFT'], dtype='<U20')

In [29]:
from sklearn.metrics import confusion_matrix, accuracy_score
confusion_matrix(pred, y_test)

array([[ 1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  1,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0],
       [ 0,  0, 10,  0,  1,  0,  1,  0,  0,  0,  4,  2,  0,  0,  0,  1,
         0,  0,  0,  0],
       [ 0,  0,  0,  6,  0,  0,  0,  0,  0,  0,  0,  1,  0,  0,  0,  1,
         0,  0,  0,  0],
       [ 0,  0,  2,  0, 17,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  1,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0],
       [ 0,  0,  1,  0,  1,  0, 23,  1,  0,  0,  0,  0,  0,  1,  5,  0,
         0,  0,  1,  0],
       [ 0,  0,  0,  0,  0,  0,  2, 29,  0,  0,  0,  4,  0,  1,  3,  7,
         1,  0,  1,  2],
       [ 0,  0,  1,  0,  0,  0,  0,  0,  1,  0,  0,  2,  0,  0,  1,  0,
         0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  3,  0,  0,  0,  1,  0,  0,
         0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0

In [31]:
accuracy_score(y_test, pred)

0.6194895591647331