## South African Language Identification Hack 2022

EDSA 2201 & 2207 classification hackathon

---

In this challenge, you will take text which is in any of South Africa's 11 Official languages and identify which language the text is in. This is an example of NLP's Language Identification, the task of determining the natural language that a piece of text is written in.

<a id="menu"></a>
## Sections Menu

<a href=#inp>1. Import Necessary Packages</a>

<a href=#ltd>2. Load Train Data</a>

<a href=#eda>3. Exploratory Data Analysis</a>

<a href=#de>4. Data Engineering</a>

<a href=#mod>5. Modeling</a>

<a href=#mp>6. Model Performance</a>


 <a id="inp"></a>
## 1. Import Necessary Packages
<a href=#menu>Back to Sections Menu</a>

In [47]:
# Data loading, data manipulation and data visualisation packages
import pandas as pd

# NLP and text preprocessing packages
from sklearn.feature_extraction.text import TfidfVectorizer

# Modelling packages
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import feature_selection
from sklearn.feature_selection import f_classif
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

#libraries for score metrics
from sklearn.metrics import accuracy_score, classification_report

 <a id="ltd"></a>
## 2. Load Train Data
<a href=#menu>Back to Sections Menu</a>

In [6]:
#load training data
df_train = pd.read_csv('train_set.csv')

train_set = df_train.copy()
pd.set_option('max_colwidth', None)
train_set.head()

Unnamed: 0,lang_id,text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko axhasa ulawulo lwesininzi kunye nokuthath inxaxheba kwabafazi ezi ziquka phakathi kwezinye zazo ikomishoni yokulingana ngokwesini ikomishoni yamalungelo oluntu lomzantsi afrika
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi naphi na kwisebe ngokusekwe kwiimfuno zokusebenza zalo emva kokubonana nomsebenzi kunye okanye imanyano yakhe ukuba ulandulo lomntu onjalo alufanelekanga i-dha mayibize uncedo olufanelekileyo elungelweni layo
2,eng,the province of kwazulu-natal department of transport invites tenders from established contractors experienced in bridge construction for the construction of the kwajolwayo tugela river pedestrian bridge near tugela ferry the duration of the project will be months
3,nso,o netefatša gore o ba file dilo ka moka tše le dumelelanego ka tšona mohlala maleri a magolo a a šomišwago go fihlelela meagong e metelele scaffolds a a bolokegilego lefelo la maleba la go šomela go phela gabotse bjbj
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana u ya nga mulayo wa khomishini ya ndinganyiso ya mbeu u thetshelesa mbilaelo dzine dza tshimbilelana na tshialula u ya nga mbeu nahone i ivhea sa foramu ya thungo u ya nga mulayo wa ndinganyiso


 <a id="eda"></a>
## 3. Exploratory Data Analysis
<a href=#menu>Back to Sections Menu</a>

In [19]:
print('DF shape?\n',train_set.shape)
print('\nNull entries?\n',train_set.isna().sum())
print('\nDuplicates?', train_set.columns.duplicated().any())
print('')

DF shape?
 (33000, 2)

Null entries?
 lang_id    0
text       0
dtype: int64

Duplicates? False


 <a id="de"></a>
## 4. Data Engineering
<a href=#menu>Back to Sections Menu</a>

In [21]:
#remove special characters numbers and punctuation
train_set['clean_text'] = train_set['text'].str.replace('[^a-zA-Z#]', ' ', regex=False)
#lowercase the text
train_set['clean_text'] = train_set['clean_text'].str.lower()
train_set.head()

Unnamed: 0,lang_id,text,clean_text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko axhasa ulawulo lwesininzi kunye nokuthath inxaxheba kwabafazi ezi ziquka phakathi kwezinye zazo ikomishoni yokulingana ngokwesini ikomishoni yamalungelo oluntu lomzantsi afrika,umgaqo-siseko wenza amalungiselelo kumaziko axhasa ulawulo lwesininzi kunye nokuthath inxaxheba kwabafazi ezi ziquka phakathi kwezinye zazo ikomishoni yokulingana ngokwesini ikomishoni yamalungelo oluntu lomzantsi afrika
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi naphi na kwisebe ngokusekwe kwiimfuno zokusebenza zalo emva kokubonana nomsebenzi kunye okanye imanyano yakhe ukuba ulandulo lomntu onjalo alufanelekanga i-dha mayibize uncedo olufanelekileyo elungelweni layo,i-dha iya kuba nobulumko bokubeka umsebenzi naphi na kwisebe ngokusekwe kwiimfuno zokusebenza zalo emva kokubonana nomsebenzi kunye okanye imanyano yakhe ukuba ulandulo lomntu onjalo alufanelekanga i-dha mayibize uncedo olufanelekileyo elungelweni layo
2,eng,the province of kwazulu-natal department of transport invites tenders from established contractors experienced in bridge construction for the construction of the kwajolwayo tugela river pedestrian bridge near tugela ferry the duration of the project will be months,the province of kwazulu-natal department of transport invites tenders from established contractors experienced in bridge construction for the construction of the kwajolwayo tugela river pedestrian bridge near tugela ferry the duration of the project will be months
3,nso,o netefatša gore o ba file dilo ka moka tše le dumelelanego ka tšona mohlala maleri a magolo a a šomišwago go fihlelela meagong e metelele scaffolds a a bolokegilego lefelo la maleba la go šomela go phela gabotse bjbj,o netefatša gore o ba file dilo ka moka tše le dumelelanego ka tšona mohlala maleri a magolo a a šomišwago go fihlelela meagong e metelele scaffolds a a bolokegilego lefelo la maleba la go šomela go phela gabotse bjbj
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana u ya nga mulayo wa khomishini ya ndinganyiso ya mbeu u thetshelesa mbilaelo dzine dza tshimbilelana na tshialula u ya nga mbeu nahone i ivhea sa foramu ya thungo u ya nga mulayo wa ndinganyiso,khomishini ya ndinganyiso ya mbeu yo ewa maana u ya nga mulayo wa khomishini ya ndinganyiso ya mbeu u thetshelesa mbilaelo dzine dza tshimbilelana na tshialula u ya nga mbeu nahone i ivhea sa foramu ya thungo u ya nga mulayo wa ndinganyiso


 <a id="mod"></a>
## 5. Modeling
<a href=#menu>Back to Sections Menu</a>

### 5.1 Vectorization

In [23]:
# Create features out of the text (Vectorization)
vectorizer = TfidfVectorizer(min_df=2, max_df=0.5, ngram_range=(1, 2))

#fit and transorm data 
training_x = vectorizer.fit_transform(train_set['clean_text'])

# Select the best features
X = training_x 
y = train_set['lang_id']

#Split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=50)

In [29]:
# Set up selector, choosing score function and number of features to retain
selector_kbest = feature_selection.SelectKBest(score_func=f_classif, k=50000)

In [30]:
# Transform: Run selection on the training data
X_train_kbest = selector_kbest.fit_transform(X_train, y_train)
X_train_kbest.shape

  f = msb / msw


(26400, 50000)

In [31]:
# Transform: Run selection on the testing data
X_test_kbest = selector_kbest.transform(X_test)
X_test_kbest.shape

(6600, 50000)

### 5.2 Models

### 5.2.1 Naive bayes

In [34]:
# Parameters
params = {'alpha': [0.01, 0.1, 0.5, 1.0, 10.0, ],}

# Perform a grid search to obtain the best alpha
multinomial_nb_grid = GridSearchCV(MultinomialNB(), param_grid=params, n_jobs=-1, cv=5, verbose=5)
multinomial_nb_grid.fit(X_train, y_train)

Fitting 5 folds for each of 5 candidates, totalling 25 fits


GridSearchCV(cv=5, estimator=MultinomialNB(), n_jobs=-1,
             param_grid={'alpha': [0.01, 0.1, 0.5, 1.0, 10.0]}, verbose=5)

In [35]:
# Print out the accuracy and best parameters of the model
print('Train Accuracy : %.3f'%multinomial_nb_grid.best_estimator_.score(X_train, y_train))
print('Test Accuracy : %.3f'%multinomial_nb_grid.best_estimator_.score(X_test, y_test))
print('Best Accuracy Through Grid Search : %.3f'%multinomial_nb_grid.best_score_)
print('Best Parameters : ',multinomial_nb_grid.best_params_)

Train Accuracy : 1.000
Test Accuracy : 0.999
Best Accuracy Through Grid Search : 0.999
Best Parameters :  {'alpha': 0.5}


In [36]:
# Instatiate the best model from the best parameters defined above
naive_bayes = MultinomialNB(alpha=0.5)
naive_bayes.fit(X_train_kbest, y_train)

MultinomialNB(alpha=0.5)

In [37]:
# Generate predictions
nb_tuned_pred = naive_bayes.predict(X_test_kbest)

In [40]:
score = accuracy_score(y_test, nb_tuned_pred)
print("Accuracy is :", score)

Accuracy is : 0.9986363636363637
[CV 3/5] END ........................alpha=0.01;, score=0.999 total time=   0.8s
[CV 1/5] END .........................alpha=0.1;, score=0.998 total time=   0.8s
[CV 4/5] END .........................alpha=0.1;, score=0.999 total time=   0.7s
[CV 4/5] END .........................alpha=0.5;, score=0.999 total time=   0.5s
[CV 3/5] END .........................alpha=1.0;, score=0.999 total time=   0.7s
[CV 3/5] END ........................alpha=10.0;, score=0.995 total time=   0.7s
[CV 4/5] END ........................alpha=0.01;, score=0.998 total time=   0.8s
[CV 5/5] END ........................alpha=0.01;, score=0.999 total time=   0.9s
[CV 1/5] END .........................alpha=0.5;, score=0.999 total time=   0.6s
[CV 5/5] END .........................alpha=0.5;, score=1.000 total time=   0.6s
[CV 5/5] END .........................alpha=1.0;, score=0.999 total time=   0.6s
[CV 2/5] END ........................alpha=10.0;, score=0.994 total time=   

In [42]:
# Assess accuracy using the classification report
nb_report = classification_report(y_test, nb_tuned_pred)
print(nb_report)

              precision    recall  f1-score   support

         afr       1.00      1.00      1.00       596
         eng       1.00      1.00      1.00       595
         nbl       1.00      0.99      1.00       594
         nso       1.00      1.00      1.00       581
         sot       1.00      1.00      1.00       600
         ssw       1.00      1.00      1.00       601
         tsn       1.00      1.00      1.00       609
         tso       1.00      1.00      1.00       606
         ven       1.00      1.00      1.00       614
         xho       1.00      1.00      1.00       606
         zul       1.00      1.00      1.00       598

    accuracy                           1.00      6600
   macro avg       1.00      1.00      1.00      6600
weighted avg       1.00      1.00      1.00      6600



### 5.2.2 Random Forest Classifier

In [44]:
# Instantiate and fit the random forest classifier
RFC = RandomForestClassifier()
RFC.fit(X_train_kbest, y_train)
RFC_pred = RFC.predict(X_test_kbest)

In [46]:
# Assess accuracy using the classification report
rfc_report = classification_report(y_test, RFC_pred)
print(rfc_report)

              precision    recall  f1-score   support

         afr       1.00      1.00      1.00       596
         eng       0.99      1.00      1.00       595
         nbl       0.99      0.94      0.97       594
         nso       1.00      0.99      0.99       581
         sot       1.00      1.00      1.00       600
         ssw       0.96      0.98      0.97       601
         tsn       1.00      1.00      1.00       609
         tso       1.00      1.00      1.00       606
         ven       1.00      1.00      1.00       614
         xho       0.97      0.98      0.98       606
         zul       0.96      0.96      0.96       598

    accuracy                           0.99      6600
   macro avg       0.99      0.99      0.99      6600
weighted avg       0.99      0.99      0.99      6600



### 5.2.3 Support Vector Machine

In [48]:
# Insatiate svm for kbest
parameters = {'kernel':('linear','rbf'), 'C':(0.25,1.0), 'gamma': (1,2)}

# Insatntiate svm for k features
svm_kbest = SVC()
clf = GridSearchCV(svm_kbest, parameters)
svm_kbest.fit(X_train_kbest, y_train)

SVC()

In [49]:
# Predictions for kbest
svm_y_pred_kbest = svm_kbest.predict(X_test_kbest)

In [50]:
# Assess accuracy using the classification report
svm_score = classification_report(y_test,svm_y_pred_kbest)
print(svm_score)

              precision    recall  f1-score   support

         afr       1.00      1.00      1.00       596
         eng       1.00      1.00      1.00       595
         nbl       0.99      0.98      0.99       594
         nso       1.00      0.99      1.00       581
         sot       1.00      1.00      1.00       600
         ssw       1.00      1.00      1.00       601
         tsn       1.00      1.00      1.00       609
         tso       1.00      1.00      1.00       606
         ven       1.00      1.00      1.00       614
         xho       0.99      1.00      0.99       606
         zul       0.98      0.99      0.99       598

    accuracy                           1.00      6600
   macro avg       1.00      1.00      1.00      6600
weighted avg       1.00      1.00      1.00      6600



 <a id="mp"></a>
## 6. Model Performance
<a href=#menu>Back to Sections Menu</a>

In [52]:
#load test set data
df_test = pd.read_csv('test_set.csv')
test_set = df_test.copy()
#remove special characters numbers and punctuation
test_set['clean_text'] = test_set['text'].str.replace('[^a-zA-Z#]', ' ', regex=False)
#lowercase
test_set['clean_text'] = test_set['clean_text'].str.lower()
test_set.head()

Unnamed: 0,index,text,clean_text
0,1,"Mmasepala, fa maemo a a kgethegileng a letlelela kgato eo.","mmasepala, fa maemo a a kgethegileng a letlelela kgato eo."
1,2,Uzakwaziswa ngokufaneleko nakungafuneka eminye imitlolo engezelelako ukuqedelela ukutloliswa kwesibawo sakho.,uzakwaziswa ngokufaneleko nakungafuneka eminye imitlolo engezelelako ukuqedelela ukutloliswa kwesibawo sakho.
2,3,Tshivhumbeo tshi fana na ngano dza vhathu.,tshivhumbeo tshi fana na ngano dza vhathu.
3,4,Kube inja nelikati betingevakala kutsi titsini naticocisana.,kube inja nelikati betingevakala kutsi titsini naticocisana.
4,5,Winste op buitelandse valuta.,winste op buitelandse valuta.


In [54]:
# Vectorize data 
X = vectorizer.transform(test_set['clean_text'])

In [55]:
X_kbest_hack = selector_kbest.transform(X)

In [56]:
nb_predictions = naive_bayes.predict(X_kbest_hack)

In [58]:
# Add the predicted sentiments to our test set with no labels
test_set['lang_id'] = nb_predictions
test_set.head()

Unnamed: 0,index,text,clean_text,lang_id
0,1,"Mmasepala, fa maemo a a kgethegileng a letlelela kgato eo.","mmasepala, fa maemo a a kgethegileng a letlelela kgato eo.",tsn
1,2,Uzakwaziswa ngokufaneleko nakungafuneka eminye imitlolo engezelelako ukuqedelela ukutloliswa kwesibawo sakho.,uzakwaziswa ngokufaneleko nakungafuneka eminye imitlolo engezelelako ukuqedelela ukutloliswa kwesibawo sakho.,nbl
2,3,Tshivhumbeo tshi fana na ngano dza vhathu.,tshivhumbeo tshi fana na ngano dza vhathu.,ven
3,4,Kube inja nelikati betingevakala kutsi titsini naticocisana.,kube inja nelikati betingevakala kutsi titsini naticocisana.,ssw
4,5,Winste op buitelandse valuta.,winste op buitelandse valuta.,afr


In [59]:
# Extract a dataframe for submission
submission_df = test_set[['index','lang_id']]

# Save submission dataframe to csv
submission_df.to_csv('sample_submission.csv', header=True, index=False)