## South African Language Identification 

© Explore Data Science Academy

---

I {**Stella, Njuki**}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

### Overview

South Africa is a multilingual country with 11 official languages. As such, systems and devices communicate in multi-languages. Language can be used to deepen democracy as well as contribute to the social, cultural, intellectual, economic and political life of a society.

### Task
Given a piece of text, the goal is to predict the language used. This is one of the applications of natural language processing (NLP) 

### Data
The dataset used for this challenge is the NCHLT Text Corpora collected by the South African Department of Arts and Culture & Centre for Text Technology (CTexT, North-West University, South Africa). The training set was improved through additional cleaning done by Praekelt.

### Language IDs
The language predicted comprise one of the following:
* afr - Afrikaans
* eng - English
* nbl - isiNdebele
* nso - Sepedi
* sot - Sesotho
* ssw - siSwati
* tsn - Setswana
* tso - Xitsonga
* ven - Tshivenda
* xho - isiXhosa
* zul - isiZulu

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>


 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

In [1]:
#libraries for loading and manipulating data.
import numpy as np
import pandas as pd

#libraries for NLP and text preprocessing
import nltk
import re
import string 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.utils import resample

#libraries for visualisation 
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from wordcloud import WordCloud

#libraries for modelling
from nltk.util import ngrams
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import feature_selection
from sklearn.feature_selection import f_classif
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import RandomizedSearchCV
from sklearn.naive_bayes import MultinomialNB

#libraries for score metrics
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix, log_loss


# set plot style
sns.set()

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

In [37]:
#load training data
df_train = pd.read_csv('train_set.csv')

hackathon_set = df_train.copy()
pd.set_option('max_colwidth', None)
hackathon_set.head()

Unnamed: 0,lang_id,text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko axhasa ulawulo lwesininzi kunye nokuthath inxaxheba kwabafazi ezi ziquka phakathi kwezinye zazo ikomishoni yokulingana ngokwesini ikomishoni yamalungelo oluntu lomzantsi afrika
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi naphi na kwisebe ngokusekwe kwiimfuno zokusebenza zalo emva kokubonana nomsebenzi kunye okanye imanyano yakhe ukuba ulandulo lomntu onjalo alufanelekanga i-dha mayibize uncedo olufanelekileyo elungelweni layo
2,eng,the province of kwazulu-natal department of transport invites tenders from established contractors experienced in bridge construction for the construction of the kwajolwayo tugela river pedestrian bridge near tugela ferry the duration of the project will be months
3,nso,o netefatša gore o ba file dilo ka moka tše le dumelelanego ka tšona mohlala maleri a magolo a a šomišwago go fihlelela meagong e metelele scaffolds a a bolokegilego lefelo la maleba la go šomela go phela gabotse bjbj
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana u ya nga mulayo wa khomishini ya ndinganyiso ya mbeu u thetshelesa mbilaelo dzine dza tshimbilelana na tshialula u ya nga mbeu nahone i ivhea sa foramu ya thungo u ya nga mulayo wa ndinganyiso


<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

Exploratory data analysis (EDA) involves:
* Checking the shape of the data
* Checking if there are any missing values 
* Identify all the unique languages in the dataset
* Check if there are any duplicate entries

In [38]:
# General eda 
def eda (df):
    shape = df.shape
    null_entries = df.isnull().sum()
    dist_sent = list(df.lang_id.unique())
    duplicate = df.columns.duplicated().any()
    
    # summary
    a = print ('Shape of dataframe is ' + str(shape[0]) + ' rows and ' + str(shape[1]) + ' columns')
    b = print ('Unique languages are: ' +  str(dist_sent))
    c = print ('Duplicate entries: ', duplicate)
    d = print ('Checking for null entries in each column:\n' ,null_entries)
    
    return a,b,c,d

eda(hackathon_set)

Shape of dataframe is 33000 rows and 2 columns
Unique languages are: ['xho', 'eng', 'nso', 'ven', 'tsn', 'nbl', 'zul', 'ssw', 'tso', 'sot', 'afr']
Duplicate entries:  False
Checking for null entries in each column:
 lang_id    0
text       0
dtype: int64


(None, None, None, None)

<a id="four"></a>
## 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>
 In this section the following data preprocessing will be done:
 * Remove numbers, special characters and punctuation
 * Lowercase all words

In [39]:
#remove special characters numbers and punctuation
hackathon_set['clean_text'] = hackathon_set['text'].str.replace('[^a-zA-Z#]', ' ')
hackathon_set.head()

Unnamed: 0,lang_id,text,clean_text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko axhasa ulawulo lwesininzi kunye nokuthath inxaxheba kwabafazi ezi ziquka phakathi kwezinye zazo ikomishoni yokulingana ngokwesini ikomishoni yamalungelo oluntu lomzantsi afrika,umgaqo siseko wenza amalungiselelo kumaziko axhasa ulawulo lwesininzi kunye nokuthath inxaxheba kwabafazi ezi ziquka phakathi kwezinye zazo ikomishoni yokulingana ngokwesini ikomishoni yamalungelo oluntu lomzantsi afrika
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi naphi na kwisebe ngokusekwe kwiimfuno zokusebenza zalo emva kokubonana nomsebenzi kunye okanye imanyano yakhe ukuba ulandulo lomntu onjalo alufanelekanga i-dha mayibize uncedo olufanelekileyo elungelweni layo,i dha iya kuba nobulumko bokubeka umsebenzi naphi na kwisebe ngokusekwe kwiimfuno zokusebenza zalo emva kokubonana nomsebenzi kunye okanye imanyano yakhe ukuba ulandulo lomntu onjalo alufanelekanga i dha mayibize uncedo olufanelekileyo elungelweni layo
2,eng,the province of kwazulu-natal department of transport invites tenders from established contractors experienced in bridge construction for the construction of the kwajolwayo tugela river pedestrian bridge near tugela ferry the duration of the project will be months,the province of kwazulu natal department of transport invites tenders from established contractors experienced in bridge construction for the construction of the kwajolwayo tugela river pedestrian bridge near tugela ferry the duration of the project will be months
3,nso,o netefatša gore o ba file dilo ka moka tše le dumelelanego ka tšona mohlala maleri a magolo a a šomišwago go fihlelela meagong e metelele scaffolds a a bolokegilego lefelo la maleba la go šomela go phela gabotse bjbj,o netefat a gore o ba file dilo ka moka t e le dumelelanego ka t ona mohlala maleri a magolo a a omi wago go fihlelela meagong e metelele scaffolds a a bolokegilego lefelo la maleba la go omela go phela gabotse bjbj
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana u ya nga mulayo wa khomishini ya ndinganyiso ya mbeu u thetshelesa mbilaelo dzine dza tshimbilelana na tshialula u ya nga mbeu nahone i ivhea sa foramu ya thungo u ya nga mulayo wa ndinganyiso,khomishini ya ndinganyiso ya mbeu yo ewa maana u ya nga mulayo wa khomishini ya ndinganyiso ya mbeu u thetshelesa mbilaelo dzine dza tshimbilelana na tshialula u ya nga mbeu nahone i ivhea sa foramu ya thungo u ya nga mulayo wa ndinganyiso


In [42]:
#lowercase the text
hackathon_set['clean_text'] = hackathon_set['clean_text'].str.lower()
hackathon_set.head()

Unnamed: 0,lang_id,text,clean_text
0,xho,umgaqo-siseko wenza amalungiselelo kumaziko axhasa ulawulo lwesininzi kunye nokuthath inxaxheba kwabafazi ezi ziquka phakathi kwezinye zazo ikomishoni yokulingana ngokwesini ikomishoni yamalungelo oluntu lomzantsi afrika,umgaqo siseko wenza amalungiselelo kumaziko axhasa ulawulo lwesininzi kunye nokuthath inxaxheba kwabafazi ezi ziquka phakathi kwezinye zazo ikomishoni yokulingana ngokwesini ikomishoni yamalungelo oluntu lomzantsi afrika
1,xho,i-dha iya kuba nobulumko bokubeka umsebenzi naphi na kwisebe ngokusekwe kwiimfuno zokusebenza zalo emva kokubonana nomsebenzi kunye okanye imanyano yakhe ukuba ulandulo lomntu onjalo alufanelekanga i-dha mayibize uncedo olufanelekileyo elungelweni layo,i dha iya kuba nobulumko bokubeka umsebenzi naphi na kwisebe ngokusekwe kwiimfuno zokusebenza zalo emva kokubonana nomsebenzi kunye okanye imanyano yakhe ukuba ulandulo lomntu onjalo alufanelekanga i dha mayibize uncedo olufanelekileyo elungelweni layo
2,eng,the province of kwazulu-natal department of transport invites tenders from established contractors experienced in bridge construction for the construction of the kwajolwayo tugela river pedestrian bridge near tugela ferry the duration of the project will be months,the province of kwazulu natal department of transport invites tenders from established contractors experienced in bridge construction for the construction of the kwajolwayo tugela river pedestrian bridge near tugela ferry the duration of the project will be months
3,nso,o netefatša gore o ba file dilo ka moka tše le dumelelanego ka tšona mohlala maleri a magolo a a šomišwago go fihlelela meagong e metelele scaffolds a a bolokegilego lefelo la maleba la go šomela go phela gabotse bjbj,o netefat a gore o ba file dilo ka moka t e le dumelelanego ka t ona mohlala maleri a magolo a a omi wago go fihlelela meagong e metelele scaffolds a a bolokegilego lefelo la maleba la go omela go phela gabotse bjbj
4,ven,khomishini ya ndinganyiso ya mbeu yo ewa maana u ya nga mulayo wa khomishini ya ndinganyiso ya mbeu u thetshelesa mbilaelo dzine dza tshimbilelana na tshialula u ya nga mbeu nahone i ivhea sa foramu ya thungo u ya nga mulayo wa ndinganyiso,khomishini ya ndinganyiso ya mbeu yo ewa maana u ya nga mulayo wa khomishini ya ndinganyiso ya mbeu u thetshelesa mbilaelo dzine dza tshimbilelana na tshialula u ya nga mbeu nahone i ivhea sa foramu ya thungo u ya nga mulayo wa ndinganyiso


<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

Modelling will involve:
 * Create features out of the text (Vectorization)
 * Select the best features
 * Split the data into train and test
 
The models tested include:
* Support Vector Machine (SVM)
* Random Forest
* Multinomial Naive Bayes

# vectorization

In [43]:
# vectorization
vect = TfidfVectorizer(min_df=2, 
                      max_df=0.5,
                       #max_features= 1000,
                             ngram_range=(1, 2))
#fit and transorm data 
training_x = vect.fit_transform(hackathon_set['clean_text']) 

In [44]:
#define features and variables
X = training_x 
y = hackathon_set['lang_id']

#split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=50)

> NB: The data is split before selecting the features to avoid data leakage where knowledge of the hold-out test set leaks into the dataset used to train the model.

Select the best features using select k-best.

In [45]:
# Set up selector, choosing score function and number of features to retain
selector_kbest = feature_selection.SelectKBest(score_func=f_classif, k=50000)

# Transform (i.e.: run selection on) the training data
X_train_kbest = selector_kbest.fit_transform(X_train, y_train)

import warnings
warnings.filterwarnings("ignore")

In [46]:
X_train_kbest.shape

(26400, 50000)

In [47]:
# transform the test data the same way we did the train data
X_test_kbest = selector_kbest.transform(X_test)

In [48]:
X_test_kbest.shape

(6600, 50000)

### Models

#### Support Vector Machine

In [49]:
#insatiate svm for kbest
parameters = {'kernel':('linear','rbf'), 
              'C':(0.25,1.0),
              'gamma': (1,2)
             }

# insatntiate svm for k features
svm_kbest = SVC()
clf = GridSearchCV(svm_kbest, parameters)
svm_kbest.fit(X_train_kbest,y_train)

SVC()

In [50]:
#predictions for kbest
svm_y_pred_kbest = svm_kbest.predict(X_test_kbest)

In [51]:
# assess accuracy using the classification report
svm_score = classification_report(y_test,svm_y_pred_kbest)
print(svm_score)

import warnings
warnings.filterwarnings('ignore')

              precision    recall  f1-score   support

         afr       1.00      1.00      1.00       596
         eng       1.00      1.00      1.00       595
         nbl       0.99      0.98      0.99       594
         nso       1.00      0.99      1.00       581
         sot       1.00      1.00      1.00       600
         ssw       1.00      1.00      1.00       601
         tsn       1.00      1.00      1.00       609
         tso       1.00      1.00      1.00       606
         ven       1.00      1.00      1.00       614
         xho       0.99      1.00      0.99       606
         zul       0.98      0.99      0.99       598

    accuracy                           1.00      6600
   macro avg       1.00      1.00      1.00      6600
weighted avg       1.00      1.00      1.00      6600



#### Random Forest Classifier

In [52]:
# instantiate and train the random forest classifier
RF = RandomForestClassifier()
RF.fit(X_train_kbest, y_train)
RF_pred = RF.predict(X_test_kbest)

In [53]:
# assess accuracy using the classification report
rf_report = classification_report(y_test,RF_pred)
print(rf_report)

              precision    recall  f1-score   support

         afr       1.00      1.00      1.00       596
         eng       0.99      1.00      1.00       595
         nbl       0.99      0.95      0.97       594
         nso       1.00      0.99      0.99       581
         sot       0.99      1.00      1.00       600
         ssw       0.97      0.98      0.97       601
         tsn       1.00      1.00      1.00       609
         tso       1.00      1.00      1.00       606
         ven       1.00      1.00      1.00       614
         xho       0.97      0.99      0.98       606
         zul       0.96      0.96      0.96       598

    accuracy                           0.99      6600
   macro avg       0.99      0.99      0.99      6600
weighted avg       0.99      0.99      0.99      6600



#### Naive bayes

In [54]:
#define parameters
params = {'alpha': [0.01, 0.1, 0.5, 1.0, 10.0, ],
         }

#perform a grid search to obtain the best alpha
multinomial_nb_grid = GridSearchCV(MultinomialNB(), param_grid=params, n_jobs=-1, cv=5, verbose=5)
multinomial_nb_grid.fit(X_train, y_train)

Fitting 5 folds for each of 5 candidates, totalling 25 fits


GridSearchCV(cv=5, estimator=MultinomialNB(), n_jobs=-1,
             param_grid={'alpha': [0.01, 0.1, 0.5, 1.0, 10.0]}, verbose=5)

In [55]:
# print out the accuracy and best parameters of the model
print('Train Accuracy : %.3f'%multinomial_nb_grid.best_estimator_.score(X_train, y_train))
print('Test Accuracy : %.3f'%multinomial_nb_grid.best_estimator_.score(X_test, y_test))
print('Best Accuracy Through Grid Search : %.3f'%multinomial_nb_grid.best_score_)
print('Best Parameters : ',multinomial_nb_grid.best_params_)

Train Accuracy : 1.000
Test Accuracy : 0.999
Best Accuracy Through Grid Search : 0.999
Best Parameters :  {'alpha': 0.5}


In [56]:
# instatiate the best model from the best parameters defined above
naive_bayes= MultinomialNB(alpha=0.5)
naive_bayes.fit(X_train_kbest, y_train)

MultinomialNB(alpha=0.5)

In [57]:
#generate predictions
nb_tuned_pred = naive_bayes.predict(X_test_kbest)

In [58]:
ac = accuracy_score(y_test, nb_tuned_pred)
print("Accuracy is :",ac)

Accuracy is : 0.9986363636363637


In [59]:
# assess accuracy using the classification report
nb_report = classification_report(y_test,nb_tuned_pred)
print(nb_report)

              precision    recall  f1-score   support

         afr       1.00      1.00      1.00       596
         eng       1.00      1.00      1.00       595
         nbl       1.00      0.99      1.00       594
         nso       1.00      1.00      1.00       581
         sot       1.00      1.00      1.00       600
         ssw       1.00      1.00      1.00       601
         tsn       1.00      1.00      1.00       609
         tso       1.00      1.00      1.00       606
         ven       1.00      1.00      1.00       614
         xho       1.00      1.00      1.00       606
         zul       1.00      1.00      1.00       598

    accuracy                           1.00      6600
   macro avg       1.00      1.00      1.00      6600
weighted avg       1.00      1.00      1.00      6600



<a id="six"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

On comparing the three models, Support vector machine, Random Forest and Naive Bayes, the Naive Bayes gave an almost perfect accuracy score. The Naive Bayes is therefore used to generate the predictions for the test set with no labels.

The test data is preprocessed in exactly the same way the train data was preprocessed:
* Remove numbers, special characters and punctuation
* Lowercase the words
* Vectorize the TFID vectorizer instantiated during training
* Select kbest using kbest instantiated during training
* Generate predictions using the trained Naive Bayes model

### Test Data

In [60]:
#load test set data
df_test = pd.read_csv('test_set.csv')

test_set = df_test.copy()

pd.set_option('max_colwidth', None)
test_set.head()

Unnamed: 0,index,text
0,1,"Mmasepala, fa maemo a a kgethegileng a letlelela kgato eo."
1,2,Uzakwaziswa ngokufaneleko nakungafuneka eminye imitlolo engezelelako ukuqedelela ukutloliswa kwesibawo sakho.
2,3,Tshivhumbeo tshi fana na ngano dza vhathu.
3,4,Kube inja nelikati betingevakala kutsi titsini naticocisana.
4,5,Winste op buitelandse valuta.


### Preprocess

In [62]:
#remove special characters numbers and punctuation
test_set['clean_text'] = test_set['text'].str.replace('[^a-zA-Z#]', ' ')
test_set.head()

Unnamed: 0,index,text,clean_text
0,1,"Mmasepala, fa maemo a a kgethegileng a letlelela kgato eo.",Mmasepala fa maemo a a kgethegileng a letlelela kgato eo
1,2,Uzakwaziswa ngokufaneleko nakungafuneka eminye imitlolo engezelelako ukuqedelela ukutloliswa kwesibawo sakho.,Uzakwaziswa ngokufaneleko nakungafuneka eminye imitlolo engezelelako ukuqedelela ukutloliswa kwesibawo sakho
2,3,Tshivhumbeo tshi fana na ngano dza vhathu.,Tshivhumbeo tshi fana na ngano dza vhathu
3,4,Kube inja nelikati betingevakala kutsi titsini naticocisana.,Kube inja nelikati betingevakala kutsi titsini naticocisana
4,5,Winste op buitelandse valuta.,Winste op buitelandse valuta


In [63]:
#lowercase
test_set['clean_text'] = test_set['clean_text'].str.lower()
test_set.head()

Unnamed: 0,index,text,clean_text
0,1,"Mmasepala, fa maemo a a kgethegileng a letlelela kgato eo.",mmasepala fa maemo a a kgethegileng a letlelela kgato eo
1,2,Uzakwaziswa ngokufaneleko nakungafuneka eminye imitlolo engezelelako ukuqedelela ukutloliswa kwesibawo sakho.,uzakwaziswa ngokufaneleko nakungafuneka eminye imitlolo engezelelako ukuqedelela ukutloliswa kwesibawo sakho
2,3,Tshivhumbeo tshi fana na ngano dza vhathu.,tshivhumbeo tshi fana na ngano dza vhathu
3,4,Kube inja nelikati betingevakala kutsi titsini naticocisana.,kube inja nelikati betingevakala kutsi titsini naticocisana
4,5,Winste op buitelandse valuta.,winste op buitelandse valuta


In [64]:
#vectorize data 
X = vect.transform(test_set['clean_text'])

In [65]:
X_kbest_hack = selector_kbest.transform(X)

In [66]:
#svm
#svm_y_pred = svm_kbest.predict(X_kbest_hack)
#randomforest
#rf_y_pred = RF.predict(X_kbest_hack)
#naive bayes
nb = naive_bayes.predict(X_kbest_hack)

In [70]:
#add the predicted sentiments to our test set with no labels
test_set['lang_id'] = nb
test_set.head(50)

Unnamed: 0,index,text,clean_text,lang_id
0,1,"Mmasepala, fa maemo a a kgethegileng a letlelela kgato eo.",mmasepala fa maemo a a kgethegileng a letlelela kgato eo,tsn
1,2,Uzakwaziswa ngokufaneleko nakungafuneka eminye imitlolo engezelelako ukuqedelela ukutloliswa kwesibawo sakho.,uzakwaziswa ngokufaneleko nakungafuneka eminye imitlolo engezelelako ukuqedelela ukutloliswa kwesibawo sakho,nbl
2,3,Tshivhumbeo tshi fana na ngano dza vhathu.,tshivhumbeo tshi fana na ngano dza vhathu,ven
3,4,Kube inja nelikati betingevakala kutsi titsini naticocisana.,kube inja nelikati betingevakala kutsi titsini naticocisana,ssw
4,5,Winste op buitelandse valuta.,winste op buitelandse valuta,afr
5,6,"Ke feela dilense tše hlakilego, tša pono e tee goba tše pedi tšeo di lefelelwago. Kgetho ya diforeimo e a hwetšagala yeo maloko a ka kgethago go tšwa go yona. Ge o nyaka foreimo ya go bitša ga nnyane, o tla swanelwa ke go lefelela phapano yeo.",ke feela dilense t e hlakilego t a pono e tee goba t e pedi t eo di lefelelwago kgetho ya diforeimo e a hwet agala yeo maloko a ka kgethago go t wa go yona ge o nyaka foreimo ya go bit a ga nnyane o tla swanelwa ke go lefelela phapano yeo,nso
6,7,<fn>(762010101403 AM) 1495 Final Gems Birthing Options_ZULU.txt</fn>,fn am final gems birthing options zulu txt fn,eng
7,8,Ntjhafatso ya konteraka ya mosebetsi: Etsa bonnete hore tsohle tse lokelwang ho ngolwa fatshe di entswe!,ntjhafatso ya konteraka ya mosebetsi etsa bonnete hore tsohle tse lokelwang ho ngolwa fatshe di entswe,sot
8,9,"u-GEMS uhlinzeka ngezinzuzo zemithi yezifo ezingapheli kuwo wonke amalunga. ukuze ukwazi ukuthola imithi yezifo ezingapheli kufanele ubhalise ohlelweni lwemithi yezifo ezingapheli, emva kokuthola isaziso sokuthi iyilunga elibhalisiwe lakwa-GEMS.",u gems uhlinzeka ngezinzuzo zemithi yezifo ezingapheli kuwo wonke amalunga ukuze ukwazi ukuthola imithi yezifo ezingapheli kufanele ubhalise ohlelweni lwemithi yezifo ezingapheli emva kokuthola isaziso sokuthi iyilunga elibhalisiwe lakwa gems,zul
9,10,"So, on occasion, are statistics misused.",so on occasion are statistics misused,eng


In [68]:
#extract a dataframe for submission
sub_df = test_set[['index','lang_id']]

#save submission dataframe to csv
sub_df.to_csv('hack14.csv',header=True, index=False)